GB 2312

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
GB 2312
MIME / IANAGB_2312-80 (GB2312 for usual EUC form)
Alias(es)iso-ir-58, chinese, csISO58GB231280
Language(s)Simplified Chinese, English, Russian
Partial support:
Greek, Japanese
StandardGB/T 2312-1980, RFC 1345
ClassificationISO-2022-compatible DBCS, CJK encoding
ExtensionsISO-IR-165
Encoding formatsEUC-CN (GB2312),
HZ-GB-2312
Succeeded byGBK, GB 18030
Other related encoding(s)JIS X 0208, KS X 1001

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB abbreviates Guojia Biaozhun (国家标准), which means national standard in Chinese. GB2312 (1980) has been superseded by GBK and GB18030, which include additional characters, but GB2312 remains in widespread use as a subset of those encodings.

According to a National Standard Bulletin of the People's Republic of China, the National Standard GB 2312-1980 is no longer mandatory, and its standard code is modified to GB/T 2312-1980.[1]

While GB2312 covers over 99% of the characters of contemporary usage,[2] historical texts and many names remain out of scope. GB2312 includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and a double-byte set of Pinyin letters with tone marks. As of February 2018, 0.3% of all web pages use GB2312, a drop from 3.5% in January 2010.[3]

There is an analogous character set known as GB/T 12345, closely related to GB2312, but with traditional character forms replacing simplified forms, and some extra 62 supplemental characters.[4] GB-encoded fonts often come in pairs, one with the GB 2312 (simplified) character set and the other with the GB/T 12345 (traditional) character set.

Characters[edit]

Characters in GB2312 are arranged in a 94x94 grid (as in ISO 2022), and the two-byte code point of each character is expressed in the kuten (or quwei) form, which specifies a row (ku or qu) and the position of the character within the row (cell, ten or wei).

The rows (numbered from 1 to 94) contain characters as follows:

  • 01–09, comprising punctuation and other special characters; also Hiragana, Katakana, Greek, Cyrillic, Pinyin, Bopomofo
  • 16–55, the first plane for Chinese characters, arranged according to Pinyin. (3755 characters).
  • 56–87, the second plane for Chinese characters, arranged according to radical and strokes. (3008 characters).
  • 88–89, further Chinese characters. (103 characters). Defined only for GB/T 12345, not GB 2312.

The rows 10–15 and 90–94 are unassigned.

For GB/T 2312-1980, it contains 682 signs and 6763 Chinese Characters.

Encodings of GB2312[edit]

EUC-CN[edit]

EUC-CN is often used as the character encoding (i.e. for external storage) in programs that deal with GB2312, thus maintaining compatibility with ASCII. Two bytes are used to represent every character not found in ASCII. The value of the first byte is from 0xA1–0xF7 (161–247), while the value of the second byte is from 0xA1–0xFE (161–254). Since all of these ranges are beyond ASCII, like UTF-8, it is possible to check if a byte is part of a multi-byte construct when using EUC-CN, but not if a byte is first or last.

Compared to UTF-8, GB2312 (whether native or encoded in EUC-CN) is more storage efficient: while UTF-8 uses three bytes[a] per CJK ideograph, GB2312 only uses two. However, GB2312 does not cover as many ideographs as Unicode does.

To map the kuten code points to bytes, add 160 (0xA0) to the row number (ku, the 1000s and 100s place) of the code point to form the high byte, and add 160 to the column number (ten, the 10s and 1s place) of the code point to form the low byte.

For example, if you have the GB2312 code point 4566 ("外"[5], which means foreign), the high byte will use the row number 45: 45+160=205=0xCD, and the low byte will come from the column, 66: 66+160=212=0xE2. So, the full encoding is 0xCDE2[6].

HZ[edit]

HZ is another encoding of GB2312 that is used mostly for Usenet postings.

Two implementations of GB2312[edit]

There are two implementations of GB2312 which differ in few code points.

EUC-CN GBK/GB18030 subset GB2312.TXT Character name[7]:3
A1A4 U+00B7 · MIDDLE DOT U+30FB KATAKANA MIDDLE DOT 间隔点; 'separator dot'
A1AA U+2014 EM DASH U+2015 HORIZONTAL BAR 破折号; 'em dash'

The GBK/GB18030 subset is compatible with both GBK and GB18030; GB2312.TXT is the then-official implementation from ftp.unicode.org,[8] which has been obsolete since August 2011[9] and missing as of September 2016. Even more vendor mappings existed.[8]

As of 2015, Microsoft .Net Framework is using the subset. ICU,[10] iconv-1.14,[11] php-5.6, ActivePerl-5.20, Java 1.7 and Python 3.4[12] are using GB2312.TXT. Ruby 2.2 is compatible with both implementations; it internally converts the conflictive characters to the subset. W3C's technical recommendation specifies a GBK encoding to be inferred for streams labelled gb2312, which in turn uses a GB18030 decoder.[13]

See also[edit]

References[edit]

  1. ^ "2017年第7号中国国家标准公告 (China National Standard Bulletin 2017 No.7)". Standardization Administration of the People's Republic of China. Retrieved 3 July 2018.
  2. ^ Hannas, William C. (1997). Asia's Orthographic Dilemma. University of Hawai‘i Press. p. 264.
  3. ^ "Historical trends in the usage of character encodings, February 2019". w3techs.com. Retrieved 2019-02-01.
  4. ^ "GB/T 12345" (PDF).
  5. ^ https://archive.org/details/GB2312-1980/page/n17
  6. ^ https://web.archive.org/web/20160303230643/http://cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html
  7. ^ "GB 2312-1980: Information technology—Chinese ideogram coded character set for information interchange (basic set)". Retrieved 2 October 2016.
  8. ^ a b Haible, Bruno. "GB2312 (Conversion Tables)". Retrieved 29 September 2016.
  9. ^ "Readme – MAPPINGS/OBSOLETE/EASTASIA". 9 August 2001. Retrieved 29 September 2016.
  10. ^ "java-EUC_CN-1.3_P.ucm". Retrieved 29 September 2016.
  11. ^ "libiconv:lib/gb2312.h". GNU Savannah. Retrieved 29 September 2016.
  12. ^ "Issue 24036". Python Bug Tracker.
  13. ^ "Encoding § Names and labels". W3C. Retrieved 29 September 2016.

Notes[edit]

  1. ^ Only for ideographs covered by GB2312, all of which fall into Unicode BMP

Further reading[edit]

External links[edit]