Big-5 or Big5 is a character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.
Mainland China, which uses Simplified Chinese Characters, uses the GB instead.
The original Big5 character set lacked many commonly used characters. To solve this problem, each vendor developed its own extension. The ETen extension became part of the current Big5 standard through popularity.
The structure of Big5 does not conform to the ISO 2022 standard, but rather bears a certain similarity to the Shift JIS encoding. It is a double-byte character set (DBCS) with the following structure: {| border=1 style="border-collapse: collapse" |- | First byte ("lead byte") || 0x81 to 0xfe (or 0xa1 to 0xf9 for non-user-defined characters) |- | Second byte || 0x40 to 0x7e, 0xa1 to 0xfe |} (the prefix 0x signifying hexadecimal numbers).
Certain variants of the Big5 character set, for example the HKSCS, use an expanded range for the lead byte including values in the 0x81 to 0xA0 range (similar to Shift JIS).
If the second byte is not in the correct range, behaviour is undefined (i.e., varies from system to system).
The numerical value of individual Big5 codes are frequently given as a 4-digit hexadecimal number, which describes the two bytes that comprise the Big5 code as if the two bytes were a big endian representation of a 16-bit number. For example, the Big5 code for a full-width space, which are the bytes 0xa1 0x40, is usually written as 0xa140 or just A140.
Strictly speaking, the Big5 encoding contains only DBCS characters. However, in practice, the Big5 codes are always used together with an unspecified, system-dependent single-byte character set (ASCII, or an 8-bit character set such as code page 437), so that you will find a mix of DBCS characters and single-byte characters in Big5-encoded text. Bytes in the range 0x00 to 0x7f that are not part of a double-byte character are assumed to be single-byte characters. (For a more detailed description of this problem, please see the discussion on "The Matching SBCS" below.)
The meaning of non-ASCII single bytes outside the permitted values that are not part of a double-byte character varies from system to system. In old MSDOS-based systems, they are likely to be displayed as 8-bit characters; in modern systems, they are likely to either give unpredictable results or generate an error.
0x8140 to 0xa0fe | Reserved for user-defined characters 造字 |
0xa140 to 0xa3bf | "Graphical characters" 圖形碼 |
0xa3c0 to 0xa3fe | Reserved, not for user-defined characters |
0xa440 to 0xc67e | Frequently used characters 常用字 |
0xc6a1 to 0xc8fe | Reserved for user-defined characters |
0xc940 to 0xf9d5 | Less frequently used characters 次常用字 |
0xf9d6 to 0xfefe | Reserved for user-defined characters |
In most vendor extensions, extended characters are placed in the various zones reserved for user-defined characters, each of which are normally regarded as associated with the preceding zone. For example, additional "graphical characters" (e.g., punctuation marks) would be expected to be placed in the 0xa3c0–0xa3fe range, and additional logograms would be placed in either the 0xc6a1–0xc8fe or the 0xf9d6–0xfefe range. Sometimes, this is not possible due to the large number of extended characters to be added; for example, Cyrillic letters and Japanese kana have been placed in the zone associated with "frequently-used characters".
(The above might need some explanation by putting it in historical perspective, as it is theoretically incorrect: Back when text mode personal computing was still the norm, characters were normally represented as single bytes and each character takes one position on the screen. There was therefore a practical reason to insist that double-byte characters must take up two positions on the screen, namely that off-the-shelf, American-made software would then be usable without modification in a DBCS-based system. If a character can take an arbitrary number of screen positions, software that assumes that one byte of text takes one screen position would produce incorrect output. Of course, if a computer never had to deal with the text screen, the manufacturer would not enforce this artificial restriction; the Apple Macintosh is an example. Nevertheless, the encoding itself must be designed so that it works correctly on text-screen-based systems.)
To illustrate this point, consider the Big5 code 0xa14b (…). To English speakers this looks like an ellipsis and the Unicode standard identifies it as such; however, in Chinese, the ellipsis consists of six dots that fit in the space of two Chinese characters (……), so in fact there is no Big5 code for the Chinese ellipsis, and the Big5 code 0xa14b just represents half of a Chinese ellipsis. It represents only half of an ellipsis because the whole ellipsis should take the space of two Chinese characters, and in many DBCS systems one DBCS character must take exactly the space of one Chinese character.
Characters encoded in Big5 do not always represent things that can be readily used in plain text files; an example is "citation mark" (0xa1ca, ﹋), which is, when used, required to be typeset under the title of literary works. Another example is the Suzhou numerals, which is a form of scientific notation that requires the number to be laid out in a 2-D form consisting of at least two rows.
The SBCS to use being unspecified implies that the SBCS used can theoretically vary from system to system. Nowadays, ASCII is the only possible SBCS one would use. However, in old DOS-based systems, Code Page 437—with its extra special symbols in the control code area including position 127—was much more common. Yet, on a Macintosh system with the Chinese Language Kit, or on a Unix system running the cxterm terminal emulator, the SBCS paired with Big5 would not be Code Page 437.
Outside the valid range of Big5, the old DOS-based systems would routinely interpret things according to the SBCS that is paired with Big5 on that system. In such systems, characters 127 to 160, for example, were very likely not avoided because they would produce invalid Big5, but used because they would be valid characters in Code Page 437.
The modern characterization of Big5 as an MBCS consisting of the DBCS of Big5 plus the SBCS of ASCII is therefore historically incorrect and potentially flawed, as the choice of the matching SBCS was, and theoretically still is, quite independent of the flavour of Big5 being used.
Big5 was rapidly popularized in Taiwan and worldwide among Chinese who used the traditional Chinese character set through its adoption in several commercial software packages, notably the E-TEN Chinese DOS input system (ETen Chinese System). The Republic of China government declared Big5 as their standard in mid-1980s since it was, by then, the de facto standard for using traditional Chinese on computers.
The plethora of variations make UTF-8 or UTF-16 a more consistent code page for modern use.
In some versions of Eten, there are extra graphical symbols and Simplified Chinese characters.
After installing Microsoft's HKSCS patch on top of traditional Chinese Windows (or any version of Windows 2000 and above with proper language pack), applications using code page 950 automatically use a hidden code page 951 table. The table supports all code points in HKSCS-2001, except for the compatibility code points specified by the standard.
Code page 950 used by Windows 2000 and Windows XP maps hiragana and katakana characters to Unicode private use area block when exporting to Unicode, but to the proper hiragana and katakana Unicode blocks in Windows Vista.
Despite the problems, characters previously mapped to Unicode Private Use Area are remapped to the standardized equivalents when exporting characters to Unicode format.
Big5-2003 incorporates all Big-5 characters introduced in the 1984 ETEN extensions (code points A3C0-A3E0, C6A1-C7F2, and F9D6-F9FE) and the Euro symbol. Cyrillic characters were not included because the authority claimed CNS 11643 does not include such characters.
Hong Kong also adopted Big5 for character encoding. However, Cantonese uses many archaic and some colloquial Chinese characters that were not available in the normal Big5 character set. To solve this problem, the Hong Kong Government created the Big5 extensions Government Chinese Character Set in 1995 and Hong Kong Supplementary Character Set in 1999. The Hong Kong extensions were commonly distributed as a patch. It is still being distributed as a patch by Microsoft, but a full Unicode font is also available from the Hong Kong Government’s web site.
There are two encoding schemes of HKSCS: one encoding scheme is for the Big-5 coding standard and the other is for the ISO 10646 standard. Subsequent to the initial release, there are also HKSCS-2001 and HKSCS-2004. The HKSCS-2004 is aligned technically with the ISO/IEC 10646:2003 and its Amendment 1 published in April 2004 by the International Organization for Standardization (ISO).
HKSCS includes all the characters from the common ETEN extension, plus some characters from Simplified Chinese, place names, people's names, and Cantonese phrases (including profanity).
Category:Character sets Category:Encodings of Asian Languages
This text is licensed under the Creative Commons CC-BY-SA License. This text was originally published on Wikipedia and was developed by the Wikipedia community.
The World News (WN) Network, has created this privacy statement in order to demonstrate our firm commitment to user privacy. The following discloses our information gathering and dissemination practices for wn.com, as well as e-mail newsletters.
We do not collect personally identifiable information about you, except when you provide it to us. For example, if you submit an inquiry to us or sign up for our newsletter, you may be asked to provide certain information such as your contact details (name, e-mail address, mailing address, etc.).
When you submit your personally identifiable information through wn.com, you are giving your consent to the collection, use and disclosure of your personal information as set forth in this Privacy Policy. If you would prefer that we not collect any personally identifiable information from you, please do not provide us with any such information. We will not sell or rent your personally identifiable information to third parties without your consent, except as otherwise disclosed in this Privacy Policy.
Except as otherwise disclosed in this Privacy Policy, we will use the information you provide us only for the purpose of responding to your inquiry or in connection with the service for which you provided such information. We may forward your contact information and inquiry to our affiliates and other divisions of our company that we feel can best address your inquiry or provide you with the requested service. We may also use the information you provide in aggregate form for internal business purposes, such as generating statistics and developing marketing plans. We may share or transfer such non-personally identifiable information with or to our affiliates, licensees, agents and partners.
We may retain other companies and individuals to perform functions on our behalf. Such third parties may be provided with access to personally identifiable information needed to perform their functions, but may not use such information for any other purpose.
In addition, we may disclose any information, including personally identifiable information, we deem necessary, in our sole discretion, to comply with any applicable law, regulation, legal proceeding or governmental request.
We do not want you to receive unwanted e-mail from us. We try to make it easy to opt-out of any service you have asked to receive. If you sign-up to our e-mail newsletters we do not sell, exchange or give your e-mail address to a third party.
E-mail addresses are collected via the wn.com web site. Users have to physically opt-in to receive the wn.com newsletter and a verification e-mail is sent. wn.com is clearly and conspicuously named at the point of
collection.If you no longer wish to receive our newsletter and promotional communications, you may opt-out of receiving them by following the instructions included in each newsletter or communication or by e-mailing us at michaelw(at)wn.com
The security of your personal information is important to us. We follow generally accepted industry standards to protect the personal information submitted to us, both during registration and once we receive it. No method of transmission over the Internet, or method of electronic storage, is 100 percent secure, however. Therefore, though we strive to use commercially acceptable means to protect your personal information, we cannot guarantee its absolute security.
If we decide to change our e-mail practices, we will post those changes to this privacy statement, the homepage, and other places we think appropriate so that you are aware of what information we collect, how we use it, and under what circumstances, if any, we disclose it.
If we make material changes to our e-mail practices, we will notify you here, by e-mail, and by means of a notice on our home page.
The advertising banners and other forms of advertising appearing on this Web site are sometimes delivered to you, on our behalf, by a third party. In the course of serving advertisements to this site, the third party may place or recognize a unique cookie on your browser. For more information on cookies, you can visit www.cookiecentral.com.
As we continue to develop our business, we might sell certain aspects of our entities or assets. In such transactions, user information, including personally identifiable information, generally is one of the transferred business assets, and by submitting your personal information on Wn.com you agree that your data may be transferred to such parties in these circumstances.