UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed “octets” in the Unicode Standard). Code points with lower numerical values (i.e., earlier code positions in the Unicode character set, which tend to occur more frequently in practice) are encoded using fewer bytes, making the encoding scheme reasonably efficient. In particular, the first 128 characters of the Unicode character set, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8-encoded Unicode text as well.
The official IANA code for the UTF-8 character encoding is UTF-8
.
In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only bytes where the high bit was set.
In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Labs then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string to find code point boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.
UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25–29, 1993.
The original specification allowed for sequences of up to six bytes, covering numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000
to U+10FFFF
, in order to match the constraints of the UTF-16 character encoding.
!Bits!!Last code point!!Byte 1 | ||
! 7 | U+007F | 0xxxxxxx |
Prosser’s and Thompson’s challenge was to extend this scheme to handle code points with up to 31 bits. The solution proposed by Prosser as subsequently modified by Thompson was as follows:
!Bits!!Last code point!!Byte 1!!Byte 2!!Byte 3!!Byte 4!!Byte 5!!Byte 6 | |||||||
! 7 | U+007F | 0xxxxxxx | |||||
!11 | U+07FF | 110xxxxx | 10xxxxxx | ||||
!16 | U+FFFF | 1110xxxx | 10xxxxxx | ||||
!21 | U+1FFFFF | 11110xxx | 10xxxxxx| | 10xxxxxx | 10xxxxxx | ||
!26 | U+3FFFFFF | 111110xx | 10xxxxxx| | 10xxxxxx | 10xxxxxx | 10xxxxxx | |
!31 | U+7FFFFFFF | 1111110x | 10xxxxxx| | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The salient features of the above scheme are as follows:
# Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value. (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
# For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
# All continuation bytes (byte nos. 26 in the table above) have 10
as their two most-significant bits (bits 76); in contrast, the first byte never has 10
as its two most-significant bits. As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
# As a consequence of no.3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to a single character (three bytes in actual UTF‑8 as explained in the next section). If it is not possible to back up, or a byte is missing because of e.g. a communication failure, one single character can be discarded, and the next character be correctly read.
# Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
# Prosser’s and Thompson’s scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 text — see under Advantages in section "Compared to single byte encodings" below — and indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).
! Code point range | Binary numeral system>Binary code point | ! UTF-8 bytes | ! Example |
So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from the Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.
cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add.
cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte. When a start byte could form both overlong and valid encodings, the lowest non-overlong-encoded codepoint is shown, marked by an asterisk "*".
cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for overlong encoding of basic ASCII characters. The remaining red cells indicate start bytes of sequences that could only encode numbers larger than the 0x10FFFF limit of Unicode. The byte 244 (hex 0xF4) could also encode some values greater than 0x10FFFF; such a sequence is also invalid.
Many earlier decoders would happily try to decode these. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes. Invalid UTF-8 has been used to bypass security validations in high profile products including Microsoft's IIS web server.
Many UTF-8 decoders throw exceptions on encountering errors, since such errors suggest the input is not a UTF-8 string at all. This can turn what would otherwise be harmless errors (producing a message such as "no such file") into a denial of service bug. For instance Python 3.0 would exit immediately if the command line contained invalid UTF-8, so it was impossible to write a Python program that could handle such input.
An increasingly popular option is to detect errors with a separate API, and for converters to translate the first byte to a replacement and continue parsing with the next byte. Popular replacements are:
Replacing errors is "lossy": more than one UTF-8 string converts to the same Unicode result. Therefore the original UTF-8 should be stored, and translation should only be used when displaying the text to the user.
Whether an actual application should do this with surrogate halves is debatable. Allowing them allows lossless storage of invalid UTF-16, and allows CESU encoding (described below) to be decoded. There are other code points that are far more important to detect and reject, such as the reversed-BOM U+FFFE, or the C1 controls, caused by improper conversion of CP1252 text or double-encoding of UTF-8. These are invalid in HTML.
Alternatively, the name "utf-8" may be used by all standards conforming to the Internet Assigned Numbers Authority (IANA) list (which include CSS, HTML, XML, and HTTP headers), as the declaration is case insensitive.
Other descriptions that omit the hyphen or replace it with a space, such as "utf8" or "UTF 8", are not accepted as correct by the governing standards . Despite this, most agents such as browsers can understand them, and so standards intended to describe existing practice (such as HTML5) may effectively require their recognition.
MySQL omits the hyphen in the following query: SET NAMES 'utf8'
Many pieces of software added UTF-8 conversions for UCS-2 data and did not alter their UTF-8 conversion when UCS-2 was replaced with the surrogate-pair supporting UTF-16. The result is that each half of a UTF-16 surrogate pair is encoded as its own 3-byte UTF-8 encoding, resulting in 6-byte sequences rather than 4 for characters outside the Basic Multilingual Plane. Oracle databases use this, as well as Java and Tcl as described below, and probably a great deal of other Windows software where the programmers were unaware of the complexities of UTF-16. Although most usage is by accident, a supposed benefit is that this preserves UTF-16 binary sorting order when CESU-8 is binary sorted.
All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.
In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through and . However it uses Modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constant strings in class files. Tcl also uses the same modified UTF-8 as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.
The presence of the UTF-8 BOM may cause interoperability problems with existing software that could otherwise handle UTF-8; for example:
If compatibility with existing programs is not important, the BOM could be used to identify if a file is in UTF-8 versus a legacy encoding, but this is still problematic, due to many instances where the BOM is added or removed without actually changing the encoding, or various encodings are concatenated together. Checking if the text is valid UTF-8 is more reliable than using BOM.
They supersede the definitions given in the following obsolete works:
They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.
Category:Character sets Category:Encodings Category:Character encoding Category:Unicode Transformation Formats
ar:صيغة التحويل الموحد-8 ca:UTF-8 cs:UTF-8 da:UTF-8 de:UTF-8 el:UTF-8 es:UTF-8 eo:UTF-8 fa:یونیکد fr:UTF-8 ko:UTF-8 hr:UTF-8 it:UTF-8 he:UTF-8 lv:UTF-8 lt:UTF-8 hu:UTF-8 ml:യു.ടി.എഫ്-8 ms:UTF-8 nl:UTF-8 ja:UTF-8 no:UTF-8 nn:UTF-8 pl:UTF-8 pt:UTF-8 ru:UTF-8 sk:UTF-8 sl:UTF-8 sr:UTF-8 fi:Unicode#UTF-8 sv:UTF-8 tr:UTF-8 uk:UTF-8 vi:UTF-8 zh:UTF-8This text is licensed under the Creative Commons CC-BY-SA License. This text was originally published on Wikipedia and was developed by the Wikipedia community.
The World News (WN) Network, has created this privacy statement in order to demonstrate our firm commitment to user privacy. The following discloses our information gathering and dissemination practices for wn.com, as well as e-mail newsletters.
We do not collect personally identifiable information about you, except when you provide it to us. For example, if you submit an inquiry to us or sign up for our newsletter, you may be asked to provide certain information such as your contact details (name, e-mail address, mailing address, etc.).
When you submit your personally identifiable information through wn.com, you are giving your consent to the collection, use and disclosure of your personal information as set forth in this Privacy Policy. If you would prefer that we not collect any personally identifiable information from you, please do not provide us with any such information. We will not sell or rent your personally identifiable information to third parties without your consent, except as otherwise disclosed in this Privacy Policy.
Except as otherwise disclosed in this Privacy Policy, we will use the information you provide us only for the purpose of responding to your inquiry or in connection with the service for which you provided such information. We may forward your contact information and inquiry to our affiliates and other divisions of our company that we feel can best address your inquiry or provide you with the requested service. We may also use the information you provide in aggregate form for internal business purposes, such as generating statistics and developing marketing plans. We may share or transfer such non-personally identifiable information with or to our affiliates, licensees, agents and partners.
We may retain other companies and individuals to perform functions on our behalf. Such third parties may be provided with access to personally identifiable information needed to perform their functions, but may not use such information for any other purpose.
In addition, we may disclose any information, including personally identifiable information, we deem necessary, in our sole discretion, to comply with any applicable law, regulation, legal proceeding or governmental request.
We do not want you to receive unwanted e-mail from us. We try to make it easy to opt-out of any service you have asked to receive. If you sign-up to our e-mail newsletters we do not sell, exchange or give your e-mail address to a third party.
E-mail addresses are collected via the wn.com web site. Users have to physically opt-in to receive the wn.com newsletter and a verification e-mail is sent. wn.com is clearly and conspicuously named at the point of
collection.If you no longer wish to receive our newsletter and promotional communications, you may opt-out of receiving them by following the instructions included in each newsletter or communication or by e-mailing us at michaelw(at)wn.com
The security of your personal information is important to us. We follow generally accepted industry standards to protect the personal information submitted to us, both during registration and once we receive it. No method of transmission over the Internet, or method of electronic storage, is 100 percent secure, however. Therefore, though we strive to use commercially acceptable means to protect your personal information, we cannot guarantee its absolute security.
If we decide to change our e-mail practices, we will post those changes to this privacy statement, the homepage, and other places we think appropriate so that you are aware of what information we collect, how we use it, and under what circumstances, if any, we disclose it.
If we make material changes to our e-mail practices, we will notify you here, by e-mail, and by means of a notice on our home page.
The advertising banners and other forms of advertising appearing on this Web site are sometimes delivered to you, on our behalf, by a third party. In the course of serving advertisements to this site, the third party may place or recognize a unique cookie on your browser. For more information on cookies, you can visit www.cookiecentral.com.
As we continue to develop our business, we might sell certain aspects of our entities or assets. In such transactions, user information, including personally identifiable information, generally is one of the transferred business assets, and by submitting your personal information on Wn.com you agree that your data may be transferred to such parties in these circumstances.