Unicode Character Representation

Keywords: 	Unicode CF_UNICODETEXT  Surrogate Pair UTF UTF16

I need some help regarding Unicode characters.
Using WordPad I put in an Unicode character by pressing [Alt] key and typing "+119966" on the NumPad. This creates a stylish "c" character (Font Cambria Math ) from the Unicode table with hex value '0x1D49E'.
After copying this character to ClipBoard and BinaryClipGetting it with format CF_UNICODETEXT the hexadecimal representation becomes '35D89EDC'.
What is the relation between C notation '0x1D49E' and hexstring '35D89EDC'?
How can I convert a sequence of such C notated characters into an Unicode string (to put this Unicode string via ClipBoard into WordPad editor)?
Test script.
CF_UNICODETEXT = 13 ; Get Unicode character from WordPad via ClipBoard, ; e. g . 'Alt+119966' resp. '0x1D49E'. sClipGetUnicode = ClipGetEx (CF_UNICODETEXT) ClipPut (sClipGetUnicode) ; Inserting ClipBoard content back into Wordpad looks like the same character. iBBSize = BinaryClipGet (0, CF_UNICODETEXT) hBB = BinaryAlloc (iBBSize) iBytes = BinaryClipGet (hBB, CF_UNICODETEXT) sUnicode = BinaryPeekStrW (hBB, 0, iBBSize) sHex = BinaryPeekHex (hBB, 0, iBBSize) ; e. g. '35D89EDC'. hBB = BinaryFree (hBB) iStrCmp = StrCmp (sClipGetUnicode, sUnicode) ; 0 means Unicode strings are the same. ClipPut (sUnicode) ; Inserting ClipBoard content into Wordpad looks like the same character. Exit

Answer:

Ugh... That's unpleasant looking. You just encountered what's known as a "Surrogate Pair" in Unicode.

Here's the quick & dirty explanation:

The decimal value 119,966 [0x1D49E] is greater than 65,535 [0xFFFF]. As such, it cannot be represented by a single 16-bit code unit in the UTF16 encoding that Windows uses to represent Unicode characters. This means that the character in question, although it represents a single code point in the Unicode character set, must be encoded so that it can be represented by a pair of 16-bit code units known as a Surrogate Pair. The complete Unicode character set requires a 21-bit code unit in order to store any given code point in a single code unit; the UTF32 encoding accommodates this requirement. Given a UTF32 encoding, the maximum number of characters in the Unicode character set is 1,114,112, with a maximum code unit value of 0x10FFFF.

Anytime that you see a UTF16 code unit with a value in the range of 0xD800 thru 0xDFFF, you know that you are dealing with one part of a Surrogate Pair. Specifically, you will see a code unit with a value in the range 0xD800 thru 0xDBFF, followed by another code unit in the range 0xDC00 thru 0xDFFF. Collectively, this pair of 16-bit code units represents a Unicode character with a value that requires 17 or more bits in which to store it.

UTF16 Surrogate Pairs are analogous to UTF8 encoding in that both can represent the entire Unicode character set, albeit they use multiple code units to do so for some of the characters in the Unicode character set.

In the case of your character, 0x1D49E, if you choose to have WordPad save the document as a Unicode text file, you can then examine a hex dump of the file. That hex dump would show the following bytes in sequence:

FF FE 35 D8 9E DC

Now, we have to keep a few things in mind here. The initial "FF FE" is what's known as a BOM [Byte Order Mark]. Those bytes are interpreted as just single bytes, not as a 16-bit integer, and they tell us that the remainder of the file is UTF16 Little Endian encoded Unicode text.

Now, the remainder of the byte stream in the file, "35 D8 9E DC", must be interpreted as 16-bit unsigned integers in Little Endian order, which means that the corresponding Unicode code units are 0xD835 and 0xDC9E. As I previously explained, this is a Surrogate Pair based on the ranges of the values that the two code units belong to.

Here's how we transcode between UTF16LE Surrogate Pairs & UTF32LE:

Given a pair of code units, 0xD835 and 0xDC9E, we need to express them in binary notation and take the lower 10 bits of each value and re-assemble them into a 32-bit integer as follows:

0xD835 = 1101 1000 0011 0101 0xDC9E = 1101 1100 1001 1110

Taking the lower 10 bits of each, we get, respectively, binary values of 00 0011 0101 and 00 1001 1110. We put them together in sequence, the lower 10 bits of the 1st half of the Surrogate Pair, then the lower 10 bits of the 2nd half of the Surrogate Pair, as follows:

00 0011 0101 00 1001 1110

Normalizing the notation for these 20 bits, we get the following:

0000 1101 0100 1001 1110

Next, we convert to hex notation:

0xD49E

Finally, we add 0x10000 to the value, getting the final result:

0x1D49E

which is your original Unicode character value in UTF32LE encoding.

Reversing the process, given the initial value of 0x1D49E, we do the following:

Subtract 0x10000 from the value to obtain the following:

0xD49E

Then, represented in binary, we take the lower 20-bits of the value:

0000 1101 0100 1001 1110

And split it apart, such that the high order 10 bits is our "first" value, and the low order 10 bits is our "second" value, as follows:

0000 1101 01 = 0000 0011 0101 = 0x35

and

00 1001 1110 = 0000 1001 1110 = 0x9E

Next, we add 0xD800 to the 1st value, and we add 0xDC00 to the 2nd value, obtaining the following values:

0xD835 and 0xDC9E

which gets us back to our original Surrogate Pair values.

Finally, since we're using a Little Endian architecture, when streaming the values out to a file or in from a file, the bytes in the integers are processed from low order byte to high order byte, thus giving us a byte stream of "35 D8 9E DC". If we prefix that with the BOM value of "FF FE", we get the original byte stream that we read in from our UTF16LE encoded text file.

Article ID:   W18291

Filename:   Unicode Character Representation.txt

File Created: 2012:10:11:07:48:38

Last Updated: 2012:10:11:07:48:38

Database Search

TechHome

Unicode Character Representation

Answer: