Friday, June 12, 2009

Some unicode character sets (10 bits)

0-127: C0 Controls and Basic Latin

128 characters; hex: 0000-007F; dec: 0-127, i.e. makes full use of 7 bits. Subsets are: (a) C0 controls (0000-001F or 0-31); (b) ASCII punctuation and symbols, including 'space' (0020-002F or 32-47); (c) ASCII digits 0-9 (0030-0039 or 48-57); (d) more ASCII punctuation and symbols (003A-0040 or 58-64); (e) uppercase Latin alphabet A-Z (0041-005A or 65-90); (f) more ASCII punctuation and symbols (005B-0060 or 91-96); (g) lowercase Latin alphabet A-Z (0061-007A or 97-122); (h) more ASCII punctuation and symbols, including control character 'delete' (007B-007F or 123-127).

128-255: C1 controls and Latin-1 Supplement

128 characters; hex 0080-00FF; dec: 128-255, i.e. together with 'CO Controls and Basic Latin', this makes full use of 8 bits. Subsets are: (a) C1 controls (0080-009F or 128-159); (b) Latin-1 punctuation and symbols, mainly mathematical (00A0-00BF or 160-191); (c) letters (00C0-00D6 or 192-214); (d) mathematical operator × (00D7 or 215); (e) more letters (00D8-00F6 or 216-246); (f) mathematical operator ÷ (00F7 or 247); (g) more letters (00F8-00FF or 248-255). The letters are those extra ones needed for writing Western European languages, i.e. uppercase and lowercase accented vowels, ç, ñ, ß, ø, ð, þ, and æ.

256-383: Latin Extended-A

128 characters; hex 0100-017F; dec 256-383, i.e. this makes half-use of the 9th bit. Adds accented Latin vowels and consonants for Eastern European languages, and ones like Finnish, Turkish, Greenlandic etc.

384-591: Latin Extended-B

208 characters; hex 0180-024F; dec 384-591, i.e. this makes full use of the 9th bit (<512) and partial use of the 10th (<1024). Subsets: (a) non-European and historic Latin; (b) African letters for clicks; (c) Croatian digraphs matching Serbian Cyrillic letters; (d) Pinyin diacritic-vowel combinations; (e) phonetic and historic letters; (f) additions for Slovenian and Croatian; (g) additions for Romanian; (h) miscellaneous additions; (i) additions for Livonian; (j) additions for Sinology; (k) miscellaneous additions.

592-687: IPA Extensions

96 characters; hex 0250-02AF; dec 592-687. Additional characters needed for the IPA.

688-767: Spacing Modifier Letters

80 characters; hex 02B0-02FF; dec 688-767.

768-879: Combining Diacritical Marks

112 characters; hex 0300-036F; dec 768-879.

880-1023: Greek and Coptic

144 characters (9 are blank and reserved for future use); hex 0370-03FF; dec 880-1023.

No comments:

Post a Comment