What every Programmer should know about Unicodecabo/awe13/ptgdm10-8zch.pdf · UTF: UCS Transformation Format – UTF-7: +ACQ-– UTF-8: Aufteilen, eindeutig auch bei „Quereinstieg“

© 2008–2013 Carsten Bormann1

What every Programmer should know about

😱 Unicode 😱2. Semester Medieninformatik

Prof. Dr.-Ing. Carsten Bormann [email protected]

U+1F4A9 = 💩


Textuelle Information – Zeichen

Primäre Informationsquelle im Web: Text Zeichen:

Buchstaben, Ziffern, Zeichensetzung, Sonderzeichen

Welche Zeichen gibt es?Zeichenvorrat

Wie werden sie digital kodiert?Zeichensatz

Wie sehen sie aus?Font (Schrift, Schriftart)


Digitale Kodierung

Kodierung über Kette von Bits– 0 oder 1– n Bits 2n Möglichkeiten (25 = 32, 27 = 128, 28 = 256, ...)

Beispiel: Zahlen


Zeichencodes: Baudot (IA2, ITU-T S.1)

Telegrafie (50 bit/s): 5 Bits 32 Symbole A-Z = 26 Ziffern

+ Satzzeichen = 21 6 Symbole eindeutig 26 Symbole

doppelt belegt Bu/Zi zum

Umschalten


Zeichencodes: 7-Bit-Codes

7 Bit pro Zeichen (eins bleibt frei für Parity) ASCII ISO 646 = IA5 ~ DIN 66003

– Nationale Varianten: nicht alle Codes gleich belegt

Steuerzeichen:CR, LF, ...(0 – 31)

Schriftzeichen:!“#$...A-Z...a-z...(32* – 127*)



8-Bit-Codes

Problem: Nationale Varianten unhandlich– Europäische Integration…

8. Bit ungenutzt Idee: 2 Tabellen

Linke Tabelle ~ ASCII


8-bit-Codes

ISO 6937:– Linke Tabelle ISO 646:1973 (ASCII ohne $)– Rechte Tabelle für alle lateinischen Sprachen

Diakritische Zeichen Besondere/zusammengesetzte Zeichen

ISO 8859-n– Linke Tabelle ASCII (ISO 646:1990)– Rechte Tabelle in ca. 15 Varianten (ISO 8859-1 bis -15)





Klassische Zeichen-Codes

Telegrafie: 5-Bit-Code, 25 = 32– Durch Doppelbelegung 26+26+6 = 58 Zeichen

ASCII/ISO 646: 7-Bit-Code, 27 = 128– C-Set: 32 Steuerzeichen; G-Set: 96 (94) Schriftzeichen

ISO 6937: 8-Bit-Code, 28 = 256– 2 C-Sets, 2 G-Sets; ca. 600 Zeichen durch Zusammensetzen

ISO 8859-n: 8-Bit-Code, 28 = 256– Wirtschaftsraumspezifische Varianten

mit je 94+96 = 190 Zeichen (inkl. ASCII)


Probleme mit 8-Bit-Codes

Bengali, Devanagari, Tamil, Thai, Tibetanisch, ... Was mit den ideographischen Schriften?

– Kanji (Japan), Hanzi (China), Hanja (Korea, neben Hangul)– Tausende von Symbolen

Sonstige Symbole– Dingbats, Mathematische Zeichen, E-Technik, ...– halbe Leerzeichen, linke untere Anführungszeichen, ...

Kombination von Schriften in einer Anwendung

Mehrfachbelegung = ISO 2022 (Codeerweiterung) 16-/32-Bit-Codes = ISO 10646 (Unicode)



Unicode (ISO 10646)

Ziel: alle definierten Zeichen repräsentieren können Idee: 32-Bit-Zeichensatz, effizient kodieren

– 231 ~ 2 Milliarden Zeichen (real: bis 0x10FFFF ~ 220 ~ 1 Mio max.)

128 Gruppen, 256 Ebenen, 256 Zeilen, 256 Zellen


Unicode-BMP: 16-Bit-Zeichensatz

Idee: Kanji und Hanzi-Varianten überlagern– Ebene 00, Gruppe 00 reicht

Basic Multilingual Plane (BMP) UCS-2-Format

– MSB first vs. LSB first:Byte Order Marker (BOM)FEFF…


Unicode BMP: A-Zone

ASCII und Latin-1 sindcode-kompatibleUntermengen

Andere 8859-n ebenfallsvorhanden (verschoben)

Griechisch, Hebräisch, Arabisch, ...

Zeichensetzung, Mathematik, Dingbats, ...


Repräsentation von Unicode

UCS: UCS-2, UCS-4– Byte-Order-Probleme FEFF (Byte Order Marker, BOM)

UTF: UCS Transformation Format– UTF-7: +ACQ-– UTF-8: Aufteilen, eindeutig auch bei „Quereinstieg“

0000 – 007F: 0xxx xxxx 0080 – 07FF: 110x xxxx, 10xx xxxx 0800 – FFFF: 1110 xxxx, 10xx xxxx, 10xx xxxx 10000 – 10FFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

– UTF-16: Wie UCS-2, aber mit Surrogat-Zeichen 10000 – 10FFFF: –10000, 1101 10xx xxxxxxxx, 1101 11xx xxxxxxxx UTF-16BE vs. UTF-16LE (oops) BOM...

– UTF-32: Wie UCS-4, aber beschränkt auf 0..0x10FFFF


Zeichen vs. Glyphs

Zeichencode: Codekombinationen für Schriftzeichen Aussehen kann sich aber unterscheiden:

Formvarianten sind abstrahierbar:– z.B.: – Ligaturen: – Arabische Schreibung: initial, medial, terminal, isoliert– Arabisch vs. Europäisch:

Glyphregistratur vs. weitere Zeichen in Unicode



Normalisierung: NFD, NFC, NFKD, NFKC

NFD

NFC

NFKD

NFKC


Zeichensätze in der Praxis

Industrie im Übergang von ISO 8859 zu Unicode– Windows-1252 (Erweiterung von ISO 8859-1) weit verbreitet

Unicode ist Basiszeichensatz für HTML– HTML selbst aber oft in ISO 8859-1 kodiert (Default!)

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />– <?xml version="1.0" encoding="iso-8859-1"?>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />– <?xml version="1.0"?>


Apache und der Zeichensatz

httpd.conf, .htaccess

AddCharset UTF-‐8 .html

AddType 'text/html; charset=UTF-‐8' html

Selektiv:

<Files "example.html"> AddCharset UTF-‐8 .html</Files>

http://www.w3.org/International/questions/qa-htaccess-charset


Nützliche Unicode-Zeichen

„Anführungszeichen“: – Links unten „ „– Rechts oben “ (englisch: links) “– Englisch rechts ” ”

Gedankenstrich– Halbgeviertstrich (en dash) – heute üblich –– Geviertstrich (em dash) — traditionell/USA —

Euro-Zeichen € €

Achtung: Zeichen zwischen und sind Fehler (Überbleibsel aus Windows-1252)

ASCII-8BIT (BINARY) Big5 (CP950) CP51932 CP850 (IBM850) CP852 CP855 CP949 Emacs-Mule EUC-JP (eucJP) EUC-KR (eucKR) EUC-TW (eucTW) eucJP-ms (euc-jp-ms) GB12345 GB18030 GB1988 GB2312 (EUC-CN, eucCN) GBK (CP936) IBM437 (CP437) IBM737 (CP737) IBM775 (CP775) IBM852 IBM855 IBM857 (CP857) IBM860 (CP860) IBM861 (CP861) IBM862 (CP862) IBM863 (CP863) IBM864 (CP864) IBM865 (CP865) IBM866 (CP866) IBM869 (CP869) ISO-2022-JP (ISO2022-JP) ISO-2022-JP-2 (ISO2022-JP2) ISO-8859-1 (ISO8859-1) ISO-8859-10 (ISO8859-10) ISO-8859-11 (ISO8859-11) ISO-8859-13 (ISO8859-13) ISO-8859-14 (ISO8859-14) ISO-8859-15 (ISO8859-15) ISO-8859-16 (ISO8859-16) ISO-8859-2 (ISO8859-2) ISO-8859-3 (ISO8859-3) ISO-8859-4 (ISO8859-4) ISO-8859-5 (ISO8859-5) ISO-8859-6 (ISO8859-6) ISO-8859-7 (ISO8859-7) ISO-8859-8 (ISO8859-8) ISO-8859-9 (ISO8859-9) KOI8-R (CP878) KOI8-U macCentEuro macCroatian macCyrillic macGreek macIceland MacJapanese (MacJapan) macRoman macRomania macThai macTurkish macUkraine Shift_JIS (SJIS) stateless-ISO-2022-JP TIS-620 US-ASCII (ASCII, ANSI_X3.4-1968, 646) UTF-16BE (UCS-2BE) UTF-16LE UTF-32BE (UCS-4BE) UTF-32LE (UCS-4LE) UTF-7 (CP65000) UTF-8 (CP65001, locale, external) UTF8-MAC (UTF-8-MAC) Windows-1250 (CP1250) Windows-1251 (CP1251) Windows-1252 (CP1252) Windows-1253 (CP1253) Windows-1254 (CP1254) Windows-1255 (CP1255) Windows-1256 (CP1256) Windows-1257 (CP1257) Windows-1258 (CP1258) Windows-31J (CP932, csWindows31J) Windows-874 (CP874)

© 2008–2013 Carsten Bormann

UTF-8 in Programmiersprachen Ruby 1.8:

– Strings sind Byte-Folgen– ASCII-Kompatibilität wird vorausgesetzt

Ruby 1.9/2.0:– String#bytes, #codepoints, #chars – String#encoding

“a”.encoding ➔ #<Encoding:UTF-8>== Encoding::UTF_8

String.new.encoding ➔ #<Encoding:ASCII-8BIT>== Encoding::BINARY

– String#force_encoding(Encoding::UTF_8) String#valid_encoding?

– String#encode(Encoding::UTF_8, invalid: :replace)– String#encode(“UTF-8”, “ISO8859-1”)

25

# -*- coding: UTF-8 -*-

DEFAULT IN RUBY 2.0

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/




Being “helpful” rarely helps(“ASCII compatible”)

26

>> u = "a".encode("UTF-8")=> "a">> b = "a".force_encoding("BINARY")=> "a">> u + b=> "aa">>

>> u = "ä".encode("UTF-8")=> "a">> b = "ä".force_encoding("BINARY")=> "a">> u + bEncoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT>>


WTF OSX HFS+ NFD Dateisystem von OSX: HFS+

– January 19, 1998– Apple hatte Unicode noch

nicht ganz verstanden

HFS+: Dateinamen in NFD– Müller ➔ Mu¨ller

alma:tmp cabo$ ls -l *ml*t-rw-r--r-- 1 cabo wheel 13 Feb 26 15:18 ümläutalma:tmp cabo$ irb>> Dir["*ml*t"].first.chars.to_a=> ["u", " ̈", "m", "l", "a", " ̈", "u", "t"]>> Dir["*ml*t"].first.encode("UTF-8", "UTF-8-MAC").chars.to_a=> ["ü", "m", "l", "ä", "u", "t"]>>

“UTF-8-MAC” als Trivialname für UTF-8 in NFD27

⟽

Documents

What every Programmer should know about Unicodecabo/awe13/ptgdm10-8zch.pdf · UTF: UCS Transformation Format – UTF-7: +ACQ-– UTF-8: Aufteilen, eindeutig auch bei „Quereinstieg“