HTTP 완벽가이드 16장

HTTP:�The�Definitive�Guide�(ch.16�국제화)아키텍트를�꿈꾸는�사람들�

Cecil

Contents

• 국제�콘텐츠를�다루기�위한�HTTP�지원�

• 국제화된�URI�

• 기타�고려사항

다국어�컨텐츠를�다루는�HTTP

Accept-Charset:�iso-8859-1,�utf-8�Accept-Language:�fr,�en;q=0.8

This is the Title of the Book, eMatter EditionCopyright © 2008 O’Reilly & Associates, Inc. All rights reserved.

Content Encoding | 353

The gzip, compress, and deflate encodings are lossless compression algorithms usedto reduce the size of transmitted messages without loss of information. Of these, gziptypically is the most effective compression algorithm and is the most widely used.

Accept-Encoding HeadersOf course, we don’t want servers encoding content in ways that the client can’t deci-pher. To prevent servers from using encodings that the client doesn’t support, theclient passes along a list of supported content encodings in the Accept-Encodingrequest header. If the HTTP request does not contain an Accept-Encoding header, aserver can assume that the client will accept any encoding (equivalent to passingAccept-Encoding: *).

Figure 15-4 shows an example of Accept-Encoding in an HTTP transaction.

Table 15-2. Content-encoding tokens

Content-encoding value Description

gzip Indicates that the GNU zip encoding was applied to the entity.a

a RFC 1952 describes the gzip encoding.

compress Indicates that the Unix file compression program has been run on the entity.

deflate Indicates that the entity has been compressed into the zlib format.b

b RFCs 1950 and 1951 describe the zlib format and deflate compression.

identity Indicates that no encoding has been performed on the entity. When a Content-Encoding headeris not present, this can be assumed.

Figure 15-4. Content encoding

Request message

GET /logo.gif HTTP/1.1Accept-encoding: gzip[...]

HTTP/1.1 200 OKContent-type: image/gifContent-encoding: gzip[...]

Response messagegzip

...011010011...

gunzip

...011010011...

The server compresses the image with gzip to transport a smaller file over the thinnetwork connection between itself and the client. This saves network bandwidthand reduces the amount of time that the client waits for the transfer. Though, theclient will have to spend time decompressing the image once the image is served.

www.it-ebooks.info


Delta Encoding | 365

Delta EncodingWe have described different versions of a web page as different instances of a page. Ifa client has an expired copy of a page, it requests the latest instance of the page. Ifthe server has a newer instance of the page, it will send it to the client, and it willsend the full new instance of the page even if only a small portion of the page actu-ally has changed.

Rather than sending it the entire new page, the client would get the page faster if theserver sent just the changes to the client’s copy of the page (provided that the num-ber of changes is small). Delta encoding is an extension to the HTTP protocol thatoptimizes transfers by communicating changes instead of entire objects. Delta encod-ing is a type of instance manipulation, because it relies on clients and serversexchanging information about particular instances of an object. RFC 3229 describesdelta encoding.

Figure 15-10 illustrates more clearly the mechanism of requesting, generating, receiv-ing, and applying a delta-encoded document. The client has to tell the server whichversion of the page it has, that it is willing to accept a delta from the latest version ofpage, and which algorithms it knows for applying those deltas to its current version.

Figure 15-9. Entity range request example

110001111011010111000101

Client

www.joes-hardware.com

HTTP/1.1 200 OKContent-type: text/htmlContent-length: 65537Accept-ranges: bytes[...]

GET /bigfile.html HTTP/1.1[...]

Request message

Response message

GET /bigfile.html HTTP.1.1Range: bytes=20224-[...]

Range request message

Client received onlythe first 20224 bytes

of the resource

HTTP/1.1 200 OKRange: bytes=20224-Accept-ranges: bytes

[...]

Range response message

The client’s original request wasinterrupted, but a second requestfor the part of the message thatwas not received allows theclient to resume from the pointof the interruption

www.joes-hardware.com

www.it-ebooks.info

Content-Type:�text/html;�charset=utf-8�Content-Language:�fr 인코딩�방식

언어�태그

언어�인코딩


Character Sets and HTTP | 373

only with transporting the character data and the associated language and charsetlabels. The presentation of the character shapes is handled by the user’s graphics dis-play software (browser, operating system, fonts), as shown in Figure 16-2c.

The Wrong Charset Gives the Wrong CharactersIf the client uses the wrong charset parameter, the client will display strange, boguscharacters. Let’s say a browser got the value 225 (binary 11100001) from the body:

• If the browser thinks the body is encoded with iso-8859-1 Western Europeancharacter codes, it will show a lowercase Latin “a” with acute accent:

• If the browser is using iso-8859-6 Arabic codes, it will show “FEH”:

• If the browser is using iso-8859-7 Greek, it will show a small “Alpha”:

Figure 16-2. HTTP “charset” combines a character encoding scheme and a coded character set

65 LATIN CAPITAL LETTER A66 LATIN CAPITAL LETTER B

224 ARABIC TATWEEL225 ARABIC LETTER FEH226 ARABIC LETTER QAF227 ARABIC LETTER KAF

...11100001

Data bits

encoding scheme(using iso-8859-6’s encoding)

225Character code

(in iso-8859-6 set)

Coded character set

Unique character

"ARABIC LETTER FEH"

Fonts and presentation logic

Glyph

(a) Decode using encoding scheme (b) Find character using codedcharacter set

(c) Find display shape using fonts andformatting software

MIME charset tag describes the combination of characterencoding scheme and coded character set mapping

(iso-8859-6 codedcharacter set)

www.it-ebooks.info

글자를�비트로�인코딩하고,�비트를�글자로�디코딩하는�방법�

Charset:�특정�코딩된�문자�집합과�특정�문자�인코딩�구조의�결합

주요�문자�집합• US-ASCII�• 정보�교환을�위한�미국�표준�코드�가장�많이�사용됨.��

• 코드값�0~127만�사용�

• ISO-8859�• 국제적인�글쓰기를�위해�필요한�글자들을�하이�비트를�위해�추가한�US-ASCII의�확장�

• UCS(Universal�Character�Set)�• 전�세계의�모든�글자를�하나의�코딩된�문자�집합으로�표현�

• 기본�집합은�50,000�글자로�구성되어�있음�

• 수백만개의�글자를�위한�확장�코드�공간을�가짐

문자�인코딩�구조• 고정폭:�8비트�

• 각�코딩된�문자를�고정된�길이의�비트로�표현�

• 빠르게�처리�될�수�있지만,�공간을�낭비할�우려가�있음.�

• 가변폭(비모달):�UTF-8�

• 다른�문자�코드�번호에�다른�길이의�비트를�사용�

• 자주�사용되는�글자일�수록�비트의�길이가�짧음�

• 가변폭(모달):�iso-2022-jp�

• 다른�모드로의�전환을�위해�특별한�escape�패턴을�사용

비모달:UTF-8�vs�모달(iso-2022-jp)


382 | Chapter 16: Internationalization

8-bit

The 8-bit fixed-width identity encoding simply encodes each character code with itscorresponding 8-bit value. It supports only character sets with a code range of 256characters. The iso-8859 family of character sets uses the 8-bit identity encoding.

UTF-8

UTF-8 is a popular character encoding scheme designed for UCS (UTF stands for“UCS Transformation Format”). UTF-8 uses a nonmodal, variable-length encodingfor the character code values, where the leading bits of the first byte tell the length ofthe encoded character in bytes, and any subsequent byte contains six bits of codevalue (see Table 16-2).

If the first encoded byte has a high bit of 0, the length is just 1 byte, and the remain-ing 7 bits contain the character code. This has the nice result of ASCII compatibility(but not iso-8859 compatibility, because iso-8859 uses the high bit).

For example, character code 90 (ASCII “Z”) would be encoded as 1 byte (01011010),while code 5073 (13-bit binary value 1001111010001) would be encoded into 3 bytes:

11100001 10001111 10010001

iso-2022-jp

iso-2022-jp is a widely used encoding for Japanese Internet documents. iso-2022-jp isa variable-length, modal encoding, with all values less than 128 to prevent problemswith non–8-bit-clean software.

The encoding context always is set to one of four predefined character sets.* Special“escape sequences” shift from one set to another. iso-2022-jp initially uses the US-ASCII character set, but it can switch to the JIS X 0201 (JIS-Roman) character set orthe much larger JIS X 0208-1978 and JIS X 0208-1983 character sets using 3-byteescape sequences.

Table 16-2. UTF-8 variable-width, nonmodal encoding

Character code bits Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6

0–7 0ccccccc - - - - -

8–11 110ccccc 10cccccc - - - -

12–16 1110cccc 10cccccc 10cccccc - - -

17–21 11110ccc 10cccccc 10cccccc 10cccccc - -

22–26 111110cc 10cccccc 10cccccc 10cccccc 10cccccc -

27–31 1111110c 10cccccc 10cccccc 10cccccc 10cccccc 10cccccc

* The iso-2022-jp encoding is tightly bound to these four character sets, whereas some other encodings areindependent of the particular character set.

www.it-ebooks.info

UTF-8:�첫�비트의�선두�비트들은�인코딩된�문자의�길이를�표현

iso-2022-jp:�확장�문자를�기반으로�네가지�미리�정의된�문자집합중�하나로�설정


Multilingual Character Encoding Primer | 383

The escape sequences are shown in Table 16-3. In practice, Japanese text begins with“ESC $ @” or “ESC $ B” and ends with “ESC ( B” or “ESC ( J”.

When in the US-ASCII or JIS-Roman modes, a single byte is used per character.When using the larger JIS X 0208 character set, two bytes are used per charactercode. The encoding restricts the bytes sent to be between 33 and 126.*

euc-jp

euc-jp is another popular Japanese encoding. EUC stands for “Extended UnixCode,” first developed to support Asian characters on Unix operating systems.

Like iso-2022-jp, the euc-jp encoding is a variable-length encoding that allows theuse of several standard Japanese character sets. But unlike iso-2022-jp, the euc-jpencoding is not modal. There are no escape sequences to shift between modes.

euc-jp supports four coded character sets: JIS X 0201 (JIS-Roman, ASCII with a fewJapanese substitutions), JIS X 0208, half-width katakana (63 characters used in theoriginal Japanese telegraph system), and JIS X 0212.

One byte is used to encode JIS Roman (ASCII compatible), two bytes are used for JIS X0208 and half-width katakana, and three bytes are used for JIS X 0212. The coding is abit wasteful but is simple to process.

The encoding patterns are outlined in Table 16-4.

Table 16-3. iso-2022-jp character set switching escape sequences

Escape sequence Resulting coded character set Bytes per code

ESC ( B US-ASCII 1

ESC ( J JIS X 0201-1976 (JIS Roman) 1

ESC $ @ JIS X 0208-1978 2

ESC $ B JIS X 0208-1983 2

* Though the bytes can have only 94 values (between 33 and 126), this is sufficient to cover all the charactersin the JIS X 0208 character sets, because the character sets are organized into a 94 × 94 grid of code values,enough to cover all JIS X 0208 character codes.

Table 16-4. euc-jp encoding values

Which byte Encoding values

JIS X 0201 (94 coded characters)

1st byte 33–126

JIS X 0208 (6879 coded characters)

1st byte 161–254

2nd byte 161–254

www.it-ebooks.info

언어�태그


386 | Chapter 16: Internationalization

• Regional languages (as in “sgn-US-MA” for Martha’s Vineyard sign language)

• Standardized nonvariant languages (e.g., “i-navajo”)

• Nonstandard languages (e.g., “x-snowboarder-slang”*)

SubtagsLanguage tags have one or more parts, separated by hyphens, called subtags:

• The first subtag called the primary subtag. The values are standardized.

• The second subtag is optional and follows its own naming standard.

• Any trailing subtags are unregistered.

The primary subtag contains only letters (A–Z). Subsequent subtags can contain let-ters or numbers, up to eight characters in length. An example is shown in Figure 16-9.

CapitalizationAll tags are case-insensitive—the tags “en” and “eN” are equivalent. However, low-ercasing conventionally is used to represent general languages, while uppercasing isused to signify particular countries. For example, “fr” means all languages classifiedas French, while “FR” signifies the country France.†

IANA Language Tag RegistrationsThe values of the first and second language subtags are defined by various standardsdocuments and their maintaining organizations. The IANA‡ administers the list ofstandard language tags, using the rules outlined in RFC 3066.

If a language tag is composed of standard country and language values, the tag doesn’thave to be specially registered. Only those language tags that can’t be composed outof the standard country and language values need to be registered specially with the

* Describes the unique dialect spoken by “shredders.”

Figure 16-9. Language tags are separated into subtags

† This convention is recommended by ISO standard 3166.

‡ See http://www.iana.org and RFC 2860.

sgn-US-MAFirst subtag

(sign language)Second subtag

(America)Third subtag

(Massachusettsregional variant)

Martha’s Vineyard sign language

www.it-ebooks.info

언어에�이름을�붙이기�위한�짧고�표준화된�문자열

•첫번째�서브태그:�ISO-639�표준�언어�집합에�속한�언어�토큰�

•두번째�서브태그:�ISO3166�국가�코드와�지역�표준�집합에서�선택된�코드�

•세번째�서브태그:�확장용,�특별한�규칙�없음�

•ex)�en-US,�en-GS�…

국제화된�URIURI는�식별자의�가독성과�공유�가능성�보장을�위해��

US-ASCII�만으로�구성�

URI�Escape:�예약된�문자나�다른�지원하지�않는�글자들을�안전하게�URI에�삽입할�수�있는�방법(%�문자�사용)


Internationalized URIs | 391

filenames that contain international characters. This is incorrect and may causeproblems with some applications.

For example, the filename Sven Ölssen.html (containing an umlaut) might beencoded by a web server as Sven%20%D6lssen.html. It’s fine to encode the spacewith %20, but is technically illegal to encode the Ö with %D6, because the code D6(decimal 214) falls outside the range of ASCII. ASCII defines only codes up to 0x7F(decimal 127).

Modal Switches in URIsSome URIs also use sequences of ASCII characters to represent characters in othercharacter sets. For example, iso-2022-jp encoding might be used to insert “ESC ( J”to shift into JIS-Roman and “ESC ( B” to shift back to ASCII. This works in somelocal circumstances, but the behavior is not well defined, and there is no standard-ized scheme to identify the particular encoding used for the URL. As the authors ofRFC 2396 say:

For original character sequences that contain non-ASCII characters, however, the situ-ation is more difficult. Internet protocols that transmit octet sequences intended torepresent character sequences are expected to provide some way of identifying thecharset used, if there might be more than one [RFC2277].

However, there is currently no provision within the generic URI syntax to accomplishthis identification. An individual URI scheme may require a single charset, define adefault charset, or provide a way to indicate the charset used. It is expected that a sys-tematic treatment of character encoding within URI will be developed as a future mod-ification of this specification.

Currently, URIs are not very international-friendly. The goal of URI portability out-weighed the goal of language flexibility. There are efforts currently underway tointernationalize URIs, but in the near term, HTTP applications should stick withASCII. It’s been around since 1968, so it can’t be all that bad.

Figure 16-10. URI characters are transported as escaped code bytes but processed unescaped

Big Sale at Joe’sBig Sale at Joe’s

http://www.joes-hardware.com/big%20sale.txt

...o=111m=109/=47b=98i=105g=103%=372=500=48s=115...

External form(email, web, billboard, radio)

What you enter and send(in current character set)

...1111094798105103

32

115...

What you process(in US-ASCII character set)

Conceptual characters URI code bytes Unescaped ASCII code byte

www.it-ebooks.info

기타�고려사항• HTTP��헤더�• 반드시�US-ASCII�문자�집합의�글자로만�구성되어야�함�

• 날짜�• 올바른�GMT�날짜형식을�사용을�권고�

• 도메인�이름�• 국제화�도메인�이름(Internationalizing�Domain�Name)�

• 대부분의�웹�브라우저가�퓨니코드를�지원�

• 퓨니코드:�유니코드�문자열을�호스트�명에서�사용�가능한�문자로�변환하는�방법�

• ex)�한글.com�->�xn—bj0bj06e.com

http://bj0bj06e.com

Q&A

References• David�Gourley,�Brian�Totty,�Marjorie�Sayer,�Sailu�Reddy,�Anshu�Aggarwal.�HTTP�완벽�가이드(이응준,�정상일�옮김).�서울시�마포구:�인사이트,�2014

Technology

HTTP 완벽가이드 16장