Upload
kenny-huang
View
177
Download
3
Embed Size (px)
DESCRIPTION
Ruling the root : CJK Rules for The Root Zone
Citation preview
CJK Rules For The Root Zone
Kenny Huang, Ph.D. 黃勝雄博士 Member, CDNC / CGP Co-author, RFC3743 IETF Member, Executive Council, APNIC Member, Board of Directors, TWNIC [email protected] 2014.Jun
Problem : CJK Is Complicated
2
Putting CJK labels in the root zone is even more complicated
Institutionalized Problem Solving : Structure
3
Constraints for CJK LGR
4
Independent Tasks
Each CJK Panel creates an LGR Each LGR includes a repertoire and variants
Define labels permission Define variants labels Assign dispositions
•Allocatable •Block
Coordination Tasks
If an LGR includes Han characters:
The variant *mappings* must agree for all the panels The variant *types* may be different The repertoires may be different
*Presented by Lee Han Chuan & IP, Shanghai 2014 May 29
Overlap Case Illustration
5
壹
U58F9
弌
U5F0C
壱
U58F1 一
U4E00 allocate block block
Variant
Unicode
Disposition
一
U4E00
Variant
Unicode
Disposition
Chinese LGR
Japanese LGR
1 2 3
Integrate ?
Integrated Root Zone Label Generation Rules
Rejected
Generation Panel
F
T
High Level Conflict Strategies
6
ID Strategy Pros Cons Rank 1 Adopt X
Abandon Rcjk
Permit X No label rule
2 Adopt X Intersection ∩ (Rcjk)
Permit X Permit ∩(variants/disp)
Rules changed
3 Adopt X Union ∪(Rcjk)
Permit X Permit ∪(variants/disp)
Rules changed
4 Abandon X and Rcjk No conflict Label not available
5 Adopt rules based on frequency of use
Fair & scientific approach
Rules changed; fairness doesn’t mean appropriate
CJK overlap
C: rule Rc J : rule Rj K: rule Rk
Unified CJK LGR Illustration
7
壹
U58F9
弌
U5F0C
壱
U58F1 一
U4E00 allocate block block
Variant
Unicode
Disposition
Chinese LGR
1 2 3
一
U4E00
Variant
Unicode
Disposition
Japanese LGR
壹
U58F9
弌
U5F0C
壱
U58F1 一
U4E00 allocate block block
Variant
Unicode
Disposition
Integrated LGR
1 2 3
一
U4E00
Variant
Unicode
Disposition
Integrated LGR
Union
Intersection
CJK Integration Methodology Divide & Conquer (D&C)
Unified CJK Rules
Variant Dispositions
Minimal Viable Solution
CJK Rules
Root Zone Admin
Strategic Direction
Plan and Define
CJK Overlap
Resources
JK Overlap
CJ Usage Pattern
CJ Overlap
CK Usage Pattern
CK Overlap
Services
LGR Constrains
Evaluation Method
Diversified CJK Demands
Requires
C Demands
J Demands
8
Requires
Split Merge
Splitting Non-overlapping Code Points From Repertories
9
C/J Overlap:
6181
C-Han : 19520 (CNNIC/TWNIC)
J-Han : 6356 (JPRS) K-Han : 0 (KRNIC)
Develop Conflict Strategy No conflict
Rc
Rk
Rj
13339
175
1
unified code points
13339 175
13514
+
CJK Han-overlap in IANA IDN Repository
Problem Domain (Unsolved Overlap) : 6181
Rc
Rj
Rk
Chinese LGR
Japanese LGR
Korean LGR
Engineering Design
10
2
TC : Apple News SC : Sina News JP : Mainichi News
Computation for Word Usage and Frequency
C/J overlap code points
Matching
usage frequency of use
Split unused code points Split code points of low frequency of use
Sample size is statistical significant
Splitting Unused Code Points from The Overlap
11
J only : 203
C only : 1927 Rc
Rj
total unused : 2739
3
C / J Overlap Data Set : 6181
unified code points
2739 203
1927 4869
+
C / J usage overlap : 1312
total used : 3442
Problem Domain (Unsolved Overlap) : 1312
Computing Frequency of Use of Code Points
12
4
Initial Data Set : 1312
Top 10 Most Popular Words
13
的, 2774
人, 1005
在, 975 一, 964 是, 960
不, 951
中, 896
有, 883
大, 776 台, 718
TC 日, 20942
月, 20315
人, 4430
国, 3754
中, 3521
被, 2791
称, 2340
地, 2226 南, 2152 生, 2027
SC
日, 822
年, 496
国, 393 会, 345
月, 325
人, 325
大, 319
市, 253
本, 251 中, 250
JP
14
4.063 4.1884
1.7338
0.6 0.6094
0.886
0.5582 0.5944 0.5518 0.468
0.7042
0.4304
0.6026 0.4488
0.36 0.3562 0.4026 0.385
0.7508
0.325
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
C-Freq
J-Freq
Top 20 : Chinese Frequency of Use > Japanese Frequency of Use
Generated Data Set : 939
Frequency of Use %
15
0.1988
0.2788
0.5512
0.1956
0.3834
0.1644
0.2366
0.1056 0.1212
0.1088
0.1688
0.1422 0.1344
0.1622
0.0856
0.2812
0.1588
0.0912 0.1134
0.1288
0
0.1
0.2
0.3
0.4
0.5
0.6
C-Freq
J-Freq
Top 20 : Chinese Frequency of Use < Japanese Frequency of Use
Generated Data Set : 363
Frequency of Use %
16
0.0222
0.0144
0.0112
0.0056 0.0056
0.0022 0.0012 0.0012 0.0012 0.0012
0
0.005
0.01
0.015
0.02
0.025
8FCE 7D20 675F 541B 846C 79E9 82BD 96C0 5857 5353
C-Freq
J-Freq
Frequency of Use %
Chinese Frequency of Use = Japanese Frequency of Use
Generated Data Set : 10
Frequency of Use Reassembly
17
unified code points
363 939
1302 +
Problem Domain (Unsolved Overlap) : 10
C / J Usage Overlap Data Set : 1312
Freq C > J : 939
Freq J > C : 363
J = C 10
Rc
Rj
Data Processing & Computation Recap
18
>20K Han Code Points
6181 CJK Overlap
1312 Usage Overlap
Splitting Non-overlapping
Frequency of Use Computation
Filtering Process
Filtering Process
LOG
IC D
esig
n
Splitting Unused
Methodology Review
CJK Coordination
Re-Sampling & Computation
Statistical Justification
10 Code Points
Problem domain was effectively reduced
Future Work
19
15 7 8 15 23 23 34 33
69
122
289
501
80
26 14 5 5 3 2 1 0
100
200
300
400
500
600
%
number of X w
ithin the same range
Chinese Frequency of Use Minus Japanese Frequency of Use
Overlap range redefine Expand (?) Std Dev.
Require intensive CKJ coordination & deliberation
Rc Rj
Mean= 0.034465
S.D.=0.158477
Re-consider Language Tag
20
K tag
J tag
TLD registries
IANA/Verisign provisioning
root server operators publication
Internet query
Policy
C tag
Language tag support
•RFC 2860 : The name space of language tags is administered by IANA •ISO Standard 639 :
•when a language has both an IANA-registered tag and a tag derived from an ISO registered code, one MUST use the ISO tag. •Maintenance Agency : International Information Centre for Terminology (Austria)
Sources of Language Tag
distribution masters
root servers
DNS resolvers
21
Perfection Syndrome “Engineering isn't about perfect solutions; it's about doing the best you can with limited resources.” Randy Pausch