View
86
Download
11
Category
Preview:
DESCRIPTION
컴파일러 입문. 제 4 장 어휘 분석. Lexical Analysis the process by which the compiler groups certain strings of characters into individual tokens. Lexical Analyzer Scanner Lexer. 4.1 서 론. Token 문법적으로 의미 있는 최소 단위 Token - a single syntactic entity(terminal symbol). - PowerPoint PPT Presentation
Citation preview
제 4 장 어휘 분석
컴파일러 입문
SourceProgram Lexical Analyzer Token
Stream
4.1 서 론
Lexical Analysis the process by which the compiler groups certain
strings of characters into individual tokens.
Lexical Analyzer Scanner Lexer
Text p.130
Token 문법적으로 의미 있는 최소 단위
Token - a single syntactic entity(terminal symbol). Token Number - string 처리의 효율성 위한 integer number. Token Value - numeric value or string value.
ex) if ( a > 10 ) ...
Token Number : 32 7 4 25 5 8 Token Value : 0 0 ‘a’ 0 10 0
Token classes Special form - language designer
1. Keyword --- const, else, if, int, ...2. Operator symbols --- +, -, *, /, ++, -- etc.3. Delimiters --- ;, ,, (, ), [, ] etc.
General form - programmer4. identifier --- stk, ptr, sum, ...5. constant --- 526, 3.0, 0.1234e-10, ‘c’, “string” etc.
Token Structure - represented by regular expres-sion.
ex) id = (l + _)( l + d + _)*
Interaction of Lexical Analyzer with Parser Lexical Analyzer is the procedure of Syntax Ana-
lyzer. L.A. Finite Automata. S.A. Pushdown Automata.
Token type scanner 가 parser 에게 넘겨주는 토큰 형태 .
(token number, token value)
ex) if ( x > y ) x = 10 ; (32,0) (7,0) (4,x) (25,0) (4,y) (8,0) (4,x) (23,0) (5,10) (20,0)
SourceProgram
Lexical Analyzer(=Scanner)
Shift(get-token)ReduceAcceptError
Syntax Analyzer(=Parser)
get token
token
The reasons for separating the analysis phase of compiling into lexical analysis(scanning) and syntax analysis(parsing).
1. modular construction - simpler design.2. compiler efficiency is improved.3. compiler portability is enhanced.
Parsing table Parser 의 행동 (Shift, Reduce, Accept, Error) 을 결정 .
Token number 는 Parsing table 의 index.
Tokennum State
Symbol table 의 용도 L.A 와 S.A 시 identifier 에 관한 정보를 수집하여 저장 . Semantic analysis 와 Code generation 시에 사용 . name + attributes
ex) Hashed symbol table
- chapter 12 참조
attributesname
symbol tablebucket
4.2 토큰 인식
Specification of token structure - RE Specification of PL - CFG Scanner design steps
1. describe the structure of tokens in re.2. or, directly design a transition diagram for the tokens.3. and program a scanner according to the diagram.4. moreover, we verify the scanner action through regular
language theory. Character classification
letter : a | b | c... | z | A | B | C |…| Z l digit : 0 | 1 | 2... | 9 d special character : + | - | * | / | . | , | ...
S Astartl, _
l, d, _
4.2.1 Identifier Recognition
Transition diagram
Regular grammar S lA | _A A lA | dA | _A | ε
Regular expression S = lA + _A = (l + _)A A = lA + dA + _A + ε = (l + d + _)A + ε = (l + d + _)*
S = (l + _)( l + d + _)*
Form : 10 진수 , 8 진수 , 16 진수로 구분되어진다 . 10 진수 : 0 이 아닌 수 시작
8 진수 : 0 으로 시작 , 16 진수 : 0x, 0X 로 시작
Transition diagram
4.2.2 Integer number Recognition
S An
D
start
B C
E
0o
x, Xh
o
h
d
n : non-zero digito : octal digit h : hexa digit
Regular grammar S nA | 0B A dA | ε B oC | xD | XD | ε
C oC | ε D hE E hE | ε
Regular expression E = hE + ε = h*ε = h* D = hE = hh* = h+
C = oC + ε = o* B = oC + xD + XD + ε = o+ + (x + X)D = o+ + (x + X)h+ + ε A = dA + ε = d*
S = nA + 0B = nd* + 0(o+ + (x + X)h+ + ε) = nd* + 0 + 0o+ + 0(x + X)h+
∴ S = nd* + 0 + 0o+ + 0(x + X)h+
S Cd
start o
d
A B D E
G
F
.d e
dd
+
-
d d
d
4.2.3 Real number Recognition
Form : Fixed-point number & Floating-point number Transition diagram
Regular grammar S dA D dE | +F | -G A dA | .B E dE |ε B dC F dE C dC | eD |ε G dE
Regular expressionE = dE + ε = d* F = dE = dd* = d+ G = dE = dd* = d+ D = dE + '+'F + -G = dd* + '+'d+ + -d + = d+ + '+'d+ + -d+ = (ε + '+' +-)d +
C = dC + eD + ε = dC+e(ε + '+' +-)d+ + e = d*(e(ε + '+' +-) d+ + ε)B = dC=dd*(e(ε + '+' +-)d+ +ε) = d++(e(ε + '+' +-) d+ +ε) A = dA + .B = d*.d+(e(ε + '+' +-)d+ + ε) S = dA = dd*. d+(e(ε + '+' +-) d+ +ε) = d+.d+(e(ε + '+' +-) d+ + ε) = d+.d++ d+.d+e(ε + '+' +-) d+
참고 Terminal + 를 ‘ +’ 로 표기 .
Form : a sequence of characters between a pair of double quotes.
Transition diagram
where, a = char_set - {", \} and c = char_set
Regular grammar S "A A aA | "B | \C B ε C cA
Bstart " "A
c\
a
S
C
4.2.4 String Constant Recognition
Regular expression
A = aA + " B + \C = aA + " + \cA = (a + \c)A + " = (a + \c)* "
S = " A = "(a + \c)*"
∴ S = "(a + \c)* "
a
start S /*A DB C
*
*/
b
4.2.5 Comment Recognition
Transition diagram
where, a = char_set - {*} and b = char_set - {*, /}. Regular grammar
S /AA *BB aB | *CC *C | bB | /DD ε
Regular expressionC = *C + bB + /D = **(bB + /)
B = aB + ***(bB + /)
= aB + ***bB + ***/
= (a + *** b)B + ***/= (a + ***b)****/
A = *B = *(a + ***b)****/
S = /A = /* (a + ***b)****/
A program which recognizes a comment statement.
do { while (ch != '*') ch = getchar(); ch = getchar();} while (ch != '/');
Recommended