제 4 장 어휘 분석 컴파일러 입문. Lexical Analysis the process by which the compiler groups certain strings of characters into individual tokens. Lexical Analyzer

제 4 장 어휘 분석

컴파일러 입문

SourceProgram

Lexical AnalyzerTokenStream

4.1 서 론

Lexical Analysis the process by which the compiler groups certain

strings of characters into individual tokens.

Lexical Analyzer Scanner Lexer

Text p.130

Token 문법적으로 의미 있는 최소 단위

Token - a single syntactic entity(terminal symbol).

Token Number - string 처리의 효율성 위한 integer number.

Token Value - numeric value or string value.

ex) if ( a > 10 ) ...

Token Number : 32 7 4 25 5 8 Token Value : 0 0 ‘a’ 0 10 0

Token classes Special form - language designer

1. Keyword --- const, else, if, int, ...2. Operator symbols --- +, -, *, /, ++, -- etc.3. Delimiters --- ;, ,, (, ), [, ] etc.

General form - programmer4. identifier --- stk, ptr, sum, ...5. constant --- 526, 3.0, 0.1234e-10, ‘c’, “string” etc.

Token Structure - represented by regular expres-sion.

ex) id = (l + _)( l + d + _)*

Interaction of Lexical Analyzer with Parser Lexical Analyzer is the procedure of Syntax Ana-

lyzer.

L.A. Finite Automata.

S.A. Pushdown Automata.

Token type scanner 가 parser 에게 넘겨주는 토큰 형태 .

(token number, token value)

ex) if ( x > y ) x = 10 ; (32,0) (7,0) (4,x) (25,0) (4,y) (8,0) (4,x) (23,0) (5,10) (20,0)

SourceProgram

Lexical Analyzer(=Scanner)

Shift(get-token)ReduceAcceptError

Syntax Analyzer(=Parser)

get token

token

The reasons for separating the analysis phase of compiling into lexical analysis(scanning) and syntax analysis(parsing).

1. modular construction - simpler design.2. compiler efficiency is improved.3. compiler portability is enhanced.

Parsing table Parser 의 행동 (Shift, Reduce, Accept, Error) 을 결정 .

Token number 는 Parsing table 의 index.

Tokennum State

Symbol table 의 용도 L.A 와 S.A 시 identifier 에 관한 정보를 수집하여 저장 . Semantic analysis 와 Code generation 시에 사용 . name + attributes

ex) Hashed symbol table

- chapter 12 참조

attributesname

symbol tablebucket

4.2 토큰 인식

Specification of token structure - RE Specification of PL - CFG Scanner design steps

1. describe the structure of tokens in re.2. or, directly design a transition diagram for the tokens.3. and program a scanner according to the diagram.4. moreover, we verify the scanner action through regular

language theory.

Character classification letter : a | b | c... | z | A | B | C |…| Z l digit : 0 | 1 | 2... | 9 d special character : + | - | * | / | . | , | ...

S Astartl, _

l, d, _

4.2.1 Identifier Recognition

Transition diagram

Regular grammar S lA | _A A lA | dA | _A | ε

Regular expression S = lA + _A = (l + _)A A = lA + dA + _A + ε = (l + d + _)A + ε = (l + d + _)*

S = (l + _)( l + d + _)*

Form : 10 진수 , 8 진수 , 16 진수로 구분되어진다 . 10 진수 : 0 이 아닌 수 시작

8 진수 : 0 으로 시작 , 16 진수 : 0x, 0X 로 시작

Transition diagram

4.2.2 Integer number Recognition

S An

D

start

B C

E

0o

x, Xh

o

h

d

n : non-zero digito : octal digit h : hexa digit

Regular grammar

S nA | 0B A dA | ε B oC | xD | XD | ε

C oC | ε D hE E hE | ε

Regular expression E = hE + ε = h*ε = h* D = hE = hh* = h+

C = oC + ε = o* B = oC + xD + XD + ε = o+ + (x + X)D = o+ + (x + X)h+ + ε A = dA + ε = d*

S = nA + 0B = nd* + 0(o+ + (x + X)h+ + ε) = nd* + 0 + 0o+ + 0(x + X)h+

∴ S = nd* + 0 + 0o+ + 0(x + X)h+

S Cd

start o

d

A B D E

G

F

.

d e

d

d

+

-

d d

d

4.2.3 Real number Recognition

Form : Fixed-point number & Floating-point number Transition diagram

Regular grammar S dA D dE | +F | -G A dA | .B E dE |ε B dC F dE C dC | eD |ε G dE

Regular expressionE = dE + ε = d* F = dE = dd* = d+ G = dE = dd* = d+

D = dE + '+'F + -G = dd* + '+'d+ + -d +

= d+ + '+'d+ + -d+ = (ε + '+' +-)d +

C = dC + eD + ε = dC+e(ε + '+' +-)d+ + e

= d*(e(ε + '+' +-) d+ + ε)

B = dC=dd*(e(ε + '+' +-)d+ +ε)

= d++(e(ε + '+' +-) d+ +ε)

A = dA + .B

= d*.d+(e(ε + '+' +-)d+ + ε)

S = dA

= dd*. d+(e(ε + '+' +-) d+ +ε)

= d+.d+(e(ε + '+' +-) d+ + ε)

= d+.d++ d+.d+e(ε + '+' +-) d+

참고 Terminal + 를 ‘ +’ 로 표기 .

Form : a sequence of characters between a pair of double

quotes.

Transition diagram

where, a = char_set - {", \} and c = char_set

Regular grammar

S "A A aA | "B | \C B ε C cA

Bstart" "

A

c\

a

S

C

4.2.4 String Constant Recognition

Regular expression

A = aA + " B + \C

= aA + " + \cA

= (a + \c)A + "

= (a + \c)* "

S = " A

= "(a + \c)*"

∴ S = "(a + \c)* "

a

start S /*

A DB C

*

*/

b

4.2.5 Comment Recognition

Transition diagram

where, a = char_set - {*} and b = char_set - {*, /}.

Regular grammarS /AA *BB aB | *CC *C | bB | /DD ε

Regular expression

C = *C + bB + /D = **(bB + /)

B = aB + ***(bB + /)

= aB + ***bB + ***/

= (a + *** b)B + ***/= (a + ***b)****/

A = *B = *(a + ***b)****/

S = /A = /* (a + ***b)****/

A program which recognizes a comment statement.

do {

while (ch != '*') ch = getchar(); ch = getchar();

} while (ch != '/');

Documents

제 4 장 어휘 분석 컴파일러 입문. Lexical Analysis the process by which the compiler groups certain strings of characters into individual tokens. Lexical Analyzer