57
Compiler Compiler Construction Construction 2 2 주 주주 주 주주 Lexical Analysis Lexical Analysis

Compiler Construction

  • Upload
    fred

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Compiler Construction. 2 주 강의 Lexical Analysis. token. Lexical Analyzer. Parser. Source Program. get next token. Symbol Table. Lexical Analysis. “ get next token ” is a command sent from the parser to the lexical analyzer. - PowerPoint PPT Presentation

Citation preview

Page 1: Compiler Construction

Compiler Compiler ConstructionConstruction

22 주 강의주 강의Lexical AnalysisLexical Analysis

Page 2: Compiler Construction

Lexical AnalysisLexical Analysis

““get next token” is a command sent from get next token” is a command sent from the parser to the lexical analyzer.the parser to the lexical analyzer.

On receipt of the command, the lexical On receipt of the command, the lexical analyzer scans the input until it analyzer scans the input until it determines the next token, and returns it.determines the next token, and returns it.

LexicalAnalyzer

SourceProgra

m

Parsertoken

get next token

SymbolTable

Page 3: Compiler Construction

Other jobs of the lexical Other jobs of the lexical analyzeranalyzer

We also want the lexer toWe also want the lexer to Strip out comments and white space from the Strip out comments and white space from the

source code.source code. Correlate parser errors with the source code loCorrelate parser errors with the source code lo

cation (the parser doesn’t know what line of thcation (the parser doesn’t know what line of the file it’s at, but the lexer does)e file it’s at, but the lexer does)

Page 4: Compiler Construction

Tokens, patterns, and Tokens, patterns, and lexemeslexemes

A TOKEN is a set of strings over the source A TOKEN is a set of strings over the source alphabet.alphabet.

A PATTERN is a rule that describes that set.A PATTERN is a rule that describes that set. A LEXEME is a sequence of characters A LEXEME is a sequence of characters

matching that pattern.matching that pattern. E.g. in Pascal, for the statementE.g. in Pascal, for the statement

const pi = 3.1416;const pi = 3.1416;

The substring pi is a lexeme for the token The substring pi is a lexeme for the token identifieridentifier

Page 5: Compiler Construction

Example tokens, lexemes, Example tokens, lexemes, patternspatterns

TokenToken Sample LexemesSample Lexemes Informal description of patternInformal description of pattern

ifif ifif ifif

WhileWhile WhileWhile whilewhile

RelationRelation <, <=, = , <>, > >=<, <=, = , <>, > >= < or <= or = or <> or > or >=< or <= or = or <> or > or >=

IdId count, sun, i, j, pi, D2count, sun, i, j, pi, D2 Letter followed by letters and Letter followed by letters and digitsdigits

NumNum 0, 12, 3.1416, 6.02E230, 12, 3.1416, 6.02E23 Any numeric constantAny numeric constant

literalliteral ““please enter input please enter input values”values” Any characters between “ and ”Any characters between “ and ”

Page 6: Compiler Construction

TokensTokens Together, the complete set of tokens form the set of terTogether, the complete set of tokens form the set of ter

minal symbols used in the grammar for the parser.minal symbols used in the grammar for the parser. In most languages, the tokens fall into these categories:In most languages, the tokens fall into these categories:

KeywordsKeywords OperatorsOperators IdentifiersIdentifiers ConstantsConstants Literal stiringsLiteral stirings PunctuationPunctuation

Usually the token is represented as an integer.Usually the token is represented as an integer. The lexer and parser just agree on which integers are useThe lexer and parser just agree on which integers are use

d for each token.d for each token.

Page 7: Compiler Construction

Token attributesToken attributes

If there is more than one lexeme for a If there is more than one lexeme for a token, we have to save additional token, we have to save additional information about the token.information about the token.

Example: the token Example: the token numbernumber matches matches lexemes 10 and 20.lexemes 10 and 20.

Code generation needs the actual number, Code generation needs the actual number, not just the token.not just the token.

With each token, we associate With each token, we associate ATTRIBUTES. Normally just a pointer into ATTRIBUTES. Normally just a pointer into the symbol table.the symbol table.

Page 8: Compiler Construction

Example attributesExample attributes For C source codeFor C source code

E = M * C * CE = M * C * C

We have token/attribute pairsWe have token/attribute pairs<ID, ptr to symbol table entry for E><ID, ptr to symbol table entry for E><Assign_op, NULL><Assign_op, NULL><ID, ptr to symbol table entry for M><ID, ptr to symbol table entry for M><Mult_op, NULL><Mult_op, NULL><ID, ptr to symbol table entry for C><ID, ptr to symbol table entry for C><Mult_op, NULL><Mult_op, NULL><ID, ptr to symbol table entry for C><ID, ptr to symbol table entry for C>

Page 9: Compiler Construction

Lexical errorsLexical errors

When errors occur, we could just crashWhen errors occur, we could just crash It is better to print an error message then It is better to print an error message then

continue.continue. Possible techniques to continue on error:Possible techniques to continue on error:

Delete a characterDelete a character Insert a missing characterInsert a missing character Replace an incorrect character by a correct Replace an incorrect character by a correct

charactercharacter Transpose adjacent charactersTranspose adjacent characters

Page 10: Compiler Construction

Token specificationToken specification REGULAR EXPRESSIONS (REs) are the most common notREGULAR EXPRESSIONS (REs) are the most common not

ation for pattern specification.ation for pattern specification. Every pattern specifies a set of strings, so an RE names Every pattern specifies a set of strings, so an RE names

a set of strings.a set of strings.

Definitions:Definitions: The ALPHABET (often written ∑) is the set of legal input symbolsThe ALPHABET (often written ∑) is the set of legal input symbols A STRING over some alphabet ∑ is a finite sequence of symbols fA STRING over some alphabet ∑ is a finite sequence of symbols f

rom ∑rom ∑ The LENGTH of string s is written |s| The LENGTH of string s is written |s| The EMPTY STRING is a special 0-length string denoted The EMPTY STRING is a special 0-length string denoted εε

Page 11: Compiler Construction

More definitions: strings More definitions: strings and substringsand substrings

A PREFIX of s is formed by removing 0 or A PREFIX of s is formed by removing 0 or more trailing symbols of smore trailing symbols of s

A SUFFIX of s is formed by removing 0 or A SUFFIX of s is formed by removing 0 or more leading symbols of smore leading symbols of s

A SUBSTRING of s is formed by deleting a A SUBSTRING of s is formed by deleting a prefix and a suffix from sprefix and a suffix from s

A PROPER prefix, suffix, or substring is a A PROPER prefix, suffix, or substring is a nonempty string x that is, respectively, a nonempty string x that is, respectively, a prefix, suffix, or substring of s but with x ≠ prefix, suffix, or substring of s but with x ≠ s.s.

Page 12: Compiler Construction

More definitionsMore definitions A LANGUAGE is a set of strings over a fixed alphaA LANGUAGE is a set of strings over a fixed alpha

bet ∑.bet ∑. Example languages:Example languages:

Ø (the empty set)Ø (the empty set) { { εε } } { a, aa, aaa, aaaa }{ a, aa, aaa, aaaa }

The CONCATENATION of two strings x and y is wriThe CONCATENATION of two strings x and y is written xy tten xy

String EXPONENTIATION is written sString EXPONENTIATION is written s ii, where s, where s00 = = εε and s and sii = s = si-1i-1s for i>0.s for i>0.

Page 13: Compiler Construction

Operations on languagesOperations on languages

We often want to perform operations on sets of strings (lanWe often want to perform operations on sets of strings (languages). The important ones are:guages). The important ones are: The UNION of L and M: The UNION of L and M:

L ∪ M = { s | s is in L OR s is in M }L ∪ M = { s | s is in L OR s is in M }

The CONCATENATION of L and M:The CONCATENATION of L and M:LM = { st | s is in L and t is in M }LM = { st | s is in L and t is in M }

The KLEENE CLOSURE of L:The KLEENE CLOSURE of L:

The POSITIVE CLOSURE of L:The POSITIVE CLOSURE of L:

}{1

i

iLL

}{0

*

i

iLL

Page 14: Compiler Construction

Regular expressionsRegular expressions

REs let us precisely define a set of strings.REs let us precisely define a set of strings. For C identifiers, we might useFor C identifiers, we might use

( letter | _ ) ( letter | digit | _ )( letter | _ ) ( letter | digit | _ )**

Parentheses are for grouping, | means “OParentheses are for grouping, | means “OR”, and R”, and ** means Kleene closure. means Kleene closure.

Every RE defines a language L(r).Every RE defines a language L(r).

Page 15: Compiler Construction

Regular expressionsRegular expressions

Here are the rules for writing REs over an alphaHere are the rules for writing REs over an alphabet ∑ :bet ∑ :

1.1. εε is an RE denoting { is an RE denoting { εε }, the language containing onl }, the language containing only the empty string.y the empty string.

2.2. If a is in ∑, then a is a RE denoting { a }.If a is in ∑, then a is a RE denoting { a }.3.3. If r and s are REs denoting L(r) and L(s), thenIf r and s are REs denoting L(r) and L(s), then

1.1. (r)|(s) is a RE denoting L(r) ∪ L(s)(r)|(s) is a RE denoting L(r) ∪ L(s)2.2. (r)(s) is a RE denoting L(r) L(s)(r)(s) is a RE denoting L(r) L(s)3.3. (r)(r)** is a RE denoting (L(r)) is a RE denoting (L(r))**

4.4. (r) is a RE denoting L(r)(r) is a RE denoting L(r)

Page 16: Compiler Construction

Additional conventionsAdditional conventions

To avoid too many parentheses, we To avoid too many parentheses, we assume:assume:

1.1. * has the highest precedence, and is * has the highest precedence, and is left associative.left associative.

2.2. Concatenation has the 2nd highest Concatenation has the 2nd highest precedence, and is left associative.precedence, and is left associative.

3.3. | has the lowest precedence and is left | has the lowest precedence and is left associative.associative.

Page 17: Compiler Construction

Example REsExample REs

1.1. a | ba | b

2.2. ( a | b ) ( a | b )( a | b ) ( a | b )

3.3. aa**

4.4. (a | b )(a | b )**

5.5. a | aa | a**bb

Page 18: Compiler Construction

Equivalence of REsEquivalence of REs

AxiomAxiom DescriptionDescription

r|s = s|rr|s = s|r | is commutative| is commutative

r|(s|t) = (r|s)t | is associative

(rs)t = r(st) Concatenation is associative

r(s|t) = rs|rt(s|t)r = sr|tr

Concatenation distributes over |

εε r = rr εε = r

εε Is the identity element for concatenation

r* = (r| εε)* Relation between * and εε

r** = r* * is idempotent

Page 19: Compiler Construction

Regular definitionsRegular definitions

To make our REs simpler, we can give namTo make our REs simpler, we can give names to subexpressions. A REGULAR DEFINITIes to subexpressions. A REGULAR DEFINITION is a sequenceON is a sequence

dd11 -> r -> r11

dd22 -> r -> r22

……ddnn -> r -> rnn

Page 20: Compiler Construction

Regular definitionsRegular definitions

Example for identifiers in C:Example for identifiers in C:letterletter -> A | B | … | Z | a | b | … | z -> A | B | … | Z | a | b | … | z digitdigit -> 0 | 1 | … | 9 -> 0 | 1 | … | 9 idid -> ( -> ( letterletter | _ ) ( | _ ) ( letterletter | | digitdigit | _ ) | _ )**

Example for numbers in Pascal:Example for numbers in Pascal:digitdigit -> 0 | 1 | … | 9 -> 0 | 1 | … | 9digitsdigits -> -> digitdigit digitdigit**

optional_fractionoptional_fraction -> . -> . digitsdigits | | εεoptional_exponentoptional_exponent -> ( E ( + | - | -> ( E ( + | - | εε ) ) digitsdigits ) | ) | εεnumnum -> -> digits optional_fraction optional_exponentdigits optional_fraction optional_exponent

Page 21: Compiler Construction

Notational shorthandNotational shorthand To simplify out REs, we can use a few shortcuts:To simplify out REs, we can use a few shortcuts:

1. + means “one or more instances of”1. + means “one or more instances of”aa++ (ab) (ab)++

2. ? means “zero or one instance of”2. ? means “zero or one instance of”Optional_fraction -> ( . digits ) ?Optional_fraction -> ( . digits ) ?

3. [] creates a character class3. [] creates a character class[A-Za-z][A-Za-z0-9][A-Za-z][A-Za-z0-9]**

You can prove that these shortcuts do not increaYou can prove that these shortcuts do not increase the representational power of REs, but they arse the representational power of REs, but they are convenient.e convenient.

Page 22: Compiler Construction

Token recognitionToken recognition

We now know how to specify the tokens for our laWe now know how to specify the tokens for our language. But how do we write a program to recognnguage. But how do we write a program to recognize them?ize them?

ifif -> -> ififthenthen -> -> thenthenelseelse -> -> elseelsereloprelop -> -> < | <= | = | <> | > | >=< | <= | = | <> | > | >=idid -> -> letterletter ( ( letterletter | | digitdigit ) )**

numnum -> -> digitdigit ( . ( . digitdigit )? ( E (+|-)? )? ( E (+|-)? digitdigit )? )?

Page 23: Compiler Construction

Token recognitionToken recognition

We also want to strip whitespace, so we nWe also want to strip whitespace, so we need definitionseed definitions

delimdelim -> -> blankblank | | tabtab | | newlinenewlinewsws -> -> delimdelim++

Page 24: Compiler Construction

Attribute valuesAttribute valuesRegular Regular ExpressionExpression TokenToken Attribute valueAttribute value

wsws -- --

ifif ifif --

thenthen thenthen --

elseelse elseelse --

idid idid ptr to sym table entryptr to sym table entry

numnum numnum ptr to sym table entryptr to sym table entry

<< reloprelop LTLT

<=<= reloprelop LELE

== reloprelop EQEQ

<><> reloprelop NENE

>> reloprelop GTGT

>=>= reloprelop GEGE

Page 25: Compiler Construction

Transition diagramsTransition diagrams Transition diagrams are also called finite automata.Transition diagrams are also called finite automata. We have a collection of STATES drawn as nodes in a graph.We have a collection of STATES drawn as nodes in a graph. TRANSITIONS between states are represented by directed TRANSITIONS between states are represented by directed

edges in the graph.edges in the graph. Each transition leaving a state s is labeled with a set of Each transition leaving a state s is labeled with a set of

input characters that can occur after state s.input characters that can occur after state s. For now, the transitions must be DETERMINISTIC.For now, the transitions must be DETERMINISTIC. Each transition diagram has a single START state and a set Each transition diagram has a single START state and a set

of TERMINAL STATES.of TERMINAL STATES. The label OTHER on an edge indicates all possible inputs The label OTHER on an edge indicates all possible inputs

not handled by the other transitions.not handled by the other transitions. Usually, when we recognize OTHER, we need to put it back Usually, when we recognize OTHER, we need to put it back

in the source stream since it is part of the next token. This in the source stream since it is part of the next token. This action is denoted with a * next to the corresponding state.action is denoted with a * next to the corresponding state.

Page 26: Compiler Construction

Automated lexical analyzer Automated lexical analyzer generationgeneration

Next time we discuss Lex and how it does iNext time we discuss Lex and how it does its job:ts job: Given a set of regular expressions, produce C Given a set of regular expressions, produce C

code to recognize the tokens.code to recognize the tokens.

Page 27: Compiler Construction

Lexical AnalysisLexical Analysis

Page 28: Compiler Construction

Lexical Analysis Example

Page 29: Compiler Construction

Lexical Analysis With Lex

Page 30: Compiler Construction

Lexical analysis with Lex

Page 31: Compiler Construction

Lex source program format

The Lex program has three sections, separated by %%:

declarations%%transition rules%%auxiliary code

Page 32: Compiler Construction

Declarations section Code between %{ and }% is inserted directly into the lex.yy.c. Should

contain: Manifest constants (#define for each token) Global variables, function declarations, typedefs

Outside %{ and }%, REGULAR DEFINITIONS are declared.Examples:

delim [ \t\n]ws {delim}+

letter [A-Za-z]Each definition is a name followed by a pattern.Declared names can be used in later patterns, if surrounded by { }.

Page 33: Compiler Construction

Translation rules section

Translation rules take the formp1 { action1 }p2 { action2 }… …pn { actionn }

Where pi is a regular expression and actioni is a C program fragment t

o be executed whenever pi is recognized in the input stream.

In regular expressions, references to regular definitions must be enclosed in {} to distinguish them from the corresponding character sequences.

Page 34: Compiler Construction

Auxiliary procedures

Arbitrary C code can be placed in this section, e.g. functions to manipulate the symbol table.

이미 설명했음

Page 35: Compiler Construction

Special characters

Some characters have special meaning to Lex. ‘.’ in a RE stands for ANY character ‘*’ stands for Kleene closure ‘+’ stands for positive closure ‘?’ stands for 0-or-1 instance of ‘-’ produces a character range (e.g. in [A-Z])

When you want to use these characters in a RE, they must be “escaped” e.g. in RE {digit}+(\.{digit}+)? ‘.’ is escaped with ‘\’

Page 36: Compiler Construction

Lex interface to yacc The yacc parser calls a function yylex() produced by lex. yylex() returns the next token it finds in the input stream. yacc expects the token’s attribute, if any, to be returned v

ia the global variable yylval. The declaration of yylval is up to you (the compiler writer).

In our example, we use a union, since we have a few different kinds of attributes.

Page 37: Compiler Construction

Lookahead in Lex

Sometimes, we don’t know until looking ahead several characters what the next token is. Recognition of the DO keyword in Fortran is a famous example.

DO5I=1.25 assigns the value 1.25 to DO5IDO5I=1,25 is a DO loop

Lex handles long-term lookahead with r1/r2:DO/({letter}|{digit})*=({letter}|{digit})*,

Recognize keyword DO

(if it’s followed by letters & digits, ‘=’,more letters & digits, followed by a ‘,’)

Page 38: Compiler Construction

Finite Automata for Lexical Analysis

Page 39: Compiler Construction

Automatic lexical analyzer generation

How do Lex and similar tools do their job? Lex translates regular expressions into transition diagr

ams. Then it translates the transition diagrams into C code

to recognize tokens in the input stream.

There are many possible algorithms.

The simplest algorithm is RE -> NFA -> DFA -> C code.

Page 40: Compiler Construction

Finite automata (FAs) and regular languages

A RECOGNIZER takes language L and string x as input, and responds YES if x∈L, or NO otherwise.

The finite automaton (FA) is one class of recognizer. A FA is DETERMINISTIC if there is only one possible trans

ition for each <state,input> pair. A FA is NONDETERMINISTIC if there is more than one pos

sible transition some <state,input> pair. BUT both DFAs and NFAs recognize the same class of lan

guages: REGULAR languages, or the class of languages that can be written as regular expressions.

Page 41: Compiler Construction

NFAs A NFA is a 5-tuple < S, ∑, move, s0, F >

S is the set of STATES in the automaton. ∑ is the INPUT CHARACTER SET move( s, c ) = S is the TRANSITION FUNCTION

specifying which states S the automaton can move to on seeing input c while in state s.

s0 is the START STATE. F is the set of FINAL, or ACCEPTING STATES

Page 42: Compiler Construction

NFA example

and recognizes the language L = (a|b)*abb (the set of all strings of a’s and b’s ending with abb)

The NFA

has move() function:

Page 43: Compiler Construction

The language defined by a NFA

An NFA ACCEPTS string x iff there exists a path from s0 to an accepting state, such that the edge labels along the path spell out x.

The LANGUAGE DEFINED BY a NFA N, written L(N), is the set of strings it accepts.

Page 44: Compiler Construction

Another NFA example

This NFA accepts L = aa*|bb*

Page 45: Compiler Construction

Deterministic FAs (DFAs)

The DFA is a special case of the NFA except: No state has an ε-transition No state has more than one edge leaving it for the sa

me input character.

The benefit of DFAs is that they are simple to simulate: there is only one choice for the machine’s state after each input symbol.

Page 46: Compiler Construction

Algorithm to simulate a DFA

Inputs: string x terminated by EOF; DFA D = < S, ∑, move, s0, F >

Outputs: YES if D accepts x; NO otherwiseMethod:

s = s0;

c = nextchar;while ( c != EOF ) {

s = move( s, c );c = nextchar;

}if ( s ∈ F ) return YESelse return NO

Page 47: Compiler Construction

DFA example

This DFA accepts L = (a|b)*abb

Page 48: Compiler Construction

RE -> DFA

Now we know how to simulate DFAs. If we can convert our REs into a DFA, we can aut

omatically generate lexical analyzers. BUT it is not easy to convert REs directly into a D

FA. Instead, we will convert our REs to a NFA then co

nvert the NFA to a DFA.

Page 49: Compiler Construction

Converting a NFA to a DFA

Page 50: Compiler Construction

NFA -> DFA NFAs are ambiguous: we don’t know what state a NFA is in after obs

erving each input. The simplest conversion method is to have the DFA track the SUBSE

T of states the NFA MIGHT be in. We need three functions for the construction:

ε-closure(s): the set of NFA states reachable from NFA state s on ε-transitions alone.

ε-closure(T): the set of NFA states reachable from some state s ∈ T on ε-transitions alone.

move(T,a): the set of NFA states to which there is a transition on input a from some NFA state s ∈ T

Page 51: Compiler Construction

Subset construction algorithm

Inputs: a NFA N = < SN, ∑, tranN, n0, FN > Outputs: a DFA D = < SD, ∑, tranD, d0, FD > Method: add a state d0 to SD corresponding to ε-closure(n0)

while there is an unexpanded state di ∈ SD {for each input symbol a ∈ ∑ {

dj = ε-closure(move(di,a))

if dj ∉ SD,add dj to SD

tranN( di, a ) = dj

}}

Page 52: Compiler Construction

Examples: convert these NFAs

a)

b)

Page 53: Compiler Construction

Converting a RE to a NFA

Page 54: Compiler Construction

RE -> NFA The construction is bottom up. Construct NFAs to recognize ε and each element

a ∈ ∑. Recursively expand those NFAs for alternation, concaten

ation, and Kleene closure.

Every step introduces at most two additional NFA states. Therefore the NFA is at most twice as large as the regul

ar expression.

Page 55: Compiler Construction

RE -> NFA algorithm

Inputs: A RE r over alphabet ∑Outputs: A NFA N accepting L(r)Method: Parse r.

If r = ε, then N is

If r = a ∈ ∑ , then N is

If r = s | t, construct N(s) for s and N(t) for t then N is

Page 56: Compiler Construction

RE -> NFA algorithmIf r = st, construct N(s) for s and N(t) for t then N is

If r = s*, construct N(s) for s, then N is

If r = ( s ), construct N(s) then let N be N(s).

Page 57: Compiler Construction

Example

Use the NFA construction algorithm to build a NFA forr = (a|b)*abb