163
Compiler Parsing 邱邱邱 邱邱邱邱邱邱邱邱邱邱邱邱邱 [email protected] http://jkx.fudan.edu.cn/~xpqiu/compiler 1

邱锡鹏 复旦大学计算机科学技术学院 [email protected] xpqiu/compiler 1

Embed Size (px)

Citation preview

Top-down Analysis

[email protected]://jkx.fudan.edu.cn/~xpqiu/compiler11OutlineParser overviewContext-free grammars (CFG) LL ParsingLR Parsing22Languages and AutomataFormal languages are very important in CSEspecially in programming languagesRegular languagesThe weakest formal languages widely usedMany applicationsWe will also study context-free languages33Limitations of Regular LanguagesIntuition: A finite automaton that runs long enough must repeat statesFinite automaton cant remember # of times it has visited a particular stateFinite automaton has finite memoryOnly enough to store in which state it is Cannot count, except up to a finite limitE.g., language of balanced parentheses is not regular: { (i )i | i 0} 44Parser Overview5The Functionality of the ParserSyntax analysis for natural languages:recognize whether a sentence is grammatically well-formed & identify the function of each component.6

6Syntax Analysis Example7

7Comparison with Lexical Analysis8PhaseInputOutputLexerSequence of charactersSequence of tokensParserSequence of tokensParse tree8The Role of the ParserNot all sequences of tokens are programs . . .. . . Parser must distinguish between valid and invalid sequences of tokens

We needA language for describing valid sequences of tokensA method for distinguishing valid from invalid sequences of tokens

99Context-Free Grammars10Context-Free GrammarsProgramming language constructs have recursive structure

An EXPR isif EXPR then EXPR else EXPR fi , orwhile EXPR loop EXPR pool , orContext-free grammars are a natural notation for this recursive structure1111CFGs (Cont.)A CFG consists ofA set of terminals TA set of non-terminals NA start symbol S (a non-terminal)A set of productions

Assuming X N X e , or X Y1 Y2 ... Yn where Yi N T1212Notational ConventionsIn these lecture notesNon-terminals are written upper-caseTerminals are written lower-caseThe start symbol is the left-hand side of the first productionProductionsA , A left side, right sideA 1, A 2, , A k, A 1 | 2 | | k, 1313Examples of CFGs14

14Examples of CFGs (cont.)Simple arithmetic expressions:

15

15The Language of a CFGRead productions as replacement rules: X Y1 ... YnMeans X can be replaced by Y1 ... YnX eMeans X can be erased (replaced with empty string)

1616Key IdeasBegin with a string consisting of the start symbol SReplace any non-terminal X in the string by a right-hand side of some production X Y1 Yn Repeat (2) until there are no non-terminals in the string1717The Language of a CFG (Cont.)More formally, write X1 Xi Xn X1 Xi-1 Y1 Ym Xi+1 Xn

if there is a production Xi Y1 Ym 1818The Language of a CFG (Cont.)Write X1 Xn Y1 Ymif X1 Xn Y1 Ym

in 0 or more steps19

19The Language of a CFGLet G be a context-free grammar with start symbol S. Then the language of G is:

{ a1 an | S a1 an and every ai is a terminal }20

20TerminalsTerminals are called because there are no rules for replacing them

Once generated, terminals are permanent

Terminals ought to be tokens of the language

2121ExamplesL(G) is the language of CFG G

Strings of balanced parentheses

Two grammars:

22

OR22Example23

23Example (Cont.)Some elements of the language

24

24Arithmetic ExampleSimple arithmetic expressions:

Some elements of the language:

25

25NotesThe idea of a CFG is a big step. But:

Membership in a language is yes or nowe also need parse tree of the inputMust handle errors gracefullyNeed an implementation of CFGs (e.g., bison)

2626More NotesForm of the grammar is importantMany grammars generate the same languageTools are sensitive to the grammar

Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice2727Derivations and Parse TreesA derivation is a sequence of productions S

A derivation can be drawn as a treeStart symbol is the trees rootFor a production X Y1 Yn add children Y1, , Yn to node X

2828Derivation ExampleGrammar

String

29

29Derivation Example (Cont.)30

EEEEE+id*idid30Derivation in Detail (1)31E

31Derivation in Detail (2)32

EEE+32Derivation in Detail (3)33

EEEEE+*33Derivation in Detail (4)34

EEEEE+*id34Derivation in Detail (5)35

EEEEE+*idid35Derivation in Detail (6)36

EEEEE+id*idid36Notes on DerivationsA parse tree hasTerminals at the leavesNon-terminals at the interior nodes

An in-order traversal of the leaves is the original input

The parse tree shows the association of operations, the input string does not3737Left-most and Right-most DerivationsTwo choices to be made in each step in a derivationWhich nonterminal to replaceWhich alternative to use for that nonterminal3838Left-most and Right-most DerivationsThe example is a left-most derivationAt each step, replace the left-most non-terminal

There is an equivalent notion of a right-most derivation

39

39Right-most Derivation in Detail (1)40E

40Right-most Derivation in Detail (2)41

EEE+41Right-most Derivation in Detail (3)42

EEE+id42Right-most Derivation in Detail (4)43

EEEEE+id*43Right-most Derivation in Detail (5)44

EEEEE+id*id44Right-most Derivation in Detail (6)45

EEEEE+id*idid45Derivations and Parse TreesNote that right-most and left-most derivations have the same parse tree

The difference is the order in which branches are added4646AmbiguityGrammar E E + E | E * E | ( E ) | int

Strings int + int + int

int * int + int4747Ambiguity. ExampleThis string has two parse trees48EEEEE+int+intintEEEEE+int+intint+ is left-associative48Ambiguity. ExampleThis string has two parse trees49EEEEE*int+intintEEEEE+int*intint* has higher precedence than +49Ambiguity (Cont.)A grammar is ambiguous if it has more than one parse tree for some stringEquivalently, there is more than one right-most or left-most derivation for some string5050Ambiguity (Cont.)Ambiguity is badLeaves meaning of some programs ill-defined

Ambiguity is common in programming languagesArithmetic expressionsIF-THEN-ELSE5151Dealing with AmbiguityThere are several ways to handle ambiguity

Most direct method is to rewrite the grammar unambiguously E E + T | T T T * int | int | ( E )

Enforces precedence of * over +Enforces left-associativity of + and * 5252TEAmbiguity. ExampleThe int * int + int has ony one parse tree now

53EEEEE*int+intintETTintT+int*Eint53Ambiguity: The Dangling ElseConsider the grammar E if E then E | if E then E else E | OTHER

This grammar is also ambiguous5454The Dangling Else: A Fixelse matches the closest unmatched then We can describe this in the grammar (distinguish between matched and unmatched then) E MIF /* all then are matched */ | UIF /* some then are unmatched */MIF if E then MIF else MIF | OTHERUIF if E then E | if E then MIF else UIF Describes the same set of strings5656AmbiguityNo general techniques for handling ambiguityImpossible to convert automatically an ambiguous grammar to an unambiguous oneUsed with care, ambiguity can simplify the grammarSometimes allows more natural definitionsWe need disambiguation mechanisms5858LL(1) Parsing5959Intro to Top-Down ParsingTerminals are seen in order of appearance in the token stream: t1 t2 t3 t4 t5The parse tree is constructedFrom the topFrom left to right60At1Bt4Ct2Dt3t460Recursive Descent ParsingConsider the grammar E T + E | T T int | int * T | ( E )Token stream is: int5 * int2Start with top-level non-terminal E

Try the rules for E in order6161Recursive Descent Parsing. Example (Cont.)Try E0 T1 + E2 Then try a rule for T1 ( E3 )But ( does not match input token int5Try T1 int . Token matches. But + after T1 does not match input token *Try T1 int * T2This will match but + after T1 will be unmatchedHave exhausted the choices for T1Backtrack to choice for E06262Recursive Descent Parsing. Example (Cont.)Try E0 T1Follow same steps as before for T1And succeed with T1 int * T2 and T2 intWith the following parse tree63E0T1int5*T2int263Recursive Descent Parsing. Notes.Easy to implement by hand

But does not always work

6464Recursive-Descent ParsingParsing: given a string of tokens t1 t2 ... tn, find its parse treeRecursive-descent parsing: Try all the productions exhaustivelyAt a given moment the fringe of the parse tree is: t1 t2 tk A Try all the productions for A: if A BC is a production, the new fringe is t1 t2 tk B C Backtrack when the fringe doesnt match the string Stop when there are no more non-terminals6565When Recursive Descent Does Not WorkConsider a production S S a:In the process of parsing S we try the above ruleWhat goes wrong?A left-recursive grammar has a non-terminal S S + S for some Recursive descent does not work in such casesIt goes into an infinite loop6666Elimination of Left RecursionConsider the left-recursive grammar S S | S generates all strings starting with a and followed by a number of Can rewrite using right-recursion S S S S | 6767Elimination of Left-Recursion. ExampleConsider the grammar S 1 | S 0 ( = 1 and = 0 )

can be rewritten as S 1 S S 0 S | 6868More Elimination of Left-RecursionIn general S S 1 | | S n | 1 | | mAll strings derived from S start with one of 1,,m and continue with several instances of 1,,n Rewrite as S 1 S | | m S S 1 S | | n S | 6969Summary of Recursive DescentSimple and general parsing strategyLeft-recursion must be eliminated first but that can be done automaticallyUnpopular because of backtrackingThought to be too inefficientIn practice, backtracking is eliminated by restricting the grammar

7070Predictive ParsersLike recursive-descent but parser can predict which production to useBy looking at the next few tokensNo backtracking Predictive parsers accept LL(k) grammarsL means left-to-right scan of inputL means leftmost derivationk means predict based on k tokens of lookaheadIn practice, LL(1) is used7171LL(1) LanguagesIn recursive-descent, for each non-terminal and input token there may be a choice of productionLL(1) means that for each non-terminal and token there is only one production that could lead to successCan be specified as a 2D tableOne dimension for current non-terminal to expandOne dimension for next tokenA table entry contains one production7272Predictive Parsing and Left FactoringRecall the grammar E T + E | T T int | int * T | ( E )Impossible to predict becauseFor T two productions start with intFor E it is not clear how to predictA grammar must be left-factored before use for predictive parsing7373LL(1) Parsing Table ExampleLeft-factored grammarE T X X + E | T ( E ) | int Y Y * T | The LL(1) parsing table:75int*+()$Tint Y( E )ET XT XX+ EY* T 75LL(1) Parsing Table Example (Cont.)Consider the [E, int] entryWhen current non-terminal is E and next input is int, use production E T XThis production can generate an int in the first placeConsider the [Y,+] entryWhen current non-terminal is Y and current token is +, get rid of YWell see later why this is so7676LL(1) Parsing Tables. ErrorsBlank entries indicate error situationsConsider the [E,*] entryThere is no way to derive a string starting with * from non-terminal E

7777Using Parsing TablesMethod similar to recursive descent, exceptFor each non-terminal SWe look at the next token aAnd choose the production shown at [S,a]We use a stack to keep track of pending non-terminalsWe reject when we encounter an error stateWe accept when we encounter end-of-input 7878LL(1) Parsing Algorithminitialize stack = and next (pointer to tokens)repeat case stack of : if T[X,*next] = Y1Yn then stack = ; else error (); : if t == *next ++ then stack = ; else error ();until stack == < >7979LL(1) Parsing ExampleStack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ terminalY X $ * int $ * T* T X $ * int $ terminalT X $ int $ int Yint Y X $ int $ terminalY X $ $ X $ $ $ $ ACCEPT8080Constructing Parsing TablesLL(1) languages are those defined by a parsing table for the LL(1) algorithmNo table entry can be multiply defined

We want to generate parsing tables from CFG8181Top-Down Parsing. ReviewTop-down parsing expands a parse tree from the start symbol to the leaves with leftmost non-terminal.82ETE+int * int + int82Top-Down Parsing. ReviewTop-down parsing expands a parse tree from the start symbol to the leaves with leftmost non-terminal.

83EintT*TE+int * int + intThe leaves at any point form a string bAg b contains only terminalsThe input string is bbdThe prefix b matchesThe next token is b83Top-Down Parsing. ReviewTop-down parsing expands a parse tree from the start symbol to the leaves with leftmost non-terminal.84EintT*intTE+Tint * int + intThe leaves at any point form a string bAg b contains only terminalsThe input string is bbdThe prefix b matchesThe next token is b84Predictive Parsing. Review.A predictive parser is described by a tableFor each non-terminal A and for each token b we specify a production A a When trying to expand A we use A a if b follows next

Once we have the tableThe parsing algorithm is simple and fastNo backtracking is necessary

8686Constructing Predictive Parsing TablesConsider the state S * bAgWith b the next tokenTrying to match bbdThere are two possibilities: b belongs to an expansion of AAny A a can be used if b can start a string derived from a In this case we say that b First(a)

Or8787Constructing Predictive Parsing Tables (Cont.)b does not belong to an expansion of AThe expansion of A is empty and b belongs to an expansion of gMeans that b can appear after A in a derivation of the form S * bAbw We say that b Follow(A) in this case

What productions can we use in this case?Any A a can be used if a can expand to eWe say that e First(A) in this case8888Computing First SetsDefinition First(X) = { b | X * b} { | X * }First(b) = { b }

For all productions X A1 AnAdd First(A1) {} to First(X). Stop if First(A1) Add First(A2) {} to First(X). Stop if First(A2)Add First(An) {} to First(X). Stop if First(An)Add to First(X)8989First Sets. ExampleRecall the grammar E T X X + E | T ( E ) | int Y Y * T | First sets First( ( ) = { ( } First( T ) = {int, ( } First( ) ) = { ) } First( E ) = {int, ( } First( int) = { int } First( X ) = {+, } First( + ) = { + } First( Y ) = {*, } First( * ) = { * }9090Computing Follow SetsDefinition Follow(X) = { b | S * X b }Compute the First sets for all non-terminals firstAdd $ to Follow(S) (if S is the start non-terminal)

For all productions Y X A1 AnAdd First(A1) {} to Follow(X). Stop if First(A1) Add First(A2) {} to Follow(X). Stop if First(A2)Add First(An) {} to Follow(X). Stop if First(An)Add Follow(Y) to Follow(X)9191Follow Sets. ExampleRecall the grammar E T X X + E | T ( E ) | int Y Y * T | Follow sets Follow( + ) = { int, ( } Follow( * ) = { int, ( } Follow( ( ) = { int, ( } Follow( E ) = {), $} Follow( X ) = {$, ) } Follow( T ) = {+, ) , $} Follow( ) ) = {+, ) , $} Follow( Y ) = {+, ) , $} Follow( int) = {*, +, ) , $}9292Constructing LL(1) Parsing TablesConstruct a parsing table T for CFG G

For each production A in G do:For each terminal b First() doT[A, b] = If * , for each b Follow(A) doT[A, b] = If * and $ Follow(A) doT[A, $] = 9393Constructing LL(1) Tables. ExampleRecall the grammar E T X X + E | T ( E ) | int Y Y * T | Where in the line of Y we put Y * T ?In the lines of First( *T) = { * }

Where in the line of Y we put Y e ?In the lines of Follow(Y) = { $, +, ) }9494Notes on LL(1) Parsing TablesIf any entry is multiply defined then G is not LL(1)If G is ambiguousIf G is left recursiveIf G is not left-factoredAnd in other cases as wellMost programming language grammars are not LL(1)There are tools that build LL(1) tables9595LR Parsing9696Bottom-up parsingBottom-up parsing is more general than top-down parsingBuilds on ideas in top-down parsingPreferred method in practicealso called as LR ParsingL means that read tokens from left to rightR means that it constructs a rightmost derivationLR(0),LR(1),SLR(1),LALR(1)

9797The IdeaLR parsing reduces a string to the start symbol by inverting productions:str input string of terminals repeatIdentify b in str such that A b is a production (i.e., str = a b g)Replace b by A in str (i.e., str becomes a A g)until str = S98An Introductory ExampleLR parsers dont need left-factored grammars and can also handle left-recursive grammars Consider the following grammar: E E + ( E ) | intWhy is this not LL(1)?Consider the string: int + ( int ) + ( int )99A Bottom-up Parse in Detail (1)100int++intint()int + (int) + (int)()A Bottom-up Parse in Detail (2)101Eint++intint()int + (int) + (int)E + (int) + (int)()A Bottom-up Parse in Detail (3)102Eint++intint()int + (int) + (int)E + (int) + (int)E + (E) + (int)

()EA Bottom-up Parse in Detail (4)103Eint++intint()int + (int) + (int)E + (int) + (int)E + (E) + (int)E + (int)E()EA Bottom-up Parse in Detail (5)104Eint++intint()int + (int) + (int)E + (int) + (int)E + (E) + (int)E + (int)E + (E)E()EEA Bottom-up Parse in Detail (6)105EEint++intint()int + (int) + (int)E + (int) + (int)E + (E) + (int)E + (int)E + (E)EE()EEA rightmost derivation in reverseWhere Do Reductions HappenSome characters:Let g be a step of a bottom-up parseAssume the next reduction is by A Then g is a string of terminals !Why? Because Ag g is a step in a right-most derivation 106Top-down vs. Bottom-upBottom-up: Dont need to figure out as much of the parse tree for a given amount of input107

NotationIdea: Split string into two substringsRight substring (a string of terminals) is as yet unexamined by parserLeft substring has terminals and non-terminalsThe dividing point is marked by a IThe I is not part of the stringInitially, all input is unexamined:Ix1x2 . . . xn108Shift-Reduce ParsingBottom-up parsing uses only three kinds of actions:

ShiftReduceaccept

109ShiftShift: Move I one place to the rightShifts a terminal to the left string

E + (I int ) E + (int I ) 110ReduceReduce: Apply an inverse production at the right end of the left stringIf E E + ( E ) is a production, then

E + (E + ( E ) I ) E +(E I ) 111Shift-Reduce ExampleI int + (int) + (int)$ shift112int++intint()()Shift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E int113int++intint()()Shift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 times114Eint++intint()()Shift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E int115Eint++intint()()Shift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E intE + (E I ) + (int)$ shift 116Eint++intint()()EShift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E intE + (E I ) + (int)$ shift E + (E) I + (int)$ red. E E + (E) 117Eint++intint()()EShift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E intE + (E I ) + (int)$ shift E + (E) I + (int)$ red. E E + (E) E I + (int)$ shift 3 times 118Eint++intint()E()EShift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E intE + (E I ) + (int)$ shift E + (E) I + (int)$ red. E E + (E) E I + (int)$ shift 3 times E + (int I )$ red. E int119Eint++intint()E()EShift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E intE + (E I ) + (int)$ shift E + (E) I + (int)$ red. E E + (E) E I + (int)$ shift 3 times E + (int I )$ red. E intE + (E I )$ shift120Eint++intint()E()EEShift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E intE + (E I ) + (int)$ shift E + (E) I + (int)$ red. E E + (E) E I + (int)$ shift 3 times E + (int I )$ red. E intE + (E I )$ shiftE + (E) I $ red. E E + (E) 121Eint++intint()E()EEShift-Reduce ExampleI int + (int) + (int)$ shiftint I + (int) + (int)$ red. E intE I + (int) + (int)$ shift 3 timesE + (int I ) + (int)$ red. E intE + (E I ) + (int)$ shift E + (E) I + (int)$ red. E E + (E) E I + (int)$ shift 3 times E + (int I )$ red. E intE + (E I )$ shiftE + (E) I $ red. E E + (E) E I $ accept122EEint++intint()E()EEKey Issue: When to Shift or Reduce?Decide based on the left string (the stack)Idea: use a finite automaton (DFA) to decide when to shift or reduceThe DFA input is the stackThe language consists of terminals and non-terminals123LR Parsing EngineBasic mechanism:Use a set of parser statesUse a stack of statesUse a parsing table to:Determine what action to apply (shift/reduce)Determine the next stateThe parser actions can be precisely determined from the table124Representing the DFAParsers represent the DFA as a 2D tableRecall table-driven lexical analysisLines correspond to DFA statesColumns correspond to terminals and non-terminalsTypically columns are split into:Those for terminals: action tableThose for non-terminals: goto table125I int + (int) + (int)$ shiftint I + (int) + (int)$ E intE I + (int) + (int)$ shift(x3)E + (int I ) + (int)$ E intE + (E I ) + (int)$ shift E + (E) I + (int)$ E E+(E) E I + (int)$ shift (x3) E + (int I )$ E intE + (E I )$ shiftE + (E) I $ E E+(E) E I $ accept

126intE int on $, +accept on $E inton ), +E E + (E)on $, +E E + (E)on ), +(+Eint10911012345687+E+)(intE)The LR Parsing TableAlgorithm: look at entry for current state S and input terminal cif Table[S,c] = s(S) then shift:push(S)if Table[S,c] = A then reduce:pop(||); S=top(); push(A); push(Table[S,A])127TerminalsNon-terminalsState

Next actionand next stateNext stateRepresenting the DFA. ExampleThe table for a fragment of our DFA:128int+()$E3s44s5g65rE intrE int6s8s77rE E+(E)rE E+(E)E inton ), +E E + (E)on $, +(int34567)EThe LR Parsing AlgorithmAfter a shift or reduce action we return the DFA on the entire stackThis is wasteful, since most of the work is repeatedRemember for each stack element on which state it brings the DFALR parser maintains a stack sym1, state1 . . . symn, staten statek is the final state of the DFA on sym1 symk129LR Parsing NotesCan be used to parse more grammars than LLMost programming languages grammars are LRCan be described as a simple tableThere are many tools for building the tableHow is the table constructed?130Key Issue: How is the DFA Constructed?The stack describes the context of the parseWhat non-terminal we are looking forWhat production rhs we are looking forWhat we have seen so far from the rhsEach DFA state describes several such contextsE.g., when we are looking for non-terminal E, we might be looking either for an int or a E + (E) rhs131131LR(0)LR(0) ItemsThe LR(0) item of a grammar GFor example: A->XYZA XYZA X YZA XY ZA XYZ The parser begins with S$ (state 0).

132The Closure OperationThe operation of extending the context with items is called the closure operation

Closure(I) = repeat for each [X aYb] in I for each production Y g add [Y g] to Items until I is unchangedreturn I133133P222Goto OperationGoto(I,X) is defined to be the closure of the set of all items [AaX] such that [A aX ] is in I.I is the set of itemsX is a grammar symbolGoto(I,X)=set J to the empty setfor any item A aX in Iadd A aX to Jreturn Closure(J)

134The Sets-of-items ConstructionInitialize T={closure({[S S$]})};Initialize E ={};repeat for each state(a set of items) I in T let J be Goto(I,X)T=T{J};E=E {I J};until E and I did not change

For the symbol $ we do not compute Goto(I,$).135

135P224 Example(1)Grammar G1:S S$S (L)S xL SL L,S136Example(2)Grammar G2:S E$E T + EE TT xG2 is not LR(0).We can extend LR(0) in a simple way.SLR(1): simple LRReduce actions are indicated by Follow set.137SLR(1)R={};for each state I in Tfor each item A in Ifor each token a in Follow(A)R=R{(I, a, A )}

(I, a, A ) indicates that in state I, on lookahead symbol a, the parser will reduce by rule A .138LR(0) LimitationsAn LR(0) machine only works if states with reduce actions have a single reduce actionWith more complex grammar, construction gives states with shift/reduce or reduce/reduce conflictsNeed to use look-ahead to choose139LR(1) ItemsAn LR(1) item is a pair: X ab, aX ab is a productiona is a terminal (the lookahead terminal)LR(1) means 1 lookahead terminal

[X ab, a] describes a context of the parser We are trying to find an X followed by an a, and We have a already on top of the stackThus we need to see next a prefix derived from ba140ConventionWe add to our grammar a fresh new start symbol S and a production S E$Where E is the old start symbolFor grammar: E int and E E + ( E)The initial parsing context contains: S E$, ?Trying to find an S as a string derived from E$The stack is empty141LR(1) Items (Cont.)In context containing E E + ( E ), +If ( follows then we can perform a shift to context containing E E + ( E ), +In context containing E E + ( E ) , +We can perform a reduction with E E + ( E ) But only if a + follows142LR(1) Items (Cont.)Consider the item E E + ( E ) , +We expect a string derived from E ) +We describe this by extending the context with two more items: E int, ) E E + ( E ) , )143The Closure OperationThe operation of extending the context with items is called the closure operation

Closure(I) = repeat for each [A aXb, z] in I for each production X g for each w First(bz) add [X g, w] to I until I is unchangedreturn I144The Goto OperationGoto(I, X) = J={}; for each item (A aXb, z) in Iadd (A aX b, z) to J;return Closure(J);145Example(1)Grammar: E E + ( E)|intConstruct the start context: Closure({S E$, ?})S E$, ?E E+(E), $E int, $E E+(E), +E int, +We abbreviate as:S E$, ?E E+(E), $/+E int, $/+

146LR Parsing Tables. NotesParsing tables (i.e. the DFA) can be constructed automatically for a CFG

But we still need to understand the construction to work with parser generatorsE.g., they report errors in terms of sets of items

What kind of errors can we expect?147Shift/Reduce ConflictsIf a DFA state contains both [X aab, b] and [Y g, a]

Then on input a we could eitherShift into state [X aab, b], orReduce with Y g

This is called a shift-reduce conflict148Shift/Reduce ConflictsTypically due to ambiguities in the grammarClassic example: the dangling else S if E then S | if E then S else S | OTHERWill have DFA state containing [S if E then S , else] [S if E then S else S, x]If else follows then we can shift or reduceDefault (bison) is to shiftDefault behavior is as needed in this case149More Shift/Reduce ConflictsConsider the ambiguous grammar E E + E | E * E | intWe will have the states containing [E E * E, +] [E E * E , +] [E E + E, +] E [E E + E, +] Again we have a shift/reduce on input +We need to reduce (* binds more tightly than +)Recall solution: declare the precedence of * and +150Reduce/Reduce ConflictsIf a DFA state contains both [X a , a] and [Y b , a]Then on input a we dont know which production to reduce

This is called a reduce/reduce conflict151Reduce/Reduce ConflictsUsually due to gross ambiguity in the grammarExample: a sequence of identifiers S e | id | id SThere are two parse trees for the string id S id S id S id How does this confuse the parser?152More on Reduce/Reduce Conflicts153153LR(1) but not SLR(1)Let G have productionsS aAb | AcA a | V(a) = {[ S a.Ab ][ A a. ][ A .a ][ A . ]}154FOLLOW(A) = {b,c}reduce-reduce conflictLR(1) Parsing Tables are BigBut many states are similar, e.g.

and

Idea: merge the DFA states whose items differ only in the lookahead tokensWe say that such states have the same coreWe obtain155E int on $, +E int, $/+E int, )/+E int on ), +51E int on $, +, )E int, $/+/)1The Core of a Set of LR ItemsDefinition: The core of a set of LR items is the set of first componentsWithout the lookahead terminals

Example: the core of { [X ab, b], [Y gd, d]} is {X ab, Y gd}156LALR StatesConsider for example the LR(1) states {[X a, a], [Y b, c]} {[X a, b], [Y b, d]}They have the same core and can be mergedAnd the merged state contains: {[X a, a/b], [Y b, c/d]}These are called LALR(1) states Stands for LookAhead LRTypically 10 times fewer LALR(1) states than LR(1)157A LALR(1) DFARepeat until all states have distinct coreChoose two distinct states with same coreMerge the states by creating a new one with the union of all the itemsPoint edges from predecessors to new stateNew state points to all the previous successors158AEDCBFABEDCFConversion LR(1) to LALR(1). Example.159intE int on $, +E inton ), +E E + (E)on $, +E E + (E)on ), +(+Eint10911012345687+E+)(intE)accept on $intE int on $, +, )E E + (E)on $, +, )(Eint01,523,84,96,107,11++)Eaccept on $The LALR Parser Can Have ConflictsConsider for example the LR(1) states {[X a, a], [Y b, b]} {[X a, b], [Y b, a]}And the merged LALR(1) state {[X a, a/b], [Y b, a/b]}Has a new reduce-reduce conflict

In practice such cases are rare160LALR vs. LR ParsingLALR languages are not naturalThey are an efficiency hack on LR languages

Any reasonable programming language has a LALR(1) grammar

LALR(1) has become a standard for programming languages and for parser generators161A Hierarchy of Grammar Classes162

3.43.14CFGRECFG163