Download pdf - LL and LR Parsingkjleach.eecs.umich.edu/c18/l6.pdfparsing predicts which production rule to use from k tokens of lookahead. LL(1) parsing is a special case using one token of lookahead

LL and LR ParsingLecture 6

February 5, 2018

Context-free Grammars

A context-free grammar consists of

É A set of non-terminals NÉ Written in uppercase throughout these notes

É A set of terminals T comprised of tokensÉ Lowercase or punctuation throughout these notes

É A start symbol S (a non-terminal)É A set of productions (rewrite rules)

Assuming E ∈NE→ ε orE→ Y1Y2...Yn where Yi ∈N ∪T

Compiler Construction 2/49

Context-free ?Production rules hint at expressiveness!

Regular A→ aB,C→ εContext-free A→ αContext-sensitive αAβ→ αγβType-0 α→β

α,β,γ ∈ {N ∪T }∗

“What just happened? We must be missing some context...”


Parsing and Context-free Grammars

É Lexical AnalysisÉ Regular Expressions specify a Regular Language

containing strings of characters (lexeme) thatcorrespond to a token

É ParsingÉ Context-free Grammars specify a Context-free

Language containing strings of tokens thatcorrespond to a grammatical rule (production)


Generativeness

É Regular expressions and context-free grammarsare generativeÉ You can generate every string in the language

using the regex or grammar!


Generating Strings

É Consider regex: ab*aÉ You can generate aa, aba, abba, abbba, ...

É Consider context-free grammar:

E → ( E ) E| ε

É You can generate ε, (), (()), (())(), ...

É Generating strings with a grammar can bethought of as creating a parse tree!


Language membership

É We care about whether an input string oftokens is syntactically correct (e.g., obeys ourlanguage’s grammar)

É So far, we have looked at theoreticalimplications of grammars

L(G) = {a1...an|S→∗ a1...an}For an input string x, is x ∈ L(G)?

Parsing part 1: We need a yes/no answer!


Language membership

S → a B| b C

B → b b CC → c c

What strings are in this language? (Hint: there’sonly two!)If my input string is “dabc”, we ask: can thegrammar generate this string? (No)

É N.B. it doesn’t matter how from a theoreticalperspective, that’s the job of the parsingalgorithm!


Parsing Algorithms

É LL (top down)É Reads input from left to right and uses left-most

derivations to construct a parse tree

É LR (bottom up)É Reads input from left to right and uses right-most

derivations to construct a parse tree

É Both algorithms are driven by the inputgrammar and the input to be parsed.


Parsing Algorithm Intuition

É You start with a sequence of tokens, t1t2t3t4t5É and also a grammar!

É Two general approaches to constructing theparse treeÉ top-down parsing is when you predict the

grammatical rule used to produce the tokens seenso far

É bottom-up parsing is when you consider tokensone at a time until you match a grammatical rule


Top Down Parsing

S → a B cB → C x BB → εC → d

| a B c

Input string:“adxdxc”

S

a d x d x c


Top Down Parsing


| a B c


S

BB

aa d x d x cc


Top Down Parsing


| a B c


S

B

CC

BB

a d xx d x c


Top Down Parsing


| a B c


S

B

C

B

a dd x d x c


Top Down Parsing


| a B c


S

B

C

CC

B

BB

a d x d xx c


Top Down Parsing


| a B c


S

B

C

C

B

B

a d x dd x c


Top Down Parsing


| a B c


S

B

C

C

B

B

a d x d x cε


Bottom-up Parsing


| a B c


Tokens right now: a

aa d x d x c


Bottom-up Parsing


| a B c


Tokens right now: ad

a dd x d x c


Bottom-up Parsing


| a B c


Tokens right now: aC

CC

a d x d x c


Bottom-up Parsing


| a B c


Tokens right now: aCx

C

a d xx d x c


Bottom-up Parsing


| a B c


Tokens right now: aCxd

C

a d x dd x c


Bottom-up Parsing


| a B c


Tokens right now: aCxC

C

CC

a d x d x c


Bottom-up Parsing


| a B c


Tokens right now: aCxCx

C

C

a d x d xx c


Bottom-up Parsing


| a B c


Tokens right now: aCxCxε

C

C

a d x d x cε


Bottom-up Parsing


| a B c


Tokens right now: aCxCxB

C

C BB

a d x d x cε


Bottom-up Parsing


| a B c


Tokens right now: aCxB

C

C

BB

B

a d x d x cε


Bottom-up Parsing


| a B c


Tokens right now: aB

BB

C

C

B

B

a d x d x cε


Bottom-up Parsing


| a B c


Tokens right now: aBc

B

C

C

B

B

a d x d x ccε


Bottom-up Parsing


| a B c


Tokens right now: S

S

B

C

C

B

B

a d x d x cε


LL(k) parsing

A LL parser read tokens from left to right andconstructs a top-down leftmost derivation. LL(k)parsing predicts which production rule to use fromk tokens of lookahead. LL(1) parsing is a special

case using one token of lookahead. LL(1) parsing isfast and easy, but does not work if the grammar isambiguous, left-recursive, or non-left-factored.


General LL(1) Algorithm

É Process 1 token at a timeÉ Consider a ‘current’ non-terminal symbol,

start with S

É While input is not emptyÉ Given next 1 token (t) and ‘current’ non-terminal

N , choose a rule R s.t. (N → α)É For each element X in rule R from left to right

É If X is a non-terminal, ‘expand’ X by recursing! Set‘current’ to X and consider same token t.

É If X is a terminal and if t matches. If it matches,consume t from input, loop

É Note the need for particular types ofgrammars! What if we have a rule S→ Sα?


Recursive Descent Parsing

É Recursive Descent Parsing can parse LL(k)grammars with backtracingÉ We can use RDP to parse LL(1) grammars by

recursing through the rules of the grammar basedupon the next available token

É Intuition: Construct mutually-recursivefunctions that consume tokens according to thegrammar rules!

É TL;DR “Try all productions exhaustively,backtrack”


Recursive Descent ParsingE → T + E | TT → ( E ) | i n t | i n t ∗ T

Input: int * int

1. Try E0→ T1+E2

2. Try T1→ (E3)É Nope! token ‘int’ does not match ‘(’ in T1→ (E3)

3. Try T1→ int. Match!

É But the next token ‘*’ does not match ‘+’ from E0

4. Try T1→ int ∗T2

É Matches ‘int’, but ‘+’ from E0 remains unmatched

5. Exhausted choices for T1, so we backtrack to E0


Recursive Descent Parsing (2)

E → T + E | TT → ( E ) | i n t | i n t ∗ T

Input: int * int

6. Try E0→ T1

7. Exhaustively try T1→ α productions

É Succeed with T1→ int and T2→ int

E→ T → int ∗T → int ∗ int



S → a B| b C


void S() {if (next_char () == ’a’)

{ consume(’a’); B(); }else if (next_char () == ’b’)

{ consume(’b’); C(); }else { error(); }

}

void B() {if (next_char () == ’b’){ consume(’b’); consume(’b’)

; C(); }else { error(); }

void C() {if (next_char () == ’c’){ consume(’c’); consume(’c’)

; }else { error(); }

}Compiler Construction 18/49


T → l i n e \ n umber\n B| ε

B → i f \n T| e l s e \n T| c l a s s \n C| s t r i n g \n C

C → t e x t \n T

That’s right, subsequent assignments PA3 throughPA6 provide inputs that can be parsed throughrecursive descent!



Observations

É At any given moment, the fringe of the parsetree is: t1t2...tkA...É Try all productions for A: if A→ BC is a

production, the new fringe is t1t2...tkBC...É Backtrack when the fringe does not match the

input string


What Could Go Wrong?


Recursive Descent Failure

S → S a

void S() {S();if (next_char () == ’a’){ consume(’a’); }

}


Eliminating Left Recursion

É Left-recursive grammars have someproduction rule

S →+ S α

Recursive Descent (and LL(k))parsers cannot parse left-recursivegrammars!


Eliminating Left Recursion

Consider the left-recursive grammar:

S → S α | β

S generates all strings starting with β followed by anumber of α

Rewrite using right-recursion

S → β TT → α T | ε


Concrete Left Recursion Elimination

S → 1 | S 0

Can be rewritten as

S → 1 TT → 0 T | ε


More Left Recursion Elimination

In general

S → Sα 1 | . . . | Sαn | β1 | . . . | βm

All strings dervied from S start with one ofβ1, ...,βm and continue with several instances ofα1, ...,αn.

Rewrite as

S → β1 T | . . . | βm TT → α 1 T | . . . | αn T | ε


Recursive Descent Summary

É Simple and general parsingstrategyÉ Left-recursion must be

eliminated first!É There’s an algorithm for that

É Requires significantbacktrackingÉ Backtracking is avoidable for some grammars!


LL(1) Predictive Parsing

É LL(1) parsing assumes that for eachnon-terminal and token there is only oneproduction that could lead to successÉ This sounds deterministic! We can use a

table-based approach like with lexingÉ One dimension for current non-terminal to

expandÉ One dimension for next token seen on the inputÉ Each table entry contains one production


Predictive Parsing and Left Factoring

S → a B| b C


vs.

T →T + E | TT →i n t | i n t ∗ T | ( E )

É Left grammar: Easy! One token→One rule

É Right grammar: Hard! Two T productions start with‘int’

É We must left-factor before using LL(1) predictiveparsing


Left Factoring

E → T + E | TT → i n t | i n t ∗ T | ( E )

Factor out the common prefixes of productionrules

E → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε


Parse Tables!

É Parse tables are a fast implementation of LL(1)parsersÉ N.B. LL(1) grammars represent a subset of

context-free grammarsÉ Restrict ambiguities in resolving rules to make a

table possible!

É Table T is 2-dimensional:T [A][t] =A→ Y1Y2...Ym means “when youare in production rule A and see token t, startconsidering A→ Y1Y2...Ym”


Parse Tables!


LL(1) Parsing Table ($ means end of input)int * + ( ) $

T int Y ( E )E T X T XX + E ε εY *T ε ε ε


Parse Tables!

É T[E][int] = T XÉ Interpretation: “If I’m considering nonterminal E

and I see ‘int’, follow production E→ TX




Parse Tables!

É T[Y][+] = εÉ Interpretation: “If I’m considering nonterminal Y

and I see a ‘+’, get rid of the Y”




Parse Tables!

É Blank entries indicate errors! ConsiderT[E][*]É Interpretation: “There is no way to derive a string

starting with * from non-terminal E.”




Using Parse Tables

É Much like recursive descent

É For each non-terminal SÉ Look at next token aÉ Choose production shown in T[S][a]

É We use a stack to track pending non-terminalsÉ Reject when we encounter an error state

(a blank)É Accept when we encounter an end-of-input


LL(1) Predictive Parsing with Tablepush($); // we succeed if we get to the endpush(S); // start symboldo {

X = pop();if (X == $) { accept (); }if (is_terminal(X)){

if (X == next_token ()) {consume(next_token ());

} else { error (); }} else {

// X is non terminalif (T[X][ next_token ()] == "X → Y1 Y2 ... Ym")

{push(Ym); ... push(Y2); push(Y1);

} else { error (); }}

} while (X != $);


Stack Input Action

int * + ( ) $



Stack Input ActionE $ int * int $ T X

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Y

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consume

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consume

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Y

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Yint Y X $ int $ consume

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Yint Y X $ int $ consumeY X $ $ εX $ $ ε

int * + ( ) $



Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Yint Y X $ int $ consumeY X $ $ εX $ $ ε$ $ ACCEPT

int * + ( ) $



LL(1) Languages

É LL(1) languages can be LL(1) parsedÉ Formally, A language Q is LL(1) if there exists an

LL(1) table such that the LL(1) parsing algorithmusing that table accepts exactly the strings in Q.

É No table entry can be multiply definedÉ This restricts the grammar!

É Once we construct the table1. The parsing algorithm is simple and fast2. No backtracking is necessary

É Wouldn’t it be nice to generate a parsing tablefrom a CFG?


FIRST and FOLLOW sets

É FIRST(α) is the set of all terminal symbolsthat can begin some derivation starting with α

___α→ ...→ aβ

FIRST(α) = { a ∈ T | α→∗ a β} ∪ {ε | α→∗ ε}

Example:

S → a | b S c

FIRST(S) = {a, b}


Example FIRST sets

S → a S e | S T TT → R S e | QR → x S x | εQ → S T | ε

FIRST(S) = ?FIRST(T) = ?FIRST(R) = ?FIRST(Q) = ?


FOLLOW setsÉ FOLLOW(A) is the set of terminals

(including $) that follows a non-terminal A

FOLLOW(A) ={ a ∈ T | S→+ ...Aa...} ∪ {$ | S→+ ...A}

É Compute FIRST sets for all non-terminals

É Add $ to FOLLOW(S) (the start symbol always endswith end-of-input)

É For all productions Y → ...XA1...An

É Add FIRST(Ai)-{ε} to FOLLOW(X). Stop ifε 6∈ FIRST (Ai).

É Add FOLLOW(Y) to FOLLOW(X)Compiler Construction 42/49

Example FOLLOW Set


FOLLOW(“+”) = { int, ( }FOLLOW(“(”) = { int, ( }FOLLOW(X) = { $, ) }FOLLOW(Y) = { +, ), $ }


Back to Parsing Tables

É Recall: We want to build a LL(1) Parsing Table

For each production A→ α in G do:É For each terminal b ∈ FIRST(α) do

É T[A][b] = αÉ If α→∗ ε, for each b ∈ FOLLOW(A) do

É T[A][b] = α


Parsing TableE → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε

Where do we put Y →∗T ?

É Well, FIRST(*T) = {*}, thus column * of row Y gets *T

int * + ( ) $



Parsing TableE → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε

Where do we put Y → ε?É Well, FOLLOW(Y) = {$, +, )}, thus columns $, +, and )

in row Y get Y → ε

int * + ( ) $



Notes on LL(1) Parsing Tables

É If any entry is multiply defined then G is notLL(1)É G is ambiguousÉ G is left-recursiveÉ G is not left-factored


Ambiguity in parse tables

E → E + T T → FE → T F → i dT → T ∗ F F → ( E )

For the E productions, we need FIRST(T) = {(, id} andFIRST(E) = {(, id}

But now, which rule ( E→ E+T or E→ T ) gets put inT[E][(] and T[E][id]??

+ * ( ) id $

E ? ?TF


Simple Parsing Strategies

É Recursive Descent ParsingÉ Backtracking is annoying, BUT super useful for

PA3-6

É Predictive Parsing a.k.a. LL(k)É Predict production from k tokens of lookaheadÉ Build LL(1) tableÉ Parsing is now fast and easy!

É Next up, LR Parsing, a more powerful strategyfor parsing non-LL(1) grammars