LL and LR ParsingLecture 6
February 5, 2018
Context-free Grammars
A context-free grammar consists of
É A set of non-terminals NÉ Written in uppercase throughout these notes
É A set of terminals T comprised of tokensÉ Lowercase or punctuation throughout these notes
É A start symbol S (a non-terminal)É A set of productions (rewrite rules)
Assuming E ∈NE→ ε orE→ Y1Y2...Yn where Yi ∈N ∪T
Compiler Construction 2/49
Context-free ?Production rules hint at expressiveness!
Regular A→ aB,C→ εContext-free A→ αContext-sensitive αAβ→ αγβType-0 α→β
α,β,γ ∈ {N ∪T }∗
“What just happened? We must be missing some context...”
Compiler Construction 3/49
Parsing and Context-free Grammars
É Lexical AnalysisÉ Regular Expressions specify a Regular Language
containing strings of characters (lexeme) thatcorrespond to a token
É ParsingÉ Context-free Grammars specify a Context-free
Language containing strings of tokens thatcorrespond to a grammatical rule (production)
Compiler Construction 4/49
Generativeness
É Regular expressions and context-free grammarsare generativeÉ You can generate every string in the language
using the regex or grammar!
Compiler Construction 5/49
Generating Strings
É Consider regex: ab*aÉ You can generate aa, aba, abba, abbba, ...
É Consider context-free grammar:
E → ( E ) E| ε
É You can generate ε, (), (()), (())(), ...
É Generating strings with a grammar can bethought of as creating a parse tree!
Compiler Construction 6/49
Language membership
É We care about whether an input string oftokens is syntactically correct (e.g., obeys ourlanguage’s grammar)
É So far, we have looked at theoreticalimplications of grammars
L(G) = {a1...an|S→∗ a1...an}For an input string x, is x ∈ L(G)?
Parsing part 1: We need a yes/no answer!
Compiler Construction 7/49
Language membership
S → a B| b C
B → b b CC → c c
What strings are in this language? (Hint: there’sonly two!)If my input string is “dabc”, we ask: can thegrammar generate this string? (No)
É N.B. it doesn’t matter how from a theoreticalperspective, that’s the job of the parsingalgorithm!
Compiler Construction 8/49
Parsing Algorithms
É LL (top down)É Reads input from left to right and uses left-most
derivations to construct a parse tree
É LR (bottom up)É Reads input from left to right and uses right-most
derivations to construct a parse tree
É Both algorithms are driven by the inputgrammar and the input to be parsed.
Compiler Construction 9/49
Parsing Algorithm Intuition
É You start with a sequence of tokens, t1t2t3t4t5É and also a grammar!
É Two general approaches to constructing theparse treeÉ top-down parsing is when you predict the
grammatical rule used to produce the tokens seenso far
É bottom-up parsing is when you consider tokensone at a time until you match a grammatical rule
Compiler Construction 10/49
Top Down Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
S
a d x d x c
Compiler Construction 11/49
Top Down Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
S
BB
aa d x d x cc
Compiler Construction 11/49
Top Down Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
S
B
CC
BB
a d xx d x c
Compiler Construction 11/49
Top Down Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
S
B
C
B
a dd x d x c
Compiler Construction 11/49
Top Down Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
S
B
C
CC
B
BB
a d x d xx c
Compiler Construction 11/49
Top Down Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
S
B
C
C
B
B
a d x dd x c
Compiler Construction 11/49
Top Down Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
S
B
C
C
B
B
a d x d x cε
Compiler Construction 11/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: a
aa d x d x c
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: ad
a dd x d x c
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aC
CC
a d x d x c
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aCx
C
a d xx d x c
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aCxd
C
a d x dd x c
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aCxC
C
CC
a d x d x c
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aCxCx
C
C
a d x d xx c
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aCxCxε
C
C
a d x d x cε
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aCxCxB
C
C BB
a d x d x cε
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aCxB
C
C
BB
B
a d x d x cε
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aB
BB
C
C
B
B
a d x d x cε
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: aBc
B
C
C
B
B
a d x d x ccε
Compiler Construction 12/49
Bottom-up Parsing
S → a B cB → C x BB → εC → d
| a B c
Input string:“adxdxc”
Tokens right now: S
S
B
C
C
B
B
a d x d x cε
Compiler Construction 12/49
LL(k) parsing
A LL parser read tokens from left to right andconstructs a top-down leftmost derivation. LL(k)parsing predicts which production rule to use fromk tokens of lookahead. LL(1) parsing is a special
case using one token of lookahead. LL(1) parsing isfast and easy, but does not work if the grammar isambiguous, left-recursive, or non-left-factored.
Compiler Construction 13/49
General LL(1) Algorithm
É Process 1 token at a timeÉ Consider a ‘current’ non-terminal symbol,
start with S
É While input is not emptyÉ Given next 1 token (t) and ‘current’ non-terminal
N , choose a rule R s.t. (N → α)É For each element X in rule R from left to right
É If X is a non-terminal, ‘expand’ X by recursing! Set‘current’ to X and consider same token t.
É If X is a terminal and if t matches. If it matches,consume t from input, loop
É Note the need for particular types ofgrammars! What if we have a rule S→ Sα?
Compiler Construction 14/49
Recursive Descent Parsing
É Recursive Descent Parsing can parse LL(k)grammars with backtracingÉ We can use RDP to parse LL(1) grammars by
recursing through the rules of the grammar basedupon the next available token
É Intuition: Construct mutually-recursivefunctions that consume tokens according to thegrammar rules!
É TL;DR “Try all productions exhaustively,backtrack”
Compiler Construction 15/49
Recursive Descent ParsingE → T + E | TT → ( E ) | i n t | i n t ∗ T
Input: int * int
1. Try E0→ T1+E2
2. Try T1→ (E3)É Nope! token ‘int’ does not match ‘(’ in T1→ (E3)
3. Try T1→ int. Match!
É But the next token ‘*’ does not match ‘+’ from E0
4. Try T1→ int ∗T2
É Matches ‘int’, but ‘+’ from E0 remains unmatched
5. Exhausted choices for T1, so we backtrack to E0
Compiler Construction 16/49
Recursive Descent Parsing (2)
E → T + E | TT → ( E ) | i n t | i n t ∗ T
Input: int * int
6. Try E0→ T1
7. Exhaustively try T1→ α productions
É Succeed with T1→ int and T2→ int
E→ T → int ∗T → int ∗ int
Compiler Construction 17/49
Recursive Descent Parsing
S → a B| b C
B → b b CC → c c
void S() {if (next_char () == ’a’)
{ consume(’a’); B(); }else if (next_char () == ’b’)
{ consume(’b’); C(); }else { error(); }
}
void B() {if (next_char () == ’b’){ consume(’b’); consume(’b’)
; C(); }else { error(); }
void C() {if (next_char () == ’c’){ consume(’c’); consume(’c’)
; }else { error(); }
}Compiler Construction 18/49
Recursive Descent Parsing
T → l i n e \ n umber\n B| ε
B → i f \n T| e l s e \n T| c l a s s \n C| s t r i n g \n C
C → t e x t \n T
That’s right, subsequent assignments PA3 throughPA6 provide inputs that can be parsed throughrecursive descent!
Compiler Construction 19/49
Recursive Descent Parsing
Observations
É At any given moment, the fringe of the parsetree is: t1t2...tkA...É Try all productions for A: if A→ BC is a
production, the new fringe is t1t2...tkBC...É Backtrack when the fringe does not match the
input string
Compiler Construction 20/49
What Could Go Wrong?
Compiler Construction 21/49
Recursive Descent Failure
S → S a
void S() {S();if (next_char () == ’a’){ consume(’a’); }
}
Compiler Construction 22/49
Eliminating Left Recursion
É Left-recursive grammars have someproduction rule
S →+ S α
Recursive Descent (and LL(k))parsers cannot parse left-recursivegrammars!
Compiler Construction 23/49
Eliminating Left Recursion
Consider the left-recursive grammar:
S → S α | β
S generates all strings starting with β followed by anumber of α
Rewrite using right-recursion
S → β TT → α T | ε
Compiler Construction 24/49
Concrete Left Recursion Elimination
S → 1 | S 0
Can be rewritten as
S → 1 TT → 0 T | ε
Compiler Construction 25/49
More Left Recursion Elimination
In general
S → Sα 1 | . . . | Sαn | β1 | . . . | βm
All strings dervied from S start with one ofβ1, ...,βm and continue with several instances ofα1, ...,αn.
Rewrite as
S → β1 T | . . . | βm TT → α 1 T | . . . | αn T | ε
Compiler Construction 26/49
Recursive Descent Summary
É Simple and general parsingstrategyÉ Left-recursion must be
eliminated first!É There’s an algorithm for that
É Requires significantbacktrackingÉ Backtracking is avoidable for some grammars!
Compiler Construction 27/49
LL(1) Predictive Parsing
É LL(1) parsing assumes that for eachnon-terminal and token there is only oneproduction that could lead to successÉ This sounds deterministic! We can use a
table-based approach like with lexingÉ One dimension for current non-terminal to
expandÉ One dimension for next token seen on the inputÉ Each table entry contains one production
Compiler Construction 28/49
Predictive Parsing and Left Factoring
S → a B| b C
B → b b CC → c c
vs.
T →T + E | TT →i n t | i n t ∗ T | ( E )
É Left grammar: Easy! One token→One rule
É Right grammar: Hard! Two T productions start with‘int’
É We must left-factor before using LL(1) predictiveparsing
Compiler Construction 29/49
Left Factoring
E → T + E | TT → i n t | i n t ∗ T | ( E )
Factor out the common prefixes of productionrules
E → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε
Compiler Construction 30/49
Parse Tables!
É Parse tables are a fast implementation of LL(1)parsersÉ N.B. LL(1) grammars represent a subset of
context-free grammarsÉ Restrict ambiguities in resolving rules to make a
table possible!
É Table T is 2-dimensional:T [A][t] =A→ Y1Y2...Ym means “when youare in production rule A and see token t, startconsidering A→ Y1Y2...Ym”
Compiler Construction 31/49
Parse Tables!
E → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε
LL(1) Parsing Table ($ means end of input)int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 32/49
Parse Tables!
É T[E][int] = T XÉ Interpretation: “If I’m considering nonterminal E
and I see ‘int’, follow production E→ TX
LL(1) Parsing Table ($ means end of input)int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 33/49
Parse Tables!
É T[Y][+] = εÉ Interpretation: “If I’m considering nonterminal Y
and I see a ‘+’, get rid of the Y”
LL(1) Parsing Table ($ means end of input)int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 34/49
Parse Tables!
É Blank entries indicate errors! ConsiderT[E][*]É Interpretation: “There is no way to derive a string
starting with * from non-terminal E.”
LL(1) Parsing Table ($ means end of input)int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 35/49
Using Parse Tables
É Much like recursive descent
É For each non-terminal SÉ Look at next token aÉ Choose production shown in T[S][a]
É We use a stack to track pending non-terminalsÉ Reject when we encounter an error state
(a blank)É Accept when we encounter an end-of-input
Compiler Construction 36/49
LL(1) Predictive Parsing with Tablepush($); // we succeed if we get to the endpush(S); // start symboldo {
X = pop();if (X == $) { accept (); }if (is_terminal(X)){
if (X == next_token ()) {consume(next_token ());
} else { error (); }} else {
// X is non terminalif (T[X][ next_token ()] == "X → Y1 Y2 ... Ym")
{push(Ym); ... push(Y2); push(Y1);
} else { error (); }}
} while (X != $);
Compiler Construction 37/49
Stack Input Action
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T X
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Y
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consume
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consume
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Y
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Yint Y X $ int $ consume
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Yint Y X $ int $ consumeY X $ $ εX $ $ ε
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
Stack Input ActionE $ int * int $ T XT X $ int * int $ int Yint Y X $ int * int $ consumeY X $ * int $ * T* T X $ * int $ consumeT X $ int $ int Yint Y X $ int $ consumeY X $ $ εX $ $ ε$ $ ACCEPT
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 38/49
LL(1) Languages
É LL(1) languages can be LL(1) parsedÉ Formally, A language Q is LL(1) if there exists an
LL(1) table such that the LL(1) parsing algorithmusing that table accepts exactly the strings in Q.
É No table entry can be multiply definedÉ This restricts the grammar!
É Once we construct the table1. The parsing algorithm is simple and fast2. No backtracking is necessary
É Wouldn’t it be nice to generate a parsing tablefrom a CFG?
Compiler Construction 39/49
FIRST and FOLLOW sets
É FIRST(α) is the set of all terminal symbolsthat can begin some derivation starting with α
___α→ ...→ aβ
FIRST(α) = { a ∈ T | α→∗ a β} ∪ {ε | α→∗ ε}
Example:
S → a | b S c
FIRST(S) = {a, b}
Compiler Construction 40/49
Example FIRST sets
S → a S e | S T TT → R S e | QR → x S x | εQ → S T | ε
FIRST(S) = ?FIRST(T) = ?FIRST(R) = ?FIRST(Q) = ?
Compiler Construction 41/49
FOLLOW setsÉ FOLLOW(A) is the set of terminals
(including $) that follows a non-terminal A
FOLLOW(A) ={ a ∈ T | S→+ ...Aa...} ∪ {$ | S→+ ...A}
É Compute FIRST sets for all non-terminals
É Add $ to FOLLOW(S) (the start symbol always endswith end-of-input)
É For all productions Y → ...XA1...An
É Add FIRST(Ai)-{ε} to FOLLOW(X). Stop ifε 6∈ FIRST (Ai).
É Add FOLLOW(Y) to FOLLOW(X)Compiler Construction 42/49
Example FOLLOW Set
E → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε
FOLLOW(“+”) = { int, ( }FOLLOW(“(”) = { int, ( }FOLLOW(X) = { $, ) }FOLLOW(Y) = { +, ), $ }
Compiler Construction 43/49
Back to Parsing Tables
É Recall: We want to build a LL(1) Parsing Table
For each production A→ α in G do:É For each terminal b ∈ FIRST(α) do
É T[A][b] = αÉ If α→∗ ε, for each b ∈ FOLLOW(A) do
É T[A][b] = α
Compiler Construction 44/49
Parsing TableE → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε
Where do we put Y →∗T ?
É Well, FIRST(*T) = {*}, thus column * of row Y gets *T
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 45/49
Parsing TableE → T XX → + E | εT → ( E ) | i n t YY → ∗ T | ε
Where do we put Y → ε?É Well, FOLLOW(Y) = {$, +, )}, thus columns $, +, and )
in row Y get Y → ε
int * + ( ) $
T int Y ( E )E T X T XX + E ε εY *T ε ε ε
Compiler Construction 46/49
Notes on LL(1) Parsing Tables
É If any entry is multiply defined then G is notLL(1)É G is ambiguousÉ G is left-recursiveÉ G is not left-factored
Compiler Construction 47/49
Ambiguity in parse tables
E → E + T T → FE → T F → i dT → T ∗ F F → ( E )
For the E productions, we need FIRST(T) = {(, id} andFIRST(E) = {(, id}
But now, which rule ( E→ E+T or E→ T ) gets put inT[E][(] and T[E][id]??
+ * ( ) id $
E ? ?TF
Compiler Construction 48/49
Simple Parsing Strategies
É Recursive Descent ParsingÉ Backtracking is annoying, BUT super useful for
PA3-6
É Predictive Parsing a.k.a. LL(k)É Predict production from k tokens of lookaheadÉ Build LL(1) tableÉ Parsing is now fast and easy!
É Next up, LR Parsing, a more powerful strategyfor parsing non-LL(1) grammars
Compiler Construction 49/49