程式語言的語法 Grammar

Preview:

DESCRIPTION

程式語言的語法 Grammar. Grammar. Language. Recursive Definition. Mathematical Expression. Structure of Expressions. Formal Language. Backus Naur Form (BNF). 1960 by J. Backus and P. Naur. EBNF (Extended BNF). BNF  EBNF. BNF. EBNF. Formalism (Formal notation). N. Chomsky 近代結構語言學之父. - PowerPoint PPT Presentation

Citation preview

作者 : 陳鍾誠單位 : 金門技術學院資管系Email: ccc@kmit.edu.twURL : http://ccc.kmit.edu.tw

日期 : 112/04/20

程式語言的語法Grammar

Grammar

2 陳鍾誠 - 112/04/20

Language

3 陳鍾誠 - 112/04/20

Recursive Definition

4 陳鍾誠 - 112/04/20

Mathematical Expression

5 陳鍾誠 - 112/04/20

Structure of Expressions

6 陳鍾誠 - 112/04/20

Formal Language

7 陳鍾誠 - 112/04/20

Backus Naur Form (BNF)

8 陳鍾誠 - 112/04/201960 by J. Backus and P. Naur

EBNF (Extended BNF)

9 陳鍾誠 - 112/04/20

BNF EBNF

10 陳鍾誠 - 112/04/20

BNF EBNF

Formalism (Formal notation)

N. Chomsky

近代結構語言學之父

11 陳鍾誠 - 112/04/20N. Chromsky -

Differing structural trees for the same expression

12 陳鍾誠 - 112/04/20

Problem of Different structural trees

13 陳鍾誠 - 112/04/20

No Ambiguous Sentence

14 陳鍾誠 - 112/04/20

Context Free Language Syntactic equations of the form defined in EBNF generate context-

free languages. The term "context free” is due to Chomsky and stems from the fact

that substitution of the symbol left of = by a sequence derived from the expression to the right of = is always permitted, regardless of the context in which the symbol is embedded within the sentence.

It has turned out that this restriction to context freedom (in the sense of Chomsky) is quite acceptable for programming languages, and that it is even desirable.

Context dependence in another sense, however, is indispensible. We will return to this topic in Chapter 8.

15 陳鍾誠 - 112/04/20

Regular Expression

A language is regular, if its syntax can be expressed by a single EBNF expression.

The requirement that a single equation suffices also implies that only terminal symbols occur in the expression.

Such an expression is called a regular expression.

16 陳鍾誠 - 112/04/20

Syntax Analysis v.s. Regular Expression

The reason for our interest in regular languages lies in the fact that programs for the recognition of regular sentences are particularly simple and efficient. By "recognition" we mean the determination of the structure of the sentence, and thereby naturally the determination of whether the sentence is well formed, that is, it belongs to the language. Sentence recognition is called syntax analysis.

17 陳鍾誠 - 112/04/20

Regular Expression v.s. State Machine

For the recognition of regular sentences a finite automaton, also called a state machine, is necessary and sufficient. In each step the state machine reads the next symbol and changes state. The resulting state is solely determined by the previous state and the symbol read. If the resulting state is unique, the state machine is deterministic, otherwise nondeterministic. If the state machine is formulated as a program, the state is represented by the current point of program execution.

18 陳鍾誠 - 112/04/20

EBNF Program The analyzing program can be derived directly from the

defining syntax in EBNF. For each EBNF construct K there exists a translation rule which yields a program fragment Pr(K). The translation rules from EBNF to program text are shown below. Therein sym denotes a global variable always representing the symbol last read from the source text by a call to procedure next. Procedure error terminates program execution, signaling that the symbol sequence read so far does not belong to the language.

19 陳鍾誠 - 112/04/20

Analyzing program

20 陳鍾誠 - 112/04/20

EBNF with only 1 rule

21 陳鍾誠 - 112/04/20

First()

22 陳鍾誠 - 112/04/20

Precondition

23 陳鍾誠 - 112/04/20

Lexical Analysis for Identifier

24 陳鍾誠 - 112/04/20

Lexical Analysis for Integer

25 陳鍾誠 - 112/04/20

Scanner

The process of syntax analysis is based on a procedure to obtain the next symbol. This procedure in turn is based on the definition of symbols in terms of sequences of one or more characters. This latter procedure is called a scanner, and syntax analysis on this second, lower level, lexical analysis.

26 陳鍾誠 - 112/04/20

Lexical Analysis v.s. Syntax Analysis

27 陳鍾誠 - 112/04/20

A Scanner Example

As an example we show a scanner for a parser of EBNF. Its terminal symbols and their definition in terms of characters are

28 陳鍾誠 - 112/04/20

Procedure GetSym() –(1)

29 陳鍾誠 - 112/04/20

Procedure GetSym() –(2)

30 陳鍾誠 - 112/04/20

Procedure GetSym() –(3)

31 陳鍾誠 - 112/04/20

Syntax Analysis Overview Goal – determine if the input token stream

satisfies the syntax of the program What do we need to do this?

An expressive way to describe the syntax A mechanism that determines if the input token

stream satisfies the syntax description For lexical analysis

Regular expressions describe tokens Finite automata = mechanisms to generate tokens

from input stream

Just Use Regular Expressions?

REs can expressively describe tokens Easy to implement via DFAs

So just use them to describe the syntax of a programming language NO! – They don’t have enough power to express any non-

trivial syntax Example – Nested constructs (blocks, expressions,

statements) – Detect balanced braces:{{} {} {{} { }}}{ { { { {

}}}} }

. . .- We need unbounded counting!- FSAs cannot count except in a strictly modulo fashion

Context-Free Grammars Consist of 4 components:

Terminal symbols = token or Non-terminal symbols = syntactic variables Start symbol S = special non-terminal Productions of the form LHSRHS

LHS = single non-terminal RHS = string of terminals and non-terminals Specify how non-terminals may be expanded

Language generated by a grammar is the set of strings of terminals derived from the start symbol by repeatedly applying the productions L(G) = language generated by grammar G

S a S aS TT b T b

T

CFG - Example Grammar for balanced-parentheses

language S ( S ) S S

1 non-terminal: S 2 terminals: “)”, “)” Start symbol: S 2 productions

If grammar accepts a string, there is a derivation of that string using the productions “(())” S = (S) = ((S) S) = (() ) = (())

? Why is the final S required?

More on CFGs

Shorthand notation – vertical bar for multiple productions S a S a | T T b T b |

CFGs powerful enough to expression the syntax in most programming languages

Derivation = successive application of productions starting from S

Acceptance? = Determine if there is a derivation for an input token stream

A Parser

Parser

Context freegrammar, G

Token stream, s(from lexer)

Yes, if s in L(G)No, otherwise

Error messages

Syntax analyzers (parsers) = CFG acceptors which alsooutput the corresponding derivation when the token streamis accepted

Various kinds: LL(k), LR(k), SLR, LALR

RE is a Subset of CFGCan inductively build a grammar for each RE

S a S aR1 R2S S1 S2R1 | R2 S S1 | S2R1* S S1 S |

WhereG1 = grammar for R1, with start symbol S1G2 = grammar for R2, with start symbol S2

Grammar for Sum Expression

Grammar S E + S | E E number | (S)

Expanded S E + S S E E number E (S)

4 productions2 non-terminals (S,E)4 terminals: “(“, “)”, “+”, numberstart symbol: S

Constructing a Derivation Start from S (the start symbol) Use productions to derive a sequence of

tokens For arbitrary strings α, β, γ and for a

production: A β A single step of the derivation is α A γ α β γ (substitute β for A)

Example S E + S (S + E) + E (E + S + E) + E

Class Problem S E + S | E E number | (S)

Derive: (1 + 2 + (3 + 4)) + 5

Parse TreeS

E + S

( S ) E

E + S 5

E + S1

2 E

( S )

E + S

E3 4

• Parse tree = tree representation of the derivation• Leaves of the tree are terminals• Internal nodes are non-terminals• No information about the order of the derivation steps

Parse Tree vs Abstract Syntax Tree

S

E + S

( S ) E

E + S 5

E + S1

2 E

( S )

E + S

E3 4

+

+

+

+

1

2

3 4

5

Parse tree also called “concrete syntax”

AST discards (abstracts) unneededinformation – more compact format

Derivation Order Can choose to apply productions in any order, select

non-terminal and substitute RHS of production Two standard orders: left and right-most Leftmost derivation

In the string, find the leftmost non-terminal and apply a production to it

E + S 1 + S Rightmost derivation

Same, but find rightmost non-terminal E + S E + E + S

Leftmost/Rightmost Derivation Examples» S E + S | E

» E number | (S)

» Leftmost derive: (1 + 2 + (3 + 4)) + 5

S E + S (S)+S (E+S) + S (1+S)+S (1+E+S)+S (1+2+S)+S (1+2+E)+S (1+2+(S))+S (1+2+(E+S))+S (1+2+(3+S))+S (1+2+(3+E))+S (1+2+(3+4))+S (1+2+(3+4))+E (1+2+(3+4))+5

»Now, rightmost derive the same input string

Result: Same parse tree: same productions chosen, but in diff order

S E+S E+E E+5 (S)+5 (E+S)+5 (E+E+S)+5 (E+E+E)+5 (E+E+(S))+5 (E+E+(E+S))+5 (E+E+(E+E))+5 (E+E+(E+4))+5 (E+E+(3+4))+5 (E+2+(3+4))+5 (1+2+(3+4))+5

Class Problem S E + S | E E number | (S) | -S

Do the rightmost derivation of : 1 + (2 + -(3 + 4)) + 5

Ambiguous Grammars

In the sum expression grammar, leftmost and rightmost derivations produced identical parse trees

+ operator associates to the right in parse tree regardless of derivation order

(1+2+(3+4))+5+

+

+

+

1

2

3 4

5

An Ambiguous Grammar

+ associates to the right because of the right-recursive production: S E + S

Consider another grammar S S + S | S * S | number

Ambiguous grammar = different derivations produce different parse trees More specifically, G is ambiguous if there are 2

distinct leftmost (rightmost) derivations for some sentence

Ambiguous Grammar - Example

S S + S | S * S | number

Consider the expression: 1 + 2 * 3

Derivation 1: S S+S 1+S 1+S*S 1+2*S 1+2*3

Derivation 2: S S*S S+S*S 1+S*S 1+2*S 1+2*3

+

*1

2 3

*

+

1 2

3

Obviously not equal!

Impact of Ambiguity

Different parse trees correspond to different evaluations!

Thus, program meaning is not defined!!

+

*1

2 3

*

+

1 2

3

= 7 = 9

Can We Get Rid of Ambiguity?

Ambiguity is a function of the grammar, not the language!

A context-free language L is inherently ambiguous if all grammars for L are ambiguous

Every deterministic CFL has an unambiguous grammar So, no deterministic CFL is inherently ambiguous No inherently ambiguous programming languages

have been invented To construct a useful parser, must devise an

unambiguous grammar

Eliminating Ambiguity

Often can eliminate ambiguity by adding nonterminals and allowing recursion only on right or left S S + T | T T T * num | num

T non-terminal enforces precedence Left-recursion; left associativity

S

S + T

T T * 3

1 2

A Closer Look at Eliminating Ambiguity

Precedence enforced by Introduce distinct non-terminals for each

precedence level Operators for a given precedence level are

specified as RHS for the production Higher precedence operators are accessed by

referencing the next-higher precedence non-terminal

Associativity

An operator is either left, right or non associative Left: a + b + c = (a + b) + c Right: a ^ b ^ c = a ^ (b ^ c) Non: a < b < c is illegal (thus undefined)

Position of the recursion relative to the operator dictates the associativity Left (right) recursion left (right) associativity Non: Don’t be recursive, simply reference next

higher precedence non-terminal on both sides of operator

Class Problem (Tough)

S S + S | S – S | S * S | S / S | (S) | -S | S ^ S | number

Enforce the standard arithmetic precedence rules and remove all ambiguity from the above grammar

Precedence (high to low)(), unary –^*, /+, -Associativity^ = rightrest are left

Recommended