View
221
Download
0
Embed Size (px)
Citation preview
Scanner
中正理工學院電算中心副教授許良全
Compiler Design Copyright © 1998 by LCH
Overview of Scanning
The purpose of a scanner is to group input characters into tokens.
A scanner is sometimes called a lexical analyzer A precise definition of tokens is necessary to
ensure that lexical rules are properly enforced. Scanners normally seek to make a token as long as
possible. E.g. ABC is scanned as one identifier rather than three
All scanners perform much the same function using scanner generator is to limit the effort in building a
scanner from scratch
Compiler Design Copyright © 1998 by LCH
Finite State Systems
The finite state automaton is a mathematical model of a system, with discrete input and outputs
Compiler Design Copyright © 1998 by LCH
Examples of Finite State Systems
Elevators do not remember all previous requests for service but only
the current floor, the direction of motion, and the collection of not yet satisfied requests for service
Vending machines insert enough coins and you’ll get a Pepsi eventually
Computers the state of the CPU, main memory, and auxiliary storage
at any time is one of a very large but finite number of states Human brains
235 cells or neurons at most
Compiler Design Copyright © 1998 by LCH
Definition of Finite Automata
A finite automaton (FA) is an idealized 5-tuple computer that recognizes strings belonging to regular sets. (Q,,,q0,F) A finite set of states, Q A finite input alphabet, , or vocabulary, V. A special start, or initial state, q0. q0Q.
A set of final, or accepting states, F. FQ. A transition function, , that maps Q×F to Q.
Compiler Design Copyright © 1998 by LCH
FA and Transition Diagrams
a b c
a state
a transition
the start state
a finite state
a
Compiler Design Copyright © 1998 by LCH
FA and Transition Tables
inputsstates
a b c
q0 q1
q1 q2
q2 q3
q3 q1 q3
Compiler Design Copyright © 1998 by LCH
Regular Expressions
The languages accepted by finite automata are easily described by simple expressions called regular expressions.
Strings are built from characters in V via catenation e.g., !=, for, while
An empty or null string, denoted by , is allowed The characters, (, ), ‘, *, +, and | are called
meta-characters. They must be be quoted when used in order to avoid ambiguity. E.g.
Delim = (‘(‘|’)’|:=|;|,|’+’|-|’*’|/|=|$$$)
Compiler Design Copyright © 1998 by LCH
Definition of Regular Expression
A regular expression denotes a set of strings: is a regular expression denoting the empty set (the set
containing no strings). is a regular expression denoting the set that contains
only the empty string. Note that this set contains one element.
A string s is a regular expression denoting a set containing only s. If s contains meta-characters, s can be quoted to avoid ambiguity.
If A and B are regular expressions, then A|B, AB, and A* are also regular expressions, corresponding to alternation, catenation, and Kleene closure respectively.
Compiler Design Copyright © 1998 by LCH
Properties of Regular Expressions
Let P and Q be a set of strings The string s (P|Q) iff s P or s Q The string s P* iff s can be broken into zero or more
pieces: s = s1s2s3…sn such that each si P.
P+ denotes all strings consisting one or more strings in P catenated together P* = (P+|) and P+ = PP* = P*P
If A is a set of characters, Not(A) denotes (V-A) all characters in V not included in A.
If k is a constant, the set Ak represents all strings formed by catenating k strings from A, i.e., Ak = (AAA…) (k copies)
Compiler Design Copyright © 1998 by LCH
Examples of Regular Expressions
Let D = (0|…|9), L = (A|…|Z) A comment that begins with -- and ends with Eol
Comment = --Not(Eol)*Eol A fixed decimal literal
Lit = D+.D+
An identifier, composed of letters, digits, and underscores, that begins with a letter, ends with a letter or digit, and contains no consecutive underscores ID = L(L|D)*(_(L|D)+)*
Compiler Design Copyright © 1998 by LCH
Using a Scanner Generator: Lex
Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab, written in C, running under UNIX.
Lex produces an entire scanner module that can be compiled and linked with other compiler modules.
Lex associates regular expressions with arbitrary code fragments. When an expression is matched, the code segment is executed.
A typical lex program contains three sections separated by %% delimiters.
Compiler Design Copyright © 1998 by LCH
First Section of Lex The first section define character classes and auxiliary regular
expression. (Fig. 3.5 on p. 67) [] delimits character classes - denotes ranges: [xyz] = = [x-z] \ denotes the escape character: as in C. ^ complements a character class, (Not):
[^xy] denotes all characters except x and y. |, *, and + (alternation, Kleene closure, and positive closure) are
provided. () can be used to control grouping of subexpressions. (expr)? = = (expr)|, i.e. matches Expr zero times or once. {} signals the macroexpansion of a symbol defined in the first section.
Compiler Design Copyright © 1998 by LCH
First Section of Lex, cont.
Catenation is specified by the juxtaposition of two expressions; no explicit operator is used. [ab][cd] will match any of ad, ac, bc, and bd.
begin = = “begin” = = [b][e][g][i][n]
Compiler Design Copyright © 1998 by LCH
Second Section of Lex The second section of lex defines a table of regular
expressions and corresponding commands. When an expression is matched, its associated command is
executed. Auxiliary functions may be defined in the third section.
Input that is matched is stored in the string variable yytext whose length is yyleng.
Lex creates an integer function yylex() that may be called from the parser. The value returned is usually the token code of the token scanned
by Lex. When yylex() encounters end of file, it calls a use-supplied
integer function named yywrap() to wrap up input processing.
Compiler Design Copyright © 1998 by LCH
Dealing with Multiple Input Files
yylex() uses three user-defined functions to handle character I/O: input(): retrieve a single character, 0 on EOF output(c): write a single character to the output unput(c): put a single character back on the input to be
re-read
Compiler Design Copyright © 1998 by LCH
Translating Regular Expressions into Finite Automata
Remember the relationship between RE and FA. The main job of a scanner generator program is to
transform a regular expression definition into an equivalent (D)FA.
A regular expression is first translated into a nondeterministic finite automaton (NFA), then translated from NFA into DFA. (2 steps)
An NFA, when reading a particular input is not required to make a unique (deterministic) choice of which state to visit.
Compiler Design Copyright © 1998 by LCH
Translating RE into NFA
Any regular expression can be transformed into an NFA with the following properties: There is a unique final state The final state has no successors Every other state has either one or two successors
Regular expressions are built out of the atomic regular expressions a (where a is a character in V) and by using the three operations AB, A|B, and A*.
Compiler Design Copyright © 1998 by LCH
NFA for a and
a
Compiler Design Copyright © 1998 by LCH
An NFA for A|B
Finite
automatonfor A
Finiteautomaton
for B
B
A
Compiler Design Copyright © 1998 by LCH
An NFA for A B
Finiteautomaton
for A
Finiteautomaton
for B
A
Compiler Design Copyright © 1998 by LCH
An NFA for A*
Finiteautomaton
for AA
Compiler Design Copyright © 1998 by LCH
Translating NFA into DFA
Each state of DFA (M) corresponds to a set of states of NFA (N) transforming N to M is done by subset construction
M will be in state {x,y,z} after reading a given input string if and only if N could be in any of the states x, y, or z, depending on the transitions it chooses. M keeps track of all the possible routes N might take and
runs them in parallel.