Upload
dinhnhu
View
218
Download
2
Embed Size (px)
Citation preview
ZHENG LI (李征)[email protected]
Program Analysis (软件源代码分析技术)
Lexical and Syntax Analysis
Topic Covered Today
▪ Compilation
▪ Lexical Analysis
▪ Semantic Analysis
Compilation
• Translating from high-level language to machine
code is organized into several phases or passes.
• In the early days passes communicated through
files, but this is no longer necessary.
Language Specification
• We must first describe the language in question
by giving its specification.
• Syntax:• Defines symbols (vocabulary)
• Defines programs (sentences)
• Semantics: • Gives meaning to sentences.
• The formal specifications are often the input to
tools that build translators automatically.
Language Specification
Compiler passesCompiler passes
Compiler passes
Parser
semantic analyzer
Optimizer
Final assembly
Translator
symbol table
managererror handler
target program
source program
front end
back end
Lexical scanner
Compiler passes
Symbol Table Management
▪ The symbol table is a data structure used by all phases
of the compiler to keep track of user defined symbols
and keywords.
▪ During early phases (lexical and syntax analysis)
symbols are discovered and put into the symbol table
▪ During later phases symbols are looked up to validate
their usage.
Symbol Tables
Regular
ExpressionToken Attribute-Value
ws
if
then
else
id
num
<
<=
=
< >
>
>=
-
if
then
else
id
num
relop
relop
relop
relop
relop
relop
-
-
-
-
pointer to table entry
pointer to table entry
LT
LE
EQ
NE
GT
GE
Note: Each token has a unique token identifier to define category of lexemes
Error Management
▪ Errors can occur at all phases in the compiler
▪ Invalid input characters, syntax errors, semantic
errors, etc.
▪ Good compilers will attempt to recover from
errors and continue.
Lexical analyzer
▪ Also called a scanner or tokenizer
▪ Converts stream of characters into a stream of
tokens
▪ Tokens are:
Keywords such as for, while, and class.
Special characters such as +, -, (, and <
Variable name occurrences
Constant occurrences such as 1, 0, true.
Lexical analyzer
Lexical analyzer
▪ The lexical analyzer is usually a subroutine of
the parser.
▪ Each token is a single entity. A numerical code
is usually assigned to each type of token.
▪ Lexical analyzers perform:
Line reconstruction
delete comments
delete white spaces
perform text substitution
Lexical translation: translation of lexemes -> tokens
Often additional information is affiliated with a token.
Lexical analyzerLexical analyzer
letter A | B | C | … | Z | a | b | … | z
digit 0 | 1 | 2 | … | 9
id letter ( letter | digit )*
Shorthand Notation:
“+” : one or more r* = r+ | & r+ = r r*
“?” : zero or one r?=r |
[range] : set range of characters (replaces “|” )
[A-Z] = A | B | C | … | Z
id [A-Za-z][A-Za-z0-9]*
Token Definitions
Example of extraction lexemes and produce the corresponding tokens.
Sum = oldsum – value /100;
Token Lexeme
IDENT sum
ASSIGN_OP =
IDENT oldsum
SUBTRACT_OP -
IDENT value
DIVISION_OP /
INT_LIT 100
SEMICOLON ;
Parser
▪ Performs syntax analysis
▪ Imposes syntactic structure on a sentence.
▪ Parse trees are used to expose the structure.
These trees are often not explicitly built
Simpler representations of them are often used
▪ Parsers, accepts a string of tokens and builds a
parse tree representing the program
Parser
Parser
▪ The collection of all the programs in a given
language is usually specified using a list of rules
known as a context free grammar.
Parser
n A grammar has four components:l A set of tokens known as terminal symbols
l A set of variables or non-terminals
l A set of productions where each production
consists of a non-terminal, an arrow, and a
sequence of tokens and/or non-terminals
l A designation of one of the nonterminals as the
start symbol.
ParserParser
Abstract Syntax Tree
Abstract Syntax Tree
▪ The parse tree is used to recognize the components of
the program and to check that the syntax is correct.
▪ As the parser applies productions, it usually generates
the component of a simpler tree (known as Abstract
Syntax Tree).
▪ The meaning of the component is derived out of the
way the statement is organized in a subtree.
Phase Input Output
Lexer Sequence of
characters
Sequence of
tokens
Parser Sequence of
tokens
Parse tree
Comparison with Lexical Analysis
Semantic Analyzer
Semantic Analyzer
▪ The semantic analyzer completes the symbol table
with information on the characteristics of each
identifier.
▪ The symbol table is usually initialized during parsing.
▪ One entry is created for each identifier and constant.
Scope is taken into account.
Two different variables with
the same name will have
different entries in the
symbol table.
Translator
▪ The lexical scanner, parser, and semantic analyzer are
collectively known as the front end of the compiler.
▪ The second part, or back end starts by generating low
level code from the (possibly optimized) AST.
▪ Rather than generate code for a specific architecture,
most compilers generate intermediate language
▪ Three address code is popular.
Really a flattened tree representation.
Simple.
Flexible (captures the essence of many target
architectures).
Can be interpreted.
Translator
Optimizers
▪ Intermediate code is examined
and improved.
▪ Can be simple:
changing “a:=a+1” to “increment a”
changing “3*5” to “15”
▪ Can be complicated:
reorganizing data and data accesses for cache efficiency
▪ Optimization can improve running time by orders of
magnitude, often also decreasing program size.
Code Generation
▪ Generation of “real executable code”for a particular target machine.
▪ It is completed by the Final Assembly phase
▪ Final output can either be
assembly language for the target machine
object code ready for linking
▪ The “target machine” can be a virtual machine (such
as the Java Virtual Machine, JVM), and the “real
executable code” is “virtual code” (such as Java
Bytecode).
Compiler Overview
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Code Optimizer
Code Generation
Source Program IF (a<b) THEN c=1*d;
Token Sequence
Syntax Tree
3-Address Code
Optimized 3-Addr. Code
Assembly Code
IF_stmt
<a
b
cond_expr
listassign_stmt
c
*
lhs
rhs 1
dGE a, b, L1
MUlT 1, d, c
L1:
GE a, b, L1
MOV d, c
L1:loadi R1,a
cmpi R1,b
jge L1
loadi R1,d
storei R1,c
L1:
IF (ID
“a”<
ID
“b”THEN
ID
“c”=
CONST
“1” *ID
“d”)
Exercise: Abstract Syntax Tree
x := a + b;
y := a * b;
while (y > a) {
a := a + 1;
x := a + b;
}
Email: [email protected]: http://cist.buct.edu.cn/staff/zheng/Office: 科410