Transcript
Page 1: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Theory of Computation

Formalism, Computation, & Compilation

Vladimir Kulyukin

Page 2: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Outline

Formalism & Computation

Software/Hardware Duality

Church’s Thesis

Programming Language L

Compilation

Finite State Automata & Tokenization

CFGs & Syntactic Analysis

Recursive-Descent Parsing

Page 3: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Formalism & Computation

Page 4: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Outline

To a software developer, the question “why do we need

programming languages?” seems silly: we need

programming languages to develop software, of course!

To a CS theorist, there is a different answer: we need in

order to study computation, one must have a formalism

in which computation can be expressed

Is there the best formalism to work with? Unlikely,

because the formalism we use is inseparable from the

computation we study (this is the software/hardware

duality principle)

Page 5: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Church’s Thesis

On first look, the previous answer seems circular (and it

is, to some extent!): before deciding on a formalism we

must have a pretty good idea of the computation we

want to study and, vice versa, we cannot begin to study

computation until we have a formalism that allows us to

express that computation

Chicken-and-egg conundrum: which comes first –

formalism or computation?

This is the heart of what is known as Church's thesis

Page 6: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Alonzo Church (1903 - 1995)

Alonzo Church developed λ-calculus, a formal system for defining functions, applying functions, and recursion

Page 7: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Church’s Thesis

The commonsense formulation of Church's Thesis:

Everything computable can be computed by a

formalism X

X can be replaced by λ-calculus or Turing machine or

some other formalism (C++, Python, Java, etc.)

Another subtle and often unstated assumption in Church’s

thesis is that there is a device that can mechanically

execute computational instructions expressed in that

formalism

Page 8: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Choice of Formalism

Choice of formalism is both objective and subjective

It is objective in that many formalisms have been shown to be

equivalent (at least, on natural numbers): any computation that

can be expressed in one can be expressed in another, and vice

versa

Similarly, programming languages are equivalent in the sense that

an algorithm implemented in one language, can be implemented

in a different one without any loss of generality (modulo standard

tradeoffs such as speed vs. ease of development & maintenance)

It is subjective in that people always have their own personal

preferences

Page 9: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Choice of Formalism

There is a simple assembly-like programming language L

developed in Chapter 2 of Computability, Complexity,

and Languages by Davis, Weyuker, and Sigal

While L is a theoretical construct, it can be thought of as

a higher level assembly language

Since L is a programming language, it is, in my humble

opinion, more appealing to the programmatically inclined

than more formal constructs such as λ-calculus or Turing

machine

Page 10: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Programming Language L

Page 11: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

L’s Tokens

1

1

1

211111

321

321

as same theis

as same theis

as same theis

example,For 1. be toassumed isit omitted, issubscript theIf

,...,,,,, :Labels

:iableOutput var

,...,, : variablesLocal

,...,, :ablesInput vari

AA

ZZ

XX

AEDCBA

Y

ZZZ

XXX

Page 12: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

L’s Basic Instructions (Primitives)

same theare side hand-right theand

side hand-left on the variables the3 2, 1, nsinstructioIn :NOTE

branch) (cond. GOTO 0 IF 4.

opp)-(no .3

)(decrement 1 .2

)(increment 1 .1

LV

VV

VV

VV

Page 13: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Instruction V V + 1

● These instructions are primitives:

X1 X1 + 1

Z10 Z10 + 1

Y Y + 1

X102 X102 + 1

● These instructions are NOT primitives:

X1 X10 + 1

Z10 X1 + 1

Y X102 + 1

Page 14: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Instruction V V - 1

●These instructions are primitives:

X1 X1 - 1

Z10 Z10 - 1

Y Y - 1

X102 X102 – 1

●These instructions are NOT primitives:

X1 X10 - 1

Z10 X1 - 1

Y X102 - 1

Page 15: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Instruction V V

●These instructions are primitives:

X1 X1

Z10 Z10

X120 X120

Y Y

●These instructions are NOT primitives:

X1 Y

X120 Z10

Z10 X1

Page 16: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

L’s Labeled Primitives

GOTO.after

dropped are brackets square thedispatches lconditionain However,

brackets. squarein is label theline theof beginning At the :NOTE

branch) (cond. GOTO 0 IF L 4.

opp)-(no L .3

)(decrement 1 L .2

)(increment 1 L .1

LV

VV

VV

VV

Page 17: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Labeled Primitives: Examples

● [A1] X1 X1 + 1

● [B1] X23 X23 – 1

● [C10] Z12 Z12 + 1

● [E1] Y Y

● [D101] IF X1 != 0 GOTO E1

Page 18: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Increments and Decrements

● Since there is no upper limit on variable values, the increment instruction always succeeds (there are no buffer overflows):

–V V + 1 –In the above instruction V’s value is always incremented by 1

● Since variable values are natural numbers, the decrement instruction has no effect if the value of the variable is 0 –V V – 1 –if V is 0 before the instruction, V remains 0 after the instruction –If V > 0 before the instruction, V’s value is decremented by 1

Page 19: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

The Output Value of L’s Program

● The output value of an L program is the value of the Y variable

● If an L program goes into an infinite loop, the value is undefined

● Thus, an L program implements a function that maps the values of the input variables into the value of Y

Page 20: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Exit Label E

● We will assume that each L program has a unique exit label E or (E1)

● If conditional dispatch with GOTO E or GOTO E1 is executed, the control exits the program and its execution terminates

● If we want to be explicit about this, we can assume that the implicit last statement of every L-program is [E1] return Y

Page 21: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Example

otherwise

0 if 1)(

x

xxf

Page 22: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Implementing f(x) in L

AX

YY

XXA

AX

YY

XXA

GOTO 0 IF

1

1 ][

:subscripts use onot want t do weif Or,

GOTO 0 IF

1

1 ][

11

111

Page 23: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Compiling L-Programs

Page 24: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Three Stages of Compilation

● Syntactic Analysis: The source program is processed to determine its conformity to the language grammar and its structure

● Contextual Analysis: The output of the syntactic analysis (a parse tree) is checked for its conformity to the language’s contextual constraints

● Code Generation: The checked parse tree is used to generate the target code, e.g. Java byte code or assembly or some other target language

Page 25: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Components of Syntactic Analysis

● Syntactic Analysis consists of Tokenization and Parsing

● Tokenization: We have to define a set of FA’s (regular expressions) to tokenize input statements (primitive instructions)

● Parsing: We have to define a CFG to map tokenized input statements (primitive instructions) into parse trees

Page 26: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Two Basic Design Principles

● Zero Token Ambiguity: Each sequence of non-white-space characters must be mapped to at most one token

● Zero Statement (Instruction) Ambiguity: Each sequence of tokens recognized in between the beginning of a line and a newline character must have at most one parse tree

Page 27: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of L-Programs

Page 28: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Sample L Program

Here is a sample program in L:

[A1] X1 <= X1 – 1

Y <= Y + 1

IF X1 != 0 GOTO A1

Page 29: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Input Variables (InputVarToken)

Input variables are tokens of the form X1, X2, X3, etc. In general, an input variable is Xk, where k is a natural number greater than 0. An NFA is as follows:

X [1 – 9]

[0 – 9]

Page 30: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Output Variables (OutputVarToken)

L has only one output variable: Y. Here is an NFA:

Y

Page 31: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Local Variables (LocalVarToken)

Local variables are tokens of the form Z1, Z2, Z3, etc. In general, a local variable is Zk, where k is a natural number greater than 0. An NFA is as follows:

Z [1 – 9]

[0 – 9]

Page 32: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Labels

● There are two places where a label can occur in a primitive instruction: at the beginning of a line and at the end of a line

● At the beginning of a line a label is bracketed; at the end of a line it is not

● Furthermore, labels that start with A, B, C, D are non-exit labels; labels that start with E are exit labels

Page 33: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Non-Exit Non-Bracketed Labels (NExLblToken)

Non-exit labels that occur at the end of a line are tokens of the form Λ1, Λ 2, Λ3, etc. In general, a label is Λk, where k is a natural number greater than 0 and Λ is in {A, B, C, D}. An NFA is as follows:

A,B,C,D [1 – 9]

[0 – 9]

Page 34: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Non-Exit Bracketed Labels (NExBrLblToken)

Non-exit labels that occur at the end of a line are tokens of the form [Λ1] , [Λ2] , [Λ3] , etc. In general, a label is [Λk] , where k is a natural number greater than 0 and Λ is in {A, B, C, D}. An NFA is as follows:

A,B,C,D [1 – 9]

[0 – 9]

[ ]

Page 35: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Exit Non-Bracketed Label (ExLblToken)

Every L program has a unique exit label (E1). If the exit label occurs at the end of a line, it is not bracketed. An NFA is as follows (this assumes that we always use the numeral 1):

E 1

Page 36: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization: Exit Bracketed Label (ExBrLblToken)

Every L program has a unique exit label (E1). If the exit label occurs at the beginning of a line is it bracketed. An NFA is as follows:

E 1 [ ]

Page 37: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Operators

There are four operator tokens in L: <=, +, -, != . Here is possible NFAs for operators:

< =

! =

+

-

AssignOperToken

NotEqOperToken

PlusOperToken

MinusOperToken

Page 38: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Keywords

L has two keywords: IF and GOTO. Two possible NFAs:

I F

G O T O

IFToken

GOTOToken

Page 39: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Literals

L has 2 literals: 0 and 1. Two possible NFAs:

0

1

ZeroLitToken

OneLitToken

Page 40: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Complete List of Tokens 1.InputVarToken 2.OutputVarToken 3.LocalVarToken 4.NExLblToken 5.ExLblToken 6.NExBrLblToken 7.ExBrLblToken 8.AssignOperToken 9.NotEqOperToken 10.PlusOperToken 11.MinusOperToken 12.IFToken 13.GOTOToken 14.ZeroLitToken 15.OneLitToken

Page 41: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization Algorithm: Outline

● Read in a line of text

● Partition the line into substrings on white space

● Run each substring through all possible NFAs

● Each substring can be recognized by at most one NFA

● If a substring is not recognized by an NFA, report an error; otherwise, create an appropriate token, depending on what NFA recognized the substring

● The output is a sequence of tokens

Page 42: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Back to Sample L Program

Here is a sample program in L:

[A1] X1 <= X1 – 1

Y <= Y + 1

IF X1 != 0 GOTO A1

Page 43: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Line 1

● "[A1] X1 <= X1 – 1" ● White space partitioning gives us the following substrings: "[A1]", "X1", "<=", "X1", "-", "1" ● "[A1]" is recognized by the Non-Exit Bracketed Label NFA; so create NExBrLblToken("A1") ● "X1" is recognized by the Input Variable NFA; so create InputVarToken("X1") ● "<=" is recognized by the Assignment Operator NFA; so create AssignOperToken("<=") ● "X1" is recognized by the InputVariable NFA; so create InputVarToken("X1") ● "-" is recognized by the Minus Operator NFA; so create MinusOperToken("-") ● "1" is recognized by the One Literal NFA; so create OneLitToken("1") ● The output is this sequence of tokens: <NExBrLblToken("A1"), InputVarToken("X1"), AssignOperToken("<="), InputVarToken("X1"), MinusOperToken("-"), OneLitToken("1")>

Page 44: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Line 1

The line "[A1] X1 <= X1 – 1" gives us the following sequences of tokens:

NExBrLblToken InputVarToken AssigOperToken InputVarToken MinusOperToken OneLitToken

“A1” “X1” “<=“ “X1” “-” “1”

Page 45: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Line 2

The line "Y <= Y + 1" gives us the following sequences of tokens:

OutputVarToken AssigOperToken OutputVarToken PlusOperToken OneLitToken

“Y” “<=“ “Y” “+” “1”

Page 46: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Tokenization of Line 3

The line "IF X1 != 0 GOTO A1" gives us the following sequences of tokens:

IFToken InputVarToken NotEqOperToken ZeroLitToken GOTOToken NExLblToken

“IF” “X1” “!=“ “0” “GOTO” “A1”

Page 47: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing

Page 48: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Recursive Descent Parsing

● Recursive Descent Parsing is an algorithm that should be considered for any unambiguous CF grammar

● All programming languages are specified either with unambiguous CF grammars or with ambiguous CF grammars where ambiguity can be easily handled (e.g., look-ahead)

● The basic step in designing an RDP parser is to design a parsing procedure parseN for every non-terminal symbol N in the grammar

Page 49: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Developing Recursive-Descent Parser for L

● To develop a recursive-descent parser for L we need to accomplish three tasks:

– Develop a CFG G for L

– Derive a set of RD parsing procedures from G

– Implement the rules in a programming language (Java, Python, C/C++, C#, etc.)

Page 50: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

A CFG Grammar for L

Page 51: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

CFG Productions L

● Incrmnt VarToken AssignOperToken VarToken PlusOperToken OneLitToken

Note: this rule is simplified, because, technically speaking, VarToken is not present in the list of tokens. So, we have to write additional productions of the form:

VarToken InputVarToken | OutputVarToken | LocalVarToken

● Decrmnt VarToken AssignOperToken VarToken MinusOperToken OneLitToken

● NOP VarToken AssignOperToken VarToken

● CDisp IFToken VarToken NotEqOperToken ZeroLitToken GOTOToken DispLBL

● DispLBL NExLblToken | ExLblToken

Page 52: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

CFG Productions

● LProgram LInstructSEQ

To recognize a L Program is to recognize a sequence of L instructions

● LInstructSEQ ε

A sequence of L instructions can be empty

● LInstructSEQ LInstruct LInstructSEQ

A non-empty sequence of L instructions starts with an L instructions and is followed by a sequence of L instructions

Page 53: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

CFG Productions

● Linstruct LblStmnt | Stmnt

To recognize a L instruction is to recognize a labeled statement or an unlabeled A sequence of L instructions can be empty

● LblStmnt BrLBL Stmnt

To recognize a labeled statement is to recognize a bracketed label and then to recognize a statement

● BrLBL NExBrLblToken | ExBrLblToken

To recognize a bracketed label is to recognize a non-exit bracketed label token or to recognize exit bracketed label token (note that NExBrLblToken and ExBrLblToken are tokens, not syntactic categories)

Page 54: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Recursive-Descent Parsing Procedures

Page 55: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Procedures for L

● Let us agree that each parsing procedure returns a ParseTree data structure (the base class)

● Consider the first rule in our grammar: LProgram LInstructSEQ

ParseTree parseLProgram(input, start_pos)

{

ParseTree progTree = parseLInstructSEQ(input, start_pos);

return progTree;

}

Page 56: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

ParseLinstructSEQ Procedure

●There are 2 productions:

1) LInstructSEQ ε 2) LInstructSEQ LInstruct LInstructSEQ

ParseTree parseLInstructSEQ(input, start_pos) {

if ( input is empty )

return the empty LInstructSEQ;

else {

ParseTree firstIns = parseLInstruct(input, start_pos);

ParseTree restInstructs = parseLInstructSEQ(input, firstIns.getNextPos());

return new LInstructSEQ(firstInstruct, restInstructs);

}

}

Page 57: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

ParseLInstruct Procedure

●Two productions for LInstruct: LInstruct LblStmnt | Stmnt

ParseTree parseLInstruct(input, start_pos) {

ParseTree lblSt = parseLblStmnt(input, start_pos);

if ( lblSt == null )

return parseStmnt(input, start_pos);

else

return lblSt;

}

Page 58: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

ParseLblStmnt Procedure

● G has one production for LblStmnt: LblStmnt BrLBL Stmnt

ParseTree parseLblStmnt(input, start_pos) {

ParseTree brLbl = parseBrLbl(inut, start_pos);

if ( brLbl == null ) return null;

else {

ParseTree stmnt = parseStmnt(input, brLbl.getNextPos();

if ( stmnt == null ) return null;

else

return new LblStmnt(brLbl, stmnt);

}

Page 59: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

ParseLbl Procedure

● G has two productions for BrLbl:

BrLBL NExBrLblToken | ExBrLblToken

● Note that both right-hand sides consist of tokens; they do not need to be parsed, because they are terminals to the parser

● So, in this case, instead of parsing we have to make sure that these terminals are in the input

Page 60: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

ParseLbl Procedure

ParseTree parseLbl(input, start_pos) {

if (input[start_pos] == NExBrLblToken )

return new Lbl(input[start_pos]);

else if (input[start_pos] == ExBrLblToken)

return new Lbl(input[start_pos]);

else

return null;

}

Page 61: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

ParseIncrmnt Procedure

● The rest of the parsing procedures can be derived in a similar fashion

● There is one rule for Incrmnt: Incrmnt VarToken AssignOperToken VarToken PlusOperToken OneLitToken

● This rule does not require any parsing; it requires only matching of tokens

Page 62: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

ParseIncrmnt Procedure

ParseTree parseIncrmnt(input, start_pos) { if ( input[start_pos] != VarToken ) return null; else if ( input[start_pos+1] != AssignOperToken ) return null; else if ( input[start_pos+2] != VarToken) return null; else if ( input[start_pos+3] != PlusOperToken) return null; else if ( input[start_pos+4] != OneLitToken) return null; else return new Incrmnt(VarToken, AssignOperToken, VarToken, PlusOperToken, OneLitToken); }

Page 63: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Back to Sample L-Program

Let us parse the following L program:

[A1] X1 <= X1 – 1

Y <= Y + 1

IF X1 != 0 GOTO A1

Page 64: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Example: Line 1 Tokenized

The line "[A1] X1 <= X1 – 1" gives us the following sequences of tokens:

NExBrLblToken InputVarToken AssigOperToken InputVarToken MinusOperToken OneLitToken

“A1” “X1” “<=“ “X1” “-” “1”

Page 65: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Example: Line 1 ParseTree

LInstruct

LblStmnt

BrLbl Stmnt

NExBrLblToken

“[A1]”

Decmnt

InputVarToken AssignOperToken InputVarToken MinusOperToken OneLitToken

“X1” “<=“ “X1” “-” “1”

Page 66: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Example: Line 2 Tokenized

The line "Y <= Y + 1" gives us the following sequences of tokens:

OutputVarToken AssigOperToken OutputVarToken PlusOperToken OneLitToken

“Y” “<=“ “Y” “+” “1”

Page 67: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Example: Line 2 ParseTree

LInstruct

Stmnt

Incmnt

OutputVarToken AssignOperToken OutputVarToken PlusOperToken OneLitToken

“Y” “<=“ “Y” “+” “1”

Page 68: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Example: Line 3 Tokenized

The line "IF X1 != 0 GOTO A1" gives us the following sequences of tokens:

IFToken InputVarToken NotEqOperToken ZeroLitToken GOTOToken NELblToken

“IF” “X1” “!=“ “0” “GOTO” “A1”

Page 69: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Example: Line 3 ParseTree

LInstruct

Stmnt

CDisp

IFToken NotEqOperToken InputVarToken ZeroLitToken GOTOToken

“IF” “X1“ “!=” “GOTO” “A1”

NExLblToken

“0”

Page 70: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

Parsing Example: LProgram ParseTree

LProgram

LInstructSEQ

LInstruct LInstruct LInstruct

“[A1] X1 <= X1 – 1” “Y <= Y + 1” “IF X1 != 0 GOTO A1”

Page 71: Theory of Computation (Fall 2014): Formalism, Computation, & Compilation

References & Reading Suggestions

Hopcroft and Ullman. Introduction to Automata

Theory, Languages, and Computation, Narosa

Publishing House

Moll, Arbib, and Kfoury. An Introduction to Formal

Language Theory

Davis, Weyuker, Sigal. Computability, Complexity,

and Languages, 2nd Edition, Academic Press

Brooks Webber. Formal Language: A Practical

Introduction, Franklin, Beedle & Associates, Inc