The View from 35,000 Feet
rial size=-1 color=black>
The View from 35,000 Feet
The View from 35,000 Feet
Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these
materials for their personal use.
Implications
Must recognize legal (and illegal) programs
Must generate correct code
Must manage storage of all variables (and code)
Must agree with OS & linker on format for object code
Big step up from assembly languageuse higher level notations
High-level View of a Compiler
Source
code
Machine
code
Compiler
Errors
Traditional Two-pass Compiler
Implications
Use an intermediate representation (
IR
)
Front end maps legal source code into
IR
Back end maps IR into target machine code
Admits multiple front ends & multiple passes
(
better code
)
Typically, front end is O(n) or O(n log n), while back end is NPC
(NP-complete)
Source
code
Front
End
Errors
Machine
code
Back
End
IR
Can we build
n x m
compilers with
n+m
components?
Must encode all language specific knowledge in each front end
Must encode all features in a single IR
Must encode all target specific knowledge in each back end
Limited success in systems with very low-level
IR
s
(and by narrowing range of languages/targets)
A Common Fallacy
Fortran
Scheme
Java
Smalltalk
Front
end
Front
end
Front
end
Front
end
Back
end
Back
end
Target 2
Target 1
Target 3
Back
end
Responsibilities
Recognize legal (& illegal) programs
Report errors in a useful way
Produce
IR
& preliminary storage map
Shape the code for the back end
Much of front end construction can be automated
The Front End
Source
code
Scanner
IR
Parser
Errors
tokens
The Front End
Scanner
Maps character stream into wordsthe basic unit of syntax
Produces pairs a word & its part of speech
x = x + y ;
becomes
<
id
,x> = <
id
,x> + <
id
,y> ;
word lexeme, part of speech token type
In casual speech, we call the pair a
token
Typical tokens include
number
,
identifier
, +, , new,
while
,
if
Scanner eliminates white space (
including comments
)
Speed is important
Source
code
Scanner
IR
Parser
Errors
tokens
The Front End
Parser
Recognizes context-free syntax & reports errors
Guides context-sensitive (semantic) analysis (
type checking
)
(Hence the term syntax-directed translation)
Builds
IR
for source program
Hand-coded parsers are fairly easy to build
Most books advocate using automatic parser generators
Source
code
Scanner
IR
Parser
Errors
tokens
The Front End
Context-free syntax is specified with a grammar
SheepNoise
SheepNoise
baa
| baa
This grammar defines the set of noises that a sheep makes
under normal circumstances
It is written in a variant of BackusNaur Form (BNF)
Formally, a grammar
G = (S,N,T,P)
S
is the
start symbol
N
is a set of
non-terminal symbols
T
is a set of
terminal symbols
or
words
P
is a set of
productions
or
rewrite rules
(
P : N
N
T
)
(Example due to Dr. Scott K. Warren)
Context-free syntax can be put to better use
This grammar defines simple expressions with addition &
subtraction over number and id
This grammar, like many, falls in a class called context-free
grammars, abbreviated
CFG
The Front End
1.
goal
expr
2.
expr
expr op term
3. |
term
4.
term
number
5. | id
6.
op
+
7. | -
S
=
goal
T
= { number, id, +, - }
N = {
goal
,
expr
,
term
,
op
}
P = { 1, 2, 3, 4, 5, 6, 7}
Given a
CFG
, we can
derive
sentences by repeated substitution
To recognize a valid sentence in some CFG, we reverse this
process and build up a
parse
The Front End
Production Result
goal
1
expr
2
expr op term
5
expr op
y
7
expr
- y
2
expr op term
- y
4
expr op
2 - y
6
expr
+ 2 - y
3
term
+ 2 - y
5
x + 2 - y
The Front End
A parse can be represented by a tree (
parse tree
or
syntax tree
)
x + 2 - y
This contains a lot of unneeded
information.
term
op
term
expr
term
expr
goal
expr
op
<
id,
x
>
<
number,
2
>
<
id,
y
>
+
-
1.
goal
expr
2.
expr
expr op term
3. |
term
4.
term
number
5. | id
6.
op
+
7. | -
The Front End
Compilers often use an
abstract syntax tree
This is much more concise
AST
s are one kind of
intermediate representation (
IR
)
+
-
<
id,
x
>
<
number,
2
>
<
id,
y
>
The
AST
summarizes
grammatical structure,
without including detail
about the derivation
The Back End
Responsibilities
Translate IR into target machine code
Choose instructions to implement each IR operation
Decide which value to keep in registers
Ensure conformance with system interfaces
Automation has been
less
successful in the back end
Errors
IR
Register
Allocation
Instruction
Selection
Machine
code
Instruction
Scheduling
IR
IR
The Back End
Instruction Selection
Produce fast, compact code
Take advantage of target features such as addressing modes
Usually viewed as a pattern matching problem
ad hoc
methods, pattern matching, dynamic programming
This was the problem of the future in 1978
Spurred by transition from
PDP-11
to
VAX-11
Orthogonality of
RISC
simplified this problem
Errors
IR
Register
Allocation
Instruction
Selection
Machine
code
Instruction
Scheduling
IR
IR
The Back End
Register Allocation
Have each value in a register when it is used
Manage a limited set of resources
Can change instruction choices & insert
LOAD
s &
STORE
s
Optimal allocation is NP-Complete
(1 or
k
registers)
Compilers approximate solutions to NP-Complete problems
Errors
IR
Register
Allocation
Instruction
Selection
Machine
code
Instruction
Scheduling
IR
IR
The Back End
Instruction Scheduling
Avoid hardware stalls and interlocks
Use all functional units productively
Can increase lifetime of variables
(changing the allocation)
Optimal scheduling is NP-Complete in nearly all cases
Heuristic techniques are well developed
Errors
IR
Register
Allocation
Instruction
Selection
Machine
code
Instruction
Scheduling
IR
IR
Traditional Three-pass Compiler
Code Improvement (or
Optimization
)
Analyzes
IR
and rewrites (or
transforms
)
IR
Primary goal is to reduce running time of the compiled code
May also improve space, power consumption,
Must preserve meaning of the code
Measured by values of named variables
Subject of Cmp Sci 710 (et al.), introduction in this course
Errors
Source
Code
Middle
End
Front
End
Machine
code
Back
End
IR
IR
The Optimizer (or Middle End)
Typical Transformations
Discover & propagate some constant value
Move a computation to a less frequently executed place
Specialize some computation based on context
Discover a redundant computation & remove it
Remove useless or unreachable code
Encode an idiom in some particularly efficient form
Errors
O
p
t
1
O
p
t
3
O
p
t
2
O
p
t
n
...
IR
IR
IR
IR
IR
Modern optimizers are structured as a series of passes
Example
Optimization of Subscript Expressions in Fortran
Address(A(I,J)) = address(A(0,0)) + J * (column size) + I
Does the user realize a multiplication
is generated here?
Example
Optimization of Subscript Expressions in Fortran
Address(A(I,J)) = address(A(0,0)) + J * (column size) + I
Does the user realize a multiplication
is generated here?
DO I = 1, M
A(I,J) = A(I,J) + C
ENDDO
Example
Optimization of Subscript Expressions in Fortran
Address(A(I,J)) = address(A(0,0)) + J * (column size) + I
Does the user realize a multiplication
is generated here?
DO I = 1, M
A(I,J) = A(I,J) + C
ENDDO
compute addr(A(0,J)
DO I = 1, M
add 1 to get addr(A(I,J)
A(I,J) = A(I,J) + C
ENDDO
Modern Restructuring Compiler
Typical
Restructuring
Transformations:
Blocking for memory hierarchy and register reuse
Vectorization
Parallelization
All based on dependence
Also full and partial inlining
Subject of Cmp Sci 710
Errors
Source
Code
Restructure
r
Front
End
Machine
code
Opt +
Ba