Understanding Source Code Evolution Using Abstract Syntax Tree Matching
tiu@cs.umd.edu
Jeffrey S. Foster
jfoster@cs.umd.edu
Michael Hicks
mwh@cs.umd.edu
Department of Computer Science
University of Maryland at College Park
ABSTRACT
Mining software repositories at the source code level can pro-
vide a greater understanding of how software evolves. We
present a tool for quickly comparing the source code of dif-
ferent versions of a C program. The approach is based on
partial abstract syntax tree matching, and can track sim-
ple changes to global variables, types and functions. These
changes can characterize aspects of software evolution use-
ful for answering higher level questions. In particular, we
consider how they could be used to inform the design of a
dynamic software updating system. We report results based
on measurements of various versions of popular open source
programs, including BIND, OpenSSH, Apache, Vsftpd and
the Linux kernel.
Categories and Subject Descriptors
F.3.2 [Logics And Meanings Of Programs]: Semantics
of Programming LanguagesProgram Analysis
General Terms
Languages, Measurement
Keywords
Source code analysis, abstract syntax trees, software evolu-
tion
1.
INTRODUCTION
Understanding how software evolves over time can im-
prove our ability to build and maintain it.
Source code
repositories contain rich historical information, but we lack
eective tools to mine repositories for key facts and statistics
that paint a clear image of the software evolution process.
Our interest in characterizing software evolution is mo-
tivated by two problems.
First, we are interested in dy-
namic software updating (DSU), a technique for xing bugs
or adding features in running programs without halting ser-
vice [4]. DSU can be tricky for programs whose types change,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
MSR 05 Saint Louis, MO USA
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...
$
5.00.
so understanding how the type structure of real programs
changes over time can be invaluable for weighing the merits
of DSU implementation choices. Second, we are interested in
a kind of release digest for explaining changes in a software
release: what functions or variables have changed, where the
hot spots are, whether or not the changes aect certain com-
ponents, etc. Typical release notes can be too high level for
developers, and output from diff can be too low level.
To answer these and other software evolution questions,
we have developed a tool that can quickly tabulate and sum-
marize simple changes to successive versions of C programs
by partially matching their abstract syntax trees. The tool
identies the changes, additions, and deletions of global vari-
ables, types, and functions, and uses this information to re-
port a variety of statistics.
Our approach is based on the observation that for C pro-
grams, function names are relatively stable over time. We
analyze the bodies of functions of the same name and match
their abstract syntax trees structurally. During this process,
we compute a bijection between type and variable names in
the two program versions. We then use this information to
determine what changes have been made to the code. This
approach allows us to report a name or type change as single
dierence, even if it results in multiple changes to the source
code. For example, changing a variable name from x to y
would cause a tool like diff to report all lines that formerly
referred to x as changed (since they would now refer to y),
even if they are structurally the same. Our system avoids
this problem.
We have used our tool to study the evolution history of a
variety of popular open source programs, including Apache,
OpenSSH, Vsftpd, Bind, and the Linux kernel. This study
has revealed trends that we have used to inform our de-
sign for DSU. In particular, we observed that function and
global variable additions are far more frequent than dele-
tions; the rates of addition and deletion vary from program
to program.
We also found that function bodies change
quite frequently over time, but function prototypes change
only rarely. Finally, type denitions (like struct and union
declarations) change infrequently, and often in simple ways.
2.
APPROACH
Figure 1 provides an overview of our tool. We begin by
parsing the two program versions to produce abstract syn-
tax trees (ASTs), which we traverse in parallel to collect
type and name mappings. With the mappings at hand, we
then detect and collect changes to report to the user, either
AST 1
Parser
Parser
AST 2
Program version 1
Program version 2
Facts Processor
Type Matchings
Bijection Computation
Changes
&
Statistics
Change
Detector
Name Matchings
Figure 1: High level view of AST matching
typedef i n t
s z t ;
i n t c o u n t ;
s t r u c t f o o {
i n t
i ;
f l o a t
f ;
char c ;
} ;
i n t baz ( i n t a , i n t b ) {
s t r u c t f o o
s f ;
s z t
c = 2 ;
s f . i = a + b + c ;
c o u n t ++;
}
i n t c o u n t e r ;
typedef i n t
s i z e t ;
s t r u c t b a r {
i n t
i ;
f l o a t
f ;
char c ;
} ;
i n t baz ( i n t d , i n t e ) {
s t r u c t b a r s b ;
s i z e t g = 2 ;
s b . i = d + e + g ;
c o u n t e r ++;
}
void
b i f f ( void ) { }
Figure 2: Two successive program versions
directly or in summary form. In this section, we describe
the matching algorithm, illustrate how changes are detected
and reported, and describe our implementation and its per-
formance.
2.1
AST Matching
Figure 2 presents an example of two successive versions
of a program.
Assuming the example on the left is the
initial version, our tool discovers that the body of baz is
unchangedwhich is what we would like, because even though
every line has been syntactically modied, the function in
fact is structurally the same, and produces the same out-
put. Our tool also determines that the type sz t has been
renamed size t, the global variable count has been renamed
counter, the structure foo has been renamed bar, and the
function biff() has been added. Notice that if we had done
a line-oriented diff instead, nearly all the lines in the pro-
gram would have been marked as changed, providing little
useful information.
To report these results, the tool must nd a bijection be-
tween the old and new names in the program, even though
functions and type declarations have been reordered and
modied. To do this, the tool begins by nding function
names that are common between program versions; our as-
sumption is that function names do not change very often.
The tool then uses partial matching of function bodies to
determine name maps between old and new versions, and
nally tries to nd bijections i.e., one-to-one, onto submaps
of the name maps.
We traverse the ASTs of the function bodies of the old and
new versions in parallel, adding entries to a LocalNameMap
and GlobalNameMap to form mappings between local vari-
able names and global variable names, respectively. Two
variables are considered equal if we encounter them in the
same syntactic position in the two function bodies. For ex-
ample, in Figure 2, parallel traversal of the two versions of
baz results in the LocalNameMap
a d, b e, sf sb, c g
and a GlobalNameMap with count counter. Similarly,
procedure GenerateMaps(V ersion1, V ersion2)
F
1
set of all functions in Version 1
F
2
set of all functions in Version 2
global T ypeM ap
global GlobalN ameM ap
for each function f F
1
F
2
do
8
<
:
AST
1
AST of f in Version 1
AST
2
AST of f in Version 2
Match Ast(AST
1
, AST
2
)
procedure Match Ast(AST
1
, AST
2
)
local LocalN ameM ap
for each (node
1
, node
2
) (AST
1
, AST
2
)
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
if (node
1
, node
2
) = (t
1
x
1
, t
2
x
2
)
//declaration
then
T ypeM ap T ypeM ap {t
1
t
2
}
LocalN ameM ap LocalN ameM ap {x
1
x
2
}
else if (node
1
, node
2
) = (y
1
:= e
1
op e
1
, y
2
:= e
2
op e
2
)
then
8
>
>
>
>
>
<
>
>
>
>
>
:
Match Ast(e
1
, e
2
)
Match Ast(e
1
, e
2
)
if isLocal(y
1
) and isLocal(y
2
) then
LocalN ameM ap LocalN ameM ap {y
1
y
2
}
else if isGlobal(y
1
) and isGlobal(y
2
) then
GlobalN ameM ap GlobalN ameM ap {y
1
y
2
}
else if . . .
else break
Figure 3: Map Generation Algorithm
we form a TypeMap between named types (typedefs and
aggregates) that are used in the same syntactic positions
in the two function bodies. For example, in Figure 2, the
name map pair sb sf will introduce a type map pair
struct foo struct bar.
We dene a renaming to be a name or type pair j
1
j
2
where j
1
j
2
exists in the bijection, j
1
does not exist in the
new version, and j
2
does not exist in the old version. Based
on this denition, our tool will report count counter
and struct foo struct bar as renamings, rather than
additions and deletions. This approach ensures that consis-
tent renamings are not presented as changes, and that type
changes are decoupled from value changes, which helps us
better understand how types and values evolve.
Figure 3 gives pseudocode for our algorithm. We accumu-
late global maps TypeMap and GlobalNameMap, as well as
a LocalNameMap per function body. We invoke the routine
Match Ast on each function common to the two versions.
When we encounter a node with a declaration t
1