A Scalable Mixed-Level Approach to Dynamic Analysis of C and C++ Programs

gn=right>

A Scalable Mixed-Level Approach to Dynamic Analysis of C and C++ Programs A Scalable Mixed-Level Approach to Dynamic
Analysis of C and C++ Programs
Master of Engineering Thesis Proposal
Philip J. Guo
May 16, 2005
Abstract
Source-level and binary-level dynamic program analyses have com-
plementary advantages. Current dynamic analyses for software engi-
neering typically operate either at the source code level or at the binary
level (possibly translating results to source code terms for output). I
propose a mixed-level approach that combines the best of the source-
level and binary-level approaches throughout the analysis. Compared
to a one-level approach, the mixed-level approach simplies implemen-
tation, improves robustness and scalability, and enables analyses that
cannot be performed purely at the source or binary level.
For my Master of Engineering thesis, I will implement a dynamic
instrumentation toolkit that embodies the mixed-level approach and
use it to build two distinct dynamic analysis tools. One performs
value proling outputting a rich set of run-time values for further
analysis and the other performs value partitioning determining
abstract types for concrete values. These two tools work together
to serve as a C/C++ Front End for the Daikon Dynamic Invariant
Detector.
Thesis Supervisor: Michael D. Ernst, Title: Associate Professor
1 1
Introduction
The C and C++ programming languages are used to create a wide variety of
modern computer programs. Thanks to the eorts of the free software and
open-source movements, there is a large collection of C and C++ programs
whose source code is freely accessible for analysis. Program analysis can assist
in improving human understanding of code, generating specications, nding
bugs, and improving software quality and security. C and C++ programs are
especially interesting and important to analyze due to their widespread pres-
ence in many types of software. However, these languages pose a challenge
for analysis tools due to their source-level syntactic complexities and lack of
type safety: they allow programmers to directly manipulate memory using
pointers and to perform low-level operations such as arbitrary type casts.
My Master of Engineering thesis consists of the design and implementa-
tion of a scalable approach for performing dynamic analysis of C and C++
programs. Program analysis is typically performed on either the source code
or on a compiled (binary) representation of a program. These approaches
each have complementary advantages, but there is a large domain of analy-
ses for which neither is best-suited. Thus, I propose a mixed-level approach
which combines the benets of both source and binary-based approaches to
provide scalability, robustness, and ease-of-implementation. I will build a
toolkit based upon this mixed-level approach and then use it to implement
two important types of analyses: value proling and value partitioning. Value
proling [2, 3] is a technique that observes the run-time values of variables,
expressions, registers, and memory locations, and is an important component
of any analysis which is interested in what the program computes. Value
partitioning divides the values observed during execution into sets based on
abstract types to provide ner-granularity for the few basic representation
types built into the C and C++ languages.
Both
analyses are crucial for providing program trace data for the Daikon
Dynamic Invariant Detector [5], a tool which infers likely relations between
variables (called invariants) from observing their run-time values. Daikon is
language-independent but requires front ends for each language to produce
program traces for it to analyze. Although the mixed-level approach and
toolkit can be used to create many types of analyses, the particular value
proling and partitioning tools I will develop for my thesis are intended to
serve together as a C/C++ Front End for Daikon.
2 2
Background
There are two main classes of program analysis: static and dynamic anal-
ysis [4]. Static analysis involves reasoning about a programs properties
without actually executing it, while dynamic analysis involves analyzing a
program by executing it and observing its run-time behavior. Static analysis
is usually more sound and conservative because it tries to nd properties
that are true for all possible executions, but this comes at the expense of
slower performance and more limited scalability. It must make approxima-
tions about the run-time behavior of a target program (the program being
analyzed) without actually executing it, and usually it makes more conser-
vative approximations. In contrast, dynamic analysis is typically unsound
because it can only be fully correct for some particular executions, but it can
be faster and more scalable in practice. The more exhaustively the target
program exhibits the properties to be analyzed during its executions, the
closer a dynamic analysis gets to achieving soundness. This thesis will only
focus on dynamic analysis.
Dynamic program analysis can be used for optimizations (proling and
tracing), error detection (testing, assertion checking, type checking, memory
safety and leak detection), and program understanding (coverage, call graph
construction). The two most common ways to obtain information from a run-
ning program are to modify a programs source code (source-based approach),
or to modify its compiled binary representation (binary-based approach), so
that it yields the desired information during execution. Each approach has
its own advantages and applications for which it is better-suited.
2.1
Source-based approach
A source-based approach can take advantage of the same level of abstraction
that the language provides to programmers, thus producing high-level output
without architectural dependencies. It can report results using language-level
terms such as functions and variables because it integrates directly into the
source code of the target program (the program being analyzed). A source-
based analysis can also be relatively easy to implement because the developer
only needs to consider one level of abstraction, that of the instrumented
language. It is relatively easy to examine the output of a source-to-source
3 rewriting tool and check whether it has the correct meaning: debugging
the tools output is very similar to debugging ones own programs. It can
also inherit the portability of the target program and take advantage of
compiler optimizations because the inserted code is compiled together with
the target programs code. Examples of source-based tools for a particular
type of dynamic analysis, the detection of memory errors, include Safe-C [1],
CCured [9], and most recently, work by Xu et al. [13], all of which rewrite
the source code of C programs to maintain pointer metadata in order to keep
track of which regions of memory are accessible.
2.2
Binary-based approach
A binary-based approach is useful for problems (e.g. instruction-level prol-
ing, memory tracking, and compiler optimizations) which can be expressed
more simply at a lower level of abstraction. A compiled program binary le is
a at list of instructions rather than a nested expression that requires parsing,
and there are fewer conceptually distinct machine operations than language-
level abstractions. For instance, the language-level description of data has a
complex structure in terms of pointers, arrays, and recursive structures. In
contrast, the machine-level representation of data is as a at memory with
load and store operations. If the property to be analyzed can be expressed in
terms of the simpler machine representation, the language-level complexities
can be ignored, which is especially advantageous for languages such as C and
C++ which contain complex source code constructs. Binary-based analy-
sis is inherently language-independent because programs written in dierent
languages can often compile to one common binary representation. Also, it
has the advantage of providing better program coverage than source-based
analyses because it makes no distinction between a main program and the
libraries that it uses. Execution in a library is analyzed in just the same
way as the main programs execution; there is no need to recompile libraries
or to create hand-written simulations or summaries of their behavior as is