Build Interceptor
Summary: Build Interceptor captures the .i files of any
project while it is built from source using the gcc tool-chain.
Maintainer: Daniel S. Wilkerson
Developers: Karl Chen
Anyone who has tried this on a large scale will find out that it is
non-trivial to build a project from source and obtain the .i files
generated during the build process. I give step-by-step instructions
on how to use the provided scripts to do this without *any*
modification to the build process of the project you are trying to
capture.
This work was supported by professors Alex Aiken and David Wagner and was done at
UC Berkeley.
Here are the current releases. Feel free to just get the current
Subversion repository version as a guest user.
A warning on the toxicity of the techniques necessary while
operating in a Byzantine Compilation Regime
WARNING: During its install process,
build-interceptor searches for programs on your system that resemble a
compiler tool-chain and messes with them. It is therefore necessary
to do this install (but not the interception) as root.
I must recommend that you install a separate operating system
image (perhaps within a chroot bubble or a virtual machine) solely for
the purpose of building packages with build-interceptor.
That said, my install/de-install makefile is pretty smart at
preventing you from shooting yourself in the foot (it saved me several
times): you can review the files to be moved before committing to do
so and it won't allow nonsensical operations such as
double-installation. Therefore I am willing to use it on my own
development box and I have never messed it up this way.
We move the original installed tool-chain like this because it is
the only way to be absolutely sure that all calls to the compiler,
linker etc. are intercepted; all other tricks with environment
variables etc. can be subverted by the build process, whereas with our
technique in order to avoid interception the build process would have
to use a different compiler or actively search for where we hid the
real one.
Introduction
Welcome!
Build Interceptor is a collection of scripts for recording the .i
files generated during a build of C or C++ programs with the gcc
tool-chain. No modification to the original build process is
necessary.
Limitations
The method described here requires that you be root on your box so you
can replace the system cc1 and cc1plus programs, among others; this is
done so that the build process you are intercepting does not have to
be changed at all. You can probably also get it to work by setting
environment variables such as GCC_EXEC_PREFIX. I could not figure out
how to change the compiler proper (cc1) from gcc spec files.
The previous version of this tool would not work with the gcc 3 series
as the preprocessor and compiler had been integrated; however since
then by looking through the source code I discovered the seemingly
undocumented flag "--no-integrated-cpp" which solves this problem.
Compilers other than gcc are not supported. Gcc 3.3 and 3.4 work;
gcc 3.2.3 seems to not work.
Background
When gcc/g++ compiles, it pre-processes .c or .cc files to .i or .ii
files (respectively), compiles .i or .ii files to .s files, assembles
.s files to .o files, and links .o files to executables. It
traditionally does all these stages with separate programs (new
versions of gcc complicate this by integrating preprocessing and
compilation), in particular the compiler-proper program being called
cc1 or cc1plus for C or C++ (respectively).
Basics of how the build interception works
The cc1_interceptor.pl script captures the .i and .ii files generated
by the gcc compiler tool chain by replacing and imitating cc1. It
- copies the pre-processed input, the .i file, to a new file,
- runs the real cc1 passing in the copy,
- puts the fully-qualified filename of the copy into a string in
the section ".note.cc1_interceptor" in the assembly output.
This name flows to the .o and then to the executable (the linker will
concatenate multiple occurrences of this section) where it can later
be retrieved using objdump; This is easier to do if you use Ben
Liblit's extract-section script which he ships as part of "The
Cooperative Bug Isolation Project" and which I include in this
project; see below for details.
The build interceptor process works by first moving away the system
executables (using the Intercept.mk makefile, as root) and replacing
them with softlinks to the interception scripts provided.
Licensing
All files in this directory tree and its subtrees are distributed
under the license in License.txt; please see that file for copyright
and terms of use.
Design
Simplicity
There are other ways one might attempt build process interception.
This particular design has been chosen to avoid some problems that are
not at all obvious if you have not tried this before. The salient
lesson of those other projects is that build-processes are very
complex and interception is hard to do without breaking them; testing
is very difficult because if something fails it is hard to know how
what went wrong or even if something went wrong. The number one
concern of the design is therefore to keep things as simple and
non-intrusive as possible.
Our design builds on the experience of the MOPS project and
Cooperative Bug Isolation Project (CBI), which I talk more about in
the Acknowledgments section below.
Staged interception
We do not pipeline the build interception with any further analysis
of the generated .i files. That is we just save the generated .i
files, we don't run an analysis right then; the MOPS project (below)
did attempt to analyze .i files as they were generated. When a build
would fail, they assumed that their analysis had failed. When we
later separated the interception from the analysis, we found that in
fact the interception was often failing but this was going undetected.
Another reason to not separate them is that if your analysis does
fail, you often want to re-run it multiple times as you gradually
minimize the input, such as while using the Delta interesting file minimizer tool.
This is only possible if you have already materialized the .i file
somewhere separately.
Basically a complex process should be staged if at all possible to
reduce complexity.
Metadata lives in data
We do not attempt to keep metadata on build-process-generated files
anywhere outside the files themselves. Early versions of the MOPS
projects attempted to put derived data from a .i file into another
file and then somehow maintain an association between the two. This
was found to be impossible due to build processes moving files around
etc.
All metadata for a file is inserted into the file in one way or
another, depending on the current language the file is in: at the
compile stage, it is inserted into the generated assembly (a trick
novel to build_interceptor) and at the link stage it is inserted into
the .o file using objcopy (a trick from MOPS and also CBI as well I
think).
Avoid long-range communication outside of data
We do not attempt complex out-of-band communication between the
various sub-processes of gcc, which differs from both MOPS and CBI.
MOPS for example attempts to capture the preprocessing stage, analyze
it, and then insert the results in after the linking stage. Getting
rid of this long-range dependency between stages greatly simplifies
things.
We do by default insert the preprocessing output captured at the
start of the compilation stage into the .o file at the end of the
assembly stage. This is pretty simple as the out of band data is the
preprocessing output which has been stored in a temporary file with a
name computed to not collide with others and located in a canonical
place; the name of this file is in-band, embedded in the file as it is
passed along.
Avoid parsing complex command-lines
Similarly we manage to almost completely avoid parsing the
command-line arguments of gcc, though a few situations forced us to do
it a little. Again, the simplification of the process is huge; we
only parse arguments of simple tools such as cc1 and collect2; their
command-lines are much simpler as another tool uses them, not a human.
Something you might be tempted to do along these lines is to remove
-O* flags from the compile stage to speed things up, since perhaps you
are only interested in the .i files and not in actually using the
resulting executables. Removing -O* from the compile stage alone will
not work, as if it has been passed to the preprocessing stage the
compile stage will fail to compile it due to various things having
been inlined. I suppose it would work to remove it from all stages,
probably using the gcc spec file mechanism, but I don't consider it
worth the complexity and possibility of failure.
Goals and amount of interception
Only use what you need
What tools must be intercepted during the build process depends on
what your goal is. You can turn off the interception of tools by
removing them from intercept.progs after it is built.
File-by-file
For a file-by-file analysis of source code, you simply need the source
files after pre-processing. It is sufficient to just intercept
cc1/cc1plus and (after running reorg_build.pl) look at the resulting
.i files.
Note that even if you do not intercept cpp/cpp0/tradcpp0/gcc -E, the
gcc spec file will tell gcc to not pass -P which means there should
always be line directives in the .i file. So if your analysis finds
an error, it can always map it back to the original source line.
Whole-program
For a whole-program analysis of all the source in the package, you
need to know for each executable which .i files went into it. Each
such executable (and any other files produced by the linker) will
result in a .ld file which lists all the .i files that went into it
that were compiled during the build.
For a really whole-program analysis that also looks at libraries, or
if you wanted to modify the .i files, recompile, and re-link, you need
to know *all* the .o files that went into an executable. For this you
will need to also intercept collect2, which is implemented; however
the script reorg.pl would also have to be extended to extract the
linker --trace output, but this is straightforward.
You would want to intercept 'as' to make a mapping between .s files
output by cc1/cc1plus and .o files linked together by the linker as
well as the command-line. It would probably be best to insert the
metadata after assembly using objcopy, just as with collect2.
Source-to-source
If you wanted to do a source-to-source transformation on the
original source you would need the preprocessing command line as well,
and so would have to intercept cpp/cpp0/tradcpp0/gcc -E; probably you
would insert the metadata into the file as the initializer of a global
string variable with an unusual name.
"Replaying" a build process from the interception record is probably
trickier than one might at first imagine: build processes sometimes do
strange things such as move files around. You would have to intercept
mv and perhaps rm etc. I have not done this but it is not hard given
the infrastructure. One thing you will likely want is for the build
process to be deterministic, so the make interceptor removes -j from
the command line; try out the TestMake.mk makefile with and without
it.
Miscellaneous difficulties with gcc layering
You might have to experiment to figure out exactly what which layer to
intercept. I am using gcc 3.4.0 and it seems that neither cpp nor gcc
-E call each other nor a program called cpp0, which seems to not exist
anymore; however perhaps gcc 2.95.3 does. Similarly, ld does not call
collect2, though the gcc source code suggests in a comment that they
are interchangeable; why do the both exist? To assist in this
experimentation, each interceptor script prints at the start its 1)
name, 2) parent process id, 3) own process id and 4) arguments all to
standard error (this may have been commented out, just uncomment).
Using the scripts
Setup
This is the one-time initial setup of build_interceptor. Note that
as is traditional, commands executed as a normal user are preceded by
a '$' and those executed as root are preceded by a '#'.
NOTE: Build interceptor is incompatible with ccache. If you have
ccache installed, turn it off first by moving the ccache scripts away
first.
- Make a place to put the .i files in your $HOME directory.
$ cd
$ mkdir preproc-foo1
$ ln -s preproc-foo1 preproc
- Build the intercept.progs and other support files.
$ make
Now check that the files you want to intercept are generated in
intercept.progs. You can change this file if you need to, but only do
it while build interception is off! Otherwise you can get into an
inconsistent state.
Interception
- Move your system gcc to gcc_orig and link gcc to gcc_interceptor.pl.
$ cd; cd build_interceptor
$ su
# make -f Intercept.mk on
You could exit the root shell now, but I find it easier to instead
just leave one shell open as root for turning interception on and
off and do user things in another shell.
# exit (leave the root shell)
At any time you can check the interception state; this works as
root or non-root, however other targets in Intercept.mk that mutate
the system state will check if you are root before allowing them.
$ make -f Intercept.mk
If you are intercepting make as well and you want to avoid running
the intercepted make, you can do this while interception is on.
$ make_orig -f Intercept.mk
- Build your project.
If you mess up and need to start over again, just do this.
$ rm -rf preproc/*
If you want to build two different projects and capture both, just
move the link.
$ mkdir preproc-foo2
$ ln -s preproc-foo2 preproc
Before compiling anything else with gcc:
1) Make the data read-only.
$ cd
$ chmod -R a-w preproc-foo1
2) Point the preprocessor capture at another file.
$ mkdir preproc-junk
$ ln -s preproc-junk preproc
- When you are done, put gcc back where it was.
$ cd; cd build_interceptor
$ su
# make -f Intercept.mk off
# exit (leave the root shell)
Extraction
After intercepting a build, one would like to access the
intercepted .i files. Build-interceptor comes with a script for just
this purpose: extract_build.pl. This script creates an 'abstraction'
of the build process: a directory containing 1) the intercepted .i
files and 2) a Makefile such that typing 'make' "replays" the build.
That is, suppose we have intercepted the build of an executable
'a.out'.
- We may then extract the entire build at once.
$ extract_build.pl -infile a.out -outdir xdir
The result will be a new directory xdir
that contains a
Makefile and some .i files in a src subdirectory. The
generic_Makefile is the same for all projects and contains the build
logic; it is included by the Makefile which has variables configured
from interception of the build process.
$ ls xdir
Makefile
generic_Makefile
src
- The xdir/Makefile is very simple: it just compiles each .i file
and links them together; therefore the extracted build process is much
more likely to be amenable to a static analysis or a source-to-source
transformation than the original build process. Changing to that
directory we may now rebuild a.out from those .i files.
$ cd xdir
$ make
$ make check # to run the resulting executable
I think it is possible however for extract_build.pl to fail to
correctly set up the Makefile, depending on the complexity of the
original build process. Therefore we give two more primitive ways of
getting at the .i files directly. First, the .i files are embedded
into the ELF files; you can get them out of the ELF as follows.
However, even this method may cause problems, because for some huge
projects (Mozilla) the embedded .i files will cause the ELF file to
exceed the file size limit on some systems (like mine which is 2 Gig).
In case of this eventuality do as follows.
- Turn off the "feature" that the .i file is embedded into the ELF
by setting the environment variable
BUILD_INTERCEPTOR_DONT_EMBED_PREPROC or commenting out this line in
as_interceptor.pl
system('objcopy', $outfile, '--add-section', ".file.$md5=$tmpfile")
- The .i files may be found down in $HOME/preproc. Print out the
name of the temporary file where the .i file was saved; it is still
there unless you have intercepted another project in the mean time and
also gotten very unlucky.
$ extract_section.pl .note.cc1_interceptor a.out
(
. . .
tmpfile:/home/dsw/preproc/./home/dsw/foo/hello.c-1153018736-18133
)
Files
Build-interceptor needs a place to put the pre-processed output,
the .i files. The name of the directory where it puts them is
hard-coded into the scripts:
$HOME/preproc
: where the scripts put the .i files.
However it is not recommended to use the tool by simply making a
preproc directory since after interception is over, you want to move
that directory so that other compilations on your system do not
inadvertently put more .i files in there. Thus in the above
instructions I use a layer of indirection as follows:
$HOME/preproc-foo1
: An actual directory for holding the .i
files.
$HOME/preproc
: a softlink to preproc-foo1 that should
be moved as soon as interception is done.
Weaknesses / Bugs
The primary assumption is that there is a binary file gcc-VERSION
and that all other names such as "gcc" or "cc" are symbolic links (not
hard-links) to gcc-VERSION. If this is not the case things will not
work. In particular this assumption fails for Slackware.
Using this assumption, build-Interceptor gets the gcc version at
run time from the binary name. If you have multiple gcc versions
installed simultaneously, they must be named gcc-x.y
(e.g. /usr/bin/gcc-3.4) for this version detecting to work.
Build-interceptor changes ongoingly to deal with various usage
scenarios. There are some old scripts lying around that I don't to
get rid of but that are unlikely to work out of the box. If I don't
explicitly mention that you should use a script, then it is not
guaranteed to work.
Acknowledgments
This work was supported by professors Alex Aiken and David Wagner and was done at
UC Berkeley.
I used code and ideas for build-process interception from two
different previous projects that dealt with this same problem.
The idea of inserting metadata into an unused section in ELF .o files
was borrowed from Ben and Hao. I extended it back to the assembly
stage.
Ben Liblit, Hao Chen, John Kodumal, and Simon Goldsmith contributed to
the discussions leading to these scripts. Thanks especially to Simon
Goldsmith for proof-reading this Readme [I of course take
responsibility for any remaining mistakes].
Thanks to Andy Begel for his in-depth explanation of dynamic linking
under various circumstances and operating systems.