Create Code Property Graphs For LLVM Bitcode
This article shows how you can create Code Property Graphs (CPG) using LLVM bitcode.
Example: Generating a Sample CPG
Before you generate a CPG for your project, this example will show you how to do so for a sample project.
llvm2cpg takes LLVM bitcode as an input. Create an LLVM IR file called main.ll with the following content:
; main.ll
declare void @callee(i32 %sink)
define void @caller(i32 %x) {
call void @callee(i32 %x)
ret void
}
Run llvm2cpg against your newly-created file:
llvm2cpg -output-dir=`pwd` main.ll
You should see output that's similar to the following:
[2019-11-06 10:30:05.543] [llvm2cpg] [info] More details: /var/folders/pp/lt3pgm5971n1qw7pp2g_bmfr0000gn/T/llvm2cpg-d36a81.log
[2019-11-06 10:30:05.543] [llvm2cpg] [info] Loading main.ll
[2019-11-06 10:30:05.544] [llvm2cpg] [info] Emitting CPG 1/1
[2019-11-06 10:30:05.544] [llvm2cpg] [info] Serializing CPG
[2019-11-06 10:30:05.545] [llvm2cpg] [info] Saving CPG on disk
[2019-11-06 10:30:05.546] [llvm2cpg] [info] CPG is successfully saved on disk: /tmp/sc-w4QviQTLdY/cpg.bin.zip
[2019-11-06 10:30:05.546] [llvm2cpg] [info] Shutting down
Start Ocular by running sl ocular
At this point, you can run a query against the CPG you just created using the Ocular command line interface:
importCpg("/tmp/sc-w4QviQTLdY/cpg.bin.zip")
def sink = cpg.method.name("callee").parameter
def source = cpg.method.name("caller").parameter
sink.reachableBy(source).allFlows.p
After running the final command, you should see output similar to the following:
2019-11-06 10:34:14.344 [main] INFO mainTasksSize: 1, reachedEndNode: 1,
res18: List[String] = List(
""" ______________________________________
| tracked| lineNumber| method| file |
|=====================================|
| x | N/A | caller| main.ll|
| x | N/A | caller| main.ll|
| sink | N/A | callee| main.ll|
"""
)
If so, you can proceed with the next steps, which shows you how to generate a CPG for your project.
Obtaining Bitcode from Source Code
As we've previously mentioned, llvm2cpg takes LLVM bitcode as an input. The bitcode, however, can be:
- IR (a human-readable representation)
- Bitcode (a bitstream representation)
- Embedded bitcode (a bitstream representation embedded into a binary)
There are several ways for you to get LLVM bitcode out of high-level source code.
Sample Program
For the remainder of this article, we will be using this program as a sample:
/// main.c
extern int printf(const char *, ...);
void callee(int x) {
printf("%d\n", x);
}
int main(int argc, char **argv) {
callee(14);
callee(42);
return 0;
}
IR
To emit an IR for the sample project, run:
clang -c -S -emit-llvm main.c -o main.ll
The resulting file main.ll can then be passed to llvm2cpg:
llvm2cpg main.ll
Bitcode
There are two ways to get the bitcode. The first way is to run:
clang -c -emit-llvm main.c -o main.bc
Alternatively, you can run the same command, but with LTO in Full mode:
clang -c -flto main.c -o main.o
Regardless of whether your output is main.bc or main.o, both contain bitcode. You can verify as follows:
> file main.o main.bc
main.o: LLVM bitcode, wrapper x86_64
main.bc: LLVM bitcode, wrapper x86_64
Both types of files can be passed to llvm2cpg (e.g., llvm2cpg main.bc
or llvm2cpg main.o
).
Embedded Bitcode
Finally, you can obtain the embedded bitcode from your source code:
> clang -fembed-bitcode main.c -o main
> ./main
You can then pass main to llvm2cpg:
llvm2cpg main
You'll see output that is similar to the following:
[2019-11-06 10:00:59.021] [llvm2cpg] [info] Loading main
[2019-11-06 10:00:59.027] [llvm2cpg] [info] Emitting CPG 1/1
[2019-11-06 10:00:59.028] [llvm2cpg] [info] Serializing CPG
[2019-11-06 10:00:59.028] [llvm2cpg] [info] Saving CPG on disk
[2019-11-06 10:00:59.029] [llvm2cpg] [info] CPG is successfully saved on disk: ./cpg.bin.zip
[2019-11-06 10:00:59.029] [llvm2cpg] [info] Shutting down
Using embedded bitcode is ideal, since it results in the most straightforward integration and can be added to an existing build system without affecting the resulting software.
Obtaining Bitcode for Your Source Code
Getting bitcode for your projects can be less straightforward than our projects, especially given the various build systems in use. One thing to remember is that you'll need to inject one of the following flags into your build system:
Flag | Notes |
---|---|
-emit-llvm | The build doesn't finish (and linking fails since no object files are produced), but all bitcode files are available |
-flto | The build completes, and all of the created intermediate object files contain bitcode |
-fembed-bitcode | The build completes and the resulting binary contains bitcode |
For example, if you're building your project with CMake, you'd run:
cmake -DCMAKE_C_FLAGS=-fembed-bitcode -DCMAKE_CXX_FLAGS=-fembed-bitcode source-root
If you're using Xcode, add the flag to both OTHER_CFLAGS
and OTHER_LDFLAGS
.
If you're using xcodebuild, then you'd use:
xcodebuild OTHER_CFLAGS=-fembed-bitcode OTHER_CPLUSPLUSFLAGS=-fembed-bitcode OTHER_LDFLAGS=-fembed-bitcode
Debugging
You may find it helpful to include the following debugging-related flags in your build command as well:
Flag | Description |
---|---|
-fno-builtin | Disables the special handling, optimization of standard library functions like malloc |
-fno-inline-function | disables inlining of function to ease debugging |
-fno-debug-macro | disables the generation of debug info for macros |
We recommend that users of other build systems look into using whole-program-llvm.
Caveats
- The
-fembed-bitcode
flag may not work on macOS if a project links to a static library that wasn't compiled with embedded bitcode support - If you combine
-fembed-bitcode
with-flto
, no bitcode will be embedded into the binary - In some cases, llvm2cpg can't read debug information emitted by Xcode's version of Clang. Everything will work normally, but the debug information won't be taken into account.
- When working with
[fooObject data]
and[anyObject data]
in Objective-C, please note that the compiler doesn’t typecheck the two (nor are they checked at runtime) and therefore the two object types are treated as the same and there are no effects on the optimized machine code that’s emitted.
Getting the CPG out of a Project
Once you've obtain the bitcode for your project, you can get the CPG. The command required to do so depends on the method you used to obtain your bitcode.
-emit-llvm
Option 1:
cd build-directory
llvm2cpg `find ./ -name "*.bc"`
Option 2:
cd build-directory
llvm2cpg `find ./ -name "*.ll"`
-flto
cd build-directory
llvm2cpg `find ./ -name "*.o"`
-fembed-bitcode
cd build-directory
llvm2cpg program-binary
whole-program-llvm
cd build-directory
llvm2cpg bitcode.bc