Create Code Property Graphs For LLVM Bitcode

This article shows how you can create Code Property Graphs (CPG) using LLVM bitcode.

Example: Generating a Sample CPG

Before you generate a CPG for your project, this example will show you how to do so for a sample project.

llvm2cpg takes LLVM bitcode as an input. Create an LLVM IR file called main.ll with the following content:

; main.ll
declare void @callee(i32 %sink)
define void @caller(i32 %x) {
    call void @callee(i32 %x)
  ret void
}

Run llvm2cpg against your newly-created file:

llvm2cpg -output-dir=`pwd` main.ll

You should see output that's similar to the following:

[2019-11-06 10:30:05.543] [llvm2cpg] [info] More details: /var/folders/pp/lt3pgm5971n1qw7pp2g_bmfr0000gn/T/llvm2cpg-d36a81.log
[2019-11-06 10:30:05.543] [llvm2cpg] [info] Loading main.ll
[2019-11-06 10:30:05.544] [llvm2cpg] [info] Emitting CPG 1/1
[2019-11-06 10:30:05.544] [llvm2cpg] [info] Serializing CPG
[2019-11-06 10:30:05.545] [llvm2cpg] [info] Saving CPG on disk
[2019-11-06 10:30:05.546] [llvm2cpg] [info] CPG is successfully saved on disk: /tmp/sc-w4QviQTLdY/cpg.bin.zip
[2019-11-06 10:30:05.546] [llvm2cpg] [info] Shutting down

Start Ocular by running sl ocular

At this point, you can run a query against the CPG you just created using the Ocular command line interface:

importCpg("/tmp/sc-w4QviQTLdY/cpg.bin.zip")
def sink = cpg.method.name("callee").parameter
def source = cpg.method.name("caller").parameter
sink.reachableBy(source).allFlows.p

After running the final command, you should see output similar to the following:

2019-11-06 10:34:14.344 [main] INFO mainTasksSize: 1, reachedEndNode: 1,
res18: List[String] = List(
  """ ______________________________________
 | tracked| lineNumber| method| file   |
 |=====================================|
 | x      | N/A       | caller| main.ll|
 | x      | N/A       | caller| main.ll|
 | sink   | N/A       | callee| main.ll|
"""
)

If so, you can proceed with the next steps, which shows you how to generate a CPG for your project.

Obtaining Bitcode from Source Code

As we've previously mentioned, llvm2cpg takes LLVM bitcode as an input. The bitcode, however, can be:

IR (a human-readable representation)
Bitcode (a bitstream representation)
Embedded bitcode (a bitstream representation embedded into a binary)

There are several ways for you to get LLVM bitcode out of high-level source code.

Sample Program

For the remainder of this article, we will be using this program as a sample:

/// main.c
extern int printf(const char *, ...);
void callee(int x) {
  printf("%d\n", x);
}
int main(int argc, char **argv) {
    callee(14);
    callee(42);
    return 0;
}

IR

To emit an IR for the sample project, run:

clang -c -S -emit-llvm main.c -o main.ll

The resulting file main.ll can then be passed to llvm2cpg:

llvm2cpg main.ll

Bitcode

There are two ways to get the bitcode. The first way is to run:

clang -c -emit-llvm main.c -o main.bc

Alternatively, you can run the same command, but with LTO in Full mode:

clang -c -flto main.c -o main.o

Regardless of whether your output is main.bc or main.o, both contain bitcode. You can verify as follows:

> file main.o main.bc
main.o:  LLVM bitcode, wrapper x86_64
main.bc: LLVM bitcode, wrapper x86_64

Both types of files can be passed to llvm2cpg (e.g., llvm2cpg main.bc or llvm2cpg main.o).

Embedded Bitcode

Finally, you can obtain the embedded bitcode from your source code:

> clang -fembed-bitcode main.c -o main
> ./main

You can then pass main to llvm2cpg:

llvm2cpg main

You'll see output that is similar to the following:

[2019-11-06 10:00:59.021] [llvm2cpg] [info] Loading main
[2019-11-06 10:00:59.027] [llvm2cpg] [info] Emitting CPG 1/1
[2019-11-06 10:00:59.028] [llvm2cpg] [info] Serializing CPG
[2019-11-06 10:00:59.028] [llvm2cpg] [info] Saving CPG on disk
[2019-11-06 10:00:59.029] [llvm2cpg] [info] CPG is successfully saved on disk: ./cpg.bin.zip
[2019-11-06 10:00:59.029] [llvm2cpg] [info] Shutting down

Using embedded bitcode is ideal, since it results in the most straightforward integration and can be added to an existing build system without affecting the resulting software.

Obtaining Bitcode for Your Source Code

Getting bitcode for your projects can be less straightforward than our projects, especially given the various build systems in use. One thing to remember is that you'll need to inject one of the following flags into your build system:

Flag	Notes
`-emit-llvm`	The build doesn't finish (and linking fails since no object files are produced), but all bitcode files are available
`-flto`	The build completes, and all of the created intermediate object files contain bitcode
`-fembed-bitcode`	The build completes and the resulting binary contains bitcode

For example, if you're building your project with CMake, you'd run:

cmake -DCMAKE_C_FLAGS=-fembed-bitcode -DCMAKE_CXX_FLAGS=-fembed-bitcode source-root

If you're using Xcode, add the flag to both OTHER_CFLAGS and OTHER_LDFLAGS.

If you're using xcodebuild, then you'd use:

xcodebuild OTHER_CFLAGS=-fembed-bitcode OTHER_CPLUSPLUSFLAGS=-fembed-bitcode OTHER_LDFLAGS=-fembed-bitcode

Debugging

You may find it helpful to include the following debugging-related flags in your build command as well:

Flag	Description
`-fno-builtin`	Disables the special handling, optimization of standard library functions like `malloc`
`-fno-inline-function`	disables inlining of function to ease debugging
`-fno-debug-macro`	disables the generation of debug info for macros

We recommend that users of other build systems look into using whole-program-llvm.

Caveats

The -fembed-bitcode flag may not work on macOS if a project links to a static library that wasn't compiled with embedded bitcode support
If you combine -fembed-bitcode with -flto, no bitcode will be embedded into the binary
In some cases, llvm2cpg can't read debug information emitted by Xcode's version of Clang. Everything will work normally, but the debug information won't be taken into account.
When working with [fooObject data] and [anyObject data] in Objective-C, please note that the compiler doesn’t typecheck the two (nor are they checked at runtime) and therefore the two object types are treated as the same and there are no effects on the optimized machine code that’s emitted.

Getting the CPG out of a Project

Once you've obtain the bitcode for your project, you can get the CPG. The command required to do so depends on the method you used to obtain your bitcode.

`-emit-llvm`

Option 1:

cd build-directory
llvm2cpg `find ./ -name "*.bc"`

Option 2:

cd build-directory
llvm2cpg `find ./ -name "*.ll"`

`-flto`

cd build-directory
llvm2cpg `find ./ -name "*.o"`

`-fembed-bitcode`

cd build-directory
llvm2cpg program-binary

whole-program-llvm

cd build-directory
llvm2cpg bitcode.bc

Example: Generating a Sample CPG​

Obtaining Bitcode from Source Code​

Sample Program​

IR​

Bitcode​

Embedded Bitcode​

Obtaining Bitcode for Your Source Code​

Debugging​

Caveats​

Getting the CPG out of a Project​

-emit-llvm​

-flto​

-fembed-bitcode​

whole-program-llvm​