Skip to main content

Code Property Graph

The Code Property Graph is a data structure designed to mine large codebases for instances of programming patterns. These patterns are formulated in a domain-specific language (DSL) based on Scala. It serves as a single intermediate program representation across all languages supported by Ocular.

Property graphs are a generic abstraction supported by many contemporary graph databases such as Neo4j, OrientDB, and JanusGraph. In fact, Ocular's predecessor Joern made use of general purpose graph databases as a storage backend and the graph query language Gremlin. As the limitations of this approach became more apparent over the years, we replaced both the storage backend and query language with our own graph database OverflowDB.

Qwiet has open-sourced the implementation of the code property graph and its specification.

Building Blocks of Code Property Graphs

Code Property Graphs are graphs as studied in discrete mathematics, and more specifically property graphs. A property graph is composed of the following building blocks:

  • Nodes and their types. Nodes represent program constructs. This includes low-level language constructs such as methods, variables, and control structures, but also higher level constructs such as HTTP endpoints or findings. Each node has a type. The type indicates the type of program construct represented by the node, e.g., a node with the type METHOD represents a method while a node with type LOCAL represents the declaration of a local variable.

  • Labeled directed edges. Relations between program constructs are represented via edges between their corresponding nodes. For example, to express that a method contains a local variable, we can create an edge with the label CONTAINS from the method's node to the local's node. By using labeled edges, we can represent multiple types of relations in the same graph. Moreover, edges are directed to express, e.g., that the method contains the local but not the other way around. Multiple edges may exist between the same two nodes.

  • Key-Value Pairs. Nodes carry key-value pairs (attributes), where the valid keys depend on the node type. For example, a method has at least a name and a signature while a local declaration has at least the name and the type of the declared variable.

In summary, Code Property Graphs are directed, edge-labeled, attributed multigraphs, and we insist that each node carries at least one attribute that indicates its type.

note

A Look into the Rear-View Mirror

The Code Property Graph was first introduced in the paper Modeling and Discovering Vulnerabilities with Code Property Graphs in the context of vulnerability discovery for C system code and the Linux kernel in particular. The core ideas outlined in this early work are the following:

  • Different classic program representations are merged into a property graph into a single data structure that holds information about the program’s syntax, control- and intra-procedural data-flow.

  • The property graph is stored in a graph database and made accessible via a domain specific language (DSL) for the identification of programming patterns - based on a DSL for graph traversals.

  • The query language allows to seamlessly transition between the original code representations, making it possible to combine aspects of the code from different views these representations offer in a single query.

From 2014-2016, research followed on (a) extending the concept for interprocedural analysis, (b) using the graph as a basis for learning typical data-flow patterns in a program, (c) the effects of introducing further classic program representations such as dominator trees, (d) and the applicability of the approach for dynamically-typed languages such as PHP.

From 2017 onwards, the code property graph served as the technological foundation for the static analysis solutions developed at Qwiet Inc. The representation has since undergone heavy extensions and transformations. At the statement and expression level, it has matured into a generic container format which allows for hosting of graphs generated by 8 different language frontends, to enable querying with the same query language across programming languages. Moreover, the concept of overlays was introduced to allow representing code on different levels of abstraction, enabling transitioning between these layers of abstraction using the query language in the same way as for the original three low-level code representations. Finally, programming models and APIs are now available for parallel graph processing at low memory footprint - a core ingredient for scaling the approach.

A First Example

Let us look at some code through the lens of the Code Property Graph. Take this simple faulty program written in C that prints an unimportant message to stderr when called with the correct arguments:

// x42.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[]) {
char password[26];
strcpy(password, argv[1]);
if (strcmp(password, "correcthorsebatterystaple") == 0) {
fprintf(stderr, "It depends!\n");
exit(42);
}
printf("What is the meaning of life?\n");
exit(0);
}

You've probably already identified the vulnerability. strcpy reads bytes from argv[1] without length checking. A carefully crafted input string will let you jump over the if check, straight to any point in the program, for example to the line which prints the unimportant message to stderr. All is required to achieve that is to call the binary with a 40-length string to overflow the buffer, and concatenate it with the address of the first instruction in the inner body (which is trivial to obtain with after dumping the hex contents of the binary with something like objdump). The exploit execution could look something like this: perl -e 'print "A"x12."\xf9\x83\x04\x08"' | ./x42.

Alright. This eyeball-method works well on a straightforward example like the one above, but it does not scale to real-world applications with hundreds of thousands lines of code, built on legacy architectural designs, using external libraries and exhibiting irregularities. Let us instead look at a method that does work, and does scale, and can handle irregularities, the Code Property Graph. The learning curve of the Code Property Graph is steep, but its rewards are great. Here is a visual representation of the Code Property Graph for the program you've just examined:

Each one of the 152 rectangles represents a node in the graph, and each color its type; edges are not depicted, but are nonetheless there. This visual oversimplification hides away a lot of information, but it should nonetheless help you construct a strong basic mental framework for working with the Code Property Graph. Let us examine the nodes in this graph and how the relationships between them can uncover the buffer overflow hidden away in plain sight.

A very simple node to look at is the META_DATA, highlighted in the blue rectangle:

There is one META_DATA node for each Code Property Graph, and its properties include: id, an integer that uniquely identifies the node in the graph (in this case with a value of 2), language, which holds the identifier for the type of input the Code Property Graph has been generated from (with a value of C), and type, which specifies the node type, i.e. what properties it has and is allowed to have (with a value of META_DATA); taken together, it looks something like this {"id": 2, "type": "META_DATA", "language": "C"}. This node only provides a bit of information, namely that the language we are analyzing is C. Other nodes will help us reach our goal of identifying the vulnerability.

We already know that the language we are working with is C, and it is common knowledge that programs written in C tend to be vulnerable when calling certain unsafe functions like gets or strcpy, so why not continue this exploration by finding all nodes in the graph of type CALL with a name value of gets OR strcpy? As expected, one node pops up:

Just because an unsafe function is called, that does not mean the program which calls it is vulnerable. Its input has to be controllable from the outside, and it has to be called without bounds-checking: both statements can be tested for easily with the Code Property Graph. For the first statement, we might want to find out if any of the arguments passed to the main function are used in the call to strcpy. The METHOD node comes in handy in this case. But you will be surprised if you query the graph for nodes of type METHOD, 14 results appear for that simple C program with only a main function:

There is a good reason for this, and it has to do with the flexibility of the Code Property Graph. In order to represent programming languages with very different syntax in the same data structure for the specific purpose of vulnerability discovery, it becomes useful to re-arrange, re-group or rename specific statements. For C-like languages, a statement of the form 26 * sizeof(char) becomes a node of type METHOD with the value for its name property set to <operator>.multiplication, and it would have edges to nodes representing 26 and sizeof(char). If this does not make sense yet, no worries. This shift in thinking takes some time, but as soon as you get the hang of it, you will swift through codebases at incredible speeds. Back to our example, we wanted to check if the arguments passed to the main functions end up being used in strcpy. If we query the graph again for nodes of type METHOD with the name property set to main, we find the one node representing the entry function:

From that node, we can follow its edges to nodes of type METHOD_PARAMETER_IN, which contain the substring char * in their CODE property value (char * because character buffers are easy to misuse):

This node represents, as its type suggests, the parameter of a method, and it includes the following properties: {"id": 8, "code": "char *argv[]", "LINE_NUMBER": 5}. In order to asses the first statement, i.e. that an argument passed from the outside of the program ends up in the strcpy call, we can follow the edges to the argument nodes of the node representing strcpy and check if one of them matches the name of the node we just found, argv. One does. It's type is CALL, its CODE value argv[1] and its name <operator>.indirectIndexAccess; a definite match.

Our first statement is true. We've proved that an external buffer ends up being read in strcpy. Incidentally, we've proved the second statement is true as well, because the argument node of strcpy has the name <operator>.indirectIndexAccess and the CODE value argv[1], properties which clearly point to a lack of bound-checks.

That is it for the Introduction. You've learned the basics of a novel data structure, and finished your first walkthrough of a vulnerability discovery using the Code Property Graph!