Policy Language

The ShiftLeft Policy Language is used to create, customize and manage Policies. The Policy Language is made up of three types of directives:

  • Tagging. Exposed methods, interface interactions and transformations are tagged in the Code Property Graph (CPG) based on syntax-patterns. The Policy contains tagging directives to encode these patterns.

  • Flow Description. The Policy specifies patterns for information flows that should be reported as possible instances of vulnerabilities, particularly data leaks.

  • (Advanced) Taint Semantics. At the lowest level of abstraction, Policies define taint semantics. These directives map between method input and output parameters that express propagation of taint. This information is stored in the Inter-Procedural Control Flow Graph and can subsequently be accessed by static taint tracking algorithms.

  • (Advanced) Type Annotations. The Code Property Graph contains information about the type hierarchy (return and parameter types of methods, inheritance, member variables, vtable entries used for dynamic dispatch). Normally, it is the job of the language frontend ($lang2cpg) to generate this information. However, it is also possible to annotate additional type information via policies.

Specifying each type of directive is documented in detail, with examples of how they are employed in the default Policy.

Note that for more complex Policy use cases, that may require additional features and/or integration with other services, contact us.

Tagging Directives

Tagging Directives to Mark IO Endpoints, Transformations and Exposed Methods

As a result of invoking library methods, data may be read from the outside world or written to it. Tagging directives are used to inform ShiftLeft about these methods. These directives result in tagging the CPG.

The Policy Language provides the IO, TRANSFORMER and EXPOSED tagging directives to for IO endpoints, transformers and exposed methods, respectively. The directives all follow the format

$command label = METHOD -f "$fullName" [{ (PAR -i $i|RET|INST) "(SOURCE|SINK|DESCRIPTOR|DESCRIPTOR_USE)" }]

where

  • $command Either IO, TRANSFORMER or EXPOSED
  • $fullName Method signature
  • $i Parameter number.

IO Tagging Directives

Employed to describe the effects of calls to external libraries. An example is the following directive from the default Policy for "java.io.FileInputStream":

IO file = METHOD -f "java.io.FileInputStream.read:int(byte[])" { PAR -i 1 "SOURCE" }

This directive specifies that, upon invoking the method "java.io.FileInputStream.read:int(byte[]), its first parameter serves as a data source, and that this is data read from a file. Similarly, the method "java.io.FileInputStream.read:int()" introduces an integer read from a file into the program. This can be encoded via the Policy directive

IO file = METHOD -f "java.io.FileInputStream.read:int()" { RET "SOURCE" }

Library methods can also write data to the outside. Analogously to FileInputStream, annotate FileOutputStream as follows

IO file = METHOD -f "java.io.FileOutputStream.write:void(byte[])" { PAR -i 1 "SINK" }

This directive indicates that data which reaches the first parameter of the method "java.io.FileOutputStream.write:void(byte[])" is written to a file.

DESCRIPTOR Flows Tagging Directives

Usually, the single data flow (primary) of an IO flow does not include less relevant objects creations. However, sometimes these object creations are important if the flow is actually vulnerable. For this reason, descriptor flows of source and sink can provide additional clues about the primary data flow.

An example when descriptors are required

File f1 = new File(HttpRequest.read());
FileOutputStream fs1 = FileOutputStream(f1);
File f2 = new File("static.txt");
FileInputStream fs2 = FileInputStream(f2);
String res = fs2.read();
fs1.write(res);

In this example, you explicitly see a read operation from a file (SOURCE) to a write operation to another file (SINK). These operations are common and most of the time intended in the primary data flow (from SOURCE to SINK); you do not notice anything attacker controllable. In this case descriptor flows help to trace back the creation of the files and identify if the file name is controllable from outside, which is the case for f1 controlled from HTTP.

Considering the IO flow, a more detailed flow is specified during the write/read operation using DESCRIPTOR_USE. In this case, the stream where the data is written is the instance.

IO file = METHOD -f "java.io.FileOutputStream.write:void(byte[])" { PAR -i 1 "SINK", INST "DESCRIPTOR_USE" }

To describe those descriptors instances, tag (DESCRIPTOR) the creation of the descriptors instances to augment the information of an IO flow summary is

IO fileStream = METHOD -f "java.io.FileOutputStream.<init>:void()" { INST "DESCRIPTOR" }

Finding Descriptor flows is done recursively, so even a descriptor creation can make use of other descriptors. This is the case especially in constructors parameters

IO fileStream = METHOD -f "java.io.FileOutputStream.<init>:void(java.io.File)" { INST "DESCRIPTOR", PAR -i 1 "DESCRIPTOR_USE" }

N.B. parameters can have different IO names and tags depending on the actual context and only the correct one is considered.

IO fileStream = METHOD -f "java.io.FileOutputStream.<init>:void(java.lang.String)" { INST "DESCRIPTOR", PAR -i 1 "DESCRIPTOR_USE" }
IO filePath = METHOD -f "java.io.FileOutputStream.<init>:void(java.lang.String)" { PAR -i 1 "SINK" }

TRANSFORMER Tagging Directives

Allows you to specify which methods transform data or may be considered data validation routines. As an example, consider the method "encodeBase64", which takes an input string as its first argument and returns a base64-encoded version of that string. This behavior is captured by the following directive

TRANSFORM base64encode = METHOD -f "org.apache.commons.codec.binary.Base64.encodeBase64:byte[](byte[])" { PAR -i 1 "SINK", RET "SOURCE" }

The first parameter is a data sink, and the return value acts as a data source. Transformer directives can also be used to model validations of input arguments. So you may wish to specify that a string is considered to be validated if it passes through a string comparison. To achieve this, transformations such as the following can be defined:

TRANSFORM stringCompare = METHOD -f "java.lang.String.compareTo:int(java.lang.String)" { PAR -i 1 "SINK", RET "SOURCE"}
TRANSFORM stringCompare = METHOD -f "java.lang.String.contains:boolean(java.lang.CharSequence)" { INST "SINK", RET "SOURCE" }
TRANSFORM stringCompare = METHOD -f "java.lang.String.matches:boolean(java.lang.String)" { INST "SINK", RET "SOURCE" }
TRANSFORM stringCompare = METHOD -f "java.lang.String.startsWith:boolean(java.lang.String)" { INST "SINK", RET "SOURCE" }
...

WHEN Tagging Directives

Describes the effects of transformers. For example, a base64 encoder generates encoded data. The Policy Language allows arbitrary tags to be added or removed as a result of a transformation. So

WHEN TRANSFORM base64encode => DATA +encoded

specifies that, on invocation of the Apache encodeBase64 method, the tag "encoded" is added to the output tag. Similarly, the tag "encoded" can be removed via the directive

WHEN TRANSFORM base64decode => DATA -encoded

assuming that a transformer with the "base64decode" label is specified.

EXPOSED Tagging Directives

Used to mark methods that can be triggered by attackers from the outside. For example, for the application "com.mycustomer.MyClass.httpHandler:java.lang.String(java.lang.String)", you know that the handler can be called by an attacker from the outside and that the first parameter is attacker-controlled. Moreover, the return value of the method is passed back to the attacker. This can be captured with the directive

EXPOSED http = METHOD -f "com.mycustomer.MyClass.httpHandler:java.lang.String(java.lang.String)" { PAR -i 1 "SOURCE", RET "SINK" }

Note to experts: EXPOSED directives perform the same actions as IO directives, with the small difference that SOURCE parameters and SINK parameters correspond to output and input parameters respectively for IO directives, and to input and output parameters for EXPOSED directives.

Flow Description Directives

With data sources, sinks, descriptors and transformers tagged, these are now combined in flow descriptions to describe flows of interest. A flow description has the form

CONCLUSION label = FLOW (DATA | IO | DATASOURCE) $expr1 [-> (DATA | IO) $expr2]

where $expr1 and $expr2 are Boolean expressions over tags in accordance with the grammar

expr := tag
| tag OR expr
| tag AND expr
| not expr

Examples for valid expressions are http, http OR ftp and http AND NOT sensitive. The simplest flow expressions only characterize tags at the end of an information flow. For example, to capture all flows that reach the outside world and are encrypted, the flow description is formulated as

CONCLUSION encrypted-to-outside = FLOW DATA (encrypted)

Similarly, to capture all flows to http methods, use the description

CONCLUSION to-http = FLOW IO (http)

It is also possible to constrain both data tags and method tags. For example, to capture unencrypted data tagged as personally identifiable information sent out over HTTP, use the following directive:

CONCLUSION unencrypted-pii-to-http = FLOW DATA (pii AND NOT encrypted) -> IO (http)

Moreover, it is possible to restrict tags at the data source. So all flows where attacker-controlled data enters the application via HTTP and remains attacker-controlled throughout the entire flow can be captured as

CONCLUSION attacker-controlled-from-http = FLOW IO (http) -> DATA (attacker-controlled)

Both source and sink tags can be restricted, e.g., to capture flows from files to http, using the directive

CONCLUSION file-to-http = FLOW IO (file) -> IO (http)

and finally, to additionally monitor only unencrypted data, use

CONCLUSION file-to-http-unencrypt = FLOW IO (file) -> DATA (NOT encrypted) -> IO (http)

Recall the previous example

CONCLUSION unencrypted-pii-to-http = FLOW DATA (pii AND NOT encrypted) -> IO (http)

This pattern has a problem: Suppose we find a flow that picks up the "pii" attribute on the way. Then, any extension of this flow beyond its source will still be matched (as long as it is not encrypted)! In other words, rules that restrict DATA, i.e. the path of the flow, but fail to restrict either source or sink are prone to producing duplicate findings. Such rules are temporarily useful during development and debugging of policy sets, but a better rule would be:

CONCLUSION unencrypted-pii-to-http = FLOW DATASOURCE (pii) -> DATA (NOT encrypted) -> IO (http)

The directives FLOW IO and FLOW DATASOURCE behave similar in that they specify the sources of matched flows. They differ in the domain of the tags they match against: For example, "pii" is often attached not at external program boundaries, but instead attached based on e.g. static type information, or even names of local variables. As such, DATASOURCE flows can originate in e.g. local variables or fields of classes.

Propagated descriptor flow tags are prefixed with $ to have a more granular matching

CONCLUSION file-to-http = FLOW IO ($fileStream AND read) -> IO ($http AND write)

Upon matching a "CONCLUSION" statement, a message is emitted that characterizes the finding:

CONCLUSION file-to-http = FLOW IO (file) -> IO (http)
WHEN CONCLUSION file-to-http => EMIT {
title: "File to http",
description: "The contents of a file is sent out via http",
category: "a6-sensitive-data-exposure",
score: "1.0"
}

Taint Semantics Directives

For advanced use only.

Library methods may also simply propagate taint without performing any transformations on the data that change its security properties. For standard libraries, these propagation rules are already provided by the default Policy. However, for exotic libraries unavailable in code to ShiftLeft, these rules can also be specified manually via MAP directives. These directives specify how taint is propagated from the input parameters of a library method to its output parameters. MAP directives follow the form

MAP -[override|preserve] -d (RET | INST | PAR -i $i) -s (INST | PAR -i $i) METHOD -f "$fullName"

where $i is a parameter number $fullName is the full name of a method.

MAP statements associate a source parameter of a given method with a destination parameter. As a result, ShiftLeft is informed that, if the source parameter is tainted, then the destination parameter is tainted after execution of the library method.

As an example, if the method "java.lang.String.concat:java.lang.String(java.lang.String)" of the Java standard library has a tainted instance parameter, then so is the return value of the library call. This directive is specify using the MAP directive

MAP -override -d RET -s INST METHOD -f "java.lang.String.concat:java.lang.String(java.lang.String)"

The directive indicates that, if the instance parameter (INST) is tainted, then the return parameter (RET) is tainted after execution of the method. The flag "-preserve" indicates to additionally add a mapping from the destination to itself.

MAP -preserve -d INST -s PAR -i 1 METHOD -f "java.lang.StringBuffer.append:java.lang.StringBuffer(java.lang.StringBuffer)"

is the shorthand of

MAP -override -d INST -s PAR -i 1 METHOD -f "java.lang.StringBuffer.append:java.lang.StringBuffer(java.lang.StringBuffer)"
MAP -override -d INST -s INST METHOD -f "java.lang.StringBuffer.append:java.lang.StringBuffer(java.lang.StringBuffer)"

In some cases it is necessary to provide more fine grained MAP statements which describe taint flow from/to a sub elements of the destination our source in order to avoid overtainting. A common use case for this are combo objects which encapsulate multiple member variables and provide getter/setter methods to access them. As example imagine a Tuple class whose instances are create via its constructor new Tuple(x, y) and element access is provided via getX and getY.

MAP -override -d INST -ap { . "x" } -s PAR -i 1 METHOD -f "Tuple.<init>(java.lang.Object,java.lang.Object)"
MAP -override -d INST -ap { . "y" } -s PAR -i 2 METHOD -f "Tuple.<init>(java.lang.Object,java.lang.Object)"
MAP -override -d RET -s INST -ap { . "x" } METHOD -f "Tuple.getX"
MAP -override -d RET -s INST -ap { . "y" } METHOD -f "Tuple.getY"

Furthermore, it is possible to describe indirect calls. Consider for example:

METHOD_MAP -n "apply" -s "java.lang.Object(java.lang.Object)" METHOD -f "scala.Option.map:scala.Option(scala.Function1)" PAR -i 1 { IN -d PAR -i 1 -s INST, OUT -d RET -s RET }

with associated code like myOption.map(barFunction). This directve indicates that the method "scala.Option.map:scala.Option(scala.Function1)" will cause a virtual function call that is dispatched on its first argument (the barFunction object, as described by the first PAR -i 1). This dynamically dispatched function call will use the binding (i.e. vtable entry) associated with the name "apply" and signature "java.lang.Object(java.lang.Object)". That will cause the dataflow engine to track back where barFunction comes from, in order to figure out its runtime type (this is often very easy, e.g. if barFunction is a static function; it is doable if barFunction is a local variable with simple data flow; and it can be undecidable for code that passes around lots of function pointers / scala function objects / java function objects).

The first argument to this dynamically dispatched call will be tainted from the instance parameter (myOption, described by IN -d PAR -i 1 -s INST) and its return-value will taint the return-value of the map call (OUT -d RET -s RET). In addition to the IN and OUT mappings specified in a METHOD_MAP, there exist two default mappings for Java and CSharp which map the argument on which the call will be dispatched to the instance parameter in the IN direction and the other way around for the OUT direction. So for our example the first default mapping is equivalent to IN -d INST -s PAR -i 1 and the second is equivalent to OUT -d PAR -i 1 -s INST.

Unicode escaping

Sometimes, full names of methods or regular expressions contain special characters that are either bothersome to type or read, or interact with the policy parser (e.g. newlines, the double-quote character or nullbytes). In these cases, we offer unicode escaping of all quoted entities. For example, to attach a semantic to the method with full name 'foo"', one can write

MAP -override -d INST -s PAR -i 1 METHOD -f "foo\U0022"

The exact regular expression pattern used for unescaping is "\U([0-9a-f]{4})". In the unlikely case that one tries to attach to a name that contains a matching substring, once can escape one of the characters. So if the desired method has fullname "\U1234", then the policy could be written as e.g. MAP -override -d INST -s PAR -i 1 METHOD -f "\U005cU1234" or alternatively as MAP -override -d INST -s PAR -i 1 METHOD -f "\U12\U00334".

We recommend to use unicode escapes sparingly: Policy files are supposed to be human readable and amenable to grep.

Type Assert Directives

For advanced use only. This feature is only enabled for a subset of supported languages (currently javascript and llvm). This feature intentionally does not support regular expressions.

The Code Property Graph contains information about the type hierarchy (return and parameter types of methods, inheritance, member variables, vtable entries used for dynamic dispatch). Normally, it is the job of the language frontend ($lang2cpg) to generate this information. However, it is also possible to annotate additional type information via policies. This is for example necessary if the frontend cannot access the implementation or headers of a used library.

When type-related statements are used in the policy, then the CPG is augmented with both the annotated type information, and additional nodes describing the possibly missing types: All annotated potential entities that are (transitively) reachable from the nodes present in the CPG get added to it. That is, "if $thing is present in the CPG" means "if $thing is either present in the CPG as emitted by the frontend, or if something caused us to ensure presence of $thing".

Possible type-annotations for methods are:

TYPEASSERT METHOD -f "myMethodFullyQualifiedName" PAR -i 3 SUBTYPE TYPE -f "declaredUpperBoundOnParam3"
TYPEASSERT METHOD -f "myMethodFullyQualifiedName" RET EXACT TYPE -f "actualConcreteReturnType"

This statement looks up whether the method "myMethodFullyQualifiedName" exists in the CPG; if so, then (1) we declare that the third argument must be of type "declaredUpperBoundOnParam3" or anything deriving from it (like java isinstanceof); (2) that the return type is "actualConcreteReturnType" (in the sense of java var.getClass == T), and (3) ensures that both these types are present in the CPG (if they are not present, then we create them).

The main practical difference between SUBTYPE and EXACT affects the dataflow tracker: When we encounter a dynamically dispatched function call, or otherwise need to resolve possible run-time types of a variable, then EXACT terminates the search (just as e.g. constructors). We permit to add multiple EXACT type annotations to the same entity; then, we assume that the exact runtime-type will be any of these annotated candidates. This somewhat paradoxical possibility would be catastrophically unsound if our goal was to design a type-safe language, but is relevant to model partial knowledge of type relations, especially for languages like javascript, where declared SUBTYPE information is virtually useless.

A special case is TYPEASSERT ... SUBTYPE TYPE -f "ANY": This typeassert does not overwrite existing subtype information in the CPG. The type ANY does not stand for the root of the type hierarchy; instead, it stands for missing type information.

Inheritance can be described by

INHERIT TYPEDECL -f "derivedTypeDecl" TYPE -f "superType"

If the derivedTypeDecl is present in the CPG, then we will ensure that superType is present. However, presence of superType will not cause us to add derivedTypeDecl to the CPG if it was not already present.

The last statement exhibits a split between TYPE and TYPEDECL. This split is modeled after the Java virtual machine: The TypeDecl is the erased type (i.e. the runtime type), while the TYPE is the parametrized type (i.e. the static type). Typically one does not need to care about this fine distinction; unless otherwise specified, it is assumed that each TYPE realizes a TYPEDECL with identical fullName.

In the exceedingly rare case of a mismatch between these entities, one can use

REALIZE TYPEDECL -f "myDecl" TYPE -f "myType"

in order to specify that the type myType realizes the typeDecl myDecl instead of the otherwise assumed typeDecl myType. Therefore statements like REALIZE TYPEDECL -f "someType" TYPE -f "someType" are redundant.

In order to attach members to a typeDecl, we simply assert their type:

TYPEASSERT MEMBER -n "memberName" TYPEDECL -f "typeDeclThatHasMember" EXACT TYPE -f "concreteMemberType"

If "typeDeclThatHasMember" exists in the CPG, then we ensure that it has a member with name "memberName", and concrete type concreteMemberType (and we ensure presence of concreteMemberType in the CPG). SUBTYPE assertions work analogously.

Finally, it is possible to attach methods to typeDecls. This corresponds to a VTable-entry / a method-binding. The syntax is

BIND -n "someName" -s "someSignature" METHOD -f "boundMethodFullName" TYPEDECL -f "typeDeclThatHasBinding"

If the typeDeclThatHasBinding exists, and if the pair (someName, someSignature) occurs in the CPG, then we ensure that boundMethodFullName exists, and we attach it to the dispatch table of typeDeclThatHasBinding.

One can see that these typing-related statements actually build a type-hierarchy of an ecosystem of libraries, a "shadow CPG" that is then partially linked into the actual CPG, depending on which parts of the "shadow CPG" are reachable from the actual as-emitted-by-frontend CPG.

Caret-Paths

Full names that begin with the caret '^' are treated specially: These are actually parsed into a path through the type hierarchy. The exact handling of such names is currently experimental and unstable.

Dumping of type information

It is possible to dump the entire type hierarchy of a CPG into a policy file. This is useful in order to debug type information or learn about the typing-related parts of the policy language, either as generated by the frontend or as generated during linking from policy. This feature is accessible from Ocular via io.shiftleft.semanticcpgext.utils.TypePolicyPrinter.summarizeTypes(cpg: io.shiftleft.codepropertygraph.Cpg):String, or from the command-line via e.g. ./cpg2sp.sh -J-Xmx8G --out ./output.proto --cpg ./cpg.bin.zip --debug-summary-out ./types.policy --debug-summary-flags T