Recently, I've worked extensively with CodeQL — a powerful static analysis tool developed by GitHub — to roll out code scanning as part of CI/CD pipelines. At the core of this is the QL language, which is used to write queries that reason about code. Compared to its competitors, CodeQL excels at taint tracking and data flow analysis, which is a fancy way of saying that it's really good at highlighting how potentially malicious or insecure data can flow through your code and end up in a dangerous place that introduces a security vulnerability.

Now, do you *need* to understand the mathematics behind CodeQL to use it effectively? Not necessarily. But at its core, QL is a declarative logic programming
language, and thus it's built on a foundation of set theory and predicate logic. I like to understand the tools I use, so I wanted to write a little about
how CodeQL and data flow analysis works under the hood. This is by no means a comprehensive guide, but I hope it serves as a useful introduction for those who
are curious!

QL is a declarative language. This means that queries are written in a way that describes the desired result, rather than the steps to achieve it. This is in contrast to imperative languages like Python or Java, where you write code that describes the steps to achieve the desired result.

Here's how a typical CodeQL analysis workflow might look like:

- The source code is compiled into a database. This database contains the relational representation of the codebase, which includes information about the structure of the code, the control flow, and the data flow.
- The CodeQL engine runs QL queries against the database, similar to how one might run SQL queries against a relational database, to find patterns in the code that match the query.
- The results are exported into the SARIF format, which can be consumed by CI tools or custom integrations.

In this sense, CodeQL is a lot like SQL. One might even recognise the similarities in the basic syntax:

This query, for example, finds all functions that are never called in a JavaScript codebase. Notice the `exists`

existential quantifier, which checks if
there is at least one `CallExpr`

that calls the function `f`

.

Notice also that QL is object-oriented. `CallExpr::getCallee()`

returns an `Expr`

, which could be a `FunctionExpr`

(which is a `Function`

). However, keep
in mind that class objects in the traditional sense (allocating memory to hold the state of an instance of a class) don't exist in QL. Here, classes
are more like abstract data types that describe sets of existing values.

Binding refers to the process of associating variables in a query with sets of values from the CodeQL database. A common mistake is to think of variables
in the imperative sense, where they can be assigned any value at any time. CodeQL works on relations between values, so variables are bound to **finite sets**
of values that already exist.

Let's walk through a simple example.

First, consider the domain of all elements in the CodeQL database, $D=\{d_1, d_2, \cdots, d_n\}$ where $n$ is finite.

We have introduced a predicate `sumsTo42`

that takes two `Expr`

objects and checks if their integer values sum to 42.

Let:

- $E \subseteq D = \{e \in D \mid e \text{ instanceof Expr}\}$.
- $C \subseteq D = \{c \in D \mid c \text{ instanceof ConstantExpr}\}$.

Then, the predicate `sumsTo42`

evaluates to a set of two-tuples:

Now, the `from`

clause binds the variables `x`

and `y`

to the set of all `Expr`

objects in the database, i.e. $E$. The `where`

clause then filters this set to
only include those tuples that satisfy the predicate $P$. The result is

Notice that the predicate $P$ is a finite set, because its arguments are of a finite type! (`Expr`

refers to the finite set of all expressions in the CodeQL database.)
Because the query results rely on checking for membership in predicate sets, all predicates must evaluate to a finite set — otherwise it's impossible to evaluate
the query in finite time.

For example, this query won't work:

because the predicate arguments aren't bound! Clearly, $x$ is bound iff $y$ is bound, and vice versa. Since there are no other operators in the predicate that could bind either variable, the predicate is not well-defined (attempting to evaluate an infinite subset of $\mathbb{Z} \times \mathbb{Z}$) and the compiler will helpfully throw an error.

The real power of CodeQL lies in its ability to track data flow through a codebase. This is particularly useful for security analysis, where one might want to find out how user input flows through the system and ends up in a dangerous place.

To do this, CodeQL constructs a data flow graph that represents how data flows through the code (e.g. passed between variables, functions, and expressions). Note that this is not the same as an abstract syntax tree, which represents the syntactic structure of the code. Here's an example:

Notice how the assignments are represented as *edges*, and the conditionals are not present in the graph. We don't care that `x`

is `0`

if `p`

is falsy —
we just need to know that `x`

has possible values of `0`

and `p.f`

.

Taint tracking takes it a step further by marking certain data as "tainted", and propagating this through derived data. For example, if `y = foo(p)`

,
and `p`

is tainted, then `y`

is also tainted, because it derives from `p`

. This makes sense when analysing bugs and vulnerabilities, because if `p`

is untrusted
user input, then `y`

must also be treated as untrusted.

This is a fundamentally recursive task: given some propagation rules that determine whether the result of some operation, given tainted inputs, is also tainted, explore the program's control flow until no more tainted data can be found.

In other words, find $\overline{X}=F(\overline{X})$ where $\overline{X}$ is a vector of "In's" and "Out's" representing the set of variables tainted before and after
each statement, and $F$ is a function that applies the propagation rules. We want the **least fixed point**, i.e. the smallest $\overline{X}$ such that $\overline{X}=F(\overline{X})$.

A partial order is a relation that is reflexive, transitive, and antisymmetric. We can define a partial order among vectors of sets:

$\begin{align*} \overline{X} \sqsubseteq \overline{X'} \iff \forall i \cdot X_i \subseteq X'_i \end{align*}$It's easy to see that this is a partial order: it is reflexive (since $X_i \subseteq X_i$), transitive (since $X_i \subseteq X'_i \land X'_i \subseteq X''_i \implies X_i \subseteq X''_i$), and antisymmetric (since $X_i \subseteq X'_i \land X'_i \subseteq X_i \implies X_i = X'_i$).

A function $F$ is monotone over the partial order $\sqsubseteq$ if, for all $\overline{X}$ and $\overline{X'}$,

$\overline{X} \sqsubseteq \overline{X'} \implies F(\overline{X}) \sqsubseteq F(\overline{X'})$What does this say? It means that "larger" or equal inputs lead to "larger" or equal outputs. Here, $F$ is monotone because applying the propagation rules to a "larger" or equal set of "current" tainted variables yields a "larger" or equal set of new tainted variables . To see this, consider that once a variable is marked as tainted, it remains tainted, while an untainted variable can become tainted with the discovery of new tainted variables.

This ensures that repeating $F$ will progressively increase the set of reachable expressions towards convergence at the least fixed point, so this provides us with a way to compute the least fixed point of $F$.

Formally, this is Kleene's fixed-point theorem. The least fixed point of the monotonic function $F$ is the limit of the ascending chain obtained by applying $F$ repeatedly starting from the least element $\bot$ (here, the empty set):

$\begin{align*} \text{lfp}(F) = \bigvee_{n \geq 0} F^n(\bot) \end{align*}$where $F^i$ is the $i$-fold composition of $F$ with itself, and $\bigvee$ denotes the least upper bound.

To see this, consider the sequence $\overline{X}_0 = \bot$, $\overline{X}_{i+1} = F(\overline{X}_i)$.

Since $F$ is monotone, we have $\overline{X}_0 \sqsubseteq \overline{X}_1 \sqsubseteq \cdots \sqsubseteq \overline{X}_n$. Consider the last element of the chain, $\overline{X}_n$. $\overline{X}_n = F(\overline{X}_n)$, otherwise it is not the last element. Therefore it is a fixed point.

Now consider an arbitrary fixed point $\overline{Y}$ of $F$. In the base case, $\overline{X}_0 = \bot \sqsubseteq \overline{Y}$. Also,

$\begin{align*} \overline{X}_i \sqsubseteq \overline{Y} \implies \overline{X}_{i+1} = F(\overline{X}_i) \sqsubseteq F(\overline{Y}) = \overline{Y} \end{align*}$so by induction, $\overline{X}_i \sqsubseteq \overline{Y}$ for all $0 \leq i \leq n$. In particular, $\overline{X}_n \sqsubseteq \overline{Y}$, so $\overline{X}_n$ is at least as "small" as any other fixed point, and thus the least fixed point.

Let's walk through a CodeQL query that looks for basic DOM-based XSS vulnerabilities in a JavaScript codebase. To start, we need to define a configuration for taint tracking:

Now, we need to define the sources. These are the places where untrusted data first enters the program. In this case, we're looking for
`RemoteFlowSource`

(data from e.g. a request parameter), `ClientRequest::Range`

(data from a HTTP response), and `SomeOtherSource`

(a custom source that we've defined elsewhere).

These are used in the first iteration, $F(\bot)$, to mark all variables directly tainted by these sources. In the following iterations, variables that are derived from these sources are also marked as tainted.

Next, we need to define the sinks. These are the places where tainted data can cause harm. Eventually, we want to find all paths from any source to any sink, and those will be the results of our query.

This defines which nodes in the data flow graph are considered sinks. In this case, we are identifying methods that can be used to inject
untrusted data into the DOM, such as `innerHTML`

, `outerHTML`

, `insertAdjacentHTML`

, `srcdoc`

, `document.write`

, `document.writeln`

, and `DOMParser.parseFromString`

.
When untrusted data reaches these sinks, it can be used to execute arbitrary JavaScript code in the context of the page.

After the least fixed point is computed (and we have the completed set of all tainted elements in the program), we can then evaluate the sinks against this set of tainted elements to determine of any tainted data can reach a sink. If so, we have a potential vulnerability.

Sometimes, the default taint steps are insufficient to capture the desired propagation. One might extend this by overriding `isAdditionalTaintStep`

:

This propagates taint through array elements and array methods like `find`

, `filter`

, `some`

, `every`

, and `map`

.

One might also want to stop propagation of taint at a sanitization step. When a node is a sanitizer, its output is considered untainted, even if the input is tainted.
This can be done by overriding `isSanitizer`

:

Here, we are considering the `sanitize`

method from the `DOMPurify`

library as a sanitizer.

These modify our propagation function $F$ to include or exclude desired nodes when propagating taint.

CodeQL has built-in support for transitive closures, which is pretty cool.

The transitive closure of a relation $R$ is the smallest relation $R^+$ that contains $R$ and is transitive. Formally, $R^+$ is the intersection of all transitive relations that contain $R$. This can be defined inductively:

$\begin{align*} R^0 &= \emptyset \\ R^{n+1} &= R^n \cup (R \circ R^n) \end{align*}$The *reflexive* transitive closure is similar, but includes the identity relation:

In CodeQL, we can use the `+`

operator to denote the transitive closure of a predicate, and the `*`

operator to denote the reflexive transitive closure. For example:

This predicate evaluates to the set of all (`start`

, `end`

) pairs such that `end`

is reachable from `start`

by following zero or more edges in the control flow graph.

This is my humble attempt to pen down my mental model of how CodeQL works under the hood, which helps me to visualize how my queries are executed and how the results are derived. It's a really interesting and powerful tool that, while lacking the user-friendliness of some other static analysis tools, makes up for it with its flexibility and expressiveness.

]]>To give you a bit more context about myself, I discovered infosec 3 years ago when I was trying to get out of being sent to the infantry. As a Singaporean male, I had to serve in the army. It just so happened that there was a new scheme that allowed people to serve in "cybersecurity" units, and the prospect of spending my days in an air-conditioned office instead of in the middle of a jungle was very appealing to me.

So without any knowledge of security at all, I borrowed a textbook from my teacher at the time and did what I was good at as an A-level student: memorizing as much as I could a week before the exam. By a stroke of luck, I got in, and the rest was history.

My 18 year old self, who at the time was pretty sure he wanted to study Physics in university, probably wouldn't have believed that his next 3 years would see him in Las Vegas, Lausanne, Stockholm, and other places around the world competing in CTF competitions, meeting some of the most talented people in the industry, and spending countless hours in front of a computer screen instead of solving differential equations.

I went through some structured training in the army with an emphasis on blue teaming. Then I tried my hand at CTFs for the first time — it was brutal. I could solve maybe 1 or 2 basic challenges, but the rest of the time I just sat there being confused. But realising just how much I didn't know was something that hooked me in, and I played CTFs almost every weekend for the next few months. We even formed our own team, Social Engineering Experts, which in 2022 and 2023 was the top CTF team in Singapore.

Eventually I found my niche in web security. CTF trains you really well for source code review and exploit development, so I started poking around open source projects and eventually got credited for 15 CVEs, most of which were from my deep dive into HTTP request smuggling.

I also started doing bug bounties because among other things, it was fun to hack my own government. I was invited to three private time-bound programs organised by the Singapore government, and was ranked 1st (by "signal", or impact) in each one among global participants.

It was really fun. I ended up playing CTFs with Blue Water, with whom I won 2nd place at the DEF CON CTF in 2023. At some point, a teammate reached out to me and asked if I wanted to join Electrovolt, a security consultancy that was partnering with Cure53. I started doing some work with them on a freelance basis, whenever I had the time. The work has been super cool, working with some really talented people for really big clients. This was also around the time I started my internship with TikTok, where I was doing security research at scale.

After some time trying to break everything I could get my hands on, I started to feel like I was just breaking things for the sake of breaking them. I was getting tired of the constant churn of finding bugs, writing reports, and moving on to the next target. I was tired of the constant pressure to be the best, to be the first, to be the most impactful.

But the biggest question, however, simply boiled down to: "why don't they ** just** fix it? It's not that hard." I couldn't understand why teams would
ignore important vulnerabilities, or when something is fixed only to be broken again in the next release. I couldn't understand why we were still finding
the same classes of vulnerabilities that have been known for years. It's like playing whack-a-mole, but the moles are the size of elephants and there's
only one hammer.

But no time to dwell on that too much, because I was starting university. I took this as an opportunity to take a step back from living and breathing security to enjoy life and explore other interests.

I didn't expect my first year at Cambridge to be *this* busy. To be honest, I spent most of the year thinking I was going to fail. Everyone else seemed
to be doing so much better, and I was returning to a really difficult technical degree after years of not doing any math or theoretical computer science.
Cambridge certainly has a way of making you feel like you're not good enough, because I ended up managing a first-class grade in the end.

I was still doing *some* CTFs (I started playing with the Cambridge team, cheriPI) and freelance work on the side, but there was just a lot more
to try. I started doing hackathons and for the first time, I started thinking about what products people actually need.

A hackathon felt like a microcosm of a software shop. You have a team, a deadline, and a goal. No one cares about security when you're trying to get a product out the door.

I'm spending this summer at Open Government Products, where I now have much more time to think about how to solve security problems at scale, and in a way that lasts. I started to appreciate why security can be so hard, and why it's not just a matter of "just fixing it".

The truth is, security is hard because it's not just about fixing bugs. It's about fixing the process that led to the bug in the first place. The organisations that are having repeated security incidents are the ones that have a broken process, often because they grew too fast without thinking about security, and now they're playing catch-up. But it's hard to catch up when the processes you've put in place and the tools developers have gotten comfortable with need to be fundamentally changed.

Security is not prioritised enough because the cost of poor security is not immediately visible. It's hard to quantify the cost of a security incident that didn't happen, and it's hard to justify the cost of investing in security when you're not seeing any immediate benefits. There are no returns when investing in security, only losses when you don't.

And if the probability of an incident is low in the first place, then the expected cost of standing around doing nothing, waiting for an incident to happen, is lower than the cost of investing in security.

$\begin{align*} E[\text{{cost of doing nothing}}] &= P(\text{{incident}}) \times \text{{cost of incident}} \\ &< \text{{cost of investing in security}} \end{align*}$And when the cost of doing nothing is lower than the cost of investing in security, the rational decision is to do nothing.

So to do good security, you probably need to reduce the cost of security. This means making it easy to do the right thing: in Ross Anderson's Software and Security Engineering course, there was a lot of emphasis on making the secure thing the easiest, or the default, thing to do without compromising on usability.

That means building tools that developers actually want to use, or something so low-friction that they don't even realise they're using it. It means building security into the process from the start, and not as an afterthought.

I don't think I can ever truly quit security. I still love doing it on the side with fun research projects and freelance pentesting work outside of my day job or school. Besides, how do you "quit" something so fundamentally important to everything else you might do in tech? But I'm ready to take a pause on popping shells all the time and start building things that last.

And there are so many things I take with me from my time in security. I've learnt how to think adversarially, which helps a lot in writing good software. I've learnt how to communicate complex ideas in a way that's easy to understand. I've picked up an attention to detail that I didn't have before. I've learnt how to learn.

I think this is probably a good balance. Doing security as both a hobby *and* a job is probably too much for me. I think the analogy "it's a marathon,
not a sprint" is very apt here. Unfortunately I'm at that age where I need to be a bit more realistic about what I want in my career, and I need to know
I will still enjoy what I'm doing 10 years down the line.

I'm still not *entirely* sure what I *do* want to do, but I know I've found a passion for being as close to the product as possible. I'm at my best
when I work on meaningful problems that serve a clear purpose in the world and where I can clearly see the impact of my work.

This has been a long and quite personal post, and I don't want to enforce any particular message on anyone. I just re-made my personal site and decided it made sense to write my first post on where I'm at right now. Wherever you are at, I hope you're doing well, and I hope you're doing what you love!

Oh, and please remember to touch grass.

]]>