Zeyu (Zayne) Zhang

My NRIC number was public for four days. Should it have been?

ping@analogue.computer (Zeyu (Zayne) Zhang) — Sun, 15 Dec 2024 00:00:00 GMT

Last Friday, I was getting off the bus in downtown Houston when I received a rather alarming message from a friend in Singapore.

That's not good...

To set the context: my National Registration Identity Card (NRIC) number — what most will consider the equivalent of a Social Security Number in the United States, and the National Insurance Number in the United Kingdom — was on a public government website, publicly accessible free of charge to anyone who knew where to look. This was not a data breach, but a feature that was added to the Accounting and Corporate Regulatory Authority (ACRA) portal, which is used by businesses to look up information about other businesses and individuals.

The feature was released on the 9th of December, and on the 13th of December, Mothership reported on it. You could search for the name of any person, and if they were associated with a business, their NRIC number would be displayed free of charge, and a "full profile" — with additional information, like a citizen's residential address — is for sale for S$33.

The ACRA portal

My information was probably on there because I registered a business in Singapore a while ago which I no longer actively operate. According to the latest data from ACRA, as of the end of 2023, there were 588,764 registered businesses in Singapore, with a significant portion being owned by Singaporeans. This means that hundreds of thousands of individuals had their NRIC numbers exposed, just like me.

This feature has since been disabled, but the damage has probably already been done. This was public for four days, and it is not unreasonable to assume that someone could have scraped much of the data in that time. While this is disturbing on its own, what I'm more concerned about is the subsequent response from the government, and the discussion around the sensitivity of NRIC numbers. I think this is relevant to many of my readers, even if you are not from Singapore, because it involves a slightly more nuanced discussion about the differences between security and privacy.

Secret or not?

Following the disabling of the feature, the Ministry of Digital Development and Information (MDDI) responded to queries from the media.

MDDI said in its statement that NRIC numbers are meant to be used to identify individuals and "should be used as such".

"As a unique identifier, the NRIC number is assumed to be known, just as our real names are known," said the ministry.

"There should therefore not be any sensitivity in having one’s full NRIC number made public, in the same way that we routinely share and reveal our full names to others."

It added that it has been a practice for some time to use masked NRIC numbers. But there is no need to mask the number, nor is there much value in doing so, said MDDI.

"Using some basic algorithms, one can make a good guess at the full NRIC number from the masked number, especially if one also knows the year of birth of the person."

This is why public agencies are phasing out the use of masked NRIC numbers, so as to avoid giving a "false sense of security", said MDDI.

This is an interesting perspective, which came as a shock to many. For years, the Personal Data Protection Act (PDPA) has made it illegal for organizations to collect, use, or disclose NRIC numbers without a valid reason, with a financial penalty of up to S$1 million (although government agencies are exempt) — this has led many to treat NRIC numbers as sensitive information that should be kept secret.

This means that many organizations that do collect NRIC numbers often use them as part of identity verification. My banking app, for example, asks for my NRIC and a 6-digit PIN to log in. My insurance company sends me policy documents in a password-protected PDF, with part of the password being my NRIC number. Given the current state of affairs, I'll be very concerned if my NRIC number is leaked.

But I'm not the person you should be worried about here — it's the people who are likely to believe the scammer on the other end of the phone who claims to be from the government and backs that up by reciting their NRIC number. Or worse, people who set their banking PIN to contain their birth year: this sort of feature makes it much easier to spray-and-pray NRIC and PIN combinations to gain access to someone's bank account, since the birth year is the first two digits of the NRIC number.

That said, it is true that full NRIC numbers are easily guessable given the masked number and the year of birth. The algorithm for calculating the checksum (the last letter of the NRIC number) is publicly available, and the structure of the NRIC number is well-known. But I imagine that for many, the contention is not about whether the NRIC is masked or not, but about whether it should be public at all. For instance, one might feel slightly safer disclosing only the street they live on, rather than their full address, but that doesn't mean they should be comfortable with their street address being publicly searchable by anyone on the internet.

The ministry said it recognises that some Singaporeans have "long treated" the NRIC number as private and confidential information, and will need time to adjust to this "new way of thinking".

In 2025, MDDI and the Personal Data Protection Commission (PDPC) will be carrying out public education about the purpose of the NRIC number and "how it should be used freely as a personal identifier".

Perhaps the timing just wasn't right, and there is a perfectly reasonable world where the NRIC number is public and everyone is fine with it. But this probably doesn't have to be either/or. Most web services use email addresses as unique account identifiers, and one probably won't consider their email address to be top-secret information. Yet, it'll likely be a bad idea to make the email addresses of all registered users public. So if the NRIC number is to be used as a unique identifier, maybe it should be treated with at least the same level of care.

Scalability

When we think about threat models, it's often tempting to think about the worst-case scenario. But the reality is that most people are not going to be targeted by a sophisticated attacker who is willing to invest significant time and resources — and those who are probably either have the means to protect themselves or have bigger problems to worry about. The real danger is in the low-hanging fruit that can be exploited at scale.

For instance, think about the chip-and-pin terminal relay attack. It's a conceptually obvious decades-old attack that involves intercepting the communication between a chip card and a terminal, and relaying it to another terminal to make a transaction for a higher amount. But it's not used in practice because it's not scalable: you need to be physically close to the victim, and you need to have a way to relay the communication in real-time. So while it's a cool attack, no one's bothered fixing it because it's not a real threat.

A relay attack

On the other hand, attacks like the no-PIN attack which only require a stolen card to be used were a much bigger problem, so the industry had to change the protocol entirely.

The reason it'll be a bad idea to make the email addresses of all registered users public is because it enables a stupidly simple attack at scale: you can now send phishing emails to everyone who has ever signed up for your service, and you'll probably get a few people to click on the link. You can also spray passwords from a data breach against all the email addresses you have, and you'll probably get a few hits. So while any one individual's email address is probably not that sensitive, making all email addresses easily scrapable would be really bad (and if you could dump the email addresses of all users of a popular service, you could probably get a handsome bug bounty or if you're less ethical, make a lot of money selling that data).

At the same time, for the individual, letting one person know your email address is probably not a big deal — you can just block them, right? But if you let everyone know your email address, that's probably called an invasion of privacy. The scale at which the information is exposed matters.

In fact, there is probably a lot of "unique identifier" information that can be considered "not sensitive" on its own, but you'll probably not be comfortable with being made public. Some simple examples:

Your phone number uniquely identifies you among all people who have a phone number, but you probably don't want any random person — maybe a creep you met at a bar that one time — to be able to look up your phone number and call you.
Your home address is unique to the people who live with you — likely your family — but you probably don't want people or unsolicited mail to show up at your door uninvited.

While it's true that these examples facilitate some kind of other interaction — a phone number is used to call you, and a home address is used to send you mail — and the NRIC doesn't have more direct utility, the point is that the NRIC number has become part of what most people consider to be their personal, private information. In fact, even the NRIC number itself already contains one's birth year. And when enough of these pieces of information are put together, they can be used to paint a very detailed picture of a person's life — the holy grail of scammers, stalkers, and data brokers.

It's fine for one website to store metadata about me, keyed with my NRIC number. But if that website, and a bunch of other websites, make my NRIC and its associated metadata public, then I'd expect to be able to find a good profile of myself available for sale on the internet or the dark web. And that's stretching my comfort zone.

Privacy vs. security

"Likewise, the NRIC number should not be used as passwords, just as we should not be using our names as passwords. If the NRIC number is used for authentication, it would have to be kept a secret, which would defeat its main purpose as a unique identifier," MDDI added.

Subsequently, the PDPC also made a similar point about authentication:

The commission noted that it had previously taken action against organisations which used NRIC numbers for authentication and "breached their data protection obligations".

It said: "A person’s name and NRIC number identifies who the person is. Authentication is about proving you are who you claim to be. This requires proof of identity, for example, through a password, a security token or biometric data.

"As the NRIC number is not a secret, it should not be used by an organisation for authentication purposes."

I believe that for many technical folk, this wouldn't be a controversial statement. Indeed, using NRIC numbers as "what you know" in authentication is inherently flawed. If the intent is to make NRIC numbers public, there is much work to be done to ensure that systems we use are secure. Imagine ordering a plate of chicken rice and getting a queue number. In the ideal case, you should not be able to just walk up to the counter and say "I'm number 42". You should have to present your queue number or receipt. But most places don't do this, and the same is true for many systems that depend on NRIC numbers for authentication.

But this seems to be conflating the issues of privacy and security. While there is much to be said about the current flawed implementations of NRIC-based authentication, the fact that we want to use NRIC numbers as usernames or account identifiers doesn't change many's expectations of privacy.

Going by this reasoning, should identifiers like passport numbers, biometric data stored on my phone, and even credit card numbers be public too? Afterall, all of these require at least one second factor of authentication to be useful (a convincing face, access to my physical phone, CVV number).

I think the distinction between privacy and security is important. Privacy is about controlling who has access to your information, while security is about ensuring that only the right people have access to your information. The two are related, but they are not the same. One would generally consider Google to be secure — I don't lose sleep over the idea that someone might hack into my Google account — but we all know that Google sells a lot of our data to advertisers. So while I have strong security controls over my Google account, I have very little privacy control over it.

The main contention then is about whether we should consider NRIC numbers to be personally identifiable information (PII) that we should expect reasonable privacy over. For me at least, the answer is yes. At the very least, the correlation between my NRIC number and other information about me, in the hands of unsavoury characters, would create inherent risk that I would not be comfortable with.

This is echoed in the PDPC's guidelines on NRIC numbers:

As the NRIC number is a permanent and irreplaceable identifier which can potentially be used to unlock large amounts of information relating to the individual, the collection, use and disclosure of an individual’s NRIC number is of special concern. Indiscriminate or negligent handling of NRIC numbers increases the risk of unintended disclosure with the result that NRIC numbers may be obtained and used for illegal activities such asidentity theft and fraud.

It does not matter how secure our systems are if we are not careful about what information we expose to the world. Even if all systems on our small corner of the world are perfectly secure, having lived and worked overseas, my NRIC has been used by foreign governments and organizations as a form of identification and for tax purposes as an independent contractor. Can we be certain that all systems in all countries make the same assumptions about the sensitivity (or lack thereof) of the NRIC number, when in most other countries, similar identifiers are considered private? One would have to expect systems to know that a Singaporean NRIC number is far less sensitive than a US Social Security Number, or UK National Insurance Number, and implement different security controls based on that. But that's a big ask.

More importantly, it is impossible to patch human nature. Ever notice how banks always have a disclaimer that they will never call you to ask for your PIN? Yet, people still fall for scams! No matter how much you try to educate people, there will always be someone who doesn't get the memo. A big part of modern security engineering is designing systems that are secure by default — making it hard for people to make mistakes — because we've learnt as an industry that education can only go so far.

Where I stand

My folks are getting old. Of all things privacy-related, I lose the most sleep over the idea that they might fall for impersonation scams that have been on the rise in Singapore.

I know women in my life who will be deeply uncomfortable with what a highly motivated individual might do when given access to just a little bit more information about them.

Just because I have a secure lock on my door doesn't mean I want to risk someone bringing a lockpick to my doorstep. People used to have to pay good money for this information on the dark web — surely Economics 101 tells us that there is value in this information. Now it's intended to be free for all. Us cybersecurity folk tend to be a paranoid bunch, but I think the idea that we shouldn't let people know more about us than they need to should not be controversial.

Let me know what you think. Should the NRIC number be public?

An introduction to CodeQL and data flow analysis

ping@analogue.computer (Zeyu (Zayne) Zhang) — Sat, 21 Sep 2024 00:00:00 GMT

I'll be honest — when I sat through my first-year Discrete Mathematics course, there were plenty of times when I wondered if I'd ever use any of it in the real world. I'm going to assume that a significant proportion of my audience is studying, has studied, or is about to study Computer Science, where discrete mathematics is a core part of the foundational curriculum. If you're in that boat, you might be wondering the same thing.

Recently, I've worked extensively with CodeQL — a powerful static analysis tool developed by GitHub — to roll out code scanning as part of CI/CD pipelines. At the core of this is the QL language, which is used to write queries that reason about code. Compared to its competitors, CodeQL excels at taint tracking and data flow analysis, which is a fancy way of saying that it's really good at highlighting how potentially malicious or insecure data can flow through your code and end up in a dangerous place that introduces a security vulnerability.

Now, do you need to understand the mathematics behind CodeQL to use it effectively? Not necessarily. But at its core, QL is a declarative logic programming language, and thus it's built on a foundation of set theory and predicate logic. I like to understand the tools I use, so I wanted to write a little about how CodeQL and data flow analysis works under the hood. This is by no means a comprehensive guide, but I hope it serves as a useful introduction for those who are curious!

An introduction to CodeQL

QL is a declarative language. This means that queries are written in a way that describes the desired result, rather than the steps to achieve it. This is in contrast to imperative languages like Python or Java, where you write code that describes the steps to achieve the desired result.

Here's how a typical CodeQL analysis workflow might look like:

The source code is compiled into a database. This database contains the relational representation of the codebase, which includes information about the structure of the code, the control flow, and the data flow.
The CodeQL engine runs QL queries against the database, similar to how one might run SQL queries against a relational database, to find patterns in the code that match the query.
The results are exported into the SARIF format, which can be consumed by CI tools or custom integrations.

In this sense, CodeQL is a lot like SQL. One might even recognise the similarities in the basic syntax:

import javascript
 
from Function f
where not exists(CallExpr c | c.getCallee() = f)
select f, "This function is never called."

This query, for example, finds all functions that are never called in a JavaScript codebase. Notice the exists existential quantifier, which checks if there is at least one CallExpr that calls the function f.

Notice also that QL is object-oriented. CallExpr::getCallee() returns an Expr, which could be a FunctionExpr (which is a Function). However, keep in mind that class objects in the traditional sense (allocating memory to hold the state of an instance of a class) don't exist in QL. Here, classes are more like abstract data types that describe sets of existing values.

Predicates and Binding

Binding refers to the process of associating variables in a query with sets of values from the CodeQL database. A common mistake is to think of variables in the imperative sense, where they can be assigned any value at any time. CodeQL works on relations between values, so variables are bound to finite sets of values that already exist.

Let's walk through a simple example.

import javascript
 
predicate sumsTo42(Expr x, Expr y) {
  x instanceof ConstantExpr and y instanceof ConstantExpr and
  x.getIntValue() + y.getIntValue() = 42
}
 
from Expr x, Expr y
where sumsTo42(x, y)
select x, y

First, consider the domain of all elements in the CodeQL database, $D=\{d_1, d_2, \cdots, d_n\}$ where $n$ is finite.

We have introduced a predicate sumsTo42 that takes two Expr objects and checks if their integer values sum to 42.

Let:

$E \subseteq D = \{e \in D \mid e \text{ instanceof Expr}\}$ .
$C \subseteq D = \{c \in D \mid c \text{ instanceof ConstantExpr}\}$ .

Then, the predicate sumsTo42 evaluates to a set of two-tuples:

P = \{(x, y) \in E \times E \mid ((x, y) \in C \times C) \land (x + y = 42)\}

Now, the from clause binds the variables x and y to the set of all Expr objects in the database, i.e. $E$ . The where clause then filters this set to only include those tuples that satisfy the predicate $P$ . The result is

\{(x, y) \in E \times E \mid (x, y) \in P\}

Notice that the predicate $P$ is a finite set, because its arguments are of a finite type! (Expr refers to the finite set of all expressions in the CodeQL database.) Because the query results rely on checking for membership in predicate sets, all predicates must evaluate to a finite set — otherwise it's impossible to evaluate the query in finite time.

For example, this query won't work:

import javascript
 
predicate sumsTo42(int x, int y) { x + y = 42 }
 
from Expr x, Expr y
where
  x instanceof ConstantExpr and
  y instanceof ConstantExpr and
  sumsTo42(x.getIntValue(), y.getIntValue())
select x, y

because the predicate arguments aren't bound! Clearly, $x$ is bound iff $y$ is bound, and vice versa. Since there are no other operators in the predicate that could bind either variable, the predicate is not well-defined (attempting to evaluate an infinite subset of $\mathbb{Z} \times \mathbb{Z}$ ) and the compiler will helpfully throw an error.

Data flow analysis and taint tracking

The real power of CodeQL lies in its ability to track data flow through a codebase. This is particularly useful for security analysis, where one might want to find out how user input flows through the system and ends up in a dangerous place.

To do this, CodeQL constructs a data flow graph that represents how data flows through the code (e.g. passed between variables, functions, and expressions). Note that this is not the same as an abstract syntax tree, which represents the syntactic structure of the code. Here's an example:

function foo(p) {
  let x = 0
  if (p) {
    x = p.f
  }
  return x
}

Data flow graph for a simple JavaScript program

Notice how the assignments are represented as edges, and the conditionals are not present in the graph. We don't care that x is 0 if p is falsy — we just need to know that x has possible values of 0 and p.f.

Taint tracking takes it a step further by marking certain data as "tainted", and propagating this through derived data. For example, if y = foo(p), and p is tainted, then y is also tainted, because it derives from p. This makes sense when analysing bugs and vulnerabilities, because if p is untrusted user input, then y must also be treated as untrusted.

This is a fundamentally recursive task: given some propagation rules that determine whether the result of some operation, given tainted inputs, is also tainted, explore the program's control flow until no more tainted data can be found.

In other words, find $\overline{X}=F(\overline{X})$ where $\overline{X}$ is a vector of "In's" and "Out's" representing the set of variables tainted before and after each statement, and $F$ is a function that applies the propagation rules. We want the least fixed point, i.e. the smallest $\overline{X}$ such that $\overline{X}=F(\overline{X})$ .

A partial order is a relation that is reflexive, transitive, and antisymmetric. We can define a partial order among vectors of sets:

\begin{align*} \overline{X} \sqsubseteq \overline{X'} \iff \forall i \cdot X_i \subseteq X'_i \end{align*}

It's easy to see that this is a partial order: it is reflexive (since $X_i \subseteq X_i$ ), transitive (since $X_i \subseteq X'_i \land X'_i \subseteq X''_i \implies X_i \subseteq X''_i$ ), and antisymmetric (since $X_i \subseteq X'_i \land X'_i \subseteq X_i \implies X_i = X'_i$ ).

A function $F$ is monotone over the partial order $\sqsubseteq$ if, for all $\overline{X}$ and $\overline{X'}$ ,

\overline{X} \sqsubseteq \overline{X'} \implies F(\overline{X}) \sqsubseteq F(\overline{X'})

What does this say? It means that "larger" or equal inputs lead to "larger" or equal outputs. Here, $F$ is monotone because applying the propagation rules to a "larger" or equal set of "current" tainted variables yields a "larger" or equal set of new tainted variables . To see this, consider that once a variable is marked as tainted, it remains tainted, while an untainted variable can become tainted with the discovery of new tainted variables.

This ensures that repeating $F$ will progressively increase the set of reachable expressions towards convergence at the least fixed point, so this provides us with a way to compute the least fixed point of $F$ .

Formally, this is Kleene's fixed-point theorem. The least fixed point of the monotonic function $F$ is the limit of the ascending chain obtained by applying $F$ repeatedly starting from the least element $\bot$ (here, the empty set):

\begin{align*} \text{lfp}(F) = \bigvee_{n \geq 0} F^n(\bot) \end{align*}

where $F^i$ is the $i$ -fold composition of $F$ with itself, and $\bigvee$ denotes the least upper bound.

To see this, consider the sequence $\overline{X}_0 = \bot$ , $\overline{X}_{i+1} = F(\overline{X}_i)$ .

Since $F$ is monotone, we have $\overline{X}_0 \sqsubseteq \overline{X}_1 \sqsubseteq \cdots \sqsubseteq \overline{X}_n$ . Consider the last element of the chain, $\overline{X}_n$ . $\overline{X}_n = F(\overline{X}_n)$ , otherwise it is not the last element. Therefore it is a fixed point.

Now consider an arbitrary fixed point $\overline{Y}$ of $F$ . In the base case, $\overline{X}_0 = \bot \sqsubseteq \overline{Y}$ . Also,

\begin{align*} \overline{X}_i \sqsubseteq \overline{Y} \implies \overline{X}_{i+1} = F(\overline{X}_i) \sqsubseteq F(\overline{Y}) = \overline{Y} \end{align*}

so by induction, $\overline{X}_i \sqsubseteq \overline{Y}$ for all $0 \leq i \leq n$ . In particular, $\overline{X}_n \sqsubseteq \overline{Y}$ , so $\overline{X}_n$ is at least as "small" as any other fixed point, and thus the least fixed point.

Writing a taint tracking query

Let's walk through a CodeQL query that looks for basic DOM-based XSS vulnerabilities in a JavaScript codebase. To start, we need to define a configuration for taint tracking:

class UnsafeDOMManipulationConfiguration extends TaintTracking::Configuration {
  UnsafeDOMManipulationConfiguration() { this = "UnsafeDOMManipulationConfiguration" }
}

Now, we need to define the sources. These are the places where untrusted data first enters the program. In this case, we're looking for RemoteFlowSource (data from e.g. a request parameter), ClientRequest::Range (data from a HTTP response), and SomeOtherSource (a custom source that we've defined elsewhere).

override predicate isSource(DataFlow::Node source) {
  source instanceof RemoteFlowSource or
  source instanceof ClientRequest::Range or
  source instanceof SomeOtherSource
}

These are used in the first iteration, $F(\bot)$ , to mark all variables directly tainted by these sources. In the following iterations, variables that are derived from these sources are also marked as tainted.

Next, we need to define the sinks. These are the places where tainted data can cause harm. Eventually, we want to find all paths from any source to any sink, and those will be the results of our query.

// Sink functions from https://web.dev/articles/trusted-types
override predicate isSink(DataFlow::Node sink) {
  // Direct assignment to innerHTML or outerHTML
  exists(DataFlow::PropWrite pw |
    pw.getPropertyName() in ["innerHTML", "outerHTML"] and
    sink = pw.getRhs()
  )
  or
  // Element.insertAdjacentHTML()
  exists(DataFlow::MethodCallNode call |
    call.getMethodName() = "insertAdjacentHTML" and
    sink = call.getArgument(1)
  )
  or
  // Direct assignment to iframe.srcdoc
  exists(DataFlow::PropWrite pw |
    pw.getPropertyName() = "srcdoc" and
    sink = pw.getRhs()
  )
  or
  // document.write() and document.writeln()
  exists(DataFlow::MethodCallNode call |
    call.getMethodName() in ["write", "writeln"] and
    call.getReceiver() = DOM::documentRef() and
    sink = call.getArgument(0)
  )
  or
  // DOMParser.parseFromString()
  exists(DataFlow::MethodCallNode call |
    call.getMethodName() = "parseFromString" and
    call.getReceiver().getALocalSource() instanceof DataFlow::NewNode and
    call.getReceiver().getALocalSource().(DataFlow::NewNode).getCalleeName() = "DOMParser" and
    sink = call.getArgument(0)
  )
  or
  sink instanceof DomBasedXss::TooltipSink
}

This defines which nodes in the data flow graph are considered sinks. In this case, we are identifying methods that can be used to inject untrusted data into the DOM, such as innerHTML, outerHTML, insertAdjacentHTML, srcdoc, document.write, document.writeln, and DOMParser.parseFromString. When untrusted data reaches these sinks, it can be used to execute arbitrary JavaScript code in the context of the page.

After the least fixed point is computed (and we have the completed set of all tainted elements in the program), we can then evaluate the sinks against this set of tainted elements to determine of any tainted data can reach a sink. If so, we have a potential vulnerability.

from DataFlow::PathNode source, DataFlow::PathNode sink, UnsafeDOMManipulationConfiguration config
where config.hasFlowPath(source, sink)
select sink.getNode(), "Potentially unsafe DOM manipulation with $@.", source.getNode(),
  "untrusted data"

Expanding the propagation function

Sometimes, the default taint steps are insufficient to capture the desired propagation. One might extend this by overriding isAdditionalTaintStep:

override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {
  exists(DataFlow::ArrayCreationNode array |
    pred = array.getAnElement() and
    succ = array
  )
  or
  exists(DataFlow::MethodCallNode find |
    find.getMethodName().regexpMatch("find|filter|some|every|map") and
    pred = find.getReceiver() and
    succ = find.getCallback(0).getParameter(0)
  )
}

This propagates taint through array elements and array methods like find, filter, some, every, and map.

One might also want to stop propagation of taint at a sanitization step. When a node is a sanitizer, its output is considered untainted, even if the input is tainted. This can be done by overriding isSanitizer:

override predicate isSanitizer(DataFlow::Node node) {
  node = DataFlow::moduleImport("dompurify").getAMemberCall("sanitize")
}

Here, we are considering the sanitize method from the DOMPurify library as a sanitizer.

These modify our propagation function $F$ to include or exclude desired nodes when propagating taint.

Aside: transitive closures

CodeQL has built-in support for transitive closures, which is pretty cool.

The transitive closure of a relation $R$ is the smallest relation $R^+$ that contains $R$ and is transitive. Formally, $R^+$ is the intersection of all transitive relations that contain $R$ . This can be defined inductively:

\begin{align*} R^0 &= \emptyset \\ R^{n+1} &= R^n \cup (R \circ R^n) \end{align*}

The reflexive transitive closure is similar, but includes the identity relation:

\begin{align*} R^0 &= \text{id} \\ R^{n+1} &= R^n \cup (R \circ R^n) \end{align*}

In CodeQL, we can use the + operator to denote the transitive closure of a predicate, and the * operator to denote the reflexive transitive closure. For example:

predicate isReachableFrom(BasicBlock start, BasicBlock end) {
  end = start.getASuccessor*()
}

This predicate evaluates to the set of all (start, end) pairs such that end is reachable from start by following zero or more edges in the control flow graph.

Conclusion

This is my humble attempt to pen down my mental model of how CodeQL works under the hood, which helps me to visualize how my queries are executed and how the results are derived. It's a really interesting and powerful tool that, while lacking the user-friendliness of some other static analysis tools, makes up for it with its flexibility and expressiveness.

Why I'm "quitting" (offensive) security

ping@analogue.computer (Zeyu (Zayne) Zhang) — Wed, 21 Aug 2024 00:00:00 GMT

When I interview for internships nowadays, one question that recruiters often ask me is "why not cybersecurity?". Funnily enough, I've stepped back from security since the start of this year, but many people still think I'm the "hacker" and competitive CTF player I used to be. Even my friends and family often ask me how things are going in the industry, and whether I'm still doing "hacking stuff".

How did we get here?

To give you a bit more context about myself, I discovered infosec 3 years ago when I was trying to get out of being sent to the infantry. As a Singaporean male, I had to serve in the army. It just so happened that there was a new scheme that allowed people to serve in "cybersecurity" units, and the prospect of spending my days in an air-conditioned office instead of in the middle of a jungle was very appealing to me.

So without any knowledge of security at all, I borrowed a textbook from my teacher at the time and did what I was good at as an A-level student: memorizing as much as I could a week before the exam. By a stroke of luck, I got in, and the rest was history.

My 18 year old self, who at the time was pretty sure he wanted to study Physics in university, probably wouldn't have believed that his next 3 years would see him in Las Vegas, Lausanne, Stockholm, and other places around the world competing in CTF competitions, meeting some of the most talented people in the industry, and spending countless hours in front of a computer screen instead of solving differential equations.

Team Singapore @ Cyber SEA Games 2022

I went through some structured training in the army with an emphasis on blue teaming. Then I tried my hand at CTFs for the first time — it was brutal. I could solve maybe 1 or 2 basic challenges, but the rest of the time I just sat there being confused. But realising just how much I didn't know was something that hooked me in, and I played CTFs almost every weekend for the next few months. We even formed our own team, Social Engineering Experts, which in 2022 and 2023 was the top CTF team in Singapore.

Eventually I found my niche in web security. CTF trains you really well for source code review and exploit development, so I started poking around open source projects and eventually got credited for 15 CVEs, most of which were from my deep dive into HTTP request smuggling.

I also started doing bug bounties because among other things, it was fun to hack my own government. I was invited to three private time-bound programs organised by the Singapore government, and was ranked 1st (by "signal", or impact) in each one among global participants.

MINDEF Bug Bounty Programme

It was really fun. I ended up playing CTFs with Blue Water, with whom I won 2nd place at the DEF CON CTF in 2023. At some point, a teammate reached out to me and asked if I wanted to join Electrovolt, a security consultancy that was partnering with Cure53. I started doing some work with them on a freelance basis, whenever I had the time. The work has been super cool, working with some really talented people for really big clients. This was also around the time I started my internship with TikTok, where I was doing security research at scale.

The turning point

After some time trying to break everything I could get my hands on, I started to feel like I was just breaking things for the sake of breaking them. I was getting tired of the constant churn of finding bugs, writing reports, and moving on to the next target. I was tired of the constant pressure to be the best, to be the first, to be the most impactful.

But the biggest question, however, simply boiled down to: "why don't they just fix it? It's not that hard." I couldn't understand why teams would ignore important vulnerabilities, or when something is fixed only to be broken again in the next release. I couldn't understand why we were still finding the same classes of vulnerabilities that have been known for years. It's like playing whack-a-mole, but the moles are the size of elephants and there's only one hammer.

University

But no time to dwell on that too much, because I was starting university. I took this as an opportunity to take a step back from living and breathing security to enjoy life and explore other interests.

I didn't expect my first year at Cambridge to be this busy. To be honest, I spent most of the year thinking I was going to fail. Everyone else seemed to be doing so much better, and I was returning to a really difficult technical degree after years of not doing any math or theoretical computer science. Cambridge certainly has a way of making you feel like you're not good enough, because I ended up managing a first-class grade in the end.

I was still doing some CTFs (I started playing with the Cambridge team, cheriPI) and freelance work on the side, but there was just a lot more to try. I started doing hackathons and for the first time, I started thinking about what products people actually need.

CheriPI @ LakeCTF 2023

A hackathon felt like a microcosm of a software shop. You have a team, a deadline, and a goal. No one cares about security when you're trying to get a product out the door.

Encode Club AI Hackathon in London

How do we do good security, then?

I'm spending this summer at Open Government Products, where I now have much more time to think about how to solve security problems at scale, and in a way that lasts. I started to appreciate why security can be so hard, and why it's not just a matter of "just fixing it".

The truth is, security is hard because it's not just about fixing bugs. It's about fixing the process that led to the bug in the first place. The organisations that are having repeated security incidents are the ones that have a broken process, often because they grew too fast without thinking about security, and now they're playing catch-up. But it's hard to catch up when the processes you've put in place and the tools developers have gotten comfortable with need to be fundamentally changed.

Security is not prioritised enough because the cost of poor security is not immediately visible. It's hard to quantify the cost of a security incident that didn't happen, and it's hard to justify the cost of investing in security when you're not seeing any immediate benefits. There are no returns when investing in security, only losses when you don't.

And if the probability of an incident is low in the first place, then the expected cost of standing around doing nothing, waiting for an incident to happen, is lower than the cost of investing in security.

\begin{align*} E[\text{{cost of doing nothing}}] &= P(\text{{incident}}) \times \text{{cost of incident}} \\ &< \text{{cost of investing in security}} \end{align*}

And when the cost of doing nothing is lower than the cost of investing in security, the rational decision is to do nothing.

So to do good security, you probably need to reduce the cost of security. This means making it easy to do the right thing: in Ross Anderson's Software and Security Engineering course, there was a lot of emphasis on making the secure thing the easiest, or the default, thing to do without compromising on usability.

That means building tools that developers actually want to use, or something so low-friction that they don't even realise they're using it. It means building security into the process from the start, and not as an afterthought.

I can never truly quit security

I don't think I can ever truly quit security. I still love doing it on the side with fun research projects and freelance pentesting work outside of my day job or school. Besides, how do you "quit" something so fundamentally important to everything else you might do in tech? But I'm ready to take a pause on popping shells all the time and start building things that last.

And there are so many things I take with me from my time in security. I've learnt how to think adversarially, which helps a lot in writing good software. I've learnt how to communicate complex ideas in a way that's easy to understand. I've picked up an attention to detail that I didn't have before. I've learnt how to learn.

I think this is probably a good balance. Doing security as both a hobby and a job is probably too much for me. I think the analogy "it's a marathon, not a sprint" is very apt here. Unfortunately I'm at that age where I need to be a bit more realistic about what I want in my career, and I need to know I will still enjoy what I'm doing 10 years down the line.

I'm still not entirely sure what I do want to do, but I know I've found a passion for being as close to the product as possible. I'm at my best when I work on meaningful problems that serve a clear purpose in the world and where I can clearly see the impact of my work.

This has been a long and quite personal post, and I don't want to enforce any particular message on anyone. I just re-made my personal site and decided it made sense to write my first post on where I'm at right now. Wherever you are at, I hope you're doing well, and I hope you're doing what you love!

Oh, and please remember to touch grass.

Touch grass