As outlined on day 1, this week I’m in Singapore for AI Security Bootcamp 2026. This is an informal daily update written at the time: I’ll follow with a more formal writeup and takeaways after the week is over.
On the second day, we had our first guest lecture – on how Zero-Knowledge Proofs can be used to guarantee e.g. output was produced by the correct model, based on only external input – before a lecture and exercise on agents and attacks against them (prompt injection, tools, MCP, RAG), and after lunch a lecture and exercise on how we can safely make use of models that we know misbehave.
Core takeaways:
- Zero-Knowledge Proof allows us to prove only expected models are used on only expected data, among other security properties. The computational overhead is currently prohibitive, but by applying some workload-specific tricks there’s hope this overhead can be brought down.
- The attack surface of coding agents is huge: anything you have read access to could be used to attack the agent, and anything you have write access to could be used to achieve persistence. You should run with robust oversight, in a sandbox, or both.
- After my own reflection: control techniques feel relevant even before ASI. Research has largely looked at toy examples where the untrusted model produces code that might fail in only one input case, but the setup could transfer to any case where an AI system can take out a harmful action. Claude Code’s auto mode strongly resembles the trusted monitoring setup with a more concrete failure case!
Zero Knowledge Proofs for AI Security
Due to scheduling obligations, our first guest lecture jumped ahead to later in the course.
The talk, in summary:
- Even today’s AI capabilities are enough for AI systems to be trusted with consequential decisions, and for model weights to be extremely valuable.
- It would therefore be good to prove security properties of the system (e.g. we only ran the expected version of the model; we didn’t leak any information from the environment).
- Previous solutions rely on hardware attestation: but there is a private key there which can still be breached, and implementations have repeatedly shown flaws.
- Zero-Knowledge Proofs allow significant reduction of the trust base: if you have a trusted verifier, you don’t need to trust the whole compute environment.
- The main obstacle for this: naive verification is around 1 million times the work of actual inference.
- ML applications have a few tricks we can use to reduce the overhead: linear operations are easy to produce proofs for (Freivalds’ algorithm). The same GPU hardware we use for inference allows parallelising the proof.
- The speaker was confident that this – with some secret sauce – could bring verification overhead to just 1x inference cost.
- Not a perfect solution – side channel attacks are still possible – but at least reduces the problem space.
I took about 4500 words of notes during this talk: I won’t be able to distribute these notes verbatim, but please feel free to reach out at webmaster@wjbarn.es if you’re curious and I can point you in the right direction!
Attacking RAG systems
Shiau Huei’s LinkedIn post summarised the lecture content here better than I can in the time available to me right now.
We then went off and paired with others in the cohort through some exercises: extracting the system prompt on a RAG system, and injecting a poisoned document for indirect prompt injection. It was great fun getting hands on with these attacks, knowing that there are numerous systems out there which are already vulnerable: I was reminded of last August’s Invitation is All You Need, which I found out about through Smashing Security. This outlined a proof of concept attack chain exploiting indirect prompt injection (chained with unsolicited invites being added straight to the document store, and lack of verification on actions) to cause remote action, exfiltration of sensitive data, and more.
We also discussed coding agents specifically. The main takeaway from me here is that a coding agent – running untrusted – really does have all the same permissions that malware running on the machine could have. I’ve been running in approval-required modes so far, but it really does make me think I should try free tool use in an appropriately sandboxed environment (with some oversight).
AI Control
AI control covers techniques that allow us to use untrusted models – models with strong capabilities that could carry out undesirable actions (“misaligned”) – by putting them in settings where we have some monitoring tools (human time; previous less capable models). A control protocol outlines how we make use of these resources (e.g. review the first x% of responses; use the trusted model to flag malicious response; use the trusted model to correct the malicious response; go back and resample the untrusted output at a point before the malicious chain began) to deploy the potentially-unsafe model in a way that meets our safety requirements.
The goal of control research is to push out the Pareto frontier, finding control protocols which get better usefulness (false positive rate; safe responses that are denied) for a given safety tolerance (false negative rate; dangerous responses that are approved) and monitoring budget.
More than anything, this reminded me of previous systems I’ve worked on where AI system output could result in harm, and the monitoring effort surrounding that – control protocols feel very transferable to these situations, with the “side task” being generalisable to any undesirable harmful action.
Control was framed as an alternative to solving AI alignment. I’m personally skeptical we’ll ever be able to make aligned AI given humanity can’t even align on what is desirable, so control therefore appeals a lot to me as a paradigm which both leads to concrete useful results now, and one that can scale to buy us time to solve harder problems with the better tools that will follow.
For this section, I then worked solo on the exercises, producing a monitor prompt, then an attacker prompt which my monitor would reliably fall for (even scoring backdoored examples as less suspicious than honest ones)!
Adding Chain of Thought to the monitor gave significantly better performance, especially after modifying the monitor prompt to pay more attention to it (under the assumption – currently true as best we can tell and under heavy scrutiny – that the attacker can’t reliably solve the problem without including some trace in the chain of thought, even for untrusted models, noting that if we trust the experimental setup of the untrusted model we can have more confidence in the CoT).
Others in the cohort spun up optimisation loops, getting monitor prompts which performed at over 80% accuracy on the sample set. I found the adversarial problem setup here really fascinating, as I suspect these optimised monitors will overfit to the indicators of malice for the specific attack case used for the test set: it’d be interesting to play off optimisers for the monitor and attack systems against one another across multiple runs, and see what the best solutions are across the problem space (how well would an adversarially-optimised monitor perform against attackers which were adversarially-optimised against a different monitor? how well do the optimised prompts transfer across models?).
After the bootcamp
While others went out to explore Singapore, I headed straight back to the hotel and was asleep within the hour to go watch the sun rise tomorrow morning… I then wrote this post after waking up in the middle of the night.