As outlined on day 1, this week I’m in Singapore for AI Security Bootcamp 2026. This is an informal daily update written at the time: I’ll follow with a more formal writeup and takeaways after the week is over.
On the second day, we had our first guest lecture – on how Zero-Knowledge Proofs can be used to guarantee e.g. output was produced by the correct model, based on only external input – before a lecture and exercise on agents and attacks against them (prompt injection, tools, MCP, RAG), and after lunch a lecture and exercise on how we can safely make use of models that we know misbehave.
Core takeaways:
- Zero-Knowledge Proof allows us to prove only expected models are used on only expected data, among other security properties. The computational overhead is currently prohibitive, but by applying some workload-specific tricks there’s hope this overhead can be brought down.
- The attack surface of coding agents is huge: anything you have read access to could be used to attack the agent, and anything you have write access to could be used to achieve persistence. You should run with robust oversight, in a sandbox, or both.
- After my own reflection: control techniques feel relevant even before ASI. Research has largely looked at toy examples where the untrusted model produces code that might fail in only one input case, but the setup could transfer to any case where an AI system can take out a harmful action. Claude Code’s auto mode strongly resembles the trusted monitoring setup with a more concrete failure case!