AI Security Bootcamp 2026 Day 2

How can AI agents go wrong? How can we catch misbehaviour?

As outlined on day 1, this week I’m in Singapore for AI Security Bootcamp 2026. This is an informal daily update written at the time: I’ll follow with a more formal writeup and takeaways after the week is over.

On the second day, we had our first guest lecture – on how Zero-Knowledge Proofs can be used to guarantee e.g. output was produced by the correct model, based on only external input – before a lecture and exercise on agents and attacks against them (prompt injection, tools, MCP, RAG), and after lunch a lecture and exercise on how we can safely make use of models that we know misbehave.

Core takeaways:

  • Zero-Knowledge Proof allows us to prove only expected models are used on only expected data, among other security properties. The computational overhead is currently prohibitive, but by applying some workload-specific tricks there’s hope this overhead can be brought down.
  • The attack surface of coding agents is huge: anything you have read access to could be used to attack the agent, and anything you have write access to could be used to achieve persistence. You should run with robust oversight, in a sandbox, or both.
  • After my own reflection: control techniques feel relevant even before ASI. Research has largely looked at toy examples where the untrusted model produces code that might fail in only one input case, but the setup could transfer to any case where an AI system can take out a harmful action. Claude Code’s auto mode strongly resembles the trusted monitoring setup with a more concrete failure case!
[Read More]

AI Security Bootcamp 2026 Days 0 and 1

Introducing cybersecurity professionals to AI (safety|security)

tl;dr: first of 7 daily posts from AI Security Bootcamp with 15 other cybersecurity professionals; today set the stage with a survey of AI safety topics and challenges at today’s frontier. Takeaways: securing AI systems is one part of making them safe; cybersecurity mindset applies across safety challenges

As I posted on LinkedIn, this week I’m in Singapore for AI Security Bootcamp 2026. I’ll do informal daily posts – like this one – and follow up with a more formal writeup and takeaways after the week is over. If you want a true tl;dr, feel free to wait for that post to drop.

On the zeroth and first days, I met with friends old and new to settle in and set the stage for the rest of the week: surveying the problem space and where we are now (including a tour of the Claude Mythos Preview System Card), and reinforcing understanding of the AI safety landscape.

View from my bus across town. There is certainly British colonial legacy, but Singapore has built a long way on top. It's also far greener than I expected.

View from my bus across town. There is certainly British colonial legacy, but Singapore has built a long way on top. It's also far greener than I expected.

[Read More]