CrabTrap: An LLM-as-a-judge HTTP proxy to secure agents in production
https://www.brex.com/journal/building-crabtrap-open-source
118 points by pedrofranceschi - 45 commentshttps://www.brex.com/journal/building-crabtrap-open-source
118 points by pedrofranceschi - 45 comments
I think you're spot on with the fact that it's so far it's been either all or nothing. You either give an agent a lot of access and it's really powerful but proportionally dangerous or you lock it down so much that it's no longer useful.
I like a lot of the ideas you show here, but I also worry that LLM-as-a-judge is fundamentally a probabilistic guardrail that is inherently limited. How do you see this? It feels dangerous to rely on a security system that's not based on hard limitations but rather probabilities?
Not exactly sure where I’m going with this, but my work with creating penetesting tools for LLMs, the way that I use judgment is critical to the core functionality of the application. I agree with your concern and I will just say that the more time I spent concerned with chain of though where now I will make multiple versions of the same app using a different judge set a different “temperaments” and I found it to be incredibly enlightening as to the diversity of applications and approaches that it creates.
The problem is, 99% secure is a failing grade.
I have an issue with security layers that are inherently nondeterministic. You can't really reason strongly about what this tool provides as part of a security model.
But also, it's in an area where real security seems extremely hard. I think at some point everyone will have a situation where they wanna give an agent some private information and access to the web. You just can't do that in a way that's deterministically safe. But if there are usecase where making it probabilistically safer is enough to tip the balance, well, fine.
The question edf13 pointed at but didn’t develop; where does a transport-layer judge earn its place at all? Not as the enforcement layer but as the audit layer on top of one. Kernel-level controls tell you what the agent did. A proxy tells you what the agent tried to exfiltrate and where to.
Structured-JSON escaping and header caps are good tools for the detection job. They’re the wrong tools for the prevention job. Different layers, different questions.
If both are Claude, you have shared-vulnerability risk. Prompt-injection patterns that work against one often work against the other. Basic defense in depth says they should at least be different providers, ideally different architectures.
Secondary issue: the judge only sees what's in the HTTP body. Someone who can shape the request (via agent input) can shape the judge's context window too. That's a different failure mode than "judge gets tricked by clever prompting." It's "judge is starved of the signals it would need to spot the trick."
not adding LLM layers to stuff to make them inherently less secure.
This will be a neat concept for the types of tools that come after the present iteration of LLMs.
Unless I’m sorely mistaken.
Most proper LLM guardrails products use both.
Edit: actually looks like it has two policy engines embedded
user message becomes close to untrusted compared to dev prompt.
also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.
ie llama prompt guard, oss 120 safeguard.
If people said "we build a ML-based classifier into our proxy to block dangerous requests" would it be better? Why does the fact the classifier is a LLM make it somehow worse?
The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.
You might say this hasn't prevented leaks/CVEs in exisiting mission-critical software and this would be correct. However, the people writing the checks do not care. You get paid as long as you follow the spec provided. How then, in a world which demands rigorous proof do you fit in an LLM judge?
This is exactly the point though. A LLM is great at finding work-around for static defenses. We need something that understands the intent and responds to that.
Static rules are insufficient
EDIT: it does seem to have a deterministic layer too and I think that's great
BWHAHAHAHAHA. your bot tried, but failed at the same time. (also interesting that this user's other comments seem ok-ish. The prompts are evolving, we get a sneak peek here on what they prompted for, and the delivery seems more human as well)
One thing I didn't see: are there any OSS solutions appearing here?