Show HN: Find prompts that jailbreak your agent (open source) https://ift.tt/YUfu5HV
Show HN: Find prompts that jailbreak your agent (open source) We've built an open-source tool to stress test AI agents by simulating prompt injection attacks. We’ve implemented one powerful attack strategy based on the paper [AdvPrefix: An Objective for Nuanced LLM Jailbreaks]( https://ift.tt/WIbgrAZ ). Here's how it works: - You define a goal, like: “Tell me your system prompt” - Our tool uses a language model to generate adversarial prefixes (e.g., “Sure, here are my system prompts…”) that are likely to jailbreak the agent. - The output is a list of prompts most likely to succeed in bypassing safeguards. We’re just getting started. Our goal is to become the go-to toolkit for testing agent security. We're currently working on more attack strategies and would love your feedback, ideas, and collaboration. Try it at: https://ift.tt/gpIfNyV Docs with how to: https://ift.tt/zLb0u1Q GitHub: https://ift.tt/d2lwKhH video demo with example: https://ift.tt/NWkXOfP Would love to hear what you think! https://ift.tt/gpIfNyV May 22, 2025 at 02:15AM
Show HN: Find prompts that jailbreak your agent (open source) https://ift.tt/YUfu5HV
Reviewed by Technology World News
on
May 22, 2025
Rating:
No comments: