In Gandalf, FMSPs successfully red-teamed an LLM, breaching GPT-4o-mini’s defenses. We implemented 7 additional external defensive strategies from Lakera’s single-agent Gandalf game (gandalf.lakera.ai) and FMSPs autonomously wrote code to break 6/7 of those defenses!!
Trick Gandalf into revealing information and experience the limitations of large language models firsthand.