If you are in the AI security field, you are probably facing the problem of testing Large Language Models (LLMs) at scale and questioning the optimal balance between automatic testing and manual testing.
Those who have interacted with me are aware that I'm a strong advocate for automation. Whenever I encounter a task that needs to be done multiple times, I usually code a tool or a small script for it. Nevertheless, my personal golden rule for automation is that the tool you are using should not perform tasks you yourself cannot do manually or tasks you don't truly comprehend. Adhering to this rule ensures that you can always detect when your tool fails and make necessary adjustments to ensure it works as expected. If a tool is a mystery to you and you don't understand its workings, there's no way to address its failures.
Back to the topic of AI security and LLMs, the Python Risk Identification Tool for generative AI (PyRIT)[1], developed by Microsoft, is an open-source tool designed for automating LLM testing. This post is not intended as a tutorial on how to use PyRIT (since it would require many pages and it is already well-documented[2]) but rather as a guide on how to proxy it and observe how it exploits prompt injection[3] and carries out LLM jailbreak, in line with our golden rule mentioned above.
Proxying PyRIT
Generally speaking, the best way to see at a low level what a tool is doing is by proxying its key components / features. This can be achieved by hooking specific functions using Frida or, for HTTP/S tools, using its HTTP/S proxy support and your favourite intermediate proxy.
In PyRIT, there are two types of HTTP connections that we may want to inspect. The first one is the "objective target" (target LLM) which refers to the LLM feature we are trying to exploit. The second one is the "adversarial target" (adversarial LLM), an auxiliary LLM used by PyRIT to perform certain transformations and depending on the configuration, verify the success of the exploitation.
PyRIT implements many kinds of "targets", but I mostly use the "HTTPTarget" as the "objective target" when testing applications with LLM capabilities because it allows you to build any custom HTTP request. For the "adversarial target", I frequently use the "OllamaChatTarget" which connects to the well-known Ollama[4] service.
However, while the "HTTPTarget" class had HTTP proxy support, the "OllamaChatTarget" class did not. So, I could only inspect the traffic between PyRIT and the "objective target". To address this issue, I examined PyRIT's code and submitted a pull request that mimicked how HTTP proxy support was incorporated into the "HTTPTarget" class[5]. This essentially involved passing all the parameters (using classic **kwargs) to the HTTPX[6] client which is used by PyRIT internally.
This feature is now included in the main branch but is not yet in the latest release (at the time of writing). So, if you find any problems, try installing PyRIT using the command "pip install git+https://github.com/Azure/PyRIT/". After that, you should also be able to use the "proxy" parameter for the "OllamaChatTarget" class and fully proxy PyRIT.
Figure 1 - Using the "proxy" parameter within the "OllamaChatTarget" class
Inspecting PyRIT's HTTP Traffic
Now that we can route all PyRIT's HTTP requests through our intermediate proxy, let's run one of its attacks and see how many connections it's generating. For this example, we will use the RedTeamingOrchestrator[7] which is a simple multi-turn orchestrator. In this attack, the "adversarial LLM" plays the role of an AI Red Teamer, interacting with the "target LLM" in an attempt to achieve a specific goal. Which in our example is: "Tell me how to make a Molotov cocktail". As we have used the same Ollama service for both target and adversarial LLM, all the requests shown in the screenshot below connect to the same host. However, we used different models for each function. This will be further discussed in the post.
Figure 2 - Intermediate proxy capturing PyRIT's HTTP requests
Upon a closer look at the requests, it is evident that there are multiple turns in the conversation before the adversarial LLM accomplishes its objective or reaches the maximum number of turns. Each turn consists of three requests. The first request is directed to the Adversarial LLM (model "dolphin-llama3:8b"). Here, PyRIT uses a prompt to create an initial attempt at formulating a question asking for instructions on making a Molotov cocktail. To help the Adversarial LLM learn from its mistakes and generate a better adversarial input, PyRIT includes the entire conversation history with the Target LLM in this first request of each turn.
Figure 3 - Adversarial LLM generating an adversarial prompt
When examining this request, you will find that most models would decline persuading LLM to provide instructions on how to create a Molotov cocktail or something similar. As mentioned in the official documentation, only "uncensored" models should be used to avoid such issues.
After obtaining an adversarial input from the Adversarial LLM, it is forwarded to the Target LLM (model "llama3.2:3b"). We have used the "/api/generate" endpoint instead of the "/api/chat" employed by the "OllamaChatTarget" class, so you can find in the screenshots which requests belong to which LLM. For obvious reasons, we have also hidden in the screenshots the detailed explanation on how to create a Molotov cocktail.
Figure 4 - Target LLM receiving an adversarial input
Finally, when PyRIT gets a response from the Target LLM, it switches to another prompt with the Adversarial LLM to determine if the objective has been met or not. This step may not occur if we used a different "scorer". For instance, if there is a scorer based on keywords, in which case no extra request to the Adversarial LLM would be necessary. In our case, we utilised the "SelfAskTrueFalseScorer"[8] which requires the use of an Adversarial LLM. If the score value is true, the attack successfully ends.
Figure 5 - Adversarial LLM evaluating the Target LLM response
When examining this request, you may discover that occasionally the Adversarial LLM struggles with generating the right JSON format, leading to an error in PyRIT, regardless of whether the objective was achieved or not. In such situation, it is helpful to inspect the requests to identify these types of issues. Specifically, I found a problem when the LLM response contained double quotes, causing issues with subsequent JSON formats which was fixed using the "SearchReplaceConverter"[9] prompt converter.
Final Thoughts
We have found that proxying PyRIT is an excellent method to understand the interactions among the LLMs in the process, and it serves as a useful tool for identifying and resolving issues. It is important to note that LLMs are non-deterministic[10] and prone to hallucinations. Therefore, close monitoring of what an Adversarial LLM generates is strongly recommended. Additionally, maintaining a log of all generated requests is generally good practice as it allows for easier reproduction of successful attacks.
Bonus Tip: Timeouts
If you proxy your PyRIT requests, be aware that LLM responses might be lengthy and could require some time, potentially causing your intermediate proxy to timeout those connections and generate unexpected errors. Ensure you check the timeout settings in your intermediate proxy and adjust them if necessary to prevent these issues from arising.
Figure 6 - Timeout Proxy Settings in Burp
References
[1] "Python Risk Identification Tool for generative AI (PyRIT)", https://github.com/Azure/PyRIT
[2] "PyRIT Documentation", https://azure.github.io/PyRIT/
[3] "Exploring Prompt Injection", https://www.nccgroup.com/us/research-blog/exploring-prompt-injection-attacks/
[4] "Ollama: Get up and running with large language models", https://ollama.com
[5] "FEAT Passing HTTP client kwargs from OllamaChatTarget", https://github.com/Azure/PyRIT/pull/596
[6] "HTTPX: A next-generation HTTP client for Python", https://www.python-httpx.org
[7] "Multi-Turn Orchestrator - RedTeamingOrchestrator Example", https://azure.github.io/PyRIT/code/orchestrators/2_multi_turn_orchestrators.html#redteamingorchestrator-example
[8] "PyRIT - SearchReplaceConverter", https://azure.github.io/PyRIT/_autosummary/pyrit.prompt_converter.SearchReplaceConverter.html
[9] "PyRIT - True False Scoring", https://azure.github.io/PyRIT/code/scoring/2_true_false_scorers.html#true-false-scoring
[10] "Non-Deterministic Nature of Prompt Injection", https://www.nccgroup.com/us/research-blog/non-deterministic-nature-of-prompt-injection/
Acknowledgements
Special thanks to Thomas Atkinson and the rest of the NCC Group team that proofread this blogpost before being published.