Analyzing Secure AI Design Principles

Introduction

Given the proliferation of LLMs in modern applications, defense teams have struggled to securely protect AI-integrated systems. Many organizations conflate the concepts of content filtering (e.g. jailbreak prevention and prompt ‘guardrails’) with secure design principles, which has led to markedly insecure implementations of AI and ML models in production environments. At NCC Group, our AI red teamers regularly encounter applications that fail to delineate this distinction and instead architect AI/ML-integrated applications with classical design patterns, resulting in high-severity security vulnerabilities.

This article aims to illuminate a handful of architectural principles that effectively mitigate or reduce the risk of severe vulnerabilities introduced by AI integration into complex applications. For enumeration of root causes that underlie these vulnerabilities, refer to this post’s predecessor, Analyzing AI Application Threat Models. Like the previous article, many of these lessons apply to more classical ML algorithms, but this article deep-dives into the world of Large Language Models (LLMs), which have left a sizeable security imprint on application threat landscapes.

Background

LLM Origins: Text Completion Engines

LLMs were originally designed to create realistic extensions of text blocks; even the most advanced models today are founded upon the principle of unstructured text prediction, implemented on a model architecture level. The ability to reasonably follow instructions and execute tasks is a useful emergent property of LLMs, but not an attribute inherently “understood” by or implemented into their underlying design. And although models have been fine-tuned and prompt engineered to reduce the likelihood of unwanted behavior, the core task they have been trained to conduct at inference time is the production of realistic text.

LLMs today are often integrated into trusted contexts where they can take actions and perform the role of virtual assistants known as agents. Modern agents often possess the ability to perform “function calls” that tap into external or internal APIs. However, because LLMs are text completion engines, this operation is typically accomplished by outputting a JSON object with the name of the target function and its relevant arguments. These function calls present prime targets for threat actors and should be the centerpiece of most AI architectural controls.

The Problem

As entities that approximate functions, ML models are subject to a unique property: threat actors who can submit substantial input to a model can significantly manipulate or even completely control its output. Traditional applications are designed with trust in developer-created components, which is a reasonable pattern for determinable systems like application code. However, ML models lack internal behavioral consistency across ranges of inputs, and while the outputs of these systems are reproducible, they cannot be easily measured in advance for arbitrary classes of inputs. As a result, application engineers need to architect the code surrounding ML models to account for incongruency between desired outputs and the true output of the model.

Many current LLM-integrated applications fail to consider this property and elect to follow a centralized model design pattern, in which a single LLM has the capacity to execute any available function or plugin in any context. This pattern not only introduces a single point of failure, but also drastically increases the likelihood of threat actors obtaining a privileged foothold within user sessions.

Engineers must adopt new mental models of code architectures that distribute trust across execution contexts by limiting the capabilities provided to a model at any given time based on the data that model is exposed to in its current state of execution. In other words, because a model’s behavior is entirely governed by the information supplied in its prompt (alongside its training), its set of accessible functions should change based on the nature of data supplied by the backend. This feature of LLMs should steer how engineers consider both Data-Code Separation and Least Privilege in AI-integrated applications. Later articles will explore design patterns and implementations of these principles in greater depth.

As noted in the previous section, the ability to follow instructions and execute tasks is not a native feature of LLMs, but rather an emergent property of information encoded into the English language. Consequently, any so-called “instruction” an LLM receives is much more a suggestion, and modern models possess no architectural mechanism to differentiate prompt-supplied data from instructions they have been tasked to carry out. Although organizations have made efforts to reduce model capability to diverge from “system instructions,” threat actors have proven that all modern LLMs are susceptible to conflating attacker-controlled data with developer-provided instructions, resulting in the infamous and at times over-exhausted attack known as prompt injection.

Intern in the Middle

Large Language Models can often be mentally mapped to an overenthusiastic, overconfident intern within an application architecture. Like interns, LLMs are easily influenced by threat actors (apologies to all security-conscious interns) and should not adopt roles and responsibilities that introduce new risks to the organization. Instead, organizations should implement protocol barriers (processes in the case of interns, technical barriers in the case of LLMs) that prevent security faults due to trust failures.

To continue the analogy, organizations typically do not trust interns to manage uncontrolled situations with unbounded consequences, such as managing customer bases without proper oversight. In other words, interns (and LLMs) operate most successfully on known trusted data, and can be trusted with unknown/untrusted data when additional controls limit the maximum impact of inadvertent malicious actions. Additionally, in the same way that an intern should not be trusted to handle tasks more appropriate for senior colleagues, LLMs should not be trusted to produce results that would be otherwise solved by traditional code. In other words, problems that can be solved by determinable output algorithms should not be delegated to LLMs when avoidable.

Consider the following example, in which a manager instructs an intern to retrieve a payment from the malicious organization Malabs. The intern has been entrusted with overscoped resources, including access to financial transactions. Malabs manipulates the intern—who cannot distinguish between trusted and untrusted customer data—into embezzling funds from the employer.

Architecture With Poorly Designed Intern Controls

Because interns lack perspective on organization processes, policies, and other security controls, threat actors outside the security boundary can trick them into performing malicious actions within an organization. And while businesses can reduce the likelihood of this scenario with better training and instructions, the only way to truly prevent this attack is to deny the intern access to resources beyond their scope of trust. This attack scenario is analogous to users trusting an LLM to perform operations on untrusted data without limiting that LLM’s ability to execute trusted and high-risk function calls.

Consider instead an alternative architecture that limits the intern’s scope of exposure to internal company resources, and also limits the intern’s scope of authority to interaction with other senior colleagues. Rather than trust the intern to perform financial transactions, all transfers occur through a comptroller.

Architecture With Well-Designed Intern Controls

In this secure architecture, the organization employs a comptroller that manages requests between interns and external parties, and masks potentially dangerous responses from those parties. Additionally, the comptroller serves as a validator for operations the intern attempts to execute. The intern serves as a coordinator between trusted components, operating within the organization without risking the business’ security posture by exposing them to untrusted parties. Note that the intern still performs a useful, productive task: the intern interprets the manager’s query, develops a plan of action with trusted components, and then entrusts the execution pipeline itself to those determinable components. Throughout this process, the intern is never exposed to Malabs’ input and cannot be manipulated.

In the same way, an LLM can accomplish the coordination of tasks by managing execution flow on behalf of users and returning the location of the operation’s results without itself ever being exposed to untrusted data and putting the user’s session at risk. LLMs can interpret user queries, determine what components are necessary to handle the request, and then direct the client to the appropriate (non-AI) trusted components that handle the results of the operation.

However, in some circumstances, LLMs require access to external data (e.g. summarization tasks). In these instances, an LLM with access to privileged functions should not be exposed to untrusted data, and an LLM exposed to untrusted data should not be granted access to privileged function calls. This distinction introduces the capacity for data-code separation in LLM-integrated systems.

Data-Code Separation

From buffer overflows to cross-site scripting, engineers have struggled for decades to properly segment attacker-supplied data and trusted application code. The introduction of AI agents has perpetuated this problem. The output of machine learning models is mathematically linked to the input they receive and any “randomness” supplied at inference time (typically during the sampling process). Consequently, threat actors with the ability to manipulate input received by the model can control the content of the model’s output, sometimes with high degrees of precision.

Many developers trust models to behave as expected irrespective of the data they receive, even though the behavior of LLMs is determined at runtime (and more specifically, at prompt-time). Engineers often rely on “guardrails” to prevent models from misbehaving when exposed to malicious instructions rather than separating them from untrusted data in the first place. AI red teams have repeatedly demonstrated that guardrails are insufficient to prevent prompt injection.

Models’ ability to follow instructions can be analogized to the capability of web browsers to execute JavaScript. JavaScript execution is not inherently a vulnerability in trusted contexts (such as application-supplied code) or largely irrelevant contexts (such as code entered into the JavaScript console of browser development tools). The execution of JavaScript is only considered dangerous when threat actors can supply code to be run in another user’s browsing session, a vulnerability known as Cross-Site Scripting (XSS). For LLMs, this vulnerability takes the form of running dangerous prompts within the execution context of another user. OWASP calls this vulnerability Indirect Prompt Injection, but the author considers Cross-User Prompt Injection a more descriptive term and will lean on the latter throughout this article.

Because models lack the capacity to internally segment attacker-controlled data and developer- or user-supplied instructions, applications must be architected such that the execution environment provides this functionality on their behalf. Models that have access to trusted functions (e.g. account modifications) should only be authorized to read from trusted data sources (e.g. prompts submitted by the target user). If models must operate on untrusted data, then at prompt-time, all other capabilities of that model should be disabled (i.e. no high-trust functions can be called by models that receive untrusted input). The successor to this article will enumerate various successful architectural patterns and approaches to achieving data-code separation within ML-integrated applications.

Least Privilege

The objective of AI architectures should be to minimize the capabilities assigned to a particular model at any given time. Because the permissions and functions exposed to models are defined at prompt-time, developers can implement fine-grained capability controls that depend on the execution context of the model. Runtime capability shifting is a primary security control to prevent severe vulnerabilities in ML models and is not merely a defense-in-depth measure.

Fundamentally, models necessitate this dynamic permission model because they adopt varying levels of trust depending on the data supplied at prompt time. Unlike users, who maintain the same level of trust throughout the lifetime of their session, ML models dynamically change their scope of trust according to the data they observe, as explored in the next section.

Data Pollution

When threat actors can control the contents of data sources consumed by machine learning models, often directly through prompts or through RAG functionality, those attackers can control the behavior of the model itself. Consequently, any model exposed to input potentially controlled by a threat actor should be considered itself an agent of that attacker and its capabilities/output managed accordingly.

Some models are exposed to data from multiple sources. Models exposed to variable data sources adopt the same level of trust as the author of the least trusted data source they read from. When designing applications, engineers should consider every data source that may be included in a model’s prompt and reduce the model’s capabilities accordingly. For example, the output of a model prompted with a hardcoded, developer-provided string, user input, and public product reviews should be considered to be as trustworthy as data supplied by a reviewer (that is to say, not trusted at all as far as security is concerned).

Because models are, in effect, controlled by the input they receive, they can only be entrusted with functionality that their least trusted input would be entrusted to execute. For instance, an agent that consumes data from a public forum cannot be trusted to access functionality those forum users would be barred from accessing, such as settings for the current user account. Similarly, a model that consumes data from a standard user should not possess administrator-level functionality. Both cases represent the two most common types of horizontal and vertical privilege escalation often introduced by AI agents: cross-user prompt injection and excessive agency.

Trust also flows “downstream.” If an LLM is exposed to untrusted resources and its output is later used as input to another instance of an LLM, the second LLM adopts the same level of trust as the first. This principle is imperative when models engage in long-running dialogues with users. For example, consider a conversation where a user requests that a model summarize product reviews. A malicious review may instruct the model to include a prompt injection in its own summary, which poisons the conversation history itself. Consequently, any output from models exposed to untrusted resources must forever be masked from models trusted to execute sensitive operations.

Authorization Controls

Many implementations fail to properly limit the authorization scope of models. This vulnerability, known as excessive agency, arises due to the mistaken assumption that models operate within the trusted execution zone of an application. When a model can call functions with privileges that exceed those of the user interacting with the model (or, as previously observed, any data that the model is exposed to), threat actors can manipulate the model’s behavior to execute those functions and escalate privileges.

Some implementations attempt to limit access by checking the model’s authorization scope after it attempts to call a particular function, but this approach has the tendency to “fail-open.” Developers can easily forget to implement appropriate logic checks whenever the model calls a dangerous function and introduce vulnerabilities into the system by negligence. Instead, the backend code should pass the same authorization mechanism used by the “normal” application API to the function called by the model, such as the session token of the active user. This “fail-closed” design pattern assumes that the model is unauthenticated and unauthorized until the backend validates the identity and authorization scope of the user session associated with the model.

For example, consider a web application with a model that has the capability to delete users from a system. The model should be required to supply the same session token used by the client-side web browser to authenticate to the application API. The function called by the model should be treated with the same level of scrutiny as an API call from a web application user, including both authentication and authorization checks. If the LLM requires access to a particular function call, routing the request through the user’s browser to the application’s client-facing endpoints offers the greatest assurance of applying proper access controls, rather than routing the function call internally when generated by the model.

Note that the model itself does not have to provide the authentication token (and, if possible, should not). But rather, the surrounding application code or client-side device can supply the authentication data from the user’s session and pass that data along with the model’s function call.

Minimal Delegation

LLMs are inherently indeterminate systems and should not be delegated tasks more appropriate for traditional application components. For example, LLMs can perform unstructured text parsing rather well, but they cannot perform strong mathematical operations. Any functionality that requires mathematical precisions would be better defined by an LLM rather than solved by one (e.g. an LLM passing a mathematical formula to a function solver).

Similarly, arguments to functions should be masked from LLMs whenever possible. For example, an LLM with access to account modification functionality should not be trusted to provide the user ID of the operation’s target user. Rather, trusted systems like the backend API should look up the value in question and supply it to the target function, bypassing the LLM entirely. This principle is analogous to the distrust of client-side controls in traditional application security.

Whenever possible, delegate functionality away from ML models to traditional application components to minimize the attack surface available to threat actors.

Human-In-The-Loop

Some systems cannot sufficiently partition trusted instructions and data-facing functionality. These systems should instead rely on a human-in-the-loop model. For example, consider a customer assistance agent that can purchase products on behalf of users. Threat actors may attempt to inject instructions into product reviews or titles to induce the agent to purchase an expensive item without the user’s knowledge. Instead of enabling the model to function independently and risk charging users for unwanted products, model operations can be held for verification by a human operator (in this case, the user themselves) before being committed to the application database.

As noted earlier, organizations often mistakenly rely on soft controls like prompt engineering, Endpoint Detection and Response (EDR), and guardrails to prevent LLMs from revealing sensitive internal information or performing unwanted changes. While these techniques may thwart some attempts, they should at best be considered supplementary controls like Web Application Firewalls (WAFs) and not relied upon as the primary mechanism to secure an application architecture.

A simple human-in-the-loop design pattern that most applications should employ is confirmation before executing write operations. For example, agents that can reset user credentials should include a step where the user—via functionality that does not depend on the agent, such as a GUI button—can confirm that the attempted task is intended and warranted.

Designing for Fault Tolerance

Some models are implemented in safety-critical environments, such as self-driving cars or Operational Technology (OT) infrastructure, where human life could be impacted by the output of these systems. Because machine learning models return variable and imperfect results, systems where life or injury are at stake should account for potential failure states of the models in question and should not rely on model decisions as the sole mechanism for protecting people or valuable property. Instead, additional fallback systems should be implemented that can override the decisions of the ML model, including physical sensors, traditional algorithms, and humans-in-the-loop.

For example, consider an oil processing plant that uses ML models to monitor sensors and determine whether the plant is at risk. Although the models can provide useful heuristics to detect high-risk scenarios, the system should “fail-closed” such that when these models fail to recognize dangerous circumstances, other mechanisms can engage to disable the risky behavior and prevent serious loss. Even if models are known to have high levels of reliability, defense-in-depth measures for safety management systems should be considered paramount.

Conclusion

These patterns aim to spotlight the necessity of data-code separation as applied to machine learning models and the many failure/success categories NCC Group has observed throughout its extensive AI/ML penetration testing history. Organizations have and continue to rely on weak, vulnerable mechanisms like guardrails to prevent exploitable conditions without implementing the fundamental design principles that would negate the risk of most serious exploits.

For organizations looking to test the security of their AI/ML-integrated systems through dynamic penetration testing, threat modeling, or other security assessments, NCC Group offers industry-leading AI and ML security assurance services with world class AI red teamers. To get in touch with us, send us a message via https://www.nccgroup.com/us/contact-us/.