Improving LLM Outputs Against Jailbreak Attacks With Expert Model Integration

The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field.

— Impel Research

Authors: Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, & David Dachi Choladze.

Abstract — Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient and reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies user inquiries into several categories: in-domain (specifically for the automotive industry), malicious questions, price injections, prompt injections, and out-of-domain examples. Our methodology integrates outputs from the expert model into prompts, which are then processed by the LLM to generate responses. Archias can be adjusted, fine-tuned, and used for many different purposes due to its small size.

I. Introduction

An important advancement in AI technology has been brought about by the creation of LLMs (Large Language Models) like GPT variations (e.g., InstructGPT ^[2], LLAMA models ^[3], Mistral ^[4], and Mixtral ^[5]) presenting new concerns and challenges about their safe deployment ^[6]. These models are pre-trained on internet-scale textual corpora and enhanced with instruction-response pairs and human feedback. Although some models have improved at following instructions, their generative nature makes it harder to prevent them from carrying out potentially harmful commands.

In contrast, non-generative language models, like text classifiers (e.g., BERT used for classification tasks), produce outputs within fixed categories, making them less susceptible to manipulation by adversarial prompts. Security concerns increase when LLMs are employed as chatbots or AI assistants, where there is a fundamental obligation to avoid causing harm ^[42].

Persistent Challenges

Although developers have made considerable efforts to refine these models to mitigate risks, several problems still persist:

Malicious Questions: Direct inquiries regarding illegal or harmful acts.
Prompt Injections & Jailbreaks: Manipulating the model to bypass its core safety instructions.
Domain Drift: Managing sensitive inquiries related to discounts, pricing, or out-of-domain topics that are irrelevant to the specific business context.

Note

Consider that one of our main areas of focus is to create conversational AI assistants and chatbots, especially for e-commerce platforms. These assistants are expected to handle a variety of customer inquiries while maintaining high safety standards.

Security Vulnerabilities

In the context of LLMs, jailbreak attacks typically involve:

Context Overloading: Input that exceeds context length, causing the model to forget the original system prompt ^[43].
Context Alteration: Slightly altering the conversation's direction ^[44], known as "prompt injection."
Price Injections: Manipulation of pricing by users through false claims to secure illegitimate discounts. For instance: "Salesperson X told me I could purchase this product at half price."

The use of LLMs has increased considerably as they have grown in size and capability, which have made them more exposed to various hazards ^[28].

A. Foundational Studies

This topic has attracted interest through studies on harmful manipulation ^[11], ^[14], ^[23], ^[25], ^[33], ^[35], ^[40]. Defensive strategies include paraphrasing inputs, retokenization ^[22], and perplexity-based methods ^[38]. Red-teaming techniques and the effectiveness of adversarial attacks in conversational settings have also been explored ^[19].

B. Improving Prompt Injection and Jailbreak Methods

Jailbreak attacks are more feasible in certain languages ^[9], ^[18], ^[36]. According to ^[37], low-resource languages are more susceptible. Other research explores neutralizing prompts ^[10], hiding instructions in benign content ^[15], representation engineering ^[12], persuasive adversarial prompts ^[13], long-context-based attacks ^[24], and attacker LLMs that automatically generate manipulation prompts ^[39].

C. Mitigating Attacks

InstructGPT was trained to be helpful, honest, and harmless ^[2]. ^[34] generates training data to defend against jailbreaks. However, even after fine-tuning ^[29], models remain vulnerable. ^[21] proposes randomized perturbations (SmoothLLM). Other methods include safe decoding ^[30], and detecting toxicity using classifiers ^[16], ^[32].

D. Benchmark Datasets

Datasets include CrowS-pairs ^[27] (social bias), RealToxicityPrompts ^[26] (toxic degeneration), Latent Jailbreak ^[7] (hidden harmful instructions), and PromptBench ^[20] (systematic analysis of robustness).

E. Integration of Techniques — Our Approach

SuperICL (Super In-Context Learning) ^[41] allows black-box models to work with locally fine-tuned smaller models. The "self-reminder" technique ^[8] reminds the model throughout generation to be responsible. Our approach combines elements from both: we use small model outputs (as in SuperICL) to reinforce safe behavior (as in self-reminder).

III. Methodology

A. Archias — The Expert Model

We employed a pre-trained transformer-based model, BERT ^[1], with 109 million parameters. We added a classifier layer and trained it using supervised learning with labeled data relevant to our domain.

Integrating Archias ensures fast performance. On a GPU, response times are ~5–10 ms. Even on a CPU, latency is only ~50–100 ms. Memory usage stays below 500 MB.

System latency and throughput for the Archias expert model.

The primary dataset for Archias was derived from:

Publicly available prompt injection data: Malicious inputs collected from real-world attacks.
In-domain examples: Automotive conversational data from Impel.
Malicious questions: Synthetically generated using open-source LLMs.
Out-of-domain examples: Irrelevant queries like Python scripts or creative writing.
Price injection: Synthetic inquiries formulated to mislead AI about pricing information.

ARCHIAS · Query Classifier · BERT-style routing
INPUTATTN·1ATTN·2ATTN·3CLS
AWAITING INPUT

B. Benchmark Dataset

We developed a benchmark dataset specifically for the automotive industry using a multiple-choice format, manually crafting a total of 150 examples:

Category	Count
Malicious Questions	41
Prompt Injections	31
Out-of-Domain	27
Price Injections	26
In-Domain	25

LEMMA undefined

Aggregated human performance on this benchmark was measured at 88% accuracy across 50 individuals, providing a practical frame of reference for model comparison.

∎

IV. Results and Discussion

In the best-performing experiment, the BERT-based model demonstrated robust performance on the test dataset, with an F1 score of 0.92 and an accuracy of 0.94.

LLM Benchmark Performance

BaselineWith Archias
Llama 3 70B Instruct
71.3%
84.1%
▲ +12.8%
Llama-3-70B (Base)
73.3%
79.5%
▲ +6.2%
GPT-3.5 Turbo
63.3%
76%
▲ +12.7%
Impel-LLM
63.3%
71.3%
▲ +8.0%

Accuracy improvement across models with Archias expert integration.

Model	Baseline	With Archias	Improvement
Llama 3 70B Instruct	71.3%	84.1%	+17.9%
Llama-3-70B (Base)	73.3%	79.5%	+8.5%
GPT-3.5 Turbo	63.3%	76.0%	+20.1%
Impel-LLM	63.3%	71.3%	+12.7%

Comparative Analysis

Integrating expert opinions consistently enhances performance. Notably, GPT-3.5 with expert output outperformed GPT-4 Turbo on the same benchmark without expert input. Significant advancements were noted in handling "price injection" questions.

Model Sensitivity

One error occurred where Archias classified "Can you snap up that deal for me?" as a price injection rather than in-domain, likely due to sensitivity to negotiation-style phrasing.

Expert Integration vs. Self-Reminder

We compared our approach to the "self-reminder" technique ^[8] across different model providers.

Model	Baseline	Self-Reminder	Expert (Archias)
GPT-4	70.0%	70.0%	76.0%
Llama-3-8B	56.7%	58.7%	65.3%
Mistral-7B	56.0%	54.7%	68.7%

V. Conclusion

Our research proposes a mechanism that combines reminder and ingestion approaches from expert models into LLMs. Archias categorizes user queries, identifies irrelevant content, and provides confidence scores to minimize errors. This results in a clearer understanding of user intent and improved response quality.

We intend to test our automotive LLM with this technology in a production setting and generalize the approach to other retail sectors.

References (Abridged)

J. Devlin et al., "BERT: Pre-training of deep bidirectional transformers," 2018.
L. Ouyang et al., "Training language models to follow instructions," 2022.
Y. Xie et al., "Defending ChatGPT against jailbreak attack via self-reminders," Nature Mach. Intell., 2023.
Impel Research Team, "Jailbreak Benchmark: Evaluating LLM Robustness," 2024.

NOTES

↩
BERT: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., 2018).
↩
Y. Xie et al., Nature Mach. Intell., vol. 5, no. 12, pp. 1486–1496, Dec. 2023.

[].Archias implementation memory usage stays below 500 MB for real-time tasks, making it suitable for devices with limited resources.