The complete technical paper on Archias: a standalone system designed to categorize user queries and shield LLMs from adversarial tactics.
tags: Expert model, generative AI, jailbreak attacks, large language models, prompt injections
The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field.
Authors: Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, & David Dachi Choladze.
Abstract — Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient and reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies user inquiries into several categories: in-domain (specifically for the automotive industry), malicious questions, price injections, prompt injections, and out-of-domain examples. Our methodology integrates outputs from the expert model into prompts, which are then processed by the LLM to generate responses. Archias can be adjusted, fine-tuned, and used for many different purposes due to its small size.
An important advancement in AI technology has been brought about by the creation of LLMs (Large Language Models) like GPT variations (e.g., InstructGPT [2], LLAMA models [3], Mistral [4], and Mixtral [5]) presenting new concerns and challenges about their safe deployment [6]. These models are pre-trained on internet-scale textual corpora and enhanced with instruction-response pairs and human feedback. Although some models have improved at following instructions, their generative nature makes it harder to prevent them from carrying out potentially harmful commands.
In contrast, non-generative language models, like text classifiers (e.g., BERT used for classification tasks), produce outputs within fixed categories, making them less susceptible to manipulation by adversarial prompts. Security concerns increase when LLMs are employed as chatbots or AI assistants, where there is a fundamental obligation to avoid causing harm [42].
Although developers have made considerable efforts to refine these models to mitigate risks, several problems still persist:
Consider that one of our main areas of focus is to create conversational AI assistants and chatbots, especially for e-commerce platforms. These assistants are expected to handle a variety of customer inquiries while maintaining high safety standards.
In the context of LLMs, jailbreak attacks typically involve:
The use of LLMs has increased considerably as they have grown in size and capability, which have made them more exposed to various hazards [28].
This topic has attracted interest through studies on harmful manipulation [11], [14], [23], [25], [33], [35], [40]. Defensive strategies include paraphrasing inputs, retokenization [22], and perplexity-based methods [38]. Red-teaming techniques and the effectiveness of adversarial attacks in conversational settings have also been explored [19].
Jailbreak attacks are more feasible in certain languages [9], [18], [36]. According to [37], low-resource languages are more susceptible. Other research explores neutralizing prompts [10], hiding instructions in benign content [15], representation engineering [12], persuasive adversarial prompts [13], long-context-based attacks [24], and attacker LLMs that automatically generate manipulation prompts [39].
InstructGPT was trained to be helpful, honest, and harmless [2]. [34] generates training data to defend against jailbreaks. However, even after fine-tuning [29], models remain vulnerable. [21] proposes randomized perturbations (SmoothLLM). Other methods include safe decoding [30], and detecting toxicity using classifiers [16], [32].
Datasets include CrowS-pairs [27] (social bias), RealToxicityPrompts [26] (toxic degeneration), Latent Jailbreak [7] (hidden harmful instructions), and PromptBench [20] (systematic analysis of robustness).
SuperICL (Super In-Context Learning) [41] allows black-box models to work with locally fine-tuned smaller models. The "self-reminder" technique [8] reminds the model throughout generation to be responsible. Our approach combines elements from both: we use small model outputs (as in SuperICL) to reinforce safe behavior (as in self-reminder).
We employed a pre-trained transformer-based model, BERT [1], with 109 million parameters. We added a classifier layer and trained it using supervised learning with labeled data relevant to our domain.
Integrating Archias ensures fast performance. On a GPU, response times are ~5–10 ms. Even on a CPU, latency is only ~50–100 ms. Memory usage stays below 500 MB.
The primary dataset for Archias was derived from:
We developed a benchmark dataset specifically for the automotive industry using a multiple-choice format, manually crafting a total of 150 examples:
| Category | Count |
|---|---|
| Malicious Questions | 41 |
| Prompt Injections | 31 |
| Out-of-Domain | 27 |
| Price Injections | 26 |
| In-Domain | 25 |
Aggregated human performance on this benchmark was measured at 88% accuracy across 50 individuals, providing a practical frame of reference for model comparison.
In the best-performing experiment, the BERT-based model demonstrated robust performance on the test dataset, with an F1 score of 0.92 and an accuracy of 0.94.
| Model | Baseline | With Archias | Improvement |
|---|---|---|---|
| Llama 3 70B Instruct | 71.3% | 84.1% | +17.9% |
| Llama-3-70B (Base) | 73.3% | 79.5% | +8.5% |
| GPT-3.5 Turbo | 63.3% | 76.0% | +20.1% |
| Impel-LLM | 63.3% | 71.3% | +12.7% |
Integrating expert opinions consistently enhances performance. Notably, GPT-3.5 with expert output outperformed GPT-4 Turbo on the same benchmark without expert input. Significant advancements were noted in handling "price injection" questions.
One error occurred where Archias classified "Can you snap up that deal for me?" as a price injection rather than in-domain, likely due to sensitivity to negotiation-style phrasing.
We compared our approach to the "self-reminder" technique [8] across different model providers.
| Model | Baseline | Self-Reminder | Expert (Archias) |
|---|---|---|---|
| GPT-4 | 70.0% | 70.0% | 76.0% |
| Llama-3-8B | 56.7% | 58.7% | 65.3% |
| Mistral-7B | 56.0% | 54.7% | 68.7% |
Our research proposes a mechanism that combines reminder and ingestion approaches from expert models into LLMs. Archias categorizes user queries, identifies irrelevant content, and provides confidence scores to minimize errors. This results in a clearer understanding of user intent and improved response quality.
We intend to test our automotive LLM with this technology in a production setting and generalize the approach to other retail sectors.
✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof
a personal library in perpetual arrangement · MMXXVI