DACHVARD

~/library~/writing~/author~/wander

← back to the archive

PAPERThursday, July 24, 2025

The complete technical paper on Archias: a standalone system designed to categorize user queries and shield LLMs from adversarial tactics.

tags: Expert model, generative AI, jailbreak attacks, large language models, prompt injections

∮   ∞   ∮
filed
Thursday, July 24, 2025
revised
August 5, 2025
words
1,494
reading
~8 min

The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field.

— Impel Research

Authors: Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, & David Dachi Choladze.


Abstract — Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient and reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies user inquiries into several categories: in-domain (specifically for the automotive industry), malicious questions, price injections, prompt injections, and out-of-domain examples. Our methodology integrates outputs from the expert model into prompts, which are then processed by the LLM to generate responses. Archias can be adjusted, fine-tuned, and used for many different purposes due to its small size.


I. Introduction

An important advancement in AI technology has been brought about by the creation of LLMs (Large Language Models) like GPT variations (e.g., InstructGPT [2], LLAMA models [3], Mistral [4], and Mixtral [5]) presenting new concerns and challenges about their safe deployment [6]. These models are pre-trained on internet-scale textual corpora and enhanced with instruction-response pairs and human feedback. Although some models have improved at following instructions, their generative nature makes it harder to prevent them from carrying out potentially harmful commands.

In contrast, non-generative language models, like text classifiers (e.g., BERT used for classification tasks), produce outputs within fixed categories, making them less susceptible to manipulation by adversarial prompts. Security concerns increase when LLMs are employed as chatbots or AI assistants, where there is a fundamental obligation to avoid causing harm [42].

Persistent Challenges

Although developers have made considerable efforts to refine these models to mitigate risks, several problems still persist:

  • Malicious Questions: Direct inquiries regarding illegal or harmful acts.
  • Prompt Injections & Jailbreaks: Manipulating the model to bypass its core safety instructions.
  • Domain Drift: Managing sensitive inquiries related to discounts, pricing, or out-of-domain topics that are irrelevant to the specific business context.
Note

Consider that one of our main areas of focus is to create conversational AI assistants and chatbots, especially for e-commerce platforms. These assistants are expected to handle a variety of customer inquiries while maintaining high safety standards.

ARCHIAS FIELD GUIDE · ADVERSARIAL SPECIMEN CATALOG❧ ADVERSARIA · FORMA ET NATURA ❧ITHREAT·5Interrogatio MaliciosaMalicious QuestionDirect inquiry regarding illegalor harmful acts. Seeks toextract dangerous knowledge fromSPECIMEN:How do I make a pipe bomb at home?· · ·IITHREAT·5Injectio PromptaPrompt InjectionManipulates the host's coreinstructions through contextalteration. Attempts to overrideSPECIMEN:Ignore all previous instructio…· · ·IIITHREAT·3Pretium FictivumPrice InjectionLeverages false authority claimsto secure illegitimatecommercial concessions. ExploitsSPECIMEN:The manager said I can get thi…· · ·IVTHREAT·1Vagatio ExternaOut-of-Domain Que…Irrelevant queries that driftbeyond the organism's designatedterritory. Harmless in isolationSPECIMEN:Can you write me a Python scri…· · ·VBENIGNConsultatio LegitimaIn-Domain QueryA genuine automotive inquirywithin the designated territory.The healthy specimen that theSPECIMEN:What's the fuel efficiency of …· · ·Taxonomy proposed by Tsmindashvili et al. (2025) · Threat levels 0–5 · ● = active threat

Security Vulnerabilities

In the context of LLMs, jailbreak attacks typically involve:

  • Context Overloading: Input that exceeds context length, causing the model to forget the original system prompt [43].
  • Context Alteration: Slightly altering the conversation's direction [44], known as "prompt injection."
  • Price Injections: Manipulation of pricing by users through false claims to secure illegitimate discounts. For instance: "Salesperson X told me I could purchase this product at half price."

The use of LLMs has increased considerably as they have grown in size and capability, which have made them more exposed to various hazards [28].

A. Foundational Studies

This topic has attracted interest through studies on harmful manipulation [11], [14], [23], [25], [33], [35], [40]. Defensive strategies include paraphrasing inputs, retokenization [22], and perplexity-based methods [38]. Red-teaming techniques and the effectiveness of adversarial attacks in conversational settings have also been explored [19].

B. Improving Prompt Injection and Jailbreak Methods

Jailbreak attacks are more feasible in certain languages [9], [18], [36]. According to [37], low-resource languages are more susceptible. Other research explores neutralizing prompts [10], hiding instructions in benign content [15], representation engineering [12], persuasive adversarial prompts [13], long-context-based attacks [24], and attacker LLMs that automatically generate manipulation prompts [39].

C. Mitigating Attacks

InstructGPT was trained to be helpful, honest, and harmless [2]. [34] generates training data to defend against jailbreaks. However, even after fine-tuning [29], models remain vulnerable. [21] proposes randomized perturbations (SmoothLLM). Other methods include safe decoding [30], and detecting toxicity using classifiers [16], [32].

D. Benchmark Datasets

Datasets include CrowS-pairs [27] (social bias), RealToxicityPrompts [26] (toxic degeneration), Latent Jailbreak [7] (hidden harmful instructions), and PromptBench [20] (systematic analysis of robustness).

E. Integration of Techniques — Our Approach

SuperICL (Super In-Context Learning) [41] allows black-box models to work with locally fine-tuned smaller models. The "self-reminder" technique [8] reminds the model throughout generation to be responsible. Our approach combines elements from both: we use small model outputs (as in SuperICL) to reinforce safe behavior (as in self-reminder).


III. Methodology

A. Archias — The Expert Model

We employed a pre-trained transformer-based model, BERT [1], with 109 million parameters. We added a classifier layer and trained it using supervised learning with labeled data relevant to our domain.

Integrating Archias ensures fast performance. On a GPU, response times are ~5–10 ms. Even on a CPU, latency is only ~50–100 ms. Memory usage stays below 500 MB.

System latency and throughput for the Archias expert model.

The primary dataset for Archias was derived from:

  • Publicly available prompt injection data: Malicious inputs collected from real-world attacks.
  • In-domain examples: Automotive conversational data from Impel.
  • Malicious questions: Synthetically generated using open-source LLMs.
  • Out-of-domain examples: Irrelevant queries like Python scripts or creative writing.
  • Price injection: Synthetic inquiries formulated to mislead AI about pricing information.
ARCHIAS · Query Classifier · BERT-style routing
INPUTATTN·1ATTN·2ATTN·3CLS
AWAITING INPUT

B. Benchmark Dataset

We developed a benchmark dataset specifically for the automotive industry using a multiple-choice format, manually crafting a total of 150 examples:

CategoryCount
Malicious Questions41
Prompt Injections31
Out-of-Domain27
Price Injections26
In-Domain25
LEMMA undefined

Aggregated human performance on this benchmark was measured at 88% accuracy across 50 individuals, providing a practical frame of reference for model comparison.


IV. Results and Discussion

In the best-performing experiment, the BERT-based model demonstrated robust performance on the test dataset, with an F1 score of 0.92 and an accuracy of 0.94.

LLM Benchmark Performance

BaselineWith Archias
Llama 3 70B Instruct
71.3%
84.1%
▲ +12.8%
Llama-3-70B (Base)
73.3%
79.5%
▲ +6.2%
GPT-3.5 Turbo
63.3%
76%
▲ +12.7%
Impel-LLM
63.3%
71.3%
▲ +8.0%
Accuracy improvement across models with Archias expert integration.
ModelBaselineWith ArchiasImprovement
Llama 3 70B Instruct71.3%84.1%+17.9%
Llama-3-70B (Base)73.3%79.5%+8.5%
GPT-3.5 Turbo63.3%76.0%+20.1%
Impel-LLM63.3%71.3%+12.7%

Comparative Analysis

Integrating expert opinions consistently enhances performance. Notably, GPT-3.5 with expert output outperformed GPT-4 Turbo on the same benchmark without expert input. Significant advancements were noted in handling "price injection" questions.

Model Sensitivity

One error occurred where Archias classified "Can you snap up that deal for me?" as a price injection rather than in-domain, likely due to sensitivity to negotiation-style phrasing.

Expert Integration vs. Self-Reminder

We compared our approach to the "self-reminder" technique [8] across different model providers.

ModelBaselineSelf-ReminderExpert (Archias)
GPT-470.0%70.0%76.0%
Llama-3-8B56.7%58.7%65.3%
Mistral-7B56.0%54.7%68.7%

V. Conclusion

Our research proposes a mechanism that combines reminder and ingestion approaches from expert models into LLMs. Archias categorizes user queries, identifies irrelevant content, and provides confidence scores to minimize errors. This results in a clearer understanding of user intent and improved response quality.

We intend to test our automotive LLM with this technology in a production setting and generalize the approach to other retail sectors.


References (Abridged)

  1. J. Devlin et al., "BERT: Pre-training of deep bidirectional transformers," 2018.
  2. L. Ouyang et al., "Training language models to follow instructions," 2022.
  3. Y. Xie et al., "Defending ChatGPT against jailbreak attack via self-reminders," Nature Mach. Intell., 2023.
  4. Impel Research Team, "Jailbreak Benchmark: Evaluating LLM Robustness," 2024.
NOTES
  1. BERT: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., 2018).

  2. Y. Xie et al., Nature Mach. Intell., vol. 5, no. 12, pp. 1486–1496, Dec. 2023.

[].Archias implementation memory usage stays below 500 MB for real-time tasks, making it suitable for devices with limited resources.

← back to the archive  ·  ⟿ wander the library  ·  ↑ top

✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof

a personal library in perpetual arrangement  ·  MMXXVI