Architecture

How the QuilrAI LLM Gateway processes every request - from your application to the LLM provider and back.

Your Application

client = OpenAI(
  base_url='https://guardrails-usa-2.quilr.ai/openai_compatible/',
  api_key='sk-quilr-xxx'
)
client.chat.completions.create(
  model='gpt-4o',
  messages=[{'role': 'user', 'content': 'Hello!'}]
)

QuilrAI LLM Gateway

Validate

Identity & Auth

JWT / header validation

Domain allowlist

Per-user tracking

Rate Limits

Req/min, hr, day limits

Token budgets

Key expiration

Scan

PII / PHI / PCI

Contextual detection

Block / redact / anonymize

Adversarial Detection

Prompt injection

Jailbreak detection

Social engineering

Custom Intents

User-defined categories

Example-trained classifier

Guardian Agent

Dependency review

Task adherence

Monitor / nudge / block

Transform

Prompt Store

Centralized prompts

Combine refs + inline text

Template variables

Require store reference

Token Saving

JSON compression

HTML/MD to text

Input-only, text compression

Route

Request Routing

Weighted load balancing

Automatic failover

Multi-provider groups

Logging · Cost Tracking · Analytics · Red Team Testing

LLM Providers

OpenAIAnthropicAzure OpenAIAWS BedrockVertex AICustom Endpoints

QuilrAI

Pipeline Stages

Every API request flows through these stages in order. Each stage is independently configurable per API key from the dashboard.

Stage	Description	Details
Identity & Auth	Validates request identity via JWT, JWKS, or header. Enforces domain restrictions.	Identity Aware →
Rate Limits	Enforces request rates, token budgets, and key expiration before reaching the provider.	Rate Limits →
Security Guardrails	Detects PII, PHI, PCI, and financial data. Catches prompt injection, jailbreak, and social engineering.	Security Guardrails →
Custom Intents	User-defined detection categories trained with positive and negative examples.	Custom Intents →
Guardian Agent	Adds dependency-safety guidance, reviews generated dependency output, and keeps agent requests aligned to the system prompt.	Guardian Agent →
Prompt Store	Resolves one or more centralized system prompts by ID, allows inline instructions alongside references, and substitutes template variables.	Prompt Store →
Token Saving	Compresses input tokens - JSON to TOON, HTML/Markdown to plain text, and verbose prose compression. Responses unchanged.	Token Saving →
Request Routing	Routes to the optimal provider using weighted load balancing with automatic failover.	Request Routing →

Response Path

Responses from the LLM provider pass back through the security guardrails for output scanning before being returned to your application. The same detection categories and configurable actions (block, redact, anonymize, monitor) apply to both requests and responses. When Guardian Agent coding helpers are enabled, non-streaming responses can also be reviewed for dependency vulnerabilities and outdated exact pins before final delivery.

Non-streaming chat completions, including provider-native models reached through OpenAI-compatible translations such as Bedrock Converse, Vertex AI Gemini generateContent, and Anthropic Messages, native Anthropic Messages, AWS Bedrock Runtime boto3 converse / supported invoke_model, native Vertex/Gemini generateContent, and the OpenAI Responses API all follow the full request -> scan -> forward -> scan -> return pipeline. For streaming responses (SSE), request-side scanning runs as usual but response-side scanning is skipped so chunks pass straight through; request-side prediction results are still logged. AWS Bedrock Runtime converse_stream follows the same request-scan / response-passthrough pattern for AWS EventStream responses. Realtime websocket sessions are a raw passthrough today - neither request-side nor response-side DLP runs on live Realtime events, though session-level logs are still recorded.

Copilot Studio is different from LLM proxy routes: Copilot calls QuilrAI before tool execution, QuilrAI scans the user context and proposed tool input values, and the response is only an allow/block decision.

Observability

Every request is logged with cost, latency, token counts, and guardrail actions. Use the Logs tab to review request history, the LLM Gateway Log Export API to export logs programmatically, and the Red Team Testing tool to validate your guardrail configuration against adversarial prompts.

Pipeline Stages​

Response Path​

Observability​

Pipeline Stages

Response Path

Observability