Skip to main content

Architecture

How the QuilrAI LLM Gateway processes every request - from your application to the LLM provider and back.

Your Application
client = OpenAI(
  base_url='https://guardrails-usa-2.quilr.ai/openai_compatible/',
  api_key='sk-quilr-xxx'
)
client.chat.completions.create(
  model='gpt-4o',
  messages=[{'role': 'user', 'content': 'Hello!'}]
)
QuilrAI LLM Gateway
Validate
Identity & Auth
JWT / header validation
Domain allowlist
Per-user tracking
Rate Limits
Req/min, hr, day limits
Token budgets
Key expiration
Scan
PII / PHI / PCI
Contextual detection
Block / redact / anonymize
Adversarial Detection
Prompt injection
Jailbreak detection
Social engineering
Custom Intents
User-defined categories
Example-trained classifier
Guardian Agent
Dependency review
Task adherence
Monitor / nudge / block
Transform
Prompt Store
Centralized prompts
Template variables
Enforce prompt-only mode
Token Saving
JSON compression
HTML/MD stripping
Input-only, same accuracy
Route
Request Routing
Weighted load balancing
Automatic failover
Multi-provider groups
Logging · Cost Tracking · Analytics · Red Team Testing
LLM Providers
OpenAIAnthropicAzure OpenAIAWS BedrockVertex AICustom Endpoints
QuilrAI

Pipeline Stages

Every API request flows through these stages in order. Each stage is independently configurable per API key from the dashboard.

StageDescriptionDetails
Identity & AuthValidates request identity via JWT, JWKS, or header. Enforces domain restrictions.Identity Aware →
Rate LimitsEnforces request rates, token budgets, and key expiration before reaching the provider.Rate Limits →
Security GuardrailsDetects PII, PHI, PCI, and financial data. Catches prompt injection, jailbreak, and social engineering.Security Guardrails →
Custom IntentsUser-defined detection categories trained with positive and negative examples.Custom Intents →
Guardian AgentAdds dependency-safety guidance, reviews generated dependency output, and keeps agent requests aligned to the system prompt.Guardian Agent →
Prompt StoreResolves centralized system prompts by ID with template variable substitution.Prompt Store →
Token SavingCompresses input tokens - JSON to TOON, HTML/Markdown to plain text. Responses unchanged.Token Saving →
Request RoutingRoutes to the optimal provider using weighted load balancing with automatic failover.Request Routing →

Response Path

Responses from the LLM provider pass back through the security guardrails for output scanning before being returned to your application. The same detection categories and configurable actions (block, redact, anonymize, monitor) apply to both requests and responses. When Guardian Agent coding helpers are enabled, non-streaming responses can also be reviewed for dependency vulnerabilities and outdated exact pins before final delivery.

Non-streaming chat completions, including Bedrock models reached through OpenAI-compatible chat via Converse, Anthropic Messages, AWS Bedrock Runtime boto3 converse / supported invoke_model, Vertex/Gemini generateContent, and the OpenAI Responses API all follow the full request → scan → forward → scan → return pipeline. For streaming responses (SSE), request-side scanning runs as usual but response-side scanning is skipped so chunks pass straight through; request-side prediction results are still logged. AWS Bedrock Runtime converse_stream follows the same request-scan / response-passthrough pattern for AWS EventStream responses. Realtime websocket sessions are a raw passthrough today - neither request-side nor response-side DLP runs on live Realtime events, though session-level logs are still recorded.

Copilot Studio is different from LLM proxy routes: Copilot calls QuilrAI before tool execution, QuilrAI scans the user context and proposed tool input values, and the response is only an allow/block decision.

Observability

Every request is logged with cost, latency, token counts, and guardrail actions. Use the Logs tab to review request history, the LLM Gateway Log Export API to export logs programmatically, and the Red Team Testing tool to validate your guardrail configuration against adversarial prompts.