An Interactive Paper
AI Architecture
From model selection to agent orchestration — designing intelligent systems that actually work in production.
The Landscape
Separating Signal from Noise
The AI landscape is a spectrum. On one end, capabilities that are genuinely production-ready — chatbots, search augmentation, code assistance, summarization. Organizations are extracting real value from these today, at scale.
On the other end, capabilities that dominate keynotes but remain experimental — fully autonomous agents, self-improving systems, and anything approaching general intelligence. Understanding where your use case sits on this spectrum is the first architectural decision you make.
Common Pitfalls
Patterns Worth Avoiding
Most AI setbacks aren't model problems — they're architecture problems. The same anti-patterns keep appearing across organizations. Recognizing them early saves months of rework.
Fine-Tune First
Jumping to fine-tuning before exhausting prompt engineering and retrieval. Fine-tuning is expensive, hard to maintain, and rarely the bottleneck.
Model Maximalism
Defaulting to the largest model for every task. 90% of use cases don’t need frontier-class reasoning — and you pay 30x for the 10% improvement.
Prompt-and-Pray
No evaluation harness, no regression tests, no observability. If you can’t measure it, you can’t improve it — and you can’t trust it.
RAG-as-a-Silver-Bullet
Throwing documents into a vector database and expecting accurate answers. Retrieval quality is only as good as your chunking, indexing, and re-ranking.
Agent Sprawl
Autonomous agents calling other agents with no guardrails. One hallucinated tool call cascades into real-world consequences.
Governance Afterthought
Shipping AI to production and worrying about safety, bias, and compliance later. Regulators won’t wait for your next sprint.
Model Strategy
Right Model, Right Task
There is no “best model.” There is only the best model for a given task, latency budget, and cost envelope. Smart architectures route dynamically across tiers.
Not all tasks need the biggest model
A frontier model costs 10-30x more per token than a workhorse model — and for classification, extraction, and simple Q&A, the quality difference is negligible.
Frontier: complex reasoning
Reserve frontier models for multi-step analysis, complex code generation, and agentic workflows that need to plan and recover from errors.
This should be 5-10% of your production traffic.
Workhorse: the default tier
Mid-tier models handle the majority of production workloads — summarization, classification, chat, extraction. Best cost-quality ratio. This is where ~60% of your traffic should land.
Speed: real-time responses
Small, fast models for latency-sensitive paths — autocomplete, intent classification, real-time filtering, streaming UIs. Sub-200ms response times at a fraction of the cost. About 25% of production traffic.
Specialized: domain fine-tuned
Fine-tuned or domain-specific models for narrow, high-value tasks — medical coding, legal extraction, financial sentiment. Only fine-tune when prompt engineering and retrieval have been exhausted. ~5% of traffic.
Route dynamically
A small, fast model classifies the incoming request and sends it to the appropriate tier. This alone can reduce LLM spend by 0–80%.
Retrieval
RAG Done Right
Retrieval-Augmented Generation is the most common production pattern — and the most commonly implemented poorly. Each step is an opportunity to improve or degrade answer quality.
1. Ingest & Chunk
Documents are parsed, cleaned, and split into chunks. Chunk size and overlap matter more than most teams realize. Too small and you lose context. Too large and you dilute relevance.
2. Embed
Each chunk is vectorized using an embedding model. Choose based on your domain, not leaderboard benchmarks. A model fine-tuned on legal text beats a general-purpose model for legal documents.
3. Index & Tag
Vectors stored alongside metadata — source, date, author, type. Metadata filtering is your first line of precision.
4. Retrieve
Hybrid search — combining BM25 (keyword) with semantic (vector) — consistently outperforms either approach alone. Pure vector search misses exact keyword matches.
5. Re-rank
A cross-encoder re-ranks retrieved chunks by relevance. Unlike embeddings that encode separately, cross-encoders see query and document together.
6. Generate
Top chunks injected as context. The LLM synthesizes the answer. If the answer is wrong here, the problem is almost always upstream. Most RAG failures are retrieval failures.
Agents
Agentic Architecture
Systems that reason, plan, and take actions. The pattern you choose determines your risk profile, debuggability, and scalability.
ReAct: Reason + Act
The model thinks, acts, observes, then thinks again. Simple, interpretable, effective. Start here — most production agent use cases can be solved with a well-designed ReAct loop and the right tools.
Multi-Agent Orchestration
Specialized agents collaborate — one plans, one researches, one executes, one reviews. Unlocks parallel execution but introduces coordination overhead. Only adopt when single-agent can't handle the complexity.
Human-in-the-Loop
Autonomous for low-risk, pauses for high-stakes. The right default for production.
The Stack
Anatomy of a Production System
Six layers deep. The model is just one — and rarely the one that determines success. Build the layers that differentiate you. Buy the rest.
Infrastructure
GPU compute, serving, caching. Start with managed APIs. Self-host only when economics or compliance demand it.
Model Layer
Foundation models via API or self-hosted. This layer is commoditizing fast — abstract behind a unified interface so you can swap providers as the market shifts.
Retrieval Layer
Where your domain knowledge lives and where you start to differentiate. Chunking, metadata, re-ranking, hybrid search.
Orchestration
The brain. Which model to call, what tools to invoke, how to decompose tasks. Frameworks accelerate prototyping; production often outgrows them.
Application Layer
Chat, copilots, search, automation. The UX should abstract away all AI complexity. The best AI products feel like magic, not a chatbot.
Eval & Observability
The cross-cutting layer. LLM-as-judge, regression tests, distributed tracing, cost attribution.
When these layers work together, the difference is dramatic:
Manual Workflow
~4 hours, error-prone
Orchestrated Stack
~15 minutes, auditable
Governance
AI You Can Trust
Governance isn't a tax on innovation — it's the foundation that lets you move fast without breaking things. Four pillars, in place before your first production deployment.
Evaluation
Systematic measurement of model quality, accuracy, and regression detection before and after deployment.
Guardrails
Input/output filtering, content policy enforcement, PII detection, and prompt injection defense at the application boundary.
Observability
End-to-end tracing of every LLM call — inputs, outputs, latency, cost, and token usage. You can’t govern what you can’t see.
Access Control
Who can use which models, with what data, for what purpose. Role-based access, data classification, and audit trails.
Evaluation
LLM-as-Judge
The biggest bottleneck in production AI isn't the model — it's knowing whether the model is working. Human evaluation doesn't scale. Traditional metrics don't capture what matters. LLM-as-judge bridges the gap.
The idea is simple: use a strong LLM to evaluate outputs from your production LLM against a rubric you define. It's not replacing human judgment — it's making human-quality evaluation practical at scale.
Define your rubric
What does “good” look like? Factual accuracy, tone, completeness, safety — each dimension scored on a clear scale. The rubric is the contract between your team and your AI.
Curate golden examples
Ten to fifty reference input-output pairs with human-verified scores. These anchor the judge and catch calibration drift. Start small — even ten examples provide signal most teams never have.
Run the judge in CI
Every prompt change, every model upgrade — the judge scores the full test suite before it hits production. Regressions get caught at the PR, not by your users.
Monitor in production
Sample production traffic, run async evaluation, flag quality drops. The judge becomes your always-on QA engineer — catching degradation before it compounds.
Strategy
Build vs. Buy
Not every layer deserves custom engineering. The decision is about where your competitive advantage lies.
What Now
Monday Morning
Build an eval suite for your highest-value use case. Ten golden examples, an LLM-as-judge scorer, a CI pipeline. This puts you ahead of 90% of teams.
Audit your model spend. Classify every LLM call by tier and route accordingly. Most organizations overspend 3-5x.
Add guardrails and tracing to every production endpoint this week. If a call isn't traced, it doesn't exist.