An Interactive Paper

AI Architecture

From model selection to agent orchestration — designing intelligent systems that actually work in production.

The Landscape

Separating Signal from Noise

The AI landscape is a spectrum. On one end, capabilities that are genuinely production-ready — chatbots, search augmentation, code assistance, summarization. Organizations are extracting real value from these today, at scale.

On the other end, capabilities that dominate keynotes but remain experimental — fully autonomous agents, self-improving systems, and anything approaching general intelligence. Understanding where your use case sits on this spectrum is the first architectural decision you make.

Hype vs. Reality Spectrum — 2026Hover to explore

HypeProduction-Ready

Chatbots & Assistants

90%

Summarization

85%

RAG / Search

80%

Code Assistance

75%

Content Generation

70%

Data Extraction

65%

Multi-Agent Systems

40%

Autonomous Agents

25%

Self-Improving AI

10%

AGI

The companies winning with AI aren't chasing the frontier. They're systematically extracting value from the production-ready end of the spectrum while carefully experimenting at the edges.

Common Pitfalls

Patterns Worth Avoiding

Most AI setbacks aren't model problems — they're architecture problems. The same anti-patterns keep appearing across organizations. Recognizing them early saves months of rework.

Fine-Tune First

Jumping to fine-tuning before exhausting prompt engineering and retrieval. Fine-tuning is expensive, hard to maintain, and rarely the bottleneck.

Model Maximalism

Defaulting to the largest model for every task. 90% of use cases don’t need frontier-class reasoning — and you pay 30x for the 10% improvement.

Prompt-and-Pray

No evaluation harness, no regression tests, no observability. If you can’t measure it, you can’t improve it — and you can’t trust it.

RAG-as-a-Silver-Bullet

Throwing documents into a vector database and expecting accurate answers. Retrieval quality is only as good as your chunking, indexing, and re-ranking.

Agent Sprawl

Autonomous agents calling other agents with no guardrails. One hallucinated tool call cascades into real-world consequences.

Governance Afterthought

Shipping AI to production and worrying about safety, bias, and compliance later. Regulators won’t wait for your next sprint.

The single most common pattern: teams skip evaluation. Without an eval harness, every change is a gamble. You can't improve what you can't measure.

Model Strategy

Right Model, Right Task

There is no “best model.” There is only the best model for a given task, latency budget, and cost envelope. Smart architectures route dynamically across tiers.

Not all tasks need the biggest model

A frontier model costs 10-30x more per token than a workhorse model — and for classification, extraction, and simple Q&A, the quality difference is negligible.

Frontier: complex reasoning

Reserve frontier models for multi-step analysis, complex code generation, and agentic workflows that need to plan and recover from errors.

This should be 5-10% of your production traffic.

Workhorse: the default tier

Mid-tier models handle the majority of production workloads — summarization, classification, chat, extraction. Best cost-quality ratio. This is where ~60% of your traffic should land.

Speed: real-time responses

Small, fast models for latency-sensitive paths — autocomplete, intent classification, real-time filtering, streaming UIs. Sub-200ms response times at a fraction of the cost. About 25% of production traffic.

Specialized: domain fine-tuned

Fine-tuned or domain-specific models for narrow, high-value tasks — medical coding, legal extraction, financial sentiment. Only fine-tune when prompt engineering and retrieval have been exhausted. ~5% of traffic.

Route dynamically

A small, fast model classifies the incoming request and sends it to the appropriate tier. This alone can reduce LLM spend by 0–80%.

Retrieval

RAG Done Right

Retrieval-Augmented Generation is the most common production pattern — and the most commonly implemented poorly. Each step is an opportunity to improve or degrade answer quality.

1. Ingest & Chunk

Documents are parsed, cleaned, and split into chunks. Chunk size and overlap matter more than most teams realize. Too small and you lose context. Too large and you dilute relevance.

2. Embed

Each chunk is vectorized using an embedding model. Choose based on your domain, not leaderboard benchmarks. A model fine-tuned on legal text beats a general-purpose model for legal documents.

3. Index & Tag

Vectors stored alongside metadata — source, date, author, type. Metadata filtering is your first line of precision.

Before investing in better embeddings or re-ranking, try adding structured metadata filters. Cheap and often transformative.

4. Retrieve

Hybrid search — combining BM25 (keyword) with semantic (vector) — consistently outperforms either approach alone. Pure vector search misses exact keyword matches.

5. Re-rank

A cross-encoder re-ranks retrieved chunks by relevance. Unlike embeddings that encode separately, cross-encoders see query and document together.

Adding a re-ranking step can double answer accuracy. The single highest-leverage improvement to an existing RAG pipeline.

6. Generate

Top chunks injected as context. The LLM synthesizes the answer. If the answer is wrong here, the problem is almost always upstream. Most RAG failures are retrieval failures.

Agents

Agentic Architecture

Systems that reason, plan, and take actions. The pattern you choose determines your risk profile, debuggability, and scalability.

ReAct: Reason + Act

The model thinks, acts, observes, then thinks again. Simple, interpretable, effective. Start here — most production agent use cases can be solved with a well-designed ReAct loop and the right tools.

Multi-Agent Orchestration

Specialized agents collaborate — one plans, one researches, one executes, one reviews. Unlocks parallel execution but introduces coordination overhead. Only adopt when single-agent can't handle the complexity.

Human-in-the-Loop

Autonomous for low-risk, pauses for high-stakes. The right default for production.

Start with human-in-the-loop. Always. Earn autonomy through demonstrated reliability — measured by your eval suite, not your intuition.

The Stack

Anatomy of a Production System

Six layers deep. The model is just one — and rarely the one that determines success. Build the layers that differentiate you. Buy the rest.

Infrastructure

GPU compute, serving, caching. Start with managed APIs. Self-host only when economics or compliance demand it.

Model Layer

Foundation models via API or self-hosted. This layer is commoditizing fast — abstract behind a unified interface so you can swap providers as the market shifts.

Retrieval Layer

Where your domain knowledge lives and where you start to differentiate. Chunking, metadata, re-ranking, hybrid search.

Orchestration

The brain. Which model to call, what tools to invoke, how to decompose tasks. Frameworks accelerate prototyping; production often outgrows them.

Application Layer

Chat, copilots, search, automation. The UX should abstract away all AI complexity. The best AI products feel like magic, not a chatbot.

Eval & Observability

The cross-cutting layer. LLM-as-judge, regression tests, distributed tracing, cost attribution.

The most under-invested layer. Teams spend months on prompts but won't spend a week on eval. If you skip this, nothing else matters.

When these layers work together, the difference is dramatic:

Manual Workflow

1Search 5 data sources manually

2Copy relevant data to spreadsheet

3Write analysis in document

4Review for errors

5Email to stakeholders

~4 hours, error-prone

Orchestrated Stack

1Agent queries all sources in parallel

2Cross-references and ranks results

3Generates draft with citations

4Human reviews & approves

5Auto-distributes to stakeholders

~15 minutes, auditable

Governance

AI You Can Trust

Governance isn't a tax on innovation — it's the foundation that lets you move fast without breaking things. Four pillars, in place before your first production deployment.

Evaluation

Systematic measurement of model quality, accuracy, and regression detection before and after deployment.

Guardrails

Input/output filtering, content policy enforcement, PII detection, and prompt injection defense at the application boundary.

Observability

End-to-end tracing of every LLM call — inputs, outputs, latency, cost, and token usage. You can’t govern what you can’t see.

Access Control

Who can use which models, with what data, for what purpose. Role-based access, data classification, and audit trails.

The EU AI Act, NIST AI RMF, and sector-specific regulations are no longer theoretical. Without governance, compliance walls will block deployment entirely.

Evaluation

LLM-as-Judge

The biggest bottleneck in production AI isn't the model — it's knowing whether the model is working. Human evaluation doesn't scale. Traditional metrics don't capture what matters. LLM-as-judge bridges the gap.

The idea is simple: use a strong LLM to evaluate outputs from your production LLM against a rubric you define. It's not replacing human judgment — it's making human-quality evaluation practical at scale.

Define your rubric

What does “good” look like? Factual accuracy, tone, completeness, safety — each dimension scored on a clear scale. The rubric is the contract between your team and your AI.

Curate golden examples

Ten to fifty reference input-output pairs with human-verified scores. These anchor the judge and catch calibration drift. Start small — even ten examples provide signal most teams never have.

Run the judge in CI

Every prompt change, every model upgrade — the judge scores the full test suite before it hits production. Regressions get caught at the PR, not by your users.

Monitor in production

Sample production traffic, run async evaluation, flag quality drops. The judge becomes your always-on QA engineer — catching degradation before it compounds.

LLM-as-judge correlates 85-95% with human reviewers when the rubric is clear. The teams that ship confidently aren't the ones with better models — they're the ones that know exactly how their models are performing, every day.

Strategy

Build vs. Buy

Not every layer deserves custom engineering. The decision is about where your competitive advantage lies.

+The capability is your core differentiator

+You need full control over the data pipeline

+Existing tools can't meet your requirements

+You can maintain it long-term

+Regulation demands self-hosted infrastructure

Build the layers that touch your domain. Buy the layers that don't. The exception is evaluation — always build that in-house.

What Now

Monday Morning

Build an eval suite for your highest-value use case. Ten golden examples, an LLM-as-judge scorer, a CI pipeline. This puts you ahead of 90% of teams.

Audit your model spend. Classify every LLM call by tier and route accordingly. Most organizations overspend 3-5x.

Add guardrails and tracing to every production endpoint this week. If a call isn't traced, it doesn't exist.

An Interactive Paper

AI Architecture

From model selection to agent orchestration — designing intelligent systems that actually work in production.

The Landscape

Separating Signal from Noise

Hype vs. Reality Spectrum — 2026Hover to explore

HypeProduction-Ready

Chatbots & Assistants

90%

Summarization

85%

RAG / Search

80%

Code Assistance

75%

Content Generation

70%

Data Extraction

65%

Multi-Agent Systems

40%

Autonomous Agents

25%

Self-Improving AI

10%

AGI

The companies winning with AI aren't chasing the frontier. They're systematically extracting value from the production-ready end of the spectrum while carefully experimenting at the edges.

Common Pitfalls

Patterns Worth Avoiding

Most AI setbacks aren't model problems — they're architecture problems. The same anti-patterns keep appearing across organizations. Recognizing them early saves months of rework.

Fine-Tune First

Jumping to fine-tuning before exhausting prompt engineering and retrieval. Fine-tuning is expensive, hard to maintain, and rarely the bottleneck.

Model Maximalism

Defaulting to the largest model for every task. 90% of use cases don’t need frontier-class reasoning — and you pay 30x for the 10% improvement.

Prompt-and-Pray

No evaluation harness, no regression tests, no observability. If you can’t measure it, you can’t improve it — and you can’t trust it.

RAG-as-a-Silver-Bullet

Throwing documents into a vector database and expecting accurate answers. Retrieval quality is only as good as your chunking, indexing, and re-ranking.

Agent Sprawl

Autonomous agents calling other agents with no guardrails. One hallucinated tool call cascades into real-world consequences.

Governance Afterthought

Shipping AI to production and worrying about safety, bias, and compliance later. Regulators won’t wait for your next sprint.

The single most common pattern: teams skip evaluation. Without an eval harness, every change is a gamble. You can't improve what you can't measure.

Model Strategy

Right Model, Right Task

There is no “best model.” There is only the best model for a given task, latency budget, and cost envelope. Smart architectures route dynamically across tiers.

Not all tasks need the biggest model

A frontier model costs 10-30x more per token than a workhorse model — and for classification, extraction, and simple Q&A, the quality difference is negligible.

Frontier: complex reasoning

Reserve frontier models for multi-step analysis, complex code generation, and agentic workflows that need to plan and recover from errors.

This should be 5-10% of your production traffic.

Workhorse: the default tier

Mid-tier models handle the majority of production workloads — summarization, classification, chat, extraction. Best cost-quality ratio. This is where ~60% of your traffic should land.

Speed: real-time responses

Specialized: domain fine-tuned

Route dynamically

A small, fast model classifies the incoming request and sends it to the appropriate tier. This alone can reduce LLM spend by 0–80%.

Retrieval

RAG Done Right

Retrieval-Augmented Generation is the most common production pattern — and the most commonly implemented poorly. Each step is an opportunity to improve or degrade answer quality.

1. Ingest & Chunk

Documents are parsed, cleaned, and split into chunks. Chunk size and overlap matter more than most teams realize. Too small and you lose context. Too large and you dilute relevance.

2. Embed

Each chunk is vectorized using an embedding model. Choose based on your domain, not leaderboard benchmarks. A model fine-tuned on legal text beats a general-purpose model for legal documents.

3. Index & Tag

Vectors stored alongside metadata — source, date, author, type. Metadata filtering is your first line of precision.

Before investing in better embeddings or re-ranking, try adding structured metadata filters. Cheap and often transformative.

4. Retrieve

Hybrid search — combining BM25 (keyword) with semantic (vector) — consistently outperforms either approach alone. Pure vector search misses exact keyword matches.

5. Re-rank

A cross-encoder re-ranks retrieved chunks by relevance. Unlike embeddings that encode separately, cross-encoders see query and document together.

Adding a re-ranking step can double answer accuracy. The single highest-leverage improvement to an existing RAG pipeline.

6. Generate

Top chunks injected as context. The LLM synthesizes the answer. If the answer is wrong here, the problem is almost always upstream. Most RAG failures are retrieval failures.

Agents

Agentic Architecture

Systems that reason, plan, and take actions. The pattern you choose determines your risk profile, debuggability, and scalability.

ReAct: Reason + Act

Multi-Agent Orchestration

Human-in-the-Loop

Autonomous for low-risk, pauses for high-stakes. The right default for production.

Start with human-in-the-loop. Always. Earn autonomy through demonstrated reliability — measured by your eval suite, not your intuition.

The Stack

Anatomy of a Production System

Six layers deep. The model is just one — and rarely the one that determines success. Build the layers that differentiate you. Buy the rest.

Infrastructure

GPU compute, serving, caching. Start with managed APIs. Self-host only when economics or compliance demand it.

Model Layer

Foundation models via API or self-hosted. This layer is commoditizing fast — abstract behind a unified interface so you can swap providers as the market shifts.

Retrieval Layer

Where your domain knowledge lives and where you start to differentiate. Chunking, metadata, re-ranking, hybrid search.

Orchestration

The brain. Which model to call, what tools to invoke, how to decompose tasks. Frameworks accelerate prototyping; production often outgrows them.

Application Layer

Chat, copilots, search, automation. The UX should abstract away all AI complexity. The best AI products feel like magic, not a chatbot.

Eval & Observability

The cross-cutting layer. LLM-as-judge, regression tests, distributed tracing, cost attribution.

The most under-invested layer. Teams spend months on prompts but won't spend a week on eval. If you skip this, nothing else matters.

When these layers work together, the difference is dramatic:

Manual Workflow

1Search 5 data sources manually

2Copy relevant data to spreadsheet

3Write analysis in document

4Review for errors

5Email to stakeholders

~4 hours, error-prone

Orchestrated Stack

1Agent queries all sources in parallel

2Cross-references and ranks results

3Generates draft with citations

4Human reviews & approves

5Auto-distributes to stakeholders

~15 minutes, auditable

Governance

AI You Can Trust

Governance isn't a tax on innovation — it's the foundation that lets you move fast without breaking things. Four pillars, in place before your first production deployment.

Evaluation

Systematic measurement of model quality, accuracy, and regression detection before and after deployment.

Guardrails

Input/output filtering, content policy enforcement, PII detection, and prompt injection defense at the application boundary.

Observability

End-to-end tracing of every LLM call — inputs, outputs, latency, cost, and token usage. You can’t govern what you can’t see.

Access Control

Who can use which models, with what data, for what purpose. Role-based access, data classification, and audit trails.

The EU AI Act, NIST AI RMF, and sector-specific regulations are no longer theoretical. Without governance, compliance walls will block deployment entirely.

Evaluation

LLM-as-Judge

Define your rubric

What does “good” look like? Factual accuracy, tone, completeness, safety — each dimension scored on a clear scale. The rubric is the contract between your team and your AI.

Curate golden examples

Ten to fifty reference input-output pairs with human-verified scores. These anchor the judge and catch calibration drift. Start small — even ten examples provide signal most teams never have.

Run the judge in CI

Every prompt change, every model upgrade — the judge scores the full test suite before it hits production. Regressions get caught at the PR, not by your users.

Monitor in production

Sample production traffic, run async evaluation, flag quality drops. The judge becomes your always-on QA engineer — catching degradation before it compounds.

Strategy

Build vs. Buy

Not every layer deserves custom engineering. The decision is about where your competitive advantage lies.

+The capability is your core differentiator

+You need full control over the data pipeline

+Existing tools can't meet your requirements

+You can maintain it long-term

+Regulation demands self-hosted infrastructure

Build the layers that touch your domain. Buy the layers that don't. The exception is evaluation — always build that in-house.

What Now

Monday Morning

Build an eval suite for your highest-value use case. Ten golden examples, an LLM-as-judge scorer, a CI pipeline. This puts you ahead of 90% of teams.

Audit your model spend. Classify every LLM call by tier and route accordingly. Most organizations overspend 3-5x.

Add guardrails and tracing to every production endpoint this week. If a call isn't traced, it doesn't exist.