A proposal for deploying AI agents as containerized, single-role workers — each defined by one prompt, scoped to one job — composed into an agentic workforce that automates business operations.
Motivation
Every organization has work that follows a pattern: receive input, apply judgment, produce output. Invoice processing, customer onboarding, compliance review, content generation, data reconciliation — these are not engineering problems. They are judgment problems that currently require humans because software has historically been too brittle to handle ambiguity.
Large language models change this. An LLM can read an unstructured email, understand what's being asked, and draft a response. But a single LLM prompt in a notebook is not a deployment. The gap between "this works in a demo" and "this runs the accounts payable process" is infrastructure.
This RFC proposes a deployment model built on three ideas:
| # | Principle | Rationale |
|---|---|---|
| P1 | One agent, one role, one container | Agents are scoped like microservices — single responsibility, independently deployable |
| P2 | Agents will make mistakes | Errors are not bugs to eliminate; they are a certainty to design around |
| P3 | Quality is layered | Validation, verification, and human oversight catch errors before they reach the business |
The result is an agentic workforce — a fleet of containerized agents, each with a defined role, that collectively automate business processes the way a team of specialists would.
The question is not "can AI do this perfectly?" It's "can AI do this well enough, with review, faster and cheaper than the current process?" The answer, for a growing number of business functions, is yes.
The Container-Per-Agent Model
Each agent is deployed as an independent container. The container runs a lightweight runtime that loads a system prompt — the agent's role definition — and connects to the tools that role requires. Nothing else.
Why Containers
The same reasons you containerize microservices apply to agents:
| Benefit | For Microservices | For Agents |
|---|---|---|
| Isolation | Process and memory boundaries | Prompt and tool boundaries — one agent can't access another's context |
| Independent scaling | Scale hot services independently | Scale busy agents (e.g. invoice processor) without scaling idle ones |
| Independent deployment | Ship one service without redeploying the monolith | Update one agent's prompt or model without touching the fleet |
| Resource limits | CPU/memory caps | Token budgets, timeout limits, cost ceilings per container |
| Observability | Per-service metrics and logs | Per-agent traces, token usage, error rates |
Agent Container Anatomy
Every agent container has four components:
agent:
name: "invoice-processor"
model: "claude-sonnet-4-5-20250929"
# 1. The role prompt — this IS the agent
system_prompt: |
You are an invoice processing specialist. You receive scanned
invoices and extract: vendor name, invoice number, date, line
items, totals, and payment terms. You output structured JSON.
Flag anything ambiguous for human review.
# 2. The tools this role can access
tools:
- ocr_extract
- vendor_lookup
- output_json
- flag_for_review
# 3. Operational constraints
limits:
max_tokens_per_request: 4096
max_requests_per_minute: 30
timeout_seconds: 60
cost_ceiling_daily: 50.00
# 4. Output destination
output:
queue: "invoice-results"
dead_letter: "invoice-errors"
The system prompt is the most important line. It defines what the agent is, what it does, and what it doesn't do. A well-scoped prompt is the difference between an agent that reliably processes invoices and one that hallucinates line items.
One prompt, one job. Don't build a "general assistant" container that handles invoices, customer emails, and compliance checks. Build three containers. The prompt stays focused, the tools stay minimal, the failure modes stay predictable.
Composing the Agentic Workforce
Individual agents are useful. Composed agents automate entire business processes. The composition model is simple: agents communicate through message queues, each doing their part of a larger workflow.
Example: Order-to-Cash Pipeline
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Email │ │ Order │ │ Invoice │ │ Payment │
│ Intake │───▶│ Validator│───▶│Generator │───▶│ Matcher │
│ Agent │ │ Agent │ │ Agent │ │ Agent │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Each agent in the pipeline:
- Reads from an input queue
- Performs its role (extract, validate, generate, match)
- Writes to an output queue
- The next agent in the chain implicitly validates the previous agent's output — an order validator that receives malformed data from the intake agent will flag it
Workforce Topology Patterns
| Pattern | Description | Use Case |
|---|---|---|
| Pipeline | A → B → C, sequential handoff | Order processing, document workflows |
| Fan-out | One input, multiple agents process in parallel | Multi-criteria analysis, parallel extraction |
| Fan-in | Multiple agents' outputs merged by an aggregator | Research synthesis, multi-source reconciliation |
| Router | Classifier agent routes to specialist agents | Customer support (billing vs. technical vs. sales) |
| Loop | Agent output feeds back for iterative refinement | Content generation with revision cycles |
Workforce Registry
Every agent in the workforce is registered in a central manifest. This is the source of truth for what's deployed, what role each agent plays, and how they connect.
workforce:
- name: "email-intake"
role: "Extract structured data from incoming customer emails"
model: "claude-haiku-4-5-20251001"
replicas: 3
input: "incoming-emails"
output: "parsed-emails"
- name: "order-validator"
role: "Validate order details against product catalog and customer history"
model: "claude-sonnet-4-5-20250929"
replicas: 2
input: "parsed-emails"
output: "validated-orders"
- name: "invoice-generator"
role: "Generate invoices from validated orders per billing rules"
model: "claude-sonnet-4-5-20250929"
replicas: 2
input: "validated-orders"
output: "draft-invoices"
Treat agents like employees. Each one has a job title (name), a job description (prompt), and tools they're authorized to use. The workforce registry is the org chart.
Quality Assurance
Agents will make mistakes. LLMs hallucinate, misinterpret edge cases, and occasionally produce confident nonsense. The solution is not to make agents perfect — it's to layer multiple quality checks so errors are caught before they matter.
Verification Layers
| Layer | Mechanism | Cost |
|---|---|---|
| Schema validation | Deterministic check — does the output match the expected structure? | Near zero |
| Business rules | Code-based checks — are calculations correct, are required fields present? | Near zero |
| Downstream agents | The next agent in the pipeline rejects malformed input naturally | Included in pipeline cost |
| Agent-based review | A dedicated verification agent spot-checks output using a different model | LLM cost per review |
| Human spot-checks | Sample-based human review (e.g. 5–10% of output) | Labor cost |
The key insight is that most errors can be caught cheaply. Schema validation and business rule checks are deterministic and free. Only the ambiguous cases — semantic correctness, edge cases, novel inputs — need LLM-based or human review.
Error Budget
Not every mistake needs to be caught. The goal is to reduce the error rate below the threshold where the business impact is acceptable — not to reach zero.
| Metric | Target | Action |
|---|---|---|
| Agent accuracy (raw output) | > 85% | Below this, the prompt or model needs rework |
| Post-validation accuracy | > 99% | Below this, add more verification layers or increase human review |
| Human escalation rate | < 5% | Above this, the agents aren't ready for this task |
| False positive rate (over-flagging) | < 10% | Above this, validation is too aggressive — wasting throughput |
Measure the pipeline, not the individual agent. An agent with 88% accuracy feeding into a schema validator that catches structural errors and a business rule check that catches calculation errors can produce a pipeline with 99%+ end-to-end accuracy. That's better than most manual processes.
Cost / Benefit
The business case for an agentic workforce is not "AI is exciting." It's arithmetic. A back-office worker processing invoices costs money, works fixed hours, and processes at a fixed rate. An agent container costs fractions of a cent per transaction, runs 24/7, and scales horizontally. The question is whether the math works after you account for infrastructure, quality checks, human escalations, and error costs.
Unit Economics: Invoice Processing
Take a concrete example — a four-person team processing 200 invoices per day manually.
Manual baseline:
| Item | Value |
|---|---|
| Team size | 4 FTEs |
| Fully loaded cost per FTE | $75,000 / year |
| Total annual labor cost | $300,000 |
| Invoices per day | 200 |
| Working days per year | 250 |
| Invoices per year | 50,000 |
| Cost per invoice | $6.00 |
| Processing time per invoice | 8–12 minutes |
| Error rate (industry average) | 1–3% |
Agentic workforce:
| Item | Value |
|---|---|
| Agent model | Haiku for extraction, Sonnet for complex reasoning |
| Tokens per invoice | ~1,500 input + ~800 output |
| LLM cost per invoice | ~$0.005 |
| Validation (schema + business rules) | ~$0.00 (deterministic) |
| Human escalation rate | 5% (at $3.00 per escalation) |
| Blended cost per invoice | ~$0.16 |
| Infrastructure (containers, queues, gateway) | ~$1,500 / month |
| Annual LLM cost (50,000 invoices) | ~$250 |
| Annual infrastructure | ~$18,000 |
| Annual human escalation (2,500 cases) | ~$7,500 |
| Total annual cost | ~$25,750 |
| Cost per invoice | ~$0.52 |
| Processing time per invoice | 10–30 seconds |
| Post-validation error rate | < 0.5% |
Comparison
| Metric | Manual | Agentic | Delta |
|---|---|---|---|
| Annual cost | $300,000 | $25,750 | 91% reduction |
| Cost per invoice | $6.00 | $0.52 | 91% reduction |
| Processing time | 8–12 min | 10–30 sec | ~30x faster |
| Error rate | 1–3% | < 0.5% | 2–6x more accurate |
| Availability | Business hours | 24/7 | Continuous |
| Scalability | Hire + train (weeks) | Add replicas (minutes) | Elastic |
Break-Even
| Cost Item | One-Time | Recurring |
|---|---|---|
| Platform build (containers, messaging, LLM gateway) | $30,000–50,000 | — |
| Prompt engineering + eval suite (per workflow) | $10,000–20,000 | — |
| Ongoing infrastructure | — | $18,000 / year |
| LLM API costs | — | Scales with volume |
| Human oversight (escalations + spot-checks) | — | $7,500 / year |
At $300,000/year in manual labor costs and ~$25,750/year in agentic costs, the annual savings are ~$274,000. With a one-time platform investment of $50,000–70,000, the break-even point is roughly 3 months after the first workflow goes live.
What Changes With Scale
The economics improve with volume. The platform cost is fixed — adding a second workflow doesn't require rebuilding the container orchestration, LLM gateway, or observability stack. Each new workflow adds only its prompt engineering cost and marginal LLM usage.
| Workflows automated | Annual manual cost displaced | Annual agentic cost | Net savings |
|---|---|---|---|
| 1 (invoice processing) | $300,000 | $25,750 | $274,000 |
| 3 (+ email triage, order validation) | $750,000 | $62,000 | $688,000 |
| 5 (+ compliance checks, data reconciliation) | $1,200,000 | $100,000 | $1,100,000 |
The first workflow is the most expensive. It carries the full platform build cost. Every subsequent workflow is incremental — just prompts, validation rules, and eval suites. This is where the container-per-agent model pays off: the infrastructure is reusable, only the roles change.
Infrastructure
The container platform is standard — Kubernetes, ECS, or Cloud Run. The agent-specific infrastructure is the messaging layer, the LLM gateway, and the observability stack.
Container Orchestration
| Component | Purpose | Implementation |
|---|---|---|
| Container runtime | Run agent containers | Kubernetes / ECS / Cloud Run |
| Message broker | Inter-agent communication | RabbitMQ / SQS / Cloud Pub/Sub |
| Dead letter queue | Capture failed messages for investigation | Per-agent DLQ |
| LLM gateway | Route requests, enforce rate limits, track costs | LiteLLM / custom proxy |
| Secret store | API keys, connection strings | Vault / cloud-native KMS |
| Config store | Agent prompts and tool definitions (version-controlled) | Git + ConfigMap / Parameter Store |
Scaling
Agents scale like any stateless container workload — horizontally, based on queue depth.
| Signal | Action |
|---|---|
| Queue depth > threshold | Scale up worker replicas |
| Queue depth = 0 for 5 min | Scale down to minimum |
| Error rate > 10% | Halt scaling, alert on-call |
| Cost ceiling reached | Stop processing, queue backpressure |
Prompts are config, not code. Store system prompts in version-controlled config. Updating an agent's behavior is a config change, not a code deployment. This enables rapid iteration without CI/CD overhead for prompt tuning.
Observability
Every agent interaction produces a trace. Because agents communicate through queues, you need end-to-end tracing across the entire workflow — not just per-agent metrics.
Per-Agent Metrics
| Metric | Description |
|---|---|
| Throughput | Messages processed per minute |
| Latency | Time from message received to output produced |
| Token usage | Input + output tokens per request |
| Cost | Dollar cost per request (computed from token usage) |
| Error rate | Failed requests / total requests |
| Validation pass rate | Percentage of outputs passing schema + business rule checks |
Workflow Metrics
| Metric | Description |
|---|---|
| End-to-end latency | Time from initial input to final output |
| Pipeline accuracy | Post-validation accuracy across the full workflow |
| Human escalation rate | Percentage of items requiring human intervention |
| Cost per transaction | Total LLM + infrastructure cost for one complete workflow run |
| Queue depth | Backlog per agent — indicates bottlenecks |
Alerts
| Condition | Severity | Response |
|---|---|---|
| Agent error rate > 10% | SEV-2 | Pause agent, investigate prompt or input quality |
| Validation failure rate > 30% | SEV-2 | Agent quality degraded — review prompt, check model version |
| End-to-end latency > 2x baseline | SEV-3 | Identify bottleneck agent, check queue depth, scale up |
| Daily cost > 150% of budget | SEV-2 | Circuit breaker, review for retry loops or prompt regression |
| Human escalation rate > 15% | SEV-3 | Agents are not handling enough — expand prompts or add training data |
Dashboard the workforce like you'd dashboard a team. Who's busy, who's idle, who's making mistakes, what's the cost per completed task. The agent workforce should be as transparent as a Kanban board.
Deployment & Lifecycle
Deploying an agent is deploying a container with a prompt. The CI/CD pipeline reflects this — standard container build plus agent-specific behavioral validation.
Pipeline
| Stage | Gate | Description |
|---|---|---|
| Prompt lint | Schema validation | Ensure prompt follows template, tools are registered |
| Unit test | Tool mocks | Verify tool integration and output parsing |
| Eval suite | Golden set | Run agent against 50–100 curated inputs, compare to expected outputs |
| Integration test | Pipeline validation | Run full workflow end-to-end, verify outputs pass all validation layers |
| Shadow mode | Parallel execution | New agent runs alongside current, outputs compared but not acted on |
| Canary | 10% traffic | Route fraction of queue messages to new version |
| Promote | Full rollout | After 24h canary with no regression |
Prompt Versioning
Every prompt change is versioned and traceable. A prompt change is functionally equivalent to a code change — it alters agent behavior.
| Policy | Rule |
|---|---|
| All prompts in Git | Version-controlled, reviewed, auditable |
| Prompt + model pinned together | Changing the model version requires re-running evals |
| Rollback-ready | Previous prompt version deployable in < 5 minutes |
| A/B testing | Route percentage of traffic to prompt variant, measure accuracy difference |
Prompt engineering is not ad-hoc. It's a deployment artifact with a version, an eval suite, and a rollback plan. Treat it like code.
Team & Getting Started
Building an agentic workforce is not an AI research project. It's a platform engineering effort with domain expertise.
Team
| Role | Count | Scope |
|---|---|---|
| Platform Engineer | 1–2 | Container orchestration, messaging, LLM gateway, observability |
| Prompt Engineer | 1–2 | Agent role design, eval suite curation, validation rule authoring |
| Domain Expert | 1 per workflow | Define what "correct" looks like, review escalations, validate accuracy targets |
| FinOps / Cost Analyst | 0.5 | Model selection for cost optimization, budget configuration |
Getting Started
Don't automate everything at once. Start with one workflow, prove the model, then expand.
| Phase | Duration | Scope |
|---|---|---|
| Pilot | Weeks 1–4 | One workflow, 2–3 agents, shadow mode only |
| Validation | Weeks 5–8 | Measure accuracy, cost, latency against manual baseline |
| Production | Weeks 9–12 | Promote to live traffic, human spot-checks at 10% sample rate |
| Expansion | Ongoing | Add workflows, add agents, reduce human review as confidence grows |
Tooling
| Function | Tool |
|---|---|
| Container orchestration | Kubernetes / ECS / Cloud Run |
| Message broker | RabbitMQ / SQS / Pub/Sub |
| LLM gateway | LiteLLM / Portkey |
| Observability | Langfuse / Arize Phoenix |
| Eval framework | promptfoo / custom harness |
| Prompt storage | Git + ConfigMap |
| Cost tracking | Provider dashboards + aggregation |
Start with the boring workflow — the one that's high-volume, well-understood, and low-risk. Invoice processing, email classification, data entry validation. Prove that the container-per-agent model works, that quality checks catch errors, and that the cost math holds. Then scale.