Skip to main content

RFC-003

The Agentic Workforce

One Container, One Role, One Prompt — The Agentic Workforce

A proposal for deploying AI agents as containerized, single-role workers — each defined by one prompt, scoped to one job — composed into an agentic workforce that automates business operations.


Motivation

Every organization has work that follows a pattern: receive input, apply judgment, produce output. Invoice processing, customer onboarding, compliance review, content generation, data reconciliation — these are not engineering problems. They are judgment problems that currently require humans because software has historically been too brittle to handle ambiguity.

Large language models change this. An LLM can read an unstructured email, understand what's being asked, and draft a response. But a single LLM prompt in a notebook is not a deployment. The gap between "this works in a demo" and "this runs the accounts payable process" is infrastructure.

This RFC proposes a deployment model built on three ideas:

#PrincipleRationale
P1One agent, one role, one containerAgents are scoped like microservices — single responsibility, independently deployable
P2Agents will make mistakesErrors are not bugs to eliminate; they are a certainty to design around
P3Quality is layeredValidation, verification, and human oversight catch errors before they reach the business

The result is an agentic workforce — a fleet of containerized agents, each with a defined role, that collectively automate business processes the way a team of specialists would.

The question is not "can AI do this perfectly?" It's "can AI do this well enough, with review, faster and cheaper than the current process?" The answer, for a growing number of business functions, is yes.


The Container-Per-Agent Model

Each agent is deployed as an independent container. The container runs a lightweight runtime that loads a system prompt — the agent's role definition — and connects to the tools that role requires. Nothing else.

Why Containers

The same reasons you containerize microservices apply to agents:

BenefitFor MicroservicesFor Agents
IsolationProcess and memory boundariesPrompt and tool boundaries — one agent can't access another's context
Independent scalingScale hot services independentlyScale busy agents (e.g. invoice processor) without scaling idle ones
Independent deploymentShip one service without redeploying the monolithUpdate one agent's prompt or model without touching the fleet
Resource limitsCPU/memory capsToken budgets, timeout limits, cost ceilings per container
ObservabilityPer-service metrics and logsPer-agent traces, token usage, error rates

Agent Container Anatomy

Every agent container has four components:

agent:
  name: "invoice-processor"
  model: "claude-sonnet-4-5-20250929"

  # 1. The role prompt — this IS the agent
  system_prompt: |
    You are an invoice processing specialist. You receive scanned
    invoices and extract: vendor name, invoice number, date, line
    items, totals, and payment terms. You output structured JSON.
    Flag anything ambiguous for human review.

  # 2. The tools this role can access
  tools:
    - ocr_extract
    - vendor_lookup
    - output_json
    - flag_for_review

  # 3. Operational constraints
  limits:
    max_tokens_per_request: 4096
    max_requests_per_minute: 30
    timeout_seconds: 60
    cost_ceiling_daily: 50.00

  # 4. Output destination
  output:
    queue: "invoice-results"
    dead_letter: "invoice-errors"

The system prompt is the most important line. It defines what the agent is, what it does, and what it doesn't do. A well-scoped prompt is the difference between an agent that reliably processes invoices and one that hallucinates line items.

One prompt, one job. Don't build a "general assistant" container that handles invoices, customer emails, and compliance checks. Build three containers. The prompt stays focused, the tools stay minimal, the failure modes stay predictable.


Composing the Agentic Workforce

Individual agents are useful. Composed agents automate entire business processes. The composition model is simple: agents communicate through message queues, each doing their part of a larger workflow.

Example: Order-to-Cash Pipeline

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Email   │    │  Order   │    │ Invoice  │    │ Payment  │
│  Intake  │───▶│ Validator│───▶│Generator │───▶│ Matcher  │
│  Agent   │    │  Agent   │    │  Agent   │    │  Agent   │
└──────────┘    └──────────┘    └──────────┘    └──────────┘

Each agent in the pipeline:

  1. Reads from an input queue
  2. Performs its role (extract, validate, generate, match)
  3. Writes to an output queue
  4. The next agent in the chain implicitly validates the previous agent's output — an order validator that receives malformed data from the intake agent will flag it

Workforce Topology Patterns

PatternDescriptionUse Case
PipelineA → B → C, sequential handoffOrder processing, document workflows
Fan-outOne input, multiple agents process in parallelMulti-criteria analysis, parallel extraction
Fan-inMultiple agents' outputs merged by an aggregatorResearch synthesis, multi-source reconciliation
RouterClassifier agent routes to specialist agentsCustomer support (billing vs. technical vs. sales)
LoopAgent output feeds back for iterative refinementContent generation with revision cycles

Workforce Registry

Every agent in the workforce is registered in a central manifest. This is the source of truth for what's deployed, what role each agent plays, and how they connect.

workforce:
  - name: "email-intake"
    role: "Extract structured data from incoming customer emails"
    model: "claude-haiku-4-5-20251001"
    replicas: 3
    input: "incoming-emails"
    output: "parsed-emails"

  - name: "order-validator"
    role: "Validate order details against product catalog and customer history"
    model: "claude-sonnet-4-5-20250929"
    replicas: 2
    input: "parsed-emails"
    output: "validated-orders"

  - name: "invoice-generator"
    role: "Generate invoices from validated orders per billing rules"
    model: "claude-sonnet-4-5-20250929"
    replicas: 2
    input: "validated-orders"
    output: "draft-invoices"

Treat agents like employees. Each one has a job title (name), a job description (prompt), and tools they're authorized to use. The workforce registry is the org chart.


Quality Assurance

Agents will make mistakes. LLMs hallucinate, misinterpret edge cases, and occasionally produce confident nonsense. The solution is not to make agents perfect — it's to layer multiple quality checks so errors are caught before they matter.

Verification Layers

LayerMechanismCost
Schema validationDeterministic check — does the output match the expected structure?Near zero
Business rulesCode-based checks — are calculations correct, are required fields present?Near zero
Downstream agentsThe next agent in the pipeline rejects malformed input naturallyIncluded in pipeline cost
Agent-based reviewA dedicated verification agent spot-checks output using a different modelLLM cost per review
Human spot-checksSample-based human review (e.g. 5–10% of output)Labor cost

The key insight is that most errors can be caught cheaply. Schema validation and business rule checks are deterministic and free. Only the ambiguous cases — semantic correctness, edge cases, novel inputs — need LLM-based or human review.

Error Budget

Not every mistake needs to be caught. The goal is to reduce the error rate below the threshold where the business impact is acceptable — not to reach zero.

MetricTargetAction
Agent accuracy (raw output)> 85%Below this, the prompt or model needs rework
Post-validation accuracy> 99%Below this, add more verification layers or increase human review
Human escalation rate< 5%Above this, the agents aren't ready for this task
False positive rate (over-flagging)< 10%Above this, validation is too aggressive — wasting throughput

Measure the pipeline, not the individual agent. An agent with 88% accuracy feeding into a schema validator that catches structural errors and a business rule check that catches calculation errors can produce a pipeline with 99%+ end-to-end accuracy. That's better than most manual processes.


Cost / Benefit

The business case for an agentic workforce is not "AI is exciting." It's arithmetic. A back-office worker processing invoices costs money, works fixed hours, and processes at a fixed rate. An agent container costs fractions of a cent per transaction, runs 24/7, and scales horizontally. The question is whether the math works after you account for infrastructure, quality checks, human escalations, and error costs.

Unit Economics: Invoice Processing

Take a concrete example — a four-person team processing 200 invoices per day manually.

Manual baseline:

ItemValue
Team size4 FTEs
Fully loaded cost per FTE$75,000 / year
Total annual labor cost$300,000
Invoices per day200
Working days per year250
Invoices per year50,000
Cost per invoice$6.00
Processing time per invoice8–12 minutes
Error rate (industry average)1–3%

Agentic workforce:

ItemValue
Agent modelHaiku for extraction, Sonnet for complex reasoning
Tokens per invoice~1,500 input + ~800 output
LLM cost per invoice~$0.005
Validation (schema + business rules)~$0.00 (deterministic)
Human escalation rate5% (at $3.00 per escalation)
Blended cost per invoice~$0.16
Infrastructure (containers, queues, gateway)~$1,500 / month
Annual LLM cost (50,000 invoices)~$250
Annual infrastructure~$18,000
Annual human escalation (2,500 cases)~$7,500
Total annual cost~$25,750
Cost per invoice~$0.52
Processing time per invoice10–30 seconds
Post-validation error rate< 0.5%

Comparison

MetricManualAgenticDelta
Annual cost$300,000$25,75091% reduction
Cost per invoice$6.00$0.5291% reduction
Processing time8–12 min10–30 sec~30x faster
Error rate1–3%< 0.5%2–6x more accurate
AvailabilityBusiness hours24/7Continuous
ScalabilityHire + train (weeks)Add replicas (minutes)Elastic

Break-Even

Cost ItemOne-TimeRecurring
Platform build (containers, messaging, LLM gateway)$30,000–50,000
Prompt engineering + eval suite (per workflow)$10,000–20,000
Ongoing infrastructure$18,000 / year
LLM API costsScales with volume
Human oversight (escalations + spot-checks)$7,500 / year

At $300,000/year in manual labor costs and ~$25,750/year in agentic costs, the annual savings are ~$274,000. With a one-time platform investment of $50,000–70,000, the break-even point is roughly 3 months after the first workflow goes live.

What Changes With Scale

The economics improve with volume. The platform cost is fixed — adding a second workflow doesn't require rebuilding the container orchestration, LLM gateway, or observability stack. Each new workflow adds only its prompt engineering cost and marginal LLM usage.

Workflows automatedAnnual manual cost displacedAnnual agentic costNet savings
1 (invoice processing)$300,000$25,750$274,000
3 (+ email triage, order validation)$750,000$62,000$688,000
5 (+ compliance checks, data reconciliation)$1,200,000$100,000$1,100,000

The first workflow is the most expensive. It carries the full platform build cost. Every subsequent workflow is incremental — just prompts, validation rules, and eval suites. This is where the container-per-agent model pays off: the infrastructure is reusable, only the roles change.


Infrastructure

The container platform is standard — Kubernetes, ECS, or Cloud Run. The agent-specific infrastructure is the messaging layer, the LLM gateway, and the observability stack.

Container Orchestration

ComponentPurposeImplementation
Container runtimeRun agent containersKubernetes / ECS / Cloud Run
Message brokerInter-agent communicationRabbitMQ / SQS / Cloud Pub/Sub
Dead letter queueCapture failed messages for investigationPer-agent DLQ
LLM gatewayRoute requests, enforce rate limits, track costsLiteLLM / custom proxy
Secret storeAPI keys, connection stringsVault / cloud-native KMS
Config storeAgent prompts and tool definitions (version-controlled)Git + ConfigMap / Parameter Store

Scaling

Agents scale like any stateless container workload — horizontally, based on queue depth.

SignalAction
Queue depth > thresholdScale up worker replicas
Queue depth = 0 for 5 minScale down to minimum
Error rate > 10%Halt scaling, alert on-call
Cost ceiling reachedStop processing, queue backpressure

Prompts are config, not code. Store system prompts in version-controlled config. Updating an agent's behavior is a config change, not a code deployment. This enables rapid iteration without CI/CD overhead for prompt tuning.


Observability

Every agent interaction produces a trace. Because agents communicate through queues, you need end-to-end tracing across the entire workflow — not just per-agent metrics.

Per-Agent Metrics

MetricDescription
ThroughputMessages processed per minute
LatencyTime from message received to output produced
Token usageInput + output tokens per request
CostDollar cost per request (computed from token usage)
Error rateFailed requests / total requests
Validation pass ratePercentage of outputs passing schema + business rule checks

Workflow Metrics

MetricDescription
End-to-end latencyTime from initial input to final output
Pipeline accuracyPost-validation accuracy across the full workflow
Human escalation ratePercentage of items requiring human intervention
Cost per transactionTotal LLM + infrastructure cost for one complete workflow run
Queue depthBacklog per agent — indicates bottlenecks

Alerts

ConditionSeverityResponse
Agent error rate > 10%SEV-2Pause agent, investigate prompt or input quality
Validation failure rate > 30%SEV-2Agent quality degraded — review prompt, check model version
End-to-end latency > 2x baselineSEV-3Identify bottleneck agent, check queue depth, scale up
Daily cost > 150% of budgetSEV-2Circuit breaker, review for retry loops or prompt regression
Human escalation rate > 15%SEV-3Agents are not handling enough — expand prompts or add training data

Dashboard the workforce like you'd dashboard a team. Who's busy, who's idle, who's making mistakes, what's the cost per completed task. The agent workforce should be as transparent as a Kanban board.


Deployment & Lifecycle

Deploying an agent is deploying a container with a prompt. The CI/CD pipeline reflects this — standard container build plus agent-specific behavioral validation.

Pipeline

StageGateDescription
Prompt lintSchema validationEnsure prompt follows template, tools are registered
Unit testTool mocksVerify tool integration and output parsing
Eval suiteGolden setRun agent against 50–100 curated inputs, compare to expected outputs
Integration testPipeline validationRun full workflow end-to-end, verify outputs pass all validation layers
Shadow modeParallel executionNew agent runs alongside current, outputs compared but not acted on
Canary10% trafficRoute fraction of queue messages to new version
PromoteFull rolloutAfter 24h canary with no regression

Prompt Versioning

Every prompt change is versioned and traceable. A prompt change is functionally equivalent to a code change — it alters agent behavior.

PolicyRule
All prompts in GitVersion-controlled, reviewed, auditable
Prompt + model pinned togetherChanging the model version requires re-running evals
Rollback-readyPrevious prompt version deployable in < 5 minutes
A/B testingRoute percentage of traffic to prompt variant, measure accuracy difference

Prompt engineering is not ad-hoc. It's a deployment artifact with a version, an eval suite, and a rollback plan. Treat it like code.


Team & Getting Started

Building an agentic workforce is not an AI research project. It's a platform engineering effort with domain expertise.

Team

RoleCountScope
Platform Engineer1–2Container orchestration, messaging, LLM gateway, observability
Prompt Engineer1–2Agent role design, eval suite curation, validation rule authoring
Domain Expert1 per workflowDefine what "correct" looks like, review escalations, validate accuracy targets
FinOps / Cost Analyst0.5Model selection for cost optimization, budget configuration

Getting Started

Don't automate everything at once. Start with one workflow, prove the model, then expand.

PhaseDurationScope
PilotWeeks 1–4One workflow, 2–3 agents, shadow mode only
ValidationWeeks 5–8Measure accuracy, cost, latency against manual baseline
ProductionWeeks 9–12Promote to live traffic, human spot-checks at 10% sample rate
ExpansionOngoingAdd workflows, add agents, reduce human review as confidence grows

Tooling

FunctionTool
Container orchestrationKubernetes / ECS / Cloud Run
Message brokerRabbitMQ / SQS / Pub/Sub
LLM gatewayLiteLLM / Portkey
ObservabilityLangfuse / Arize Phoenix
Eval frameworkpromptfoo / custom harness
Prompt storageGit + ConfigMap
Cost trackingProvider dashboards + aggregation

Start with the boring workflow — the one that's high-volume, well-understood, and low-risk. Invoice processing, email classification, data entry validation. Prove that the container-per-agent model works, that quality checks catch errors, and that the cost math holds. Then scale.

Back to RFCs