FAQ (comprehensive guide to metrics, dashboards, and production operations)
Pilots & Scaling
How complex should a pilot be?
A pilot should be small in scope but high in fidelity. Use the same critical integrations, data types, models, policies, and monitoring as production.
Start with 3 scenarios, 1 region, 1 core system, 1 channel. Expand gradually: +2 scenarios, +1 region, +1 integration in v2; realistic volume + real operations in v3.
What is pilot fidelity?
Pilot Fidelity = Structural representativeness of pilot vs production. A high-fidelity pilot uses the same architecture as production (same integrations,
same messy data, same model family, same policies, same monitoring) but with limited scope. Low-fidelity pilots (mock data, simplified integrations) create false confidence
and don't reveal real production risks.
Why do pilots fail in production?
Pilots fail due to failure cascades: messy inputs → parsing errors → retries → tool load spikes → queues → failed escalations → frustration → bypass → trust loss → value collapse.
Low-fidelity pilots don't expose these cascades. Also common: scope creep that changes success criteria, legacy integration issues (rate limits, permissions, data quality),
and change management failures (users don't adopt because they don't understand "why").
Multi-Agent & Prompting
When should I split an agent into multiple agents?
Split when you have conflicting requirements: e.g., conversational empathy vs strict validation. Use a frontstage agent (conversational, handles questions/clarifications)
and a backstage agent (command parser, strict JSON validation, deterministic rules). An orchestrator handles handoffs, retries, and escalations. This separation lets you optimize each layer independently.
Persona vs task-definition prompting: what's the difference?
Persona prompting ("You are a helpful assistant") focuses on tone and style. Task-definition prompting ("Extract these fields: name, date, amount") focuses on structure and validation.
Use persona for frontstage (conversation), task-definition for backstage (execution). Don't mix them—it creates ambiguity.
How do I detect brittle prompts early?
Monitor: (1) retry rate (high retries = prompt ambiguity), (2) clarification rate (agent asking for help = unclear instructions),
(3) tool error rate (invalid args = prompt not enforcing schema), (4) offline eval regression (small changes break many cases = brittle).
Use versioned prompts and A/B test changes on a gold set before production.
QA/Dev/Prod
Can I use a cheaper model in QA?
Yes, for rapid iteration during development. But final gates must run on production-equivalent model. Strategy: unit tests with cheap model →
nightly regression with sampling on prod model → pre-release full suite on prod model. Track QA/Prod divergence rate (% of cases with different outcomes).
Target: <5% divergence. Higher divergence indicates capability mismatch.
How do I balance cost and realism?
Use a tiered testing strategy: (1) Fast/cheap models for unit tests and rapid iteration, (2) Production model with sampling for nightly regression
(e.g., 10% of test cases), (3) Full production model for pre-release validation. This balances speed (cheap model) with confidence (prod model validation).
Cost control: sample strategically, cache results, use cheaper models for non-critical paths.
KPIs & Business
How do KPIs map to cost or revenue?
Map every KPI to financial impact: Containment rate ↑ → cost-to-serve ↓ (fewer human tickets), Speed-to-lead ↓ → conversion ↑ → revenue ↑,
Cycle time ↓ → cost avoided (hours saved × rate), AOV lift → revenue ↑ (basket size × orders). Create an executive scorecard with: Cost saved ($/month),
Revenue lift ($/month), ROI (%), Payback period (months).
What's the minimum executive scorecard?
Four metrics: (1) Cost saved ($/month run rate), (2) Revenue lift ($/month attributable ARR), (3) ROI (% year 1),
(4) Payback period (months to recover investment). Add context: trend (↑/↓), target vs actual, and narrative (what changed this month).
People & Change
How do I measure adoption and trust?
Use ADKAR metrics: Awareness (% understand "why"), Desire (activated users, champions), Knowledge (training completion, quiz scores),
Ability (task completion, escalation rate), Reinforcement (retention, bypass rate trend, NPS). Also track: trust erosion rate (% going from positive to negative),
bypass rate (usage vs workaround), time-to-recover-trust (after incidents).
How does ADKAR translate to measurable signals?
Awareness → Survey: "I understand why we're using this agent" (% agree). Desire → Activated users, champion nominations, voluntary usage rate.
Knowledge → Training completion %, quiz pass rate, help desk tickets (knowledge gaps). Ability → Task completion rate, escalation rate, time-to-proficiency.
Reinforcement → Retention (30/60/90 days), bypass rate trend, advocacy score (NPS). Measure monthly. If any stage stalls for 2 months, investigate blockers.
Governance & Risk
What is Shadow AI and why bans fail?
Shadow AI is unofficial AI usage that bypasses security, compliance, and governance. It happens when official tools are slow, hard to use, or restricted.
Bans fail because users find workarounds (personal accounts, corporate SaaS licenses). Instead: (1) Clear policies (what's allowed/restricted and why),
(2) Training (educate on risks), (3) "Safe by default" tools (make approved tools easier than Shadow AI), (4) Monitor exfiltration/DLP (detect and block unauthorized sharing).
What data risks vary by license/provider?
Personal account: High risk (data exfiltration, no audit trail). Corporate SaaS license: Medium risk (some controls, but data leaves org).
Private tenant/self-hosted: Low risk (full control, but more operational overhead). Trade-off: convenience vs security vs data sovereignty.
Use DLP tools to monitor and block unauthorized data sharing regardless of provider.
Scope & Delivery
Is scope creep always bad?
No. Accept scope creep if: (1) Adds reusable value for future clients (makes product more sellable), (2) Reduces future costs (e.g., eliminates future integration),
(3) Doesn't break business case (ROI still positive), (4) Has sponsor approval to absorb trade-offs. Block if: (1) Changes pilot outcome (unclear success criteria),
(2) Increases risk without funding, (3) No sponsor to absorb trade-off, (4) Creates unrecoverable technical debt.
How do I arbitrate scope disputes?
Use a decision framework: (1) Create "A plan / B plan" with costs, risks, timelines (present side-by-side),
(2) Escalate to decision makers when there's no agreement (sponsor or steering committee with clear recommendation),
(3) Document decision, rationale, and trade-offs (update scope doc, communicate to team). Rule: "No" is the start of negotiation—present alternatives, don't just reject.
General Metrics & Operations
How many KPIs should I start with?
Start with 10–12 stable KPIs across value, quality, risk, reliability, and cost. Add more only when you have owners and actions.
Recommended set: Activated users, Task completion, Task success, User helpfulness, Override rate, Citation accuracy (RAG), p95 latency, Tool error rate,
Escalation rate, Policy violations, $/task, Incident MTTR.
What's the fastest way to prove business value?
Pick one workflow with measurable cycle time or cost (support tickets, claims, procurement requests). Instrument timestamps and run an A/B or phased rollout.
Focus on a single outcome metric (e.g., "time to resolution" or "cost per case") and show clear improvement within 4–6 weeks.
How do I measure "hallucinations"?
Use a mix: (1) Offline eval on a gold set (human rubric + sampling), (2) Online user feedback + escalations (thumbs down + reason codes),
(3) Citation coverage/accuracy for RAG (% claims supported by cited sources). Track trends and severity. Set thresholds: e.g., >5% hallucination rate = investigate.
What if my agent uses multiple tools?
Measure per tool: error rate, p95 latency, retries, and compensation actions. Create a "tool health" panel so you can disable a single tool without killing the entire agent.
Track tool-specific metrics: calls per tool, success rate per tool, retry depth per tool, cost per tool.
Do I need a kill switch even for internal agents?
Yes. Internal incidents still create operational and compliance risk. A kill switch is often cheaper than a redeploy and reduces MTTR.
Test kill switches regularly ("rollback just because" drills) to ensure they work under pressure.
Where should metrics live?
Operational metrics: observability stack (Grafana/Datadog/New Relic). Product/value metrics: warehouse + BI (Power BI/Looker).
Governance: factsheet registry with links to dashboards and incidents. Keep them connected: link dashboards to factsheets, link incidents to metrics.