The Lab-to-Production Gap
An 80% success rate in the lab does not mean 80% success in production. This is the single most important lesson in robot policy deployment, and it surprises nearly every team the first time.
Production environments differ from the lab in ways that are invisible until they cause failures: novel lighting conditions (different times of day, seasonal light angle, overhead fixture replacement), wear-induced drift (joint backlash increases after 50,000 cycles, gripper pad wear changes grasp mechanics), object variation (supplier changes product packaging slightly, objects arrive in non-canonical orientations), and context drift (the workspace gets slightly reorganized, a background object is moved). Each of these individually degrades policy performance by 5–20%. Combined, an 80% lab policy can drop to 40% production performance within 3 months.
The solution is not better lab performance — it is building systems that detect degradation, fail safely, and recover automatically.
Pre-Deployment Checklist
Before any policy enters production, it must pass a structured evaluation in conditions designed to probe generalization:
- 3 novel lighting conditions: Test with overhead only, natural light + overhead, and desk lamp positioned differently from training. The policy must achieve ≥70% success in each condition.
- 5 novel object positions: Place target objects at positions not seen during training, including near workspace boundaries. Any position within the declared workspace boundary must be handled.
- 10 distractor objects: Add objects not present during training to the workspace. A well-trained policy should maintain ≥85% of its base success rate with distractors present.
- 100 consecutive trial evaluation: Run 100 trials autonomously overnight. This catches intermittent failures (jammed gripper after 30 cycles, thermal throttling after 45 minutes) that short evaluations miss. Target ≥85% over 100 trials to enter production.
- Edge case scenarios: Explicitly test: object slightly outside nominal pose, gripper partially occluded, arm joint at near-limit position, camera partially obstructed.
Model Serving Infrastructure
Inference latency directly affects robot control rate. For a policy running at 10 Hz (100 ms control period), your inference must complete in <80 ms to leave margin for communication overhead.
- TorchServe: Deploy PyTorch policies as a model archive (.mar). Provides HTTP and gRPC inference endpoints, batching, model versioning, and metrics. Suitable for policies with >50 ms inference time where a dedicated model server is warranted.
- TensorRT: Convert your policy to a TensorRT engine for 3–5× inference speedup on NVIDIA GPUs. An ACT policy that takes 80 ms in PyTorch typically runs in 18–25 ms with TensorRT FP16. Use
trtexec --onnx=policy.onnx --saveEngine=policy.trt --fp16to convert. - Latency target: p99 inference latency must be <100 ms. p99 (99th percentile) matters more than mean because the 1% worst-case latency determines your control loop's worst-case jitter. Profile with
torch.profilerunder simulated production load. - Health check endpoint: Expose a
GET /healthendpoint that runs a dummy inference pass and returns 200 OK with latency measurement. The robot controller should poll this endpoint at startup and reject deployment if p99 >100 ms.
Monitoring Strategy
A policy in production without monitoring is a time bomb. Build monitoring from day one, not after the first incident.
- Per-episode success rate: Log success/failure for every episode. Track 7-day rolling average. Alert when the rolling average drops >5% from the deployment baseline.
- Failure classification: When an episode fails, classify the failure mode: grasp failure, placement failure, collision, timeout, or other. Different failure modes indicate different root causes and different fixes.
- Telemetry logging: Log joint positions, velocities, forces, policy confidence scores, and inference latency for every episode. Store for 90 days minimum. This data is essential for root cause analysis and retraining.
- Human review queue: Flag every failed episode for human review within 24 hours. A 5-minute human review per failure catches systematic issues (new object variant, mounting drift) before they cascade.
Graceful Degradation
A production robot must fail safely. The worst outcome is silent failure — a robot that continues operating while producing bad outputs.
- Confidence score threshold: Many policies output a confidence or certainty estimate alongside the action. If confidence <0.7, pause the robot and alert the operator before proceeding. This prevents catastrophic grasps in novel situations the policy is not confident about.
- Pause and alert: When a pause trigger fires, move the arm to a safe home position, turn on a visual indicator (red status light), and send an alert via the platform to the operator's dashboard and mobile device.
- Fallback to teleop: For high-value or high-risk tasks, implement a teleop fallback where a remote operator takes control via a VR headset or web interface when the policy triggers a pause. The operator completes the episode manually, and the data is logged for retraining.
- Maximum consecutive failure limit: If 5 consecutive episodes fail, automatically suspend the policy and escalate to a senior operator. Do not let a failing policy cycle indefinitely.
Version Management
Treat policy versions like software releases — with staged rollouts and rollback capability.
- A/B testing: When deploying a new policy version, route 10% of tasks to the new version and 90% to the current production version. Compare success rates over 200+ episodes before full rollout. This requires task routing logic in your platform dashboard.
- Canary rollout: After A/B testing shows improvement, roll out to 25% → 50% → 100% of traffic at weekly intervals, with automated rollback if success rate drops >5% at any stage.
- Rollback procedure: Maintain the last 3 production policy versions as deployable artifacts. Rollback to the previous version must be executable in <5 minutes, ideally via a single button in the fleet dashboard.
Retraining Triggers
Retraining is not a one-time event — it is a continuous process driven by data from production.
- >5% success rate drop: Investigate root cause. If caused by distributional shift (new object, changed workspace), collect 50–200 demonstrations covering the new conditions and fine-tune.
- New task variants: When the business introduces a new SKU, product variant, or workflow change, trigger a data collection campaign before the variant reaches production volume.
- Quarterly refresh: Even without a specific trigger, retrain quarterly incorporating all production failure episodes. This prevents gradual drift accumulation.
Incident Runbook Template
| Phase | Actions | Owner | Time Target |
|---|---|---|---|
| Detect | Alert fires (success rate drop or consecutive failures) | Automated | <5 min |
| Classify | Review failure clips, classify failure mode | On-call operator | <30 min |
| Contain | Suspend affected policy, route tasks to manual/teleop | On-call operator | <15 min |
| Diagnose | Identify root cause: hardware drift, distributional shift, infrastructure issue | ML engineer | <4 hr |
| Resolve | Deploy fix: rollback, hotfix, or retrain | ML engineer | <24 hr |
| Post-mortem | Document cause, impact, fix, and prevention measures | Team lead | <1 week |
Model Serving Comparison: Which Framework to Use
| Framework | Typical Inference (ACT) | Versioning | Rollback | Best For |
|---|---|---|---|---|
| Raw PyTorch | 60-100 ms (RTX 4070) | Manual (file paths) | Manual | Prototyping, single robot |
| TorchServe | 70-110 ms | Built-in model store | API-driven | Multi-model serving, A/B testing |
| TensorRT FP16 | 18-30 ms (RTX 4070) | Manual (engine files) | Manual | Low-latency production, Jetson edge |
| Triton Inference Server | 20-35 ms (with TRT backend) | Model repository | API-driven | Fleet-scale, multi-GPU, mixed models |
| FastAPI + ONNX Runtime | 35-60 ms | Custom | Custom | Simple REST integration, CPU fallback |
| ROS2 Service Node | 65-100 ms | Launch file | Node restart | Native ROS2 integration, single robot |
For a single robot in production, start with TensorRT FP16 conversion for lowest latency. For fleets of 5+ robots, invest in Triton Inference Server -- its model repository and dynamic batching features justify the setup complexity.
Deployment Architecture: Single Robot vs. Fleet
Single robot deployment: The policy runs on the robot's onboard GPU (Jetson Orin, RTX 4060 workstation). Observations flow from cameras and joint encoders directly to the inference process. No network latency in the control loop. This is the simplest and most reliable architecture.
Edge-cloud hybrid (recommended for fleets): Low-level control (safety, joint servo, e-stop) runs on the robot's onboard computer. Policy inference runs on an edge server (one per 5-10 robots) with a high-end GPU. Communication is via a dedicated 1 Gbps LAN with <5 ms latency. The edge server also handles monitoring, logging, and model updates.
Cloud-based inference (not recommended for manipulation): Policy inference runs on a cloud GPU. Network latency adds 20-100 ms to the control loop, making it unsuitable for contact-rich manipulation at 10+ Hz. Only viable for mobile robot navigation or very slow pick-and-place tasks.
Rollback Procedure: Step-by-Step
- Step 1 -- Detect degradation: Monitoring alerts on success rate drop >5% or >3 consecutive failures.
- Step 2 -- Suspend current policy: Via the platform dashboard or CLI:
svrc policy suspend --robot arm-001 --policy v2.3. The robot enters safe idle state. - Step 3 -- Activate previous version:
svrc policy activate --robot arm-001 --policy v2.2. The previous version (stored locally on the robot) loads in <30 seconds. - Step 4 -- Verify: Run 10 automated trials on the previous version. If success rate returns to baseline, confirm rollback is complete.
- Step 5 -- Root cause analysis: Investigate why v2.3 failed. Common causes: training data distribution shift, hyperparameter regression, or infrastructure change (camera calibration drift).
Total rollback time target: <5 minutes from detection to resumed operation on the previous version. Practice the rollback procedure monthly even when no incident occurs.
Related Guides
- Remote Fleet Management -- monitoring infrastructure and OTA update procedures
- Curriculum Design for Robot Learning -- training policies that generalize better to production
- Data Collection Service Buyer's Guide -- sourcing high-quality training data for retraining
- Robot Safety Risk Assessment -- safety requirements for production robot deployments
- Warehouse Deployment Checklist -- end-to-end production deployment planning
Work with SVRC
SVRC helps teams bridge the lab-to-production gap with infrastructure, monitoring, and ongoing data collection support.
- Data Platform -- policy monitoring dashboards, A/B testing, fleet management, and automated rollback
- Data Collection Services -- rapid retraining data collection when production policies degrade
- Robot Leasing -- production-ready robot systems with integrated monitoring and maintenance
- Contact Us -- schedule a deployment readiness review with our engineering team