What Mobile ALOHA Is
Mobile ALOHA, developed at Stanford by Tony Zhao et al., extends the original ALOHA (A Low-cost Open-source Hardware system for Bimanual Teleoperation) by mounting the bimanual arm system on an omnidirectional mobile base. This enables whole-body teleoperation: an operator controls both arms and the base simultaneously, demonstrating tasks that require locomotion alongside manipulation, such as opening doors, pushing carts, cleaning tables while moving, and navigating between rooms.
The key research contribution of Mobile ALOHA is demonstrating that co-training on diverse static ALOHA data plus a small number of mobile demonstrations (as few as 50 episodes) produces policies that generalize surprisingly well to mobile manipulation tasks. This means the expensive mobile demonstrations are supplemented by cheaper static bimanual data, making the approach more practical than it initially appears.
Mobile ALOHA has become one of the most replicated research platforms in robot learning. Multiple labs, companies, and maker groups have built variants, and the original hardware design is fully open source. However, the build process involves more complexity than the paper suggests, and this guide covers the practical details that the academic publication omits.
Hardware Bill of Materials
A complete Mobile ALOHA system requires four categories of hardware: the mobile base, the manipulation arms (leader and follower), the camera system, and the compute stack. Here is the detailed BOM with 2026 pricing.
Mobile base:
- AgileX Tracer differential-drive base: $4,500-5,500 depending on configuration. This is the platform used in the original paper. Alternatives include the AgileX Scout Mini ($6,800) for higher payload or the Clearpath Jackal ($20,000+) for research-grade odometry, but the Tracer is the standard choice for cost-constrained builds.
- Custom mounting frame (aluminum extrusion 80/20 or similar): $300-600 for materials, plus machining. The frame must rigidly couple the arm bases to the mobile platform and provide mounting points for cameras and the compute box.
Follower arms (the robot arms that execute tasks):
- 2x Trossen Robotics ViperX 300 S2 6-DOF arms: $4,800 each, $9,600 total. These are the standard follower arms for ALOHA builds. They use Dynamixel XM/XH series servos with position, velocity, and current (torque) feedback. Payload is 750g at full extension, which limits the weight of objects the system can manipulate.
- 2x custom gripper assemblies: $200-400 each. The standard ALOHA gripper is a simple parallel-jaw gripper with a Dynamixel XL330 servo. 3D-printed finger pads are adequate for most tasks.
Leader arms (held by the operator during teleoperation):
- 2x Trossen Robotics WidowX 250 S 6-DOF arms: $3,100 each, $6,200 total. The WidowX is lighter (0.53 kg) and shorter-reach than the ViperX, making it comfortable for the seated or standing operator to hold during multi-hour data collection sessions. Same Dynamixel servo family ensures transparent kinematic mapping.
- Leader arm mounting brackets: $100-200. Mounted at waist height on the mobile platform frame so the operator walks behind the platform while holding the leader arms.
Camera system:
- 2x Intel RealSense D405 wrist cameras: $300 each, $600 total. Mounted on the follower arm wrists for close-range manipulation views.
- 1x Intel RealSense D435 overhead camera: $350. Mounted on the frame mast for a top-down workspace view.
- Camera mounts and USB cables: $100-150.
Compute:
- Onboard workstation: Intel NUC 13 Pro or equivalent mini-PC with i7, 32GB RAM, 1TB NVMe SSD: $800-1,200. This handles real-time teleoperation control, camera capture, and data recording. It does not need a GPU; training happens offline.
- Training workstation (separate, not on the robot): Any desktop or cloud instance with an NVIDIA RTX 4090 or A100 for ACT/Diffusion Policy training. Budget $2,000-3,000 for a local training machine, or use cloud GPU instances at $1-4/hour.
Total Cost Breakdown
| Category | Cost Range |
|---|---|
| Mobile base (AgileX Tracer) | $4,500-5,500 |
| Follower arms (2x ViperX 300 S2) | $9,600 |
| Leader arms (2x WidowX 250 S) | $6,200 |
| Grippers and mounting | $700-1,200 |
| Camera system (3x RealSense) | $950-1,100 |
| Onboard compute | $800-1,200 |
| Frame, cables, misc hardware | $500-800 |
| Total (robot only) | $23,250-25,600 |
| Training workstation (separate) | $2,000-3,000 |
| Total (complete system) | $25,250-28,600 |
This is the total cost to build one Mobile ALOHA system from scratch. The original Stanford paper cited approximately $32,000 for their specific configuration; the difference reflects 2026 component pricing and using the base Tracer rather than the higher-end configurations. Note that this does not include operator labor for data collection, which is the dominant ongoing cost in any imitation learning project.
Software Stack: ROS2, ACT, and LeRobot
The Mobile ALOHA software stack has three layers, each with a specific role.
Real-time control layer (ROS2 Humble on Ubuntu 22.04). The low-level control runs as ROS2 nodes: a Dynamixel driver node for each arm, a base driver node for the AgileX platform, and camera driver nodes for each RealSense camera. The critical requirement is that all nodes share a synchronized clock (use chrony or PTP) and that the leader-to-follower command loop runs at 50 Hz with less than 10 ms latency. The Interbotix ROS2 driver packages provide the arm drivers; the AgileX ROS2 package provides the base driver.
Teleoperation and recording layer. A recording node subscribes to all joint state topics and camera image topics, timestamps them against a shared clock, and writes synchronized episodes to HDF5 files. Each episode contains: 14-DOF joint positions (7 per arm) at 50 Hz, 14-DOF joint velocities, gripper apertures, three camera streams at 30 fps, base velocity commands, and episode metadata (task label, success flag, operator ID). LeRobot from Hugging Face provides standardized recording scripts for ALOHA-style hardware that handle this synchronization correctly.
Training layer (offline, on the training workstation). ACT (Action Chunking with Transformers) is the standard training algorithm for ALOHA data. ACT predicts a chunk of future actions (typically 100 timesteps) from the current observation, using a transformer encoder-decoder architecture with a CVAE (conditional variational autoencoder) for action prediction. Training takes 4-8 hours on a single RTX 4090 for a 200-episode dataset. LeRobot provides the training pipeline with sensible defaults for ACT, Diffusion Policy, and TDMPC2.
First Tasks to Learn
Start with these tasks in order of difficulty. Each builds skills needed for the next.
Task 1: Stationary bimanual handover. The robot picks up an object with one arm and hands it to the other arm. Base remains stationary. This validates bimanual calibration and coordination without adding base motion complexity. Target: 50 demonstrations, 60-70% policy success rate on first training run.
Task 2: Table bussing (clearing a table). The robot picks up objects from a table and places them in a bin on the robot platform, then drives to a drop-off location. This introduces base motion alongside manipulation. The co-training technique matters here: augment your 50 mobile demonstrations with 200-300 static bimanual demonstrations from the handover task. Target: 50 mobile demonstrations plus co-training data, 50-60% success rate.
Task 3: Opening doors. Approach a door, grasp the handle, pull/push while coordinating base motion. This is a canonical Mobile ALOHA task that demonstrates whole-body coordination: the base must move in sync with the arm as the door swings. This is substantially harder than table bussing because the contact dynamics change throughout the trajectory. Target: 100 demonstrations, 40-50% success rate initially.
Task 4: Object handover to a human. The robot navigates to a person, extends an arm, and releases an object when the person grasps it. This requires detecting human presence and timing the release. Target: 75 demonstrations with varied human positions, 45-55% success rate.
Common Setup Mistakes
These are the mistakes SVRC sees most frequently in labs building their first Mobile ALOHA system.
- Skipping leader arm gravity compensation. Without gravity compensation, the leader arms feel heavy to the operator. Operator fatigue sets in after 30 minutes, and data quality degrades severely. Configure Dynamixel current limits to 30-50% of rated torque in gravity compensation mode. Test by holding the leader arm at various poses; it should feel nearly weightless.
- WiFi for leader-follower communication. Routing the leader-follower control loop through WiFi introduces 20-100 ms of variable latency. The system feels sluggish and the operator overcompensates, producing jerky demonstrations. Use direct USB-to-Dynamixel connections on the onboard computer. The leader and follower arms must be on the same physical machine.
- Ignoring camera synchronization. If wrist cameras and the overhead camera are not synchronized, the recorded observations contain temporal misalignment. At 30 fps, a 2-frame misalignment is 66 ms, which is significant for fast manipulation. Use hardware trigger synchronization (RealSense multi-camera sync module, $50) or at minimum timestamp-based alignment during data loading.
- Not locking the base during static tasks. If you are collecting manipulation-only data to augment mobile demonstrations (the co-training approach), engage the base motor brake or use software velocity limits to prevent the base from drifting. Base drift during static collection adds noise to your dataset without providing useful mobility information.
- Insufficient cable management. Loose cables get caught on furniture, catch on arm joints during motion, and occasionally disconnect during episodes. A disconnected camera mid-episode corrupts the entire episode. Use cable chains or spiral wrap on all moving cables and verify cable routing at the start of every collection session.
- Collecting data with inconsistent task definitions. "Clean the table" is too vague. "Pick up the blue cup from position A and place it in the gray bin" is specific enough. Inconsistent demonstrations within a task label confuse the policy. Write a task specification document before collecting a single episode.
Step-by-Step Assembly Guide
This section covers the physical assembly sequence. Budget 2-3 full days for assembly if you have experience with robotics hardware, or 4-5 days if this is your first build.
Step 1: Base platform preparation (2-4 hours). Unbox the AgileX Tracer and verify all components against the packing list. Charge the battery fully before proceeding (4-6 hours from empty). Mount the Tracer on a flat surface and verify omnidirectional motion by running the AgileX diagnostic tool. Common issue: the Tracer ships with firmware that may need updating before the ROS2 driver works. Flash the latest firmware from the AgileX GitHub repository before proceeding.
Step 2: Mounting frame construction (4-8 hours). Cut aluminum extrusion (80/20 series 10 or equivalent) to the dimensions specified in the Stanford ALOHA CAD drawings. The frame has three critical mounting surfaces: two arm base plates (one left, one right, spaced 40 cm apart center-to-center) and one vertical mast for the overhead camera. All arm mounting surfaces must be co-planar to within 1 mm. Use a machinist's level during assembly. Torque all frame bolts to 8-12 Nm. Under-torqued bolts allow frame flex during arm motion, which introduces variable kinematic offsets that degrade data quality.
Step 3: Follower arm installation (3-4 hours per arm). Mount each ViperX 300 S2 to its base plate using the provided M6 bolts. Torque to 6 Nm. Connect the U2D2 (USB-to-Dynamixel adapter) to the arm's daisy-chained servo bus. Run the Dynamixel Wizard to verify communication with all 7 servos per arm (6 arm joints + 1 gripper). If any servo fails to respond, check the TTL wiring for loose connectors at each joint. Set each servo to Position-Current (multi-turn) control mode. Configure the PID gains to the Trossen-recommended defaults: P=800, I=0, D=0 for position control; adjust only after verifying basic operation.
Step 4: Leader arm installation (2-3 hours per arm). Mount each WidowX 250 S on the leader arm brackets at waist height. The leader arms must be positioned so the operator can comfortably hold them while walking behind the platform. Typical mounting height: 90-100 cm from ground. Configure the leader arm servos in Current-Based Position mode for gravity compensation. Set current limits to 40% of rated torque for comfortable operation. Test gravity compensation by releasing each leader arm at various poses -- it should remain roughly stationary, drifting less than 5 degrees over 30 seconds.
Step 5: Gripper assembly (1-2 hours per gripper). Assemble the parallel-jaw grippers from the provided parts or 3D-print custom finger pads. The gripper servo (Dynamixel XL330) connects to the same bus as the arm servos. Set gripper operating range: 0 (fully closed) to 1200 (fully open) in Dynamixel position units. 3D-print finger pads in TPU (flexible filament) for improved grip on smooth objects. Pad thickness: 2-3 mm provides good compliance without excessive deformation.
Step 6: Camera installation (2-3 hours). Mount the two D405 wrist cameras on the follower arm wrists using the provided or custom brackets. The D405 has a minimum depth range of 7 cm, making it suitable for close-range manipulation. Aim each wrist camera 30-45 degrees downward relative to the gripper plane for optimal coverage of the grasp zone. Mount the D435 overhead camera on the mast at 80-120 cm above the workspace. Angle it 45-60 degrees from vertical for a clear view of both arm workspaces. Secure all camera USB cables with cable clamps at 15-20 cm intervals along the arm.
Step 7: Compute installation (1-2 hours). Mount the Intel NUC or equivalent on the frame base plate, secured with anti-vibration mounts (essential -- base motion transmits vibration that can loosen USB connections). Connect all USB devices: 2x U2D2 (follower arms), 2x U2D2 (leader arms), 3x RealSense cameras, 1x AgileX base. Use a powered USB 3.0 hub for the cameras to ensure sufficient bandwidth. Verify all devices enumerate correctly with lsusb.
ROS2 Software Stack Setup
The software stack requires Ubuntu 22.04 with ROS2 Humble. Do not use Ubuntu 24.04 or ROS2 Iron/Jazzy -- the Interbotix packages have not been fully validated on newer versions as of 2026.
Base OS installation:
# Install ROS2 Humble (follow official docs, then):
sudo apt install ros-humble-desktop python3-colcon-common-extensions
# Install Interbotix ROS2 packages for arm control
curl 'https://raw.githubusercontent.com/Interbotix/interbotix_ros_manipulators/main/interbotix_ros_xsarms/install/amd64/xsarm_amd64_install.sh' > xsarm_install.sh
chmod +x xsarm_install.sh && ./xsarm_install.sh -d humble
# Install RealSense SDK and ROS2 wrapper
sudo apt install ros-humble-librealsense2* ros-humble-realsense2-camera
# Install AgileX ROS2 driver
cd ~/colcon_ws/src
git clone https://github.com/agilexrobotics/agx_sdk_ros2.git
cd ~/colcon_ws && colcon build --symlink-install
# Install LeRobot for data recording and training
pip install lerobot
Configuration and calibration:
- Set Dynamixel baud rate to 1M (1000000) on all servos using Dynamixel Wizard. The default 57600 baud is too slow for 50 Hz control of 7 servos per arm.
- Calibrate servo zero positions: move each arm to its mechanical home position and record the encoder offsets. These offsets go in the Interbotix YAML config file for each arm.
- Configure the leader-follower mapping: each leader joint maps 1:1 to the corresponding follower joint. The Interbotix teleoperation example provides this mapping out of the box, but verify that all 7 joints (including gripper) track correctly before collecting data.
- Set up time synchronization: install chrony and configure all nodes to use the same system clock. Camera timestamps must be within 5 ms of joint state timestamps for clean data. Run
ros2 topic echo /clockto verify synchronization.
Camera Calibration Procedure
Camera calibration is not optional. Un-calibrated cameras produce spatial misalignment in the recorded data that degrades policy performance by 10-25%.
Intrinsic calibration: The RealSense cameras ship with factory calibration, but this calibration can degrade over time or after physical impacts. Verify intrinsic calibration by running the RealSense self-calibration tool. If reprojection error exceeds 0.5 pixels, re-run on-chip calibration. For higher accuracy, use the camera_calibration ROS2 package with a 9x6 checkerboard (25mm squares) to compute custom intrinsics. Collect at least 30 checkerboard images covering the full field of view.
Extrinsic calibration (camera-to-base transform): The position and orientation of each camera relative to the robot base frame must be measured accurately. For the wrist cameras, this is primarily determined by the arm's forward kinematics plus the camera mounting offset. Measure the camera mounting offset with calipers (position accurate to 1mm, orientation accurate to 2 degrees). For the overhead camera, use an ArUco marker placed at a known position in the robot base frame and solve the PnP problem to compute the camera-to-base transform. Verify calibration by commanding the arm to a known position and checking that the camera projection of the end-effector matches the expected pixel location to within 5 pixels.
Multi-camera synchronization: Connect the RealSense D435 and D405 cameras using the Intel Multi-Camera Sync Cable (available separately, approximately $50). Configure one camera as the master and the others as slaves. This ensures all cameras capture frames simultaneously, eliminating the temporal misalignment that otherwise causes 30-60 ms jitter between camera streams. Without hardware sync, you must use timestamp-based alignment in your data loading pipeline, which is less reliable.
First Data Collection: The Validation Task
Before attempting any real research task, collect a validation dataset to verify the entire pipeline works end-to-end.
Task definition: Pick up a tennis ball from the center of the table and place it in a bowl 30 cm to the right. This is the simplest bimanual task: one arm picks up the ball while the other arm holds the bowl steady. If your system cannot achieve 80% success on this task with 50 demonstrations, there is a hardware or calibration issue that must be fixed before proceeding.
Collection procedure:
- Place the tennis ball at the same position for the first 20 demonstrations (fixed-position validation).
- Vary the ball position within a 20 cm radius for the next 30 demonstrations (position diversity).
- Each demonstration should take 15-30 seconds of active teleoperation. Reject episodes where the ball is dropped, the placement misses the bowl, or the operator makes a corrective movement longer than 2 seconds.
- Record all data using LeRobot's
recordscript:python lerobot/scripts/control_robot.py record --robot-path lerobot/configs/robot/aloha.yaml --fps 50 --repo-id your-username/aloha-validation
Training and evaluation: Train ACT on the 50 demonstrations using LeRobot defaults. Training should complete in 1-2 hours on a single RTX 4090. Deploy the policy and run 20 evaluation trials with the ball at positions seen during training. Target: 60%+ success rate. If below 40%, check the following before collecting more data: camera calibration accuracy, leader-follower tracking latency, timestamp synchronization, and data recording completeness (no dropped frames).
Torque Specifications and Safety
Overtightened bolts strip threads in aluminum extrusion. Undertightened bolts allow the frame to shift during motion. Follow these specifications:
| Connection | Torque (Nm) | Notes |
|---|---|---|
| Frame extrusion joints (M8) | 10-12 | Use threadlocker on vibrating joints |
| Arm base mounting (M6) | 6 | Check monthly for loosening |
| Camera mount (M4) | 2-3 | Easy to strip; use calibrated driver |
| Gripper finger pads (M3) | 1-1.5 | 3D-printed parts; low torque to avoid cracking |
| Base-to-frame (M8) | 12-15 | Critical for base stability; double-check after first drive test |
Safety considerations: The ViperX arms have sufficient torque to cause injury. Always implement software joint limits (set in the Interbotix YAML config) to prevent the arms from contacting the operator, the frame, or each other. Set velocity limits to 1.0 rad/s during initial testing and increase to 1.5 rad/s only after verifying collision-free operation. Implement an emergency stop button (hardware e-stop connected to the Dynamixel power supply) that cuts servo power immediately. Test the e-stop before every data collection session.
Maintenance and Troubleshooting
Regular maintenance prevents data collection downtime. Follow this schedule:
- Before each session (5 min): Check cable connections, verify camera feeds, test leader-follower tracking on 3 poses, check battery level (minimum 40% charge for a 2-hour session).
- Weekly: Check all frame bolts for loosening. Clean camera lenses with microfiber cloth. Verify servo temperatures after a 1-hour run are below 60C (check via Dynamixel Wizard). Back up collected data to a second drive.
- Monthly: Recalibrate camera extrinsics (the overhead camera mount can drift 1-2 mm over a month of operation). Inspect Dynamixel servo cables for wear at joints. Update ROS2 packages if security patches are available.
- Common troubleshooting: If a servo stops responding mid-session, it has likely overheated. Wait 10 minutes for cooling. If the issue persists, check the TTL cable at that joint. If camera frames are dropping, check USB bandwidth -- three RealSense cameras require USB 3.0, and running through a USB 2.0 hub will cause frame drops.
Related Reading
Imitation Learning Guide · ACT vs. Diffusion Policy · Robot Camera Setup · LeRobot Getting Started · Cost Per Demonstration Analysis · Data Annotation Guide · Data Services
Data Recording Configuration
Correct data recording configuration determines whether your collected demonstrations will actually train a usable policy. Misconfigured recording produces data that looks fine during review but fails during training due to synchronization errors, incorrect action spaces, or missing modalities.
Recording frequency. Record joint states at 50 Hz (the standard for ALOHA-class systems). Camera frames at 30 fps. If using force-torque sensors (optional but recommended for contact-rich tasks), record at 500 Hz and downsample to 50 Hz during data loading to match the joint state frequency. LeRobot's recording scripts handle these frequencies by default when configured for ALOHA hardware.
Action space configuration. ACT expects absolute joint position targets as actions. Each action vector is 14-dimensional: 7 joints per arm (6 arm joints + 1 gripper). The joint positions are in radians, and the gripper aperture is in Dynamixel position units (0-4095, mapped to the physical range). Ensure that the action recorded at timestep t corresponds to the joint position the follower arm was commanded to at timestep t, not the position it was measured at. This distinction matters because servo tracking lag means the measured position is always slightly behind the commanded position.
Episode metadata. Each HDF5 episode file should contain:
/action: shape (T, 14), dtype float32 -- commanded joint positions at each timestep/observations/qpos: shape (T, 14), dtype float32 -- measured joint positions/observations/qvel: shape (T, 14), dtype float32 -- measured joint velocities/observations/images/top: shape (T, 480, 640, 3), dtype uint8 -- overhead camera/observations/images/left_wrist: shape (T, 480, 640, 3), dtype uint8/observations/images/right_wrist: shape (T, 480, 640, 3), dtype uint8/observations/effort: shape (T, 14), dtype float32 -- servo current readings (optional, useful for contact-aware policies)/base_action: shape (T, 2), dtype float32 -- base linear and angular velocity commands- Attributes: task_name (string), success (bool), operator_id (string), timestamp (ISO format)
# Verify a recorded episode for completeness
import h5py
import numpy as np
def verify_episode(filepath):
with h5py.File(filepath, 'r') as f:
T = f['/action'].shape[0]
checks = {
'action_shape': f['/action'].shape == (T, 14),
'qpos_shape': f['/observations/qpos'].shape == (T, 14),
'images_top': f['/observations/images/top'].shape[0] == T,
'images_lwrist': f['/observations/images/left_wrist'].shape[0] == T,
'images_rwrist': f['/observations/images/right_wrist'].shape[0] == T,
'no_nan_actions': not np.any(np.isnan(f['/action'][:])),
'joint_limits': np.all(np.abs(f['/action'][:]) < 3.14),
'episode_length': 50 < T < 3000, # 1-60 sec at 50Hz
}
for check, passed in checks.items():
status = 'PASS' if passed else 'FAIL'
print(f' {check}: {status}')
return all(checks.values())
Common recording errors:
- Timestamp drift: If using software timestamps rather than hardware sync, verify that the camera frame timestamps and joint state timestamps remain aligned within 10 ms across the entire episode. Drift accumulates over long episodes. Run a synchronization check after every 100 episodes.
- Dropped camera frames: USB bandwidth contention causes frame drops when three RealSense cameras share a USB 3.0 hub. Verify that the camera frame count equals the expected count (episode_duration * 30fps +/- 1 frame). If frames are missing, the HDF5 episode will have misaligned observation-action pairs that corrupt training.
- Gripper action encoding errors: The gripper action must use the same encoding as the training pipeline. ACT expects gripper position in the same coordinate space as the arm joints (radians equivalent or normalized 0-1). A mismatch between recording and training encoding is the single most common cause of "the policy works but the gripper never closes."
Training Your First Policy
After collecting your validation dataset, training an ACT policy with LeRobot follows a straightforward process. Here is the annotated workflow.
# Step 1: Push your dataset to HuggingFace Hub (or train locally)
# Assumes data was recorded with LeRobot's record script
python lerobot/scripts/push_dataset_to_hub.py \
--raw-dir data/aloha_validation \
--repo-id your-username/aloha-validation \
--raw-format aloha_hdf5
# Step 2: Train ACT policy
python lerobot/scripts/train.py \
policy=act \
dataset_repo_id=your-username/aloha-validation \
env=aloha \
training.num_epochs=2000 \
training.batch_size=8 \
policy.chunk_size=100 \
policy.kl_weight=10 \
policy.n_obs_steps=1 \
wandb.enable=true
# Step 3: Evaluate on the real robot
python lerobot/scripts/control_robot.py record \
--robot-path lerobot/configs/robot/aloha.yaml \
--fps 50 \
--policy-path outputs/train/act_aloha_validation/checkpoints/last/pretrained_model \
--warmup-time-s 2 \
--episode-time-s 30 \
--num-episodes 20
Training hyperparameters that matter most:
chunk_size=100: Predicts 100 future timesteps (2 seconds at 50 Hz). Larger chunks reduce compounding error but require the task trajectory to be consistent across demonstrations. Start with 100 and reduce to 50 if the policy oscillates.kl_weight=10: Controls the CVAE regularization. Higher values (50-100) produce smoother but less precise actions. Lower values (1-5) produce sharper actions but risk mode collapse. Sweep [1, 5, 10, 50] on your first task.batch_size=8: Standard for single-GPU training on RTX 4090. Increase to 16-32 if using A100 with 80 GB VRAM for faster convergence.num_epochs=2000: ACT typically converges around 1000-1500 epochs. Watch the validation loss -- if it plateaus for 200+ epochs, training is done.
Expected training metrics: Action MSE loss should decrease from ~0.01 to ~0.001 over the first 500 epochs, then gradually to ~0.0005 by convergence. If loss does not decrease below 0.005, check for data quality issues (inconsistent demonstrations, misaligned timestamps). KL divergence loss should stabilize at 1-5 nats. If KL divergence is near zero, increase the KL weight -- the CVAE is collapsing to a single mode.
Scaling Up: From Validation to Research Tasks
Once your validation task (tennis ball pick-and-place) achieves >60% success, you can confidently scale to more complex tasks. Key principles for scaling:
- Increase diversity, not just volume. 200 demonstrations with varied object positions (5 cm grid across the workspace) and 3-5 different objects will train a better policy than 500 demonstrations with a single object in a fixed position. Design your collection protocol around diversity targets before starting.
- Use co-training for mobile tasks. Collect 50-100 mobile demonstrations and combine them with 200-500 static bimanual demonstrations during training. The static data teaches manipulation skills; the mobile data teaches base coordination. LeRobot supports multi-dataset training through the
dataset_repo_idparameter accepting a list of datasets. - Monitor per-phase success. When a task has multiple phases (approach, grasp, transport, place), track where failures occur. If 80% of failures happen during grasping, collect HG-DAgger corrections specifically for the grasp phase rather than re-collecting full episodes.
- Budget for iteration. The typical cycle is: collect 50 demos, train, evaluate, identify failure modes, collect 50 more demos targeting those failures, retrain. Budget 3-5 collection-training cycles per task. SVRC's managed data collection services can accelerate this iteration cycle -- our operators are trained to collect targeted corrective demonstrations based on policy failure analysis. See our data services page for details, starting at $2,500 for a pilot collection.
Troubleshooting Common Training Failures
When your ACT policy does not work on the real robot, use this diagnostic guide before re-collecting data.
| Symptom | Likely Cause | Fix |
|---|---|---|
| Robot does not move | Action normalization mismatch | Verify action stats (mean/std) match between training and deployment config |
| Arms collide with each other | Demonstrations include a wide range of arm positions; policy interpolates between modes | Add joint limit safety checks in deploy script; increase KL weight to 50 to reduce mode averaging |
| Policy drifts after 2-3 seconds | Control frequency mismatch: training at 50Hz, deploying at 30Hz | Match --fps between record and deploy commands exactly |
| Gripper never closes | Gripper action polarity inverted or normalized incorrectly | Check gripper min/max in the YAML config; verify open=0.0, closed=1.0 convention |
| Works on day 1, fails on day 2 | Camera bumped or lighting changed | Rigidly mount cameras with Loctite; run calibration check at start of each session |
| Jerky, oscillating motion near objects | Inconsistent operator strategies across demonstrations | Use a single operator; filter episodes by trajectory smoothness; increase temporal_agg weight |
Mobile Base Integration: Coordinating Locomotion and Manipulation
The "mobile" in Mobile ALOHA adds substantial complexity compared to static ALOHA. The AgileX Tracer base (or Tracer Mini) must coordinate with the dual arms during whole-body teleoperation. Key integration details:
Base velocity control. The Tracer base accepts velocity commands (linear x, angular z) at 50Hz via CAN bus. The leader robot's base position (tracked by odometry) maps to velocity commands: the offset between the leader's current position and the follower base's position generates proportional velocity commands. Maximum safe speeds during manipulation: 0.3 m/s linear, 0.5 rad/s angular. Exceeding these risks tipping the arm assembly or exceeding joint velocity limits.
Coordinated action space. The full action space for Mobile ALOHA is 16-dimensional: 7 joints per arm (14 total) + 2 base velocity commands. During ACT training, all 16 dimensions are predicted jointly, which allows the policy to learn coordinated whole-body motions (e.g., leaning forward while reaching). However, the base velocity dimensions should be normalized differently than joint positions because their units and ranges differ -- normalize base velocities to [-1, 1] based on the maximum safe speed.
When to use mobility. Not every task benefits from mobility. Static bimanual tasks (folding clothes on a table, assembly at a fixed workstation) should be collected with the base locked to avoid introducing unnecessary base motion noise into the training data. Enable mobility only when the task genuinely requires it: fetching objects from different locations, navigating between workstations, or reaching objects beyond the static arm workspace.
Related Reading
Imitation Learning Guide · ACT vs Diffusion Policy · LeRobot Getting Started · Cost per Demonstration · Data Annotation · Data Services · Robot Leasing
Alternatives to Building Your Own
Building a Mobile ALOHA system takes 2-4 weeks of focused effort for an experienced roboticist, or 6-8 weeks for a team without prior Dynamixel and ROS2 experience. If your primary goal is to collect mobile manipulation data rather than to understand the hardware, consider these alternatives.
UMI (Universal Manipulation Interface): A much cheaper approach ($2,000-3,000 total) that uses a hand-held gripper with a GoPro camera to collect demonstrations without any robot hardware. UMI demonstrations can train manipulation policies that deploy on various robot arms. UMI cannot capture base mobility data, so it is not a replacement for Mobile ALOHA if your tasks require locomotion, but it is an excellent starting point for manipulation-only data collection.
SVRC managed data collection: If you need mobile manipulation data but do not want to build and maintain the hardware, SVRC operates Mobile ALOHA systems and other bimanual platforms at our Mountain View facility. Our operators are trained in whole-body teleoperation tasks. You define the task specification and object set; we collect, annotate, and deliver the dataset in LeRobot or RLDS format. This eliminates the hardware build entirely and gives you professionally collected data from the start. See our data services for details and pricing.
Leasing a system: SVRC's robot leasing program can provide a fully assembled and calibrated Mobile ALOHA system for monthly lease. This lets you collect data in your own environment without the upfront hardware investment. Contact us to discuss leasing availability and configuration options.