Why Dataset Format Matters
Robot training data format is not a detail you can defer. The format you choose on day one determines three things that will affect your project for months:
1. Training Framework Compatibility
Each major training framework expects data in a specific format. ACT and Diffusion Policy read HDF5 natively. Octo and the Open X-Embodiment data mix scripts expect RLDS/TFRecord. The LeRobot training library reads LeRobot Parquet. If your data is in the wrong format, you are writing conversion scripts before you can train — and conversion scripts are where subtle data corruption bugs hide.
2. Storage Efficiency and Access Patterns
A 500-episode dataset with 3 cameras at 30 fps occupies 25-50 GB in raw HDF5, 15-30 GB in compressed HDF5, 3-8 GB in LeRobot (MP4 video), or 20-40 GB in RLDS/TFRecord. The storage difference matters for cloud hosting costs, download times, and training data loading speed. But storage efficiency trades off against data fidelity: LeRobot's MP4 compression is lossy, while HDF5 and RLDS preserve exact pixel values.
3. Community and Sharing
If you want to share your dataset publicly, LeRobot format gives you one-command upload to Hugging Face Hub with built-in web visualization. RLDS gives you compatibility with the Open X-Embodiment ecosystem (50+ datasets, 22 robot types). HDF5 gives you maximum flexibility but no standardized sharing platform.
Our recommendation: Use HDF5 as your source-of-truth collection and storage format. Convert to LeRobot for sharing and to RLDS for cross-embodiment training. This gives you the best of all three ecosystems without the downsides of any single format lock-in.
HDF5: The Gold Standard for Robot Data Storage
HDF5 (Hierarchical Data Format 5) stores data in a filesystem-like hierarchy of groups (directories) and datasets (arrays). It was originally developed for scientific computing and has become the de facto standard for robot demonstration data thanks to its flexibility, mature tooling, and efficient random access.
Episode Structure
The standard ACT/ALOHA HDF5 layout stores each episode as a top-level group with observations, actions, and metadata attributes:
/episode_0/
observations/
images/
cam_high # uint8 [T x 480 x 640 x 3] overhead camera
cam_wrist_left # uint8 [T x 480 x 640 x 3] left wrist camera
cam_wrist_right # uint8 [T x 480 x 640 x 3] right wrist camera
qpos # float32 [T x 14] joint positions (7 per arm)
qvel # float32 [T x 14] joint velocities
action # float32 [T x 14] leader arm positions (supervision signal)
attrs:
task = "pick_cube_bimanual"
operator_id = "op_03"
success = True
num_timesteps = 450
timestamp = "2026-04-10T14:32:00Z"
Reading HDF5 with Python
Reading episodes with h5py is straightforward. Here is a complete example that loads an episode's observations and actions:
import h5py
import numpy as np
# Open a single episode file
with h5py.File("episode_0.hdf5", "r") as f:
# Read joint positions and actions
qpos = f["/observations/qpos"][:] # shape: [T, 14]
action = f["/action"][:] # shape: [T, 14]
# Read a specific camera frame (random access)
frame_100 = f["/observations/images/cam_high"][100] # shape: [480, 640, 3]
# Read all frames for a camera
all_frames = f["/observations/images/cam_high"][:] # shape: [T, 480, 640, 3]
# Read metadata
task = f.attrs.get("task", "unknown")
success = f.attrs.get("success", False)
print(f"Task: {task}, Success: {success}")
print(f"Episode length: {qpos.shape[0]} timesteps")
print(f"Joint positions range: [{qpos.min():.3f}, {qpos.max():.3f}]")
HDF5 Best Practices
- Chunking: Always chunk datasets along the time axis. Use chunk_size=1 for random access (debugging, visualization) or chunk_size=32 for sequential read efficiency (training). Never store unchunked image data — it loads as a single monolithic block.
- Compression: Use LZF for image data (3-5x faster than GZIP at similar ratios for camera frames). Use GZIP level 4 for joint trajectories (higher ratio, speed not critical). Do not compress images at collection time — apply compression in the final archive after QA validation.
- Metadata attributes: Store episode metadata as HDF5 group attributes:
episode.attrs['success'],episode.attrs['task'],episode.attrs['operator_id'],episode.attrs['robot_serial']. Include aschema_versionattribute on the file root to track format changes. - One file per episode vs. one file per dataset: For datasets under 1,000 episodes, one HDF5 file per episode is simpler for parallel processing and partial re-collection. For larger datasets, consider packing 50-100 episodes per file to reduce filesystem overhead.
HDF5 Pros and Cons
- Mature library support (h5py, HDFView, Julia, C++)
- Efficient random access to any frame
- Flexible schema — add custom sensor types freely
- Lossless storage preserves exact pixel values
- Native to ACT, ALOHA, and Diffusion Policy
- Human-inspectable with HDFView GUI
- No built-in versioning or provenance tracking
- Not cloud-streamable (must download full file)
- Large file sizes without video compression
- Schema inconsistencies between labs
- No standardized sharing platform
- Concurrent writes require careful locking
RLDS: The Open X-Embodiment Standard
RLDS (Reinforcement Learning Datasets) is the format used by the Open X-Embodiment dataset — the largest collection of robot manipulation data with 2.2M+ episodes across 22 robot types and 527K unique trajectories. It serializes data as TFRecord files processed via TensorFlow Datasets (TFDS).
RLDS Schema
Each RLDS dataset is defined by a TensorFlow DatasetBuilder that specifies the features schema. Episodes are represented as sequences of steps, where each step contains:
# Standard RLDS step structure
step = {
"observation": {
"image": tf.uint8, # shape: [H, W, C]
"state": tf.float32, # shape: [D] (joint positions + gripper)
"wrist_image": tf.uint8, # shape: [H, W, C] (optional)
},
"action": tf.float32, # shape: [D]
"reward": tf.float32, # scalar
"discount": tf.float32, # scalar (typically 1.0)
"is_terminal": tf.bool, # True on terminal state
"is_first": tf.bool, # True on first step
"is_last": tf.bool, # True on last step
"language_instruction": tf.string, # natural language task description
}
Loading RLDS Data
import tensorflow_datasets as tfds
# Load an Open X-Embodiment dataset
dataset = tfds.load("berkeley_autolab_ur5", split="train")
# Iterate over episodes
for episode in dataset.take(5):
steps = episode["steps"]
for step in steps:
image = step["observation"]["image"].numpy() # [H, W, 3]
state = step["observation"]["state"].numpy() # [D]
action = step["action"].numpy() # [D]
instruction = step["language_instruction"].numpy().decode()
print(f"Instruction: {instruction}")
print(f"State shape: {state.shape}, Action shape: {action.shape}")
break # just first step
RLDS Pros and Cons
- Standardized schema enables cross-dataset training
- Efficient streaming via tf.data pipelines
- Cloud-native (stream from GCS/S3 without download)
- 50+ datasets available in compatible format
- Native to Octo, RT-2, and OXE data mix
- Built-in language instruction field
- TensorFlow dependency (heavy for PyTorch teams)
- Sequential access only (no efficient random frame)
- Rigid schema — custom sensors need DatasetBuilder
- Writing a DatasetBuilder takes 2-4 hours
- Inspection requires TF tooling
- Less intuitive than HDF5 for debugging
LeRobot: The Hugging Face Ecosystem
LeRobot, developed by Hugging Face, uses Parquet files for tabular data (joint positions, actions, metadata) and MP4 video files for camera observations. It is designed for the open-source research workflow: collect locally, push to Hugging Face Hub, train with the LeRobot library, share results with the community.
LeRobot Dataset Structure
A LeRobot dataset on Hugging Face Hub contains:
my_dataset/
data/
train-00000-of-00001.parquet # tabular data (all episodes)
videos/
observation.images.cam_high/
episode_000000.mp4 # overhead camera video
episode_000001.mp4
observation.images.cam_wrist/
episode_000000.mp4 # wrist camera video
episode_000001.mp4
meta/
info.json # dataset metadata, features schema
episodes.jsonl # per-episode metadata
stats.json # per-feature mean/std/min/max
The Parquet file contains one row per timestep with columns for episode_index, frame_index, timestamp, observation.state (joint positions), action, and references to the corresponding video frame index.
Loading LeRobot Data
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
# Load a dataset from Hugging Face Hub
dataset = LeRobotDataset("lerobot/aloha_sim_transfer_cube_human")
# Access a single frame (returns a dict of tensors)
frame = dataset[0]
print(f"State: {frame['observation.state'].shape}") # [D]
print(f"Action: {frame['action'].shape}") # [D]
print(f"Image: {frame['observation.images.cam_high'].shape}") # [C, H, W]
# Get episode-level info
print(f"Number of episodes: {dataset.num_episodes}")
print(f"Number of frames: {dataset.num_frames}")
print(f"FPS: {dataset.fps}")
LeRobot Pros and Cons
- One-command upload to Hugging Face Hub
- Built-in web visualization at hf.co/datasets/
- Compact storage (MP4 video 5-10x smaller than raw)
- 300+ public datasets and growing fast
- Native ACT and Diffusion Policy training support
- Statistics (mean/std) computed automatically
- MP4 compression is lossy — not source-of-truth quality
- Video decoding adds latency during training
- Parquet not ideal for variable-length episodes
- Schema changes require full dataset rebuild
- Newer format with evolving tooling
- No random frame access without decoding video
Format Comparison Table
| Feature | HDF5 | RLDS / TFRecord | LeRobot / Parquet |
|---|---|---|---|
| Native frameworks | ACT, Diffusion Policy, custom | Octo, RT-2, OXE data mix | LeRobot, ACT (via lib), DP (via lib) |
| Storage size (500 eps, 3 cams) | 15-30 GB (compressed) | 20-40 GB | 3-8 GB (MP4) |
| Image fidelity | Lossless (raw uint8) | Lossless (raw uint8) | Lossy (MP4 H.264/H.265) |
| Random frame access | Efficient (chunked) | Inefficient (sequential) | Requires video decode |
| Cloud streaming | No (download required) | Yes (tf.data from GCS/S3) | Yes (HF Hub streaming) |
| Schema flexibility | High (any structure) | Low (fixed DatasetBuilder) | Medium (Parquet columns) |
| Sharing platform | None (manual hosting) | TFDS catalog | Hugging Face Hub |
| Community datasets | Many (no central catalog) | 50+ (Open X-Embodiment) | 300+ (Hugging Face Hub) |
| Python tooling | h5py (mature, lightweight) | tensorflow-datasets (heavy) | lerobot, datasets (growing) |
| Recommended for | Primary storage, ACT/DP training | Cross-embodiment, Octo training | Sharing, community, quick start |
Converting Between Formats
You will eventually need data in multiple formats. Here is the practical guide to conversion, with the tools and estimated effort for each path.
HDF5 to LeRobot
The LeRobot library provides native conversion for ALOHA-style HDF5 datasets:
# Convert ALOHA HDF5 to LeRobot format and push to Hub
python -m lerobot.scripts.push_dataset_to_hub \
--raw-dir /path/to/hdf5/episodes \
--raw-format aloha_hdf5 \
--repo-id your-org/dataset-name \
--push-to-hub 1
For custom HDF5 schemas (not ALOHA), you need to write a small adapter function that maps your key names to LeRobot's expected schema. This typically takes 30-60 minutes.
HDF5 to RLDS
Converting to RLDS requires writing a custom TensorFlow DatasetBuilder. This is the most labor-intensive conversion (2-4 hours for a new schema) but is a one-time cost per dataset format:
# Skeleton RLDS DatasetBuilder (simplified)
import tensorflow_datasets as tfds
class MyRobotDataset(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version("1.0.0")
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
features=tfds.features.FeaturesDict({
"steps": tfds.features.Dataset({
"observation": tfds.features.FeaturesDict({
"image": tfds.features.Image(shape=(480, 640, 3)),
"state": tfds.features.Tensor(shape=(14,), dtype=tf.float32),
}),
"action": tfds.features.Tensor(shape=(14,), dtype=tf.float32),
"is_terminal": tf.bool,
"is_first": tf.bool,
"is_last": tf.bool,
"language_instruction": tfds.features.Text(),
}),
}),
)
def _generate_examples(self, path):
# Read from your HDF5 files and yield episodes
for episode_path in sorted(path.glob("*.hdf5")):
with h5py.File(episode_path, "r") as f:
# Map HDF5 fields to RLDS schema
yield episode_id, {"steps": steps_list}
RLDS to LeRobot
LeRobot provides a built-in converter for RLDS datasets, including all Open X-Embodiment datasets:
# Convert any RLDS dataset to LeRobot format
python -m lerobot.scripts.push_dataset_to_hub \
--raw-dir /path/to/rlds/dataset \
--raw-format rlds \
--repo-id your-org/converted-dataset \
--push-to-hub 1
LeRobot to HDF5
There is no official tool for this direction, but it is straightforward to write (30-60 minutes):
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
import h5py
import numpy as np
dataset = LeRobotDataset("your-org/dataset-name")
for ep_idx in range(dataset.num_episodes):
ep_frames = [dataset[i] for i in range(len(dataset))
if dataset[i]["episode_index"] == ep_idx]
with h5py.File(f"episode_{ep_idx:05d}.hdf5", "w") as f:
qpos = np.stack([fr["observation.state"].numpy() for fr in ep_frames])
action = np.stack([fr["action"].numpy() for fr in ep_frames])
f.create_dataset("observations/qpos", data=qpos, chunks=(1, qpos.shape[1]))
f.create_dataset("action", data=action, chunks=(1, action.shape[1]))
# Decode and store video frames as image arrays
# ... (video decode step adds complexity)
Important caveat: Converting from LeRobot back to HDF5 cannot recover the original pixel-level fidelity because LeRobot stores images as lossy MP4 video. The converted HDF5 will contain decoded MP4 frames, not the original raw images.
Conversion Summary Table
| From → To | Tool | Effort | Notes |
|---|---|---|---|
| HDF5 → LeRobot | lerobot.scripts.push_dataset_to_hub | 30 min | Native ALOHA support; custom schemas need adapter |
| HDF5 → RLDS | Custom DatasetBuilder | 2-4 hours | One-time per schema; requires TF knowledge |
| RLDS → LeRobot | lerobot.scripts.push_dataset_to_hub --raw-format rlds | 15 min | Works for all OXE datasets |
| LeRobot → HDF5 | Custom script | 30-60 min | Lossy: MP4 frames, not original raw images |
| Any → Any | SVRC Platform | 5 min | Upload once, export to any format via UI |
How SVRC Delivers Your Data
When you engage SVRC for a data collection campaign, here is how we handle format delivery:
Collection Format
We always collect in HDF5 as our source of truth. Raw sensor data is stored losslessly with per-frame timestamps, full metadata, and chunked datasets for efficient access. This master copy is retained for the duration of your project.
Delivery Format
You specify your target format in the project brief. We support:
- HDF5: Direct delivery of the source-of-truth files. Includes schema documentation and a Python example script for loading.
- RLDS / TFRecord: Converted with a custom DatasetBuilder matched to your schema. Includes the DatasetBuilder source code so you can re-run the conversion yourself.
- LeRobot / Parquet: Pushed to a private Hugging Face Hub repository under your organization. Includes dataset card with full metadata, statistics, and visualization.
- Custom formats: ROS bag, CSV, JSON-lines, or proprietary schemas. We write the export adapter and include it in the delivery.
What Is Included
Every dataset delivery includes:
- The dataset files in your requested format
- A data manifest (JSON) listing all episodes with metadata, quality scores, and statistics
- Schema documentation describing every field, data type, and unit
- A Python example script that loads one episode and prints shapes and ranges
- Per-feature statistics (mean, std, min, max) for normalization during training
- QA report summarizing quality metrics across the full dataset
SVRC Platform Export
If you use the SVRC Fearless Platform, you can upload datasets in any format and export to any other format through the web UI. The platform handles schema normalization, statistics computation, and format-specific encoding (MP4 for LeRobot, TFRecord for RLDS) automatically. Upload once, export as many times as you need.
Frequently Asked Questions
Which format should I use if I am just getting started?
Start with HDF5. It has the simplest tooling (just h5py), the most flexible schema, and is native to the most popular training frameworks (ACT, Diffusion Policy). You can always convert to LeRobot or RLDS later. If you want to share your dataset immediately on Hugging Face Hub, use LeRobot from the start — but keep the raw HDF5 as your backup.
Is LeRobot's MP4 compression a problem for training?
For most manipulation tasks, no. The visual artifacts from H.264 compression at reasonable quality settings (CRF 20-23) are below the noise level of typical camera sensors. However, for tasks where pixel-level accuracy matters — visual servoing to sub-millimeter targets, detecting thin wires or threads, or research that analyzes compression artifacts — use lossless HDF5 as your training source. The LeRobot team is exploring lossless video codecs (FFV1) for future versions.
Can I mix datasets from different formats for training?
Yes, but you need to normalize them to a common format first. The most practical approach is to convert everything to a single format before training. If you are training with Octo or doing cross-embodiment experiments, convert everything to RLDS. If you are training with the LeRobot library, convert everything to LeRobot format. The SVRC Platform can normalize and export mixed-format uploads into a unified dataset.
How do I version-control my robot datasets?
For LeRobot datasets on Hugging Face Hub, versioning is built in via git-lfs. For HDF5, use a data manifest file (JSON) alongside your HDF5 files that records schema_version, creation date, episodes list, and statistics. Bump the schema version when you change sensor configuration. For production workflows, the SVRC Platform provides full dataset versioning with rollback.
What about ROS bag format?
ROS bag (rosbag2 in ROS2) is excellent for data recording during collection because it captures all ROS topics with timestamps natively. However, it is not well-suited as a training format because it requires ROS2 libraries to read, has no random access, and stores data in a format optimized for replay rather than ML training. The standard workflow is: record in ROS bag during collection, then convert to HDF5 (or LeRobot/RLDS) for training and sharing. This conversion step also serves as a data cleaning and validation checkpoint.