pipeline is really slow - consulting [D]

Slow Training Pipeline for Imitation Learning in Robotics

I am encountering significant training slowness with a pipeline designed for imitation learning in robotics. My goal is to create models that can learn from observed robot actions and apply them to new situations.

Model / Pipeline: The model consists of an encoder based on ResNet18, which processes observations from four RGB cameras with a resolution of 128x128x3. It also includes a small vector representing the robot’s joint velocities (14 dimensions). The final input to the policy is a concatenation of these image embeddings and state vectors.
Pipeline: The pipeline uses a shared ResNet18 encoder for each image, resulting in an embedding dimension of 128. The DiT (Diffusion Transformer) backbone has 8 layers with a hidden dimension of 512, 8 attention heads, and totals about ~50M parameters.
Policy: The policy is defined by the Diffusion setup, where actions are predicted in chunks of approximately 50 steps over four diffusion timesteps. This approach allows for a flexible modeling of sequential actions within the environment.
Data Storage and Access: The dataset is stored using Zarr, with indexed access to avoid loading large chunks into RAM. The train/val split remains contiguous without shuffling, ensuring consistency in training data distribution.
Hardware / Software: Training occurs on a NVIDIA A4500 GPU with 48GB of RAM and an SSD for storage. CUDA version is 12.8, and PyTorch version is 2.9. Mixed precision (bf16) is used throughout the process.
Data Loading: The dataloader employs a batch size of 2 with 8 persistent workers and pinned memory enabled to optimize data transfer efficiency.
Data Preprocessing: Minimal preprocessing steps are applied, including normalization and conversion to floating-point values. These operations occur within the multimodal encoder on the GPU for better performance.

The current encoder is frozen during training to avoid unnecessary computation overhead. Profiler results indicate that most of the time (80%) is spent in the training_step function, with a significant portion (~25%) dedicated to the backward pass and optimizer steps.

Current Behavior: CPU utilization consistently hovers around 100%, while GPU usage fluctuates between ~20–30%. Even synthetic data can cause GPU utilization to drop below expected levels. VRAM usage remains relatively low, at about 5GB.
Throughput: The system achieves a throughput of approximately 10 iterations per second. An epoch with around 50k samples takes roughly 30 minutes to complete. Increasing batch size does not significantly reduce the total time required for an epoch.

Despite these optimizations, the training process remains excessively slow. I have observed that freezing the encoder did not noticeably improve performance and replacing dataset samples with synthetic/random tensors only marginally increased throughput (by about 50%). These observations suggest that there might be inefficiencies in either the model architecture or data processing steps.

For context, other research papers mention training times of around 10 hours on an RTX 4090 GPU. Given my current setup with a NVIDIA A4500, achieving such performance is challenging and suggests that there may be room for improvement in the pipeline design or execution strategy.

Could anyone provide insights into what might be causing this slowness? Any suggestions on how to further investigate or optimize the training process would be greatly appreciated.

Key Takeaways

The encoder is currently frozen, but its impact on performance was minimal.
Synthetic data did not significantly improve throughput compared to real-world samples.
Increasing batch size did not reduce the epoch duration noticeably.

Source Read original →