A practical engineering series about optimizing the SAMURAI/EfficientTAM video segmentation pipeline for real-time performance. From research-grade PyTorch code that barely runs at 1 FPS, through ONNX export, NVIDIA TensorRT, Apple Silicon CoreML, and ultimately a Rust runtime.
Posts in this series
- From Research Checkpoint to ONNX Runtime — Getting the model into ONNX, which sounds simple until you try
- CUDA, TensorRT, and the FP16 Softmax Overflow — From 80 FPS to 97 FPS on RTX 4090, including a deep dive into a TensorRT fusion bug
- CoreML, Partitions, and the 13x Mac Speedup — Running on Apple Silicon: from CoreML being slower than CPU to hitting 28 FPS
- Part 4: Rust Runtime — Coming soon