A practical engineering series about optimizing the SAMURAI/EfficientTAM video segmentation pipeline for real-time performance. From research-grade PyTorch code that barely runs at 1 FPS, through ONNX export, NVIDIA TensorRT, Apple Silicon CoreML, and ultimately a Rust runtime.
Posts in this series
- Optimizing PyTorch Models for Production: ONNX Export with SAM 2 - Getting the model into ONNX, which sounds simple until you try
- TensorRT Optimization: 5x Faster Inference with FP16 Precision - From 80 FPS to 97 FPS on RTX 4090, including a deep dive into a TensorRT fusion bug
- CoreML Deployment on Apple Silicon: Real-Time Vision Models - Running on Apple Silicon: from CoreML being slower than CPU to hitting 28 FPS
- Part 4: Rust Runtime - Coming soon