Optimizing SAMURAI

A practical engineering series about optimizing the SAMURAI/EfficientTAM video segmentation pipeline for real-time performance. From research-grade PyTorch code that barely runs at 1 FPS, through ONNX export, NVIDIA TensorRT, Apple Silicon CoreML, and ultimately a Rust runtime.

Posts in this series

From Research Checkpoint to ONNX Runtime — Getting the model into ONNX, which sounds simple until you try
CUDA, TensorRT, and the FP16 Softmax Overflow — From 80 FPS to 97 FPS on RTX 4090, including a deep dive into a TensorRT fusion bug
CoreML, Partitions, and the 13x Mac Speedup — Running on Apple Silicon: from CoreML being slower than CPU to hitting 28 FPS
Part 4: Rust Runtime — Coming soon

Egor Dmitriev

Recent Posts

CoreML, Partitions, and the 13x Mac Speedup

CUDA, TensorRT, and the FP16 Softmax Overflow

From Research Checkpoint to ONNX Runtime

Building Boosters: A Gradient Boosting Library from Scratch

Posts in this series