We experimented with different backbones, camera pose representations, scalability, and attention mechanisms. Our evaluation spans hundreds of full-length videos across various metrics, without aligning the predicted trajectory to the ground truth, to simulate a real-world application
š½ļø Check out Visual Odometry Transformer! VoT is an end-to-end model for getting accurate metric camera poses from monocular videos.
vladimiryugay.github.io/vot/
Thanks to the team, Kien Nguyen, Theo Gevers, @cgmsnoek.bsky.social, and @martin-r-oswald.bsky.social from the University of Amsterdam!
VoT does not require calibration or post-optimization and operates in real-time, capable of processing thousands of frames. It is trained on a vast amount of real-world indoor data, but can work just fine in outdoor scenarios. It uses only camera poses as supervision, making it broadly accessible