VoT does not require calibration or post-optimization and operates in real-time, capable of processing thousands of frames. It is trained on a vast amount of real-world indoor data, but can work just fine in outdoor scenarios. It uses only camera poses as supervision, making it broadly accessible