Innovations and Limitations of VGGT: An Implementation Comparison with MASt3R

Introduction

VGGT (Visual Geometry Grounded Transformer) was announced in March 2025 and subsequently received the prestigious Best Paper Award at CVPR 2025.

In the original research paper, VGGT is claimed to hold several advantages over its predecessor, MASt3R, in terms of accuracy, processing speed, and architectural efficiency.

1. Overwhelming Speed and Efficiency

2. Improved 3D Reconstruction and Camera Estimation

3. Architectural Flexibility and Comprehensiveness


Experimental Verification

While the literature emphasizes the benefits of processing all images in a single batch—claiming the ability to handle up to 200 images—this was achieved using NVIDIA H100 GPUs. Batch processing requires VRAM proportional to the number of images.

I implemented both VGGT and MASt3R in a standard GPU environment (such as a T4) to compare their 3D reconstruction performance in real-world constraints.

Case 1: Cyprus (30 images)

This dataset has few images, making 3D reconstruction difficult for traditional SfM. VGGT was able to handle this count in a single batch. Its 3D construction capability was overwhelming. In contrast, MASt3R could only reconstruct a small portion of the scene.

Case 2: Stone Steps (20 frames)

I attempted 3D reconstruction using frames extracted from a video at 1-second intervals. VGGT succeeded, but MASt3R failed to obtain valid 3D points, resulting in a failed reconstruction.

Case 3: Grand Place (40 images)

Using two T4 GPUs, VGGT was able to process up to 40 images in a single batch. Attempting to process more than this resulted in an OOM (Out of Memory) error.


Attempts to Handle Large Image Sets in VGGT

I explored whether VGGT could handle sequences of 100+ images despite memory limits.

Summary

Within the range of images that can be processed in a single batch, VGGT demonstrates 3D reconstruction performance superior to MASt3R. However, because VGGT is designed on the premise of processing all images simultaneously, its VRAM requirement grows almost linearly with the number of images. In scenarios involving many images, it is prone to OOM errors, revealing a clear limitation in scalability.

On the other hand, MASt3R allows users to control the number of pairs processed, ensuring the computation stays within the limits of the GPU memory. This gives MASt3R a practical advantage in terms of stability when dealing with large-scale image sets.

At present, the most realistic choice is to use both models selectively based on the number of images and the available computational environment.


References