Innovations and Limitations of VGGT: An Implementation Comparison with MASt3R
Introduction
VGGT (Visual Geometry Grounded Transformer) was announced in March 2025 and subsequently received the prestigious Best Paper Award at CVPR 2025.
In the original research paper, VGGT is claimed to hold several advantages over its predecessor, MASt3R, in terms of accuracy, processing speed, and architectural efficiency.
1. Overwhelming Speed and Efficiency
- Inference Time: When processing a set of 10 images, MASt3R takes approximately 9 seconds, whereas VGGT is remarkably fast, completing the task in about
0.2 seconds.
- Batch Processing Capability: MASt3R can only process two images (a pair) at a time, requiring a high-cost post-processing step called "Global Alignment" to integrate multiple images. In contrast, VGGT can reconstruct anywhere from one to hundreds of images directly in a single feed-forward pass, eliminating the need for complex post-processing.
2. Improved 3D Reconstruction and Camera Estimation
- Point Cloud Accuracy: Evaluations on the ETH3D dataset show that VGGT significantly outperforms MASt3R in point cloud generation accuracy.
- Camera Pose Estimation: On datasets such as RealEstate10K and CO3Dv2, VGGT recorded higher AUC@30 scores, outperforming MASt3R "by a large margin."
- Generalization Performance: VGGT demonstrates superior performance even on unseen datasets, proving its robust versatility.
3. Architectural Flexibility and Comprehensiveness
- Direct Attribute Output: While MASt3R primarily estimates point maps, VGGT simultaneously outputs camera parameters, depth maps, point maps, and 3D point tracks from a single network.
- Single-View Support: MASt3R requires duplicating an image to create a "pair," but VGGT natively supports reconstruction from a single image with impressive results.
- Bundle Adjustment (BA) Compatibility: Since VGGT’s initial predictions are highly accurate, they serve as excellent starting values for additional Bundle Adjustment, allowing optimization to finish much faster and more accurately than traditional SfM (Structure from Motion) methods.
Experimental Verification
While the literature emphasizes the benefits of processing all images in a single batch—claiming the ability to handle up to 200 images—this was achieved using NVIDIA H100 GPUs. Batch processing requires VRAM proportional to the number of images.
I implemented both VGGT and MASt3R in a standard GPU environment (such as a T4) to compare their 3D reconstruction performance in real-world constraints.
Case 1: Cyprus (30 images)
This dataset has few images, making 3D reconstruction difficult for traditional SfM. VGGT was able to handle this count in a single batch. Its 3D construction capability was overwhelming. In contrast, MASt3R could only reconstruct a small portion of the scene.
Case 2: Stone Steps (20 frames)
I attempted 3D reconstruction using frames extracted from a video at 1-second intervals. VGGT succeeded, but MASt3R failed to obtain valid 3D points, resulting in a failed reconstruction.
Case 3: Grand Place (40 images)
Using two T4 GPUs, VGGT was able to process up to 40 images in a single batch. Attempting to process more than this resulted in an OOM (Out of Memory) error.
Attempts to Handle Large Image Sets in VGGT
I explored whether VGGT could handle sequences of 100+ images despite memory limits.
- VGGT Batch Processing: I set the batch size to 20 (the limit to avoid OOM) and ran 5 batches for a total of 100 images. The result was that each batch constructed its own independent 3D coordinate system, and they were merged at arbitrary, disconnected positions.
- VGGT Frame-by-Frame Mode: This resulted in a messy overlap of individual frames without global consistency.
Summary
Within the range of images that can be processed in a single batch, VGGT demonstrates 3D reconstruction performance superior to MASt3R. However, because VGGT is designed on the premise of processing all images simultaneously, its VRAM requirement grows almost linearly with the number of images. In scenarios involving many images, it is prone to OOM errors, revealing a clear limitation in scalability.
On the other hand, MASt3R allows users to control the number of pairs processed, ensuring the computation stays within the limits of the GPU memory. This gives MASt3R a practical advantage in terms of stability when dealing with large-scale image sets.
At present, the most realistic choice is to use both models selectively based on the number of images and the available computational environment.
References