What This Notebook Does
This notebook reconstructs a 3D map of a pedestrian scene from a short MP4 video, using a state-of-the-art neural SLAM system called VGGT-SLAM developed at MIT SPARK. The pipeline extracts frames, runs neural pose estimation and point-cloud generation, and exports the result in multiple formats (PLY, COLMAP) for downstream 3D tools.
The key engineering challenge is VRAM pressure: VGGT's 1-billion-parameter transformer nearly fills a 16 GB T4 by itself. The solution is to pin SLAM to GPU 0 and run all post-processing (outlier removal, PLY export, visualization) on GPU 1 in parallel.
Streaming Pipeline — Stage by Stage
The overall flow has five stages. Stages 2–4 form the inner loop that repeats for every batch of frames. Stage 1 (frame extraction) and Stage 5 (final export) each run only once.
Frame Extraction & Resize once
ffmpeg extracts frames at 10 fps (up to 400 frames cap). Each frame is then
resized to width=1036 px via ffmpeg -vf scale, keeping aspect ratio
and rounding height to a multiple of 8 — required by the VGGT model's patch architecture.
Keyframe Selection per batch
VGGT-SLAM selects keyframes from the batch using DINOv2-based descriptor matching (SALAD checkpoint). This decides which frames are informative enough to build a new submap — non-redundant views with sufficient scene overlap.
VGGT Neural Inference per batch
The selected keyframes (typically 6–7) are fed to the 1B-parameter VGGT transformer in a single forward pass. It outputs camera poses (extrinsics + intrinsics) and per-pixel depth/point-cloud estimates simultaneously. On a T4 this takes ~8–9 seconds per batch.
Backend Optimization & Loop Closure per batch
GTSAM solves the pose graph to enforce global consistency across submaps. Scale factors are estimated between consecutive submaps (e.g. 0.82×, 0.45×, 3.31×) to align coordinate frames. Loop closures (zero found here) would add inter-submap constraints.
Export & Visualization once
Final point cloud is assembled from all batch NPZ files, outliers removed on GPU 1,
then written as a colored ASCII PLY (slam_combined_color.ply).
COLMAP binary format (cameras.bin, images.bin,
points3D.bin) and an incremental GIF animation are also produced.
Why Split Across Two GPUs?
VGGT-SLAM's neural inference is extremely memory-hungry. The 1B-parameter model plus activations for 7 frames nearly saturates a 16 GB T4. Attempting post-processing on the same GPU causes Out-of-Memory (OOM) crashes.
SLAM Inference
Runs main.py via subprocess with
CUDA_VISIBLE_DEVICES=0 so the process
cannot even see GPU 1. All VGGT forward
passes, keyframe matching, and GTSAM solving happen here.
Post-Processing
Loads NPZ point clouds from disk, performs GPU-accelerated
quantile-based outlier removal with PyTorch tensors,
and random downsampling via torch.randperm.
Runs in parallel — zero VRAM conflict.
The env['CUDA_VISIBLE_DEVICES'] = '0' trick is crucial.
It doesn't just set device preference — it makes GPU 1 invisible
to the SLAM subprocess entirely, preventing any accidental cross-GPU allocation
that would fragment VRAM on the primary inference device.
If you hit OOM errors, reduce RESIZE_WIDTH: try 1036 → 776 → 518.
Alternatively, reduce SUBMAP_SIZE from 6 to 4 to limit frames
per inference batch. Both reduce peak VRAM on GPU 0.
Design Decisions & Implementation Notes
Several engineering choices make this notebook robust and efficient. Here are the most important ones:
The notebook writes a full COLMAP sparse reconstruction without using the COLMAP binary.
It manually encodes cameras.bin, images.bin, and
points3D.bin in little-endian binary using Python's struct module,
following the COLMAP file format spec exactly. This output can be fed directly into
tools like InstantNGP, nerfstudio, or Gaussian Splatting trainers.
Pose conversion from VGGT's world-to-camera quaternion format to COLMAP's camera-to-world convention is done using a pure NumPy implementation of Shepperd's algorithm — no scipy dependency required, important for Kaggle environments.
How VGGT-SLAM Builds the Map
Rather than processing all 294 frames at once, VGGT-SLAM divides the video into
overlapping submaps. Each submap contains SUBMAP_SIZE=6 keyframes
selected by descriptor similarity.
| Submap ID | Keyframes Used | Inference Time | Scale Factor |
|---|---|---|---|
| 0 | 1, 11, 50, 68, 71, 74, 77 | 8.46 s | — (first) |
| 7 | 77, 86, 95, 112, 121, 132, 139 | 8.86 s | 0.8246× |
| 14 | 139, 142, 147, 150, 154, 161, 168 | 8.35 s | 0.4465× |
| 21 | 168, 201, 262, 280, 287, 294 | 6.34 s | 3.3073× |
Note how the last submap jumps from frame 168 to 201 to 262 — VGGT's keyframe selector detected that intermediate frames were visually redundant (too similar to already-seen views) and skipped them. The varying scale factors reflect the relative motion magnitude between consecutive submap coordinate systems before alignment.
What Gets Produced
The COLMAP output can be fed directly into 3D Gaussian Splatting,
NeRF trainers (nerfstudio, instant-ngp), or MVS pipelines.
The PLY file is ready for mesh reconstruction in MeshLab or CloudCompare.
The poses.txt format is compatible with standard SLAM benchmarks.
Timing Breakdown
| Operation | Time / Frame (avg) | % of Total |
|---|---|---|
| VGGT inference | 1.350 s | 69.4% |
| Keyframe selection | 0.321 s | 16.5% |
| Loop closure check | 0.083 s | 4.3% |
| Backend (GTSAM) | 0.003 s | 0.2% |
| Total | 1.947 s | 0.51 FPS |
VGGT inference dominates at ~70% of runtime. The backend (GTSAM pose graph solver) is surprisingly cheap at just 3 ms/frame — the heavy lifting is all in the neural forward pass. Peak VRAM on GPU 0 reached only 605 MB of the available 15,360 MB because only 6–7 frames are processed per batch (not all 294 simultaneously).
With 0 loop closures detected, the trajectory is open-loop — accuracy degrades over
longer sequences. For real deployments, increase MAX_LOOPS and ensure
enough scene revisitation for loop closure detection to trigger.