VGGT-SLAM: Pedestrian 3D Reconstruction with Dual GPUs

01 — Overview

What This Notebook Does

This notebook reconstructs a 3D map of a pedestrian scene from a short MP4 video, using a state-of-the-art neural SLAM system called VGGT-SLAM developed at MIT SPARK. The pipeline extracts frames, runs neural pose estimation and point-cloud generation, and exports the result in multiple formats (PLY, COLMAP) for downstream 3D tools.

The key engineering challenge is VRAM pressure: VGGT's 1-billion-parameter transformer nearly fills a 16 GB T4 by itself. The solution is to pin SLAM to GPU 0 and run all post-processing (outlier removal, PLY export, visualization) on GPU 1 in parallel.

294 Input Frames @ 10 fps

4 Submaps Created

2.5M 3D Points Exported

02 — Pipeline

Streaming Pipeline — Stage by Stage

The overall flow has five stages. Stages 2–4 form the inner loop that repeats for every batch of frames. Stage 1 (frame extraction) and Stage 5 (final export) each run only once.

01

Frame Extraction & Resize once

ffmpeg extracts frames at 10 fps (up to 400 frames cap). Each frame is then resized to width=1036 px via ffmpeg -vf scale, keeping aspect ratio and rounding height to a multiple of 8 — required by the VGGT model's patch architecture.

02

Keyframe Selection per batch

VGGT-SLAM selects keyframes from the batch using DINOv2-based descriptor matching (SALAD checkpoint). This decides which frames are informative enough to build a new submap — non-redundant views with sufficient scene overlap.

03

VGGT Neural Inference per batch

The selected keyframes (typically 6–7) are fed to the 1B-parameter VGGT transformer in a single forward pass. It outputs camera poses (extrinsics + intrinsics) and per-pixel depth/point-cloud estimates simultaneously. On a T4 this takes ~8–9 seconds per batch.

04

Backend Optimization & Loop Closure per batch

GTSAM solves the pose graph to enforce global consistency across submaps. Scale factors are estimated between consecutive submaps (e.g. 0.82×, 0.45×, 3.31×) to align coordinate frames. Loop closures (zero found here) would add inter-submap constraints.

05

Export & Visualization once

Final point cloud is assembled from all batch NPZ files, outliers removed on GPU 1, then written as a colored ASCII PLY (slam_combined_color.ply). COLMAP binary format (cameras.bin, images.bin, points3D.bin) and an incremental GIF animation are also produced.

03 — Dual GPU Strategy

Why Split Across Two GPUs?

VGGT-SLAM's neural inference is extremely memory-hungry. The 1B-parameter model plus activations for 7 frames nearly saturates a 16 GB T4. Attempting post-processing on the same GPU causes Out-of-Memory (OOM) crashes.

Physical GPU 0 — Tesla T4

SLAM Inference

Runs main.py via subprocess with CUDA_VISIBLE_DEVICES=0 so the process cannot even see GPU 1. All VGGT forward passes, keyframe matching, and GTSAM solving happen here.

VRAM used (peak) 605 / 15,360 MB

Physical GPU 1 — Tesla T4

Post-Processing

Loads NPZ point clouds from disk, performs GPU-accelerated quantile-based outlier removal with PyTorch tensors, and random downsampling via torch.randperm. Runs in parallel — zero VRAM conflict.

VRAM used (peak) 147 / 15,360 MB

⚡ Key Insight

The env['CUDA_VISIBLE_DEVICES'] = '0' trick is crucial. It doesn't just set device preference — it makes GPU 1 invisible to the SLAM subprocess entirely, preventing any accidental cross-GPU allocation that would fragment VRAM on the primary inference device.

🔧 OOM Troubleshooting

If you hit OOM errors, reduce RESIZE_WIDTH: try 1036 → 776 → 518. Alternatively, reduce SUBMAP_SIZE from 6 to 4 to limit frames per inference batch. Both reduce peak VRAM on GPU 0.

04 — Key Code Patterns

Design Decisions & Implementation Notes

Several engineering choices make this notebook robust and efficient. Here are the most important ones:

Pinning SLAM to GPU 0 via subprocess environment
# Copy current environment and override GPU visibility
env = os.environ.copy()
env['CUDA_VISIBLE_DEVICES'] = '0'   # GPU 1 becomes invisible

proc = subprocess.Popen(
    ['python3', 'main.py', '--image_folder', INPUT_DIR, ...],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
    env=env,   # ← critical: pass modified env to subprocess
)

GPU-accelerated outlier removal on GPU 1 (cuda:1)
for npz_path in npz_files:
    npz = np.load(npz_path)
    pts = npz['pointcloud'][npz['mask']]

    # Move to GPU 1 — no conflict with SLAM on GPU 0
    pts_t = torch.tensor(pts, dtype=torch.float32, device='cuda:1')

    for axis in range(3):
        col = pts_t[:, axis]
        lo  = torch.quantile(col, 0.02)
        hi  = torch.quantile(col, 0.98)
        pts_t = pts_t[(col >= lo) & (col <= hi)]

    all_xyz.append(pts_t.cpu().numpy())

PLY accumulation strategy — always-valid output file
# ASCII PLY format: header rewritten each append, body grows
with open(filename, 'w') as f:
    f.write("ply\n")
    f.write("format ascii 1.0\n")
    f.write(f"element vertex {len(points)}\n")
    f.write("property float x\nproperty float y\nproperty float z\n")
    f.write("property uchar red\nproperty uchar green\nproperty uchar blue\n")
    f.write("end_header\n")
    for i in range(len(points)):
        x, y, z = points[i]
        r, g, b = int(colors[i][0]), int(colors[i][1]), int(colors[i][2])
        f.write(f"{x:.6f} {y:.6f} {z:.6f} {r} {g} {b}\n")

📐 COLMAP Export

The notebook writes a full COLMAP sparse reconstruction without using the COLMAP binary. It manually encodes cameras.bin, images.bin, and points3D.bin in little-endian binary using Python's struct module, following the COLMAP file format spec exactly. This output can be fed directly into tools like InstantNGP, nerfstudio, or Gaussian Splatting trainers.

🔄 Quaternion Conversion (no scipy)

Pose conversion from VGGT's world-to-camera quaternion format to COLMAP's camera-to-world convention is done using a pure NumPy implementation of Shepperd's algorithm — no scipy dependency required, important for Kaggle environments.

05 — Submap Processing

How VGGT-SLAM Builds the Map

Rather than processing all 294 frames at once, VGGT-SLAM divides the video into overlapping submaps. Each submap contains SUBMAP_SIZE=6 keyframes selected by descriptor similarity.

Submap ID	Keyframes Used	Inference Time	Scale Factor
0	1, 11, 50, 68, 71, 74, 77	8.46 s	— (first)
7	77, 86, 95, 112, 121, 132, 139	8.86 s	0.8246×
14	139, 142, 147, 150, 154, 161, 168	8.35 s	0.4465×
21	168, 201, 262, 280, 287, 294	6.34 s	3.3073×

Note how the last submap jumps from frame 168 to 201 to 262 — VGGT's keyframe selector detected that intermediate frames were visually redundant (too similar to already-seen views) and skipped them. The varying scale factors reflect the relative motion magnitude between consecutive submap coordinate systems before alignment.

06 — Results & Output Files

What Gets Produced

🖼️

input_preview.png

4-frame montage of sampled input frames for sanity-check

563 KB

📊

slam_result.png

Static 3D matplotlib visualization — 3 viewpoints: perspective, top-down, front

382 KB

🎬

slam_animation_color.gif

Incremental SLAM build-up animation — each frame adds a new keyframe's colored cloud

1.6 MB

🗂️

slam_combined_color.ply

Full colored point cloud (ASCII PLY) — open in MeshLab or CloudCompare

91.1 MB

📄

poses.txt

Camera poses: frame_id tx ty tz qx qy qz qw (27 keyframes)

2.5 KB

🔩

slam_colmap/sparse/0/

COLMAP binary sparse reconstruction — cameras.bin, images.bin, points3D.bin (2.5M pts, 24 cams)

—

💡 Downstream Usage

The COLMAP output can be fed directly into 3D Gaussian Splatting, NeRF trainers (nerfstudio, instant-ngp), or MVS pipelines. The PLY file is ready for mesh reconstruction in MeshLab or CloudCompare. The poses.txt format is compatible with standard SLAM benchmarks.

07 — Performance Summary

Timing Breakdown

Operation	Time / Frame (avg)	% of Total
VGGT inference	1.350 s	69.4%
Keyframe selection	0.321 s	16.5%
Loop closure check	0.083 s	4.3%
Backend (GTSAM)	0.003 s	0.2%
Total	1.947 s	0.51 FPS

VGGT inference dominates at ~70% of runtime. The backend (GTSAM pose graph solver) is surprisingly cheap at just 3 ms/frame — the heavy lifting is all in the neural forward pass. Peak VRAM on GPU 0 reached only 605 MB of the available 15,360 MB because only 6–7 frames are processed per batch (not all 294 simultaneously).

📈 Scaling Note

With 0 loop closures detected, the trajectory is open-loop — accuracy degrades over longer sequences. For real deployments, increase MAX_LOOPS and ensure enough scene revisitation for loop closure detection to trigger.

Pedestrian Movie SLAM with Dual GPUs

What This Notebook Does

Streaming Pipeline — Stage by Stage

Frame Extraction & Resize once

Keyframe Selection per batch

VGGT Neural Inference per batch

Backend Optimization & Loop Closure per batch

Export & Visualization once

Why Split Across Two GPUs?

SLAM Inference

Post-Processing

Design Decisions & Implementation Notes

How VGGT-SLAM Builds the Map

What Gets Produced

Timing Breakdown

Pedestrian Movie
SLAM with
Dual GPUs