Kaggle Notebook Deep Dive

Pedestrian Movie
SLAM with
Dual GPUs

3D Scene Reconstruction using VGGT-SLAM on two Tesla T4 GPUs — neural inference on GPU 0, post-processing on GPU 1, producing colored point clouds and COLMAP exports.

Model VGGT-1B
Hardware 2× Tesla T4
Frames 294
SLAM Time 1.7 min
Point Cloud 2.2 M pts
scroll to explore
01 — Overview

What This Notebook Does

This notebook reconstructs a 3D map of a pedestrian scene from a short MP4 video, using a state-of-the-art neural SLAM system called VGGT-SLAM developed at MIT SPARK. The pipeline extracts frames, runs neural pose estimation and point-cloud generation, and exports the result in multiple formats (PLY, COLMAP) for downstream 3D tools.

The key engineering challenge is VRAM pressure: VGGT's 1-billion-parameter transformer nearly fills a 16 GB T4 by itself. The solution is to pin SLAM to GPU 0 and run all post-processing (outlier removal, PLY export, visualization) on GPU 1 in parallel.

294 Input Frames @ 10 fps
4 Submaps Created
2.5M 3D Points Exported
02 — Pipeline

Streaming Pipeline — Stage by Stage

The overall flow has five stages. Stages 2–4 form the inner loop that repeats for every batch of frames. Stage 1 (frame extraction) and Stage 5 (final export) each run only once.

01

Frame Extraction & Resize once

ffmpeg extracts frames at 10 fps (up to 400 frames cap). Each frame is then resized to width=1036 px via ffmpeg -vf scale, keeping aspect ratio and rounding height to a multiple of 8 — required by the VGGT model's patch architecture.

02

Keyframe Selection per batch

VGGT-SLAM selects keyframes from the batch using DINOv2-based descriptor matching (SALAD checkpoint). This decides which frames are informative enough to build a new submap — non-redundant views with sufficient scene overlap.

03

VGGT Neural Inference per batch

The selected keyframes (typically 6–7) are fed to the 1B-parameter VGGT transformer in a single forward pass. It outputs camera poses (extrinsics + intrinsics) and per-pixel depth/point-cloud estimates simultaneously. On a T4 this takes ~8–9 seconds per batch.

04

Backend Optimization & Loop Closure per batch

GTSAM solves the pose graph to enforce global consistency across submaps. Scale factors are estimated between consecutive submaps (e.g. 0.82×, 0.45×, 3.31×) to align coordinate frames. Loop closures (zero found here) would add inter-submap constraints.

05

Export & Visualization once

Final point cloud is assembled from all batch NPZ files, outliers removed on GPU 1, then written as a colored ASCII PLY (slam_combined_color.ply). COLMAP binary format (cameras.bin, images.bin, points3D.bin) and an incremental GIF animation are also produced.

03 — Dual GPU Strategy

Why Split Across Two GPUs?

VGGT-SLAM's neural inference is extremely memory-hungry. The 1B-parameter model plus activations for 7 frames nearly saturates a 16 GB T4. Attempting post-processing on the same GPU causes Out-of-Memory (OOM) crashes.

Physical GPU 0 — Tesla T4

SLAM Inference

Runs main.py via subprocess with CUDA_VISIBLE_DEVICES=0 so the process cannot even see GPU 1. All VGGT forward passes, keyframe matching, and GTSAM solving happen here.

VRAM used (peak) 605 / 15,360 MB
Physical GPU 1 — Tesla T4

Post-Processing

Loads NPZ point clouds from disk, performs GPU-accelerated quantile-based outlier removal with PyTorch tensors, and random downsampling via torch.randperm. Runs in parallel — zero VRAM conflict.

VRAM used (peak) 147 / 15,360 MB
⚡ Key Insight

The env['CUDA_VISIBLE_DEVICES'] = '0' trick is crucial. It doesn't just set device preference — it makes GPU 1 invisible to the SLAM subprocess entirely, preventing any accidental cross-GPU allocation that would fragment VRAM on the primary inference device.

🔧 OOM Troubleshooting

If you hit OOM errors, reduce RESIZE_WIDTH: try 1036 → 776 → 518. Alternatively, reduce SUBMAP_SIZE from 6 to 4 to limit frames per inference batch. Both reduce peak VRAM on GPU 0.

04 — Key Code Patterns

Design Decisions & Implementation Notes

Several engineering choices make this notebook robust and efficient. Here are the most important ones:

Pinning SLAM to GPU 0 via subprocess environment
# Copy current environment and override GPU visibility env = os.environ.copy() env['CUDA_VISIBLE_DEVICES'] = '0' # GPU 1 becomes invisible proc = subprocess.Popen( ['python3', 'main.py', '--image_folder', INPUT_DIR, ...], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, env=env, # ← critical: pass modified env to subprocess )
GPU-accelerated outlier removal on GPU 1 (cuda:1)
for npz_path in npz_files: npz = np.load(npz_path) pts = npz['pointcloud'][npz['mask']] # Move to GPU 1 — no conflict with SLAM on GPU 0 pts_t = torch.tensor(pts, dtype=torch.float32, device='cuda:1') for axis in range(3): col = pts_t[:, axis] lo = torch.quantile(col, 0.02) hi = torch.quantile(col, 0.98) pts_t = pts_t[(col >= lo) & (col <= hi)] all_xyz.append(pts_t.cpu().numpy())
PLY accumulation strategy — always-valid output file
# ASCII PLY format: header rewritten each append, body grows with open(filename, 'w') as f: f.write("ply\n") f.write("format ascii 1.0\n") f.write(f"element vertex {len(points)}\n") f.write("property float x\nproperty float y\nproperty float z\n") f.write("property uchar red\nproperty uchar green\nproperty uchar blue\n") f.write("end_header\n") for i in range(len(points)): x, y, z = points[i] r, g, b = int(colors[i][0]), int(colors[i][1]), int(colors[i][2]) f.write(f"{x:.6f} {y:.6f} {z:.6f} {r} {g} {b}\n")
📐 COLMAP Export

The notebook writes a full COLMAP sparse reconstruction without using the COLMAP binary. It manually encodes cameras.bin, images.bin, and points3D.bin in little-endian binary using Python's struct module, following the COLMAP file format spec exactly. This output can be fed directly into tools like InstantNGP, nerfstudio, or Gaussian Splatting trainers.

🔄 Quaternion Conversion (no scipy)

Pose conversion from VGGT's world-to-camera quaternion format to COLMAP's camera-to-world convention is done using a pure NumPy implementation of Shepperd's algorithm — no scipy dependency required, important for Kaggle environments.

05 — Submap Processing

How VGGT-SLAM Builds the Map

Rather than processing all 294 frames at once, VGGT-SLAM divides the video into overlapping submaps. Each submap contains SUBMAP_SIZE=6 keyframes selected by descriptor similarity.

Submap ID Keyframes Used Inference Time Scale Factor
0 1, 11, 50, 68, 71, 74, 77 8.46 s — (first)
7 77, 86, 95, 112, 121, 132, 139 8.86 s 0.8246×
14 139, 142, 147, 150, 154, 161, 168 8.35 s 0.4465×
21 168, 201, 262, 280, 287, 294 6.34 s 3.3073×

Note how the last submap jumps from frame 168 to 201 to 262 — VGGT's keyframe selector detected that intermediate frames were visually redundant (too similar to already-seen views) and skipped them. The varying scale factors reflect the relative motion magnitude between consecutive submap coordinate systems before alignment.

06 — Results & Output Files

What Gets Produced

🖼️
input_preview.png
4-frame montage of sampled input frames for sanity-check
563 KB
📊
slam_result.png
Static 3D matplotlib visualization — 3 viewpoints: perspective, top-down, front
382 KB
🎬
slam_animation_color.gif
Incremental SLAM build-up animation — each frame adds a new keyframe's colored cloud
1.6 MB
🗂️
slam_combined_color.ply
Full colored point cloud (ASCII PLY) — open in MeshLab or CloudCompare
91.1 MB
📄
poses.txt
Camera poses: frame_id tx ty tz qx qy qz qw (27 keyframes)
2.5 KB
🔩
slam_colmap/sparse/0/
COLMAP binary sparse reconstruction — cameras.bin, images.bin, points3D.bin (2.5M pts, 24 cams)
💡 Downstream Usage

The COLMAP output can be fed directly into 3D Gaussian Splatting, NeRF trainers (nerfstudio, instant-ngp), or MVS pipelines. The PLY file is ready for mesh reconstruction in MeshLab or CloudCompare. The poses.txt format is compatible with standard SLAM benchmarks.

07 — Performance Summary

Timing Breakdown

Operation Time / Frame (avg) % of Total
VGGT inference 1.350 s 69.4%
Keyframe selection 0.321 s 16.5%
Loop closure check 0.083 s 4.3%
Backend (GTSAM) 0.003 s 0.2%
Total 1.947 s 0.51 FPS

VGGT inference dominates at ~70% of runtime. The backend (GTSAM pose graph solver) is surprisingly cheap at just 3 ms/frame — the heavy lifting is all in the neural forward pass. Peak VRAM on GPU 0 reached only 605 MB of the available 15,360 MB because only 6–7 frames are processed per batch (not all 294 simultaneously).

📈 Scaling Note

With 0 loop closures detected, the trajectory is open-loop — accuracy degrades over longer sequences. For real deployments, increase MAX_LOOPS and ensure enough scene revisitation for loop closure detection to trigger.