Building VFTCam

A Zero-Dependency 360° Photosphere Capture Web App

Stanford JavaScript Three.js WebGL2

VFTCam is a browser-based Progressive Web App that creates 360° panoramic photospheres using structured camera capture and GPU-accelerated stitching. Built as a spiritual successor to Google's discontinued Street View Camera app, it runs entirely in the browser with zero external dependencies and no build tools.

The Challenge

When Google discontinued their Street View Camera app in 2023, educators lost a valuable tool for creating immersive virtual field trips. A replacement was needed that could:

  • Run entirely in mobile browsers without native app installation
  • Adopt a privacy-first approach with no user data collection
  • Work offline in remote field locations
  • Handle mobile device memory constraints (especially iOS Safari's ~1.4GB limit)
  • Provide real-time guidance for capturing aligned photos
  • Stitch 36 high-resolution images into equirectangular panoramas

Architecture Overview

VFTCam is built as a "zero-build" application using pure ES6 modules. This decision prioritizes maintainability and debuggability over bundle size optimization.

Core Technologies
  • ES6 Modules (no bundler)
  • WebGL2 for GPU processing
  • Three.js for 3D visualization
  • IndexedDB for image storage
Device APIs
  • WebRTC getUserMedia
  • DeviceOrientationEvent
  • DeviceMotionEvent
  • Geolocation API
PWA Features
  • Service Worker (Workbox)
  • Web App Manifest
  • OPFS for panoramas
  • Web Share API

The Capture Pattern

The heart of VFTCam is its structured 36-point capture pattern, arranged in three rows:

Upper Row (pitch +45°):  12 points at 30° yaw intervals
Equator (pitch 0°):      12 points at 30° yaw intervals
Lower Row (pitch -45°):  12 points at 30° yaw intervals

Total Coverage: 36 overlapping images covering full sphere

Inside the Sphere

Unlike traditional panorama apps that show a flat grid or compass interface, VFTCam places the user inside a Three.js wireframe sphere. This transforms the abstract concept of spherical capture into an intuitive, spatial experience.

Key Innovation: By visualizing capture points as physical locations on a sphere surrounding the user, VFTCam makes the complex mathematics of spherical projection tangible. Users can literally "see" the photosphere they're building from the inside out.
// scene.js - Creating the capture sphere environment
export class Scene {
    constructor(canvas) {
        this.scene = new THREE.Scene();
        this.camera = new THREE.PerspectiveCamera(75, aspect, 0.1, 1000);

        // Create wireframe sphere (user is INSIDE looking out)
        const sphereGeometry = new THREE.SphereGeometry(100, 24, 16);
        const sphereMaterial = new THREE.MeshBasicMaterial({
      color: 0x666666,
      wireframe: true,
      opacity: 0.3,
      transparent: true,
      side: THREE.BackSide  // CRITICAL: Render inside of sphere
        });

        this.sphere = new THREE.Mesh(sphereGeometry, sphereMaterial);
        this.scene.add(this.sphere);
    }
}

Understanding Field of View

FOV calculation was one of the trickiest aspects. Getting it wrong means gaps in coverage or wasted overlap.

Challenge: WebRTC's getUserMedia captures video streams, not photos. The FOV for video mode is often different from the device's photo capture mode—typically narrower due to cropping and stabilization.
Solution: Through extensive empirical testing across devices, we found 44° horizontal FOV provided the best balance. With 30° spacing between capture points, this gives ~14° overlap (~32%).

WebGL2 Best-Pixel Stitching

The stitching algorithm runs entirely on the GPU using WebGL2 fragment shaders. Each output pixel samples from all overlapping source images and selects the "best" one based on multiple quality metrics:

  • Distance from center: Pixels near image center have less lens distortion
  • Sharpness: Laplacian-based edge detection identifies blur
  • Angular distance: Pixels closest to their source image's center direction
  • Exposure consistency: Weighted toward median brightness across images
// Fragment shader for best-pixel selection
void main() {
    vec2 equirect = gl_FragCoord.xy / resolution;
    vec3 sphereDir = equirectToSphere(equirect);

    float bestScore = -1.0;
    vec4 bestColor = vec4(0.0);

    for (int i = 0; i < NUM_IMAGES; i++) {
        vec2 imgUV = projectToImage(sphereDir, imageOrientations[i]);
        if (isValidUV(imgUV)) {
      float score = calculateQuality(imgUV, i);
      if (score > bestScore) {
          bestScore = score;
          bestColor = texture(images[i], imgUV);
      }
        }
    }

    fragColor = bestColor;
}

Memory Management

Mobile Safari has strict memory limits (~1.4GB). With 36 high-resolution images plus WebGL textures, memory management is critical.

Texture Streaming

Only 4-6 images loaded as GPU textures at once. Images swap in/out based on which output region is being processed.

Tiled Rendering

Output renders in 512×512 tiles. Each tile completes and copies to CPU before the next starts, preventing GPU memory accumulation.

Aggressive Cleanup

Explicit WebGL resource deletion after each tile. Force garbage collection hints between major operations.

IndexedDB Offloading

Source images stored in IndexedDB immediately after capture. Only loaded into memory when needed for stitching.

Crash Recovery

Mobile browsers can terminate apps at any time. VFTCam implements "stitch jobs" - resumable processing that survives app crashes:

// Stitch job persistence
const stitchJob = {
    id: crypto.randomUUID(),
    createdAt: Date.now(),
    status: 'pending',
    progress: { currentTile: 0, totalTiles: 64 },
    imageIds: [...capturedImageIds],
    outputTiles: []  // Completed tiles stored as blobs
};

// Save after each tile completes
async function saveTileProgress(tileIndex, tileBlob) {
    stitchJob.outputTiles[tileIndex] = await blobToBase64(tileBlob);
    stitchJob.progress.currentTile = tileIndex + 1;
    await db.put('stitchJobs', stitchJob);
}

VR Viewing

Completed photospheres can be viewed in VR mode using Google Cardboard or similar headsets. The viewer uses WebXR when available, with a fallback to gyroscope-controlled stereoscopic rendering.

Privacy by Design

VFTCam collects zero user data:

  • No analytics or tracking scripts
  • No server-side processing - everything runs locally
  • No account required
  • GPS data is optional and only embedded in exported files
  • Images never leave the device unless explicitly shared

Performance

Metric iPhone 13 Pixel 6 iPad Pro
Capture time (36 photos) 2-3 min 2-3 min 2-3 min
Stitch time (4K output) ~45 sec ~60 sec ~30 sec
Memory peak ~800 MB ~900 MB ~1.1 GB
Output resolution 4096 × 2048 (4K equirectangular)

Lessons Learned

  1. Test on real devices early. The iOS Safari memory limit shaped the entire architecture. Emulators don't expose these constraints.
  2. WebGL debugging is hard. GPU errors are often silent. Build extensive logging and visualization tools for shader development.
  3. Device orientation APIs vary wildly. The same code produces different results on iOS vs Android. Abstract these early.
  4. Users will hold their phones wrong. Design for imperfection. Capture the actual orientation and compensate in software.
  5. Zero-build has real benefits. No webpack config to debug. No node_modules. Just open the HTML file.

External Libraries

  • Three.js - 3D visualization and WebGL abstraction
  • Pannellum - Panorama viewing
  • Workbox - Service worker and PWA caching
  • GyroNorm.js - Cross-platform device orientation normalization

Conclusion

VFTCam demonstrates that sophisticated image processing applications can run entirely in the browser, without native code, server infrastructure, or complex build systems. By embracing web standards and working within browser constraints rather than fighting them, it's possible to build tools that are both powerful and accessible.

The app continues to be used by educators worldwide to create virtual field trips, bringing remote locations into classrooms and enabling students to explore places they could never physically visit.