Role: Tech lead

Led a team of computer-vision researchers building hand-tracking for the Apple Vision Pro. Hand tracking is the primary input device on the headset — there are no controllers — so the system has to recover joint-accurate 3D hand pose from headset cameras at AR latency, in any lighting, with any hand.

Training data at scale

The single biggest lever for hand-tracking quality is data. We built an auto-annotation stack that produced labels for more than a billion images by combining pose estimation, mesh reconstruction, and object segmentation across multi-view captures. Real footage was paired with synthetic renders so corner cases (gloves, occlusion, unusual skin tones) had ground truth.

Multi-view semi-supervised learning

The headset sees the same hand from multiple cameras at once. We exploited that with semi-supervised algorithms that use cross-view consistency as a self-label, so the model can absorb huge volumes of unlabeled real-world capture in addition to the (smaller) manually labeled set.

Real-time 2D-to-3D lifting

On-device, the runtime model produces 2D keypoint and segmentation outputs per camera, then lifts them into a single 3D hand pose using the headset's camera rig calibration. The lift has to be deterministic and tight against the headset's render cadence — every millisecond of lag shows up as drift in the user's interaction.

Technical Highlights:

  • Auto-annotation of 1 billion+ images involving pose estimation, mesh reconstruction and object segmentation.
  • Semi-supervised multiview algorithms involving real and synthetic data.
  • Designed and implemented 2D-to-3D lifting algorithms for real-time 3D hand tracking.