Role: Tech lead
Led a team of computer-vision researchers building hand-tracking for the Apple Vision Pro. Hand tracking is the primary input device on the headset — there are no controllers — so the system has to recover joint-accurate 3D hand pose from headset cameras at AR latency, in any lighting, with any hand.
Training data at scale
The single biggest lever for hand-tracking quality is data. We built an auto-annotation stack that produced labels for more than a billion images by combining pose estimation, mesh reconstruction, and object segmentation across multi-view captures. Real footage was paired with synthetic renders so corner cases (gloves, occlusion, unusual skin tones) had ground truth.
Multi-view semi-supervised learning
The headset sees the same hand from multiple cameras at once. We exploited that with semi-supervised algorithms that use cross-view consistency as a self-label, so the model can absorb huge volumes of unlabeled real-world capture in addition to the (smaller) manually labeled set.
Real-time 2D-to-3D lifting
On-device, the runtime model produces 2D keypoint and segmentation outputs per camera, then lifts them into a single 3D hand pose using the headset's camera rig calibration. The lift has to be deterministic and tight against the headset's render cadence — every millisecond of lag shows up as drift in the user's interaction.
Technical Highlights:
- Auto-annotation of 1 billion+ images involving pose estimation, mesh reconstruction and object segmentation.
- Semi-supervised multiview algorithms involving real and synthetic data.
- Designed and implemented 2D-to-3D lifting algorithms for real-time 3D hand tracking.
