Hand Tracking For Robot Arm Teleoperation

The current generation of hand tracking is good enough to make a robot arm feel reachable through camera vision. It is not yet good enough to make the camera the safety system.

That is the useful line through the literature. The strongest systems now recover both hands, individual finger joints, handedness, pinch state, and a hand-relative 3D skeleton in real time. Meta's Quest runtime shows what this feels like when the hardware, cameras, tracking model, interaction layer, and coordinate system are designed together. MediaPipe shows that a lighter version can run locally from ordinary RGB video. Apple, Ultraleap, OpenXR, Android XR, and recent 3D hand reconstruction papers all point in the same direction: the hand is becoming a usable software interface, not just a gesture trigger. ^{1Meta Horizon OS Developers. Hand Tracking Overview for Meta Quest in Unity. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-overview/} ^{2Google AI Edge. Hand landmarks detection guide. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker} ^{3Apple Developer Documentation. VNDetectHumanHandPoseRequest. https://developer.apple.com/documentation/vision/vndetecthumanhandposerequest} ^{4Ultraleap Documentation. Ultraleap Hand Tracking Overview. https://docs.ultraleap.com/hand-tracking/index.html} ^{5Khronos Group. OpenXR specification, XREXThandtracking. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XREXThandtracking}

The hard part is the translation from "we can see the hand" to "the robot should move there." Robot teleoperation needs a stable command frame, predictable latency, recovery from tracking loss, clear mode changes, and independent safety. A finger pose estimate can express intent. It should not be trusted as raw motion authority for a high-force machine.

The Control Idea

The interface under examination is natural because it mirrors the task. The right hand supplies the spatial command: where the robot end effector should go, how the wrist should turn, and how the tool should be oriented. The distance between thumb and index finger supplies an analog gripper signal, from closed to open. The left hand supplies a speed or precision layer, potentially with a second wrist-orientation command.

That mapping is easy to understand:

Human signal	Robot signal	Main uncertainty
Right-hand position	End-effector target	Camera-frame depth, jitter, calibration
Right wrist orientation	Tool or wrist orientation	Forearm visibility, joint limits, singularities
Thumb-index distance	Gripper aperture	Hand-size normalization, occlusion, false pinches
Left-hand open/closed state	Speed or precision multiplier	Mode confusion, fatigue, accidental commands
Left wrist rotation	Tool-roll offset	Separating rotation from translation
Low-confidence tracking	Hold or stop	Recovery without unexpected motion

The appeal is clear. A gripper wants an aperture signal, and the human hand already provides one. A robot wrist wants a 3D orientation, and a hand can express it without a joystick, teach pendant, or master arm. But this is also where the danger sits. A camera-based hand estimate is probabilistic. A robot arm is physical.

What Exists Now

The model landscape breaks into four practical groups.

First are landmark trackers. MediaPipe Hands remains the most useful public baseline because it was built for real-time, on-device tracking. Its pipeline uses palm detection followed by hand landmark estimation, and the current Hand Landmarker task returns image landmarks, world landmarks, and handedness. ^{6Zhang, F. et al. MediaPipe Hands: On-device Real-time Hand Tracking. CV4ARVR, 2020. https://research.google/pubs/mediapipe-hands-on-device-real-time-hand-tracking/} ^{2Google AI Edge. Hand landmarks detection guide. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker} The web version exposes the same basic capability through JavaScript. ^{7Google AI Edge. Hand landmarks detection guide for Web. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker/webjs} TensorFlow.js made browser-side hand pose visible early, with 21 3D landmarks and local execution rather than cloud video upload. ^{8TensorFlow Blog. Face and hand tracking in the browser with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2020/03/face-and-hand-tracking-in-browser-with-mediapipe-and-tensorflowjs.html} ^{9TensorFlow Blog. 3D Hand Pose with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2021/11/3D-handpose.html}

Second are platform APIs. Apple Vision exposes hand-pose detection through VNDetectHumanHandPoseRequest, while ARKit and visionOS expose richer hand skeleton concepts for spatial computing. ^{3Apple Developer Documentation. VNDetectHumanHandPoseRequest. https://developer.apple.com/documentation/vision/vndetecthumanhandposerequest} ^{10Apple Developer Documentation. HandSkeleton. https://developer.apple.com/documentation/arkit/handskeleton} ^{11Apple Developer Documentation. Tracking and visualizing hand movement. https://developer.apple.com/documentation/visionOS/tracking-and-visualizing-hand-movement} Meta's Quest documentation exposes hand tracking, gestures, Interaction SDK abstractions, Fast Motion Mode, Wide Motion Mode, and OpenXR skeleton support. ^{12Meta Horizon OS Developers. Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-interaction-sdk-overview/} ^{13Meta Horizon OS Developers. Use Fast Motion Mode. https://developers.meta.com/horizon/documentation/unity/fast-motion-mode/} ^{14Meta Horizon OS Developers. Use Wide Motion Mode. https://developers.meta.com/horizon/documentation/unity/unity-wide-motion-mode/} ^{15Meta Horizon OS Developers. OpenXR Hand Skeleton in Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-openxr-hand/} OpenXR gives this work a common language: XR_EXT_hand_tracking defines runtime-provided hand joints, and the standard hand-joint count is 26. ^{5Khronos Group. OpenXR specification, XREXThandtracking. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XREXThandtracking} ^{16Khronos Group. XRHANDJOINTCOUNTEXT manual page. https://registry.khronos.org/OpenXR/specs/1.1/man/html/XRHANDJOINTCOUNTEXT.html} That matters because a robot-control layer should not be tied to one vendor's landmark names.

Third are 3D reconstruction systems. MANO remains the common parametric hand model, and datasets such as FreiHAND, InterHand2.6M, HO-3D, DexYCB, BigHand2.2M, and RHD form much of the evidence base around hand pose, hand shape, two-hand interaction, depth sensing, and hand-object grasping. ^{17Romero, J., Tzionas, D., and Black, M. J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM TOG, 2017. https://mano.is.tue.mpg.de/} ^{18Zimmermann, C. et al. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images. ICCV, 2019. https://lmb.informatik.uni-freiburg.de/Publications/2019/ZAB19/} ^{19Moon, G. et al. InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image. ECCV, 2020. https://mks0601.github.io/InterHand2.6M/} ^{20Hampali, S. et al. HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation. ICCV, 2019. https://www.tugraz.at/index.php?id=40231} ^{21Chao, Y. W. et al. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. CVPR, 2021. https://dex-ycb.github.io/} ^{22Yuan, S. et al. BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis. CVPR, 2017. https://arxiv.org/abs/1704.02612} ^{23Zimmermann, C. and Brox, T. Learning to Estimate 3D Hand Pose from Single RGB Images. ICCV, 2017. https://lmb.informatik.uni-freiburg.de/projects/hand3d/} Newer systems such as HaMeR, WiLoR, MobRecon, and Hamba show the direction of travel: stronger monocular 3D hand recovery, more robust in-the-wild localization, and mesh estimates rather than only sparse keypoints. ^{24Pavlakos, G. et al. Reconstructing Hands in 3D with Transformers. CVPR, 2024. https://openaccess.thecvf.com/content/CVPR2024/html/PavlakosReconstructingHandsin3DwithTransformersCVPR2024paper.html} ^{25Potamias, R. A. et al. WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild. arXiv:2409.12259, 2024. https://arxiv.org/abs/2409.12259} ^{26Chen, X. et al. MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image. CVPR, 2022. https://arxiv.org/abs/2112.02753} ^{27Dong, H. et al. Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba. NeurIPS, 2024. https://publications.ri.cmu.edu/hamba-single-view-3d-hand-reconstruction-with-graph-guided-bi-scanning-mamba}

Fourth are dedicated hand-tracking sensors. Ultraleap is the clearest comparison point because it uses stereo infrared sensing and provides hands, digits, bones, frames, and a millimeter coordinate system through its developer APIs. ^{4Ultraleap Documentation. Ultraleap Hand Tracking Overview. https://docs.ultraleap.com/hand-tracking/index.html} ^{28Ultraleap Documentation. Leap Concepts. https://docs.ultraleap.com/api-reference/tracking-api/leapc-guide/leap-concepts.html} It is not the same product class as a plain RGB camera, but it is valuable as a practical benchmark: if a general camera struggles with depth or occlusion, a dedicated hand sensor shows what better sensing buys.

Why Quest-Class Tracking Feels Different

Quest hand tracking is often discussed as though the model is the product. The public documentation suggests a different reading. The product is the runtime.

Meta controls the camera geometry, exposure behavior, headset coordinate frame, hand interaction design, and application APIs. Fast Motion Mode exists because fast hands are difficult. Wide Motion Mode exists because hands leave the field of view. The Interaction SDK exists because raw skeletons are not enough to build usable interaction. ^{13Meta Horizon OS Developers. Use Fast Motion Mode. https://developers.meta.com/horizon/documentation/unity/fast-motion-mode/} ^{14Meta Horizon OS Developers. Use Wide Motion Mode. https://developers.meta.com/horizon/documentation/unity/unity-wide-motion-mode/} ^{12Meta Horizon OS Developers. Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-interaction-sdk-overview/}

That distinction matters for robot teleoperation. A headset runtime can fuse body, head, hand, controller, and scene context in ways a single application-level model cannot. Meta also documents the failure cases: occlusion, low light, noisy tracking, and low-confidence states. ^{29Meta Horizon OS Developers. Troubleshooting and Limitations. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-troubleshooting-limitations/} The lesson is not that Quest has solved robot control. The lesson is that convincing hand input requires the whole sensing and interaction stack to be engineered around hands.

The 3D Problem

The central vision problem is not detecting fingers. It is estimating where the hand is in a coordinate frame that a robot can use.

Single-camera RGB models can infer a plausible 3D hand because hands are constrained objects. Fingers have known joints, palms have recognizable proportions, and training data gives the model strong priors. But plausible hand geometry is not the same as metric 3D position. The same hand can appear small and close or larger and farther away. Lens distortion, hand size, wrist visibility, lighting, motion blur, and self-occlusion all affect the estimate.

For robot control, the coordinate chain has to be explicit:

Layer	Meaning	Failure mode
Image landmarks	2D points in the camera frame	Good overlay, weak depth
Hand-relative 3D	Skeleton around the hand itself	Useful pose, uncertain global position
Camera-relative 3D	Hand root relative to the camera	Needs calibration and scale assumptions
Operator control volume	Bounded space where hand motion becomes command intent	Must prevent jumps and impossible targets
Robot target frame	End-effector pose command	Needs transform, limits, and filtering
Robot joint frame	Motor or joint targets from inverse kinematics	Exposed to singularities, collisions, and joint limits

This is where many hand-tracking demos overstate the result. Drawing a 3D skeleton over a video feed is not the same as driving a robot endpoint. A robot needs a stable transform, not a visual impression.

Gripper Mapping Is The Strongest Signal

Thumb-index pinch is the cleanest part of the interface. It is local to the hand, easy to normalize, and directly related to a claw or parallel gripper. It does not require the system to know the hand's exact global position.

A robust mapping would use thumb-tip and index-tip distance normalized against hand scale, then add hysteresis, confidence gating, and smoothing before it reaches the robot. OpenXR's hand interaction work is relevant here because it standardizes hand-oriented input paths such as pinch and poke poses, but a physical robot should still compute and validate its own gripper command. ^{30Khronos Group. OpenXR specification, XREXThandinteraction. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XREXThandinteraction}

The details matter:

Step	Purpose
Measure thumb-tip to index-tip distance	Raw aperture signal
Normalize against palm or finger scale	Handles different hand sizes
Gate on landmark confidence	Prevents false closure during occlusion
Add hysteresis	Stops flicker near open/closed thresholds
Smooth over time	Removes visual jitter
Clamp gripper velocity and force	Keeps the physical tool inside limits

This is the strongest near-term use of hand tracking for robot control: not free-space arm motion first, but a dexterous analog command that existing interfaces often handle poorly.

Arm Motion Is Harder

End-effector position is more difficult because it involves depth, scale, and frame alignment. Absolute mapping looks intuitive: move the hand right, the robot moves right; move forward, the robot moves forward. In practice, absolute mapping is brittle. Tracking loss, hand reacquisition, camera movement, or a bad depth estimate can create jumps.

Teleoperation literature points toward more conservative mappings. Workspace-level control of underwater manipulators uses operator commands that are translated into robot joint motion by inverse kinematics. ^{31Yuh, J. et al. Workspace control system of underwater tele-operated manipulators on an ROV. Ocean Engineering, 2010. https://www.sciencedirect.com/science/article/abs/pii/S0029801810000892} Bilateral and shared-control work treats direct control, supervisory control, and shared control as a spectrum, with shared control reducing workload while preserving human authority. ^{32Xu, X. et al. Bilateral teleoperation with object-adaptive mapping. Complex and Intelligent Systems, 2022. https://link.springer.com/article/10.1007/s40747-021-00546-z}

For a hand-tracked arm, that implies four possible mappings:

Mapping	Strength	Weakness
Absolute hand pose to end-effector pose	Intuitive and visually direct	Vulnerable to jumps and calibration drift
Relative hand movement while enabled	Safer and easier to clutch	Less natural at first
Rate control from displacement around neutral	Stable over larger motions	Less precise for point placement
Shared control around task constraints	Strongest for real work	Requires a task model and environment sensing

The most defensible reading of the evidence is that raw one-to-one mapping belongs in demonstrations and constrained tests. Work near valuable hardware, subsea equipment, hazardous material, or patients needs a control layer that can reject bad targets.

Browser, Local, And Dedicated Runtime Paths

The software path is no longer speculative. MediaPipe Hand Landmarker for Web can run local hand tracking in a browser. ^{7Google AI Edge. Hand landmarks detection guide for Web. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker/webjs} ONNX Runtime Web offers browser inference paths through WebAssembly, WebGPU, WebGL, and WebNN where supported. ^{33ONNX Runtime. How to add machine learning to your web application with ONNX Runtime. https://onnxruntime.ai/docs/tutorials/web/} ^{34ONNX Runtime. Using the WebGPU Execution Provider. https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html} ^{35ONNX Runtime. Web get started support matrix. https://onnxruntime.ai/docs/get-started/with-javascript/web.html} WebNN is emerging as a lower-level browser API for neural-network graph execution on available hardware backends. ^{36WebNN. Web Neural Network API FAQ. https://webnn.io/en/faq/api}

The question is not whether local inference can run. It can. The question is which work should happen where.

Runtime path	Best fit	Constraint
Browser landmarks	Low-friction hand pose, pinch, visualization, logging	Limited control over heavy models and hardware integration
Local service	Heavier reconstruction, ROS 2 bridge, simulation, logging, replay	More installation and support burden
Native platform API	Access to system hand-pose frameworks	Less portable across devices
Dedicated sensor SDK	Better depth and hand-specific tracking	Extra hardware and integration cost

MoveIt 2 and moveit_kinematics are relevant downstream because inverse kinematics is the bridge from a desired end-effector pose to robot joint targets. ^{37MoveIt. MoveIt 2 documentation. https://moveit.picknik.ai/main/index.html} ^{38ROS Documentation. moveitkinematics package. https://docs.ros.org/en/ros2packages/rolling/api/moveitkinematics/} But MoveIt is not the safety controller. It is a manipulation and planning framework. Industrial motion still belongs behind robot controllers, safety-rated stops, cell limits, and risk assessment.

What Industrial Interfaces Teach

Industrial human-robot interaction research has been moving toward multimodal interfaces: gestures, speech, vision, augmented displays, haptics, and supervisory control. ^{39Berg, J. and Lu, S. Review of Interfaces for Industrial Human-Robot Interaction. Current Robotics Reports, 2020. https://link.springer.com/article/10.1007/s43154-020-00005-6} ^{40Qi, J. et al. Computer vision-based hand gesture recognition for human-robot interaction: a review. Complex and Intelligent Systems, 2024. https://link.springer.com/article/10.1007/s40747-023-01173-6} Leap Motion-era literature is useful because it made the same promise earlier: low-cost 3D hand input for robot motion. Reviews of Leap Motion interaction note the common pattern of mapping hand positions or gestures into robot motion and then resolving the robot's joints through inverse kinematics. ^{41Guna, J. et al. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller. Sensors, 2018. https://www.mdpi.com/1424-8220/18/7/2194}

The older studies also show the trap. Contactless interfaces remove hardware, but they also remove tactile grounding. Physical masters, joysticks, teach pendants, pedals, and haptic devices persist because they give repeatable frames, mechanical clutching, muscle memory, and known failure modes. Weichert and colleagues measured Leap Motion accuracy against an industrial robot reference setup, which is the right mindset: the input device has to be measured, not merely experienced. ^{42Weichert, F. et al. Analysis of the Accuracy and Robustness of the Leap Motion Controller. Sensors, 2013. https://www.mdpi.com/1424-8220/13/5/6380}

The result is not a simple replacement story. Hand tracking competes poorly with physical controls when the operator needs force feedback, detents, or long-duration comfort. It competes well when the operator needs dexterous finger intent, fast mode switching, low-contamination interaction, or a software-defined interface that can change by task.

Lessons From Subsea, Nuclear, And Surgery

Subsea teleoperation is a useful comparison because operators already work through cameras and remote manipulators. Recent ROV studies describe a control environment shaped by limited visibility, turbulence, localization problems, poor depth perception, and missing sensory feedback. ^{43Chen, R. et al. Sensory augmentation for subsea robot teleoperation. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/pii/S0166361522002329} Body-motion mapping, visual-haptic feedback, and VR-haptic systems are being explored because ordinary video and joysticks do not solve the perception problem on their own. ^{44Chen, R. et al. ROV teleoperation via human body motion mapping: Design and experiment. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/abs/pii/S0166361523001094} ^{45Chen, R. et al. Visual-haptic feedback for ROV subsea navigation control. Automation in Construction, 2023. https://www.sciencedirect.com/science/article/pii/S0926580523002479} ^{46Chen, R. et al. SubSense: VR-Haptic and Motor Feedback for Immersive Control in Subsea Telerobotics. arXiv:2510.02594, 2025. https://arxiv.org/abs/2510.02594}

Nuclear remote handling is less forgiving. Remote operation is required around radioactive and corrosive samples, and the literature still leans heavily on master-slave manipulators and haptic feedback. ^{47Li, J. et al. Application of isomorphic remote manipulators with haptic feedback in spent fuel reprocessing sample analysis. International Journal of Advanced Nuclear Reactor Design and Technology, 2025. https://www.sciencedirect.com/science/article/pii/S2468605025000869} New haptic master devices are being developed for nuclear and aerospace tasks precisely because dexterous remote manipulation benefits from force, posture, and joint correspondence. ^{48Liu, Y. et al. Novel Haptic Device and Control Strategy for Manipulator Teleoperation in Nuclear and Aerospace Tasks. Chinese Journal of Mechanical Engineering, 2025. https://link.springer.com/article/10.1186/s10033-025-01351-2}

Surgical robotics offers the most mature public example of professional teleoperation. Systems such as da Vinci use purpose-built master consoles, not casual free-space gestures. ^{49Intuitive. da Vinci surgical systems. https://www.intuitive.com/en-us/products-and-services/da-vinci/systems} Reviews of telesurgery focus on latency, communication reliability, and haptic feedback. ^{50Choi, P. J. et al. Telesurgery: Past, Present, and Future. Cureus, 2018. https://pmc.ncbi.nlm.nih.gov/articles/PMC6067812/} Raven-II shows the same point from the research side: serious teleoperation platforms are instrumented systems, not input demos. ^{51Hannaford, B. et al. Raven-II: An Open Platform for Surgical Robotics Research. IEEE Transactions on Biomedical Engineering, 2013. https://pmc.ncbi.nlm.nih.gov/articles/PMC4813138/}

Across these domains, hand tracking reads best as an input layer. It does not remove the need for haptics, shared control, redundancy, or task constraints. It adds a new way to express intent.

Safety Boundary

A hand tracker is not a safety device.

That sentence carries most of the industrial implication. Robot standards and safety guidance are built around machinery risk, not interface novelty. OSHA's robotics guidance points to industrial robot standards while also noting that tele-operated manipulators, undersea robots, and medical robots can fall outside direct industrial-robot scope. ^{52OSHA. Robotics standards. https://www.osha.gov/robotics/standards} ISO/TS 15066 covers collaborative industrial robot systems, while ISO 10218-1:2025 covers industrial robot safety requirements. ^{53International Organization for Standardization. ISO/TS 15066:2016 Robots and robotic devices - Collaborative robots. https://www.iso.org/standard/62996.html} ^{54International Organization for Standardization. ISO 10218-1:2025 Robotics - Safety requirements - Part 1: Industrial robots. https://www.iso.org/standard/73933.html} EU machinery regulation and safety-case standards such as UL 4600 add another reminder: software that influences motion has to be argued through hazards, evidence, and lifecycle controls. ^{55European Union. Regulation (EU) 2023/1230 on machinery. https://eur-lex.europa.eu/eli/reg/2023/1230/oj} ^{56UL Standards and Engagement. UL 4600, Standard for Safety for the Evaluation of Autonomous Products. https://ulse.org/ul-standards/safety/autonomous-products}

The practical risk table is straightforward:

Failure	Robot-control consequence
Hand disappears	Motion should hold or decay, not extrapolate blindly
Landmarks jitter	Target motion should be filtered and rate limited
Left and right hands swap	Mode and speed commands can become unsafe
Pinch is falsely detected	Gripper may close unexpectedly
Camera moves	Calibration is no longer valid
User leaves the frame	The robot needs an unambiguous hold state
IK hits a singularity	The target should be rejected or re-planned
Operator becomes fatigued	Mid-air control quality degrades

None of those failures is exotic. They are normal operating conditions for camera-based tracking.

What Should Be Measured

The evidence points to a measurement program before any strong claims about robot-arm control. The relevant numbers are not only model frame rate or landmark count.

Metric	Why it matters
End-to-end latency	Delay changes teleoperation stability
Frame-to-frame jitter	Small visual noise can become robot vibration
Pinch repeatability	Analog gripper control needs predictable scaling
Dropout frequency	Occlusion is common during hand work
Reacquisition behavior	Returning hands should not create command jumps
Handedness stability	Left/right roles can be safety-relevant
Depth error	Single-camera z motion is the hardest axis
Operator fatigue	Free-space control can become uncomfortable
Task completion time	Input methods should be compared against physical baselines
Error rate	Overshoot, wrong grasp, and missed stops matter more than demo smoothness

NIST's robot test-method work is relevant less for a specific hand benchmark than for the discipline it implies: define repeatable tests, measure the input and robot together, and separate one-off success from operational reliability. ^{57NIST. Mobile robots and robot test methods. https://www.nist.gov/el/intelligent-systems-division-73500/robotic-systems-smart-manufacturing-program/mobile-robots} NIOSH's workplace robotics work points in the same direction from the worker side: the interface has to be evaluated as part of a work system. ^{58NIOSH. Center for Occupational Robotics Research. https://www.cdc.gov/niosh/robotics/}

Bottom Line

Camera-based hand tracking has crossed the threshold from novelty to usable interface technology. The model and runtime ecosystem is mature enough to track fingers, recover hand skeletons, estimate pinch, and feed a robot-control stack with human intent. Quest-class runtimes show the high end of integrated hand interaction. MediaPipe and related tools show that lighter local tracking is broadly accessible. 3D reconstruction research is improving the quality of hand pose and mesh estimates year by year.

The evidence is weaker at the point where software intent becomes physical motion. Depth ambiguity, occlusion, latency, fatigue, missing haptics, calibration drift, and safety certification remain the real limits. Thumb-index gripper control is the strongest near-term signal because it is local, analog, and naturally matched to the tool. Full 3D arm teleoperation is possible only when hand tracking is treated as one sensor feeding a constrained controller.

The impartial conclusion is therefore conservative: hand tracking is credible for robot-arm teleoperation research and constrained demonstrations, especially for gripper aperture and high-level end-effector intent. It is not, by itself, a replacement for physical master controls in high-consequence work. The serious version is hybrid: visual hand tracking for dexterous intent, conventional safety hardware for authority, and a robot-control layer that assumes the camera will sometimes be wrong.

Meta Horizon OS Developers. Hand Tracking Overview for Meta Quest in Unity. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-overview/ ↩
Google AI Edge. Hand landmarks detection guide. https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker ↩ ↩²
Apple Developer Documentation. VNDetectHumanHandPoseRequest. https://developer.apple.com/documentation/vision/vndetecthumanhandposerequest ↩ ↩²
Ultraleap Documentation. Ultraleap Hand Tracking Overview. https://docs.ultraleap.com/hand-tracking/index.html ↩ ↩²
Khronos Group. OpenXR specification, XR_EXT_hand_tracking. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XR_EXT_hand_tracking ↩ ↩²
Zhang, F. et al. MediaPipe Hands: On-device Real-time Hand Tracking. CV4ARVR, 2020. https://research.google/pubs/mediapipe-hands-on-device-real-time-hand-tracking/ ↩
Google AI Edge. Hand landmarks detection guide for Web. https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker/web_js ↩ ↩²
TensorFlow Blog. Face and hand tracking in the browser with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2020/03/face-and-hand-tracking-in-browser-with-mediapipe-and-tensorflowjs.html ↩
TensorFlow Blog. 3D Hand Pose with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2021/11/3D-handpose.html ↩
Apple Developer Documentation. HandSkeleton. https://developer.apple.com/documentation/arkit/handskeleton ↩
Apple Developer Documentation. Tracking and visualizing hand movement. https://developer.apple.com/documentation/visionOS/tracking-and-visualizing-hand-movement ↩
Meta Horizon OS Developers. Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-interaction-sdk-overview/ ↩ ↩²
Meta Horizon OS Developers. Use Fast Motion Mode. https://developers.meta.com/horizon/documentation/unity/fast-motion-mode/ ↩ ↩²
Meta Horizon OS Developers. Use Wide Motion Mode. https://developers.meta.com/horizon/documentation/unity/unity-wide-motion-mode/ ↩ ↩²
Meta Horizon OS Developers. OpenXR Hand Skeleton in Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-openxr-hand/ ↩
Khronos Group. XR_HAND_JOINT_COUNT_EXT manual page. https://registry.khronos.org/OpenXR/specs/1.1/man/html/XR_HAND_JOINT_COUNT_EXT.html ↩
Romero, J., Tzionas, D., and Black, M. J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM TOG, 2017. https://mano.is.tue.mpg.de/ ↩
Zimmermann, C. et al. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images. ICCV, 2019. https://lmb.informatik.uni-freiburg.de/Publications/2019/ZAB19/ ↩
Moon, G. et al. InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image. ECCV, 2020. https://mks0601.github.io/InterHand2.6M/ ↩
Hampali, S. et al. HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation. ICCV, 2019. https://www.tugraz.at/index.php?id=40231 ↩
Chao, Y. W. et al. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. CVPR, 2021. https://dex-ycb.github.io/ ↩
Yuan, S. et al. BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis. CVPR, 2017. https://arxiv.org/abs/1704.02612 ↩
Zimmermann, C. and Brox, T. Learning to Estimate 3D Hand Pose from Single RGB Images. ICCV, 2017. https://lmb.informatik.uni-freiburg.de/projects/hand3d/ ↩
Pavlakos, G. et al. Reconstructing Hands in 3D with Transformers. CVPR, 2024. https://openaccess.thecvf.com/content/CVPR2024/html/Pavlakos_Reconstructing_Hands_in_3D_with_Transformers_CVPR_2024_paper.html ↩
Potamias, R. A. et al. WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild. arXiv:2409.12259, 2024. https://arxiv.org/abs/2409.12259 ↩
Chen, X. et al. MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image. CVPR, 2022. https://arxiv.org/abs/2112.02753 ↩
Dong, H. et al. Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba. NeurIPS, 2024. https://publications.ri.cmu.edu/hamba-single-view-3d-hand-reconstruction-with-graph-guided-bi-scanning-mamba ↩
Ultraleap Documentation. Leap Concepts. https://docs.ultraleap.com/api-reference/tracking-api/leapc-guide/leap-concepts.html ↩
Meta Horizon OS Developers. Troubleshooting and Limitations. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-troubleshooting-limitations/ ↩
Khronos Group. OpenXR specification, XR_EXT_hand_interaction. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XR_EXT_hand_interaction ↩
Yuh, J. et al. Workspace control system of underwater tele-operated manipulators on an ROV. Ocean Engineering, 2010. https://www.sciencedirect.com/science/article/abs/pii/S0029801810000892 ↩
Xu, X. et al. Bilateral teleoperation with object-adaptive mapping. Complex and Intelligent Systems, 2022. https://link.springer.com/article/10.1007/s40747-021-00546-z ↩
ONNX Runtime. How to add machine learning to your web application with ONNX Runtime. https://onnxruntime.ai/docs/tutorials/web/ ↩
ONNX Runtime. Using the WebGPU Execution Provider. https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html ↩
ONNX Runtime. Web get started support matrix. https://onnxruntime.ai/docs/get-started/with-javascript/web.html ↩
WebNN. Web Neural Network API FAQ. https://webnn.io/en/faq/api ↩
MoveIt. MoveIt 2 documentation. https://moveit.picknik.ai/main/index.html ↩
ROS Documentation. moveit_kinematics package. https://docs.ros.org/en/ros2_packages/rolling/api/moveit_kinematics/ ↩
Berg, J. and Lu, S. Review of Interfaces for Industrial Human-Robot Interaction. Current Robotics Reports, 2020. https://link.springer.com/article/10.1007/s43154-020-00005-6 ↩
Qi, J. et al. Computer vision-based hand gesture recognition for human-robot interaction: a review. Complex and Intelligent Systems, 2024. https://link.springer.com/article/10.1007/s40747-023-01173-6 ↩
Guna, J. et al. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller. Sensors, 2018. https://www.mdpi.com/1424-8220/18/7/2194 ↩
Weichert, F. et al. Analysis of the Accuracy and Robustness of the Leap Motion Controller. Sensors, 2013. https://www.mdpi.com/1424-8220/13/5/6380 ↩
Chen, R. et al. Sensory augmentation for subsea robot teleoperation. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/pii/S0166361522002329 ↩
Chen, R. et al. ROV teleoperation via human body motion mapping: Design and experiment. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/abs/pii/S0166361523001094 ↩
Chen, R. et al. Visual-haptic feedback for ROV subsea navigation control. Automation in Construction, 2023. https://www.sciencedirect.com/science/article/pii/S0926580523002479 ↩
Chen, R. et al. SubSense: VR-Haptic and Motor Feedback for Immersive Control in Subsea Telerobotics. arXiv:2510.02594, 2025. https://arxiv.org/abs/2510.02594 ↩
Li, J. et al. Application of isomorphic remote manipulators with haptic feedback in spent fuel reprocessing sample analysis. International Journal of Advanced Nuclear Reactor Design and Technology, 2025. https://www.sciencedirect.com/science/article/pii/S2468605025000869 ↩
Liu, Y. et al. Novel Haptic Device and Control Strategy for Manipulator Teleoperation in Nuclear and Aerospace Tasks. Chinese Journal of Mechanical Engineering, 2025. https://link.springer.com/article/10.1186/s10033-025-01351-2 ↩
Intuitive. da Vinci surgical systems. https://www.intuitive.com/en-us/products-and-services/da-vinci/systems ↩
Choi, P. J. et al. Telesurgery: Past, Present, and Future. Cureus, 2018. https://pmc.ncbi.nlm.nih.gov/articles/PMC6067812/ ↩
Hannaford, B. et al. Raven-II: An Open Platform for Surgical Robotics Research. IEEE Transactions on Biomedical Engineering, 2013. https://pmc.ncbi.nlm.nih.gov/articles/PMC4813138/ ↩
OSHA. Robotics standards. https://www.osha.gov/robotics/standards ↩
International Organization for Standardization. ISO/TS 15066:2016 Robots and robotic devices - Collaborative robots. https://www.iso.org/standard/62996.html ↩
International Organization for Standardization. ISO 10218-1:2025 Robotics - Safety requirements - Part 1: Industrial robots. https://www.iso.org/standard/73933.html ↩
European Union. Regulation (EU) 2023/1230 on machinery. https://eur-lex.europa.eu/eli/reg/2023/1230/oj ↩
UL Standards and Engagement. UL 4600, Standard for Safety for the Evaluation of Autonomous Products. https://ulse.org/ul-standards/safety/autonomous-products ↩
NIST. Mobile robots and robot test methods. https://www.nist.gov/el/intelligent-systems-division-73500/robotic-systems-smart-manufacturing-program/mobile-robots ↩
NIOSH. Center for Occupational Robotics Research. https://www.cdc.gov/niosh/robotics/ ↩