Hand Tracking For Robot Arm Teleoperation
An impartial research summary on camera-based hand tracking, XR hand skeletons, monocular 3D reconstruction, and the limits of using hand pose as an interface for robot-arm teleoperation.
- Published
- May 9, 2026
- Reading
- 19 min
- Author
- Christopher Lyon
- Filed
- Research

The current generation of hand tracking is good enough to make a robot arm feel reachable through camera vision. It is not yet good enough to make the camera the safety system.
That is the useful line through the literature. The strongest systems now recover both hands, individual finger joints, handedness, pinch state, and a hand-relative 3D skeleton in real time. Meta's Quest runtime shows what this feels like when the hardware, cameras, tracking model, interaction layer, and coordinate system are designed together. MediaPipe shows that a lighter version can run locally from ordinary RGB video. Apple, Ultraleap, OpenXR, Android XR, and recent 3D hand reconstruction papers all point in the same direction: the hand is becoming a usable software interface, not just a gesture trigger. 1Meta Horizon OS Developers. Hand Tracking Overview for Meta Quest in Unity. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-overview/ 2Google AI Edge. Hand landmarks detection guide. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker 3Apple Developer Documentation. VNDetectHumanHandPoseRequest. https://developer.apple.com/documentation/vision/vndetecthumanhandposerequest 4Ultraleap Documentation. Ultraleap Hand Tracking Overview. https://docs.ultraleap.com/hand-tracking/index.html 5Khronos Group. OpenXR specification, XREXThandtracking. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XREXThandtracking
The hard part is the translation from "we can see the hand" to "the robot should move there." Robot teleoperation needs a stable command frame, predictable latency, recovery from tracking loss, clear mode changes, and independent safety. A finger pose estimate can express intent. It should not be trusted as raw motion authority for a high-force machine.
The Control Idea
The interface under examination is natural because it mirrors the task. The right hand supplies the spatial command: where the robot end effector should go, how the wrist should turn, and how the tool should be oriented. The distance between thumb and index finger supplies an analog gripper signal, from closed to open. The left hand supplies a speed or precision layer, potentially with a second wrist-orientation command.
That mapping is easy to understand:
| Human signal | Robot signal | Main uncertainty |
|---|---|---|
| Right-hand position | End-effector target | Camera-frame depth, jitter, calibration |
| Right wrist orientation | Tool or wrist orientation | Forearm visibility, joint limits, singularities |
| Thumb-index distance | Gripper aperture | Hand-size normalization, occlusion, false pinches |
| Left-hand open/closed state | Speed or precision multiplier | Mode confusion, fatigue, accidental commands |
| Left wrist rotation | Tool-roll offset | Separating rotation from translation |
| Low-confidence tracking | Hold or stop | Recovery without unexpected motion |
The appeal is clear. A gripper wants an aperture signal, and the human hand already provides one. A robot wrist wants a 3D orientation, and a hand can express it without a joystick, teach pendant, or master arm. But this is also where the danger sits. A camera-based hand estimate is probabilistic. A robot arm is physical.
What Exists Now
The model landscape breaks into four practical groups.
First are landmark trackers. MediaPipe Hands remains the most useful public baseline because it was built for real-time, on-device tracking. Its pipeline uses palm detection followed by hand landmark estimation, and the current Hand Landmarker task returns image landmarks, world landmarks, and handedness. 6Zhang, F. et al. MediaPipe Hands: On-device Real-time Hand Tracking. CV4ARVR, 2020. https://research.google/pubs/mediapipe-hands-on-device-real-time-hand-tracking/ 2Google AI Edge. Hand landmarks detection guide. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker The web version exposes the same basic capability through JavaScript. 7Google AI Edge. Hand landmarks detection guide for Web. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker/webjs TensorFlow.js made browser-side hand pose visible early, with 21 3D landmarks and local execution rather than cloud video upload. 8TensorFlow Blog. Face and hand tracking in the browser with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2020/03/face-and-hand-tracking-in-browser-with-mediapipe-and-tensorflowjs.html 9TensorFlow Blog. 3D Hand Pose with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2021/11/3D-handpose.html
Second are platform APIs. Apple Vision exposes hand-pose detection through VNDetectHumanHandPoseRequest, while ARKit and visionOS expose richer hand skeleton concepts for spatial computing. 3Apple Developer Documentation. VNDetectHumanHandPoseRequest. https://developer.apple.com/documentation/vision/vndetecthumanhandposerequest 10Apple Developer Documentation. HandSkeleton. https://developer.apple.com/documentation/arkit/handskeleton 11Apple Developer Documentation. Tracking and visualizing hand movement. https://developer.apple.com/documentation/visionOS/tracking-and-visualizing-hand-movement Meta's Quest documentation exposes hand tracking, gestures, Interaction SDK abstractions, Fast Motion Mode, Wide Motion Mode, and OpenXR skeleton support. 12Meta Horizon OS Developers. Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-interaction-sdk-overview/ 13Meta Horizon OS Developers. Use Fast Motion Mode. https://developers.meta.com/horizon/documentation/unity/fast-motion-mode/ 14Meta Horizon OS Developers. Use Wide Motion Mode. https://developers.meta.com/horizon/documentation/unity/unity-wide-motion-mode/ 15Meta Horizon OS Developers. OpenXR Hand Skeleton in Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-openxr-hand/ OpenXR gives this work a common language: XR_EXT_hand_tracking defines runtime-provided hand joints, and the standard hand-joint count is 26. 5Khronos Group. OpenXR specification, XREXThandtracking. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XREXThandtracking 16Khronos Group. XRHANDJOINTCOUNTEXT manual page. https://registry.khronos.org/OpenXR/specs/1.1/man/html/XRHANDJOINTCOUNTEXT.html That matters because a robot-control layer should not be tied to one vendor's landmark names.
Third are 3D reconstruction systems. MANO remains the common parametric hand model, and datasets such as FreiHAND, InterHand2.6M, HO-3D, DexYCB, BigHand2.2M, and RHD form much of the evidence base around hand pose, hand shape, two-hand interaction, depth sensing, and hand-object grasping. 17Romero, J., Tzionas, D., and Black, M. J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM TOG, 2017. https://mano.is.tue.mpg.de/ 18Zimmermann, C. et al. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images. ICCV, 2019. https://lmb.informatik.uni-freiburg.de/Publications/2019/ZAB19/ 19Moon, G. et al. InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image. ECCV, 2020. https://mks0601.github.io/InterHand2.6M/ 20Hampali, S. et al. HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation. ICCV, 2019. https://www.tugraz.at/index.php?id=40231 21Chao, Y. W. et al. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. CVPR, 2021. https://dex-ycb.github.io/ 22Yuan, S. et al. BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis. CVPR, 2017. https://arxiv.org/abs/1704.02612 23Zimmermann, C. and Brox, T. Learning to Estimate 3D Hand Pose from Single RGB Images. ICCV, 2017. https://lmb.informatik.uni-freiburg.de/projects/hand3d/ Newer systems such as HaMeR, WiLoR, MobRecon, and Hamba show the direction of travel: stronger monocular 3D hand recovery, more robust in-the-wild localization, and mesh estimates rather than only sparse keypoints. 24Pavlakos, G. et al. Reconstructing Hands in 3D with Transformers. CVPR, 2024. https://openaccess.thecvf.com/content/CVPR2024/html/PavlakosReconstructingHandsin3DwithTransformersCVPR2024paper.html 25Potamias, R. A. et al. WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild. arXiv:2409.12259, 2024. https://arxiv.org/abs/2409.12259 26Chen, X. et al. MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image. CVPR, 2022. https://arxiv.org/abs/2112.02753 27Dong, H. et al. Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba. NeurIPS, 2024. https://publications.ri.cmu.edu/hamba-single-view-3d-hand-reconstruction-with-graph-guided-bi-scanning-mamba
Fourth are dedicated hand-tracking sensors. Ultraleap is the clearest comparison point because it uses stereo infrared sensing and provides hands, digits, bones, frames, and a millimeter coordinate system through its developer APIs. 4Ultraleap Documentation. Ultraleap Hand Tracking Overview. https://docs.ultraleap.com/hand-tracking/index.html 28Ultraleap Documentation. Leap Concepts. https://docs.ultraleap.com/api-reference/tracking-api/leapc-guide/leap-concepts.html It is not the same product class as a plain RGB camera, but it is valuable as a practical benchmark: if a general camera struggles with depth or occlusion, a dedicated hand sensor shows what better sensing buys.
Why Quest-Class Tracking Feels Different
Quest hand tracking is often discussed as though the model is the product. The public documentation suggests a different reading. The product is the runtime.
Meta controls the camera geometry, exposure behavior, headset coordinate frame, hand interaction design, and application APIs. Fast Motion Mode exists because fast hands are difficult. Wide Motion Mode exists because hands leave the field of view. The Interaction SDK exists because raw skeletons are not enough to build usable interaction. 13Meta Horizon OS Developers. Use Fast Motion Mode. https://developers.meta.com/horizon/documentation/unity/fast-motion-mode/ 14Meta Horizon OS Developers. Use Wide Motion Mode. https://developers.meta.com/horizon/documentation/unity/unity-wide-motion-mode/ 12Meta Horizon OS Developers. Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-interaction-sdk-overview/
That distinction matters for robot teleoperation. A headset runtime can fuse body, head, hand, controller, and scene context in ways a single application-level model cannot. Meta also documents the failure cases: occlusion, low light, noisy tracking, and low-confidence states. 29Meta Horizon OS Developers. Troubleshooting and Limitations. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-troubleshooting-limitations/ The lesson is not that Quest has solved robot control. The lesson is that convincing hand input requires the whole sensing and interaction stack to be engineered around hands.
The 3D Problem
The central vision problem is not detecting fingers. It is estimating where the hand is in a coordinate frame that a robot can use.
Single-camera RGB models can infer a plausible 3D hand because hands are constrained objects. Fingers have known joints, palms have recognizable proportions, and training data gives the model strong priors. But plausible hand geometry is not the same as metric 3D position. The same hand can appear small and close or larger and farther away. Lens distortion, hand size, wrist visibility, lighting, motion blur, and self-occlusion all affect the estimate.
For robot control, the coordinate chain has to be explicit:
| Layer | Meaning | Failure mode |
|---|---|---|
| Image landmarks | 2D points in the camera frame | Good overlay, weak depth |
| Hand-relative 3D | Skeleton around the hand itself | Useful pose, uncertain global position |
| Camera-relative 3D | Hand root relative to the camera | Needs calibration and scale assumptions |
| Operator control volume | Bounded space where hand motion becomes command intent | Must prevent jumps and impossible targets |
| Robot target frame | End-effector pose command | Needs transform, limits, and filtering |
| Robot joint frame | Motor or joint targets from inverse kinematics | Exposed to singularities, collisions, and joint limits |
This is where many hand-tracking demos overstate the result. Drawing a 3D skeleton over a video feed is not the same as driving a robot endpoint. A robot needs a stable transform, not a visual impression.
Gripper Mapping Is The Strongest Signal
Thumb-index pinch is the cleanest part of the interface. It is local to the hand, easy to normalize, and directly related to a claw or parallel gripper. It does not require the system to know the hand's exact global position.
A robust mapping would use thumb-tip and index-tip distance normalized against hand scale, then add hysteresis, confidence gating, and smoothing before it reaches the robot. OpenXR's hand interaction work is relevant here because it standardizes hand-oriented input paths such as pinch and poke poses, but a physical robot should still compute and validate its own gripper command. 30Khronos Group. OpenXR specification, XREXThandinteraction. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XREXThandinteraction
The details matter:
| Step | Purpose |
|---|---|
| Measure thumb-tip to index-tip distance | Raw aperture signal |
| Normalize against palm or finger scale | Handles different hand sizes |
| Gate on landmark confidence | Prevents false closure during occlusion |
| Add hysteresis | Stops flicker near open/closed thresholds |
| Smooth over time | Removes visual jitter |
| Clamp gripper velocity and force | Keeps the physical tool inside limits |
This is the strongest near-term use of hand tracking for robot control: not free-space arm motion first, but a dexterous analog command that existing interfaces often handle poorly.
Arm Motion Is Harder
End-effector position is more difficult because it involves depth, scale, and frame alignment. Absolute mapping looks intuitive: move the hand right, the robot moves right; move forward, the robot moves forward. In practice, absolute mapping is brittle. Tracking loss, hand reacquisition, camera movement, or a bad depth estimate can create jumps.
Teleoperation literature points toward more conservative mappings. Workspace-level control of underwater manipulators uses operator commands that are translated into robot joint motion by inverse kinematics. 31Yuh, J. et al. Workspace control system of underwater tele-operated manipulators on an ROV. Ocean Engineering, 2010. https://www.sciencedirect.com/science/article/abs/pii/S0029801810000892 Bilateral and shared-control work treats direct control, supervisory control, and shared control as a spectrum, with shared control reducing workload while preserving human authority. 32Xu, X. et al. Bilateral teleoperation with object-adaptive mapping. Complex and Intelligent Systems, 2022. https://link.springer.com/article/10.1007/s40747-021-00546-z
For a hand-tracked arm, that implies four possible mappings:
| Mapping | Strength | Weakness |
|---|---|---|
| Absolute hand pose to end-effector pose | Intuitive and visually direct | Vulnerable to jumps and calibration drift |
| Relative hand movement while enabled | Safer and easier to clutch | Less natural at first |
| Rate control from displacement around neutral | Stable over larger motions | Less precise for point placement |
| Shared control around task constraints | Strongest for real work | Requires a task model and environment sensing |
The most defensible reading of the evidence is that raw one-to-one mapping belongs in demonstrations and constrained tests. Work near valuable hardware, subsea equipment, hazardous material, or patients needs a control layer that can reject bad targets.
Browser, Local, And Dedicated Runtime Paths
The software path is no longer speculative. MediaPipe Hand Landmarker for Web can run local hand tracking in a browser. 7Google AI Edge. Hand landmarks detection guide for Web. https://ai.google.dev/edge/mediapipe/solutions/vision/handlandmarker/webjs ONNX Runtime Web offers browser inference paths through WebAssembly, WebGPU, WebGL, and WebNN where supported. 33ONNX Runtime. How to add machine learning to your web application with ONNX Runtime. https://onnxruntime.ai/docs/tutorials/web/ 34ONNX Runtime. Using the WebGPU Execution Provider. https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html 35ONNX Runtime. Web get started support matrix. https://onnxruntime.ai/docs/get-started/with-javascript/web.html WebNN is emerging as a lower-level browser API for neural-network graph execution on available hardware backends. 36WebNN. Web Neural Network API FAQ. https://webnn.io/en/faq/api
The question is not whether local inference can run. It can. The question is which work should happen where.
| Runtime path | Best fit | Constraint |
|---|---|---|
| Browser landmarks | Low-friction hand pose, pinch, visualization, logging | Limited control over heavy models and hardware integration |
| Local service | Heavier reconstruction, ROS 2 bridge, simulation, logging, replay | More installation and support burden |
| Native platform API | Access to system hand-pose frameworks | Less portable across devices |
| Dedicated sensor SDK | Better depth and hand-specific tracking | Extra hardware and integration cost |
MoveIt 2 and moveit_kinematics are relevant downstream because inverse kinematics is the bridge from a desired end-effector pose to robot joint targets. 37MoveIt. MoveIt 2 documentation. https://moveit.picknik.ai/main/index.html 38ROS Documentation. moveitkinematics package. https://docs.ros.org/en/ros2packages/rolling/api/moveitkinematics/ But MoveIt is not the safety controller. It is a manipulation and planning framework. Industrial motion still belongs behind robot controllers, safety-rated stops, cell limits, and risk assessment.
What Industrial Interfaces Teach
Industrial human-robot interaction research has been moving toward multimodal interfaces: gestures, speech, vision, augmented displays, haptics, and supervisory control. 39Berg, J. and Lu, S. Review of Interfaces for Industrial Human-Robot Interaction. Current Robotics Reports, 2020. https://link.springer.com/article/10.1007/s43154-020-00005-6 40Qi, J. et al. Computer vision-based hand gesture recognition for human-robot interaction: a review. Complex and Intelligent Systems, 2024. https://link.springer.com/article/10.1007/s40747-023-01173-6 Leap Motion-era literature is useful because it made the same promise earlier: low-cost 3D hand input for robot motion. Reviews of Leap Motion interaction note the common pattern of mapping hand positions or gestures into robot motion and then resolving the robot's joints through inverse kinematics. 41Guna, J. et al. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller. Sensors, 2018. https://www.mdpi.com/1424-8220/18/7/2194
The older studies also show the trap. Contactless interfaces remove hardware, but they also remove tactile grounding. Physical masters, joysticks, teach pendants, pedals, and haptic devices persist because they give repeatable frames, mechanical clutching, muscle memory, and known failure modes. Weichert and colleagues measured Leap Motion accuracy against an industrial robot reference setup, which is the right mindset: the input device has to be measured, not merely experienced. 42Weichert, F. et al. Analysis of the Accuracy and Robustness of the Leap Motion Controller. Sensors, 2013. https://www.mdpi.com/1424-8220/13/5/6380
The result is not a simple replacement story. Hand tracking competes poorly with physical controls when the operator needs force feedback, detents, or long-duration comfort. It competes well when the operator needs dexterous finger intent, fast mode switching, low-contamination interaction, or a software-defined interface that can change by task.
Lessons From Subsea, Nuclear, And Surgery
Subsea teleoperation is a useful comparison because operators already work through cameras and remote manipulators. Recent ROV studies describe a control environment shaped by limited visibility, turbulence, localization problems, poor depth perception, and missing sensory feedback. 43Chen, R. et al. Sensory augmentation for subsea robot teleoperation. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/pii/S0166361522002329 Body-motion mapping, visual-haptic feedback, and VR-haptic systems are being explored because ordinary video and joysticks do not solve the perception problem on their own. 44Chen, R. et al. ROV teleoperation via human body motion mapping: Design and experiment. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/abs/pii/S0166361523001094 45Chen, R. et al. Visual-haptic feedback for ROV subsea navigation control. Automation in Construction, 2023. https://www.sciencedirect.com/science/article/pii/S0926580523002479 46Chen, R. et al. SubSense: VR-Haptic and Motor Feedback for Immersive Control in Subsea Telerobotics. arXiv:2510.02594, 2025. https://arxiv.org/abs/2510.02594
Nuclear remote handling is less forgiving. Remote operation is required around radioactive and corrosive samples, and the literature still leans heavily on master-slave manipulators and haptic feedback. 47Li, J. et al. Application of isomorphic remote manipulators with haptic feedback in spent fuel reprocessing sample analysis. International Journal of Advanced Nuclear Reactor Design and Technology, 2025. https://www.sciencedirect.com/science/article/pii/S2468605025000869 New haptic master devices are being developed for nuclear and aerospace tasks precisely because dexterous remote manipulation benefits from force, posture, and joint correspondence. 48Liu, Y. et al. Novel Haptic Device and Control Strategy for Manipulator Teleoperation in Nuclear and Aerospace Tasks. Chinese Journal of Mechanical Engineering, 2025. https://link.springer.com/article/10.1186/s10033-025-01351-2
Surgical robotics offers the most mature public example of professional teleoperation. Systems such as da Vinci use purpose-built master consoles, not casual free-space gestures. 49Intuitive. da Vinci surgical systems. https://www.intuitive.com/en-us/products-and-services/da-vinci/systems Reviews of telesurgery focus on latency, communication reliability, and haptic feedback. 50Choi, P. J. et al. Telesurgery: Past, Present, and Future. Cureus, 2018. https://pmc.ncbi.nlm.nih.gov/articles/PMC6067812/ Raven-II shows the same point from the research side: serious teleoperation platforms are instrumented systems, not input demos. 51Hannaford, B. et al. Raven-II: An Open Platform for Surgical Robotics Research. IEEE Transactions on Biomedical Engineering, 2013. https://pmc.ncbi.nlm.nih.gov/articles/PMC4813138/
Across these domains, hand tracking reads best as an input layer. It does not remove the need for haptics, shared control, redundancy, or task constraints. It adds a new way to express intent.
Safety Boundary
A hand tracker is not a safety device.
That sentence carries most of the industrial implication. Robot standards and safety guidance are built around machinery risk, not interface novelty. OSHA's robotics guidance points to industrial robot standards while also noting that tele-operated manipulators, undersea robots, and medical robots can fall outside direct industrial-robot scope. 52OSHA. Robotics standards. https://www.osha.gov/robotics/standards ISO/TS 15066 covers collaborative industrial robot systems, while ISO 10218-1:2025 covers industrial robot safety requirements. 53International Organization for Standardization. ISO/TS 15066:2016 Robots and robotic devices - Collaborative robots. https://www.iso.org/standard/62996.html 54International Organization for Standardization. ISO 10218-1:2025 Robotics - Safety requirements - Part 1: Industrial robots. https://www.iso.org/standard/73933.html EU machinery regulation and safety-case standards such as UL 4600 add another reminder: software that influences motion has to be argued through hazards, evidence, and lifecycle controls. 55European Union. Regulation (EU) 2023/1230 on machinery. https://eur-lex.europa.eu/eli/reg/2023/1230/oj 56UL Standards and Engagement. UL 4600, Standard for Safety for the Evaluation of Autonomous Products. https://ulse.org/ul-standards/safety/autonomous-products
The practical risk table is straightforward:
| Failure | Robot-control consequence |
|---|---|
| Hand disappears | Motion should hold or decay, not extrapolate blindly |
| Landmarks jitter | Target motion should be filtered and rate limited |
| Left and right hands swap | Mode and speed commands can become unsafe |
| Pinch is falsely detected | Gripper may close unexpectedly |
| Camera moves | Calibration is no longer valid |
| User leaves the frame | The robot needs an unambiguous hold state |
| IK hits a singularity | The target should be rejected or re-planned |
| Operator becomes fatigued | Mid-air control quality degrades |
None of those failures is exotic. They are normal operating conditions for camera-based tracking.
What Should Be Measured
The evidence points to a measurement program before any strong claims about robot-arm control. The relevant numbers are not only model frame rate or landmark count.
| Metric | Why it matters |
|---|---|
| End-to-end latency | Delay changes teleoperation stability |
| Frame-to-frame jitter | Small visual noise can become robot vibration |
| Pinch repeatability | Analog gripper control needs predictable scaling |
| Dropout frequency | Occlusion is common during hand work |
| Reacquisition behavior | Returning hands should not create command jumps |
| Handedness stability | Left/right roles can be safety-relevant |
| Depth error | Single-camera z motion is the hardest axis |
| Operator fatigue | Free-space control can become uncomfortable |
| Task completion time | Input methods should be compared against physical baselines |
| Error rate | Overshoot, wrong grasp, and missed stops matter more than demo smoothness |
NIST's robot test-method work is relevant less for a specific hand benchmark than for the discipline it implies: define repeatable tests, measure the input and robot together, and separate one-off success from operational reliability. 57NIST. Mobile robots and robot test methods. https://www.nist.gov/el/intelligent-systems-division-73500/robotic-systems-smart-manufacturing-program/mobile-robots NIOSH's workplace robotics work points in the same direction from the worker side: the interface has to be evaluated as part of a work system. 58NIOSH. Center for Occupational Robotics Research. https://www.cdc.gov/niosh/robotics/
Bottom Line
Camera-based hand tracking has crossed the threshold from novelty to usable interface technology. The model and runtime ecosystem is mature enough to track fingers, recover hand skeletons, estimate pinch, and feed a robot-control stack with human intent. Quest-class runtimes show the high end of integrated hand interaction. MediaPipe and related tools show that lighter local tracking is broadly accessible. 3D reconstruction research is improving the quality of hand pose and mesh estimates year by year.
The evidence is weaker at the point where software intent becomes physical motion. Depth ambiguity, occlusion, latency, fatigue, missing haptics, calibration drift, and safety certification remain the real limits. Thumb-index gripper control is the strongest near-term signal because it is local, analog, and naturally matched to the tool. Full 3D arm teleoperation is possible only when hand tracking is treated as one sensor feeding a constrained controller.
The impartial conclusion is therefore conservative: hand tracking is credible for robot-arm teleoperation research and constrained demonstrations, especially for gripper aperture and high-level end-effector intent. It is not, by itself, a replacement for physical master controls in high-consequence work. The serious version is hybrid: visual hand tracking for dexterous intent, conventional safety hardware for authority, and a robot-control layer that assumes the camera will sometimes be wrong.
Footnotes
-
Meta Horizon OS Developers. Hand Tracking Overview for Meta Quest in Unity. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-overview/ ↩
-
Google AI Edge. Hand landmarks detection guide. https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker ↩ ↩2
-
Apple Developer Documentation. VNDetectHumanHandPoseRequest. https://developer.apple.com/documentation/vision/vndetecthumanhandposerequest ↩ ↩2
-
Ultraleap Documentation. Ultraleap Hand Tracking Overview. https://docs.ultraleap.com/hand-tracking/index.html ↩ ↩2
-
Khronos Group. OpenXR specification, XR_EXT_hand_tracking. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XR_EXT_hand_tracking ↩ ↩2
-
Zhang, F. et al. MediaPipe Hands: On-device Real-time Hand Tracking. CV4ARVR, 2020. https://research.google/pubs/mediapipe-hands-on-device-real-time-hand-tracking/ ↩
-
Google AI Edge. Hand landmarks detection guide for Web. https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker/web_js ↩ ↩2
-
TensorFlow Blog. Face and hand tracking in the browser with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2020/03/face-and-hand-tracking-in-browser-with-mediapipe-and-tensorflowjs.html ↩
-
TensorFlow Blog. 3D Hand Pose with MediaPipe and TensorFlow.js. https://blog.tensorflow.org/2021/11/3D-handpose.html ↩
-
Apple Developer Documentation. HandSkeleton. https://developer.apple.com/documentation/arkit/handskeleton ↩
-
Apple Developer Documentation. Tracking and visualizing hand movement. https://developer.apple.com/documentation/visionOS/tracking-and-visualizing-hand-movement ↩
-
Meta Horizon OS Developers. Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-interaction-sdk-overview/ ↩ ↩2
-
Meta Horizon OS Developers. Use Fast Motion Mode. https://developers.meta.com/horizon/documentation/unity/fast-motion-mode/ ↩ ↩2
-
Meta Horizon OS Developers. Use Wide Motion Mode. https://developers.meta.com/horizon/documentation/unity/unity-wide-motion-mode/ ↩ ↩2
-
Meta Horizon OS Developers. OpenXR Hand Skeleton in Interaction SDK. https://developers.meta.com/horizon/documentation/unity/unity-isdk-openxr-hand/ ↩
-
Khronos Group. XR_HAND_JOINT_COUNT_EXT manual page. https://registry.khronos.org/OpenXR/specs/1.1/man/html/XR_HAND_JOINT_COUNT_EXT.html ↩
-
Romero, J., Tzionas, D., and Black, M. J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM TOG, 2017. https://mano.is.tue.mpg.de/ ↩
-
Zimmermann, C. et al. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images. ICCV, 2019. https://lmb.informatik.uni-freiburg.de/Publications/2019/ZAB19/ ↩
-
Moon, G. et al. InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image. ECCV, 2020. https://mks0601.github.io/InterHand2.6M/ ↩
-
Hampali, S. et al. HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation. ICCV, 2019. https://www.tugraz.at/index.php?id=40231 ↩
-
Chao, Y. W. et al. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. CVPR, 2021. https://dex-ycb.github.io/ ↩
-
Yuan, S. et al. BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis. CVPR, 2017. https://arxiv.org/abs/1704.02612 ↩
-
Zimmermann, C. and Brox, T. Learning to Estimate 3D Hand Pose from Single RGB Images. ICCV, 2017. https://lmb.informatik.uni-freiburg.de/projects/hand3d/ ↩
-
Pavlakos, G. et al. Reconstructing Hands in 3D with Transformers. CVPR, 2024. https://openaccess.thecvf.com/content/CVPR2024/html/Pavlakos_Reconstructing_Hands_in_3D_with_Transformers_CVPR_2024_paper.html ↩
-
Potamias, R. A. et al. WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild. arXiv:2409.12259, 2024. https://arxiv.org/abs/2409.12259 ↩
-
Chen, X. et al. MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image. CVPR, 2022. https://arxiv.org/abs/2112.02753 ↩
-
Dong, H. et al. Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba. NeurIPS, 2024. https://publications.ri.cmu.edu/hamba-single-view-3d-hand-reconstruction-with-graph-guided-bi-scanning-mamba ↩
-
Ultraleap Documentation. Leap Concepts. https://docs.ultraleap.com/api-reference/tracking-api/leapc-guide/leap-concepts.html ↩
-
Meta Horizon OS Developers. Troubleshooting and Limitations. https://developers.meta.com/horizon/documentation/unity/unity-handtracking-troubleshooting-limitations/ ↩
-
Khronos Group. OpenXR specification, XR_EXT_hand_interaction. https://registry.khronos.org/OpenXR/specs/1.1/html/xrspec.html#XR_EXT_hand_interaction ↩
-
Yuh, J. et al. Workspace control system of underwater tele-operated manipulators on an ROV. Ocean Engineering, 2010. https://www.sciencedirect.com/science/article/abs/pii/S0029801810000892 ↩
-
Xu, X. et al. Bilateral teleoperation with object-adaptive mapping. Complex and Intelligent Systems, 2022. https://link.springer.com/article/10.1007/s40747-021-00546-z ↩
-
ONNX Runtime. How to add machine learning to your web application with ONNX Runtime. https://onnxruntime.ai/docs/tutorials/web/ ↩
-
ONNX Runtime. Using the WebGPU Execution Provider. https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html ↩
-
ONNX Runtime. Web get started support matrix. https://onnxruntime.ai/docs/get-started/with-javascript/web.html ↩
-
WebNN. Web Neural Network API FAQ. https://webnn.io/en/faq/api ↩
-
MoveIt. MoveIt 2 documentation. https://moveit.picknik.ai/main/index.html ↩
-
ROS Documentation. moveit_kinematics package. https://docs.ros.org/en/ros2_packages/rolling/api/moveit_kinematics/ ↩
-
Berg, J. and Lu, S. Review of Interfaces for Industrial Human-Robot Interaction. Current Robotics Reports, 2020. https://link.springer.com/article/10.1007/s43154-020-00005-6 ↩
-
Qi, J. et al. Computer vision-based hand gesture recognition for human-robot interaction: a review. Complex and Intelligent Systems, 2024. https://link.springer.com/article/10.1007/s40747-023-01173-6 ↩
-
Guna, J. et al. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller. Sensors, 2018. https://www.mdpi.com/1424-8220/18/7/2194 ↩
-
Weichert, F. et al. Analysis of the Accuracy and Robustness of the Leap Motion Controller. Sensors, 2013. https://www.mdpi.com/1424-8220/13/5/6380 ↩
-
Chen, R. et al. Sensory augmentation for subsea robot teleoperation. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/pii/S0166361522002329 ↩
-
Chen, R. et al. ROV teleoperation via human body motion mapping: Design and experiment. Computers in Industry, 2023. https://www.sciencedirect.com/science/article/abs/pii/S0166361523001094 ↩
-
Chen, R. et al. Visual-haptic feedback for ROV subsea navigation control. Automation in Construction, 2023. https://www.sciencedirect.com/science/article/pii/S0926580523002479 ↩
-
Chen, R. et al. SubSense: VR-Haptic and Motor Feedback for Immersive Control in Subsea Telerobotics. arXiv:2510.02594, 2025. https://arxiv.org/abs/2510.02594 ↩
-
Li, J. et al. Application of isomorphic remote manipulators with haptic feedback in spent fuel reprocessing sample analysis. International Journal of Advanced Nuclear Reactor Design and Technology, 2025. https://www.sciencedirect.com/science/article/pii/S2468605025000869 ↩
-
Liu, Y. et al. Novel Haptic Device and Control Strategy for Manipulator Teleoperation in Nuclear and Aerospace Tasks. Chinese Journal of Mechanical Engineering, 2025. https://link.springer.com/article/10.1186/s10033-025-01351-2 ↩
-
Intuitive. da Vinci surgical systems. https://www.intuitive.com/en-us/products-and-services/da-vinci/systems ↩
-
Choi, P. J. et al. Telesurgery: Past, Present, and Future. Cureus, 2018. https://pmc.ncbi.nlm.nih.gov/articles/PMC6067812/ ↩
-
Hannaford, B. et al. Raven-II: An Open Platform for Surgical Robotics Research. IEEE Transactions on Biomedical Engineering, 2013. https://pmc.ncbi.nlm.nih.gov/articles/PMC4813138/ ↩
-
OSHA. Robotics standards. https://www.osha.gov/robotics/standards ↩
-
International Organization for Standardization. ISO/TS 15066:2016 Robots and robotic devices - Collaborative robots. https://www.iso.org/standard/62996.html ↩
-
International Organization for Standardization. ISO 10218-1:2025 Robotics - Safety requirements - Part 1: Industrial robots. https://www.iso.org/standard/73933.html ↩
-
European Union. Regulation (EU) 2023/1230 on machinery. https://eur-lex.europa.eu/eli/reg/2023/1230/oj ↩
-
UL Standards and Engagement. UL 4600, Standard for Safety for the Evaluation of Autonomous Products. https://ulse.org/ul-standards/safety/autonomous-products ↩
-
NIST. Mobile robots and robot test methods. https://www.nist.gov/el/intelligent-systems-division-73500/robotic-systems-smart-manufacturing-program/mobile-robots ↩
-
NIOSH. Center for Occupational Robotics Research. https://www.cdc.gov/niosh/robotics/ ↩