Visual articulated tracking in cluttered environments
This thesis is concerned with the state estimation of an articulated robotic manipulator during interaction with its environment. Traditionally, robot state estimation has relied on proprioceptive sensors as the single source of information about the internal state. In this thesis, we are motivated to shift the focus from proprioceptive to exteroceptive sensing, which is capable to represent a holistic interpretation of the entire manipulation scene. When visually observing grasping tasks, the tracked manipulator is subject to visual distractions caused by the background, the manipulated object and by occlusions from other objects present in the environment. The aim of this thesis is to investigate and develop methods for the robust visual state estimation of articulated kinematic chains in cluttered environments which suffer from partial occlusions. To make these methods widely applicable to a variety of kinematic setups and unseen environments, we intentionally refrain from using prior information about the internal state of the articulated kinematic chain, and we do not explicitly model visual distractions such as the background and manipulated objects in the environment. We approach this problem with model-fitting methods, in which an articulated model is associated to the observed data using discriminative information. We explore model-fitting objectives that are robust to occlusions and unseen environments, methods to generate synthetic training data for data-driven discriminative methods, and robust optimisers to minimise the tracking objective. This thesis contributes (1) an automatic colour and depth image synthesis pipeline for data-driven learning without depending on a real articulated robot; (2) a training strategy for discriminative model-fitting objectives with an implicit representation of objects; (3) a tracking objective that is able to track occluded parts of a kinematic chain; and finally (4) a robust multi-hypotheses optimiser. These contributions are evaluated on two robotic platforms in different environments and with different manipulated and occluding objects. We demonstrate that our image synthesis pipeline generalises well to colour and depth observations of the real robot without requiring real ground truth labelled images. While this synthesis approach introduces a visual simulation-to-reality gap, the combination of our robust tracking objective and optimiser enables stable tracking of an occluded end-effector during manipulation tasks.