|dc.description.abstract||A key goal for humans and artificial intelligence systems is to develop an accurate and unified picture of the outside world based on the data from any sense(s) that may be available. The availability of multiple senses presents the perceptual system with new opportunities to fulfil this goal, but exploiting these opportunities first requires the solution of two related tasks. The first is how to make the best use of any redundant information from the sensors to produce the most accurate percept of the state of the world. The second is how to interpret the relationship between observations in each modality; for example, the correspondence problem of whether or not they originate from the same source.
This thesis investigates these questions using ideal Bayesian observers as the underlying theoretical approach. In particular, the latter correspondence task is treated as a problem of Bayesian model selection or structure inference in Bayesian networks. This approach provides a unified and principled way of representing and understanding the perceptual problems faced by humans and machines and their commonality.
In the domain of machine intelligence, we exploit the developed theory for practical benefit, developing a model to represent audio-visual correlations. Unsupervised learning in this model provides automatic calibration and user appearance learning, without human intervention. Inference in the model involves explicit reasoning about the association between latent sources and observations. This provides audio-visual tracking through occlusion with improved accuracy compared to standard techniques. It also provides detection, verification and speech segmentation, ultimately allowing the machine to understand ``who said what, where?'' in multi-party conversations.
In the domain of human neuroscience, we show how a variety of recent results in multimodal perception can be understood as the consequence of probabilistic reasoning about the causal structure of multimodal observations. We show this for a localisation task in audio-visual psychophysics, which is very similar to the task solved by our machine learning system. We also use the same theory to understand results from experiments in the completely different paradigm of oddity detection using visual and haptic modalities. These results begin to suggest that the human perceptual system performs -- or at least approximates -- sophisticated probabilistic reasoning about the causal structure of observations under the hood.||en