Evaluating the impact of variation in automatically generated embodied object descriptions
Foster, Mary Ellen
The primary task for any system that aims to automatically generate human-readable output is choice: the input to the system is usually well-specified, but there can be a wide range of options for creating a presentation based on that input. When designing such a system, an important decision is to select which aspects of the output are hard-wired and which allow for dynamic variation. Supporting dynamic choice requires additional representation and processing effort in the system, so it is important to ensure that incorporating variation has a positive effect on the generated output. In this thesis, we concentrate on two types of output generated by a multimodal dialogue system: linguistic descriptions of objects drawn from a database, and conversational facial displays of an embodied talking head. In a series of experiments, we add different types of variation to one of these types of output. The impact of each implementation is then assessed through a user evaluation in which human judges compare outputs generated by the basic version of the system to those generated by the modified version; in some cases, we also use automated metrics to compare the versions of the generated output. This series of implementations and evaluations allows us to address three related issues. First, we explore the circumstances under which users perceive and appreciate variation in generated output. Second, we compare two methods of including variation into the output of a corpus-based generation system. Third, we compare human judgements of output quality to the predictions of a range of automated metrics. The results of the thesis are as follows. The judges generally preferred output that incorporated variation, except for a small number of cases where other aspects of the output obscured it or the variation was not marked. In general, the output of systems that chose the majority option was judged worse than that of systems that chose from a wider range of outputs. However, the results for non-verbal displays were mixed: users mildly preferred agent outputs where the facial displays were generated using stochastic techniques to those where a simple rule was used, but the stochastic facial displays decreased users’ ability to identify contextual tailoring in speech while the rule-based displays did not. Finally, automated metrics based on simple corpus similarity favour generation strategies that do not diverge far from the average corpus examples, which are exactly the strategies that human judges tend to dislike. Automated metrics that measure other properties of the generated output correspond more closely to users’ preferences.