|dc.description.abstract||The primary task for any system that aims to automatically generate human-readable output
is choice: the input to the system is usually well-specified, but there can be a wide range of
options for creating a presentation based on that input. When designing such a system, an
important decision is to select which aspects of the output are hard-wired and which allow
for dynamic variation. Supporting dynamic choice requires additional representation and
processing effort in the system, so it is important to ensure that incorporating variation has a
positive effect on the generated output.
In this thesis, we concentrate on two types of output generated by a multimodal dialogue
system: linguistic descriptions of objects drawn from a database, and conversational facial
displays of an embodied talking head. In a series of experiments, we add different types of
variation to one of these types of output. The impact of each implementation is then assessed
through a user evaluation in which human judges compare outputs generated by the basic
version of the system to those generated by the modified version; in some cases, we also use
automated metrics to compare the versions of the generated output.
This series of implementations and evaluations allows us to address three related issues. First,
we explore the circumstances under which users perceive and appreciate variation in generated
output. Second, we compare two methods of including variation into the output of a
corpus-based generation system. Third, we compare human judgements of output quality to
the predictions of a range of automated metrics.
The results of the thesis are as follows. The judges generally preferred output that incorporated
variation, except for a small number of cases where other aspects of the output obscured
it or the variation was not marked. In general, the output of systems that chose the majority
option was judged worse than that of systems that chose from a wider range of outputs.
However, the results for non-verbal displays were mixed: users mildly preferred agent outputs
where the facial displays were generated using stochastic techniques to those where a simple
rule was used, but the stochastic facial displays decreased users’ ability to identify contextual
tailoring in speech while the rule-based displays did not. Finally, automated metrics based on
simple corpus similarity favour generation strategies that do not diverge far from the average
corpus examples, which are exactly the strategies that human judges tend to dislike. Automated
metrics that measure other properties of the generated output correspond more closely
to users’ preferences.||en