The effectiveness of the stylometry of function words in discriminating between Shakespeare and Fletcher
Horton, Thomas Bolton
A number of recent successful authorship studies have relied on a statistical analysis of language features based on function words. However, stylometry has not been extensively applied to Elizabethan and Jacobean dramatic questions. To determine the effectiveness of such an approach in this field, language features are studied in twenty-four plays by Shakespeare and eight by Fletcher. The goal is to develop procedures that might be used to determine the authorship of individual scenes in The Two Noble Kinsmen and Henry VIII. Homonyms, spelling variants and contracted forms in old-spelling dramatic texts present problems for a computer analysis. A program that uses a system of pre-edit codes and replacement /expansion lists was developed to prepare versions of the texts in which all forms of common words can be recognized automatically. To evaluate some procedures for determining authorship developed by A. Q. Morton and his colleagues, occurrences of 30 common collocations and 5 proportional pairs are analyzed in the texts. Within-author variation for these features is greater than had been found in previous studies. Univariate chi-square tests are shown to be of limited usefulness because of the statistical distribution of these textual features and correlation between pairs of features. The best of the collocations do not discriminate as well as most of the individual words from which they are composed. Turning to the rate of occurrence of individual words and groups of words, distinctiveness ratios and t-tests are used to select variables that best discriminate between Shakespeare and Fletcher. Variation due to date of composition and genre within the Shakespeare texts is examined. A multivariate and distributionfree discriminant analysis procedure (using kernel estimation) is introduced. The classifiers based on the best marker words and the kernel method are not greatly affected by characterization and perform well for samples as short as 500 words. When the final procedure is used to assign the 459 scenes of known authorship (containing at least 500 words)almost 112 95% are assigned to the correct author. Only two scenes are incorrectly classified, and 4.8% of the scenes cannot be assigned to either author by the procedure. When applied to individual scenes of at least 500 words in The Two Noble Kinsmen and Henry VIII, the procedure indicates that both plays are collaborations and generally supports the usual division. However, the marker words in a number of scenes often attributed to Fletcher are very much closer to Shakespeare's pattern of use. These scenes include TNK IV.iii and H8 I.iii, IV.i-ii and V.iv.