Abstract
This thesis arose from a problem in the analysis of data from the
Edinburgh Lead Study. The data were to be used to estimate the
influence of children's blood lead levels on their mental abilities,
controlling for other factors which might confound this relationship.
The other factors were summarised as a set of covariate scores, and the
question arose as to which of these scores should be included in a
multiple regression whose purpose was to estimate the coefficient of
blood-lead. This problem has arisen in other studies of the influence
of lead on ability, and a variety of solutions have been implemented.
The statistical and epidemiological literature offers little guidance.
The problem is formalised by proposing regression models with
various assumptions. Expressions are derived for the mean-square-error
of the parameter of special interest (here the blood-lead coefficient)
in terms of quantities which can be calculated from the data. Various
stepwise procedures are proposed for selecting a sub-set of covariates
to include in the regression equation. These include the usual
stepwise procedures, as well as new ones based on the various meansquare-error criteria and on changes in the coefficient of interest.
These procedures are studied for the data from the Edinburgh Lead Study
and evaluated by simulation in different ways.
The potential for variance reduction from sub-models, compared to
including all covariates, is a function of the multiple correlation
between the variable of special interest and the variables which could
be omitted from the model. The results suggest that, unless this
correlation exceeds 0.2, inferences should be based on a regression
with the full set of covariates. The greatest benefit is obtained from
sub-set selection procedures when the multiple correlation is increased
as a result of a decrease in the residual degrees of freedom. In these
circumstances the multiple correlation will be high, but its value
will fall when the usual adjustment for degrees of freedom is applied.
The simulation results suggest that sub-set selection will be
beneficial when the residual degrees of freedom for the full model are
less than three time the number of covariates.
The method which performed best was to select, at each step, the
variable which made the largest change in the coefficient of interest.
Stopping rules for this criterion are propped. This method was less
prone than the other methods to underestimate the variance of the
coefficient of interest, when this is evaluated in the usual way for
the final model. But it performed badly and underestimated this
variance, for artificial data where the population multiple correlation
between the variable of special interest and the covariates was high.
This suggests that sub-set selection should not be used when the
estimated multiple correlation adjusted for degrees of freedom is high.
These criteria applied to the Lead Study data would suggest that
the effect of lead on ability should be assessed by adjusting for all
the covariate scores.