Predictive modelling of soil lead in urban environments
View/ Open
Date
17/05/2023Author
Donoghue, Sarah L.
Metadata
Abstract
Lead (Pb) is a naturally occurring, potentially toxic, element, which may result in health effects. In particular, Pb is a neurotoxin so can cause neurological damage (for example lower IQ scores) even at low exposure levels. Young children, under six years old, are at particular risk. An important possible exposure pathway is through ingestion or inhalation of contaminated urban soil. Locating soils with high Pb concentrations traditionally involves costly and time-consuming sampling, laboratory analysis, and mapping. Machine learning techniques that utilise existing data from similar urban areas offers a potentially valuable short-cut.
This research analyses three soil Pb datasets from varying scales in Greater Glasgow; the city scale British Geological Survey's (BGS) Geochemical Baseline of the Environment (G-BASE) dataset, and two new neighbourhood scale datasets from Paisley and Bishopbriggs. All samples were collected using a non-probabilistic, grid sampling strategy. Consequently, spatial dependence is present in all datasets and model-based analysis is required to account for spatial error. Spatial variability in each dataset was assessed and mapped using [local] Moran’s I analysis, and also restricted maximum likelihood (REML) fit semivariograms with ordinary kriging. Hotspots and coldspots of soil Pb concentrations were visible in all study areas.
Spatial heterogeneity in soil Pb concentrations were explored by examining the relationship between soil Pb concentrations and possible covariates. Potential covariates were selected from the literature and collated in GIS using nearest neighbour joins. The relationship between each covariate and soil Pb concentrations was assessed using model-based geostatistics i.e. linear mixed models (LMM) and Wald tests. Additionally, each covariate was included when mapping soil Pb concentrations using the empirical best linear unbiased predictor (E-BLUP). Overall, only covariates which showed an almost significant (p < 0.06) and justifiable relationship with soil Pb concentrations were input in machine learning models However, geographical scale and location can impact covariate influence on soil Pb.
In the Glasgow city dataset six covariates (soil organic matter (OM) percentage, land use, building age, and historic industry type, age, and density) were used to train a random forest (RF) model. This was combined with quantile regression forests (QRF) to predict soil Pb concentrations for a range of percentiles and for different soil Pb groups i.e. only high samples containing ≥200 mg kg-1 of soil Pb. The accuracy of these predictions was determined using 10-fold cross validation and was reasonably good; the concordance correlation coefficient (CCC) for all samples at the 50th percentile is 0.33, and the Pearson’s R2 value is 0.51. Higher percentiles (i.e. the 75th or 95th percentile) more accurately predicted samples containing high Pb concentrations. This knowledge could be used to more accurately locate soil Pb hotspots.
QRF was also used to predict areas of the city at higher risk (≥30% probability) of containing high soil Pb concentrations (≥200 mg kg-1). This probability was mapped for a prediction grid covering the city to give a city-wide map of soil Pb risk, with hotspots from Getis-Ord Gi* cluster analysis added as contour lines. The prediction accuracy was assessed by comparing measured and predicted soil Pb groups using a confusion matrix, where the overall accuracy was good at 74% and the high user’s/producer’s accuracies were >50%.
In the neighbourhood datasets different covariates were used to train a RF model; soil OM, land use, building age, building distance, and road distance in Paisley and soil OM, land use, historic industry distance and type in Bishopbriggs. Validation statistics indicated that the models were reasonably accurate, i.e. the CCC for all samples at the 50th percentile was 0.35 and 0.56 for Bishopbriggs and Paisley respectively (Pearson’s R2 = 0.59 and 0.62).
Although the Paisley model had better overall accuracy, the Bishopbriggs model was more accurate at predicting sample with high soil Pb concentrations using the 75th percentile (CCC = 0.75). The Paisley and Bishopbriggs RF models were also used with QRF to predict the probability of an unknown location containing high soil Pb concentrations. These probability predictions showed very good overall accuracy (83% and 84% for Bishopbriggs and Paisley respectively), with 98% producer’s accuracy for high samples in Paisley.
This research is unique as it tests the accuracy of a RF model in other cities which it was not trained in. When the Glasgow QRF model was used to predict locations at high-risk from high soil Pb concentrations in Belfast and Leicester, there was a 71/80% overall accuracy and 53/47% producer’s accuracy to known high Pb soil samples respectively. Consequently, the machine learning model can help locate the areas of a city most at-risk from high soil Pb concentrations to target subsequent soil sampling and eventual remediation.