The different models that predict the distribution of species are a useful tool for the evaluation and monitoring of forest resources as they facilitate the planning of their management in a changing climate environment. Recently, a significant number of algorithms have been proposed for this purpose, making it difficult to select the most appropriate to use. The evaluation of performance and predictive stability of these models can elucidate this problem. Distribution data of 17 pine species with high economic importance for Mexico were collected and distribution models were carried out. We carried out a pre-modeling design to select the prediction variables (climatic, edaphic and topographic), after which nine algorithms and an ensemble model were contrasted against one another. The true skill statistic (TSS) and the area under the curve (AUC) were used to evaluate the predictive performance of the models, and the coefficient of variation of the predictions was used to evaluate their stability. The number of predictive variables in the final models fluctuated from 6 to 12; the mean diurnal range and the maximum temperature of warmest month were included in the models for most species. Random forests, the ensemble model, generalized additive models and MaxEnt were the ones that best described the distribution of the species (AUC >0.92 and TSS >0.72); the opposite was found in Bioclim and Domain (AUC<0.75 and <0.82; and TSS<0.5 and <0.55). Support vector machine, Mahalanobis distance, generalized linear models and boosted regression trees obtained intermediate settings. The coefficient of variation indicated that Bioclim, Domain and Support vector machine have low predictive stability (CV>0.055); on the contrary, Maxent and the ensemble model attained high predictive stability (CV<0.015). The ensemble model obtained greater performance and predictive stability in the predictions of the distribution of the 17 species of pines. The differences found in performance and predictive stability of the algorithms suggest that the ensemble model has the potential to model the distribution of tree species.
Species distribution models (SDM) are statistical methods or machine learning algorithms used to model and map past, present or future species distributions (
The SDM are performed by three different approaches: correlative, mechanistic, and process-oriented or hybrid (
The algorithms available to implement SDM under the correlative approach are diverse (
The genus
Recently, efforts have been made in Mexico to model the distribution areas of some pine species by means of the MaxEnt algorithm (
The seventeen pine species that most contribute to timber production in Mexico (up to 70 %) were considered in the study:
The databases for each species were entered into the Diva-GIS program ver. 7.5 (
The environmental variables that characterize the areas where the species presence were recorded were obtained from the WorldClim version 2 repository, with an approximate resolution of 1 km2 (
The selection of the environmental variables to be used in SDM is an important aspect in the modeling process, since they have a significant effect on the predictive performance of the models (
In the process of building species distribution models, the definition of the accessible area for the species is a critical factor for the result of the calibration, evaluation and comparison of the model (
The modeling process for each species was carried out with the following configuration: 20.000 randomly pseudo-absences were generated and the evaluation of the models was executed with the bootstrap resampling method with 10 repetitions (
The performance of the algorithms was evaluated through the TSS (
The pre-modeling and modeling processes were carried out using the “sdm” package (
The models were fitted with different number of presence records. In this regard,
The evaluating statistics of the predictive performance of the algorithms as well as the ensemble model obtained different values. Six of the analyzed algorithms generated low values in the AUC (<0.90) and TSS (<0.70) statistics with respect to the rest of the algorithms (
The nine algorithms and the ensemble model differed in predicting spatial ranges for all species. For most species DOM predicted a larger distribution area (
The coefficient of variation indicated that BIO, DOM and SVM have the lower stability in predictions (CV >0.055); also, their error bars are large, indicating that CV values differ widely between species (
The SDMs with their correlative algorithms are the result of the projection of an ecological niche model created in the environmental space with different variables (
In our study, final models for all species used different combinations of variables, although it was observed that bio2 and bio5 were important variables for many species. This suggests that the distribution of the pine species considered in the present study is explained by variables related to temperature, which coincides with
Testing several algorithms to perform SDM helps to have a broader view of the advantages and disadvantages of the use of each single algorithm (
Regarding the remaining evaluated algorithms, it was observed that SVM, MD, GLM and BRT obtained a medium predictive performance. BIO and DOM were the algorithms with the lowest predictive performance. These results are similar to those of
The predicted distribution areas for each of the 17 analyzed pine species (Fig. S1 in Supplementary material) coincided with the distribution reported by
The different algorithms showed significant differences in the predicted spatial distribution areas (
The results indicated that three of the tested algorithms have a high prediction variability, except for EM and MAX (
The ensemble model showed the highest predictive performance in modeling the spatial distribution of 17 pine species in Mexico, although RF, MAX and GAM also provided good predictions. The rest of the applied algorithms (BRT, GLM, MD, SVM, DOM and BIO) presented a lower accuracy; BIO, DOM and SVM were the algorithms with the greatest variability in predictions and, on the other hand, MAX and EM obtained the smallest variation. The rest of the algorithms attained a medium variability in the prediction of the distribution areas. For most of the species (16), the ensemble model showed the best predictive performance, although for
The assessment of algorithms’ performance for predicting the spatial species distribution is an important step in the modeling process and must be carried out carefully, since the selection of one algorithm over another could lead to different results and therefore to different conclusions. Because the predictive differences between the algorithms are relatively large, the choice of one over the other should be based on the study objectives. The results derived from this research suggest that it is not convenient to choose an algorithm
The first author thanks CONACYT (
True skill statistic (TSS) and area under the curve (AUC) of nine algorithms and ensemble model for predicting the geographic range of 17 pine species. (BIO): Bioclim; (DOM): Domain; (SVM): Support Vector Machine; (MD): Mahalanobis distance; (GLM): Generalized linear models; (BRT): Boosted regression trees; (GAM): Generalized additive models; (MAX): MaxEnt; (RF): Random forests; (EM): Ensemble model.
Distribution area predicted by nine algorithms and ensemble model for the modeling of 17 pine species in Mexico. (Bio): Bioclim; (BRT): Boosted regression trees; (EM): Ensemble model; (MD): Mahalanobis distance; (DOM): Domain; (GAM): Generalized additive models; (GLM): Generalized linear models; (MAX): MaxEnt; (RF): Random forests; (SVM): support vector machine.
Coefficient of variation and TSS of nine algorithms and ensemble model for the modeling of the geographic distribution of 17 pine species in Mexico. (Bio): Bioclim; (DOM): Domain; (SVM): support vector machine; (MD): Mahalanobis distance; (GLM): Generalized linear models; (BRT): Boosted regression trees; (GAM): Generalized additive models; (MAX): MaxEnt; (RF): Random forests; (EM): Ensemble model.
Evaluated algorithms to build the spatial distribution models of 17 species of pines. *For these analyzes, for each evaluated species we selected 20.000 pseudo-absences at random.
Statistical approach -General description | Data | Algorithm | Reference |
---|---|---|---|
Distance - they are considered climate envelope algorithms and calculate the similarity that exists between candidate pixels with respect to the selected presence records. | Presence | Bioclim (BIO) |
|
Domain (DOM) |
|
||
Mahalanobis distance (MD) |
|
||
Regression - they are algorithms that model the median of a response variable regarding prediction variables; they use the Logit Link Function to relate the expected value of the response variable with included predictors. | Presence/ absence* | Generalized linear model (GLM) |
|
General additive models (GAM) | |||
Machine Learning - within SDM they are algorithms that focus on classification. Their main objective is to automatically improve the classification of training data until finally obtaining a better model. | Presence/ absence* | Sector-vector machine (SVM) |
|
Boosted regression trees (BRT) |
|
||
Random forests (RF) |
|
||
Presence/ Background* | MaxEnt (MAX) |
|
Number of occurrences by species used to estimate the distribution area of 17 species of pines. (NP): Number of presences.
Species | NP | Species | NP |
---|---|---|---|
1165 | 370 | ||
234 | 560 | ||
1846 | 2182 | ||
540 | 479 | ||
394 | 1667 | ||
1426 | 1430 | ||
448 | 117 | ||
803 | 1786 | ||
2461 | - | - |
Description of climatic, edaphic and topographic variables employed to build the spatial distribution models of 17 species of pines.
Category | Variable | Key |
---|---|---|
Climatic | Annual mean temperature (°C) | bio1 |
Mean diurnal range (°C) | bio 2 | |
Maximum temperature of warmest month (°C) | bio 5 | |
Minimum temperature of coldest month (°C) | bio 6 | |
Mean temperature of warmest quarter (°C) | bio10 | |
Mean temperature of coldest quarter (°C) | bio11 | |
Annual precipitation (mm) | bio12 | |
Precipitation of wettest month (mm) | bio13 | |
Precipitation of driest month (mm) | bio14 | |
Precipitation of wettest quarter (mm) | bio16 | |
Precipitation of driest quarter (mm) | bio17 | |
Soil proprieties | Sand (g kg-1) | sa |
Cation exchange capacity (mmol(c) kg-1) | cec | |
Clay content (g kg-1) | clc | |
Organic carbon soil (dg kg-1) | ocs | |
Bulk density (cg cm-3) | bd | |
Organic carbon density (g dm-3) | ocd | |
Silt (g kg-1) | sil | |
Nitrogen (cg kg-1) | nit | |
pH water (pH·100) | pH | |
Soil organic carbon stock (t ha-1) | socs | |
Topographic | Altitude (m a.s.l.) | alt |
Slope (°) | slo | |
Diurnal anisotropic heat | dah | |
Convergence index | ci | |
Terrain ruggedness index (m) | tri | |
Topographic wetness index | twi |
Important and uncorrelated variables used in the spatial distribution models of 17 pine species in Mexico. (bio1): annual mean temperature (°C); (bio2): mean diurnal range (°C); (bio5): maximun temperature of warmest month (°C); (bio6): minimun temperature of coldest month (°C); (bio10): mean temperature of warmest quarter (°C); (bio11): mean temperature of coldest quarter (°C); (bio12): annual precipitation (mm); (bio13): precipitation of wettest month (mm); (bio14): precipitation of driest month (mm); (bio16): precipitation of wettest quarter (mm); (bio17): precipitation of driest quarter (mm); (sa): sand (g kg-1); (cec): cation exchange capacity (mmol (c) kg-1); (clc): clay content (g kg-1); (ocs): organic carbon soil (dg kg-1); (bd): bulk density (cg cm-3); (ocd): organic carbon density (g dm-3); (sil): silt (g kg-1); (nit): nitrogen (cg kg-1); (pH): pH water (pH·100); (socs): soil organic carbon stock (t ha-1); (alt): altitude (m); (slo): slope (°); (dah): diurnal anisotropic heat; (ci): convergence index; (tri): terrain ruggedness index (m); (twi): topographic wetness index.
Variable |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bio1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
bio2 | - | x | x | x | x | x | x | x | - | x | x | - | x | x | x | x | x |
bio5 | - | x | x | x | - | x | x | x | - | x | x | x | x | x | - | x | x |
bio6 | - | - | x | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
bio10 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
bio11 | - | - | - | - | - | - | - | - | x | - | - | - | - | - | - | - | x |
bio12 | - | - | x | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
bio13 | x | x | - | x | x | x | - | x | - | x | - | x | x | - | x | - | x |
bio14 | x | x | x | - | - | x | x | - | - | x | - | x | x | x | - | - | - |
bio16 | - | - | - | - | - | - | - | - | - | - | x | - | - | - | - | x | - |
bio17 | - | - | - | x | x | - | - | x | - | - | - | - | - | - | x | - | - |
sa | - | x | - | x | - | - | - | - | - | - | - | x | - | - | - | x | - |
cec | - | - | x | x | - | - | x | - | x | - | x | x | - | x | - | x | x |
clc | - | x | x | - | - | - | - | - | - | - | - | x | - | - | - | x | - |
ocs | - | - | - | - | - | x | - | x | - | - | - | - | - | x | - | - | x |
bd | x | x | - | x | - | - | x | - | - | x | x | - | x | x | x | x | - |
ocd | - | - | - | - | - | - | - | - | x | - | - | - | - | - | - | - | - |
sil | - | - | - | x | - | - | x | - | - | x | x | x | x | - | - | x | - |
nit | - | - | - | x | x | - | - | x | - | x | - | - | - | - | - | x | x |
pH | - | - | x | x | x | x | x | x | x | - | - | - | - | - | - | - | x |
socs | x | x | - | - | - | - | x | - | - | - | x | - | x | - | x | x | - |
alt | x | - | - | - | x | - | - | - | x | - | - | - | - | - | x | - | - |
slo | x | x | - | x | - | - | x | - | - | - | - | - | - | - | x | - | - |
dah | - | - | - | x | x | - | - | - | - | x | x | x | - | - | - | - | - |
ci | - | - | - | - | - | - | - | - | - | - | - | - | x | - | - | - | - |
tri | - | - | x | - | - | - | - | - | - | - | - | x | x | x | - | - | x |
twi | - | - | - | - | x | - | x | x | x | - | - | - | - | x | - | - | - |
Fig. S1 - Binary prediction of the ensemble model for 17 species of pines.
Fig. S2 - True statistical skill and AUC of nine algorithms and ensemble model for modeling 17 species of pines.
Fig. S3 - Kruskall-Wallis non-parametric test and classification of algorithms.
Tab. S1 - Relative importance of the variables (auc_test) obtained in the pre-modelling process of 17 pine species.
Tab. S2 - Variance inflation factor of the variables included in final models of spatial distribution of 17 species of pines.