Species distribution modeling for invasive species

Assessing the current and potential distribution of invasive species is critical to identify areas of the landscape vulnerable to invasion and develop informed management strategies. Species Distribution Models (SDM) have been used to develop spatially explicit maps of current and potential distribution using environmental variables -often at coarse resolutions (~1km2)-.

One of the main caveats when modeling the distribution of invasive species is that the invader is not at equilibrium, that is, it might have not yet occupy all the suitable habitat available, thus the importance of carefully selecting absences at areas outside the suitable habitats (see the sampling design tutorial) to increase the accuracy of the distribution model. The aim of this section is to present an iterative approach to select the set of absences that provide higher accuracy values for a number of SDMs approaches, and use those to generate current and potential species distribution maps.

In this tutorial we assume you understand the paramount importance of the knowledge of the species ecology, the informed and careful selection of environmental variables to be included in the model, the estimation of sources of bias in your presence/absence data and the selection of the modelling approach that best suits your data, and the aim of your study.

The dataset

The ingredients necessary to build SDMs are the species occurrence records (presence/absence), the respective probabilities of contingent absences (see the sampling design tutorial to learn how to calculate them), and the environmental variables values at each occurrence record.

The dataset with species occurrences can be download from HERE together with the environmental variables derived from LiDAR. Note that the resolution of the raster files (50 cm) has been reduce (10 meters) to decrease the storage size while increasing the speed of the process.

The tutorial will be divided in:

  1. Data preparation
  2. Absences selection using an SDM itterative approach
  3. Outcome - Potential distribution map

Preparing the data for SDM

The data required for SDM must include for each occurrence (presence/absence): the x,y coordinates, the likelihood of contingent absence (LAC) and a set of column with values at each environmental variable.

Let’s start by installing/loading the necessary packages:

# install.packages('devtools')

library(iSDM)
library(sp)
library(raster)
library(dismo)
library(biomod2)

setwd("/home/garzonc/Desktop/DIARS")

Set the working directory the raster files with LiDAR-derived environmental variables of the study area. The species absences/presences will be generated using the package virtualspecies, and limiting the absence/presence data to mimic ground sampling in a realized distribution scenario. Note that the number of absence should be at least equal to the number of presences to have a balance presence/absence dataset,thus including a bit more absences in the likelihood evaluation increases the probability of non contingent absences.

library(raster)
files <- list.files("/home/garzonc/Desktop/DIARS/Rasters1/", full.names = TRUE)
names(files) = paste("var", 1:5, sep = "")
envData <- stack(files)


library(virtualspecies)
# Virtual species response to environmental data
potential.dist <- generateRandomSp(raster.stack = envData, relations = c("gaussian"))

# limit the potential distribution to an specific area with the realized
# distribution (non equilibrium with the environment)
library(virtualspecies)
realized.dist <- limitDistribution(x = potential.dist$suitab.raster, area = extent(453000, 
    465500, 6080000, 6095000))

Pres <- coordinates(realized.dist$occupied.area)
P <- as.data.frame(Pres[which(values(realized.dist$occupied.area) > 0.5), ])
Presence <- P[sample(100), ]


Abse <- coordinates(realized.dist$occupied.area)
A <- as.data.frame(Abse[which(values(realized.dist$occupied.area) < 0.5), ])
Absence <- A[sample(100), ]

occData <- as.data.frame(rbind(cbind(Presence, SP = rep(1, 100)), cbind(Absence, 
    SP = rep(0, 100))))

occData <- na.omit(occData)
coordinates(occData) <- ~x + y
proj4string(occData) <- proj4string(envData)



plot(envData[["var1"]], col = "gray", legend = FALSE)
plot(occData, add = TRUE, pch = 1, cex = 0.1, col = ifelse(occData$SP == 0, 
    "red", "blue"))

library(iSDM)
likelihood <- pDLA(occData = occData, envData = envData, longlat = FALSE)

occsPDLA <- as.data.frame(cbind(occData, likelihood@data[, "PDLA"]))
occsPDLA <- occsPDLA[, c(3, 4, 1, 2)]
names(occsPDLA) <- c("x", "y", "occs", "PDLA")
head(occsPDLA)
##         x       y occs PDLA
## 23 458170 6091410    1    0
## 67 454650 6080210    1    0
## 20 458550 6092430    1    0
## 1  459010 6094930    1    0
## 46 458830 6086370    1    0
## 6  460270 6094750    1    0

Explore the result of the likelihood calculation by generating a histogram that will show distribution of the absences along the likelihood values.

hist(occsPDLA$PDLA, xlab = "PDLA", main = "Histogram")

# To plot the resulting map using ggplot2:

library(ggplot2)
envMap <- rasterToPoints(envData[["var1"]])
Map <- data.frame(envMap)
colnames(Map) <- c("X", "Y", "MAP")
ggplot(data = Map, aes(y = Y, x = X)) + geom_raster(fill = "gray") + geom_point(data = occsPDLA, 
    aes(x = x, y = y, color = PDLA)) + theme_bw() + coord_equal() + scale_color_gradient("PDLA", 
    low = "dark green", high = "red") + theme(axis.title.x = element_blank(), 
    axis.title.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
    legend.position = "right", legend.key = element_blank())

Absences selection using an SDM iterative approach

Run the modeling approach for different values of likelihood of contingent absence (PCA) and select the PCA values (for each modelling approach) that gives the highest True Skill Statistic (TSS) values. In this example we start with a PCA value 0.23 as it is the minimum value of PCA to allow having at least one absence in the dataset, check what is the smallest value in your likelihood calculation and run the script. We will use Random Forest (RF) modelling approach but you could run this iterative approach with multiple approaches (visit Biomod2 package tutorial to learn how to include other SDM models and customize the script to your data and objectives)

library(biomod2)

valuesTSS <- matrix(, nrow = 0, ncol = 2)


library(emdbook)
for (i in lseq(0.23, 1, 15)) {
    temp <- subset(occsPDLA, occsPDLA$PDLA < i)
    
    myRespName <- "campi"
    myResp <- as.numeric(temp[, 3])
    myRespCoord <- temp[, 1:2]
    myExpl <- envData
    
    myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl, 
        resp.xy = myRespCoord, resp.name = myRespName)
    
    # 2. Defining Models Options using default options.
    myBiomodOption <- BIOMOD_ModelingOptions()
    
    # 3. Doing Modelisation
    myBiomodModelOut <- BIOMOD_Modeling(myBiomodData, models = c("GLM"), models.options = myBiomodOption, 
        NbRunEval = 5, DataSplit = 70, VarImport = 0, models.eval.meth = c("TSS", 
            "ROC"), SaveObj = TRUE, do.full.models = FALSE, modeling.id = paste(myRespName, 
            i, sep = "_"))
    
    myBiomodModelEval <- get_evaluations(myBiomodModelOut)
    ModelEvalTSS <- myBiomodModelEval["TSS", "Testing.data", , , ]
    fooData.T <- t(ModelEvalTSS)
    PDLA <- c(i)
    temptss <- cbind(PDLA, mean(fooData.T[, 1]))
    valuesTSS <- as.data.frame(rbind(valuesTSS, temptss))
}

Explore the outcome and select the dataset with the PCA value that have the highest TSS score.

valuesTSS
##         PDLA    V2
## 1  0.2300000 1.000
## 2  0.2554575 1.000
## 3  0.2837328 1.000
## 4  0.3151378 1.000
## 5  0.3500188 1.000
## 6  0.3887605 1.000
## 7  0.4317905 1.000
## 8  0.4795832 1.000
## 9  0.5326658 1.000
## 10 0.5916238 1.000
## 11 0.6571077 1.000
## 12 0.7298395 0.933
## 13 0.8106218 1.000
## 14 0.9003454 0.967
## 15 1.0000000 0.967
plot(valuesTSS[, 1], valuesTSS[, 2], xlim = c(1, 0), xlab = "PDLA", ylab = "TSS")

Outcome - Potential distribution map

The distribution map using the highest TSS values and the Random Forest approach (RF) is:

par(mfrow = c(1, 1))
SetRF <- subset(occsPDLA, occsPDLA$PDLA < 0.3618639)

myRespName <- "campi"
myResp <- as.numeric(SetRF[, 3])
myRespCoord <- SetRF[, 1:2]



myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl, resp.xy = myRespCoord, 
    resp.name = myRespName)

# 2. Defining Models Options using default options.
myBiomodOption <- BIOMOD_ModelingOptions()

# 3. Doing Modelisation
myBiomodModelOut <- BIOMOD_Modeling(myBiomodData, models = c("RF"), models.options = myBiomodOption, 
    NbRunEval = 1, DataSplit = 70, models.eval.meth = c("TSS", "ROC"), do.full.models = FALSE)

myBiomodModelEval <- get_evaluations(myBiomodModelOut)
myBiomodModelEval

# projection over the globe under current conditions
myBiomodProj <- BIOMOD_Projection(modeling.output = myBiomodModelOut, new.env = myExpl, 
    proj.name = "current", selected.models = "all", compress = FALSE, build.clamping.mask = FALSE)
library(biomod2)
mod_proj <- get_predictions(myBiomodProj)
mod_proj
plot(mod_proj, main = "Predicted potential distribution - RF")