Gaussian Processes for spatial modeling and bias control
In this tutorial I will go over the basics of how to implement and optimize Gaussian Processes (GPs) for area control and spatial modeling in typology. This will not be a mathematical introduction, that has already been done much better than I possibly could. If you are interested in all the mathematical underpining of GPs I highly recommend reading Betancourt explanation.
If you find this tutorial useful, please cite:
Guzmán Naranjo, Matías and Becker, Laura. “Statistical bias control in typology” Linguistic Typology, vol. 26, no. 3, 2022, pp. 605-670. https://doi.org/10.1515/lingty-2021-0002
Introduction
In typology we usually are faced with the issue of spatial autocorrelation in our data. Spatial autocorrelation can arise for different reasons, but most commonly it is due to contact. I will exclusively focus on contact-like phenomena in this tutorial and will leave other more complex spatial structures for a future blogpost.
While there are several different techniques that have been proposed as a means of controling for spatial autocorrelation, this tutorial will focus on GPs. GPs are (very roughly speaking) a method to add non-linear structures to a generalized linear model. The intuition is that observations which are close together will be similar to each other, while observations further apart will not. The key aspect of GPs is that the similarity decreasis non-lineary, and can go down to zero very quickly. For linguistic purposes, this assumption makes sense. Depending on topography and other factors, two languages 200km apart may be very similar due to contact, but two languages 400 km apart may already be too far away from each other to have experienced any significant amount of contact.
Using brms
I will skip one-dimensional examples and jump right into a two-dimensional case study. We first need some non-linear data. We will use Phoibl as a test case, but any type of linguistic data should work. Especifically, we will look at the distribution of /t/ in Australia.
library(tidyverse)
## ── Attaching core tidyverse packages ───────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(geodist)
library(rnaturalearth)
library(rstan)
## Loading required package: StanHeaders
##
## rstan version 2.32.6 (Stan version 2.32.2)
##
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
##
##
## Attaching package: 'rstan'
##
## The following object is masked from 'package:tidyr':
##
## extract
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.20.1). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
##
## Attaching package: 'brms'
##
## The following object is masked from 'package:rstan':
##
## loo
##
## The following object is masked from 'package:stats':
##
## ar
library(sf)
## Linking to GEOS 3.12.2, GDAL 3.9.1, PROJ 9.4.1; sf_use_s2() is TRUE
## devtools::install_github('Mikata-Project/ggthemr')
library(ggthemr)
ggthemr("fresh")
setwd("~/Documentos/Uni/blog/content/post/2024-07-07-gaussian-process-for-spatial-modeling/")
Sys.setenv(PROJ_DATA = "/home/matias/miniforge3/envs/env-phylogenetic-blog/share/proj")
phoibl <- read_csv("https://raw.githubusercontent.com/phoible/dev/master/data/phoible.csv")
## Rows: 105484 Columns: 49
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (47): Glottocode, ISO6393, LanguageName, SpecificDialect, GlyphID, Phone...
## dbl (1): InventoryID
## lgl (1): Marginal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glottolog <- read_csv("https://cdstar.eva.mpg.de//bitstreams/EAEA0-CFBC-1A89-0C8C-0/languages_and_dialects_geo.csv")
## Rows: 22111 Columns: 7
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): glottocode, name, isocodes, level, macroarea
## dbl (2): latitude, longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aus_glotto <- filter(glottolog, macroarea == "Australia") %>% pull(glottocode)
phoibl_aus <- filter(phoibl, Glottocode %in% aus_glotto)
set.seed(1234)
phoibl_aus_wide <-
phoibl_aus %>% select(Glottocode, InventoryID, Phoneme) %>%
mutate(value = 1) %>%
pivot_wider(names_from = Phoneme, values_from = value, values_fill = 0) %>%
select(Glottocode, InventoryID, t) %>%
group_by(Glottocode) %>%
slice_sample(n = 1) %>%
ungroup()
phoibl_aus_wide <-
left_join(phoibl_aus_wide, select(glottolog, glottocode, longitude, latitude)
, by = c("Glottocode" = "glottocode"))
We can plot the distribution of /t/ in the data.
australia_shp <- ne_countries(country = "Australia")
ggplot() +
geom_sf(data = australia_shp)+
geom_point(data = phoibl_aus_wide
, aes(x = longitude, y = latitude, shape = ifelse(t == 1, "yes", "no"))
, color = "#E84646", size = 2) +
scale_shape(name = "has /t/")
## Warning: Removed 41 rows containing missing values or values outside the scale
## range (`geom_point()`).
At first sight, it does look like there are some areal structures in the data, with the languages in the north-eash and east lacking /t/, and the languages in the west consistently containing /t/ in their inventories.
However, because GPs can be very slow, we will work with a subset of the data, but the same could be achieved with the whole dataset.
set.seed(1234)
phoibl_aus_wide_s <- phoibl_aus_wide %>% na.omit() %>% slice_sample(n = 100)
phoibl_aus_wide_s
## # A tibble: 100 × 5
## Glottocode InventoryID t longitude latitude
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 yand1253 2754 0 141. -28.5
## 2 kala1377 475 1 142. -10.7
## 3 kawa1290 2883 0 142. -16.1
## 4 lowe1436 2709 1 135. -26.3
## 5 jand1248 2728 0 153. -27.6
## 6 kala1380 2767 1 144. -28.1
## 7 punt1240 2758 1 143 -27.3
## 8 gure1255 2938 0 151. -25.4
## 9 gugu1253 2819 0 144. -18.5
## 10 wira1265 2934 1 134 -32
## # ℹ 90 more rows
The simplest way to build a GP is to use brms. The advantage of brms is that it easy to use, and has some useful optimizations to the generated stan code. The disadvantage is that it only allows quadratic exponential Kernels and Euclidean distances. The kernel of a GP is what transforms the distance between observations into a covariance matrix, effectively achieving the non-linear realtions.
To fit a GP to the Australian data with brms we can do the following:
bm_gp_1 <- brm(t ~ gp(longitude, latitude)
, data = phoibl_aus_wide_s
, family = bernoulli
, prior = c(prior(normal(0, 2), class = Intercept)
, prior(normal(0, 2), class = sdgp))
, normalize = FALSE
, iter = 2000
, warmup = 1000
, cores = 4
, chains = 4
, seed = 1234)
write_rds(bm_gp_1, "./gp-1.rds")
The model samples, but it has low ESS for the length-scale (lscale). This can sometimes happen with bad priors, or if there are additional confounds missing from the model specification. In this case, the posterior of the length-scale seems to be difficult to explore. For this tutorial it is not a serious issue, so we will work with these samples.
There are two parameters of interest in a GP: the length-scale (lscale) and the sd (sdgp). The lscale parameter controls how far two observations need to be from each other before they start to significantly influence each other. This parameter is in the scale of the distance. Since brms scales the x and y (longitude and latitude) values by default (can be changed with scale = FALSE, but I do not recommend that), it is very difficult to interpret this parameter by just looking at it.
The alternative is to try to plot the correlation structure and see what is going on. This is what we will do here. First, we need to build the correlation matrix from the distance and parameters of the gp.
## mater 52, we willll use later
cov_mat52 <- function(d, lscale, sdgp) {
m52 <- (sqrt(5)*d)/lscale
sdgp^2 * (1 + m52 + (m52^2)/3)* exp(-m52)
}
## exponential quadratic
## this is the default in brms
cov_exp2 <- function(d, lscale, sdgp) {
sdgp^2 * exp(- d^2/(2*lscale ^2)) + sdgp^2
}
lscale <- 0.23 ## the mean estimate
sdgp <- 2.87 ## the mean estimate
## brms does some scaling so that the maximum Eucl distance is 1
## to do this, we calculate Eucl distance
## and then divide lon and lat by the maximum
d_mat <- dist(tibble(lon = phoibl_aus_wide_s$longitude
, lat = phoibl_aus_wide_s$latitude)) %>%
as.matrix()
d_max <- (max(d_mat))
d_mat <- dist(tibble(lon = phoibl_aus_wide_s$longitude/d_max
, lat = phoibl_aus_wide_s$latitude/d_max)) %>%
as.matrix()
## we only want the spatial correlation
## so we transform the covariance to a correlation matrix
cov_mat <- cov2cor(cov_exp2(d_mat, lscale, sdgp))
round(cov_mat[1:5,1:5], 2)
## 1 2 3 4 5
## 1 1.00 0.57 0.69 0.91 0.68
## 2 0.57 1.00 0.92 0.58 0.54
## 3 0.69 0.92 1.00 0.69 0.61
## 4 0.91 0.58 0.69 1.00 0.57
## 5 0.68 0.54 0.61 0.57 1.00
Next, we need to turn the correlation matrix into a format that is fit for plotting, and which includes the locations of the observations. This is just some data wrangling.
cov_mat_df <-
cov_mat %>%
as.data.frame() %>%
as_tibble() %>%
set_names(phoibl_aus_wide_s$Glottocode) %>%
mutate(Glottocode = phoibl_aus_wide_s$Glottocode)
cov_mat_df_long <-
pivot_longer(cov_mat_df, -Glottocode) %>%
rowwise() %>%
mutate(name_2 = str_c(pmin(Glottocode, name), pmax(Glottocode, name), sep = "_")) %>%
select(name_2, value) %>%
separate(name_2, into = c("l1", "l2"), remove = FALSE) %>%
distinct() %>%
filter(l1 != l2) %>%
left_join(select(phoibl_aus_wide_s, Glottocode, longitude, latitude)
, by = c("l1" = "Glottocode")) %>%
left_join(select(phoibl_aus_wide_s, Glottocode, longitude, latitude)
, by = c("l2" = "Glottocode"))
cov_mat_df_long
## # A tibble: 4,950 × 8
## name_2 l1 l2 value longitude.x latitude.x longitude.y latitude.y
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 kala1377_yan… kala… yand… 0.570 142. -10.7 141. -28.5
## 2 kawa1290_yan… kawa… yand… 0.691 142. -16.1 141. -28.5
## 3 lowe1436_yan… lowe… yand… 0.912 135. -26.3 141. -28.5
## 4 jand1248_yan… jand… yand… 0.680 153. -27.6 141. -28.5
## 5 kala1380_yan… kala… yand… 0.966 144. -28.1 141. -28.5
## 6 punt1240_yan… punt… yand… 0.979 143 -27.3 141. -28.5
## 7 gure1255_yan… gure… yand… 0.737 151. -25.4 141. -28.5
## 8 gugu1253_yan… gugu… yand… 0.747 144. -18.5 141. -28.5
## 9 wira1265_yan… wira… yand… 0.855 134 -32 141. -28.5
## 10 ngar1296_yan… ngar… yand… 0.520 119. -20.2 141. -28.5
## # ℹ 4,940 more rows
Finally, we can plot this matrix.
ggplot() +
geom_sf(data = australia_shp)+
geom_segment(data = cov_mat_df_long
, aes(x = longitude.x, y = latitude.x
, xend = longitude.y, yend = latitude.y
, linewidth = (value))
, alpha = 0.2) +
geom_point(data = phoibl_aus_wide_s
, aes(x = longitude, y = latitude, shape = ifelse(t == 1, "yes", "no"))
, size = 2, color = "#E84646") +
scale_shape(name = "has /t/") +
scale_linewidth(range = c(0, 0.5))
Using Stan
We have now seen how to build a GP using brms. As mentioned above, brms has several drawbacks. To solve those, we will be using Stan directly. First, we will replicate the results we got from brms but with Stan.
model_gp_ceq <- "
functions {
matrix cov_exp_quad_custom(matrix D, real sdgp, real lscale) {
int N = dims(D)[1];
matrix[N, N] K;
for (i in 1:(N - 1)) {
K[i, i] = sdgp^2;
for (j in (i + 1):N) {
K[i, j] = sdgp^2 * exp((D[i,j]^2)/(-2*(lscale^2)));
K[j, i] = K[i, j];
}
}
K[N, N] = sdgp^2;
return add_diag(K, 1e-12);
}
}
data {
int N; // number of rows (observations)
matrix[N, N] dist_mat; // distance matrix between obs
array[N] int<lower=0,upper=1> Y; // dependent variable
}
parameters {
real intercept; // intercepts
real<lower=0> lscale; // correlation factor
real<lower=0> sdgp; // correlation factor
vector[N] zgp; // latent variable
}
transformed parameters {
vector[N] mu;
matrix[N, N] gp_mat = cholesky_decompose(cov_exp_quad_custom(dist_mat, sdgp, lscale));
mu = intercept + gp_mat * zgp;
}
model {
sdgp ~ normal(0, 2); // generic prior on sigma
intercept ~ normal(0, 2); // generic prior on intercept
lscale ~ inv_gamma(1.5, 0.057); // prior on lscale, can be hard to set, we use the one in brms
zgp ~ std_normal(); // non-centered, so needs to be Normal(0, 1)
Y ~ bernoulli_logit(mu); // actual likelihood
}"
m_gp_ceq <- stan_model(model_code = model_gp_ceq)
write_rds(m_gp_ceq, "./model-ceq.rds")
In the previous model we have cov_exp_quad_custom because Stan already contains a cov_exp_quad function. We will not use the built in function because it takes continuous variables and builds the distance matrix from them, we cannot pass distance matrices to it.
We can now fit the model to the data
data_ceq <- list(
N = nrow(phoibl_aus_wide_s)
, dist_mat = d_mat
, Y = phoibl_aus_wide_s$t
)
s_gp_ceq <- sampling(m_gp_ceq
, data = data_ceq
, iter = 2000
, warmup = 1000
, cores = 4
, chains = 4
, seed = 1234
, control = list(adapt_delta = 0.9))
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
print(s_gp_ceq, c("intercept", "lscale", "sdgp"))
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## intercept 0.98 0.06 1.32 -1.98 0.21 1.11 1.85 3.31 512 1.01
## lscale 0.23 0.01 0.17 0.04 0.08 0.19 0.33 0.64 189 1.02
## sdgp 2.90 0.03 0.94 1.45 2.20 2.78 3.46 5.11 1306 1.00
##
## Samples were drawn using NUTS(diag_e) at Tue Jul 16 15:28:55 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(bm_gp_1)
## Family: bernoulli
## Links: mu = logit
## Formula: t ~ gp(longitude, latitude)
## Data: phoibl_aus_wide_s (Number of observations: 100)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Gaussian Process Terms:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sdgp(gplongitudelatitude) 2.87 0.91 1.45 5.05 1.00 1171
## lscale(gplongitudelatitude) 0.23 0.17 0.04 0.62 1.03 161
## Tail_ESS
## sdgp(gplongitudelatitude) 2017
## lscale(gplongitudelatitude) 382
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.90 1.32 -2.07 3.21 1.01 466 1409
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
As you can see, the estimates are very close to the brms model estimates. I am not sure whether the variation is due to the poor sampling performance or some other reason.
Using Stan and some custom distance and kernel function
In both previous examples we used Euclidean distances and the exponential quadratic kernel. While both choices are acceptable, and even necessary in some cases, they are not the only ones. If you want to learn more about kernels you can read this wonderful explanation. For a deep dive into other distance metrics you can read [this paper](https://open-research-europe.ec.europa.eu/articles/3-104**. For now, I will introduce the matérn 52 kernel and geodesic distances.
Geodesic distances are the distance along the surface of the earth between two points. Since the earth is round, the shortest distance along its surface is a curved line, not a straight line. These are rather easy to compute
d_mat_geo <- geodist(tibble(lon = phoibl_aus_wide_s$longitude
, lat = phoibl_aus_wide_s$latitude), measure = "geodesic")
round(d_mat_geo[1:5, 1:5], 2)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.0 1977696.1 1384391.6 566378.2 1269947
## [2,] 1977696.1 0.0 601116.4 1871428.8 2210940
## [3,] 1384391.6 601116.4 0.0 1347870.1 1700065
## [4,] 566378.2 1871428.8 1347870.1 0.0 1797724
## [5,] 1269946.8 2210940.5 1700064.7 1797724.0 0
Because estimating the lscale parameter depends on the scale of the distances, we need to scale these to something manageable. We could take the brms route and divide by the largest distance, making all distances be between 0 and 1. Alternatively, we can divide by a fixed number like 100000, which makes the distance behave in an easier to comprehend way (1 = 100 km).
d_mat_geo <- d_mat_geo/100000
Next, we want to rewrite the previous code to use the matern52 covariance kernel.
model_gp_mat52 <- "
functions {
matrix cov_matern52(matrix D, real sdgp, real lscale) {
int N = dims(D)[1];
matrix[N, N] K;
for (i in 1:(N - 1)) {
K[i, i] = sdgp^2;
for (j in (i + 1):N) {
real x52 = (sqrt(5) * D[i,j])/lscale;
K[i, j] = sdgp^2 *
(1 + x52 + (x52^2)/3)* exp(-x52);
K[j, i] = K[i, j];
}
}
K[N, N] = sdgp^2;
return add_diag(K, 1e-12);
}
}
data {
int N; // number of rows (observations)
matrix[N, N] dist_mat; // distance matrix between obs
array[N] int<lower=0,upper=1> Y; // dependent variable
}
parameters {
real intercept; // intercepts
real<lower=0> lscale; // correlation factor
real<lower=0> sdgp; // correlation factor
vector[N] zgp; // latent variable
}
transformed parameters {
vector[N] mu;
matrix[N, N] cov_mat = cov_matern52(dist_mat, sdgp, lscale);
matrix[N, N] gp_mat = cholesky_decompose(cov_mat);
mu = intercept + gp_mat * zgp;
}
model {
sdgp ~ normal(0, 2); // generic prior on sigma
intercept ~ normal(0, 2); // generic prior on intercept
lscale ~ inv_gamma(4, 4); // prior on lscale
zgp ~ std_normal(); // non-centered, so needs to be Normal(0, 1)
Y ~ bernoulli_logit(mu); // actual likelihood
}"
m_gp_mat52 <- stan_model(model_code = model_gp_mat52)
Pretty easy. Now for the sampling.
data_mat52 <- list(
N = nrow(phoibl_aus_wide_s)
, dist_mat = d_mat_geo
, Y = phoibl_aus_wide_s$t
)
s_gp_mat52 <- sampling(m_gp_mat52
, data = data_mat52
, iter = 1000
, warmup = 500
, cores = 4
, chains = 4
, seed = 1234
, control = list(adapt_delta = 0.8))
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
The ESS is low, but for this tutorial it does not matter much
print(s_gp_mat52, c("intercept", "lscale", "sdgp"))
## Inference for Stan model: anon_model.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## intercept 1.71 0.03 0.93 -0.04 1.09 1.69 2.30 3.63 872 1.00
## lscale 3.27 0.18 2.35 1.33 1.97 2.55 3.61 10.01 166 1.02
## sdgp 3.28 0.04 1.02 1.61 2.56 3.15 3.86 5.60 656 1.00
##
## Samples were drawn using NUTS(diag_e) at Tue Jul 16 15:30:25 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(s_gp_ceq , c("intercept", "lscale", "sdgp"))
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## intercept 0.98 0.06 1.32 -1.98 0.21 1.11 1.85 3.31 512 1.01
## lscale 0.23 0.01 0.17 0.04 0.08 0.19 0.33 0.64 189 1.02
## sdgp 2.90 0.03 0.94 1.45 2.20 2.78 3.46 5.11 1306 1.00
##
## Samples were drawn using NUTS(diag_e) at Tue Jul 16 15:28:55 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The model samples very well, much better than the previous model. Comparing lscale and sdgp parameters does not necessary make sense because of the different distance scale, but we do see a change in the intercept. This shows that choice of spatial component can affect model results.
We can now visualize the spatial correlation structure.
lscale_2 <- 3.27
sdgp_2 <- 3.28
cov_mat_geo <- cov2cor(cov_mat52(d_mat_geo, lscale_2, sdgp_2))
cov_mat_geo_df <-
cov_mat_geo %>%
as.data.frame() %>%
as_tibble() %>%
set_names(phoibl_aus_wide_s$Glottocode) %>%
mutate(Glottocode = phoibl_aus_wide_s$Glottocode)
cov_mat_geo_df_long <-
pivot_longer(cov_mat_geo_df, -Glottocode) %>%
rowwise() %>%
mutate(name_2 = str_c(pmin(Glottocode, name), pmax(Glottocode, name), sep = "_")) %>%
select(name_2, value) %>%
separate(name_2, into = c("l1", "l2"), remove = FALSE) %>%
distinct() %>%
filter(l1 != l2) %>%
left_join(select(phoibl_aus_wide_s, Glottocode, longitude, latitude)
, by = c("l1" = "Glottocode")) %>%
left_join(select(phoibl_aus_wide_s, Glottocode, longitude, latitude)
, by = c("l2" = "Glottocode"))
ggplot() +
geom_sf(data = australia_shp)+
geom_point(data = phoibl_aus_wide_s
, aes(x = longitude, y = latitude, shape = ifelse(t == 1, "yes", "no"))
, size = 2, color = "#E84646") +
geom_segment(data = cov_mat_geo_df_long
, aes(x = longitude.x, y = latitude.x
, xend = longitude.y, yend = latitude.y
, linewidth = (value))
, alpha = 0.2) +
scale_shape(name = "has /t/") +
scale_linewidth(range = c(0, 0.5))
There are two interesting things to observe with this plot. First, the spatial correlation is weaker in the mater52 model at longer distances. In the previous model we saw that there was a significant correlation between almost all observations, while in this model that correlation quickly decays to zero. The second observation is that the maximum correlation is also somewhat weaker in this model.
This is an extremely important point: the choice in spatial structure (covariance kernel, distance metric, priors) can have a very deep impact on your results. You need to think very carefully about these issues.
Now, how should we think of the lscale? This is always a difficult question, but luckily we can get a sense of it by testing it on different distances and see what we get. For this experiment we set the sdgp to 1 so we are dealing with correlation and not covariance.
tibble(
`distance (100km)` = c(seq(0.1, 1, length.out = 10), 1:10),
correlation = round(cov_mat52(c(seq(0.1, 1, length.out = 10), 1:10), lscale_2, 1), 3)
)
## # A tibble: 20 × 2
## `distance (100km)` correlation
## <dbl> <dbl>
## 1 0.1 0.999
## 2 0.2 0.997
## 3 0.3 0.993
## 4 0.4 0.988
## 5 0.5 0.981
## 6 0.6 0.973
## 7 0.7 0.964
## 8 0.8 0.953
## 9 0.9 0.941
## 10 1 0.928
## 11 1 0.928
## 12 2 0.762
## 13 3 0.573
## 14 4 0.404
## 15 5 0.272
## 16 6 0.177
## 17 7 0.112
## 18 8 0.069
## 19 9 0.042
## 20 10 0.025
Recall that the distance is in 100km, meaning that 0.1 is 10km. According to this, at an lscale of 2.98 (as per the model), two langauges in Australia 10km apart, are expected to be extremely similar. At 100km the similarity barely drops. At 500 apart, the correlation is already only 0.22, at 1000km the correlation is effectively 0. In my experience this is a reasonable range for the lscale. If yours is giving you extremely strong correlations at 10000km something is likely very wrong.
Visualizing spatial effects
The final piece I will discuss here is how to visualize the spatial structure of the model. To do this we need to build a grid of points for the region and make predictions for each point. We can then display the predictions the model makes for those points.
The first step is to define a grid of points.
max_lat <- max(phoibl_aus_wide_s$latitude) +1
max_lon <- max(phoibl_aus_wide_s$longitude) +1
min_lat <- min(phoibl_aus_wide_s$latitude) -1
min_lon <- min(phoibl_aus_wide_s$longitude)-1
seq_lon <- seq(min_lon, max_lon, by = 0.3)
seq_lat <- seq(min_lat, max_lat, by = 0.3)
new_obs <- tibble(
lon = rep(seq_lon, length(seq_lat))
, lat = rep(seq_lat, each = length(seq_lon)))
new_obs
## # A tibble: 13,800 × 2
## lon lat
## <dbl> <dbl>
## 1 113. -39.6
## 2 114. -39.6
## 3 114. -39.6
## 4 114. -39.6
## 5 114. -39.6
## 6 115. -39.6
## 7 115. -39.6
## 8 115. -39.6
## 9 116. -39.6
## 10 116. -39.6
## # ℹ 13,790 more rows
Next, we calculate the geodesic distance from each point in the grid, to each observation we used in the mode.
d_mat_geo_cross <- geodist(
new_obs
, tibble(lon = phoibl_aus_wide_s$longitude
, lat = phoibl_aus_wide_s$latitude)
, measure = "geodesic"
)
d_mat_geo_cross <- d_mat_geo_cross/100000
Notice we also divide the new distances by 100000. Next, we need a function to calculate the new covariance structure to be able to make predictions. This is relatively complex to understand, but it works. As you can see, part of the function is commented out. This is because I am leaving out the part of the predictor structure. The reason is practical, but the results are almost indistinguishable. The reason is that I am leaving out the posterior sd of the spatial structure and only plotting the posterior mean. Since we cannot visualize the posterior sd and only take the mean, it doe not matter much.
You can give it a try if you want, with new_dist_full being equal to geodist(new_obs)/100000. It will take forever because rmvnorm is very slow on large matrices.
## new_dist_full <- geodist(new_obs, measure = "geodesic")/100000
predictor_gp <- function(new_dist_cross
, yL
, cov_fun
, pars
, L_Sigma
## , new_dist_full
) {
L_Sigma_inverse <- solve(L_Sigma)
K_div_yL <- L_Sigma_inverse %*% yL
K_div_yL <- t(t(K_div_yL) %*% L_Sigma_inverse)
k_x_x_new <- t(cov_fun(d = new_dist_cross, lscale = pars$lscale, sdgp = pars$sdgp))
mu_yL_new <- as.numeric(t(k_x_x_new) %*% K_div_yL)
mu_yL_new
## lx_new <- nrow(new_dist_full)
## v_new <- L_Sigma_inverse %*% k_x_x_new
## cov_yL_new <- cov_fun(d = new_dist_full, lscale = pars$lscale, sdgp = pars$sdgp) -
## t(v_new) %*% v_new + diag(rep(1e-12, lx_new), lx_new, lx_new)
## yL_new <- mvtnorm::rmvnorm(1, mean = mu_yL_new, sigma = cov_yL_new)
## yL_new
}
Then, we need a prediction function with the model structure.
predict_model <- function(old_dist
, new_dist_cross
, cov_fun
, n_samples
, samples) {
pars <- extract(samples)
s_lscale <- pars$lscale
s_sdgp <- pars$sdgp
s_intercept <- pars$intercept
s_zgp <- pars$zgp
N <- nrow(new_dist_cross)
n_samples2 <- sample(1:nrow(s_lscale), n_samples)
sapply(n_samples2, function(j) {
tmp_mat_r <- cov_fun(d = old_dist, lscale = s_lscale[j], sdgp = s_sdgp[j])
diag(tmp_mat_r) <- diag(tmp_mat_r)+1e-12
s_GP_mat <- t(chol(tmp_mat_r))
mu <- s_intercept[j] +
predictor_gp(
new_dist_cross = new_dist_cross,
yL = s_GP_mat %*% s_zgp[j,],
cov_fun = cov_fun,
pars = list(
sdgp = s_sdgp[j]
, lscale = s_lscale[j]
),
L_Sigma = s_GP_mat
)
## this is an alternative if you want to include posterior uncertainty
## sapply(brms::inv_logit_scaled(mu), function(p) rbinom(1,1,p))
brms::inv_logit_scaled(mu)
})
}
preds_52 <- predict_model(
old_dist = d_mat_geo
, new_dist_cross = d_mat_geo_cross
, cov_mat52
, 100 ## number of posterior predictions
, s_gp_mat52)
## we take the mean for each location for all 100 predictions
preds_52_mean <- rowMeans(preds_52)
With this, we can now make a conditional effect plot map.
ggplot() +
geom_sf(data = australia_shp) +
geom_tile(data = mutate(new_obs, pred = preds_52_mean)
, aes(x = lon, y = lat, fill = pred)) +
geom_point(data = phoibl_aus_wide_s
, aes(x = longitude, y = latitude, shape = ifelse(t == 1, "yes", "no"))
, size = 4
, color = "#E84646") +
scale_shape(name = "has /t/")
## Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
In this case, the spatial conditional effects are exactly what we would expect. The regions without /t/ are predicted to have a low probability of /t/ and the other way around for regions with /t/.
We can do the same with brms, however, because brms does include all posterior uncertainty, we need the following trick.
## split into groups for predictions
## predicting everything at the same time takes forever and crashes
groups_pred <- split((1:nrow(new_obs)), ceiling(seq_along((1:nrow(new_obs)))/100))
preds_bm <- lapply(groups_pred, function(gr) {
tmp_preds <- predict(bm_gp_1
, newdata = rename(new_obs, latitude = lat, longitude = lon)[gr,], ndraws = 100)
tmp_preds[,1]
})
preds_bm_1 <- unlist(preds_bm)
And then we plot it. You can notice two things. First, we see the posterior uncertainty in brms predictions, and second, the areal patterns are smoother and less sharp than those we find with the matern52 kernle.
ggplot() +
geom_sf(data = australia_shp) +
geom_tile(data = mutate(new_obs, pred = preds_bm_1)
, aes(x = lon, y = lat, fill = pred)) +
geom_point(data = phoibl_aus_wide_s
, aes(x = longitude, y = latitude, shape = ifelse(t == 1, "yes", "no"))
, size = 4, color = "#E84646") +
scale_shape(name = "has /t/")
That is it for this tutorial.