1. Introduction

This document outlines the steps taken to split the Lake Ioannina MIS 7-9 diatom record into diatom assemblage zones, and it contains the R code used to do so. The reason for splitting the record into zones is to delineate sections of the record that share a similar diatom assemblage, helping to identify major changes in the record and making it easier to describe (the results can be written up by zone).

Clustering algorithm choice

A constrained cluster analysis is required in order to group samples together that are stratigraphically adjacent. The specific cluster analysis chosen is the Constrained Incremental Sum of Squares (CONISS) method, as outlined by Grimm (1987).

First, a matrix of dissimilarities between samples is computed. The CONISS algorithm then computes a statistic known as the sum of squares* between each pair of adjacent samples (each sample can be considered a cluster of just one sample at this stage). The pair with the smallest sum of squares is joined into a cluster, and then the sum of squares is recalculated for all samples with these newly joined samples receiving one sum of squares value for their cluster. The clusters with the smallest sum of squares is joined and the sum of squares recalculated. This process continues, clustering samples into successively larger groups (it is therefore an agglomerative technique).

*The sum of squares is the squared difference between the value of a taxon in one sample of a cluster divided by the average value of that taxon across all samples in that cluster, which is then summed for each taxon, which is then summed for each sample in the cluster.

2. Import packages and data

library(tidyverse)
library(vegan) # designdist, decostand
library(rioja) # chclust
library(ggdendro) # dendro_data

source("scripts/borders-for-ggplot2.R") # custom theme to remove borders

imported_counts <- read.csv("data/csv/imported-counts.csv")
taxa_life_modes <- read.csv("data/csv/imported-taxa_life_modes.csv") %>%
  transmute(
    taxon = taxon,
    type = as.factor(type)
  )

3. Prepare Data

The following steps are taken to prepare the data specifically for the analyses in this document:

Remove samples with no diatoms and an outlier, which contains some diatoms but is poorly preserved.
Discard rare taxa. These analyses are performed on taxa that are present at ≥4% in at least one sample. This decision was taken based on the following review of the literature.

Gordon and Birks (1972) suggest that the cluster analysis should only be run on taxa present at ≥5% in at least one sample as the low abundance taxa are of little numerical importance. Grimm (1987) eliminated those present at <3% at every level and noted that eliminating rare taxa has little effect on the analysis. Bennett (1996) also uses only taxa present at >5% in at least one sample and then goes on to assess the effect of decreasing and increasing the threshold for taxa inclusion. In an example using the CONISS algorithm, Bennett (1996) identified 6 zones with the threshold set at 0-5%, 1%, 2% or 5%, and identified 5 zones with the threshold set at 10% or 20%—a difference of only one zone.

Eliminating the rare taxa doesn’t seem to make much difference, however Birks (1986) explains that although rare taxa could be of ecological importance, the counts of rare taxa are associated with a high relative error so are poorly estimated numerically unless very large counts are made.

Obviously the work cited so far is quite old so I have looked around to see if there has been any progression on this. Birks (2012:357) states that the basic principles “remain largely unchanged” since they were established in Gordon and Birks (1972) and Birks and Gordon (1985).
Convert raw diatom valve counts into relative proportions (out of 1.0).
Square-root transform the data.

# Get names of taxa present at ≥4%.
abundant_taxa <- imported_counts %>%
  column_to_rownames("depth") %>%
  decostand(method = "total", na.rm = TRUE) %>%
  select_if(~any(. >= 0.04)) %>%
  select(-contains("spp")) %>%
  colnames()

# Filter and transform data.  
coniss_data <- imported_counts %>%
  filter(rowSums(select(., !matches("depth"))) > 0 & depth != 175.62) %>%
  column_to_rownames("depth") %>%
  select(all_of(abundant_taxa)) %>%
  decostand(method = "hellinger", na.rm = "TRUE") # sqrt of rel. proportions

4. Cluster analysis

Creating the zones involves three steps:

Calculate a distance matrix.
Run the cluster analysis.
Determine the number of significant zones.

First, a distance matrix of the squared Euclidean distance between samples is calculated. This dissimilarity index is the one used by the Tilia program. The CONISS cluster analysis is then performed on this distance matrix.

A broken stick model (Bennett, 1996) is used to determine the number of significant zones in the record. The plot in Figure 1 shows that after splitting the record into 12 groups (zones), the observed reduction in within-group sum of squares (black line) drops (and remains) below that expected from the broken stick model (red line). Therefore, there are 12 significant zones in this record.

dist_matrix <- designdist(coniss_data,
                          method = "A+B-2*J",
                          terms = "quadratic")

coniss_results <- chclust(dist_matrix, method = "coniss")

bstick(coniss_results, ng = 30)

Comparison of the CONISS results with a broken stick model. The observed reduction in within-group sum of squares is represented by the black line and the model is represented by the red line.

Figure 1: Comparison of the CONISS results with a broken stick model. The observed reduction in within-group sum of squares is represented by the black line and the model is represented by the red line.

5. Plot results

The results of the cluster analysis can be plotted as a dendrogram (Figure 2).

# Extract data for plotting
ddata<- dendro_data(coniss_results, type = "rectangle")
# Modify x values so leaves are plotted by depth in core rather than in 
# sequential order
new_x <- approxfun(ddata$labels$x, 
                   as.numeric(as.character(ddata$labels$label))) 
ddata$segments$x <- new_x(ddata$segments$x)
ddata$segments$xend <- new_x(ddata$segments$xend)

# Create plot
ggplot(segment(ddata)) +
    geom_vline(xintercept = c(156.02, 163.22, 177.62, 190.42, 199.22, 215.22, 
                              233.62, 253.62, 258.42, 275.22, 279.22),
               colour = "lightgrey") +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
  coord_flip() +
  scale_y_continuous() +
  scale_x_reverse(breaks = seq(135, 285, 5)) +
  labs(x = "Depth (m)",
       y = "Total sum of squares") +
  theme_bw(10) +
  theme(aspect.ratio = 3,
        legend.position = "none",
        panel.grid = element_blank(),
        panel.border = theme_border(c("left", "bottom")),
        panel.background = element_rect(fill = "transparent"),
        plot.background = element_rect(fill = "transparent")) +
  geom_hline(yintercept = 14,
               linetype = 2)

Dendrogram of CONISS results. The vertical dashed line indicates the total sum of squares value that splits the record into the 12 significant zone. The horizonal grey lines delineate the resulting zones.

Figure 2: Dendrogram of CONISS results. The vertical dashed line indicates the total sum of squares value that splits the record into the 12 significant zone. The horizonal grey lines delineate the resulting zones.

References

Bennett, K. D. (1996) Determination of the number of zones in a biostratigraphical sequence. New Phytologist, 132, 155–170. https://doi.org/10.1111/j.1469-8137.1996.tb04521.x

Birks, H. J. B. (1986) Numerical zonation, comparison and correlation of Quaternary pollen-stratigraphical data. In Berglund, B. E. (ed) Handbook of Palaeoecology and Palaeohydrology. Chichester: John Wiley & Sons.

Birks, H. J. B. (2012) Analysis of stratigraphical data. In Birks, H. J. B., Lotter, A. F., Juggins, S. & Smol, J. P. (eds) Tracking Environmental Change using Lake Sediments. Volume 5: Data Handling and Numerical Techniques. Dordrecht: Springer. https://doi.org/10.1007/978-94-007-2745-8_11

Birks, H. J. B. & Gordon, A. D. (1985) Numerical methods in Quaternary pollen analysis. London: Academic Press.

Gordon, A. D. & Birks, H. J. B. (1972) Numerical methods in Quaternary palaeoecology. I. Zonation of pollen diagrams. New Phytologist, 71, 961–979. https://doi.org/10.1111/j.1469-8137.1972.tb01976.x

Grimm, E. C. (1987) CONISS: A Fortran 77 program for stratigraphically constrained cluster analysis by the method of incremental sum of squares. Computers & Geosciences, 13, 13–35. https://doi.org/10.1016/0098-3004(87)90022-7

Grimm, E. C. (2011) Tilia version 1.7.16 [Software]. Illinois State Museum, Springfield. https://www.tiliait.com