Online trend estimation and detection of trend deviations in sub-sewershed time series of SARS-CoV-2 RNA measured in wastewater

Cal Poly Humboldt Department of Mathematics Colloquium

Dr. Julia C. Schedler

2023-10-16

Background and motivation

Routine monitoring of wastewater

https://hou-wastewater-epi.org/

Our team

Motivating questions

How do we scientifically estimate trends in wastewater time series?
How can we detect deviations from the trend at new locations during times of concern?
How can we ensure our methods are ~~easy~~ possible for others to apply?

Data processing steps

identify observations below the level of detection using statistical analysis
align all observations to Mondays
transform copies per L to a log10 scale
average replicates to give one weekly measurement per week per location
Only use locations where the primary WWTP has at least 85% coverage and observations within 1 month of last date
ensure there is a row for each week, even if the observation is missing
create an indicator of missing values
remove irrelevant features/variables

# A tibble: 475 × 6
   dates      name           value ts_missing name_orig        colors 
   <date>     <chr>          <dbl> <lgl>      <chr>            <chr>  
 1 2021-05-24 WWTP            3.40 FALSE      69TH STREET      #332288
 2 2021-05-24 Lift station C  3.42 FALSE      AFTON VILLAGE    #AA4499
 3 2021-05-24 Lift station D  4.11 FALSE      HEIGHTS          #DDCC77
 4 2021-05-24 Lift station A  3.92 FALSE      CLINTON DRIVE #2 #44AA99
 5 2021-05-24 Lift station B  3.87 FALSE      DE PRIEST        #88CCEE
 6 2021-05-31 WWTP            4.14 FALSE      69TH STREET      #332288
 7 2021-05-31 Lift station C  3.56 FALSE      AFTON VILLAGE    #AA4499
 8 2021-05-31 Lift station D NA    TRUE       HEIGHTS          #DDCC77
 9 2021-05-31 Lift station A NA    TRUE       CLINTON DRIVE #2 #44AA99
10 2021-05-31 Lift station B NA    TRUE       DE PRIEST        #88CCEE
11 2021-06-07 WWTP            3.70 FALSE      69TH STREET      #332288
12 2021-06-07 Lift station C  3.76 FALSE      AFTON VILLAGE    #AA4499
13 2021-06-07 Lift station D  3.79 FALSE      HEIGHTS          #DDCC77
14 2021-06-07 Lift station A NA    TRUE       CLINTON DRIVE #2 #44AA99
15 2021-06-07 Lift station B  3.61 FALSE      DE PRIEST        #88CCEE
16 2021-06-14 WWTP            3.64 FALSE      69TH STREET      #332288
17 2021-06-14 Lift station C NA    TRUE       AFTON VILLAGE    #AA4499
18 2021-06-14 Lift station D NA    TRUE       HEIGHTS          #DDCC77
19 2021-06-14 Lift station A NA    TRUE       CLINTON DRIVE #2 #44AA99
20 2021-06-14 Lift station B  3.82 FALSE      DE PRIEST        #88CCEE
# ℹ 455 more rows

WWTP
Lift station A
Lift station B
Lift station C
Lift station D

Trend estimation

Previous method: smoothing spline

Code

WWTP = all_ts_observed %>% dplyr::filter(name == "WWTP")
WWTP$y.smooth <- smooth.spline(WWTP$dates, zoo::na.approx(WWTP$value))$y

ggplot(data = WWTP, aes(x = dates, y = value)) + 
         geom_line(col = "#332288") + geom_point(col = "#332288")+
         geom_line(aes(x = dates, y = y.smooth), col = "#33228850", lwd = 2) +
        theme_minimal() + 
          scale_color_manual(values = c("#33228850", "#332288")) +
        ylab("Log 10 Copies/L") + xlab("Date") + 
  theme(legend.position = "none", 
                      title = element_blank(), 
                      axis.title.x = element_text(size = 20),
                      axis.title.y = element_text(size = 20),
                      axis.text.x =element_text(size = 20), 
                      axis.text.y = element_text(size = 20))

Spline methods

Pros

Gives a nice, smooth trend estimate
Functions available in base R

Cons

No way to separate process variability from observation error
Clunky when there’s missing data
Experience required for model fitting, interpretation, etc. (e.g. see A review of spline function procedures in R)

Time series data require time series methods

Or else you will under-estimate standard errors!

State space models

A flexible modeling framework that specifies a model in two levels:

\[\begin{align} \text{data model (observation equation): } y_{t} &= A_t\mu_t + v_t\\ \text{process model (state equation): } \mu_{t+1} &= \Phi_{t} \mu_{t} + R_t w_t. \\ \end{align} \]

The choice of $A_t, \Phi_t, R_t$ allow for a wide range of models to be specified.

State space spline

\[\begin{align} \text{Observation equation: }& y_{t} = \mu_t + v_{t}\\ \text{State equation: }& (\mu_t - \mu_{t-1}) = (\mu_{t-1} - \mu_{t-2}) + w_t.\\ \text{ OR}\\ \begin{pmatrix} \mu_t\\\mu_{t-1}\end{pmatrix} &= \begin{pmatrix} 2 & -1\\ 1 &0 \end{pmatrix} \begin{pmatrix}\mu_{t-1}\\ \mu_{t-2}\end{pmatrix} + \begin{pmatrix}1\\0\end{pmatrix}w_t. \\ \text{Initial condition: }& \mu_0 \sim N(c_0, m_0). \end{align} \] Parameters are $\sigma^2_v, \sigma^2_w, c_0, m_0$

See this example for a more mathematical treatment.

Several ways to obtain trend estimates using this model…

Types of estimation

Prediction: forecasting future values of the state
- E.g. one-step-ahead forecast
Online (Filtering): best estimate of the current value of the state
- Estimate $\mathbb{E}(\mu_t | y_1, \dots, y_{t})$
Retrospective (Smoothing): best estimates of past values of the state given all observations
- For example: $\mathbb{E}(\mu_{t-1}|y_1, \dots, y_n)$

Estimates of error accompany all of these!

WWTP
Lift station A
Lift station B
Lift station C
Lift station D

Variance comparison

Observation variance: $\sigma_v^2$– sampling and lab variation
State variance: $\sigma_w^2$– inherent process variability

State space models

Pros

Can specify many model structures
Separate process variability from observation error
Handles missing data
Many R packages available

Cons

Can be difficult to specify model structure
Varying ease of use for applied researchers with R packages

Software

KFAS package
“Convenience” wrapper functions:

Code

KFAS_state_space_spline <- function(ts_obs, name, ts.missing, ts_dates, init_par){
  
## Specify model structure
A = matrix(c(1,0),1)
Phi = matrix(c(2,1,-1,0),2)
mu1 = matrix(0,2) 
P1 = diag(1,2)
v = matrix(NA) 
R = matrix(c(1,0),2,1)
w = matrix(NA) 

#function for updating the model
update_model <- function(pars, model) {
  model["H"][1] <- pars[1]
  model["Q"][1] <- pars[2]
  model
}

#check that variances are non-negative
check_model <- function(model) {
  (model["H"] > 0 && min(model["Q"]) > 0)
}

# Specify the model
mod <- KFAS::SSModel(ts_obs ~ -1 +
                 SSMcustom(Z = A, T = Phi, R = R, Q = w, a1 = mu1, P1 = P1), H = v)

# Fit the model
fit_mod <- KFAS::fitSSM(mod, inits = init_par, method = "BFGS",
                  updatefn = update_model, checkfn = check_model, hessian=TRUE,
                  control=list(trace=FALSE,REPORT=1))

## Format for output
ts_len <- length(ts_obs)
smoothers <- data.frame(est = KFS(fit_mod$model)$alphahat[,1],
                  lwr = KFS(fit_mod$model)$alphahat[,1]- 1.96*sqrt(KFS(fit_mod$model)$V[1,1,]),
                  upr = KFS(fit_mod$model)$alphahat[,1]+ 1.96*sqrt(KFS(fit_mod$model)$V[1,1,]),
                  ts_missing = ts.missing,
                  name = rep(name[1], times = ts_len),
                  fit = rep("smoother", times = ts_len),
                  date = ts_dates,
                  sigv = rep(fit_mod$optim.out$par[1], times = ts_len),
                  sigw = rep(fit_mod$optim.out$par[2], times = ts_len), 
                  obs = ts_obs,
                  resid = rstandard(KFS(fit_mod$model), type = "recursive"),
                  conv = fit_mod$optim.out$convergence)

filters <- data.frame(est = KFS(fit_mod$model)$att[,1],
           lwr = KFS(fit_mod$model)$att[,1]- 1.96*sqrt(KFS(fit_mod$model)$Ptt[1,1,]),
           upr = KFS(fit_mod$model)$att[,1]+ 1.96*sqrt(KFS(fit_mod$model)$Ptt[1,1,]),
           ts_missing = ts.missing,
           name = rep(name[1], times = ts_len),
           fit = rep("filter", times = ts_len),
           date = ts_dates,
           sigv = rep(fit_mod$optim.out$par[1], times = ts_len),
           sigw = rep(fit_mod$optim.out$par[2], times = ts_len), 
           obs = ts_obs, 
           resid = rstandard(KFS(fit_mod$model), type = "recursive"),
           conv = fit_mod$optim.out$convergence)
## combine it all for output.
kfas_fits_out <- list(fits = bind_rows(smoothers,filters), model = fit_mod$model)
return(kfas_fits_out)
}

Code

KFAS_rolling_estimation <- function(init_vals_roll, 
                               ts_obs_roll,  
                               ts_name_roll,
                               dates_roll,
                               init.par_roll,
                               ts.missing_roll){
  
  ## perform initial fit on "burnin" of first init_vals_roll time points 
  model_out <- KFAS_state_space_spline(ts_obs = ts_obs_roll[1:init_vals_roll],
                                    name = ts_name_roll,                                     
                                    ts.missing = ts.missing_roll[1:init_vals_roll], 
                                    ts_dates = dates_roll[1:init_vals_roll], 
                                    init_par = init.par_roll)

  #just keep estimates for dates in burnins
  #smoother need not be kept
  fits_rolling <- dplyr::filter(model_out$fits,
                         date == dates_roll[1:init_vals_roll],
                         fit == "filter")

  
  # use variance estimates from burnin fit to initialize model for next time point
  next.par <- c(fits_rolling$sigv[init_vals_roll], fits_rolling$sigw[init_vals_roll])

  ## perform rolling estimation for each time point
  for(i in (init_vals_roll +1):length(ts_obs_roll)){
    # just looking to current time point
    ts_partial <- ts_obs_roll[1:i]

    # fit the model for the next time point
    ith_fit_out <- KFAS_state_space_spline(ts_obs = ts_partial,
                                      name = ts_name_roll,
                                      ts.missing = ts.missing_roll[1:i],
                                      ts_dates = dates_roll[1:i],
                                      init_par = next.par)
    fitmod <- ith_fit_out$model
    ith_fit <- ith_fit_out$fits
    # save results of model fit
    if(exists("ith_fit")){
      fits_rolling <- rbind(fits_rolling, dplyr::filter(ith_fit, date == dates_roll[i], fit == "filter"))
      # get updated variance estimates for observation and state
      next.par <- c(ith_fit$sigv[nrow(ith_fit)], ith_fit$sigw[nrow(ith_fit)])
      ## compute smoother at final time point
      if(i == length(ts_obs_roll)){
        fits_rolling <- rbind(fits_rolling, dplyr::filter(ith_fit, fit == "smoother"))
      }
      rm(ith_fit)
    }else{ ## I don't know how to to error handling, feel free to do a pull request
      print(rep("FAIL", times = 100))
    }
  }
  ## give the user an update once each series' estimation is complete
  print(paste("Model fit complete: ", ts_name_roll[1]))
  output <- list(fits_rolling = fits_rolling, model = fitmod)
  return(output)

}

Time series modeling assumption

Autocorrelation of model’s residuals should not be present.

Do lift station series give different information than the WWTP?

(and if so, when?)

Algorithm 1: Summary of trend estimation

Inputs: Raw lab values

Process the raw data
Initialize the model by estimating parameters using the first 10 weeks of data.
Compute online trend estimates and confidence limits and re-estimating parameters with each additional data point
Compute retrospective trend estimates and confidence limits.
Verify convergence of estimates and time series modeling assumptions.
Compute table of state and observation variances for each time series.
Compare visualizations of retrospective estimates of WWTP and sub-sewershed trends to determine whether a difference was present.

Outputs: Online and retrospective trend estimates, estimates of trend variability and measurement/sampling variability.

How can we integrate information from times of concern into the wastewater surveillance system?

Lift stations DO give different information

How can we determine when we are seeing trend deviations?

(Online) Detection of deviations

Lift station A
Lift station B
Lift station C
Lift station D

Is there another tool for which we can use the online (filter) estimates?

Control charts

When is a process “out of control” or experiencing a mean shift/structural break?

Traditional applications:

ensuring a given percentage of on-time deliveries to a client
speed and consistency of service quality in a bank
loading passengers onto an airplane

Lots of options:

Shewhart chart
CUSUM chart
Exponentially weighted moving average chart

EWMA chart

For a time series $d_t$ (here, $d$_t$ is the difference Lift Station - WWTP):

\[ z_t = \lambda d_t + (1- \lambda)z_{t-1} \]

$z_t$ is a weighted average of all past values of $d_t$
contribution of each past value is controlled by $\lambda$
- we use lag 1 autocorrelation estimate for $d_t$.

Reasoning for choice of EWMA chart:

can use with just 1 value
can detect changes in correlated series

Which series do we want to monitor?

Lift station A
Lift station B
Lift station C
Lift station D

Standardized difference series

Formally, using the previous notation, the standardized difference at time point $t$, for lift station $i=1,\ldots,4$ is given by:

\[ d_{i,t} = \frac{y_{i,t} - \hat{\mu}_t}{\tilde{\sigma}_d} = \frac{\text{WWTP state estimate - LS observed}}{\text{std dev.}}, \] where $\tilde{\sigma}_d^2 = {\rm Var}(y_{i,t} - \hat{\mu}_t)$. We approximate this variance by: \[ \tilde{\sigma}_d^2 \approx \hat{\sigma}_{v_t}^2 + \hat{\sigma}_{w_t}^2 -2{\rm Corr}(y_{i}, \hat{\mu})\cdot \hat{\sigma}_{v_t} \cdot \hat{\sigma}_{w_t}, \] where ${\rm Corr}(y_{i}, \hat{\mu})$ is the Pearson correlation.

Goal: create an EWMA chart for difference bewteen estimated WWTP state and observed LS values.

Charts

Lift station A
Lift station B
Lift station C
Lift station D

Software

qcc package
custom function ww_ewma to compute standardized difference

## creates chart for the series you input, make sure to remove burnin period.
## region 1 is the lift station
## region 2 is the WWTP
ww_ewma <- function(region1, region2, title.char, ylab.label="Standardized Difference in Series", events = NULL){
  
  
  ## observed series may have missing values. Use the estimated value from the WWTP state space model as imputed value.
  region1[region1$ts_missing, "obs"] <- left_join(region1[region1$ts_missing,], region2, by = "date") %>% pull(est.y)
  
  
  # if you have enough data to model region1 you can fill in with the estimated filter
  #region1[region1$ts_missing, "obs"] <- region1 %>% dplyr::filter(ts_missing) %>% mutate(obs = est) %>% pull(obs)
  
 
  series1 <- region1 %>% pull(obs)
  series2 <- region2 %>% pull(est)
  
  # compute the difference
  diff <- series1 - series2
  
  var1 <- region1 %>% 
    dplyr::mutate(var_est = ((upr-est)/2)^2) %>% 
    dplyr::select(var_est)
  var2 <- region2 %>%
    dplyr::mutate(var_est = ((upr-est)/2)^2) %>% 
    dplyr::select(var_est)
  
  
  series1_missing <- region1 %>% dplyr::select(ts_missing)
  series2_missing <- region2 %>% dplyr::select(ts_missing)
  
  
  vals_missing <- (series1_missing + series2_missing) > 0
 
  
  # use correlation and variances to compute the covariance
  cor_estimate <- cor(series1, series2)
  cov_estimate <- cov_est <- as.numeric(cor_estimate)*sqrt(var1)*sqrt(var2)
  
  # create estimate of variance of the difference series using approximation
  var_est <- var1 + var2 -2*cov_estimate
  
  # create standardized difference series
  standardized_diff <- diff/sqrt(var_est)
  
  # compute lag 1 autocorrelation of standardized difference series
  lag1_est <- acf(standardized_diff, plot=F, na.action = na.pass)$acf[2] ## we could do something fancier
  
  # use qcc package to make ewma plot
  out <- qcc::ewma(standardized_diff, center = 0, sd = 1, 
                   lambda = lag1_est, nsigmas = 3, sizes = 1, plot = F)
  
  ## put NAs where we had missing values for either series
  #out$x[vals_missing] <-NA
  out$y[vals_missing] <- NA
  out$data[vals_missing,1]<- NA
  
  # create plot
  dat <- data.frame(x = region1$date, 
                    ewma = out$y, 
                    y = out$data[,1], 
                    col = out$x %in% out$violations, 
                    lwr = out$limits[20,1],
                    upr = out$limits[20,2])
  obs_dat <- data.frame(x = dat$x, y = out$data[,1], col = "black")
  p <-ggplot(dat, aes(x = x, y = ewma)) + 
    geom_vline(xintercept = events, 
               col = "darkgrey", 
               lwd = 1)+
    geom_line()+
    geom_point(aes(col = dat$col), size = 3) +
    scale_color_manual(values = c(1,2), label = c("No", "Yes"), name = "Separation?") +
    new_scale_color() +
    geom_point(data = obs_dat, aes(x = x, y = y, col =col), shape = 3) +
    scale_color_manual(values = "black", label = "Observed \nStandardized \nDifference", name = "") +
    geom_hline(aes(yintercept = out$limits[20,1]), lty = 2) + 
    geom_hline(aes(yintercept = out$limits[20,2]), lty = 2) +
    geom_hline(aes(yintercept = 0), lty = 1) + 
    ggtitle(paste(title.char))+
    xlab("Date") + ylab(ylab.label) +
    theme_minimal()

  return(p)
}

Summary of EWMA chart procedure

Inputs: At least 10 + $n$ WWTP observations, $n \geq 1$ sub-sewershed observations

Read in cleaned WWTP series and obtain online trend estimates through the date of the first sub-sewershed observation.
Replace any missing sub-sewershed observations with WWTP online trend estimate for corresponding date.
Create difference time series of sub-sewershed observed copies/liter ($\log10$) - WWTP Online Trend Estimate.
Standardize the difference series by dividing by the standard deviation computed approximation
Construct EWMA chart for the standardized difference series.
Inspect EWMA chart for separation.

Outputs: WMA chart for determining separation of sub-sewershed from centralized WWTP.

Future work

Joint modeling?
Spatial component
Time-varying model structure

\[\begin{align} \text{data model (observation equation): } y_{t} &= A_t\mu_t + v_t\\ \text{process model (state equation): } \mu_{t+1} &= \Phi_{t} \mu_{t} + R_t w_t. \\ \end{align} \]

Resources:

Assumed knowledge + resources to fill gaps:
- ability to interpret algebraic equations (example interpretations provided)
- basic familiarity with R programming
  - Suggested external learning resource: R for Data Science 2E by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.
- an understanding of linear regression
  - Suggested external learning resource: Chapter 7 of Introduction to Modern Statistics by Mine Çetinkaya-Rundel and Johanna Hardin.
- a willingness to learn some basic time series techniques.
  - Suggested external learning resource: Time Series: A Data Analysis Approach Using R by Robert Shumway and David Stoffer.
  - Suggested external learning resource: Forecasting: Principles and Practice 3E by Rob J Hyndman and George Athanasopoulos.

Acknowledgements

This work was supported by the:

CDC Foundation (project no. 1085.46)
Centers for Disease Control and Prevention (ELC-ED grant no. 6NU50CK000557-01-05 and ELC-CORE grant no. NU50CK000557).
Rockefeller Foundation

Licensing note

For license information, please refer to our GitHub– coming soon!

Thanks!

https://juliaschedler.com

https://github.com/hou-wastewater-epi-org

https://hou-wastewater-epi.org/

Email me: [email protected]