Online trend estimation and detection of trend deviations in sub-sewershed time series of SARS-CoV-2 RNA measured in wastewater

Cal Poly Humboldt Department of Mathematics Colloquium

Dr. Julia C. Schedler

2023-10-16

Background and motivation

Routine monitoring of wastewater

https://hou-wastewater-epi.org/

Our team

Motivating questions

  1. How do we scientifically estimate trends in wastewater time series?
  2. How can we detect deviations from the trend at new locations during times of concern?
  3. How can we ensure our methods are easy for others to apply?

Data processing steps

  • identify observations below the level of detection using statistical analysis

  • align all observations to Mondays

  • transform copies per L to a log10 scale

  • average replicates to give one weekly measurement per week per location

  • Only use locations where the primary WWTP has at least 85% coverage and observations within 1 month of last date

  • ensure there is a row for each week, even if the observation is missing

  • create an indicator of missing values

  • remove irrelevant features/variables

# A tibble: 475 × 4
   dates      name           value ts_missing
   <date>     <chr>          <dbl> <lgl>     
 1 2021-05-24 WWTP            3.40 FALSE     
 2 2021-05-24 Lift station A  3.42 FALSE     
 3 2021-05-24 Lift station D  4.11 FALSE     
 4 2021-05-24 Lift station B  3.92 FALSE     
 5 2021-05-24 Lift station C  3.87 FALSE     
 6 2021-05-31 WWTP            4.14 FALSE     
 7 2021-05-31 Lift station A  3.56 FALSE     
 8 2021-05-31 Lift station D NA    TRUE      
 9 2021-05-31 Lift station B NA    TRUE      
10 2021-05-31 Lift station C NA    TRUE      
11 2021-06-07 WWTP            3.70 FALSE     
12 2021-06-07 Lift station A  3.76 FALSE     
13 2021-06-07 Lift station D  3.79 FALSE     
14 2021-06-07 Lift station B NA    TRUE      
15 2021-06-07 Lift station C  3.61 FALSE     
16 2021-06-14 WWTP            3.64 FALSE     
17 2021-06-14 Lift station A NA    TRUE      
18 2021-06-14 Lift station D NA    TRUE      
19 2021-06-14 Lift station B NA    TRUE      
20 2021-06-14 Lift station C  3.82 FALSE     
# ℹ 455 more rows

Trend estimation

Previous method: smoothing spline

Code
WWTP = all_ts_observed %>% dplyr::filter(name == "WWTP")
WWTP$y.smooth <- smooth.spline(WWTP$dates, zoo::na.approx(WWTP$value))$y

ggplot(data = WWTP, aes(x = dates, y = value)) + 
         geom_line(col = "#332288") + geom_point(col = "#332288")+
         geom_line(aes(x = dates, y = y.smooth), col = "#33228850", lwd = 2) +
        theme_minimal() + 
          scale_color_manual(values = c("#33228850", "#332288")) +
        ylab("Log 10 Copies/L") + xlab("Date") + 
  theme(legend.position = "none", 
                      title = element_blank(), 
                      axis.title.x = element_text(size = 20),
                      axis.title.y = element_text(size = 20),
                      axis.text.x =element_text(size = 20), 
                      axis.text.y = element_text(size = 20))

Spline methods

Pros

  • Gives a nice, smooth trend estimate
  • Functions available in base R

Cons

Time series data require time series methods

Or else you will under-estimate standard errors!

State space models

A flexible modeling framework that specifies a model in two levels:

\[\begin{align} \text{data model (observation equation): } y_{t} &= A_t\mu_t + v_t\\ \text{process model (state equation): } \mu_{t+1} &= \Phi_{t} \mu_{t} + R_t w_t. \\ \end{align} \]

The choice of \(A_t, \Phi_t, R_t\) allow for a wide range of models to be specified.

State space spline

\[\begin{align} \text{Observation equation: }& y_{t} = \mu_t + v_{t}\\ \text{State equation: }& (\mu_t - \mu_{t-1}) = (\mu_{t-1} - \mu_{t-2}) + w_t.\\ \text{ OR}\\ \begin{pmatrix} \mu_t\\\mu_{t-1}\end{pmatrix} &= \begin{pmatrix} 2 & -1\\ 1 &0 \end{pmatrix} \begin{pmatrix}\mu_{t-1}\\ \mu_{t-2}\end{pmatrix} + \begin{pmatrix}1\\0\end{pmatrix}w_t. \\ \text{Initial condition: }& \mu_0 \sim N(c_0, m_0). \end{align} \] Parameters are \(\sigma^2_v, \sigma^2_w, c_0, m_0\)

See this example for a more mathematical treatment.

Several ways to obtain trend estimates using this model…

Types of estimation

  • Prediction: forecasting future values of the state
    • E.g. one-step-ahead forecast
  • Online (Filtering): best estimate of the current value of the state
    • Estimate \(\mathbb{E}(\mu_t | y_1, \dots, y_{t})\)
  • Retrospective (Smoothing): best estimates of past values of the state given all observations
    • For example: \(\mathbb{E}(\mu_{t-1}|y_1, \dots, y_n)\)

Estimates of error accompany all of these!

Variance comparison

  • Observation variance: \(\sigma_v^2\)– sampling and lab variation
  • State variance: \(\sigma_w^2\)– inherent process variability

State space models

Pros

  • Can specify any model structure
  • Separate process variability from observation error
  • Handles missing data
  • Many R packages available

Cons

  • Can be difficult to specify model structure
  • Varying ease of use for applied researchers with R packages

Software

  • KFAS package
  • Convenience wrapper functions:
Code
KFAS_state_space_spline <- function(ts_obs, name, ts.missing, ts_dates, init_par){
  
## Specify model structure
A = matrix(c(1,0),1)
Phi = matrix(c(2,1,-1,0),2)
mu1 = matrix(0,2) 
P1 = diag(1,2)
v = matrix(NA) 
R = matrix(c(1,0),2,1)
w = matrix(NA) 

#function for updating the model
update_model <- function(pars, model) {
  model["H"][1] <- pars[1]
  model["Q"][1] <- pars[2]
  model
}

#check that variances are non-negative
check_model <- function(model) {
  (model["H"] > 0 && min(model["Q"]) > 0)
}

# Specify the model
mod <- KFAS::SSModel(ts_obs ~ -1 +
                 SSMcustom(Z = A, T = Phi, R = R, Q = w, a1 = mu1, P1 = P1), H = v)

# Fit the model
fit_mod <- KFAS::fitSSM(mod, inits = init_par, method = "BFGS",
                  updatefn = update_model, checkfn = check_model, hessian=TRUE,
                  control=list(trace=FALSE,REPORT=1))

## Format for output
ts_len <- length(ts_obs)
smoothers <- data.frame(est = KFS(fit_mod$model)$alphahat[,1],
                  lwr = KFS(fit_mod$model)$alphahat[,1]- 1.96*sqrt(KFS(fit_mod$model)$V[1,1,]),
                  upr = KFS(fit_mod$model)$alphahat[,1]+ 1.96*sqrt(KFS(fit_mod$model)$V[1,1,]),
                  ts_missing = ts.missing,
                  name = rep(name[1], times = ts_len),
                  fit = rep("smoother", times = ts_len),
                  date = ts_dates,
                  sigv = rep(fit_mod$optim.out$par[1], times = ts_len),
                  sigw = rep(fit_mod$optim.out$par[2], times = ts_len), 
                  obs = ts_obs,
                  resid = rstandard(KFS(fit_mod$model), type = "recursive"),
                  conv = fit_mod$optim.out$convergence)

filters <- data.frame(est = KFS(fit_mod$model)$att[,1],
           lwr = KFS(fit_mod$model)$att[,1]- 1.96*sqrt(KFS(fit_mod$model)$Ptt[1,1,]),
           upr = KFS(fit_mod$model)$att[,1]+ 1.96*sqrt(KFS(fit_mod$model)$Ptt[1,1,]),
           ts_missing = ts.missing,
           name = rep(name[1], times = ts_len),
           fit = rep("filter", times = ts_len),
           date = ts_dates,
           sigv = rep(fit_mod$optim.out$par[1], times = ts_len),
           sigw = rep(fit_mod$optim.out$par[2], times = ts_len), 
           obs = ts_obs, 
           resid = rstandard(KFS(fit_mod$model), type = "recursive"),
           conv = fit_mod$optim.out$convergence)
## combine it all for output.
kfas_fits_out <- list(fits = bind_rows(smoothers,filters), model = fit_mod$model)
return(kfas_fits_out)
}
Code
KFAS_rolling_estimation <- function(init_vals_roll, 
                               ts_obs_roll,  
                               ts_name_roll,
                               dates_roll,
                               init.par_roll,
                               ts.missing_roll){
  
  ## perform initial fit on "burnin" of first init_vals_roll time points 
  model_out <- KFAS_state_space_spline(ts_obs = ts_obs_roll[1:init_vals_roll],
                                    name = ts_name_roll,                                     
                                    ts.missing = ts.missing_roll[1:init_vals_roll], 
                                    ts_dates = dates_roll[1:init_vals_roll], 
                                    init_par = init.par_roll)

  #just keep estimates for dates in burnins
  #smoother need not be kept
  fits_rolling <- dplyr::filter(model_out$fits,
                         date == dates_roll[1:init_vals_roll],
                         fit == "filter")

  
  # use variance estimates from burnin fit to initialize model for next time point
  next.par <- c(fits_rolling$sigv[init_vals_roll], fits_rolling$sigw[init_vals_roll])

  ## perform rolling estimation for each time point
  for(i in (init_vals_roll +1):length(ts_obs_roll)){
    # just looking to current time point
    ts_partial <- ts_obs_roll[1:i]

    # fit the model for the next time point
    ith_fit <- KFAS_state_space_spline(ts_obs = ts_partial,
                                      name = ts_name_roll,
                                      ts.missing = ts.missing_roll[1:i],
                                      ts_dates = dates_roll[1:i],
                                      init_par = next.par)$fits
    # save results of model fit
    if(exists("ith_fit")){
      fits_rolling <- rbind(fits_rolling, dplyr::filter(ith_fit, date == dates_roll[i], fit == "filter"))
      # get updated variance estimates for observation and state
      next.par <- c(ith_fit$sigv[nrow(ith_fit)], ith_fit$sigw[nrow(ith_fit)])
      ## compute smoother at final time point
      if(i == length(ts_obs_roll)){
        fits_rolling <- rbind(fits_rolling, dplyr::filter(ith_fit, fit == "smoother"))
      }
      rm(ith_fit)
    }else{ ## I don't know how to to error handling, feel free to do a pull request
      print(rep("FAIL", times = 100))
    }
  }
  ## give the user an update once each series' estimation is complete
  print(paste("Model fit complete: ", ts_name_roll[1]))
  return(fits_rolling)

}

Time series modeling assumption

Autocorrelation of model’s residuals should not be present.

Do lift station series give different information than the WWTP?

Retrospective comparison

Summary of trend estimation

Inputs: Raw lab values

  1. Process the raw data
  2. Initialize the model by estimating parameters using the first 10 weeks of data.
  3. Compute online trend estimates and confidence limits and re-estimating parameters with each additional data point
  4. Compute retrospective trend estimates and confidence limits.
  5. Verify convergence of estimates and time series modeling assumptions.
  6. Compute table of state and observation variances for each time series.
  7. Compare visualizations of retrospective estimates of WWTP and sub-sewershed trends to determine whether a difference was present.

Outputs: Online and retrospective trend estimates, estimates of trend variability and measurement/sampling variability.

How can we integrate information from times of concern into the wastewater surveillance system?

Lift stations DO give different information

How can we determine when we are seeing trend deviations?

(Online) Detection of deviations

Is there another tool for which we can use the online (filter) estimates?

Control charts

When is a process “out of control” or experiencing a mean shift/structural break?

Traditional applications:

  • ensuring a given percentage of on-time deliveries to a client
  • speed and consistency of service quality in a bank
  • loading passengers onto an airplane

Lots of options:

  • Shewhart chart
  • CUSUM chart
  • Exponentially weighted moving average chart

EWMA chart

For a time series \(d_t\):

\[ z_t = \lambda d_t + (1- \lambda)z_{t-1} \]

  • \(z_t\) is a weighted average of all past values of \(d_t\)
  • contribution of each past value is controlled by \(\lambda\)
    • we use lag 1 autocorrelation estimate for \(d_t\).

Reasoning for choice of EWMA chart:

  • can use with just 1 value
  • can detect changes in correlated series

Which series do we want to monitor?

Standardized difference series

Formally, using the previous notation, the standardized difference at time point \(t\), for lift station \(i=1,\ldots,4\) is given by:

\[ d_{i,t} = \frac{y_{i,t} - \hat{\mu}_t}{\tilde{\sigma}_d} = \frac{\text{WWTP state estimate - LS observed}}{\text{std dev.}}, \] where \(\tilde{\sigma}_d^2 = {\rm Var}(y_{i,t} - \hat{\mu}_t)\). We approximate this variance by: \[ \tilde{\sigma}_d^2 \approx \hat{\sigma}_{v_t}^2 + \hat{\sigma}_{w_t}^2 -2{\rm Corr}(y_{i}, \hat{\mu})\cdot \hat{\sigma}_{v_t} \cdot \hat{\sigma}_{w_t}, \] where \({\rm Corr}(y_{i}, \hat{\mu})\) is the Pearson correlation.

Goal: create an EWMA chart for difference bewteen estimated WWTP state and observed LS values.

Charts

Software

  • qcc package

  • custom function ww_ewma to compute standardized difference

## creates chart for the series you input, make sure to remove burnin period.
## region 1 is the lift station
## region 2 is the WWTP
ww_ewma <- function(region1, region2, title.char, ylab.label="Standardized Difference in Series", events = NULL){
  
  
  ## observed series may have missing values. Use the estimated value from the WWTP state space model as imputed value.
  region1[region1$ts_missing, "obs"] <- left_join(region1[region1$ts_missing,], region2, by = "date") %>% pull(est.y)
  
  
  # if you have enough data to model region1 you can fill in with the estimated filter
  #region1[region1$ts_missing, "obs"] <- region1 %>% dplyr::filter(ts_missing) %>% mutate(obs = est) %>% pull(obs)
  
 
  series1 <- region1 %>% pull(obs)
  series2 <- region2 %>% pull(est)
  
  # compute the difference
  diff <- series1 - series2
  
  var1 <- region1 %>% 
    dplyr::mutate(var_est = ((upr-est)/2)^2) %>% 
    dplyr::select(var_est)
  var2 <- region2 %>%
    dplyr::mutate(var_est = ((upr-est)/2)^2) %>% 
    dplyr::select(var_est)
  
  
  series1_missing <- region1 %>% dplyr::select(ts_missing)
  series2_missing <- region2 %>% dplyr::select(ts_missing)
  
  
  vals_missing <- (series1_missing + series2_missing) > 0
 
  
  # use correlation and variances to compute the covariance
  cor_estimate <- cor(series1, series2)
  cov_estimate <- cov_est <- as.numeric(cor_estimate)*sqrt(var1)*sqrt(var2)
  
  # create estimate of variance of the difference series using approximation
  var_est <- var1 + var2 -2*cov_estimate
  
  # create standardized difference series
  standardized_diff <- diff/sqrt(var_est)
  
  # compute lag 1 autocorrelation of standardized difference series
  lag1_est <- acf(standardized_diff, plot=F, na.action = na.pass)$acf[2] ## we could do something fancier
  
  # use qcc package to make ewma plot
  out <- qcc::ewma(standardized_diff, center = 0, sd = 1, 
                   lambda = lag1_est, nsigmas = 3, sizes = 1, plot = F)
  
  ## put NAs where we had missing values for either series
  #out$x[vals_missing] <-NA
  out$y[vals_missing] <- NA
  out$data[vals_missing,1]<- NA
  
  # create plot
  dat <- data.frame(x = region1$date, 
                    ewma = out$y, 
                    y = out$data[,1], 
                    col = out$x %in% out$violations, 
                    lwr = out$limits[20,1],
                    upr = out$limits[20,2])
  obs_dat <- data.frame(x = dat$x, y = out$data[,1], col = "black")
  p <-ggplot(dat, aes(x = x, y = ewma)) + 
    geom_vline(xintercept = events, 
               col = "darkgrey", 
               lwd = 1)+
    geom_line()+
    geom_point(aes(col = dat$col), size = 3) +
    scale_color_manual(values = c(1,2), label = c("No", "Yes"), name = "Separation?") +
    new_scale_color() +
    geom_point(data = obs_dat, aes(x = x, y = y, col =col), shape = 3) +
    scale_color_manual(values = "black", label = "Observed \nStandardized \nDifference", name = "") +
    geom_hline(aes(yintercept = out$limits[20,1]), lty = 2) + 
    geom_hline(aes(yintercept = out$limits[20,2]), lty = 2) +
    geom_hline(aes(yintercept = 0), lty = 1) + 
    ggtitle(paste(title.char))+
    xlab("Date") + ylab(ylab.label) +
    theme_minimal()

  return(p)
}

Summary of EWMA chart procedure

Inputs: At least 10 + \(n\) WWTP observations, \(n \geq 1\) sub-sewershed observations

  1. Read in cleaned WWTP series and obtain online trend estimates through the date of the first sub-sewershed observation.
  2. Replace any missing sub-sewershed observations with WWTP online trend estimate for corresponding date.
  3. Create difference time series of sub-sewershed observed copies/liter (\(\log10\)) - WWTP Online Trend Estimate.
  4. Standardize the difference series by dividing by the standard deviation computed approximation
  5. Construct EWMA chart for the standardized difference series.
  6. Inspect EWMA chart for separation.

Outputs: WMA chart for determining separation of sub-sewershed from centralized WWTP.

Future work

  • Joint modeling?
  • Spatial component
  • Time-varying model structure

\[\begin{align} \text{data model (observation equation): } y_{t} &= A_t\mu_t + v_t\\ \text{process model (state equation): } \mu_{t+1} &= \Phi_{t} \mu_{t} + R_t w_t. \\ \end{align} \]

Resources:

  • Assumed knowledge + resources to fill gaps:

Acknowledgements

This work was supported by the:

  • CDC Foundation (project no. 1085.46)

  • Centers for Disease Control and Prevention (ELC-ED grant no. 6NU50CK000557-01-05 and ELC-CORE grant no. NU50CK000557).

  • Rockefeller Foundation

Licensing note

For license information, please refer to our GitHub– coming soon!

Thanks!

https://juliaschedler.com

https://github.com/hou-wastewater-epi-org

https://hou-wastewater-epi.org/

Email me: [email protected]