How do we scientifically estimate trends in wastewater time series?
How can we detect deviations from the trend at new locations during times of concern?
How can we ensure our methods are easypossible for others to apply?
Data processing steps
identify observations below the level of detection using statistical analysis
align all observations to Mondays
transform copies per L to a log10 scale
average replicates to give one weekly measurement per week per location
Only use locations where the primary WWTP has at least 85% coverage and observations within 1 month of last date
ensure there is a row for each week, even if the observation is missing
create an indicator of missing values
remove irrelevant features/variables
# A tibble: 475 × 6
dates name value ts_missing name_orig colors
<date> <chr> <dbl> <lgl> <chr> <chr>
1 2021-05-24 WWTP 3.40 FALSE 69TH STREET #332288
2 2021-05-24 Lift station C 3.42 FALSE AFTON VILLAGE #AA4499
3 2021-05-24 Lift station D 4.11 FALSE HEIGHTS #DDCC77
4 2021-05-24 Lift station A 3.92 FALSE CLINTON DRIVE #2 #44AA99
5 2021-05-24 Lift station B 3.87 FALSE DE PRIEST #88CCEE
6 2021-05-31 WWTP 4.14 FALSE 69TH STREET #332288
7 2021-05-31 Lift station C 3.56 FALSE AFTON VILLAGE #AA4499
8 2021-05-31 Lift station D NA TRUE HEIGHTS #DDCC77
9 2021-05-31 Lift station A NA TRUE CLINTON DRIVE #2 #44AA99
10 2021-05-31 Lift station B NA TRUE DE PRIEST #88CCEE
11 2021-06-07 WWTP 3.70 FALSE 69TH STREET #332288
12 2021-06-07 Lift station C 3.76 FALSE AFTON VILLAGE #AA4499
13 2021-06-07 Lift station D 3.79 FALSE HEIGHTS #DDCC77
14 2021-06-07 Lift station A NA TRUE CLINTON DRIVE #2 #44AA99
15 2021-06-07 Lift station B 3.61 FALSE DE PRIEST #88CCEE
16 2021-06-14 WWTP 3.64 FALSE 69TH STREET #332288
17 2021-06-14 Lift station C NA TRUE AFTON VILLAGE #AA4499
18 2021-06-14 Lift station D NA TRUE HEIGHTS #DDCC77
19 2021-06-14 Lift station A NA TRUE CLINTON DRIVE #2 #44AA99
20 2021-06-14 Lift station B 3.82 FALSE DE PRIEST #88CCEE
# ℹ 455 more rows
Observation variance: \(\sigma_v^2\)– sampling and lab variation
State variance: \(\sigma_w^2\)– inherent process variability
State space models
Pros
Can specify many model structures
Separate process variability from observation error
Handles missing data
Many R packages available
Cons
Can be difficult to specify model structure
Varying ease of use for applied researchers with R packages
Software
KFAS package
“Convenience” wrapper functions:
Code
KFAS_state_space_spline <-function(ts_obs, name, ts.missing, ts_dates, init_par){## Specify model structureA =matrix(c(1,0),1)Phi =matrix(c(2,1,-1,0),2)mu1 =matrix(0,2) P1 =diag(1,2)v =matrix(NA) R =matrix(c(1,0),2,1)w =matrix(NA) #function for updating the modelupdate_model <-function(pars, model) { model["H"][1] <- pars[1] model["Q"][1] <- pars[2] model}#check that variances are non-negativecheck_model <-function(model) { (model["H"] >0&&min(model["Q"]) >0)}# Specify the modelmod <- KFAS::SSModel(ts_obs ~-1+SSMcustom(Z = A, T = Phi, R = R, Q = w, a1 = mu1, P1 = P1), H = v)# Fit the modelfit_mod <- KFAS::fitSSM(mod, inits = init_par, method ="BFGS",updatefn = update_model, checkfn = check_model, hessian=TRUE,control=list(trace=FALSE,REPORT=1))## Format for outputts_len <-length(ts_obs)smoothers <-data.frame(est =KFS(fit_mod$model)$alphahat[,1],lwr =KFS(fit_mod$model)$alphahat[,1]-1.96*sqrt(KFS(fit_mod$model)$V[1,1,]),upr =KFS(fit_mod$model)$alphahat[,1]+1.96*sqrt(KFS(fit_mod$model)$V[1,1,]),ts_missing = ts.missing,name =rep(name[1], times = ts_len),fit =rep("smoother", times = ts_len),date = ts_dates,sigv =rep(fit_mod$optim.out$par[1], times = ts_len),sigw =rep(fit_mod$optim.out$par[2], times = ts_len), obs = ts_obs,resid =rstandard(KFS(fit_mod$model), type ="recursive"),conv = fit_mod$optim.out$convergence)filters <-data.frame(est =KFS(fit_mod$model)$att[,1],lwr =KFS(fit_mod$model)$att[,1]-1.96*sqrt(KFS(fit_mod$model)$Ptt[1,1,]),upr =KFS(fit_mod$model)$att[,1]+1.96*sqrt(KFS(fit_mod$model)$Ptt[1,1,]),ts_missing = ts.missing,name =rep(name[1], times = ts_len),fit =rep("filter", times = ts_len),date = ts_dates,sigv =rep(fit_mod$optim.out$par[1], times = ts_len),sigw =rep(fit_mod$optim.out$par[2], times = ts_len), obs = ts_obs, resid =rstandard(KFS(fit_mod$model), type ="recursive"),conv = fit_mod$optim.out$convergence)## combine it all for output.kfas_fits_out <-list(fits =bind_rows(smoothers,filters), model = fit_mod$model)return(kfas_fits_out)}
Code
KFAS_rolling_estimation <-function(init_vals_roll, ts_obs_roll, ts_name_roll, dates_roll, init.par_roll, ts.missing_roll){## perform initial fit on "burnin" of first init_vals_roll time points model_out <-KFAS_state_space_spline(ts_obs = ts_obs_roll[1:init_vals_roll],name = ts_name_roll, ts.missing = ts.missing_roll[1:init_vals_roll], ts_dates = dates_roll[1:init_vals_roll], init_par = init.par_roll)#just keep estimates for dates in burnins#smoother need not be kept fits_rolling <- dplyr::filter(model_out$fits, date == dates_roll[1:init_vals_roll], fit =="filter")# use variance estimates from burnin fit to initialize model for next time point next.par <-c(fits_rolling$sigv[init_vals_roll], fits_rolling$sigw[init_vals_roll])## perform rolling estimation for each time pointfor(i in (init_vals_roll +1):length(ts_obs_roll)){# just looking to current time point ts_partial <- ts_obs_roll[1:i]# fit the model for the next time point ith_fit_out <-KFAS_state_space_spline(ts_obs = ts_partial,name = ts_name_roll,ts.missing = ts.missing_roll[1:i],ts_dates = dates_roll[1:i],init_par = next.par) fitmod <- ith_fit_out$model ith_fit <- ith_fit_out$fits# save results of model fitif(exists("ith_fit")){ fits_rolling <-rbind(fits_rolling, dplyr::filter(ith_fit, date == dates_roll[i], fit =="filter"))# get updated variance estimates for observation and state next.par <-c(ith_fit$sigv[nrow(ith_fit)], ith_fit$sigw[nrow(ith_fit)])## compute smoother at final time pointif(i ==length(ts_obs_roll)){ fits_rolling <-rbind(fits_rolling, dplyr::filter(ith_fit, fit =="smoother")) }rm(ith_fit) }else{ ## I don't know how to to error handling, feel free to do a pull requestprint(rep("FAIL", times =100)) } }## give the user an update once each series' estimation is completeprint(paste("Model fit complete: ", ts_name_roll[1])) output <-list(fits_rolling = fits_rolling, model = fitmod)return(output)}
Time series modeling assumption
Autocorrelation of model’s residuals should not be present.
Do lift station series give different information than the WWTP?
custom function ww_ewma to compute standardized difference
## creates chart for the series you input, make sure to remove burnin period.## region 1 is the lift station## region 2 is the WWTPww_ewma <-function(region1, region2, title.char, ylab.label="Standardized Difference in Series", events =NULL){## observed series may have missing values. Use the estimated value from the WWTP state space model as imputed value. region1[region1$ts_missing, "obs"] <-left_join(region1[region1$ts_missing,], region2, by ="date") %>%pull(est.y)# if you have enough data to model region1 you can fill in with the estimated filter#region1[region1$ts_missing, "obs"] <- region1 %>% dplyr::filter(ts_missing) %>% mutate(obs = est) %>% pull(obs) series1 <- region1 %>%pull(obs) series2 <- region2 %>%pull(est)# compute the difference diff <- series1 - series2 var1 <- region1 %>% dplyr::mutate(var_est = ((upr-est)/2)^2) %>% dplyr::select(var_est) var2 <- region2 %>% dplyr::mutate(var_est = ((upr-est)/2)^2) %>% dplyr::select(var_est) series1_missing <- region1 %>% dplyr::select(ts_missing) series2_missing <- region2 %>% dplyr::select(ts_missing) vals_missing <- (series1_missing + series2_missing) >0# use correlation and variances to compute the covariance cor_estimate <-cor(series1, series2) cov_estimate <- cov_est <-as.numeric(cor_estimate)*sqrt(var1)*sqrt(var2)# create estimate of variance of the difference series using approximation var_est <- var1 + var2 -2*cov_estimate# create standardized difference series standardized_diff <- diff/sqrt(var_est)# compute lag 1 autocorrelation of standardized difference series lag1_est <-acf(standardized_diff, plot=F, na.action = na.pass)$acf[2] ## we could do something fancier# use qcc package to make ewma plot out <- qcc::ewma(standardized_diff, center =0, sd =1, lambda = lag1_est, nsigmas =3, sizes =1, plot = F)## put NAs where we had missing values for either series#out$x[vals_missing] <-NA out$y[vals_missing] <-NA out$data[vals_missing,1]<-NA# create plot dat <-data.frame(x = region1$date, ewma = out$y, y = out$data[,1], col = out$x %in% out$violations, lwr = out$limits[20,1],upr = out$limits[20,2]) obs_dat <-data.frame(x = dat$x, y = out$data[,1], col ="black") p <-ggplot(dat, aes(x = x, y = ewma)) +geom_vline(xintercept = events, col ="darkgrey", lwd =1)+geom_line()+geom_point(aes(col = dat$col), size =3) +scale_color_manual(values =c(1,2), label =c("No", "Yes"), name ="Separation?") +new_scale_color() +geom_point(data = obs_dat, aes(x = x, y = y, col =col), shape =3) +scale_color_manual(values ="black", label ="Observed \nStandardized \nDifference", name ="") +geom_hline(aes(yintercept = out$limits[20,1]), lty =2) +geom_hline(aes(yintercept = out$limits[20,2]), lty =2) +geom_hline(aes(yintercept =0), lty =1) +ggtitle(paste(title.char))+xlab("Date") +ylab(ylab.label) +theme_minimal()return(p)}
Summary of EWMA chart procedure
Inputs: At least 10 + \(n\) WWTP observations, \(n \geq 1\) sub-sewershed observations
Read in cleaned WWTP series and obtain online trend estimates through the date of the first sub-sewershed observation.
Replace any missing sub-sewershed observations with WWTP online trend estimate for corresponding date.
Create difference time series of sub-sewershed observed copies/liter (\(\log10\)) - WWTP Online Trend Estimate.
Standardize the difference series by dividing by the standard deviation computed approximation
Construct EWMA chart for the standardized difference series.
Inspect EWMA chart for separation.
Outputs: WMA chart for determining separation of sub-sewershed from centralized WWTP.