A brief guide to changepointGA

Introduction

The changepointGA package applies Genetic Algorithms (GAs) to detect both single and multiple changepoints in time series data. By encoding each candidate changepoint as an integer, changepointGA dynamically adjusts the model parameter vector to accommodate varying numbers of changepoints and optimizes the model fitness function accordingly. Additionally, it also provides the functionalities of estimating the model hyperparameters, parameters, and changepoint configurations simultaneously.

library(changepointGA)

Generate a time series data with changepoints

This package includes the time series simulation function ts.sim() with the ability to introduce changepoint effects. The time series could be simulated from a class of time series models,

Z_t = μ_t + e_t.

The model could be specified by the arguments of phi and theta, representing the autoregressive (AR) coefficients of and the moving-average (MA) coefficients for the autoregressive moving-average (ARMA) model. The default values of phi and theta are NULL, generating time series with e_t that are independent and identically N(0, σ²) distributed. If we assign values to phi and/or theta, the e_t will be generated from an ARMA(p,q) process, where p equals the number of values in phi and q equals the number of values in theta. For the above model, μ_t denotes time-varying mean function and the mean changepoint effects could be introduced through indicator functions to μ_t. Note that the changepointGA could also work with other time series model as long as the fitness function is appropriately specified. To illustrate the package usage, we will generate the e_t following an ARMA(p = 1,q = 1) process. The AR coefficient is ϕ = 0.5 and the MA coefficient is θ = 0.8. The time varying mean values are generated through the following equation,

μ_t = β₀ + Δ₁I_{[t > τ₁]} + Δ₂I_{[t > τ₂]},

including the intercept term with β₀ = 0.5 and two changepoints at τ₁ = 125 and τ₂ = 375 with Δ₁ = 2 and Δ₂ = −2. The users can also include additional covariates into matrix XMatT for other possible model dynamics, such as tread and seasonalities.

Ts = 200
betaT = c(0.5) # intercept
XMatT = matrix(rep(1, Ts), ncol=1)
colnames(XMatT) = c("intercept")
sigmaT = 1
phiT = c(0.5)
thetaT = c(0.8)
DeltaT = c(2, -2)
CpLocT = c(50, 150)

myts = ts.sim(beta=betaT, XMat=XMatT, sigma=sigmaT, phi=phiT, theta=thetaT, Delta=DeltaT, CpLoc=CpLocT, seed=1234)
str(myts)
#>  Time-Series [1:200, 1] from 1 to 200: -2.006 -2.328 -1.47 0.526 1.17 ...
#>  - attr(*, "DesignX")= num [1:200, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:3] "intercept" "" ""
#>  - attr(*, "mu")= num [1:200, 1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#>  - attr(*, "theta")= num [1:3] 0.5 2 -2
#>  - attr(*, "CpEff")= num [1:2] 2 -2
#>  - attr(*, "CpLoc")= num [1:2] 50 150
#>  - attr(*, "arEff")= num 0.5
#>  - attr(*, "maEff")= num 0.8
#>  - attr(*, "seed")= num 1234

The simulated time series could be plotted in the figure below. The vertical dashed blue lines indicate the true changepoint location and the horizontal red dashed segments represent the time series segment sample mean. We can clearly observe the mean changing effects from the introduced changepoints τ₁ = 250 and τ₂ = 500. The changepointGA package will be used to detected these two changepoints.

TsPlotCheck(Y=myts, tau=CpLocT)

The objective function for optimization

Considering the model used in the data simulation, we could use the R function, ARIMA.BIC.Order(), included in the changepointGA package. This function inputs include the

chromosome, consists of the number of changepoints, the order of AR part, the order of MA part, the changepoint locations, and a value of time series sample size plus 1 (N + 1) indicating the end of the chromosome;
plen represent the number of model order hyperparameters. We have plen=2 for this example representing one hyperparameter for AR order and one hyperparameter for MA order;
XMat requires the design matrix, including the of-interested covariates;
Xt requires the simulated time series data objects.

Give the specific chromosome, including the ARMA model AR and MA orders and the changepoint configuration information, the function ARIMA.BIC.Order() returns the model BIC value, which represents the configuration’s fitness.

ARIMA.BIC.Order(chromosome=c(2, 1, 1, 50, 150, Ts+1), plen=2, XMat=XMatT, Xt=myts)
#> [1] 644.0464

Genetic Algorithm

The genetic algorithm requires users set up a list object, including all the GA hyperparameters. As shown below, we will have 40 individual chromosomes in the population for each generation. p.range is a list object containing integer values ranges for parameters phi and theta. For example, in the code snippet below, p.range specifies that the AR and MA orders can vary from the set {0, 1, 2}. Details for each hyperparameter definition can be found via the package help documentation.

N = Ts
p.range = list(ar=c(0,2), ma=c(0,2))

GA_param = list(
  popsize      = 80,
  Pcrossover   = 0.95,
  Pmutation    = 0.3,
  Pchangepoint = 10/N,
  minDist      = 1,
  mmax         = N/2 - 1,
  lmax         = 2 + N/2 - 1,
  maxgen       = 5000,
  maxconv      = 100,
  option       = "both",
  monitoring   = FALSE,
  parallel     = FALSE,
  nCore        = NULL,
  tol          = 1e-5,
  seed         = 1234
)

Population chromosome initialization

The default argument GA_operators for the GA() function requires four operators, population, selection, crossover, and mutation. The population operator will provide a matrix like R object that store the randomly generated chromosome representations as the initial population. As with any optimization problem, better initial values can lead to improved and faster-converging results. For convenience, users can specify their preferred solutions in a matrix R object to serve as the initialized population and replace the default operator in operators, but the matrix dimensions should be lmax rows and popsize columns. Each column from this matrix represents an individual chromosome.

pop = matrix(0, nrow=2 + N/2 - 1, ncol=80)
# put the two desired candidate solutions into pop matrix
pop[1:6,1] = c(2, 1, 1, 50, 150, N+1)
pop[1:7,2] = c(3, 1, 1, 50, 100, 150, N+1)

# fill the remain 38 individuals slot from default generation
tmppop = random_population(popsize=80, prange=p.range, N=N, minDist=1, Pb=10/N, 
                           mmax=N/2 - 1, 2 + N/2 - 1)
pop[,3:80] = tmppop[,3:80]
pop[1:10, 1:10]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,]    2    3   12   16    9    9    7   11    9     7
#>  [2,]    1    1    0    2    2    0    2    0    1     0
#>  [3,]    1    1    1    0    0    0    1    1    1     0
#>  [4,]   50   50    6   13   46   15   82   19    9    27
#>  [5,]  150  100   37   28   65   22   83   42   20    32
#>  [6,]  201  150   42   48  109   23   87   67   55    41
#>  [7,]    0  201   67   68  110  112  120   81   67    61
#>  [8,]    0    0   69   73  140  130  156  107   73   155
#>  [9,]    0    0   85   76  144  140  183  117   82   162
#> [10,]    0    0  112   84  156  146  184  168   99   181

operators.self = list(population = pop,
                      selection  = "selection_linearrank",
                      crossover  = "uniformcrossover",
                      mutation   = "mutation")

Since we are trying to estimate the changepoint configurations, it is necessary to set the XMat to include all other covariates except the changepoint indicators, as if we are under a changepoint free model.

XMatEst = matrix(1, nrow=N, ncol=1)

Then we call function GA() for changepoint searching and simultaneously estimate the best fitted model AR and MA orders. The estimated number of changepoints and corresponding changepoint locations could be extracted as below.

res.changepointGA = suppressWarnings(GA(ObjFunc=ARIMA.BIC.Order, N=N, GA_param, GA_operators = operators.self, 
                                        p.range=p.range, XMat=XMatEst, Xt=myts))
fit.changepointGA = res.changepointGA$overbestfit
fit.changepointGA
#> [1] 614.0761
tau.changepointGA = res.changepointGA$overbestchrom
tau.changepointGA
#> [1]   4   1   1  50  54 149 155 201

Island version of Genetic Algorithm

The island GA model can be implemented by calling R function IslandGA() with similar arguments as in GA() function. The argument of IslandGA_param would be slightly different. Related, the initialized population is a list object, the length of such list should equal to the number of request islands. Here, we use the default operator to initialize the population, but users can also insert in their “reasonable” and “better” initial solutions to the list object as in previous section for improved and faster-converging results. The computational time is also showing below.

IslandGA_param = list(
  subpopsize   = 80,
  Islandsize   = 2,
  Pcrossover   = 0.95,
  Pmutation    = 0.15,
  Pchangepoint = 10/N,
  minDist      = 1,
  mmax         = N/2 - 1,
  lmax         = 2 + N/2 - 1,
  maxMig       = 1000,
  maxgen       = 50,
  maxconv      = 20,
  option       = "both",
  monitoring   = FALSE,
  parallel     = FALSE, ###
  nCore        = NULL,
  tol          = 1e-5,
  seed         = 1234
)
tim1 = Sys.time()
res.Island.changepointGA = suppressWarnings(IslandGA(ObjFunc=ARIMA.BIC.Order, N=N, IslandGA_param, 
                                                     p.range=p.range, XMat=XMatEst, Xt=myts))
tim2 = Sys.time()
tim2 - tim1
#> Time difference of 53.02986 secs
fit.Island.changepointGA = res.Island.changepointGA$overbestfit
fit.Island.changepointGA
#> [1] 587.0658
tau.Island.changepointGA = res.Island.changepointGA$overbestchrom
tau.Island.changepointGA
#> [1]   2   1   1  45 149 201

Parallel computing

The default setting for both GA() and IslandGA() implementations is to evaluate each changepoint configuration sequentially and the initialized population fitness evaluation is computational expensive. In some case, we could use the parallel computing techniques to speed up the fitness evaluation, which requires the R packages foreach and doParallel. The parallel computing will be more efficient when the island size ad the number of islands is larger (a larger starting population). We then could implement the parallel computing easily by setting parallel=TRUE and apply nCore=2 threads on a single computer as below (set nCore equals to the number of island is recommended).

IslandGA_param = list(
  subpopsize   = 80,
  Islandsize   = 2,
  Pcrossover   = 0.95,
  Pmutation    = 0.15,
  Pchangepoint = 10/N,
  minDist      = 1,
  mmax         = N/2 - 1,
  lmax         = 2 + N/2 - 1,
  maxMig       = 1000,
  maxgen       = 50,
  maxconv      = 20,
  option       = "both",
  monitoring   = FALSE,
  parallel     = TRUE, ###
  nCore        = 2,
  tol          = 1e-5,
  seed         = 1234
)

tim3 = Sys.time()
res.Island.changepointGA = suppressWarnings(IslandGA(ObjFunc=ARIMA.BIC.Order, N=N, IslandGA_param, 
                                                     p.range=p.range, XMat=XMatEst, Xt=myts))

tim4 = Sys.time()
tim4 - tim3
#> Time difference of 35.48881 secs
fit.Island.changepointGA = res.Island.changepointGA$overbestfit
fit.Island.changepointGA
#> [1] 587.0658
tau.Island.changepointGA = res.Island.changepointGA$overbestchrom
tau.Island.changepointGA
#> [1]   2   1   1  45 149 201

Changepoint configuration distance calculation

It is important to measure how different the estimated changepoint configuration away from the true configuration used in data generation. The pairwise distance metric proposed by Shi et al. (2022) can help quantify such differences. We implemented and included the method via the cpDist() function in this package.

true.tau = c(50, 150)
est.tau = c(tau.Island.changepointGA[4:(4+tau.Island.changepointGA[1]-1)])
cptDist(tau1=true.tau, tau2=est.tau, N=N)
#> [1] 0.03

References

Shi, X., Gallagher, C., Lund, R., & Killick, R. (2022). A comparison of single and multiple changepoint techniques for time series data. Computational Statistics & Data Analysis, 170, 107433.

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] changepointGA_0.1.0 rmarkdown_2.29     
#> 
#> loaded via a namespace (and not attached):
#>  [1] doParallel_1.0.17      cli_3.6.4              knitr_1.49            
#>  [4] rlang_1.1.5            xfun_0.51              clue_0.3-66           
#>  [7] jsonlite_1.9.0         buildtools_1.0.0       htmltools_0.5.8.1     
#> [10] maketools_1.3.2        sys_3.4.3              sass_0.4.9            
#> [13] evaluate_1.0.3         jquerylib_0.1.4        fastmap_1.2.0         
#> [16] yaml_2.3.10            foreach_1.5.2          lifecycle_1.0.4       
#> [19] cluster_2.1.8          compiler_4.4.2         codetools_0.2-20      
#> [22] Rcpp_1.0.14            digest_0.6.37          R6_2.6.1              
#> [25] parallel_4.4.2         bslib_0.9.0            tools_4.4.2           
#> [28] RcppArmadillo_14.2.3-1 iterators_1.0.14       cachem_1.1.0