The changepointGA package applies Genetic Algorithms (GAs) to detect both single and multiple changepoints in time series data. By encoding each candidate changepoint as an integer, changepointGA dynamically adjusts the model parameter vector to accommodate varying numbers of changepoints and optimizes the model fitness function accordingly. Additionally, it also provides the functionalities of estimating the model hyperparameters, parameters, and changepoint configurations simultaneously.
This package includes the time series simulation function
ts.sim()
with the ability to introduce changepoint effects.
The time series could be simulated from a class of time series
models,
Zt = μt + et.
The model could be specified by the arguments of phi
and
theta
, representing the autoregressive (AR) coefficients of
and the moving-average (MA) coefficients for the autoregressive
moving-average (ARMA) model. The default values of phi
and
theta
are NULL
, generating time series with
et that
are independent and identically N(0, σ2)
distributed. If we assign values to phi
and/or
theta
, the et will be
generated from an ARMA(p,q) process, where p equals the number of values in
phi
and q equals
the number of values in theta
. For the above model, μt denotes
time-varying mean function and the mean changepoint effects could be
introduced through indicator functions to μt. Note that
the changepointGA could also work with other time
series model as long as the fitness function is appropriately specified.
To illustrate the package usage, we will generate the et following an
ARMA(p = 1,q = 1) process. The AR coefficient
is ϕ = 0.5 and the MA
coefficient is θ = 0.8. The
time varying mean values are generated through the following
equation,
μt = β0 + Δ1I[t > τ1] + Δ2I[t > τ2],
including the intercept term with β0 = 0.5 and two
changepoints at τ1 = 125 and τ2 = 375 with Δ1 = 2 and Δ2 = −2. The users can
also include additional covariates into matrix XMatT
for
other possible model dynamics, such as tread and seasonalities.
Ts = 200
betaT = c(0.5) # intercept
XMatT = matrix(rep(1, Ts), ncol=1)
colnames(XMatT) = c("intercept")
sigmaT = 1
phiT = c(0.5)
thetaT = c(0.8)
DeltaT = c(2, -2)
CpLocT = c(50, 150)
myts = ts.sim(beta=betaT, XMat=XMatT, sigma=sigmaT, phi=phiT, theta=thetaT, Delta=DeltaT, CpLoc=CpLocT, seed=1234)
str(myts)
#> Time-Series [1:200, 1] from 1 to 200: -2.006 -2.328 -1.47 0.526 1.17 ...
#> - attr(*, "DesignX")= num [1:200, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:3] "intercept" "" ""
#> - attr(*, "mu")= num [1:200, 1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#> - attr(*, "theta")= num [1:3] 0.5 2 -2
#> - attr(*, "CpEff")= num [1:2] 2 -2
#> - attr(*, "CpLoc")= num [1:2] 50 150
#> - attr(*, "arEff")= num 0.5
#> - attr(*, "maEff")= num 0.8
#> - attr(*, "seed")= num 1234
The simulated time series could be plotted in the figure below. The vertical dashed blue lines indicate the true changepoint location and the horizontal red dashed segments represent the time series segment sample mean. We can clearly observe the mean changing effects from the introduced changepoints τ1 = 250 and τ2 = 500. The changepointGA package will be used to detected these two changepoints.
Considering the model used in the data simulation, we could use the R
function, ARIMA.BIC.Order()
, included in the
changepointGA package. This function inputs include
the
chromosome
, consists of the number of changepoints, the
order of AR part, the order of MA part, the changepoint locations, and a
value of time series sample size plus 1 (N + 1) indicating the end of the
chromosome;plen
represent the number of model order
hyperparameters. We have plen=2
for this example
representing one hyperparameter for AR order and one hyperparameter for
MA order;XMat
requires the design matrix, including the
of-interested covariates;Xt
requires the simulated time series data
objects.Give the specific chromosome, including the ARMA model AR and MA
orders and the changepoint configuration information, the function
ARIMA.BIC.Order()
returns the model BIC value, which
represents the configuration’s fitness.
The genetic algorithm requires users set up a list object, including
all the GA hyperparameters. As shown below, we will have 40 individual
chromosomes in the population for each generation. p.range
is a list object containing integer values ranges for parameters
phi
and theta
. For example, in the code
snippet below, p.range
specifies that the AR and MA orders
can vary from the set {0, 1, 2}.
Details for each hyperparameter definition can be found via the package
help documentation.
N = Ts
p.range = list(ar=c(0,2), ma=c(0,2))
GA_param = list(
popsize = 80,
Pcrossover = 0.95,
Pmutation = 0.3,
Pchangepoint = 10/N,
minDist = 1,
mmax = N/2 - 1,
lmax = 2 + N/2 - 1,
maxgen = 5000,
maxconv = 100,
option = "both",
monitoring = FALSE,
parallel = FALSE,
nCore = NULL,
tol = 1e-5,
seed = 1234
)
The default argument GA_operators
for the
GA()
function requires four operators,
population
, selection
, crossover
,
and mutation
. The population
operator will
provide a matrix like R object that store the randomly generated
chromosome representations as the initial population. As with any
optimization problem, better initial values can lead to improved and
faster-converging results. For convenience, users can specify their
preferred solutions in a matrix R object to serve as the initialized
population and replace the default operator in operators
,
but the matrix dimensions should be lmax
rows and
popsize
columns. Each column from this matrix represents an
individual chromosome.
pop = matrix(0, nrow=2 + N/2 - 1, ncol=80)
# put the two desired candidate solutions into pop matrix
pop[1:6,1] = c(2, 1, 1, 50, 150, N+1)
pop[1:7,2] = c(3, 1, 1, 50, 100, 150, N+1)
# fill the remain 38 individuals slot from default generation
tmppop = random_population(popsize=80, prange=p.range, N=N, minDist=1, Pb=10/N,
mmax=N/2 - 1, 2 + N/2 - 1)
pop[,3:80] = tmppop[,3:80]
pop[1:10, 1:10]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 2 3 12 16 9 9 7 11 9 7
#> [2,] 1 1 0 2 2 0 2 0 1 0
#> [3,] 1 1 1 0 0 0 1 1 1 0
#> [4,] 50 50 6 13 46 15 82 19 9 27
#> [5,] 150 100 37 28 65 22 83 42 20 32
#> [6,] 201 150 42 48 109 23 87 67 55 41
#> [7,] 0 201 67 68 110 112 120 81 67 61
#> [8,] 0 0 69 73 140 130 156 107 73 155
#> [9,] 0 0 85 76 144 140 183 117 82 162
#> [10,] 0 0 112 84 156 146 184 168 99 181
operators.self = list(population = pop,
selection = "selection_linearrank",
crossover = "uniformcrossover",
mutation = "mutation")
Since we are trying to estimate the changepoint configurations, it is
necessary to set the XMat
to include all other covariates
except the changepoint indicators, as if we are under a changepoint free
model.
Then we call function GA()
for changepoint searching and
simultaneously estimate the best fitted model AR and MA orders. The
estimated number of changepoints and corresponding changepoint locations
could be extracted as below.
res.changepointGA = suppressWarnings(GA(ObjFunc=ARIMA.BIC.Order, N=N, GA_param, GA_operators = operators.self,
p.range=p.range, XMat=XMatEst, Xt=myts))
fit.changepointGA = res.changepointGA$overbestfit
fit.changepointGA
#> [1] 614.0761
tau.changepointGA = res.changepointGA$overbestchrom
tau.changepointGA
#> [1] 4 1 1 50 54 149 155 201
The island GA model can be implemented by calling R function
IslandGA()
with similar arguments as in GA()
function. The argument of IslandGA_param
would be slightly
different. Related, the initialized population is a list object, the
length of such list should equal to the number of request islands. Here,
we use the default operator to initialize the population, but users can
also insert in their “reasonable” and “better” initial solutions to the
list object as in previous section for improved and faster-converging
results. The computational time is also showing below.
IslandGA_param = list(
subpopsize = 80,
Islandsize = 2,
Pcrossover = 0.95,
Pmutation = 0.15,
Pchangepoint = 10/N,
minDist = 1,
mmax = N/2 - 1,
lmax = 2 + N/2 - 1,
maxMig = 1000,
maxgen = 50,
maxconv = 20,
option = "both",
monitoring = FALSE,
parallel = FALSE, ###
nCore = NULL,
tol = 1e-5,
seed = 1234
)
tim1 = Sys.time()
res.Island.changepointGA = suppressWarnings(IslandGA(ObjFunc=ARIMA.BIC.Order, N=N, IslandGA_param,
p.range=p.range, XMat=XMatEst, Xt=myts))
tim2 = Sys.time()
tim2 - tim1
#> Time difference of 53.02986 secs
fit.Island.changepointGA = res.Island.changepointGA$overbestfit
fit.Island.changepointGA
#> [1] 587.0658
tau.Island.changepointGA = res.Island.changepointGA$overbestchrom
tau.Island.changepointGA
#> [1] 2 1 1 45 149 201
The default setting for both GA()
and
IslandGA()
implementations is to evaluate each changepoint
configuration sequentially and the initialized population fitness
evaluation is computational expensive. In some case, we could use the
parallel computing techniques to speed up the fitness evaluation, which
requires the R packages foreach and
doParallel. The parallel computing will be more
efficient when the island size ad the number of islands is larger (a
larger starting population). We then could implement the parallel
computing easily by setting parallel=TRUE
and apply
nCore=2
threads on a single computer as below (set
nCore
equals to the number of island is recommended).
IslandGA_param = list(
subpopsize = 80,
Islandsize = 2,
Pcrossover = 0.95,
Pmutation = 0.15,
Pchangepoint = 10/N,
minDist = 1,
mmax = N/2 - 1,
lmax = 2 + N/2 - 1,
maxMig = 1000,
maxgen = 50,
maxconv = 20,
option = "both",
monitoring = FALSE,
parallel = TRUE, ###
nCore = 2,
tol = 1e-5,
seed = 1234
)
tim3 = Sys.time()
res.Island.changepointGA = suppressWarnings(IslandGA(ObjFunc=ARIMA.BIC.Order, N=N, IslandGA_param,
p.range=p.range, XMat=XMatEst, Xt=myts))
tim4 = Sys.time()
tim4 - tim3
#> Time difference of 35.48881 secs
fit.Island.changepointGA = res.Island.changepointGA$overbestfit
fit.Island.changepointGA
#> [1] 587.0658
tau.Island.changepointGA = res.Island.changepointGA$overbestchrom
tau.Island.changepointGA
#> [1] 2 1 1 45 149 201
It is important to measure how different the estimated changepoint
configuration away from the true configuration used in data generation.
The pairwise distance metric proposed by Shi et al. (2022) can help
quantify such differences. We implemented and included the method via
the cpDist()
function in this package.
Shi, X., Gallagher, C., Lund, R., & Killick, R. (2022). A comparison of single and multiple changepoint techniques for time series data. Computational Statistics & Data Analysis, 170, 107433.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] changepointGA_0.1.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] doParallel_1.0.17 cli_3.6.4 knitr_1.49
#> [4] rlang_1.1.5 xfun_0.51 clue_0.3-66
#> [7] jsonlite_1.9.0 buildtools_1.0.0 htmltools_0.5.8.1
#> [10] maketools_1.3.2 sys_3.4.3 sass_0.4.9
#> [13] evaluate_1.0.3 jquerylib_0.1.4 fastmap_1.2.0
#> [16] yaml_2.3.10 foreach_1.5.2 lifecycle_1.0.4
#> [19] cluster_2.1.8 compiler_4.4.2 codetools_0.2-20
#> [22] Rcpp_1.0.14 digest_0.6.37 R6_2.6.1
#> [25] parallel_4.4.2 bslib_0.9.0 tools_4.4.2
#> [28] RcppArmadillo_14.2.3-1 iterators_1.0.14 cachem_1.1.0