Title: | Multivariate Time Series Data Imputation |
---|---|
Description: | This is an EM algorithm based method for imputation of missing values in multivariate normal time series. The imputation algorithm accounts for both spatial and temporal correlation structures. Temporal patterns can be modeled using an ARIMA(p,d,q), optionally with seasonal components, a non-parametric cubic spline or generalized additive models with exogenous covariates. This algorithm is specially tailored for climate data with missing measurements from several monitors along a given region. |
Authors: | Washington Junger <[email protected]> |
Maintainer: | Washington Junger <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.3.6 |
Built: | 2025-02-14 05:26:19 UTC |
Source: | https://github.com/wjunger/mtsdi |
Prepare the dataset for exploratory data analysis
edaprep(dataset)
edaprep(dataset)
dataset |
dataset with missing observations |
It replaces missing observation with the vector mean.
It returns dataset
filled in with NA
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
data(miss) c <- edaprep(miss)
data(miss) c <- edaprep(miss)
Compute the elapsed time between start time and end time
elapsedtime(st, et)
elapsedtime(st, et)
st |
starting time |
et |
ending time |
It returns the time the process took to run.
String of the form hh:mm:ss
It is not intended to be called directly.
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
Estimate the row mean from a mtsdi
object regarding a fixed number of imputed values
getmean(object, weighted=TRUE, mincol=1, maxconsec=3)
getmean(object, weighted=TRUE, mincol=1, maxconsec=3)
object |
imputation object |
weighted |
If |
mincol |
integer for the minimun number of valid values by row |
maxconsec |
integer for the maximum number of consecutive missing values in a column |
It is useful just in case one wants row mean estimated. If log tranformation was used, mean is adjusted accordingly.
A vector of of rows mean with length n
, where n
is the number of observations.
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) m <- getmean(i,2)
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) m <- getmean(i,2)
A small sample dataset for the tutorial on data imputation
data(miss)
data(miss)
A data frame with 24 observations on the following 5 variables.
c31
a numeric vector with 1 missing observation
c32
a numeric vector with 1 missing observation
c33
a numeric vector with 6 missing observations
c34
a numeric vector with 3 missing observations
c35
a numeric vector with 3 missing observations
data(miss)
data(miss)
Create a data matrix from the Johnson \& Wichern's book
mkjnw()
mkjnw()
This function creates a data matrix from the Johnson & Wichern's book.
It returns a data matrix.
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
Johnson, R., Wichern, D. (1998) Applied Multivariate Statistical Analysis. Prentice Hall.
d <- mkjnw()
d <- mkjnw()
Perform the modified EM algorithm imputation on a normal multivariate dataset
mnimput(formula, dataset, by = NULL, log = FALSE, log.offset = 1, eps = 1e-3, maxit = 1e2, ts = TRUE, method = "spline", sp.control = list(df = NULL, weights = NULL), ar.control = list(order = NULL, period = NULL), ga.control = list(formula, weights = NULL), f.eps = 1e-6, f.maxit = 1e3, ga.bf.eps = 1e-6, ga.bf.maxit = 1e3, verbose = FALSE, digits = getOption("digits"))
mnimput(formula, dataset, by = NULL, log = FALSE, log.offset = 1, eps = 1e-3, maxit = 1e2, ts = TRUE, method = "spline", sp.control = list(df = NULL, weights = NULL), ar.control = list(order = NULL, period = NULL), ga.control = list(formula, weights = NULL), f.eps = 1e-6, f.maxit = 1e3, ga.bf.eps = 1e-6, ga.bf.maxit = 1e3, verbose = FALSE, digits = getOption("digits"))
formula |
formula indicating the missing data frame, for instance, |
dataset |
data with missing values to be imputated |
by |
factor for variance windows. Default is |
log |
logical. If |
log.offset |
If |
eps |
stop criterion |
maxit |
maximum number of iterations |
ts |
logical. |
method |
method for univariate time series filtering. It may be |
sp.control |
list for Spline smooth control. See Details |
ar.control |
list for ARIMA fitting control. See Details |
ga.control |
list for GAM fitting control. See Details |
f.eps |
convergence criterion for the ARIMA filter. See |
f.maxit |
maximum number of iterations for the ARIMA filter. See |
ga.bf.eps |
covergence criterion for the backfitting algorithm of GAM models. See |
ga.bf.maxit |
maximum number of iterations for the backfitting algorithm of GAM models. See |
verbose |
if |
digits |
an integer indicating the decimal places. If not supplied, it is taken from |
This is a modified version of the EM algorithm for imputation of missing values. It is also applicable to time series data. When it is explicited the time series attribute through the argument ts
, missing values are estimated accounting for both correlation between time series and time structure of the series itself. Several filters can be used for prediction of the mean vector in the E-step.
One can select the method for the univariate time series filtering by the argument method
. The default method is "spline"
. In this case a smooth spline is fitted to each of the time series at each iteration. Some parameters can be passed to smooth.spline
through sm.control
. df
is a vector as long as the number of columns in dataset
holding fixed degrees of freedom of the splines. If NULL
, the degrees of freedom of each spline are chosen by cross-validation. If df
has length 1, this values is recycled for all the covariates. weights
must be a matrix of the same size of dataset
with the weights for smooth.spline
. If NULL
, all the observations will have weights equal to .
Other possibity for time series filtering is to fitting an ARIMA model for each of the time series by setting method
to "arima"
. The ARIMA models must be identified before using this function, nonetheless. arima
function can be partially controlled through ar.control
. Each column of order
must hold the corresponding parameters for each univariate time series if
period
is NULL
. If period
is not NULL
, order
must also hold the multiplicative seasonality parameters, so each column of order
takes the form .
period
is the multiplicative seasonality period. f.eps
and f.maxit
control de convergence of the ARIMA fitting algorithm. Convergence problems due non stationarity may arise when using this option.
Last but not least, a very interesting approach to modelling temporal patterns to use a full fledged regression model. It is possible to use generalised aditive (or linear) models with exogenous variates to proper filtering of time patterns. One must set method to gam
and supply a vector of formulas in ga.control
. One must supply one formula for each covariate. Using covariates that are part of the formula of the imputation model may yield some colinearity among the variates. See gam
and glm
for details. In order to use regression models for the level, set method
to "gam"
Simulations have shown that the algorithm is stable and yields good results on imputation of normal data.
The function returns an object of class mtsdi
containing
call |
function call |
dataset |
imputed dataset |
muhat |
estimated mean vector |
sigmahat |
estimated covariance matrix |
missings |
vector holding the number of missing values on each row |
iterations |
number of iterations until convergence or reach |
convergence |
convergence value. See Details |
converged |
a logical indicating if the algorithm converged |
time |
elapsed time of the process |
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
Junger, W.L. and Ponce de Leon, A. (2015) Imputation of Missing Data in Time Series for Air Pollutants. Atmospheric Environment, 102, 96-104.
Johnson, R., Wichern, D. (1998) Applied Multivariate Statistical Analysis. Prentice Hall.
Dempster, A., Laird, N., Rubin, D. (1977) Maximum Likelihood from Incomplete Data via the Algorithm EM. Journal of the Royal Statistical Society 39(B)), 1–38.
McLachlan, G. J., Krishnan, T. (1997) The EM algorithm and extensions. John Wiley and Sons.
Box, G., Jenkins, G., Reinsel, G. (1994) Time Series Analysis: Forecasting and Control. 3 ed. Prentice Hall.
Hastie, T. J.; Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
mnimput
, predict.mtsdi
, edaprep
data(miss) f <- ~c31+c32+c33+c34+c35 ## one-window covariance i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(i) ## two-window covariances b<-c(rep("year1",12),rep("year2",12)) ii <- mnimput(f,miss,by=b,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(ii)
data(miss) f <- ~c31+c32+c33+c34+c35 ## one-window covariance i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(i) ## two-window covariances b<-c(rep("year1",12),rep("year2",12)) ii <- mnimput(f,miss,by=b,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(ii)
Carry out some statistics from the incomplete dataset
mstats(dataset)
mstats(dataset)
dataset |
dataset with missing for description |
This function computes the proportion of missing observations in a given dataset by rows and columns.
A list containing
rows |
number of missing in each row |
columns |
number of missing in each column |
pattern |
the pattern of the missing values |
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
data(miss) mstats(miss)
data(miss) mstats(miss)
This function produces a plot with imputed values and the estimated level for each of the columns in the imputed matrix.
## S3 method for class 'mtsdi' plot(x, vars = "all", overlay = TRUE, level = TRUE, points = FALSE, leg.loc = "topright", horiz = FALSE, at.once = FALSE, ...)
## S3 method for class 'mtsdi' plot(x, vars = "all", overlay = TRUE, level = TRUE, points = FALSE, leg.loc = "topright", horiz = FALSE, at.once = FALSE, ...)
x |
an object of the class |
vars |
a vector with de variables to plot |
overlay |
logical. If |
level |
logical. If |
points |
logical. If |
leg.loc |
a list with |
horiz |
logical. If |
at.once |
logical. If |
... |
further options for function |
The leg.loc
option may also be specified by setting one of the following quoted strings "bottomright"
, "bottom"
, "bottomleft"
, "left"
, "topleft"
, "top"
, "topright"
, "right"
, or "center"
. This places the legend on the inside of the plot frame at the given location with the orietation set by horiz
. See legend
for further details.
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) plot(i)
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) plot(i)
Extract imputed dataset from a mtsdi
object
## S3 method for class 'mtsdi' predict(object, ...)
## S3 method for class 'mtsdi' predict(object, ...)
object |
imputation object |
... |
further options passed to the generic function |
If log tranformation was used, dataset is back transformed accordingly.
A vector of of rows mean with lenght , where
is the number of observations.
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
Junger, W.L. and Ponce de Leon, A. (2015) Imputation of Missing Data in Time Series for Air Pollutants. Atmospheric Environment, 102, 96-104.
Johnson, R., Wichern, D. (1998) Applied Multivariate Statistical Analysis. Prentice Hall.
Dempster, A., Laird, N., Rubin, D. (1977) Maximum Likelihood from Incomplete Data via the Algorithm EM. Journal of the Royal Statistical Society 39(B)), 1–38.
McLachlan, G. J., Krishnan, T. (1997) The EM algorithm and extensions. John Wiley and Sons.
Box, G., Jenkins, G., Reinsel, G. (1994) Time Series Analysis: Forecasting and Control. 3 ed. Prentice Hall.
Hastie, T. J.; Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) predict(i)
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) predict(i)
Printing method for the imputation model
## S3 method for class 'mtsdi' print(x, digits = getOption("digits"), ...)
## S3 method for class 'mtsdi' print(x, digits = getOption("digits"), ...)
x |
an object of class |
digits |
an integer indicating the decimal places. If not supplied, it is taken from |
... |
further options passed to |
This function does not return a value.
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) print(i)
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) print(i)
Printing method for the summary
## S3 method for class 'summary.mtsdi' print(x, digits = getOption("digits"), print.models = TRUE, ...)
## S3 method for class 'summary.mtsdi' print(x, digits = getOption("digits"), print.models = TRUE, ...)
x |
an object of class |
print.models |
a logical indicating that time filtering models should also be printed |
digits |
an integer indicating the decimal places. If not supplied, it is taken from |
... |
further options passed from |
This function does not return a value.
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(i)
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(i)
Print summary information on the imputation object
## S3 method for class 'mtsdi' summary(object, ...)
## S3 method for class 'mtsdi' summary(object, ...)
object |
an object of class |
... |
further options passed to |
The function resturns a list containing
call |
function call |
muhat |
estimated mean vector |
sigmahat |
estimated covariance matrix |
iterations |
number of iterations used |
convergence |
relative difference of covariance determinant reached |
time |
time used in the process |
models |
details on the models used for time filtering |
log |
a logical indicating that data are log transformed |
log.offset |
offset used in the log transformation in order to avoid zeros |
Washington Junger [email protected] and Antonio Ponce de Leon [email protected]
Junger, W.L. and Ponce de Leon, A. (2015) Imputation of Missing Data in Time Series for Air Pollutants. Atmospheric Environment, 102, 96-104.
Johnson, R., Wichern, D. (1998) Applied Multivariate Statistical Analysis. Prentice Hall.
Dempster, A., Laird, N., Rubin, D. (1977) Maximum Likelihood from Incomplete Data via the Algorithm EM. Journal of the Royal Statistical Society 39(B)), 1–38.
McLachlan, G. J., Krishnan, T. (1997) The EM algorithm and extensions. John Wiley and Sons.
Box, G., Jenkins, G., Reinsel, G. (1994) Time Series Analysis: Forecasting and Control. 3 ed. Prentice Hall.
Hastie, T. J.; Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(i)
data(miss) f <- ~c31+c32+c33+c34+c35 i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7))) summary(i)