Title: | Inference for Released Plug-in Sampling Single Synthetic Dataset |
---|---|
Description: | Considering the singly imputed synthetic data generated via plug-in sampling under the multivariate normal model, draws inference procedures including the generalized variance, the sphericity test, the test for independence between two subsets of variables, and the test for the regression of one set of variables on the other. For more details see Klein et al. (2021) <doi:10.1007/s13571-019-00215-9>. |
Authors: | Ricardo Moura [aut, cre, cph] , Mina Norouzirad [aut] , Vítor Augusto [aut] , Miguel Fonseca [ctb] , FCT, I.P. [fnd] (under the scope of the projects UIDB/00297/2020 and UIDP/00297/2020 (NovaMath)) |
Maintainer: | Ricardo Moura <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.1 |
Built: | 2024-11-19 05:40:28 UTC |
Source: | https://github.com/ricardomourarpm/psinference |
This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures for the regression of one subset of variables on the other based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.
canodist(part, nsample, pvariates, iterations)
canodist(part, nsample, pvariates, iterations)
part |
Number of variables in the first subset. |
nsample |
Sample size. |
pvariates |
Number of variables. |
iterations |
Number of iterations for simulating values from the distribution and finding the quantiles. Default is |
We define
where ,
is the
th observation of the synthetic dataset,
considering
partitioned as
For ,
where
is partitioned the same way as
its distribution is stochastic equivalent to
where ,
and
partitioned in the same way as
.
To test
, compute the value
of
,
, with the observed
values and reject the null hypothesis if
for
-significance level, where
is the
th percentile of
.
a vector of length iterations
that recorded the empirical distribution's values.
Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.
# generate original data library(MASS) n_sample = 100 p = 4 mu <- c(1,2,3,4) Sigma = matrix(c(1, 0.5, 0.1, 0.7, 0.5, 2, 0.4, 0.9, 0.1, 0.4, 3, 0.2, 0.7, 0.9, 0.2, 4), nr = 4, nc = 4, byrow = TRUE) df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # generate synthetic data df_s = simSynthData(df) #Decompose Sigma and Sstar part = 2 Sigma_12 = partition(Sigma,nrows = part, ncol = part)[[2]] Sigma_22 = partition(Sigma,nrows = part, ncol = part)[[4]] Delta0 = Sigma_12 %*% solve(Sigma_22) Sstar = cov(df_s) Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]] Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]] Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]] Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]] DeltaEst = Sstar_12 %*% solve(Sstar_22) Sstar11_2 = Sstar_11 - Sstar_12 %*% solve(Sstar_22) %*% Sstar_21 T4_obs = det((DeltaEst-Delta0)%*%Sstar_22%*%t(DeltaEst-Delta0))/det(Sstar11_2) T4 <- canodist(part = part, nsample = n_sample, pvariates = p, iterations = 10000) q95 <- quantile(T4, 0.95) T4_obs > q95 #False means that we don't have statistical evidences to reject Delta0 print(T4_obs) print(q95) # When the observed value is smaller than the 95% quantile, # we don't have statistical evidences to reject the Sphericity property. # # Note that the value is very close to zero
# generate original data library(MASS) n_sample = 100 p = 4 mu <- c(1,2,3,4) Sigma = matrix(c(1, 0.5, 0.1, 0.7, 0.5, 2, 0.4, 0.9, 0.1, 0.4, 3, 0.2, 0.7, 0.9, 0.2, 4), nr = 4, nc = 4, byrow = TRUE) df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # generate synthetic data df_s = simSynthData(df) #Decompose Sigma and Sstar part = 2 Sigma_12 = partition(Sigma,nrows = part, ncol = part)[[2]] Sigma_22 = partition(Sigma,nrows = part, ncol = part)[[4]] Delta0 = Sigma_12 %*% solve(Sigma_22) Sstar = cov(df_s) Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]] Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]] Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]] Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]] DeltaEst = Sstar_12 %*% solve(Sstar_22) Sstar11_2 = Sstar_11 - Sstar_12 %*% solve(Sstar_22) %*% Sstar_21 T4_obs = det((DeltaEst-Delta0)%*%Sstar_22%*%t(DeltaEst-Delta0))/det(Sstar11_2) T4 <- canodist(part = part, nsample = n_sample, pvariates = p, iterations = 10000) q95 <- quantile(T4, 0.95) T4_obs > q95 #False means that we don't have statistical evidences to reject Delta0 print(T4_obs) print(q95) # When the observed value is smaller than the 95% quantile, # we don't have statistical evidences to reject the Sphericity property. # # Note that the value is very close to zero
This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures for the Generalized Variance of the released Single Synthetic dataset generated under Plug-in Sampling, assuming that the original distribution is normally distributed.
GVdist(nsample, pvariates, iterations = 10000)
GVdist(nsample, pvariates, iterations = 10000)
nsample |
Sample size. |
pvariates |
Number of variables. |
iterations |
Number of iterations for simulating values from the distribution and finding the quantiles. Default is |
We define
where ,
is the population covariance matrix
and
is the
th observation of the synthetic dataset.
Its distribution is stochastic equivalent to
where are all independent chi-square random variables.
The
level confidence interval for
is given by
where is the observed value of
and
is the
th percentile of
.
a vector of length iterations
that recorded the empirical distribution's values.
Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.
# Original data creation library(MASS) mu <- c(1,2,3,4) Sigma <- matrix(c(1, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 1), nrow = 4, ncol = 4, byrow = TRUE) seed = 1 n_sample = 100 # Create original simulated dataset df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # Synthetic data created df_s = simSynthData(df) # Gather the 0.025 and 0.975 quantiles and construct confident interval for sigma^2 # Check that sigma^2 is inside in both cases p = dim(df_s)[2] T <- GVdist(100, p, 10000) q975 <- quantile(T, 0.975) q025 <- quantile(T, 0.025) left <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q975 right <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q025 cat(left,right,'\n') print(det(Sigma)) # The synthetic value is inside the confidence interval of GV
# Original data creation library(MASS) mu <- c(1,2,3,4) Sigma <- matrix(c(1, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 1), nrow = 4, ncol = 4, byrow = TRUE) seed = 1 n_sample = 100 # Create original simulated dataset df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # Synthetic data created df_s = simSynthData(df) # Gather the 0.025 and 0.975 quantiles and construct confident interval for sigma^2 # Check that sigma^2 is inside in both cases p = dim(df_s)[2] T <- GVdist(100, p, 10000) q975 <- quantile(T, 0.975) q025 <- quantile(T, 0.025) left <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q975 right <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q025 cat(left,right,'\n') print(det(Sigma)) # The synthetic value is inside the confidence interval of GV
This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures and test the independence of two subsets of variables based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.
Inddist(part, nsample, pvariates, iterations)
Inddist(part, nsample, pvariates, iterations)
part |
Number of variables in the first subset. |
nsample |
Sample size. |
pvariates |
Number of variables. |
iterations |
Number of iterations for simulating values from the distribution and finding the quantiles. Default is |
We define
where ,
is the
th observation of the synthetic dataset,
considering
partitioned as
Under the assumption that ,
its distribution is stochastic equivalent to
where ,
and
partitioned in the same way as
.
To test
,
compute the value of
,
,
with the observed values and reject the null hypothesis if
for
-significance level, where
is the
th percentile of
.
a vector of length iterations
that recorded the empirical distribution's values.
Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.
#generate original data with two independent subsets of variables library(MASS) n_sample = 100 p = 4 mu <- c(1,2,3,4) Sigma = matrix(c(1, 0.5, 0, 0, 0.5, 2, 0, 0, 0, 0, 3, 0.2, 0, 0, 0.2, 4), nr = 4, nc = 4, byrow = TRUE) df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # generate synthetic data df_s = simSynthData(df) #Decompose Sstar in 4 parts part = 2 Sstar = cov(df_s) Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]] Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]] Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]] Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]] #Compute observed T3_star T3_obs = det(Sstar)/(det(Sstar_11)*det(Sstar_22)) alpha = 0.05 # colect the quantile from the distribution assuming independence between the two subsets T3 <- Inddist(part = part, nsample = n_sample, pvariates = p, iterations = 10000) q5 <- quantile(T3, alpha) T3_obs < q5 #False means that we don't have statistical evidences to reject independence print(T3_obs) print(q5) # Note that the value of the observed T3_obs is close to one as expected
#generate original data with two independent subsets of variables library(MASS) n_sample = 100 p = 4 mu <- c(1,2,3,4) Sigma = matrix(c(1, 0.5, 0, 0, 0.5, 2, 0, 0, 0, 0, 3, 0.2, 0, 0, 0.2, 4), nr = 4, nc = 4, byrow = TRUE) df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # generate synthetic data df_s = simSynthData(df) #Decompose Sstar in 4 parts part = 2 Sstar = cov(df_s) Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]] Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]] Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]] Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]] #Compute observed T3_star T3_obs = det(Sstar)/(det(Sstar_11)*det(Sstar_22)) alpha = 0.05 # colect the quantile from the distribution assuming independence between the two subsets T3 <- Inddist(part = part, nsample = n_sample, pvariates = p, iterations = 10000) q5 <- quantile(T3, alpha) T3_obs < q5 #False means that we don't have statistical evidences to reject independence print(T3_obs) print(q5) # Note that the value of the observed T3_obs is close to one as expected
This function split a matrix into a list of blocks (either by rows and columns).
partition(Matrix, nrows, ncols)
partition(Matrix, nrows, ncols)
Matrix |
a matrix to split . |
nrows |
positive integer indicating the number of rows blocks. |
ncols |
positive integer indicating the number of columns blocks. |
a list of partitioned submatrices
Mat = matrix(c(1,0.5,0,0, 0.5,2,0,0, 0,0,3,0.2, 0, 0, 0.2,4), nrow = 4, ncol = 4, byrow = TRUE) partition(Matrix = Mat, nrows = 2, ncols = 2)
Mat = matrix(c(1,0.5,0,0, 0.5,2,0,0, 0,0,3,0.2, 0, 0, 0.2,4), nrow = 4, ncol = 4, byrow = TRUE) partition(Matrix = Mat, nrows = 2, ncols = 2)
This function is used to generate a single synthetic version of the original data via Plug-in Sampling.
simSynthData(X, n_imp = dim(X)[1])
simSynthData(X, n_imp = dim(X)[1])
X |
matrix or dataframe |
n_imp |
sample size |
Assume that is the original data, assumed to be normally distributed,
we compute
as the sample mean and
as the sample covariance matrix,
where
is the sample Wishart matrix.
We generate
, by drawing
a matrix of generated dataset
Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.
library(MASS) n_sample = 1000 mu=c(0,0,0,0) Sigma=diag(1,4,4) # Create original simulated dataset df_o = mvrnorm(n_sample, mu, Sigma) # Create singly imputed synthetic dataset df_s = simSynthData(df_o) #Estimators synthetic mean_s <- colMeans(df_s) S_s <- (t(df_s)- mean_s) %*% t(t(df_s)- mean_s) # careful about this computation # mean_o is a column vector and if you are thinking as n X p matrices and # row vectors you should be aware of this. print(mean_s) print(S_s/(dim(df_s)[1]-1))
library(MASS) n_sample = 1000 mu=c(0,0,0,0) Sigma=diag(1,4,4) # Create original simulated dataset df_o = mvrnorm(n_sample, mu, Sigma) # Create singly imputed synthetic dataset df_s = simSynthData(df_o) #Estimators synthetic mean_s <- colMeans(df_s) S_s <- (t(df_s)- mean_s) %*% t(t(df_s)- mean_s) # careful about this computation # mean_o is a column vector and if you are thinking as n X p matrices and # row vectors you should be aware of this. print(mean_s) print(S_s/(dim(df_s)[1]-1))
This function calculates the empirical distribution of the pivotal random
variable that can be used to perform the Sphericity test of the population covariance matrix
that is
,
based on the released Single Synthetic data generated under Plug-in Sampling,
assuming that the original dataset is normally distributed.
Sphdist(nsample, pvariates, iterations)
Sphdist(nsample, pvariates, iterations)
nsample |
Sample size. |
pvariates |
Number of variables. |
iterations |
Number of iterations for simulating values from the
distribution and finding the quantiles. Default is |
We define
where ,
is the
th observation of the synthetic dataset.
For
, its distribution is
stochastic equivalent to
where and
are
Wishart random variables,
is independent of
.
To test
, compute the observed value of
,
, with the observed values
and reject the null hypothesis if
for
-significance level, where
is the
th percentile of
.
a vector of length iterations
that recorded the empirical distribution's values.
Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.
# Original data created library(MASS) mu <- c(1,2,3,4) Sigma <- matrix(c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1), nrow = 4, ncol = 4, byrow = TRUE) seed = 1 n_sample = 100 # Create original simulated dataset df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # Synthetic data created df_s = simSynthData(df) # Gather the 0.95 quantile p = dim(df_s)[2] T_sph <- Sphdist(nsample = n_sample, pvariates = p, iterations = 10000) q95 <- quantile(T_sph, 0.95) # Compute the observed value of T from the synthetic dataset S_star = cov(df_s*(n_sample-1)) T_obs = (det(S_star)^(1/p))/(sum(diag(S_star))/p) print(q95) print(T_obs) #Since the observed value is bigger than the 95% quantile, #we don't have statistical evidences to reject the Sphericity property. # #Note that the value is very close to one
# Original data created library(MASS) mu <- c(1,2,3,4) Sigma <- matrix(c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1), nrow = 4, ncol = 4, byrow = TRUE) seed = 1 n_sample = 100 # Create original simulated dataset df = mvrnorm(n_sample, mu = mu, Sigma = Sigma) # Synthetic data created df_s = simSynthData(df) # Gather the 0.95 quantile p = dim(df_s)[2] T_sph <- Sphdist(nsample = n_sample, pvariates = p, iterations = 10000) q95 <- quantile(T_sph, 0.95) # Compute the observed value of T from the synthetic dataset S_star = cov(df_s*(n_sample-1)) T_obs = (det(S_star)^(1/p))/(sum(diag(S_star))/p) print(q95) print(T_obs) #Since the observed value is bigger than the 95% quantile, #we don't have statistical evidences to reject the Sphericity property. # #Note that the value is very close to one