Package 'PSinference'

Title: Inference for Released Plug-in Sampling Single Synthetic Dataset
Description: Considering the singly imputed synthetic data generated via plug-in sampling under the multivariate normal model, draws inference procedures including the generalized variance, the sphericity test, the test for independence between two subsets of variables, and the test for the regression of one set of variables on the other. For more details see Klein et al. (2021) <doi:10.1007/s13571-019-00215-9>.
Authors: Ricardo Moura [aut, cre, cph] , Mina Norouzirad [aut] , Vítor Augusto [aut] , Miguel Fonseca [ctb] , FCT, I.P. [fnd] (under the scope of the projects UIDB/00297/2020 and UIDP/00297/2020 (NovaMath))
Maintainer: Ricardo Moura <[email protected]>
License: GPL (>= 2)
Version: 0.2.1
Built: 2024-11-19 05:40:28 UTC
Source: https://github.com/ricardomourarpm/psinference

Help Index


Canonical Empirical Distribution

Description

This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures for the regression of one subset of variables on the other based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.

Usage

canodist(part, nsample, pvariates, iterations)

Arguments

part

Number of variables in the first subset.

nsample

Sample size.

pvariates

Number of variables.

iterations

Number of iterations for simulating values from the distribution and finding the quantiles. Default is 10000.

Details

We define

T4Δ=(S12(S22)1Δ)S22(S12)(S22)1Δ)S11.2T_4^\star|\boldsymbol{\Delta} = \frac{(|\boldsymbol{S}^{\star}_{12} (\boldsymbol{S}^{\star}_{22})^{-1}-\boldsymbol{\Delta}) \boldsymbol{S}^{\star}_{22}(\boldsymbol{S}^{\star}_{12}) (\boldsymbol{S}^{\star}_{22})^{-1}-\boldsymbol{\Delta})^\top|} {|\boldsymbol{S}^{\star}_{11.2}|}

where S=i=1n(vivˉ)(vivˉ)\boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}, viv_i is the iith observation of the synthetic dataset, considering S\boldsymbol{S}^\star partitioned as

S=[S11S12S21S22].\boldsymbol{S}^{\star}=\left[\begin{array}{lll} \boldsymbol{S}^{\star}_{11}& \boldsymbol{S}^{\star}_{12}\\ \boldsymbol{S}^{\star}_{21} & \boldsymbol{S}^{\star}_{22} \end{array}\right].

For Δ=Σ12Σ221\Delta = \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}, where Σ\boldsymbol{\Sigma} is partitioned the same way as S\boldsymbol{S}^{\star} its distribution is stochastic equivalent to

Ω12Ω221Ω21Ω11Ω12Ω221Ω21\frac{|\boldsymbol{\Omega}_{12}\boldsymbol{\Omega}_{22}^{-1} \boldsymbol{\Omega}_{21}|}{|\boldsymbol{\Omega}_{11}-\boldsymbol{\Omega}_{12} \boldsymbol{\Omega}_{22}^{-1}\boldsymbol{\Omega}_{21}|}

where ΩWp(n1,Wn1)\boldsymbol{\Omega} \sim \mathcal{W}_p(n-1, \frac{\boldsymbol{W}}{n-1}), WWp(n1,Ip)\boldsymbol{W} \sim \mathcal{W}_p(n-1, \mathbf{I}_p) and Ω\boldsymbol{\Omega} partitioned in the same way as S\boldsymbol{S}^{\star}. To test H0:Δ=Δ0\mathcal{H}_0: \boldsymbol{\Delta} =\boldsymbol{\Delta}_0, compute the value of T4T_{4}^\star, T4~\widetilde{T_{4}^\star}, with the observed values and reject the null hypothesis if T4~>t4,1α\widetilde{T_{4}^\star}>t^\star_{4,1-\alpha} for α\alpha-significance level, where t4,γt^\star_{4,\gamma} is the γ\gammath percentile of T4T_4^\star.

Value

a vector of length iterations that recorded the empirical distribution's values.

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples

# generate original data
library(MASS)
n_sample = 100
p = 4
mu <- c(1,2,3,4)
Sigma = matrix(c(1,   0.5, 0.1, 0.7,
                 0.5,   2, 0.4, 0.9,
                 0.1, 0.4,   3, 0.2,
                 0.7, 0.9, 0.2,   4), nr = 4, nc = 4, byrow = TRUE)

df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)
# generate synthetic data
df_s = simSynthData(df)
#Decompose Sigma and Sstar
part = 2
Sigma_12 = partition(Sigma,nrows = part, ncol = part)[[2]]
Sigma_22 = partition(Sigma,nrows = part, ncol = part)[[4]]
Delta0 = Sigma_12 %*% solve(Sigma_22)

Sstar = cov(df_s)
Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]]
Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]]
Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]]
Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]]


DeltaEst = Sstar_12 %*% solve(Sstar_22)
Sstar11_2 = Sstar_11 - Sstar_12 %*% solve(Sstar_22) %*% Sstar_21


T4_obs = det((DeltaEst-Delta0)%*%Sstar_22%*%t(DeltaEst-Delta0))/det(Sstar11_2)

T4 <- canodist(part = part, nsample = n_sample, pvariates = p, iterations = 10000)
q95 <- quantile(T4, 0.95)

T4_obs > q95 #False means that we don't have statistical evidences to reject Delta0
print(T4_obs)
print(q95)
# When the observed value is smaller than the 95% quantile,
# we don't have statistical evidences to reject the Sphericity property.
#
# Note that the value is very close to zero

Generalized Variance Empirical Distribution

Description

This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures for the Generalized Variance of the released Single Synthetic dataset generated under Plug-in Sampling, assuming that the original distribution is normally distributed.

Usage

GVdist(nsample, pvariates, iterations = 10000)

Arguments

nsample

Sample size.

pvariates

Number of variables.

iterations

Number of iterations for simulating values from the distribution and finding the quantiles. Default is 10000.

Details

We define

T1=(n1)SΣ,T_1^\star = (n-1)\frac{|\boldsymbol{S}^*|}{|\boldsymbol{\Sigma}|},

where S=i=1n(vivˉ)(vivˉ)\boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}, Σ\boldsymbol{\Sigma} is the population covariance matrix and viv_i is the iith observation of the synthetic dataset. Its distribution is stochastic equivalent to

i=1nχni2i=1pχni2\prod_{i=1}^n \chi_{n-i}^2 \prod_{i=1}^p \chi_{n-i}^2

where χni2\chi_{n-i}^2 are all independent chi-square random variables. The (1α)(1-\alpha) level confidence interval for Σ|\boldsymbol{\Sigma}| is given by

((n1)pS~t1,1α/2,(n1)pS~t1,α/2)\left(\frac{(n-1)^p|\tilde{\boldsymbol{S}}^\star|}{t^\star_{1,1-\alpha/2}}, \frac{(n-1)^p|\tilde{\boldsymbol{S}}^\star|}{t^\star_{1,\alpha/2}} \right)

where S~\tilde{\boldsymbol{S}}^\star is the observed value of S\boldsymbol{S}^\star and t1,γt^\star_{1,\gamma} is the γ\gammath percentile of T1T_1.

Value

a vector of length iterations that recorded the empirical distribution's values.

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples

# Original data creation
library(MASS)
mu <- c(1,2,3,4)
Sigma <- matrix(c(1, 0.5, 0.5, 0.5,
                  0.5, 1, 0.5, 0.5,
                  0.5, 0.5, 1, 0.5,
                  0.5, 0.5, 0.5, 1), nrow = 4, ncol = 4, byrow = TRUE)
seed = 1
n_sample = 100
# Create original simulated dataset
df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)

# Synthetic data created

df_s = simSynthData(df)


# Gather the 0.025 and 0.975 quantiles and construct confident interval for sigma^2
# Check that sigma^2 is inside in both cases
p = dim(df_s)[2]

T <- GVdist(100, p, 10000)
q975 <- quantile(T, 0.975)
q025 <- quantile(T, 0.025)

left <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q975
right <- (n_sample-1)^p * det(cov(df_s)*(n_sample-1))/q025

cat(left,right,'\n')
print(det(Sigma))
# The synthetic value is inside the confidence interval of GV

Independence Empirical Distribution

Description

This function calculates the empirical distribution of the pivotal random variable that can be used to perform inferential procedures and test the independence of two subsets of variables based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.

Usage

Inddist(part, nsample, pvariates, iterations)

Arguments

part

Number of variables in the first subset.

nsample

Sample size.

pvariates

Number of variables.

iterations

Number of iterations for simulating values from the distribution and finding the quantiles. Default is 10000.

Details

We define

T3=SS11S22T_3^\star = \frac{|\boldsymbol{S}^{\star}|} {|\boldsymbol{S}^{\star}_{11}||\boldsymbol{S}^{\star}_{22}|}

where S=i=1n(vivˉ)(vivˉ)\boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}, viv_i is the iith observation of the synthetic dataset, considering S\boldsymbol{S}^\star partitioned as

S=[S11S12S21S22].\boldsymbol{S}^{\star}=\left[\begin{array}{lll} \boldsymbol{S}^{\star}_{11}& \boldsymbol{S}^{\star}_{12}\\ \boldsymbol{S}^{\star}_{21} & \boldsymbol{S}^{\star}_{22} \end{array}\right].

Under the assumption that Σ12=0\boldsymbol{\Sigma}_{12} = \boldsymbol{0}, its distribution is stochastic equivalent to

ΩΩ11Ω22\frac{|\boldsymbol{\Omega}|}{|\boldsymbol{\Omega}_{11}||\boldsymbol{\Omega}_{22}|}

where ΩWp(n1,Wn1)\boldsymbol{\Omega} \sim \mathcal{W}_p(n-1, \frac{\boldsymbol{W}}{n-1}), WWp(n1,Ip)\boldsymbol{W} \sim \mathcal{W}_p(n-1, \mathbf{I}_p) and Ω\boldsymbol{\Omega} partitioned in the same way as S\boldsymbol{S}^{\star}. To test H0:Σ12=0\mathcal{H}_0: \boldsymbol{\Sigma}_{12} = \boldsymbol{0}, compute the value of T3T_{3}^\star, T3~\widetilde{T_{3}^\star}, with the observed values and reject the null hypothesis if T3~<t3,α\widetilde{T_{3}^\star}<t^\star_{3,\alpha} for α\alpha-significance level, where t3,γt^\star_{3,\gamma} is the γ\gammath percentile of T3T_3^\star.

Value

a vector of length iterations that recorded the empirical distribution's values.

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples

#generate original data with two independent subsets of variables
library(MASS)
n_sample = 100
p = 4
mu <- c(1,2,3,4)
Sigma = matrix(c(1,   0.5,   0,     0,
                 0.5,   2,   0,     0,
                 0,     0,   3,   0.2,
                 0,     0,   0.2,   4), nr = 4, nc = 4, byrow = TRUE)
df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)
# generate synthetic data
df_s = simSynthData(df)

#Decompose Sstar in 4 parts
part = 2

Sstar = cov(df_s)
Sstar_11 = partition(Sstar,nrows = part, ncol = part)[[1]]
Sstar_12 = partition(Sstar,nrows = part, ncol = part)[[2]]
Sstar_21 = partition(Sstar,nrows = part, ncol = part)[[3]]
Sstar_22 = partition(Sstar,nrows = part, ncol = part)[[4]]

#Compute observed T3_star
T3_obs = det(Sstar)/(det(Sstar_11)*det(Sstar_22))

alpha = 0.05

# colect the quantile from the distribution assuming independence between the two subsets
T3 <- Inddist(part = part, nsample = n_sample, pvariates = p, iterations = 10000)
q5 <- quantile(T3, alpha)

T3_obs < q5 #False means that we don't have statistical evidences to reject independence
print(T3_obs)
print(q5)
# Note that the value of the observed T3_obs is close to one as expected

Split a matrix into blocks

Description

This function split a matrix into a list of blocks (either by rows and columns).

Usage

partition(Matrix, nrows, ncols)

Arguments

Matrix

a matrix to split .

nrows

positive integer indicating the number of rows blocks.

ncols

positive integer indicating the number of columns blocks.

Value

a list of partitioned submatrices

Examples

Mat = matrix(c(1,0.5,0,0,
                  0.5,2,0,0,
                  0,0,3,0.2,
                  0, 0, 0.2,4), nrow = 4, ncol = 4, byrow = TRUE)
partition(Matrix = Mat, nrows = 2, ncols = 2)

Plug-in Sampling Single Synthetic Dataset Generation

Description

This function is used to generate a single synthetic version of the original data via Plug-in Sampling.

Usage

simSynthData(X, n_imp = dim(X)[1])

Arguments

X

matrix or dataframe

n_imp

sample size

Details

Assume that X=(x1,,xn)\mathbf{X}=\left(\mathbf{x}_1, \dots, \mathbf{x}_n\right) is the original data, assumed to be normally distributed, we compute xˉ\bar{\mathbf{x}} as the sample mean and Σ^=S/(n1)\hat{\boldsymbol{\Sigma}}=\mathbf{S}/(n-1) as the sample covariance matrix, where S\mathbf{S} is the sample Wishart matrix. We generate V=(v1,,vn)\mathbf{V}=\left(\mathbf{v}_1, \dots, \mathbf{v}_n\right), by drawing

vii.i.d.Np(xˉ,Σ^).\mathbf{v}_i\stackrel{i.i.d.}{\sim}N_p(\bar{\mathbf{x}},\hat{\boldsymbol{\Sigma}}).

Value

a matrix of generated dataset

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples

library(MASS)
n_sample = 1000
mu=c(0,0,0,0)
Sigma=diag(1,4,4)
# Create original simulated dataset
df_o = mvrnorm(n_sample, mu, Sigma)
# Create singly imputed synthetic dataset
df_s = simSynthData(df_o)
#Estimators synthetic
mean_s <- colMeans(df_s)
S_s <- (t(df_s)- mean_s) %*% t(t(df_s)- mean_s)
# careful about this computation
# mean_o is a column vector and if you are thinking as n X p matrices and
# row vectors you should be aware of this.
print(mean_s)
print(S_s/(dim(df_s)[1]-1))

Spherical Empirical Distribution

Description

This function calculates the empirical distribution of the pivotal random variable that can be used to perform the Sphericity test of the population covariance matrix Σ\boldsymbol{\Sigma} that is Σ=σ2Ip\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}_p, based on the released Single Synthetic data generated under Plug-in Sampling, assuming that the original dataset is normally distributed.

Usage

Sphdist(nsample, pvariates, iterations)

Arguments

nsample

Sample size.

pvariates

Number of variables.

iterations

Number of iterations for simulating values from the distribution and finding the quantiles. Default is 10000.

Details

We define

T2=S1ptr(S)/pT_2^\star = \frac{|\boldsymbol{S}^{\star}|^{\frac{1}{p}}}{tr(\boldsymbol{S}^{\star})/p}

where S=i=1n(vivˉ)(vivˉ)\boldsymbol{S}^\star = \sum_{i=1}^n (v_i - \bar{v})(v_i - \bar{v})^{\top}, viv_i is the iith observation of the synthetic dataset. For Σ=σ2Ip\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}_p, its distribution is stochastic equivalent to

Ω1Ω21ptr(Ω1Ω2)/p\frac{|\boldsymbol{\Omega}_{1}\boldsymbol{\Omega}_{2}|^{\frac{1}{p}}}{tr(\boldsymbol{\Omega}_{1}\boldsymbol{\Omega}_{2})/p}

where Ω1\boldsymbol{\Omega}_1 and Ω2\boldsymbol{\Omega}_2 are Wishart random variables, Ω1Wp(n1,Ipn1)\boldsymbol{\Omega}_1 \sim \mathcal{W}_p(n-1, \frac{\mathbf{I}_p}{n-1}) is independent of Ω2Wp(n1,Ip)\boldsymbol{\Omega}_2 \sim \mathcal{W}_p(n-1, \mathbf{I}_p). To test H0:Σ=σ2Ip\mathcal{H}_0: \boldsymbol{\Sigma} = \sigma^2 \mathbf{I}_p, compute the observed value of T2T_{2}^\star, T2~\widetilde{T_{2}^\star}, with the observed values and reject the null hypothesis if T2~>t2,α\widetilde{T_{2}^\star}>t^\star_{2,\alpha} for α\alpha-significance level, where t2,γt^\star_{2,\gamma} is the γ\gammath percentile of T2T_2^\star.

Value

a vector of length iterations that recorded the empirical distribution's values.

References

Klein, M., Moura, R. and Sinha, B. (2021). Multivariate Normal Inference based on Singly Imputed Synthetic Data under Plug-in Sampling. Sankhya B 83, 273–287.

Examples

# Original data created
library(MASS)
mu <- c(1,2,3,4)
Sigma <- matrix(c(1, 0, 0, 0,
                  0, 1, 0, 0,
                  0, 0, 1, 0,
                  0, 0, 0, 1), nrow = 4, ncol = 4, byrow = TRUE)
seed = 1
n_sample = 100
# Create original simulated dataset
df = mvrnorm(n_sample, mu = mu, Sigma = Sigma)

# Synthetic data created

df_s = simSynthData(df)


# Gather the 0.95 quantile

p = dim(df_s)[2]

T_sph <- Sphdist(nsample = n_sample, pvariates = p, iterations = 10000)
q95 <- quantile(T_sph, 0.95)

# Compute the observed value of T from the synthetic dataset
S_star = cov(df_s*(n_sample-1))

T_obs = (det(S_star)^(1/p))/(sum(diag(S_star))/p)

print(q95)
print(T_obs)

#Since the observed value is bigger than the 95% quantile,
#we don't have statistical evidences to reject the Sphericity property.
#
#Note that the value is very close to one