Conditional Independence Testing

Conditional independence testing is similar to independence testing but introduces the presence of a third conditioning variable. Consider random variables \(X\), \(Y\), and \(Z\) with distributions \(F_X\), \(F_Y\), and \(F_Z\). When performing conditional independence testing, we are evaluating whether \(F_{X, Y|Z} = F_{X|Z}F_{Y|Z}\). Specifically, we are testing

\[\begin{split}H_0 &: X \perp \!\!\! \perp Y \mid Z \\ H_A &: X \not\!\perp\!\!\!\perp Y \mid Z\end{split}\]

Like all the other tests within hyppo, each method has a statistic and test method. The test method is the one that returns the test statistic and p-values, among other outputs, and is the one that is used most often in the examples, tutorials, etc.

Specifics about how the test statistics are calculated for each in hyppo.conditional can be found the docstring of the respective test. Here, we overview subsets of the types of conditional tests we offer in hyppo, and special parameters unique to those tests.

Now, let's look at unique properties of some of the tests in hyppo.conditional:

Fast Conditional Independence Test (FCIT)

The Fast Conditional Independence Test (FCIT) is a non-parametric conditional independence test. The test is based on a weak assumption that if the conditional independence alternative hypothesis is true, then prediction of the independent variable with only the conditioning variable should be just as accurate as prediction of the independent variable using the dependent variable conditioned on the conditioning variable. More details can be found in hyppo.conditional.FCIT.

Note

This algorithm is currently under review at a preprint on arXiv.

Note

Pros
  • Very fast due on high-dimensional data due to parallel processes

Cons
  • Heuristic method; above assumption, though weak, is not always true

The test uses a regression model to construct predictors for the indendent variable. By default, the regressor used is the decision tree regressor but the user can also specify other forms of regressors to use along with a set of hyperparameters to be tuned using cross-validation. Below is an example where the null hypothesis is true:

import numpy as np
from hyppo.conditional import FCIT
from sklearn.tree import DecisionTreeRegressor
np.random.seed(1234)
dim = 2
n = 100000
z1 = np.random.multivariate_normal(mean=np.zeros(dim), cov=np.eye(dim), size=(n))
A1 = np.random.normal(loc=0, scale=1, size=dim * dim).reshape(dim, dim)
B1 = np.random.normal(loc=0, scale=1, size=dim * dim).reshape(dim, dim)
x1 = (A1 @ z1.T + np.random.multivariate_normal(mean=np.zeros(dim), cov=np.eye(dim), size=(n)).T)
y1 = (B1 @ z1.T + np.random.multivariate_normal(mean=np.zeros(dim), cov=np.eye(dim), size=(n)).T)
model = DecisionTreeRegressor()
cv_grid = {"min_samples_split": [2, 8, 64, 512, 1e-2, 0.2, 0.4]}
stat, pvalue = FCIT(model=model, cv_grid=cv_grid).test(x1.T, y1.T, z1)
print("Statistic: ", stat)
print("p-value: ", pvalue)

Out:

Statistic:  -3.620087209954849
p-value:  0.9957453952769224

Kernel Conditional Independence Test (KCI)

The Kernel Conditional Independence Test (KCI) is a conditional independence test that works based on calculating the RBF kernels of distinct samples of data. The respective kernels are then normalized and multiplied together to determine the test statistic via the trace of the matrix product. The test then employs a gamma approximation based on the mean and variance of the independent sample kernel values to determine the p-value of the test. More details can be found in hyppo.conditional.KCI.

Note

Pros
  • Very fast on high-dimensional data due to simplicity and approximation

Cons
  • Dispute in literature as to ideal theta value, loss of accuracy on very large datasets

Below is a linear example where we reject the null hypothesis:

import numpy as np
from hyppo.conditional import KCI
from hyppo.tools import linear
np.random.seed(123456789)
x, y = linear(100, 1)
stat, pvalue = KCI().test(x, y)
print("Statistic: ", stat)
print("p-value: ", pvalue)

Out:

Statistic:  544.691148251223
p-value:  0.0

Partial Correlation (PCorr) and Partial Distance Correlation (PDcorr)

Partial Correlation (PCorr) and Partial Distance Correlation (PDcorr) are conditional independence tests that are extensions of Pearson's Correlation and Distance Correlation, respectively. Partial distance correlation introduces a new Hilbert space where the squared distance covariance is the inner product. More details can be found in hyppo.conditional.PartialCorr and hyppo.conditional.PartialDcorr.

Note

Pros
  • Simplest extension of Pearson's Correlation and Distance Correlation

Cons
  • Literature may suggest that this is not actually a dependence measure

  • Partial correlation makes strong linearity assumptions about the data

Below is a linear example where we reject the null hypothesis:

import numpy as np
from hyppo.conditional import PartialDcorr
from hyppo.tools import correlated_normal
np.random.seed(123456789)
x, y, z = correlated_normal(100, 1)
stat, pvalue = PartialDcorr().test(x, y, z)
print("Statistic: ", stat)
print("p-value: ", pvalue)

Out:

Statistic:  0.16077271247537103
p-value:  0.000999000999000999

Conditional Distance Correlation (CDcorr)

Conditional Dcorr (CDcorr) is a nonparametric measure of conditional dependence for multivariate random variables. The sample version takes the same statistical form of Dcorr but is conditioned on a third variable. It has also has strong guarantees regarding convergence and asymptotic normality. More details can be found in hyppo.conditional.CDcorr.

Note

Pros
  • Has stronger theoretical guarantees than PCorr and PDcorr

Cons
  • Computationally expensive on very large datasets

Below is a linear example where we reject the null hypothesis:

import numpy as np
from hyppo.conditional import ConditionalDcorr
from hyppo.tools import correlated_normal
np.random.seed(123456789)
x, y, z = correlated_normal(100, 1)
stat, pvalue = ConditionalDcorr().test(x, y, z)
print("Statistic: ", stat)
print("p-value: ", pvalue)

Out:

Statistic:  0.0036449407249212347
p-value:  0.000999000999000999

Total running time of the script: ( 0 minutes 30.015 seconds)

Gallery generated by Sphinx-Gallery