PartialDcorr¶
- class hyppo.conditional.PartialDcorr(compute_distance='euclidean', use_cov=True, **kwargs)¶
Partial Distance Covariance/Correlation (PDcov/PDcorr) test statistic and p-value.
PDcorr is a measure of dependence between two paired random matrices given a third random matrix of not necessarily equal dimensions 1.
- Parameters
compute_distance (
str
,callable
, orNone
, default:"euclidean"
) -- A function that computes the distance among the samples within each data matrix. Valid strings forcompute_distance
are, as defined insklearn.metrics.pairwise_distances
,From scikit-learn: [
"euclidean"
,"cityblock"
,"cosine"
,"l1"
,"l2"
,"manhattan"
] See the documentation forscipy.spatial.distance
for details on these metrics.From scipy.spatial.distance: [
"braycurtis"
,"canberra"
,"chebyshev"
,"correlation"
,"dice"
,"hamming"
,"jaccard"
,"kulsinski"
,"mahalanobis"
,"minkowski"
,"rogerstanimoto"
,"russellrao"
,"seuclidean"
,"sokalmichener"
,"sokalsneath"
,"sqeuclidean"
,"yule"
] See the documentation forscipy.spatial.distance
for details on these metrics.
Set to
None
or"precomputed"
ifx
andy
are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the formmetric(x, **kwargs)
wherex
is the data matrix for which pairwise distances are calculated and**kwargs
are extra arguements to send to your custom function.use_cov (
bool,
) -- If True, then the statistic will compute the covariance rather than the correlation.**kwargs -- Arbitrary keyword arguments for
compute_distance
.
Notes
The statistic can be derived as follows:
Let \(x\), \(y\), and \(z\) be \((n, p)\) samples of random variables \(X\), \(Y\) and \(Z\). Let \(D^x\) be the \(n \times n\) distance matrix of \(x\), \(D^y\) be the \(n \times n\) be the distance matrix of \(y\), and \(D^z\) be the \(n \times n\) distance matrix of \(z\). Let \(C^x\), \(C^y\), and \(C^z\) be the unbiased centered distance matrices (see
hyppo.independence.Dcorr
for more details). The partial distance covariance is defined as\[\mathrm{PDcov}_n (x, y; z) = \frac{1}{n(n-3)} \sum_{i\neq j}^n \left(P_{z^\perp}(x)\right)_{i,j} \left(P_{z^\perp}(y)\right)_{i,j}\]where
\[P_{z^\perp}(x) = C^x - \frac{(C^x\cdot C^z)}{ C^z \cdot C^z) C^z\]is the orthogonal proejction of \(C^x\) onto the subspace orthogonal to \(C^z\). The partial distance correlation is defined as
\[\mathrm{PDcorr}_n (x, y; z) = \frac{P_{z^\perp}(x)\cdot P_{z^\perp}(y)}{|P_{z^\perp}(x)} |P_{z^\perp}(y)|}\]Equivalently, the partial distance correlation can be also defined as
\[\mathrm{CDcorr}_n (x, y; z) = \frac{R_{xy} - R_{xz} R_{yz}}{\sqrt{(1 - R_{xz}^2)(1 - R_{yz}^2)}}\]where \(R_{xy}\) is the unbiased distance correlation between \(x\) and \(y\).
References
- 1
Gábor J. Székely and Maria L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, December 2014. doi:10.1214/14-AOS1255.
Methods Summary
|
Helper function that calculates the PDcov/PDcorr test statistic. |
|
Calculates the PDcov/PDcorr test statistic and p-value. |
- PartialDcorr.statistic(x, y, z)¶
Helper function that calculates the PDcov/PDcorr test statistic.
- Parameters
x,y,z (
ndarray
offloat
) -- Input data matrices.x
,y
andz
must have the same number of samples. That is, the shapes must be(n, p)
,(n, q)
and(n, r)
where n is the number of samples and p, q, and r are the number of dimensions. Alternatively,x
andy
can be distance matrices andz
can be a similarity matrix where the shapes must be(n, n)
.- Returns
stat (
float
) -- The computed PDcov/PDcorr statistic.
- PartialDcorr.test(x, y, z, reps=1000, workers=1, random_state=None)¶
Calculates the PDcov/PDcorr test statistic and p-value.
- Parameters
x,y,z (
ndarray
offloat
) -- Input data matrices.x
,y
andz
must have the same number of samples. That is, the shapes must be(n, p)
,(n, q)
and(n, r)
where n is the number of samples and p, q, and r are the number of dimensions. Alternatively,x
andy
can be distance matrices andz
can be a similarity matrix where the shapes must be(n, n)
.reps (
int
, default:1000
) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.workers (
int
, default:1
) -- The number of cores to parallelize the p-value computation over. Supply-1
to use all cores available to the Process.random_state (
int
, default:None
) -- The random_state for permutation testing to be fixed for reproducibility.
- Returns