Independence

Multiscale Graph Correlation (MGC)

class hyppo.independence.MGC(compute_distance=<function euclidean>)[source]

Class for calculating the MGC test statistic and p-value.

Specifically, for each point, MGC finds the \(k\)-nearest neighbors for one property (e.g. cloud density), and the \(l\)-nearest neighbors for the other property (e.g. grass wetness) [1]. This pair \((k, l)\) is called the "scale". A priori, however, it is not know which scales will be most informative. So, MGC computes all distance pairs, and then efficiently computes the distance correlations for all scales. The local correlations illustrate which scales are relatively informative about the relationship. The key, therefore, to successfully discover and decipher relationships between disparate data modalities is to adaptively determine which scales are the most informative, and the geometric implication for the most informative scales. Doing so not only provides an estimate of whether the modalities are related, but also provides insight into how the determination was made. This is especially important in high-dimensional data, where simple visualizations do not reveal relationships to the unaided human eye. Characterizations of this implementation in particular have been derived from and benchmarked within in [2].

Parameters:

compute_distance : callable(), optional (default: euclidean)

A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated.

See also

Hsic
Hilbert-Schmidt independence criterion test statistic and p-value.
Dcorr
Distance correlation test statistic and p-value.

Notes

A description of the process of MGC and applications on neuroscience data can be found in [1]. It is performed using the following steps:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(D^x\) be the \(n \times n\) distance matrix of \(x\) and \(D^y\) be the \(n \times n\) be the distance matrix of \(y\). \(D^x\) and \(D^y\) are modified to be mean zero columnwise. This results in two \(n \times n\) distance matrices \(A\) and \(B\) (the centering and unbiased modification) [3]_.

  1. For all values \(k\) and \(l\) from \(1, ..., n\),
    • The \(k\)-nearest neighbor and \(l\)-nearest neighbor graphs are calculated for each property. Here, \(G_k (i, j)\) indicates the \(k\)-smallest values of the \(i\)-th row of \(A\) and \(H_l (i, j)\) indicates the \(l\) smallested values of the \(i\)-th row of \(B\)
    • Let \(\circ\) denotes the entry-wise matrix product, then local correlations are summed and normalized using the following statistic:
\[c^{kl} = \frac{\sum_{ij} A G_k B H_l} {\sqrt{\sum_{ij} A^2 G_k \times \sum_{ij} B^2 H_l}}\]
  1. The MGC test statistic is the smoothed optimal local correlation of \(\{ c^{kl} \}\). Denote the smoothing operation as \(R(\cdot)\) (which essentially set all isolated large correlations) as 0 and connected large correlations the same as before, see [3].) MGC is,
\[MGC_n (x, y) = \max_{(k, l)} R \left(c^{kl} \left( x_n, y_n \right) \right)\]

The test statistic returns a value between \((-1, 1)\) since it is normalized.

The p-value returned is calculated using a permutation test. This process is completed by first randomly permuting \(y\) to estimate the null distribution and then calculating the probability of observing a test statistic, under the null, at least as extreme as the observed test statistic.

MGC requires at least 5 samples to run with reliable results. It can also handle high-dimensional data sets.

References

[1](1, 2) Vogelstein, J. T., Bridgeford, E. W., Wang, Q., Priebe, C. E., Maggioni, M., & Shen, C. (2019). Discovering and deciphering relationships across disparate data modalities. ELife.
[2]Panda, S., Palaniappan, S., Xiong, J., Swaminathan, A., Ramachandran, S., Bridgeford, E. W., ... Vogelstein, J. T. (2019). mgcpy: A Comprehensive High Dimensional Independence Testing Python Package. ArXiv:1907.02088 [Cs, Stat].
[3]Shen, C., Priebe, C.E., & Vogelstein, J. T. (2019). From distance correlation to multiscale graph correlation. Journal of the American Statistical Association.
test(x, y, reps=1000, workers=1)[source]

Calculates the MGC test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

Returns:

stat : float

The computed MGC statistic.

pvalue : float

The computed MGC p-value.

mgc_dict : dict

Contains additional useful returns containing the following keys:

  • mgc_map : ndarray
    A 2D representation of the latent geometry of the relationship.
  • opt_scale : (int, int)
    The estimated optimal scale as a (x, y) pair.

Examples

>>> import numpy as np
>>> from hyppo.independence import MGC
>>> x = np.arange(100)
>>> y = x
>>> stat, pvalue, _ = MGC().test(x, y)
>>> '%.1f, %.3f' % (stat, pvalue)
'1.0, 0.001'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import MGC
>>> x = np.arange(100)
>>> y = x
>>> stat, pvalue, _ = MGC().test(x, y, reps=10000)
>>> '%.1f, %.3f' % (stat, pvalue)
'1.0, 0.000'

In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import MGC
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> mgc = MGC(compute_distance=None)
>>> stat, pvalue, _ = mgc.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'

Distance Correlation (Dcorr)

class hyppo.independence.Dcorr(compute_distance=<function euclidean>, bias=False)[source]

Class for calculating the Dcorr test statistic and p-value.

Dcorr is a measure of dependence between two paired random matrices of not necessarily equal dimensions. The coefficient is 0 if and only if the matrices are independent. It is an example of an energy distance.

Parameters:

compute_distance : callable(), optional (default: euclidean)

A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated.

bias : bool (default: False)

Whether or not to use the biased or unbiased test statistics.

See also

Hsic
Hilbert-Schmidt independence criterion test statistic and p-value.
HHG
Heller Heller Gorfine test statistic and p-value.

Notes

The statistic can be derived as follows:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(D^x\) be the \(n \times n\) distance matrix of \(x\) and \(D^y\) be the \(n \times n\) be the distance matrix of \(y\). The distance covariance is,

\[\mathrm{Dcov}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (D^x H D^y H)\]

where \(\mathrm{tr} (\cdot)\) is the trace operator and \(H\) is defined as \(H = I - (1/n) J\) where \(I\) is the identity matrix and \(J\) is a matrix of ones. The normalized version of this covariance is Dcorr [4] and is

\[\mathrm{Dcorr}_n (x, y) = \frac{\mathrm{Dcov}_n (x, y)} {\sqrt{\mathrm{Dcov}_n (x, x) \mathrm{Dcov}_n (y, y)}}\]

This version of distance correlation is defined using the following centering process where \(\mathbb{1}(\cdot)\) is the indicator function:

\[C^x_{ij} = \left[ D^x_{ij} - \frac{1}{n-2} \sum_{t=1}^n D^x_{it} - \frac{1}{n-2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n-1) (n-2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}\]

and similarly for \(C^y\). Then, this unbiased Dcorr is,

\[\mathrm{UDcov}_n (x, y) = \frac{1}{n (n-3)} \mathrm{tr} (C^x C^y)\]

The normalized version of this covariance [5] is

\[\mathrm{UDcorr}_n (x, y) = \frac{\mathrm{UDcov}_n (x, y)} {\sqrt{\mathrm{UDcov}_n (x, x) \mathrm{UDcov}_n (y, y)}}\]

References

[4](1, 2) Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6), 2769-2794.
[5](1, 2) Székely, G. J., & Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6), 2382-2412.
test(x, y, reps=1000, workers=1, auto=True, bias=False, perm_blocks=None)[source]

Calculates the Dcorr test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

auto : bool (default: True)

Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters reps and workers are irrelevant in this case.

perm_blocks : list or ndarray, optional

Provides hierarchy of dependencies to restrict permutations. Columns provide labels for each sample and recursively partition. Groups at each partition are exchangeable under a permutation but remain fixed if label is negative.

Returns:

stat : float

The computed Dcorr statistic.

pvalue : float

The computed Dcorr p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import Dcorr
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Dcorr().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import Dcorr
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Dcorr().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import Dcorr
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> dcorr = Dcorr(compute_distance=None)
>>> stat, pvalue = dcorr.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'

Hilbert Schmidt Independence Criterion (Hsic)

class hyppo.independence.Hsic(compute_kernel=<function gaussian>, bias=False)[source]

Class for calculating the Hsic test statistic and p-value.

Hsic is a kernel based independence test and is a way to measure multivariate nonlinear associations given a specified kernel [6]. The default choice is the Gaussian kernel, which uses the median distance as the bandwidth, which is a characteristic kernel that guarantees that Hsic is a consistent test [6] [7].

Parameters:

compute_kernel : callable(), optional (default: rbf kernel)

A function that computes the similarity among the samples within each data matrix. Set to None if x and y are already similarity matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_kernel(x) where x is the data matrix for which pairwise similarties are calculated.

bias : bool (default: False)

Whether or not to use the biased or unbiased test statistics.

See also

Dcorr
Distance correlation test statistic and p-value.
HHG
Heller Heller Gorfine test statistic and p-value.

Notes

The statistic can be derived as follows [6]:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(K^x\) be the \(n \times n\) kernel similarity matrix of \(x\) and \(D^y\) be the \(n \times n\) be the kernel similarity matrix of \(y\). The Hsic statistic is,

\[\mathrm{Hsic}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (K^x H K^y H)\]

where \(\mathrm{tr} (\cdot)\) is the trace operator and \(H\) is defined as \(H = I - (1/n) J\) where \(I\) is the identity matrix and \(J\) is a matrix of ones. The normalized version of Hsic [4] and is

\[\mathrm{Hsic}_n (x, y) = \frac{\mathrm{Hsic}_n (x, y)} {\sqrt{\mathrm{Hsic}_n (x, x) \mathrm{Hsic}_n (y, y)}}\]

This version of Hsic is defined using the following centering process where \(\mathbb{1}(\cdot)\) is the indicator function:

\[C^x_{ij} = \left[ D^x_{ij} - \frac{1}{n-2} \sum_{t=1}^n D^x_{it} - \frac{1}{n-2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n-1) (n-2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}\]

and similarly for \(C^y\). Then, this unbiased Dcorr is,

\[\mathrm{UHsic}_n (x, y) = \frac{1}{n (n-3)} \mathrm{tr} (C^x C^y)\]

The normalized version of this covariance [5] is

\[\mathrm{UHsic}_n (x, y) = \frac{\mathrm{UHsic}_n (x, y)} {\sqrt{\mathrm{UHsic}_n (x, x) \mathrm{UHsic}_n (y, y)}}\]

References

[6](1, 2, 3) Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J. (2008). A kernel statistical test of independence. In Advances in neural information processing systems (pp. 585-592).
[7]Gretton, A., & GyĂśrfi, L. (2010). Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11(Apr), 1391-1423.
test(x, y, reps=1000, workers=1, auto=True)[source]

Calculates the Hsic test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

auto : bool (default: True)

Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters reps and workers are irrelevant in this case.

Returns:

stat : float

The computed Hsic statistic.

pvalue : float

The computed Hsic p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import Hsic
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Hsic().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import Hsic
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Hsic().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_kernel parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import Hsic
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> hsic = Hsic(compute_kernel=None)
>>> stat, pvalue = hsic.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'

Heller Heller Gorfine (HHG)

class hyppo.independence.HHG(compute_distance=<function euclidean>)[source]

Class for calculating the HHG test statistic and p-value.

This is a powerful test for independence based on calculating pairwise euclidean distances and associations between these distance matrices. The test statistic is a function of ranks of these distances, and is consistent against similar tests [8]. It can also operate on multiple dimensions [8].

Parameters:

compute_distance : callable(), optional (default: euclidean)

A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated.

See also

Dcorr
Distance correlation test statistic and p-value.
Hsic
Hilbert-Schmidt independence criterion test statistic and p-value.

Notes

The statistic can be derived as follows [8]:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). For every sample \(j \neq i\), calculate the pairwise distances in \(x\) and \(y\) and denote this as \(d_x(x_i, x_j)\) and \(d_y(y_i, y_j)\). The indicator function is denoted as \(\mathbb{1} \{ \cdot \}\). The cross-classification between these two random variables can be calculated as

\[A_{11} = \sum_{k=1, k \neq i,j}^n \mathbb{1} \{ d_x(x_i, x_k) \leq d_x(x_i, x_j) \} \mathbb{1} \{ d_y(y_i, y_k) \leq d_y(y_i, y_j) \}\]

and \(A_{12}\), \(A_{21}\), and \(A_{22}\) are defined similarly. This is organized within the following table:

  \(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\) \(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)  
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\) \(A_{11} (i,j)\) \(A_{12} (i,j)\) \(A_{1 \cdot} (i,j)\)
\(d_x(x_i, \cdot) > d_x(x_i, x_j)\) \(A_{21} (i,j)\) \(A_{22} (i,j)\) \(A_{2 \cdot} (i,j)\)
  \(A_{\cdot 1} (i,j)\) \(A_{\cdot 2} (i,j)\) \(n - 2\)

Here, \(A_{\cdot 1}\) and \(A_{\cdot 2}\) are the column sums, \(A_{1 \cdot}\) and \(A_{2 \cdot}\) are the row sums, and \(n - 2\) is the number of degrees of freedom. From this table, we can calculate the Pearson's chi squared test statistic using,

\[S(i, j) = \frac{(n-2) (A_{12} A_{21} - A_{11} A_{22})^2} {A_{1 \cdot} A_{2 \cdot} A_{\cdot 1} A_{\cdot 2}}\]

and the HHG test statistic is then,

\[\mathrm{HHG}_n (x, y) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n S(i, j)\]

References

[8](1, 2, 3) Heller, R., Heller, Y., & Gorfine, M. (2012). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2), 503-510.
test(x, y, reps=1000, workers=1)[source]

Calculates the HHG test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

Returns:

stat : float

The computed HHG statistic.

pvalue : float

The computed HHG p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = HHG().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'160.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = HHG().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'160.0, 0.00'

In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from hyppo.independence import HHG
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> hhg = HHG(compute_distance=None)
>>> stat, pvalue = hhg.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'

Cannonical Correlation Analysis (CCA)

class hyppo.independence.CCA[source]

Class for calculating the CCA test statistic and p-value.

This test can be thought of inferring information from cross-covariance matrices [9]. It has been thought that virtually all parametric tests of significance can be treated as a special case of CCA [10]. The method was first introduced by Harold Hotelling in 1936 [11].

See also

RV
RV test statistic and p-value.

Notes

The statistic can be derived as follows [12]:

Let \(x\) and \(y\) be :math:`(n, p) samples of random variables \(X\) and \(Y\). We can center \(x\) and \(y\) and then calculate the sample covariance matrix \(\hat{\Sigma}_{xy} = x^T y\) and the variance matrices for \(x\) and \(y\) are defined similarly. Then, the CCA test statistic is found by calculating vectors \(a \in \mathbb{R}^p\) and \(b \in \mathbb{R}^q\) that maximize

\[\mathrm{CCA}_n (x, y) = \max_{a \in \mathbb{R}^p, b \in \mathbb{R}^q} \frac{a^T \hat{\Sigma}_{xy} b} {\sqrt{a^T \hat{\Sigma}_{xx} a} \sqrt{b^T \hat{\Sigma}_{yy} b}}\]

References

[9]Härdle, W. K., & Simar, L. (2015). Canonical correlation analysis. In Applied multivariate statistical analysis (pp. 443-454). Springer, Berlin, Heidelberg.
[10]Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance-testing system. Psychological Bulletin, 85(2), 410.
[11]Hotelling, H. (1992). Relations between two sets of variates. In Breakthroughs in statistics (pp. 162-190). Springer, New York, NY.
[12]Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12), 2639-2664.
test(x, y, reps=1000, workers=1)[source]

Calculates the CCA test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

Returns:

stat : float

The computed CCA statistic.

pvalue : float

The computed CCA p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import CCA
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = CCA().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import CCA
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = CCA().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

RV

class hyppo.independence.RV[source]

Class for calculating the RV test statistic and p-value.

RV is the multivariate generalization of the squared Pearson correlation coefficient [13]. The RV coefficient can be thought to be closely related to principal component analysis (PCA), canonical correlation analysis (CCA), multivariate regression, and statistical classification [13].

See also

CCA
CCA test statistic and p-value.

Notes

The statistic can be derived as follows [13] [14]:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). We can center \(x\) and \(y\) and then calculate the sample covariance matrix \(\hat{\Sigma}_{xy} = x^T y\) and the variance matrices for \(x\) and \(y\) are defined similarly. Then, the RV test statistic is found by calculating

\[\mathrm{RV}_n (x, y) = \frac{\mathrm{tr} \left( \hat{\Sigma}_{xy} \hat{\Sigma}_{yx} \right)} {\mathrm{tr} \left( \hat{\Sigma}_{xx}^2 \right) \mathrm{tr} \left( \hat{\Sigma}_{yy}^2 \right)}\]

where \(\mathrm{tr} (\cdot)\) is the trace operator.

References

[13](1, 2, 3) Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV‐coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25(3), 257-265.
[14]Escoufier, Y. (1973). Le traitement des variables vectorielles. Biometrics, 751-760.
test(x, y, reps=1000, workers=1)[source]

Calculates the RV test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

Returns:

stat : float

The computed RV statistic.

pvalue : float

The computed RV p-value.

Examples

>>> import numpy as np
>>> from hyppo.independence import RV
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = RV().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from hyppo.independence import RV
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = RV().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'