DISCO¶

class hyppo.ksample.DISCO(compute_distance='euclidean', bias=False, **kwargs)¶

Distance Components (DISCO) test statistic and p-value.

DISCO is a powerful multivariate k-sample test. It leverages distance matrix capabilities (similar to tests like distance correlation or Dcorr). In fact, DISCO statistic is equivalent to our 2-sample formulation nonparametric MANOVA via independence testing, i.e. hyppo.ksample.KSample, and to hyppo.independence.Dcorr, hyppo.ksample.Energy, hyppo.independence.Hsic, and hyppo.ksample.MMD 1 2.

Parameters

compute_distance (str, callable, or None, default: "euclidean") -- A function that computes the distance among the samples within each data matrix. Valid strings for compute_distance are, as defined in sklearn.metrics.pairwise_distances,
- From scikit-learn: ["euclidean", "cityblock", "cosine", "l1", "l2", "manhattan"] See the documentation for scipy.spatial.distance for details on these metrics.
- From scipy.spatial.distance: ["braycurtis", "canberra", "chebyshev", "correlation", "dice", "hamming", "jaccard", "kulsinski", "mahalanobis", "minkowski", "rogerstanimoto", "russellrao", "seuclidean", "sokalmichener", "sokalsneath", "sqeuclidean", "yule"] See the documentation for scipy.spatial.distance for details on these metrics.
Set to None or "precomputed" if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form metric(x, **kwargs) where x is the data matrix for which pairwise distances are calculated and **kwargs are extra arguements to send to your custom function.
bias (bool, default: False) -- Whether or not to use the biased or unbiased test statistics.
**kwargs -- Arbitrary keyword arguments for compute_distance.

Notes

Traditionally, the formulation for the DISCO statistic is as follows 3:

Define \(\{ u^i_1 \stackrel{iid}{\sim} F_{U_1},\ i = 1, ..., n_1 \}\) up to \(\{ u^j_k \stackrel{iid}{\sim} F_{V_1},\ j = 1, ..., n_k \}\) as k groups of samples deriving from different distributions with the same dimensionality. If \(d(\cdot, \cdot)\) is a distance metric (i.e. Euclidean), \(N = \sum_{i = 1}^k n_k\), and \(\mathrm{Energy}\) is the Energy test statistic from hyppo.ksample.Energy then,

\[\mathrm{DISCO}_N(\mathbf{u}_1, \ldots, \mathbf{u}_k) = \sum_{1 \leq k < l \leq K} \frac{n_k n_l}{2N} \mathrm{Energy}_{n_k + n_l} (\mathbf{u}_k, \mathbf{u}_l)\]

The implementation in the hyppo.ksample.KSample class (using hyppo.independence.Dcorr) is in fact equivalent to this implementation (for p-values) and statistics are equivalent up to a scaling factor 1.

The p-value returned is calculated using a permutation test uses hyppo.tools.perm_test. The fast version of the test uses hyppo.tools.chi2_approx.

References

1(1,2): Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz, Carey E. Priebe, and Joshua T. Vogelstein. Universally consistent K-sample tests via dependence measures. Statistics & Probability Letters, 216:110278, January 2025. doi:10.1016/j.spl.2024.110278.
2: Cencheng Shen and Joshua T. Vogelstein. The exact equivalence of distance and kernel methods in hypothesis testing. AStA Advances in Statistical Analysis, September 2020. doi:10.1007/s10182-020-00378-1.
3: Gábor J. Székely and Maria L. Rizzo. Testing for equal distributions in high dimensions. InterStat, pages 2004.

Methods Summary

`DISCO.statistic`(*args)	Calulates the DISCO test statistic.
`DISCO.test`(*args[, reps, workers, auto, ...])	Calculates the DISCO test statistic and p-value.

DISCO.statistic(*args)¶

Calulates the DISCO test statistic.

Parameters: *args (ndarray of float) -- Variable length input data matrices. All inputs must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n are the number of samples and p is the number of dimensions.
Returns: stat (float) -- The computed DISCO statistic.

DISCO.test(*args, reps=1000, workers=1, auto=True, random_state=None)¶

Calculates the DISCO test statistic and p-value.

Parameters

*args (ndarray of float) -- Variable length input data matrices. All inputs must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.
reps (int, default: 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.
workers (int, default: 1) -- The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.
auto (bool, default: True) -- Automatically uses fast approximation when n and size of array is greater than 20. If True, and sample size is greater than 20, then hyppo.tools.chi2_approx will be run. Parameters reps and workers are irrelevant in this case. Otherwise, hyppo.tools.perm_test will be run.

Returns

stat (float) -- The computed DISCO statistic.
pvalue (float) -- The computed DISCO p-value.

Examples

>>> import numpy as np
>>> from hyppo.ksample import DISCO
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = DISCO().test(x, y)
>>> '%.3f, %.1f' % (stat, pvalue)
'-1.566, 1.0'

Examples using `hyppo.ksample.DISCO`¶

MMD

MANOVA

DISCO¶

Examples using hyppo.ksample.DISCO¶

Examples using `hyppo.ksample.DISCO`¶