KMERF¶

class hyppo.independence.KMERF(forest='regressor', ntrees=500, **kwargs)¶

Kernel Mean Embedding Random Forest (KMERF) test statistic and p-value.

The KMERF test statistic is a kernel method for calculating independence by using a random forest induced similarity matrix as an input, and has been shown to have especially high gains in finite sample testing power in high dimensional settings 1.

Parameters

forest ("regressor", "classifier", default: "regressor") -- Type of forest used when running the independence test. If the y input in test is categorial, use the "classifier" keyword.
ntrees (int, default: 500) -- The number of trees used in the random forest.
compute_distance (str, callable, or None, default: "euclidean") -- A function that computes the distance among the samples for y. Valid strings for compute_distance are, as defined in sklearn.metrics.pairwise_distances,
- From scikit-learn: ["euclidean", "cityblock", "cosine", "l1", "l2", "manhattan"] See the documentation for scipy.spatial.distance for details on these metrics.
- From scipy.spatial.distance: ["braycurtis", "canberra", "chebyshev", "correlation", "dice", "hamming", "jaccard", "kulsinski", "mahalanobis", "minkowski", "rogerstanimoto", "russellrao", "seuclidean", "sokalmichener", "sokalsneath", "sqeuclidean", "yule"] See the documentation for scipy.spatial.distance for details on these metrics.
Set to None or "precomputed" if y is already a distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form metric(x, **kwargs) where x is the data matrix for which pairwise distances are calculated and **kwargs are extra arguements to send to your custom function.
distance_kwargs (dict) -- Arbitrary keyword arguments for compute_distance.
**kwargs -- Additional arguments used for the forest (see sklearn.ensemble.RandomForestClassifier or sklearn.ensemble.RandomForestRegressor)

Notes

Note

This algorithm is currently under review at a peer-reviewed conference.

A description of KMERF in greater detail can be found in 1. It is computed using the following steps:

Let \(x\) and \(y\) be \((n, p)\) and \((n, 1)\) samples of random variables \(X\) and \(Y\).

Run random forest with \(m\) trees. Independent bootstrap samples of size \(n_{b} \leq n\) are drawn to build a tree each time; each tree structure within the forest is denoted as \(\phi_w \in \mathbf{P}\), \(w \in \{ 1, \ldots, m \}\); \(\phi_w(x_i)\) denotes the partition assigned to \(x_i\).
Calculate the proximity kernel:

\[\mathbf{K}^{\mathbf{x}}_{ij} = \frac{1}{m} \sum_{w = 1}^{m} I(\phi_w(x_i) = \phi_w(x_j))\]

where \(I(\cdot)\) is the indicator function for how often two observations lie in the same partition.
Compute the induced kernel correlation: Let

\[\begin{split}\mathbf{L}^{\mathbf{x}}_{ij}= \begin{cases} \mathbf{K}^{\mathbf{x}}_{ij} - \frac{1}{n-2} \sum_{t=1}^{n} \mathbf{K}^{\mathbf{x}}_{it} - \frac{1}{n-2} \sum_{s=1}^{n} \mathbf{K}^{\mathbf{x}}_{sj} + \frac{1}{(n-1)(n-2)} \sum_{s,t=1}^{n} \mathbf{K}^{\mathbf{x}}_{st} & \mbox{when} \ i \neq j \\ 0 & \mbox{ otherwise} \end{cases}\end{split}\]
Then let \(\mathbf{K}^{\mathbf{y}}\) be the Euclidean distance induced kernel, and similarly compute \(\mathbf{L}^{\mathbf{y}}\) from \(\mathbf{K}^{\mathbf{y}}\). The unbiased kernel correlation equals

\[\mathrm{KMERF}_n(\mathbf{x}, \mathbf{y}) = \frac{1}{n(n-3)} \mathrm{tr} \left( \mathbf{L}^{\mathbf{x}} \mathbf{L}^{\mathbf{y}} \right)\]

The p-value returned is calculated using a permutation test using hyppo.tools.perm_test.

References

1(1,2): Cencheng Shen, Sambit Panda, and Joshua T. Vogelstein. Learning Interpretable Characteristic Kernels via Decision Forests. arXiv:1812.00029 [cs, stat], September 2020. arXiv:1812.00029.

Methods Summary

`KMERF.statistic`(x, y)	Helper function that calculates the KMERF test statistic.
`KMERF.test`(x, y[, reps, workers, auto, ...])	Calculates the KMERF test statistic and p-value.

KMERF.statistic(x, y)¶

Helper function that calculates the KMERF test statistic.

Parameters: x,y (ndarray of float) -- Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Returns: stat (float) -- The computed KMERF statistic.

KMERF.test(x, y, reps=1000, workers=1, auto=True, random_state=None)¶

Calculates the KMERF test statistic and p-value.

Parameters

x,y (ndarray of float) -- Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
reps (int, default: 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.
workers (int, default: 1) -- The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.
auto (bool, default: True) -- Automatically uses fast approximation when n and size of array is greater than 20. If True, and sample size is greater than 20, then hyppo.tools.chi2_approx will be run. Parameters reps and workers are irrelevant in this case. Otherwise, hyppo.tools.perm_test will be run.

Returns

stat (float) -- The computed KMERF statistic.
pvalue (float) -- The computed KMERF p-value.
kmerf_dict (dict) --

Contains additional useful returns containing the following keys:
- feat_importancendarray of float
  An array containing the importance of each dimension

Examples

>>> import numpy as np
>>> from hyppo.independence import KMERF
>>> x = np.arange(100)
>>> y = x
>>> '%.1f, %.2f' % KMERF().test(x, y)[:1] 
'1.0, 0.001'

Examples using `hyppo.independence.KMERF`¶

MaxMargin

MGC

KMERF¶

Examples using hyppo.independence.KMERF¶

Examples using `hyppo.independence.KMERF`¶