FriedmanRafsky

class hyppo.independence.FriedmanRafsky(**kwargs)

Friedman-Rafksy (FR) test statistic and p-value. This is a multivariate extension of the Wald-Wolfowitz runs test for randomness. The normal concept of a 'run' is replaced by a minimum spanning tree (MST) calculated between the points in respective data sets with edge weights defined as the Euclidean distance between two such points. After MST has been determined, all edges such that both corresponding nodes do not belong to the same class are severed and the number of independent resulting trees is counted. This test is consistent against similar tests.

Notes

The statistic can be derived as follows 1

Let \(x\) be a combined sample of \((n, p)\) and \((m, p)\) samples of random variables \(X\) and let \(y\) be a \((n+m, 1)\) array of labels \(Y\). We can then create a graph such that each point in \(X\) is connected to each other point in \(X\) by an edge weighted by the euclidean distance inbetween those points. The minimum spanning tree is then calculated and all edges such that the labels in \(Y\) are not from the same class are removed. The number of independent graphs is then summed to determine the uncorrected statistic for the test.

The p-value and null distribution for the corrected statistic are calculated via a permutation test using hyppo.tools.perm_test.

References

1

Jerome Friedman and Lawrence Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. Ann. Statist., 7(4):697–717, July 1979. doi:10.1214/aos/1176344722.

Methods Summary

FriedmanRafsky.statistic(x, y)

Helper function that calculates the Friedman Rafksy test statistic.

FriedmanRafsky.test(x, y[, reps, workers, ...])

Calculates the Friedman Rafsky test statistic and p-value.


FriedmanRafsky.statistic(x, y)

Helper function that calculates the Friedman Rafksy test statistic.

Parameters

x,y (ndarray of float) -- Input data matrices. x and y must have the same number of rows. That is, the shapes must be (n, p) and (n, 1) where n is the number of combined samples and p is the number of dimensions. y is the array of labels corresponding to the two samples, respectively.

Returns

stat (float) -- The computed (uncorrected) Friedman Rafsky statistic. A value between 2 and n.

FriedmanRafsky.test(x, y, reps=1000, workers=1, random_state=None)

Calculates the Friedman Rafsky test statistic and p-value.

Parameters
  • x,y (ndarray of float) -- Input data matrices. x and y must have the same number of rows. That is, the shapes must be (n, p) and (n, 1) where n is the number of combined samples and p is the number of dimensions. y is the array of labels corresponding to the two samples, respectively.

  • reps (int, default: 1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

  • workers (int, default: 1) -- The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

  • random_state (int, default: None) -- The random_state for permutation testing to be fixed for reproducibility.

Returns

  • stat (float) -- The computed (corrected) Friedman Rafsky statistic.

  • pvalue (float) -- The computed Friedman Rafsky p-value.

  • uncor_stat (float) -- The computed (uncorrected) Friedman Rafsky statistic.