{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n\n# Independence Testing\n\nFinding trends in data is pretty hard. Independence testing is a fundamental tool that\nhelps us find trends in data and make further decisions based on these results. This\ntesting is desirable to answer many question such as: does brain connectivity effect\ncreativity? Does gene expression effect cancer? Does something effect something else?\n\nIf you are interested in questions of this mold, this module of the package is for you!\nAll our tests can be found in :mod:hyppo.independence, and will be elaborated in\ndetail below. But before that, let's look at the mathematical formulations:\n\nConsider random variables $X$ and $Y$ with distributions $F_X$ and\n$F_Y$ respectively, and joint distribution is $F_{XY} = F_{Y|X} F_X$.\nWhen performing independence testing, we are seeing whether or not\n$F_{Y|X} = F_Y$. That is, we are testing\n\n\\begin{align}H_0 &: F_{XY} = F_X F_Y \\\\\n H_A &: F_{XY} \\neq F_X F_Y\\end{align}\n\nLike all the other tests within hyppo, each method has a :func:statistic and\n:func:test method. The :func:test method is the one that returns the test statistic\nand p-values, among other outputs, and is the one that is used most often in the\nexamples, tutorials, etc.\nThe p-value returned is calculated using a permutation test using\n:meth:hyppo.tools.perm_test unless otherwise specified.\n\nSpecifics about how the test statistics are calculated for each in\n:class:hyppo.independence can be found the docstring of the respective test. Here,\nwe overview subsets of the types of independence tests we offer in hyppo, and special\nparameters unique to those tests.\n\nNow, let's look at unique properties of some of the tests in :mod:hyppo.independence:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pearson's Correlation Multivariate Variants\n\n**Cannonical correlation (CCA)** and **Rank value (RV)** are multivariate analogues\nof Pearson's correlation.\nMore details can be found in :class:hyppo.independence.CCA and\n:class:hyppo.independence.RV.\nThe following applies to both:\n\n

#### Note

:Pros: - Very fast\n - Similar to tests found in scientific literature\n :Cons: - Not accurate when compared to other tests in most situations\n - Makes dangerous variance assumptions about the data, among others\n (similar assumptions to Pearson's correlation)

\n\nNeither of these test are distance based, and so do not have a compute_distance\nparameter.\nOtherwise, these tests runs like any other test.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distance (and Kernel) Based Tests\n\nA number of tests within :mod:hyppo.independence use the concept of inter-sample\ndistance, or kernel similarity, to generate powerful indpendence tests.\n\n**Heller Heller Gorfine (HHG)** is a powerful multivariate independence test that\ncompares intersample distance, and computes a Pearson statistic.\nMore details can be found in :class:hyppo.independence.HHG.\n\n

#### Note

:Pros: - Very accurate in certain settings\n :Cons: - Very slow (computationally complex)\n - Test statistic not very interpretable, not between (-1, 1)

\n\nThis test runs like any other test.\n\n------------\n\n**Distance Correlation (Dcorr)** is a powerful multivariate independence test based on\nenergy distance.\n**Hilbert Schmidt Independence Criterion (Hsic)** is a kernel-based analogue to Dcorr\nthat uses the a Gaussian median kernel by default _.\nMore details can be found in :class:hyppo.independence.Dcorr and\n:class:hyppo.independence.Hsic.\nThe following applies to both:\n\n

#### Note

:Pros: - Accurate, powerful independence test for multivariate and nonlinear\n data\n - Has enticing empiral properties (foundation of some of the other tests in\n the package)\n - Has fast implementations (fastest test in the package)\n :Cons: - Slightly less accurate as the above tests

\n\nFor Hsic, kernels are used instead of distances with the compute_kernel parameter.\nAny addition, if the bias variant of the test statistic is required, then the bias\nparameter can be set to True. In general, we do not recommend doing this.\nOtherwise, these tests runs like any other test.\nSince Dcorr and Hsic are implemented similarly, let's look at Dcorr.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import timeit\n\nimport numpy as np\n\nsetup_code = \"\"\"\nfrom hyppo.independence import Dcorr\nfrom hyppo.tools import w_shaped\nx1, y1 = w_shaped(n=100, p=3, noise=True)\nx2, y2 = w_shaped(n=100, p=1, noise=True)\n\"\"\"\n\nt_perm = timeit.Timer(stmt=\"Dcorr().test(x1, y1, auto=False)\", setup=setup_code)\nt_chisq = timeit.Timer(stmt=\"Dcorr().test(x1, y1, auto=True)\", setup=setup_code)\nt_fast_perm = timeit.Timer(stmt=\"Dcorr().test(x2, y2, auto=True)\", setup=setup_code)\nt_fast_chisq = timeit.Timer(stmt=\"Dcorr().test(x2, y2, auto=True)\", setup=setup_code)\n\nperm_time = np.array(t_perm.timeit(number=1)) # permutation Dcorr\nchisq_time = np.array(t_chisq.timeit(number=1000)) / 1000 # fast Dcorr\nfast_perm_time = np.array(t_fast_perm.timeit(number=1)) # permutation Dcorr\nfast_chisq_time = np.array(t_fast_chisq.timeit(number=1000)) / 1000 # fast Dcorr\n\nprint(u\"Permutation time: {0:.3g}s\".format(perm_time))\nprint(u\"Fast time (chi-square): {0:.3g}s\".format(chisq_time))\nprint(u\"Permutation time (fast statistic): {0:.3g}s\".format(fast_perm_time))\nprint(u\"Fast time (fast statistic chi-square): {0:.3g}s\".format(fast_chisq_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the time increases when using the fast test!\nThe fast test approximates the null distribution as a chi-squared random variable,\nand so is far faster than the permutation method.\nTo call it, simply set auto to True, which is the default, and if your data\nhas a sample size greater than 20, then the test will run.\nIn the case where the data is 1 dimensional and Euclidean, an even faster version is\nrun.\n\n------------\n\n**Multiscale graph correlation (MGC)** is a powerful independence test the uses the\npower of Dcorr\nand k-nearest neighbors to create an efficient and powerful independence test.\nMore details can be found in :class:hyppo.independence.MGC.\n\n

#### Note

We recently added the majority of the source code of this algorithm to\n :func:scipy.stats.multiscale_graphcorr.\n This class serves as a wrapper for that implementation.

\n\n

#### Note

:Pros: - Highly accurate, powerful independence test for multivariate and nonlinear\n data\n - Gives information about geometric nature of the dependency\n :Cons: - Slightly slower than similar tests in this section

\n\nMGC has some specific differences outlined below, but creating the instance of the\nclass was demonstrated in the general indep in the Overview.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from hyppo.independence import MGC\nfrom hyppo.tools import w_shaped\n\n# 100 samples, 1D x and 1D y, noise\nx, y = w_shaped(n=100, p=1, noise=True)\n\n# get the MGC map and optimal scale (only need the statistic)\n_, _, mgc_dict = MGC().test(x, y, reps=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A unique property of MGC is a MGC map, which is a (n, n) map (or (n, k) where\nk is less than n if there are repeated values). This shows you the test statistics\nat all these different \"scales\". The optimal scale is an ordered pair that indicates\nnearest neighbors where the test statistic is maximized and is marked by a red \"X\".\nThis optimal scale is the location of the test statistic.\nThe heat map gives insights into the nature of the dependency between x and y.\nOtherwise, this test runs like any other test.\nWe can plot it below:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport matplotlib.ticker as ticker\nimport seaborn as sns\n\n# make plots look pretty\nsns.set(color_codes=True, style=\"white\", context=\"talk\", font_scale=1)\n\nmgc_map = mgc_dict[\"mgc_map\"]\nopt_scale = mgc_dict[\"opt_scale\"] # i.e. maximum smoothed test statistic\nprint(\"Optimal Scale:\", opt_scale)\n\n# create figure\nfig, (ax, cax) = plt.subplots(\n ncols=2, figsize=(9, 8.5), gridspec_kw={\"width_ratios\": [1, 0.05]}\n)\n\n# draw heatmap and colorbar\nax = sns.heatmap(mgc_map, cmap=\"YlGnBu\", ax=ax, cbar=False)\nfig.colorbar(ax.get_children(), cax=cax, orientation=\"vertical\")\nax.invert_yaxis()\n\n# optimal scale\nax.scatter(opt_scale, opt_scale, marker=\"X\", s=200, color=\"red\")\n\n# make plots look nice\nax.set_title(\"MGC Map\")\nax.xaxis.set_major_locator(ticker.MultipleLocator(10))\nax.xaxis.set_major_formatter(ticker.ScalarFormatter())\nax.yaxis.set_major_locator(ticker.MultipleLocator(10))\nax.yaxis.set_major_formatter(ticker.ScalarFormatter())\nax.set_xlabel(\"Neighbors for x\")\nax.set_ylabel(\"Neighbors for y\")\nax.set_xticks([0, 50, 100])\nax.set_yticks([0, 50, 100])\nax.xaxis.set_tick_params()\nax.yaxis.set_tick_params()\ncax.xaxis.set_tick_params()\ncax.yaxis.set_tick_params()\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For linear data, we would expect the optimal scale to be at the maximum nearest\nneighbor pair. Since we consider nonlinear data, this is not the case.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forest Based Tests\n\nRandom-forest based tests exploit the theoretical properties of decision tree based\nclassifiers to create highly accurate tests (especially in cases of high dimensional\ndata sets).\n\n**Kernel mean embedding random forest (KMERF)** is one such test, which uses\nsimilarity matrices as a result of random forest to generate a test statistic and\np-value. More details can be found in :class:hyppo.independence.KMERF.\n\n

#### Note

:Pros: - Highly accurate, powerful independence test for multivariate and nonlinear\n data\n - Gives information about releative dimension (or feature) importance\n :Cons: - Very slow (requires training a random forest for each permutation)

\n\nUnlike other tests, there is no compute_distance\nparameter. Instead, the number of trees can be set explicityly, and the type of\nclassifier can be set (\"classifier\" in the case where the return value $y$ is\ncategorical, and \"regressor\" when that is not the case). Check out\n:class:sklearn.ensemble.RandomForestClassifier and\n:class:sklearn.ensemble.RandomForestRegressor for additional parameters to change.\nOtherwise, this test runs like any other test.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from hyppo.independence import KMERF\nfrom hyppo.tools import cubic\n\n# 100 samples, 5D sim\nx, y = cubic(n=100, p=5)\n\n# get the feature importances (only need the statistic)\n_, _, kmerf_dict = KMERF(ntrees=5000).test(x, y, reps=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because this test is random-forest based, we can get feature importances. This gives\nus relative importances value of dimension (i.e. feature). KMERF returns a normalized\nversion of this parameter, and we can plot these importances:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "importances = kmerf_dict[\"feat_importance\"]\ndims = range(1, 6) # range of dimensions of simulation\n\n\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\n# make plots look pretty\nsns.set(color_codes=True, style=\"white\", context=\"talk\", font_scale=1)\n\n# plot the feature importances\nplt.figure(figsize=(9, 6.5))\nplt.plot(dims, importances)\nplt.xlim([1, 5])\nplt.xticks([1, 2, 3, 4, 5])\nplt.xlabel(\"Dimensions\")\nplt.ylim([0, 1])\nplt.yticks([])\nplt.ylabel(\"Normalized Feature Importance\")\nplt.title(\"Feature Importances\")\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that feature importances decreases as dimension increases. This is true for\nmost of the simulations in :mod:hyppo.tools.\n\n## Maximal Margin Correlation\n\n**Maximial Margin Correlation** takes the independence tests in\n:mod:hyppo.independence, and compute the maximal correlation of pairwise comparisons\nbetween each dimension of x and y.\nMore details can be found in :class:hyppo.independence.MaxMargin.\n\n

#### Note

:Pros: - As powerful as some of the tests within this module\n - Minimal decrease in testing power as dimension increases\n :Cons: - Adds computational complexity, so can be slow

\n\nThese tests have an indep_test parameter corresponding to the desired independence\ntest to be run. All the parameters from the above tests can also be modified, and see\nthe relevant section of reference documentation in :mod:hyppo.independence for more\ninformation.\nThese tests runs like any other test.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }