Mercurial > hg > de.mpg.mpiwg.itgroup.digilib.plugin
diff libs/commons-math-2.1/docs/userguide/stat.html @ 10:5f2c5fb36e93
commons-math-2.1 added
author | dwinter |
---|---|
date | Tue, 04 Jan 2011 10:00:53 +0100 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/libs/commons-math-2.1/docs/userguide/stat.html Tue Jan 04 10:00:53 2011 +0100 @@ -0,0 +1,1227 @@ +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + + + + + + + + + + + +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <title>Math - The Commons Math User Guide - Statistics</title> + <style type="text/css" media="all"> + @import url("../css/maven-base.css"); + @import url("../css/maven-theme.css"); + @import url("../css/site.css"); + </style> + <link rel="stylesheet" href="../css/print.css" type="text/css" media="print" /> + <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> + </head> + <body class="composite"> + <div id="banner"> + <span id="bannerLeft"> + + Commons Math User Guide + + </span> + <div class="clear"> + <hr/> + </div> + </div> + <div id="breadcrumbs"> + + + + + + + + + <div class="xright"> + + + + + + + + </div> + <div class="clear"> + <hr/> + </div> + </div> + <div id="leftColumn"> + <div id="navcolumn"> + + + + + + + + + <h5>User Guide</h5> + <ul> + + <li class="none"> + <a href="../userguide/index.html">Contents</a> + </li> + + <li class="none"> + <a href="../userguide/overview.html">Overview</a> + </li> + + <li class="none"> + <strong>Statistics</strong> + </li> + + <li class="none"> + <a href="../userguide/random.html">Data Generation</a> + </li> + + <li class="none"> + <a href="../userguide/linear.html">Linear Algebra</a> + </li> + + <li class="none"> + <a href="../userguide/analysis.html">Numerical Analysis</a> + </li> + + <li class="none"> + <a href="../userguide/special.html">Special Functions</a> + </li> + + <li class="none"> + <a href="../userguide/utilities.html">Utilities</a> + </li> + + <li class="none"> + <a href="../userguide/complex.html">Complex Numbers</a> + </li> + + <li class="none"> + <a href="../userguide/distribution.html">Distributions</a> + </li> + + <li class="none"> + <a href="../userguide/fraction.html">Fractions</a> + </li> + + <li class="none"> + <a href="../userguide/transform.html">Transform Methods</a> + </li> + + <li class="none"> + <a href="../userguide/geometry.html">3D Geometry</a> + </li> + + <li class="none"> + <a href="../userguide/optimization.html">Optimization</a> + </li> + + <li class="none"> + <a href="../userguide/ode.html">Ordinary Differential Equations</a> + </li> + + <li class="none"> + <a href="../userguide/genetics.html">Genetic Algorithms</a> + </li> + </ul> + <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"> + <img alt="Built by Maven" src="../images/logos/maven-feather.png"></img> + </a> + + + + + + + + + </div> + </div> + <div id="bodyColumn"> + <div id="contentBox"> + <div class="section"><h2><a name="a1_Statistics"></a>1 Statistics</h2> +<div class="section"><h3><a name="a1.1_Overview"></a>1.1 Overview</h3> +<p> + The statistics package provides frameworks and implementations for + basic Descriptive statistics, frequency distributions, bivariate regression, + and t-, chi-square and ANOVA test statistics. + </p> +<p><a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br /> +</br><a href="#a1.3_Frequency_distributions">Frequency distributions</a><br /> +</br><a href="#a1.4_Simple_regression">Simple Regression</a><br /> +</br><a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br /> +</br><a href="#a1.6_Rank_transformations">Rank transformations</a><br /> +</br><a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br /> +</br><a href="#a1.8_Statistical_tests">Statistical Tests</a><br /> +</br></p> +</div> +<div class="section"><h3><a name="a1.2_Descriptive_statistics"></a>1.2 Descriptive statistics</h3> +<p> + The stat package includes a framework and default implementations for + the following Descriptive statistics: + <ul><li>arithmetic and geometric means</li> +<li>variance and standard deviation</li> +<li>sum, product, log sum, sum of squared values</li> +<li>minimum, maximum, median, and percentiles</li> +<li>skewness and kurtosis</li> +<li>first, second, third and fourth moments</li> +</ul> +</p> +<p> + With the exception of percentiles and the median, all of these + statistics can be computed without maintaining the full list of input + data values in memory. The stat package provides interfaces and + implementations that do not require value storage as well as + implementations that operate on arrays of stored values. + </p> +<p> + The top level interface is + <a href="../apidocs/org/apache/commons/math/stat/descriptive/UnivariateStatistic.html"> + org.apache.commons.math.stat.descriptive.UnivariateStatistic.</a> + This interface, implemented by all statistics, consists of + <code>evaluate()</code> methods that take double[] arrays as arguments + and return the value of the statistic. This interface is extended by + <a href="../apidocs/org/apache/commons/math/stat/descriptive/StorelessUnivariateStatistic.html"> + StorelessUnivariateStatistic</a>, which adds <code>increment(),</code><code>getResult()</code> and associated methods to support + "storageless" implementations that maintain counters, sums or other + state information as values are added using the <code>increment()</code> + method. + </p> +<p> + Abstract implementations of the top level interfaces are provided in + <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractUnivariateStatistic.html"> + AbstractUnivariateStatistic</a> and + <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractStorelessUnivariateStatistic.html"> + AbstractStorelessUnivariateStatistic</a> respectively. + </p> +<p> + Each statistic is implemented as a separate class, in one of the + subpackages (moment, rank, summary) and each extends one of the abstract + classes above (depending on whether or not value storage is required to + compute the statistic). There are several ways to instantiate and use statistics. + Statistics can be instantiated and used directly, but it is generally more convenient + (and efficient) to access them using the provided aggregates, + <a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html"> + DescriptiveStatistics</a> and + <a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html"> + SummaryStatistics.</a></p> +<p><code>DescriptiveStatistics</code> maintains the input data in memory + and has the capability of producing "rolling" statistics computed from a + "window" consisting of the most recently added values. + </p> +<p><code>SummaryStatistics</code> does not store the input data values + in memory, so the statistics included in this aggregate are limited to those + that can be computed in one pass through the data without access to + the full array of values. + </p> +<p><table class="bodyTable"><tr class="a"><th>Aggregate</th> +<th>Statistics Included</th> +<th>Values stored?</th> +<th>"Rolling" capability?</th> +</tr> +<tr class="b"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html"> + DescriptiveStatistics</a></td> +<td>min, max, mean, geometric mean, n, + sum, sum of squares, standard deviation, variance, percentiles, skewness, + kurtosis, median</td> +<td>Yes</td> +<td>Yes</td> +</tr> +<tr class="a"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html"> + SummaryStatistics</a></td> +<td>min, max, mean, geometric mean, n, + sum, sum of squares, standard deviation, variance</td> +<td>No</td> +<td>No</td> +</tr> +</table> +</p> +<p><code>SummaryStatistics</code> can be aggregated using + <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html"> + AggregateSummaryStatistics.</a> This class can be used to concurrently gather statistics for multiple + datasets as well as for a combined sample including all of the data. + </p> +<p><code>MultivariateSummaryStatistics</code> is similar to <code>SummaryStatistics</code> + but handles n-tuple values instead of scalar values. It can also compute the + full covariance matrix for the input data. + </p> +<p> + Neither <code>DescriptiveStatistics</code> nor <code>SummaryStatistics</code> is + thread-safe. <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedDescriptiveStatistics.html"> + SynchronizedDescriptiveStatistics</a> and + <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedSummaryStatistics.html"> + SynchronizedSummaryStatistics</a>, respectively, provide thread-safe versions for applications that + require concurrent access to statistical aggregates by multiple threads. + <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedMultiVariateSummaryStatistics.html"> + SynchronizedMultivariateSummaryStatistics</a> provides threadsafe <code>MultivariateSummaryStatistics.</code></p> +<p> + There is also a utility class, + <a href="../apidocs/org/apache/commons/math/stat/StatUtils.html"> + StatUtils</a>, that provides static methods for computing statistics + directly from double[] arrays. + </p> +<p> + Here are some examples showing how to compute Descriptive statistics. + <dl><dt>Compute summary statistics for a list of double values</dt> +<br /> +</br><dd>Using the <code>DescriptiveStatistics</code> aggregate + (values are stored in memory): + <div class="source"><pre> +// Get a DescriptiveStatistics instance +DescriptiveStatistics stats = new DescriptiveStatistics(); + +// Add the data from the array +for( int i = 0; i < inputArray.length; i++) { + stats.addValue(inputArray[i]); +} + +// Compute some statistics +double mean = stats.getMean(); +double std = stats.getStandardDeviation(); +double median = stats.getMedian(); + </pre> +</div> +</dd> +<dd>Using the <code>SummaryStatistics</code> aggregate (values are + <strong>not</strong> stored in memory): + <div class="source"><pre> +// Get a SummaryStatistics instance +SummaryStatistics stats = new SummaryStatistics(); + +// Read data from an input stream, +// adding values and updating sums, counters, etc. +while (line != null) { + line = in.readLine(); + stats.addValue(Double.parseDouble(line.trim())); +} +in.close(); + +// Compute the statistics +double mean = stats.getMean(); +double std = stats.getStandardDeviation(); +//double median = stats.getMedian(); <-- NOT AVAILABLE + </pre> +</div> +</dd> +<dd>Using the <code>StatUtils</code> utility class: + <div class="source"><pre> +// Compute statistics directly from the array +// assume values is a double[] array +double mean = StatUtils.mean(values); +double std = StatUtils.variance(values); +double median = StatUtils.percentile(50); + +// Compute the mean of the first three values in the array +mean = StatUtils.mean(values, 0, 3); + </pre> +</div> +</dd> +<dt>Maintain a "rolling mean" of the most recent 100 values from + an input stream</dt> +<br /> +</br><dd>Use a <code>DescriptiveStatistics</code> instance with + window size set to 100 + <div class="source"><pre> +// Create a DescriptiveStats instance and set the window size to 100 +DescriptiveStatistics stats = new DescriptiveStatistics(); +stats.setWindowSize(100); + +// Read data from an input stream, +// displaying the mean of the most recent 100 observations +// after every 100 observations +long nLines = 0; +while (line != null) { + line = in.readLine(); + stats.addValue(Double.parseDouble(line.trim())); + if (nLines == 100) { + nLines = 0; + System.out.println(stats.getMean()); + } +} +in.close(); + </pre> +</div> +</dd> +<dt>Compute statistics in a thread-safe manner</dt> +<br /> +<dd>Use a <code>SynchronizedDescriptiveStatistics</code> instance + <div class="source"><pre> +// Create a SynchronizedDescriptiveStatistics instance and +// use as any other DescriptiveStatistics instance +DescriptiveStatistics stats = new SynchronizedDescriptiveStatistics(); + </pre> +</div> +</dd> +<dt>Compute statistics for multiple samples and overall statistics concurrently</dt> +<br /> +<dd>There are two ways to do this using <code>AggregateSummaryStatistics.</code> + The first is to use an <code>AggregateSummaryStatistics</code> instance to accumulate + overall statistics contributed by <code>SummaryStatistics</code> instances created using + <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#createContributingStatistics()"> + AggregateSummaryStatistics.createContributingStatistics()</a>: + <div class="source"><pre> +// Create a AggregateSummaryStatistics instance to accumulate the overall statistics +// and AggregatingSummaryStatistics for the subsamples +AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics(); +SummaryStatistics setOneStats = aggregate.createContributingStatistics(); +SummaryStatistics setTwoStats = aggregate.createContributingStatistics(); +// Add values to the subsample aggregates +setOneStats.addValue(2); +setOneStats.addValue(3); +setTwoStats.addValue(2); +setTwoStats.addValue(4); +... +// Full sample data is reported by the aggregate +double totalSampleSum = aggregate.getSum(); + </pre> +</div> + + The above approach has the disadvantages that the <code>addValue</code> calls must be synchronized on the + <code>SummaryStatistics</code> instance maintained by the aggregate and each value addition updates the + aggregate as well as the subsample. For applications that can wait to do the aggregation until all values + have been added, a static + <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#aggregate(java.util.Collection)"> + aggregate</a> method is available, as shown in the following example. + This method should be used when aggregation needs to be done across threads. + <div class="source"><pre> +// Create SummaryStatistics instances for the subsample data +SummaryStatistics setOneStats = new SummaryStatistics(); +SummaryStatistics setTwoStats = new SummaryStatistics(); +// Add values to the subsample SummaryStatistics instances +setOneStats.addValue(2); +setOneStats.addValue(3); +setTwoStats.addValue(2); +setTwoStats.addValue(4); +... +// Aggregate the subsample statistics +Collection<SummaryStatistics> aggregate = new ArrayList<SummaryStatistics>(); +aggregate.add(setOneStats); +aggregate.add(setTwoStats); +StatisticalSummary aggregatedStats = AggregateSummaryStatistics.aggregate(aggregate); + +// Full sample data is reported by aggregatedStats +double totalSampleSum = aggregatedStats.getSum(); + </pre> +</div> +</dd> +</dl> +</p> +</div> +<div class="section"><h3><a name="a1.3_Frequency_distributions"></a>1.3 Frequency distributions</h3> +<p><a href="../apidocs/org/apache/commons/math/stat/Frequency.html"> + org.apache.commons.math.stat.descriptive.Frequency</a> + provides a simple interface for maintaining counts and percentages of discrete + values. + </p> +<p> + Strings, integers, longs and chars are all supported as value types, + as well as instances of any class that implements <code>Comparable.</code> + The ordering of values used in computing cumulative frequencies is by + default the <i>natural ordering,</i> but this can be overriden by supplying a + <code>Comparator</code> to the constructor. Adding values that are not + comparable to those that have already been added results in an + <code>IllegalArgumentException.</code></p> +<p> + Here are some examples. + <dl><dt>Compute a frequency distribution based on integer values</dt> +<br /> +</br><dd>Mixing integers, longs, Integers and Longs: + <div class="source"><pre> + Frequency f = new Frequency(); + f.addValue(1); + f.addValue(new Integer(1)); + f.addValue(new Long(1)); + f.addValue(2); + f.addValue(new Integer(-1)); + System.out.prinltn(f.getCount(1)); // displays 3 + System.out.println(f.getCumPct(0)); // displays 0.2 + System.out.println(f.getPct(new Integer(1))); // displays 0.6 + System.out.println(f.getCumPct(-2)); // displays 0 + System.out.println(f.getCumPct(10)); // displays 1 + </pre> +</div> +</dd> +<dt>Count string frequencies</dt> +<br /> +</br><dd>Using case-sensitive comparison, alpha sort order (natural comparator): + <div class="source"><pre> +Frequency f = new Frequency(); +f.addValue("one"); +f.addValue("One"); +f.addValue("oNe"); +f.addValue("Z"); +System.out.println(f.getCount("one")); // displays 1 +System.out.println(f.getCumPct("Z")); // displays 0.5 +System.out.println(f.getCumPct("Ot")); // displays 0.25 + </pre> +</div> +</dd> +<dd>Using case-insensitive comparator: + <div class="source"><pre> +Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER); +f.addValue("one"); +f.addValue("One"); +f.addValue("oNe"); +f.addValue("Z"); +System.out.println(f.getCount("one")); // displays 3 +System.out.println(f.getCumPct("z")); // displays 1 + </pre> +</div> +</dd> +</dl> +</p> +</div> +<div class="section"><h3><a name="a1.4_Simple_regression"></a>1.4 Simple regression</h3> +<p><a href="../apidocs/org/apache/commons/math/stat/regression/SimpleRegression.html"> + org.apache.commons.math.stat.regression.SimpleRegression</a> + provides ordinary least squares regression with one independent variable, + estimating the linear model: + </p> +<p><code> y = intercept + slope * x </code></p> +<p> + Standard errors for <code>intercept</code> and <code>slope</code> are + available as well as ANOVA, r-square and Pearson's r statistics. + </p> +<p> + Observations (x,y pairs) can be added to the model one at a time or they + can be provided in a 2-dimensional array. The observations are not stored + in memory, so there is no limit to the number of observations that can be + added to the model. + </p> +<p><strong>Usage Notes</strong>: <ul><li> When there are fewer than two observations in the model, or when + there is no variation in the x values (i.e. all x values are the same) + all statistics return <code>NaN</code>. At least two observations with + different x coordinates are requred to estimate a bivariate regression + model.</li> +<li> getters for the statistics always compute values based on the current + set of observations -- i.e., you can get statistics, then add more data + and get updated statistics without using a new instance. There is no + "compute" method that updates all statistics. Each of the getters performs + the necessary computations to return the requested statistic.</li> +</ul> +</p> +<p><strong>Implementation Notes</strong>: <ul><li> As observations are added to the model, the sum of x values, y values, + cross products (x times y), and squared deviations of x and y from their + respective means are updated using updating formulas defined in + "Algorithms for Computing the Sample Variance: Analysis and + Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. + 1983, American Statistician, vol. 37, pp. 242-247, referenced in + Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985. All regression + statistics are computed from these sums.</li> +<li> Inference statistics (confidence intervals, parameter significance levels) + are based on on the assumption that the observations included in the model are + drawn from a <a href="http://mathworld.wolfram.com/BivariateNormalDistribution.html" class="externalLink"> + Bivariate Normal Distribution</a></li> +</ul> +</p> +<p> + Here are some examples. + <dl><dt>Estimate a model based on observations added one at a time</dt> +<br /> +</br><dd>Instantiate a regression instance and add data points + <div class="source"><pre> +regression = new SimpleRegression(); +regression.addData(1d, 2d); +// At this point, with only one observation, +// all regression statistics will return NaN + +regression.addData(3d, 3d); +// With only two observations, +// slope and intercept can be computed +// but inference statistics will return NaN + +regression.addData(3d, 3d); +// Now all statistics are defined. + </pre> +</div> +</dd> +<dd>Compute some statistics based on observations added so far + <div class="source"><pre> +System.out.println(regression.getIntercept()); +// displays intercept of regression line + +System.out.println(regression.getSlope()); +// displays slope of regression line + +System.out.println(regression.getSlopeStdErr()); +// displays slope standard error + </pre> +</div> +</dd> +<dd>Use the regression model to predict the y value for a new x value + <div class="source"><pre> +System.out.println(regression.predict(1.5d) +// displays predicted y value for x = 1.5 + </pre> +</div> + + More data points can be added and subsequent getXxx calls will incorporate + additional data in statistics. + </dd> +<dt>Estimate a model from a double[][] array of data points</dt> +<br /> +</br><dd>Instantiate a regression object and load dataset + <div class="source"><pre> +double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; +SimpleRegression regression = new SimpleRegression(); +regression.addData(data); + </pre> +</div> +</dd> +<dd>Estimate regression model based on data + <div class="source"><pre> +System.out.println(regression.getIntercept()); +// displays intercept of regression line + +System.out.println(regression.getSlope()); +// displays slope of regression line + +System.out.println(regression.getSlopeStdErr()); +// displays slope standard error + </pre> +</div> + + More data points -- even another double[][] array -- can be added and subsequent + getXxx calls will incorporate additional data in statistics. + </dd> +</dl> +</p> +</div> +<div class="section"><h3><a name="a1.5_Multiple_linear_regression"></a>1.5 Multiple linear regression</h3> +<p><a href="../apidocs/org/apache/commons/math/stat/regression/MultipleLinearRegression.html"> + org.apache.commons.math.stat.regression.MultipleLinearRegression</a> + provides ordinary least squares regression with a generic multiple variable linear model, which + in matrix notation can be expressed as: + </p> +<p><code> y=X*b+u </code></p> +<p> + where y is an <code>n-vector</code><b>regressand</b>, X is a <code>[n,k]</code> matrix whose <code>k</code> columns are called + <b>regressors</b>, b is <code>k-vector</code> of <b>regression parameters</b> and <code>u</code> is an <code>n-vector</code> + of <b>error terms</b> or <b>residuals</b>. The notation is quite standard in literature, + cf eg <a href="http://www.econ.queensu.ca/ETM" class="externalLink">Davidson and MacKinnon, Econometrics Theory and Methods, 2004</a>. + </p> +<p> + Two implementations are provided: <a href="../apidocs/org/apache/commons/math/stat/regression/OLSMultipleLinearRegression.html"> + org.apache.commons.math.stat.regression.OLSMultipleLinearRegression</a> and + <a href="../apidocs/org/apache/commons/math/stat/regression/GLSMultipleLinearRegression.html"> + org.apache.commons.math.stat.regression.GLSMultipleLinearRegression</a></p> +<p> + Observations (x,y and covariance data matrices) can be added to the model via the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method. + The observations are stored in memory until the next time the addData method is invoked. + </p> +<p><strong>Usage Notes</strong>: <ul><li> Data is validated when invoking the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method and + <code>IllegalArgumentException</code> is thrown when inappropriate. + </li> +<li> Only the GLS regressions require the covariance matrix, so in the OLS regression it is ignored and can be safely + inputted as <code>null</code>.</li> +</ul> +</p> +<p> + Here are some examples. + <dl><dt>OLS regression</dt> +<br /> +</br><dd>Instantiate an OLS regression object and load dataset + <div class="source"><pre> +MultipleLinearRegression regression = new OLSMultipleLinearRegression(); +double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; +double[] x = new double[6][]; +x[0] = new double[]{1.0, 0, 0, 0, 0, 0}; +x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0}; +x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0}; +x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0}; +x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0}; +x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0}; +regression.addData(y, x, null); // we don't need covariance + </pre> +</div> +</dd> +<dd>Estimate of regression values honours the <code>MultipleLinearRegression</code> interface: + <div class="source"><pre> +double[] beta = regression.estimateRegressionParameters(); + +double[] residuals = regression.estimateResiduals(); + +double[][] parametersVariance = regression.estimateRegressionParametersVariance(); + +double regressandVariance = regression.estimateRegressandVariance(); + </pre> +</div> +</dd> +<dt>GLS regression</dt> +<br /> +</br><dd>Instantiate an GLS regression object and load dataset + <div class="source"><pre> +MultipleLinearRegression regression = new GLSMultipleLinearRegression(); +double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; +double[] x = new double[6][]; +x[0] = new double[]{1.0, 0, 0, 0, 0, 0}; +x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0}; +x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0}; +x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0}; +x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0}; +x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0}; +double[][] omega = new double[6][]; +omega[0] = new double[]{1.1, 0, 0, 0, 0, 0}; +omega[1] = new double[]{0, 2.2, 0, 0, 0, 0}; +omega[2] = new double[]{0, 0, 3.3, 0, 0, 0}; +omega[3] = new double[]{0, 0, 0, 4.4, 0, 0}; +omega[4] = new double[]{0, 0, 0, 0, 5.5, 0}; +omega[5] = new double[]{0, 0, 0, 0, 0, 6.6}; +regression.addData(y, x, omega); // we do need covariance + </pre> +</div> +</dd> +<dd>Estimate of regression values honours the same <code>MultipleLinearRegression</code> interface as + the OLS regression. + </dd> +</dl> +</p> +</div> +<div class="section"><h3><a name="a1.6_Rank_transformations"></a>1.6 Rank transformations</h3> +<p> + Some statistical algorithms require that input data be replaced by ranks. + The <a href="../apidocs/org/apache/commons/math/stat/ranking/package-summary.html"> + org.apache.commons.math.stat.ranking</a> package provides rank transformation. + <a href="../apidocs/org/apache/commons/math/stat/ranking/RankingAlgorithm.html"> + RankingAlgorithm</a> defines the interface for ranking. + <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html"> + NaturalRanking</a> provides an implementation that has two configuration options. + <ul><li><a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html"> + Ties strategy</a> deterimines how ties in the source data are handled by the ranking</li> +<li><a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html"> + NaN strategy</a> determines how NaN values in the source data are handled.</li> +</ul> +</p> +<p> + Examples: + <div class="source"><pre> +NaturalRanking ranking = new NaturalRanking(NaNStrategy.MINIMAL, +TiesStrategy.MAXIMUM); +double[] data = { 20, 17, 30, 42.3, 17, 50, + Double.NaN, Double.NEGATIVE_INFINITY, 17 }; +double[] ranks = ranking.rank(exampleData); + </pre> +</div> + + results in <code>ranks</code> containing <code>{6, 5, 7, 8, 5, 9, 2, 2, 5}.</code><div class="source"><pre> +new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData); + </pre> +</div> + + returns <code>{5, 2, 6, 7, 3, 8, 1, 4}.</code></p> +<p> + The default <code>NaNStrategy</code> is NaNStrategy.MAXIMAL. This makes <code>NaN</code> + values larger than any other value (including <code>Double.POSITIVE_INFINITY</code>). The + default <code>TiesStrategy</code> is <code>TiesStrategy.AVERAGE,</code> which assigns tied + values the average of the ranks applicable to the sequence of ties. See the + <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html"> + NaturalRanking</a> for more examples and <a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html"> + TiesStrategy</a> and <a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html">NaNStrategy</a> + for details on these configuration options. + </p> +</div> +<div class="section"><h3><a name="a1.7_Covariance_and_correlation"></a>1.7 Covariance and correlation</h3> +<p> + The <a href="../apidocs/org/apache/commons/math/stat/correlation/package-summary.html"> + org.apache.commons.math.stat.correlation</a> package computes covariances + and correlations for pairs of arrays or columns of a matrix. + <a href="../apidocs/org/apache/commons/math/stat/correlation/Covariance.html"> + Covariance</a> computes covariances, + <a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html"> + PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients and + <a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html"> + SpearmansCorrelation</a> computes Spearman's rank correlation. + </p> +<p><strong>Implementation Notes</strong><ul><li> + Unbiased covariances are given by the formula <br /> +</br><code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code> + where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code> + is the mean of the <code>Y</code> values. Non-bias-corrected estimates use + <code>n</code> in place of <code>n - 1.</code> Whether or not covariances are + bias-corrected is determined by the optional parameter, "biasCorrected," which + defaults to <code>true.</code></li> +<li><a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html"> + PearsonsCorrelation</a> computes correlations defined by the formula <br /> +</br><code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br /> + + where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code> + and <code>s(X)</code>, <code>s(Y)</code> are standard deviations. + </li> +<li><a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html"> + SpearmansCorrelation</a> applies a rank transformation to the input data and computes Pearson's + correlation on the ranked data. The ranking algorithm is configurable. By default, + <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html"> + NaturalRanking</a> with default strategies for handling ties and NaN values is used. + </li> +</ul> +</p> +<p><strong>Examples:</strong><dl><dt><strong>Covariance of 2 arrays</strong></dt> +<br /> +</br><dd>To compute the unbiased covariance between 2 double arrays, + <code>x</code> and <code>y</code>, use: + <div class="source"><pre> +new Covariance().covariance(x, y) + </pre> +</div> + + For non-bias-corrected covariances, use + <div class="source"><pre> +covariance(x, y, false) + </pre> +</div> +</dd> +<br /> +</br><dt><strong>Covariance matrix</strong></dt> +<br /> +</br><dd> A covariance matrix over the columns of a source matrix <code>data</code> + can be computed using + <div class="source"><pre> +new Covariance().computeCovarianceMatrix(data) + </pre> +</div> + + The i-jth entry of the returned matrix is the unbiased covariance of the ith and jth + columns of <code>data.</code> As above, to get non-bias-corrected covariances, + use + <div class="source"><pre> +computeCovarianceMatrix(data, false) + </pre> +</div> +</dd> +<br /> +</br><dt><strong>Pearson's correlation of 2 arrays</strong></dt> +<br /> +</br><dd>To compute the Pearson's product-moment correlation between two double arrays + <code>x</code> and <code>y</code>, use: + <div class="source"><pre> +new PearsonsCorrelation().correlation(x, y) + </pre> +</div> +</dd> +<br /> +</br><dt><strong>Pearson's correlation matrix</strong></dt> +<br /> +</br><dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code> + can be computed using + <div class="source"><pre> +new PearsonsCorrelation().computeCorrelationMatrix(data) + </pre> +</div> + + The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the + ith and jth columns of <code>data.</code></dd> +<br /> +</br><dt><strong>Pearson's correlation significance and standard errors</strong></dt> +<br /> +</br><dd> To compute standard errors and/or significances of correlation coefficients + associated with Pearson's correlation coefficients, start by creating a + <code>PearsonsCorrelation</code> instance + <div class="source"><pre> +PearsonsCorrelation correlation = new PearsonsCorrelation(data); + </pre> +</div> + + where <code>data</code> is either a rectangular array or a <code>RealMatrix.</code> + Then the matrix of standard errors is + <div class="source"><pre> +correlation.getCorrelationStandardErrors(); + </pre> +</div> + + The formula used to compute the standard error is <br /> +<code>SE<sub>r</sub> = ((1 - r<sup>2</sup>) / (n - 2))<sup>1/2</sup></code><br /> + + where <code>r</code> is the estimated correlation coefficient and + <code>n</code> is the number of observations in the source dataset.<br /> +<br /> +<strong>p-values</strong> for the (2-sided) null hypotheses that elements of + a correlation matrix are zero populate the RealMatrix returned by + <div class="source"><pre> +correlation.getCorrelationPValues() + </pre> +</div> +<code>getCorrelationPValues().getEntry(i,j)</code> is the + probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes + a value with absolute value greater than or equal to <br /> +</br><code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>, + where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth + columns of the source array or RealMatrix. This is sometimes referred to as the + <i>significance</i> of the coefficient.<br /> +<br /> + + For example, if <code>data</code> is a RealMatrix with 2 columns and 10 rows, then + <div class="source"><pre> +new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1) + </pre> +</div> + + is the significance of the Pearson's correlation coefficient between the two columns + of <code>data</code>. If this value is less than .01, we can say that the correlation + between the two columns of data is significant at the 99% level. + </dd> +<br /> +</br><dt><strong>Spearman's rank correlation coefficient</strong></dt> +<br /> +</br><dd>To compute the Spearman's rank-moment correlation between two double arrays + <code>x</code> and <code>y</code>: + <div class="source"><pre> +new SpearmansCorrelation().correlation(x, y) + </pre> +</div> + + This is equivalent to + <div class="source"><pre> +RankingAlgorithm ranking = new NaturalRanking(); +new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y)) + </pre> +</div> +</dd> +<br /> +</br></dl> +</p> +</div> +<div class="section"><h3><a name="a1.8_Statistical_tests"></a>1.8 Statistical tests</h3> +<p> + The interfaces and implementations in the + <a href="../apidocs/org/apache/commons/math/stat/inference/"> + org.apache.commons.math.stat.inference</a> package provide + <a href="http://www.itl.nist.gov/div898/handbook/prc/section2/prc22.htm" class="externalLink"> + Student's t</a>, + <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm" class="externalLink"> + Chi-Square</a> and + <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm" class="externalLink"> + One-Way ANOVA</a> test statistics as well as + <a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue" class="externalLink"> + p-values</a> associated with <code>t-</code>, + <code>Chi-Square</code> and <code>One-Way ANOVA</code> tests. The + interfaces are + <a href="../apidocs/org/apache/commons/math/stat/inference/TTest.html"> + TTest</a>, + <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTest.html"> + ChiSquareTest</a>, and + <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnova.html"> + OneWayAnova</a> with provided implementations + <a href="../apidocs/org/apache/commons/math/stat/inference/TTestImpl.html"> + TTestImpl</a>, + <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTestImpl.html"> + ChiSquareTestImpl</a> and + <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnovaImpl.html"> + OneWayAnovaImpl</a>, respectively. + The + <a href="../apidocs/org/apache/commons/math/stat/inference/TestUtils.html"> + TestUtils</a> class provides static methods to get test instances or + to compute test statistics directly. The examples below all use the + static methods in <code>TestUtils</code> to execute tests. To get + test object instances, either use e.g., + <code>TestUtils.getTTest()</code> or use the implementation constructors + directly, e.g., + <code>new TTestImpl()</code>. + </p> +<p><strong>Implementation Notes</strong><ul><li>Both one- and two-sample t-tests are supported. Two sample tests + can be either paired or unpaired and the unpaired two-sample tests can + be conducted under the assumption of equal subpopulation variances or + without this assumption. When equal variances is assumed, a pooled + variance estimate is used to compute the t-statistic and the degrees + of freedom used in the t-test equals the sum of the sample sizes minus 2. + When equal variances is not assumed, the t-statistic uses both sample + variances and the + <a href="http://www.itl.nist.gov/div898/handbook/prc/section3/gifs/nu3.gif" class="externalLink"> + Welch-Satterwaite approximation</a> is used to compute the degrees + of freedom. Methods to return t-statistics and p-values are provided in each + case, as well as boolean-valued methods to perform fixed significance + level tests. The names of methods or methods that assume equal + subpopulation variances always start with "homoscedastic." Test or + test-statistic methods that just start with "t" do not assume equal + variances. See the examples below and the API documentation for + more details.</li> +<li>The validity of the p-values returned by the t-test depends on the + assumptions of the parametric t-test procedure, as discussed + <a href="http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html" class="externalLink"> + here</a></li> +<li>p-values returned by t-, chi-square and Anova tests are exact, based + on numerical approximations to the t-, chi-square and F distributions in the + <code>distributions</code> package. </li> +<li>p-values returned by t-tests are for two-sided tests and the boolean-valued + methods supporting fixed significance level tests assume that the hypotheses + are two-sided. One sided tests can be performed by dividing returned p-values + (resp. critical values) by 2.</li> +<li>Degrees of freedom for chi-square tests are integral values, based on the + number of observed or expected counts (number of observed counts - 1) + for the goodness-of-fit tests and (number of columns -1) * (number of rows - 1) + for independence tests.</li> +</ul> +</p> +<p><strong>Examples:</strong><dl><dt><strong>One-sample <code>t</code> tests</strong></dt> +<br /> +</br><dd>To compare the mean of a double[] array to a fixed value: + <div class="source"><pre> +double[] observed = {1d, 2d, 3d}; +double mu = 2.5d; +System.out.println(TestUtils.t(mu, observed)); + </pre> +</div> + + The code above will display the t-statisitic associated with a one-sample + t-test comparing the mean of the <code>observed</code> values against + <code>mu.</code></dd> +<dd>To compare the mean of a dataset described by a + <a href="../apidocs/org/apache/commons/math/stat/descriptive/StatisticalSummary.html"> + org.apache.commons.math.stat.descriptive.StatisticalSummary</a> to a fixed value: + <div class="source"><pre> +double[] observed ={1d, 2d, 3d}; +double mu = 2.5d; +SummaryStatistics sampleStats = new SummaryStatistics(); +for (int i = 0; i < observed.length; i++) { + sampleStats.addValue(observed[i]); +} +System.out.println(TestUtils.t(mu, observed)); +</pre> +</div> +</dd> +<dd>To compute the p-value associated with the null hypothesis that the mean + of a set of values equals a point estimate, against the two-sided alternative that + the mean is different from the target value: + <div class="source"><pre> +double[] observed = {1d, 2d, 3d}; +double mu = 2.5d; +System.out.println(TestUtils.tTest(mu, observed)); + </pre> +</div> + + The snippet above will display the p-value associated with the null + hypothesis that the mean of the population from which the + <code>observed</code> values are drawn equals <code>mu.</code></dd> +<dd>To perform the test using a fixed significance level, use: + <div class="source"><pre> +TestUtils.tTest(mu, observed, alpha); + </pre> +</div> + + where <code>0 < alpha < 0.5</code> is the significance level of + the test. The boolean value returned will be <code>true</code> iff the + null hypothesis can be rejected with confidence <code>1 - alpha</code>. + To test, for example at the 95% level of confidence, use + <code>alpha = 0.05</code></dd> +<br /> +</br><dt><strong>Two-Sample t-tests</strong></dt> +<br /> +</br><dd><strong>Example 1:</strong> Paired test evaluating + the null hypothesis that the mean difference between corresponding + (paired) elements of the <code>double[]</code> arrays + <code>sample1</code> and <code>sample2</code> is zero. + + To compute the t-statistic: + <div class="source"><pre> +TestUtils.pairedT(sample1, sample2); + </pre> +</div> +<p> + To compute the p-value: + <div class="source"><pre> +TestUtils.pairedTTest(sample1, sample2); + </pre> +</div> +</p> +<p> + To perform a fixed significance level test with alpha = .05: + <div class="source"><pre> +TestUtils.pairedTTest(sample1, sample2, .05); + </pre> +</div> +</p> + + The last example will return <code>true</code> iff the p-value + returned by <code>TestUtils.pairedTTest(sample1, sample2)</code> + is less than <code>.05</code></dd> +<dd><strong>Example 2: </strong> unpaired, two-sided, two-sample t-test using + <code>StatisticalSummary</code> instances, without assuming that + subpopulation variances are equal. + + First create the <code>StatisticalSummary</code> instances. Both + <code>DescriptiveStatistics</code> and <code>SummaryStatistics</code> + implement this interface. Assume that <code>summary1</code> and + <code>summary2</code> are <code>SummaryStatistics</code> instances, + each of which has had at least 2 values added to the (virtual) dataset that + it describes. The sample sizes do not have to be the same -- all that is required + is that both samples have at least 2 elements. + <p><strong>Note:</strong> The <code>SummaryStatistics</code> class does + not store the dataset that it describes in memory, but it does compute all + statistics necessary to perform t-tests, so this method can be used to + conduct t-tests with very large samples. One-sample tests can also be + performed this way. + (See <a href="#1.2 Descriptive statistics">Descriptive statistics</a> for details + on the <code>SummaryStatistics</code> class.) + </p> +<p> + To compute the t-statistic: + <div class="source"><pre> +TestUtils.t(summary1, summary2); + </pre> +</div> +</p> +<p> + To compute the p-value: + <div class="source"><pre> +TestUtils.tTest(sample1, sample2); + </pre> +</div> +</p> +<p> + To perform a fixed significance level test with alpha = .05: + <div class="source"><pre> +TestUtils.tTest(sample1, sample2, .05); + </pre> +</div> +</p> +<p> + In each case above, the test does not assume that the subpopulation + variances are equal. To perform the tests under this assumption, + replace "t" at the beginning of the method name with "homoscedasticT" + </p> +</dd> +<br /> +</br><dt><strong>Chi-square tests</strong></dt> +<br /> +</br><dd>To compute a chi-square statistic measuring the agreement between a + <code>long[]</code> array of observed counts and a <code>double[]</code> + array of expected counts, use: + <div class="source"><pre> +long[] observed = {10, 9, 11}; +double[] expected = {10.1, 9.8, 10.3}; +System.out.println(TestUtils.chiSquare(expected, observed)); + </pre> +</div> + + the value displayed will be + <code>sum((expected[i] - observed[i])^2 / expected[i])</code></dd> +<dd> To get the p-value associated with the null hypothesis that + <code>observed</code> conforms to <code>expected</code> use: + <div class="source"><pre> +TestUtils.chiSquareTest(expected, observed); + </pre> +</div> +</dd> +<dd> To test the null hypothesis that <code>observed</code> conforms to + <code>expected</code> with <code>alpha</code> siginficance level + (equiv. <code>100 * (1-alpha)%</code> confidence) where <code> + 0 < alpha < 1 </code> use: + <div class="source"><pre> +TestUtils.chiSquareTest(expected, observed, alpha); + </pre> +</div> + + The boolean value returned will be <code>true</code> iff the null hypothesis + can be rejected with confidence <code>1 - alpha</code>. + </dd> +<dd>To compute a chi-square statistic statistic associated with a + <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm" class="externalLink"> + chi-square test of independence</a> based on a two-dimensional (long[][]) + <code>counts</code> array viewed as a two-way table, use: + <div class="source"><pre> +TestUtils.chiSquareTest(counts); + </pre> +</div> + + The rows of the 2-way table are + <code>count[0], ... , count[count.length - 1]. </code><br /> +</br> + The chi-square statistic returned is + <code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code> + where the sum is taken over all table entries and + <code>expected[i][j]</code> is the product of the row and column sums at + row <code>i</code>, column <code>j</code> divided by the total count. + </dd> +<dd>To compute the p-value associated with the null hypothesis that + the classifications represented by the counts in the columns of the input 2-way + table are independent of the rows, use: + <div class="source"><pre> + TestUtils.chiSquareTest(counts); + </pre> +</div> +</dd> +<dd>To perform a chi-square test of independence with <code>alpha</code> + siginficance level (equiv. <code>100 * (1-alpha)%</code> confidence) + where <code>0 < alpha < 1 </code> use: + <div class="source"><pre> +TestUtils.chiSquareTest(counts, alpha); + </pre> +</div> + + The boolean value returned will be <code>true</code> iff the null + hypothesis can be rejected with confidence <code>1 - alpha</code>. + </dd> +<br /> +</br><dt><strong>One-Way Anova tests</strong></dt> +<br /> +</br><dd>To conduct a One-Way Analysis of Variance (ANOVA) to evaluate the + null hypothesis that the means of a collection of univariate datasets + are the same, start by loading the datasets into a collection, e.g. + <div class="source"><pre> +double[] classA = + {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 }; +double[] classB = + {99.0, 92.0, 102.0, 100.0, 102.0, 89.0 }; +double[] classC = + {110.0, 115.0, 111.0, 117.0, 128.0, 117.0 }; +List classes = new ArrayList(); +classes.add(classA); +classes.add(classB); +classes.add(classC); + </pre> +</div> + + Then you can compute ANOVA F- or p-values associated with the + null hypothesis that the class means are all the same + using a <code>OneWayAnova</code> instance or <code>TestUtils</code> + methods: + <div class="source"><pre> +double fStatistic = TestUtils.oneWayAnovaFValue(classes); // F-value +double pValue = TestUtils.oneWayAnovaPValue(classes); // P-value + </pre> +</div> + + To test perform a One-Way Anova test with signficance level set at 0.01 + (so the test will, assuming assumptions are met, reject the null + hypothesis incorrectly only about one in 100 times), use + <div class="source"><pre> +TestUtils.oneWayAnovaTest(classes, 0.01); // returns a boolean + // true means reject null hypothesis + </pre> +</div> +</dd> +</dl> +</p> +</div> +</div> + + </div> + </div> + <div class="clear"> + <hr/> + </div> + <div id="footer"> + <div class="xright">© + 2003-2010 + + + + + + + + + + </div> + <div class="clear"> + <hr/> + </div> + </div> + </body> +</html>