comparison libs/commons-math-2.1/docs/userguide/stat.html @ 10:5f2c5fb36e93

commons-math-2.1 added
author dwinter
date Tue, 04 Jan 2011 10:00:53 +0100
parents
children
comparison
equal deleted inserted replaced
9:e63a64652f4d 10:5f2c5fb36e93
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2
3
4
5
6
7
8
9
10
11
12
13 <html xmlns="http://www.w3.org/1999/xhtml">
14 <head>
15 <title>Math - The Commons Math User Guide - Statistics</title>
16 <style type="text/css" media="all">
17 @import url("../css/maven-base.css");
18 @import url("../css/maven-theme.css");
19 @import url("../css/site.css");
20 </style>
21 <link rel="stylesheet" href="../css/print.css" type="text/css" media="print" />
22 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
23 </head>
24 <body class="composite">
25 <div id="banner">
26 <span id="bannerLeft">
27
28 Commons Math User Guide
29
30 </span>
31 <div class="clear">
32 <hr/>
33 </div>
34 </div>
35 <div id="breadcrumbs">
36
37
38
39
40
41
42
43
44 <div class="xright">
45
46
47
48
49
50
51
52 </div>
53 <div class="clear">
54 <hr/>
55 </div>
56 </div>
57 <div id="leftColumn">
58 <div id="navcolumn">
59
60
61
62
63
64
65
66
67 <h5>User Guide</h5>
68 <ul>
69
70 <li class="none">
71 <a href="../userguide/index.html">Contents</a>
72 </li>
73
74 <li class="none">
75 <a href="../userguide/overview.html">Overview</a>
76 </li>
77
78 <li class="none">
79 <strong>Statistics</strong>
80 </li>
81
82 <li class="none">
83 <a href="../userguide/random.html">Data Generation</a>
84 </li>
85
86 <li class="none">
87 <a href="../userguide/linear.html">Linear Algebra</a>
88 </li>
89
90 <li class="none">
91 <a href="../userguide/analysis.html">Numerical Analysis</a>
92 </li>
93
94 <li class="none">
95 <a href="../userguide/special.html">Special Functions</a>
96 </li>
97
98 <li class="none">
99 <a href="../userguide/utilities.html">Utilities</a>
100 </li>
101
102 <li class="none">
103 <a href="../userguide/complex.html">Complex Numbers</a>
104 </li>
105
106 <li class="none">
107 <a href="../userguide/distribution.html">Distributions</a>
108 </li>
109
110 <li class="none">
111 <a href="../userguide/fraction.html">Fractions</a>
112 </li>
113
114 <li class="none">
115 <a href="../userguide/transform.html">Transform Methods</a>
116 </li>
117
118 <li class="none">
119 <a href="../userguide/geometry.html">3D Geometry</a>
120 </li>
121
122 <li class="none">
123 <a href="../userguide/optimization.html">Optimization</a>
124 </li>
125
126 <li class="none">
127 <a href="../userguide/ode.html">Ordinary Differential Equations</a>
128 </li>
129
130 <li class="none">
131 <a href="../userguide/genetics.html">Genetic Algorithms</a>
132 </li>
133 </ul>
134 <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
135 <img alt="Built by Maven" src="../images/logos/maven-feather.png"></img>
136 </a>
137
138
139
140
141
142
143
144
145 </div>
146 </div>
147 <div id="bodyColumn">
148 <div id="contentBox">
149 <div class="section"><h2><a name="a1_Statistics"></a>1 Statistics</h2>
150 <div class="section"><h3><a name="a1.1_Overview"></a>1.1 Overview</h3>
151 <p>
152 The statistics package provides frameworks and implementations for
153 basic Descriptive statistics, frequency distributions, bivariate regression,
154 and t-, chi-square and ANOVA test statistics.
155 </p>
156 <p><a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br />
157 </br><a href="#a1.3_Frequency_distributions">Frequency distributions</a><br />
158 </br><a href="#a1.4_Simple_regression">Simple Regression</a><br />
159 </br><a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br />
160 </br><a href="#a1.6_Rank_transformations">Rank transformations</a><br />
161 </br><a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br />
162 </br><a href="#a1.8_Statistical_tests">Statistical Tests</a><br />
163 </br></p>
164 </div>
165 <div class="section"><h3><a name="a1.2_Descriptive_statistics"></a>1.2 Descriptive statistics</h3>
166 <p>
167 The stat package includes a framework and default implementations for
168 the following Descriptive statistics:
169 <ul><li>arithmetic and geometric means</li>
170 <li>variance and standard deviation</li>
171 <li>sum, product, log sum, sum of squared values</li>
172 <li>minimum, maximum, median, and percentiles</li>
173 <li>skewness and kurtosis</li>
174 <li>first, second, third and fourth moments</li>
175 </ul>
176 </p>
177 <p>
178 With the exception of percentiles and the median, all of these
179 statistics can be computed without maintaining the full list of input
180 data values in memory. The stat package provides interfaces and
181 implementations that do not require value storage as well as
182 implementations that operate on arrays of stored values.
183 </p>
184 <p>
185 The top level interface is
186 <a href="../apidocs/org/apache/commons/math/stat/descriptive/UnivariateStatistic.html">
187 org.apache.commons.math.stat.descriptive.UnivariateStatistic.</a>
188 This interface, implemented by all statistics, consists of
189 <code>evaluate()</code> methods that take double[] arrays as arguments
190 and return the value of the statistic. This interface is extended by
191 <a href="../apidocs/org/apache/commons/math/stat/descriptive/StorelessUnivariateStatistic.html">
192 StorelessUnivariateStatistic</a>, which adds <code>increment(),</code><code>getResult()</code> and associated methods to support
193 &quot;storageless&quot; implementations that maintain counters, sums or other
194 state information as values are added using the <code>increment()</code>
195 method.
196 </p>
197 <p>
198 Abstract implementations of the top level interfaces are provided in
199 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractUnivariateStatistic.html">
200 AbstractUnivariateStatistic</a> and
201 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractStorelessUnivariateStatistic.html">
202 AbstractStorelessUnivariateStatistic</a> respectively.
203 </p>
204 <p>
205 Each statistic is implemented as a separate class, in one of the
206 subpackages (moment, rank, summary) and each extends one of the abstract
207 classes above (depending on whether or not value storage is required to
208 compute the statistic). There are several ways to instantiate and use statistics.
209 Statistics can be instantiated and used directly, but it is generally more convenient
210 (and efficient) to access them using the provided aggregates,
211 <a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html">
212 DescriptiveStatistics</a> and
213 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html">
214 SummaryStatistics.</a></p>
215 <p><code>DescriptiveStatistics</code> maintains the input data in memory
216 and has the capability of producing &quot;rolling&quot; statistics computed from a
217 &quot;window&quot; consisting of the most recently added values.
218 </p>
219 <p><code>SummaryStatistics</code> does not store the input data values
220 in memory, so the statistics included in this aggregate are limited to those
221 that can be computed in one pass through the data without access to
222 the full array of values.
223 </p>
224 <p><table class="bodyTable"><tr class="a"><th>Aggregate</th>
225 <th>Statistics Included</th>
226 <th>Values stored?</th>
227 <th>&quot;Rolling&quot; capability?</th>
228 </tr>
229 <tr class="b"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html">
230 DescriptiveStatistics</a></td>
231 <td>min, max, mean, geometric mean, n,
232 sum, sum of squares, standard deviation, variance, percentiles, skewness,
233 kurtosis, median</td>
234 <td>Yes</td>
235 <td>Yes</td>
236 </tr>
237 <tr class="a"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html">
238 SummaryStatistics</a></td>
239 <td>min, max, mean, geometric mean, n,
240 sum, sum of squares, standard deviation, variance</td>
241 <td>No</td>
242 <td>No</td>
243 </tr>
244 </table>
245 </p>
246 <p><code>SummaryStatistics</code> can be aggregated using
247 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html">
248 AggregateSummaryStatistics.</a> This class can be used to concurrently gather statistics for multiple
249 datasets as well as for a combined sample including all of the data.
250 </p>
251 <p><code>MultivariateSummaryStatistics</code> is similar to <code>SummaryStatistics</code>
252 but handles n-tuple values instead of scalar values. It can also compute the
253 full covariance matrix for the input data.
254 </p>
255 <p>
256 Neither <code>DescriptiveStatistics</code> nor <code>SummaryStatistics</code> is
257 thread-safe. <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedDescriptiveStatistics.html">
258 SynchronizedDescriptiveStatistics</a> and
259 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedSummaryStatistics.html">
260 SynchronizedSummaryStatistics</a>, respectively, provide thread-safe versions for applications that
261 require concurrent access to statistical aggregates by multiple threads.
262 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedMultiVariateSummaryStatistics.html">
263 SynchronizedMultivariateSummaryStatistics</a> provides threadsafe <code>MultivariateSummaryStatistics.</code></p>
264 <p>
265 There is also a utility class,
266 <a href="../apidocs/org/apache/commons/math/stat/StatUtils.html">
267 StatUtils</a>, that provides static methods for computing statistics
268 directly from double[] arrays.
269 </p>
270 <p>
271 Here are some examples showing how to compute Descriptive statistics.
272 <dl><dt>Compute summary statistics for a list of double values</dt>
273 <br />
274 </br><dd>Using the <code>DescriptiveStatistics</code> aggregate
275 (values are stored in memory):
276 <div class="source"><pre>
277 // Get a DescriptiveStatistics instance
278 DescriptiveStatistics stats = new DescriptiveStatistics();
279
280 // Add the data from the array
281 for( int i = 0; i &lt; inputArray.length; i++) {
282 stats.addValue(inputArray[i]);
283 }
284
285 // Compute some statistics
286 double mean = stats.getMean();
287 double std = stats.getStandardDeviation();
288 double median = stats.getMedian();
289 </pre>
290 </div>
291 </dd>
292 <dd>Using the <code>SummaryStatistics</code> aggregate (values are
293 <strong>not</strong> stored in memory):
294 <div class="source"><pre>
295 // Get a SummaryStatistics instance
296 SummaryStatistics stats = new SummaryStatistics();
297
298 // Read data from an input stream,
299 // adding values and updating sums, counters, etc.
300 while (line != null) {
301 line = in.readLine();
302 stats.addValue(Double.parseDouble(line.trim()));
303 }
304 in.close();
305
306 // Compute the statistics
307 double mean = stats.getMean();
308 double std = stats.getStandardDeviation();
309 //double median = stats.getMedian(); &lt;-- NOT AVAILABLE
310 </pre>
311 </div>
312 </dd>
313 <dd>Using the <code>StatUtils</code> utility class:
314 <div class="source"><pre>
315 // Compute statistics directly from the array
316 // assume values is a double[] array
317 double mean = StatUtils.mean(values);
318 double std = StatUtils.variance(values);
319 double median = StatUtils.percentile(50);
320
321 // Compute the mean of the first three values in the array
322 mean = StatUtils.mean(values, 0, 3);
323 </pre>
324 </div>
325 </dd>
326 <dt>Maintain a &quot;rolling mean&quot; of the most recent 100 values from
327 an input stream</dt>
328 <br />
329 </br><dd>Use a <code>DescriptiveStatistics</code> instance with
330 window size set to 100
331 <div class="source"><pre>
332 // Create a DescriptiveStats instance and set the window size to 100
333 DescriptiveStatistics stats = new DescriptiveStatistics();
334 stats.setWindowSize(100);
335
336 // Read data from an input stream,
337 // displaying the mean of the most recent 100 observations
338 // after every 100 observations
339 long nLines = 0;
340 while (line != null) {
341 line = in.readLine();
342 stats.addValue(Double.parseDouble(line.trim()));
343 if (nLines == 100) {
344 nLines = 0;
345 System.out.println(stats.getMean());
346 }
347 }
348 in.close();
349 </pre>
350 </div>
351 </dd>
352 <dt>Compute statistics in a thread-safe manner</dt>
353 <br />
354 <dd>Use a <code>SynchronizedDescriptiveStatistics</code> instance
355 <div class="source"><pre>
356 // Create a SynchronizedDescriptiveStatistics instance and
357 // use as any other DescriptiveStatistics instance
358 DescriptiveStatistics stats = new SynchronizedDescriptiveStatistics();
359 </pre>
360 </div>
361 </dd>
362 <dt>Compute statistics for multiple samples and overall statistics concurrently</dt>
363 <br />
364 <dd>There are two ways to do this using <code>AggregateSummaryStatistics.</code>
365 The first is to use an <code>AggregateSummaryStatistics</code> instance to accumulate
366 overall statistics contributed by <code>SummaryStatistics</code> instances created using
367 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#createContributingStatistics()">
368 AggregateSummaryStatistics.createContributingStatistics()</a>:
369 <div class="source"><pre>
370 // Create a AggregateSummaryStatistics instance to accumulate the overall statistics
371 // and AggregatingSummaryStatistics for the subsamples
372 AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics();
373 SummaryStatistics setOneStats = aggregate.createContributingStatistics();
374 SummaryStatistics setTwoStats = aggregate.createContributingStatistics();
375 // Add values to the subsample aggregates
376 setOneStats.addValue(2);
377 setOneStats.addValue(3);
378 setTwoStats.addValue(2);
379 setTwoStats.addValue(4);
380 ...
381 // Full sample data is reported by the aggregate
382 double totalSampleSum = aggregate.getSum();
383 </pre>
384 </div>
385
386 The above approach has the disadvantages that the <code>addValue</code> calls must be synchronized on the
387 <code>SummaryStatistics</code> instance maintained by the aggregate and each value addition updates the
388 aggregate as well as the subsample. For applications that can wait to do the aggregation until all values
389 have been added, a static
390 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#aggregate(java.util.Collection)">
391 aggregate</a> method is available, as shown in the following example.
392 This method should be used when aggregation needs to be done across threads.
393 <div class="source"><pre>
394 // Create SummaryStatistics instances for the subsample data
395 SummaryStatistics setOneStats = new SummaryStatistics();
396 SummaryStatistics setTwoStats = new SummaryStatistics();
397 // Add values to the subsample SummaryStatistics instances
398 setOneStats.addValue(2);
399 setOneStats.addValue(3);
400 setTwoStats.addValue(2);
401 setTwoStats.addValue(4);
402 ...
403 // Aggregate the subsample statistics
404 Collection&lt;SummaryStatistics&gt; aggregate = new ArrayList&lt;SummaryStatistics&gt;();
405 aggregate.add(setOneStats);
406 aggregate.add(setTwoStats);
407 StatisticalSummary aggregatedStats = AggregateSummaryStatistics.aggregate(aggregate);
408
409 // Full sample data is reported by aggregatedStats
410 double totalSampleSum = aggregatedStats.getSum();
411 </pre>
412 </div>
413 </dd>
414 </dl>
415 </p>
416 </div>
417 <div class="section"><h3><a name="a1.3_Frequency_distributions"></a>1.3 Frequency distributions</h3>
418 <p><a href="../apidocs/org/apache/commons/math/stat/Frequency.html">
419 org.apache.commons.math.stat.descriptive.Frequency</a>
420 provides a simple interface for maintaining counts and percentages of discrete
421 values.
422 </p>
423 <p>
424 Strings, integers, longs and chars are all supported as value types,
425 as well as instances of any class that implements <code>Comparable.</code>
426 The ordering of values used in computing cumulative frequencies is by
427 default the <i>natural ordering,</i> but this can be overriden by supplying a
428 <code>Comparator</code> to the constructor. Adding values that are not
429 comparable to those that have already been added results in an
430 <code>IllegalArgumentException.</code></p>
431 <p>
432 Here are some examples.
433 <dl><dt>Compute a frequency distribution based on integer values</dt>
434 <br />
435 </br><dd>Mixing integers, longs, Integers and Longs:
436 <div class="source"><pre>
437 Frequency f = new Frequency();
438 f.addValue(1);
439 f.addValue(new Integer(1));
440 f.addValue(new Long(1));
441 f.addValue(2);
442 f.addValue(new Integer(-1));
443 System.out.prinltn(f.getCount(1)); // displays 3
444 System.out.println(f.getCumPct(0)); // displays 0.2
445 System.out.println(f.getPct(new Integer(1))); // displays 0.6
446 System.out.println(f.getCumPct(-2)); // displays 0
447 System.out.println(f.getCumPct(10)); // displays 1
448 </pre>
449 </div>
450 </dd>
451 <dt>Count string frequencies</dt>
452 <br />
453 </br><dd>Using case-sensitive comparison, alpha sort order (natural comparator):
454 <div class="source"><pre>
455 Frequency f = new Frequency();
456 f.addValue(&quot;one&quot;);
457 f.addValue(&quot;One&quot;);
458 f.addValue(&quot;oNe&quot;);
459 f.addValue(&quot;Z&quot;);
460 System.out.println(f.getCount(&quot;one&quot;)); // displays 1
461 System.out.println(f.getCumPct(&quot;Z&quot;)); // displays 0.5
462 System.out.println(f.getCumPct(&quot;Ot&quot;)); // displays 0.25
463 </pre>
464 </div>
465 </dd>
466 <dd>Using case-insensitive comparator:
467 <div class="source"><pre>
468 Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER);
469 f.addValue(&quot;one&quot;);
470 f.addValue(&quot;One&quot;);
471 f.addValue(&quot;oNe&quot;);
472 f.addValue(&quot;Z&quot;);
473 System.out.println(f.getCount(&quot;one&quot;)); // displays 3
474 System.out.println(f.getCumPct(&quot;z&quot;)); // displays 1
475 </pre>
476 </div>
477 </dd>
478 </dl>
479 </p>
480 </div>
481 <div class="section"><h3><a name="a1.4_Simple_regression"></a>1.4 Simple regression</h3>
482 <p><a href="../apidocs/org/apache/commons/math/stat/regression/SimpleRegression.html">
483 org.apache.commons.math.stat.regression.SimpleRegression</a>
484 provides ordinary least squares regression with one independent variable,
485 estimating the linear model:
486 </p>
487 <p><code> y = intercept + slope * x </code></p>
488 <p>
489 Standard errors for <code>intercept</code> and <code>slope</code> are
490 available as well as ANOVA, r-square and Pearson's r statistics.
491 </p>
492 <p>
493 Observations (x,y pairs) can be added to the model one at a time or they
494 can be provided in a 2-dimensional array. The observations are not stored
495 in memory, so there is no limit to the number of observations that can be
496 added to the model.
497 </p>
498 <p><strong>Usage Notes</strong>: <ul><li> When there are fewer than two observations in the model, or when
499 there is no variation in the x values (i.e. all x values are the same)
500 all statistics return <code>NaN</code>. At least two observations with
501 different x coordinates are requred to estimate a bivariate regression
502 model.</li>
503 <li> getters for the statistics always compute values based on the current
504 set of observations -- i.e., you can get statistics, then add more data
505 and get updated statistics without using a new instance. There is no
506 &quot;compute&quot; method that updates all statistics. Each of the getters performs
507 the necessary computations to return the requested statistic.</li>
508 </ul>
509 </p>
510 <p><strong>Implementation Notes</strong>: <ul><li> As observations are added to the model, the sum of x values, y values,
511 cross products (x times y), and squared deviations of x and y from their
512 respective means are updated using updating formulas defined in
513 &quot;Algorithms for Computing the Sample Variance: Analysis and
514 Recommendations&quot;, Chan, T.F., Golub, G.H., and LeVeque, R.J.
515 1983, American Statistician, vol. 37, pp. 242-247, referenced in
516 Weisberg, S. &quot;Applied Linear Regression&quot;. 2nd Ed. 1985. All regression
517 statistics are computed from these sums.</li>
518 <li> Inference statistics (confidence intervals, parameter significance levels)
519 are based on on the assumption that the observations included in the model are
520 drawn from a <a href="http://mathworld.wolfram.com/BivariateNormalDistribution.html" class="externalLink">
521 Bivariate Normal Distribution</a></li>
522 </ul>
523 </p>
524 <p>
525 Here are some examples.
526 <dl><dt>Estimate a model based on observations added one at a time</dt>
527 <br />
528 </br><dd>Instantiate a regression instance and add data points
529 <div class="source"><pre>
530 regression = new SimpleRegression();
531 regression.addData(1d, 2d);
532 // At this point, with only one observation,
533 // all regression statistics will return NaN
534
535 regression.addData(3d, 3d);
536 // With only two observations,
537 // slope and intercept can be computed
538 // but inference statistics will return NaN
539
540 regression.addData(3d, 3d);
541 // Now all statistics are defined.
542 </pre>
543 </div>
544 </dd>
545 <dd>Compute some statistics based on observations added so far
546 <div class="source"><pre>
547 System.out.println(regression.getIntercept());
548 // displays intercept of regression line
549
550 System.out.println(regression.getSlope());
551 // displays slope of regression line
552
553 System.out.println(regression.getSlopeStdErr());
554 // displays slope standard error
555 </pre>
556 </div>
557 </dd>
558 <dd>Use the regression model to predict the y value for a new x value
559 <div class="source"><pre>
560 System.out.println(regression.predict(1.5d)
561 // displays predicted y value for x = 1.5
562 </pre>
563 </div>
564
565 More data points can be added and subsequent getXxx calls will incorporate
566 additional data in statistics.
567 </dd>
568 <dt>Estimate a model from a double[][] array of data points</dt>
569 <br />
570 </br><dd>Instantiate a regression object and load dataset
571 <div class="source"><pre>
572 double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }};
573 SimpleRegression regression = new SimpleRegression();
574 regression.addData(data);
575 </pre>
576 </div>
577 </dd>
578 <dd>Estimate regression model based on data
579 <div class="source"><pre>
580 System.out.println(regression.getIntercept());
581 // displays intercept of regression line
582
583 System.out.println(regression.getSlope());
584 // displays slope of regression line
585
586 System.out.println(regression.getSlopeStdErr());
587 // displays slope standard error
588 </pre>
589 </div>
590
591 More data points -- even another double[][] array -- can be added and subsequent
592 getXxx calls will incorporate additional data in statistics.
593 </dd>
594 </dl>
595 </p>
596 </div>
597 <div class="section"><h3><a name="a1.5_Multiple_linear_regression"></a>1.5 Multiple linear regression</h3>
598 <p><a href="../apidocs/org/apache/commons/math/stat/regression/MultipleLinearRegression.html">
599 org.apache.commons.math.stat.regression.MultipleLinearRegression</a>
600 provides ordinary least squares regression with a generic multiple variable linear model, which
601 in matrix notation can be expressed as:
602 </p>
603 <p><code> y=X*b+u </code></p>
604 <p>
605 where y is an <code>n-vector</code><b>regressand</b>, X is a <code>[n,k]</code> matrix whose <code>k</code> columns are called
606 <b>regressors</b>, b is <code>k-vector</code> of <b>regression parameters</b> and <code>u</code> is an <code>n-vector</code>
607 of <b>error terms</b> or <b>residuals</b>. The notation is quite standard in literature,
608 cf eg <a href="http://www.econ.queensu.ca/ETM" class="externalLink">Davidson and MacKinnon, Econometrics Theory and Methods, 2004</a>.
609 </p>
610 <p>
611 Two implementations are provided: <a href="../apidocs/org/apache/commons/math/stat/regression/OLSMultipleLinearRegression.html">
612 org.apache.commons.math.stat.regression.OLSMultipleLinearRegression</a> and
613 <a href="../apidocs/org/apache/commons/math/stat/regression/GLSMultipleLinearRegression.html">
614 org.apache.commons.math.stat.regression.GLSMultipleLinearRegression</a></p>
615 <p>
616 Observations (x,y and covariance data matrices) can be added to the model via the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method.
617 The observations are stored in memory until the next time the addData method is invoked.
618 </p>
619 <p><strong>Usage Notes</strong>: <ul><li> Data is validated when invoking the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method and
620 <code>IllegalArgumentException</code> is thrown when inappropriate.
621 </li>
622 <li> Only the GLS regressions require the covariance matrix, so in the OLS regression it is ignored and can be safely
623 inputted as <code>null</code>.</li>
624 </ul>
625 </p>
626 <p>
627 Here are some examples.
628 <dl><dt>OLS regression</dt>
629 <br />
630 </br><dd>Instantiate an OLS regression object and load dataset
631 <div class="source"><pre>
632 MultipleLinearRegression regression = new OLSMultipleLinearRegression();
633 double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0};
634 double[] x = new double[6][];
635 x[0] = new double[]{1.0, 0, 0, 0, 0, 0};
636 x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0};
637 x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0};
638 x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0};
639 x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0};
640 x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0};
641 regression.addData(y, x, null); // we don't need covariance
642 </pre>
643 </div>
644 </dd>
645 <dd>Estimate of regression values honours the <code>MultipleLinearRegression</code> interface:
646 <div class="source"><pre>
647 double[] beta = regression.estimateRegressionParameters();
648
649 double[] residuals = regression.estimateResiduals();
650
651 double[][] parametersVariance = regression.estimateRegressionParametersVariance();
652
653 double regressandVariance = regression.estimateRegressandVariance();
654 </pre>
655 </div>
656 </dd>
657 <dt>GLS regression</dt>
658 <br />
659 </br><dd>Instantiate an GLS regression object and load dataset
660 <div class="source"><pre>
661 MultipleLinearRegression regression = new GLSMultipleLinearRegression();
662 double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0};
663 double[] x = new double[6][];
664 x[0] = new double[]{1.0, 0, 0, 0, 0, 0};
665 x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0};
666 x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0};
667 x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0};
668 x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0};
669 x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0};
670 double[][] omega = new double[6][];
671 omega[0] = new double[]{1.1, 0, 0, 0, 0, 0};
672 omega[1] = new double[]{0, 2.2, 0, 0, 0, 0};
673 omega[2] = new double[]{0, 0, 3.3, 0, 0, 0};
674 omega[3] = new double[]{0, 0, 0, 4.4, 0, 0};
675 omega[4] = new double[]{0, 0, 0, 0, 5.5, 0};
676 omega[5] = new double[]{0, 0, 0, 0, 0, 6.6};
677 regression.addData(y, x, omega); // we do need covariance
678 </pre>
679 </div>
680 </dd>
681 <dd>Estimate of regression values honours the same <code>MultipleLinearRegression</code> interface as
682 the OLS regression.
683 </dd>
684 </dl>
685 </p>
686 </div>
687 <div class="section"><h3><a name="a1.6_Rank_transformations"></a>1.6 Rank transformations</h3>
688 <p>
689 Some statistical algorithms require that input data be replaced by ranks.
690 The <a href="../apidocs/org/apache/commons/math/stat/ranking/package-summary.html">
691 org.apache.commons.math.stat.ranking</a> package provides rank transformation.
692 <a href="../apidocs/org/apache/commons/math/stat/ranking/RankingAlgorithm.html">
693 RankingAlgorithm</a> defines the interface for ranking.
694 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html">
695 NaturalRanking</a> provides an implementation that has two configuration options.
696 <ul><li><a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html">
697 Ties strategy</a> deterimines how ties in the source data are handled by the ranking</li>
698 <li><a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html">
699 NaN strategy</a> determines how NaN values in the source data are handled.</li>
700 </ul>
701 </p>
702 <p>
703 Examples:
704 <div class="source"><pre>
705 NaturalRanking ranking = new NaturalRanking(NaNStrategy.MINIMAL,
706 TiesStrategy.MAXIMUM);
707 double[] data = { 20, 17, 30, 42.3, 17, 50,
708 Double.NaN, Double.NEGATIVE_INFINITY, 17 };
709 double[] ranks = ranking.rank(exampleData);
710 </pre>
711 </div>
712
713 results in <code>ranks</code> containing <code>{6, 5, 7, 8, 5, 9, 2, 2, 5}.</code><div class="source"><pre>
714 new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData);
715 </pre>
716 </div>
717
718 returns <code>{5, 2, 6, 7, 3, 8, 1, 4}.</code></p>
719 <p>
720 The default <code>NaNStrategy</code> is NaNStrategy.MAXIMAL. This makes <code>NaN</code>
721 values larger than any other value (including <code>Double.POSITIVE_INFINITY</code>). The
722 default <code>TiesStrategy</code> is <code>TiesStrategy.AVERAGE,</code> which assigns tied
723 values the average of the ranks applicable to the sequence of ties. See the
724 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html">
725 NaturalRanking</a> for more examples and <a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html">
726 TiesStrategy</a> and <a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html">NaNStrategy</a>
727 for details on these configuration options.
728 </p>
729 </div>
730 <div class="section"><h3><a name="a1.7_Covariance_and_correlation"></a>1.7 Covariance and correlation</h3>
731 <p>
732 The <a href="../apidocs/org/apache/commons/math/stat/correlation/package-summary.html">
733 org.apache.commons.math.stat.correlation</a> package computes covariances
734 and correlations for pairs of arrays or columns of a matrix.
735 <a href="../apidocs/org/apache/commons/math/stat/correlation/Covariance.html">
736 Covariance</a> computes covariances,
737 <a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html">
738 PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients and
739 <a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html">
740 SpearmansCorrelation</a> computes Spearman's rank correlation.
741 </p>
742 <p><strong>Implementation Notes</strong><ul><li>
743 Unbiased covariances are given by the formula <br />
744 </br><code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code>
745 where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code>
746 is the mean of the <code>Y</code> values. Non-bias-corrected estimates use
747 <code>n</code> in place of <code>n - 1.</code> Whether or not covariances are
748 bias-corrected is determined by the optional parameter, &quot;biasCorrected,&quot; which
749 defaults to <code>true.</code></li>
750 <li><a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html">
751 PearsonsCorrelation</a> computes correlations defined by the formula <br />
752 </br><code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br />
753
754 where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code>
755 and <code>s(X)</code>, <code>s(Y)</code> are standard deviations.
756 </li>
757 <li><a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html">
758 SpearmansCorrelation</a> applies a rank transformation to the input data and computes Pearson's
759 correlation on the ranked data. The ranking algorithm is configurable. By default,
760 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html">
761 NaturalRanking</a> with default strategies for handling ties and NaN values is used.
762 </li>
763 </ul>
764 </p>
765 <p><strong>Examples:</strong><dl><dt><strong>Covariance of 2 arrays</strong></dt>
766 <br />
767 </br><dd>To compute the unbiased covariance between 2 double arrays,
768 <code>x</code> and <code>y</code>, use:
769 <div class="source"><pre>
770 new Covariance().covariance(x, y)
771 </pre>
772 </div>
773
774 For non-bias-corrected covariances, use
775 <div class="source"><pre>
776 covariance(x, y, false)
777 </pre>
778 </div>
779 </dd>
780 <br />
781 </br><dt><strong>Covariance matrix</strong></dt>
782 <br />
783 </br><dd> A covariance matrix over the columns of a source matrix <code>data</code>
784 can be computed using
785 <div class="source"><pre>
786 new Covariance().computeCovarianceMatrix(data)
787 </pre>
788 </div>
789
790 The i-jth entry of the returned matrix is the unbiased covariance of the ith and jth
791 columns of <code>data.</code> As above, to get non-bias-corrected covariances,
792 use
793 <div class="source"><pre>
794 computeCovarianceMatrix(data, false)
795 </pre>
796 </div>
797 </dd>
798 <br />
799 </br><dt><strong>Pearson's correlation of 2 arrays</strong></dt>
800 <br />
801 </br><dd>To compute the Pearson's product-moment correlation between two double arrays
802 <code>x</code> and <code>y</code>, use:
803 <div class="source"><pre>
804 new PearsonsCorrelation().correlation(x, y)
805 </pre>
806 </div>
807 </dd>
808 <br />
809 </br><dt><strong>Pearson's correlation matrix</strong></dt>
810 <br />
811 </br><dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code>
812 can be computed using
813 <div class="source"><pre>
814 new PearsonsCorrelation().computeCorrelationMatrix(data)
815 </pre>
816 </div>
817
818 The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the
819 ith and jth columns of <code>data.</code></dd>
820 <br />
821 </br><dt><strong>Pearson's correlation significance and standard errors</strong></dt>
822 <br />
823 </br><dd> To compute standard errors and/or significances of correlation coefficients
824 associated with Pearson's correlation coefficients, start by creating a
825 <code>PearsonsCorrelation</code> instance
826 <div class="source"><pre>
827 PearsonsCorrelation correlation = new PearsonsCorrelation(data);
828 </pre>
829 </div>
830
831 where <code>data</code> is either a rectangular array or a <code>RealMatrix.</code>
832 Then the matrix of standard errors is
833 <div class="source"><pre>
834 correlation.getCorrelationStandardErrors();
835 </pre>
836 </div>
837
838 The formula used to compute the standard error is <br />
839 <code>SE<sub>r</sub> = ((1 - r<sup>2</sup>) / (n - 2))<sup>1/2</sup></code><br />
840
841 where <code>r</code> is the estimated correlation coefficient and
842 <code>n</code> is the number of observations in the source dataset.<br />
843 <br />
844 <strong>p-values</strong> for the (2-sided) null hypotheses that elements of
845 a correlation matrix are zero populate the RealMatrix returned by
846 <div class="source"><pre>
847 correlation.getCorrelationPValues()
848 </pre>
849 </div>
850 <code>getCorrelationPValues().getEntry(i,j)</code> is the
851 probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes
852 a value with absolute value greater than or equal to <br />
853 </br><code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>,
854 where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth
855 columns of the source array or RealMatrix. This is sometimes referred to as the
856 <i>significance</i> of the coefficient.<br />
857 <br />
858
859 For example, if <code>data</code> is a RealMatrix with 2 columns and 10 rows, then
860 <div class="source"><pre>
861 new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1)
862 </pre>
863 </div>
864
865 is the significance of the Pearson's correlation coefficient between the two columns
866 of <code>data</code>. If this value is less than .01, we can say that the correlation
867 between the two columns of data is significant at the 99% level.
868 </dd>
869 <br />
870 </br><dt><strong>Spearman's rank correlation coefficient</strong></dt>
871 <br />
872 </br><dd>To compute the Spearman's rank-moment correlation between two double arrays
873 <code>x</code> and <code>y</code>:
874 <div class="source"><pre>
875 new SpearmansCorrelation().correlation(x, y)
876 </pre>
877 </div>
878
879 This is equivalent to
880 <div class="source"><pre>
881 RankingAlgorithm ranking = new NaturalRanking();
882 new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
883 </pre>
884 </div>
885 </dd>
886 <br />
887 </br></dl>
888 </p>
889 </div>
890 <div class="section"><h3><a name="a1.8_Statistical_tests"></a>1.8 Statistical tests</h3>
891 <p>
892 The interfaces and implementations in the
893 <a href="../apidocs/org/apache/commons/math/stat/inference/">
894 org.apache.commons.math.stat.inference</a> package provide
895 <a href="http://www.itl.nist.gov/div898/handbook/prc/section2/prc22.htm" class="externalLink">
896 Student's t</a>,
897 <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm" class="externalLink">
898 Chi-Square</a> and
899 <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm" class="externalLink">
900 One-Way ANOVA</a> test statistics as well as
901 <a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue" class="externalLink">
902 p-values</a> associated with <code>t-</code>,
903 <code>Chi-Square</code> and <code>One-Way ANOVA</code> tests. The
904 interfaces are
905 <a href="../apidocs/org/apache/commons/math/stat/inference/TTest.html">
906 TTest</a>,
907 <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTest.html">
908 ChiSquareTest</a>, and
909 <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnova.html">
910 OneWayAnova</a> with provided implementations
911 <a href="../apidocs/org/apache/commons/math/stat/inference/TTestImpl.html">
912 TTestImpl</a>,
913 <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTestImpl.html">
914 ChiSquareTestImpl</a> and
915 <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnovaImpl.html">
916 OneWayAnovaImpl</a>, respectively.
917 The
918 <a href="../apidocs/org/apache/commons/math/stat/inference/TestUtils.html">
919 TestUtils</a> class provides static methods to get test instances or
920 to compute test statistics directly. The examples below all use the
921 static methods in <code>TestUtils</code> to execute tests. To get
922 test object instances, either use e.g.,
923 <code>TestUtils.getTTest()</code> or use the implementation constructors
924 directly, e.g.,
925 <code>new TTestImpl()</code>.
926 </p>
927 <p><strong>Implementation Notes</strong><ul><li>Both one- and two-sample t-tests are supported. Two sample tests
928 can be either paired or unpaired and the unpaired two-sample tests can
929 be conducted under the assumption of equal subpopulation variances or
930 without this assumption. When equal variances is assumed, a pooled
931 variance estimate is used to compute the t-statistic and the degrees
932 of freedom used in the t-test equals the sum of the sample sizes minus 2.
933 When equal variances is not assumed, the t-statistic uses both sample
934 variances and the
935 <a href="http://www.itl.nist.gov/div898/handbook/prc/section3/gifs/nu3.gif" class="externalLink">
936 Welch-Satterwaite approximation</a> is used to compute the degrees
937 of freedom. Methods to return t-statistics and p-values are provided in each
938 case, as well as boolean-valued methods to perform fixed significance
939 level tests. The names of methods or methods that assume equal
940 subpopulation variances always start with &quot;homoscedastic.&quot; Test or
941 test-statistic methods that just start with &quot;t&quot; do not assume equal
942 variances. See the examples below and the API documentation for
943 more details.</li>
944 <li>The validity of the p-values returned by the t-test depends on the
945 assumptions of the parametric t-test procedure, as discussed
946 <a href="http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html" class="externalLink">
947 here</a></li>
948 <li>p-values returned by t-, chi-square and Anova tests are exact, based
949 on numerical approximations to the t-, chi-square and F distributions in the
950 <code>distributions</code> package. </li>
951 <li>p-values returned by t-tests are for two-sided tests and the boolean-valued
952 methods supporting fixed significance level tests assume that the hypotheses
953 are two-sided. One sided tests can be performed by dividing returned p-values
954 (resp. critical values) by 2.</li>
955 <li>Degrees of freedom for chi-square tests are integral values, based on the
956 number of observed or expected counts (number of observed counts - 1)
957 for the goodness-of-fit tests and (number of columns -1) * (number of rows - 1)
958 for independence tests.</li>
959 </ul>
960 </p>
961 <p><strong>Examples:</strong><dl><dt><strong>One-sample <code>t</code> tests</strong></dt>
962 <br />
963 </br><dd>To compare the mean of a double[] array to a fixed value:
964 <div class="source"><pre>
965 double[] observed = {1d, 2d, 3d};
966 double mu = 2.5d;
967 System.out.println(TestUtils.t(mu, observed));
968 </pre>
969 </div>
970
971 The code above will display the t-statisitic associated with a one-sample
972 t-test comparing the mean of the <code>observed</code> values against
973 <code>mu.</code></dd>
974 <dd>To compare the mean of a dataset described by a
975 <a href="../apidocs/org/apache/commons/math/stat/descriptive/StatisticalSummary.html">
976 org.apache.commons.math.stat.descriptive.StatisticalSummary</a> to a fixed value:
977 <div class="source"><pre>
978 double[] observed ={1d, 2d, 3d};
979 double mu = 2.5d;
980 SummaryStatistics sampleStats = new SummaryStatistics();
981 for (int i = 0; i &lt; observed.length; i++) {
982 sampleStats.addValue(observed[i]);
983 }
984 System.out.println(TestUtils.t(mu, observed));
985 </pre>
986 </div>
987 </dd>
988 <dd>To compute the p-value associated with the null hypothesis that the mean
989 of a set of values equals a point estimate, against the two-sided alternative that
990 the mean is different from the target value:
991 <div class="source"><pre>
992 double[] observed = {1d, 2d, 3d};
993 double mu = 2.5d;
994 System.out.println(TestUtils.tTest(mu, observed));
995 </pre>
996 </div>
997
998 The snippet above will display the p-value associated with the null
999 hypothesis that the mean of the population from which the
1000 <code>observed</code> values are drawn equals <code>mu.</code></dd>
1001 <dd>To perform the test using a fixed significance level, use:
1002 <div class="source"><pre>
1003 TestUtils.tTest(mu, observed, alpha);
1004 </pre>
1005 </div>
1006
1007 where <code>0 &lt; alpha &lt; 0.5</code> is the significance level of
1008 the test. The boolean value returned will be <code>true</code> iff the
1009 null hypothesis can be rejected with confidence <code>1 - alpha</code>.
1010 To test, for example at the 95% level of confidence, use
1011 <code>alpha = 0.05</code></dd>
1012 <br />
1013 </br><dt><strong>Two-Sample t-tests</strong></dt>
1014 <br />
1015 </br><dd><strong>Example 1:</strong> Paired test evaluating
1016 the null hypothesis that the mean difference between corresponding
1017 (paired) elements of the <code>double[]</code> arrays
1018 <code>sample1</code> and <code>sample2</code> is zero.
1019
1020 To compute the t-statistic:
1021 <div class="source"><pre>
1022 TestUtils.pairedT(sample1, sample2);
1023 </pre>
1024 </div>
1025 <p>
1026 To compute the p-value:
1027 <div class="source"><pre>
1028 TestUtils.pairedTTest(sample1, sample2);
1029 </pre>
1030 </div>
1031 </p>
1032 <p>
1033 To perform a fixed significance level test with alpha = .05:
1034 <div class="source"><pre>
1035 TestUtils.pairedTTest(sample1, sample2, .05);
1036 </pre>
1037 </div>
1038 </p>
1039
1040 The last example will return <code>true</code> iff the p-value
1041 returned by <code>TestUtils.pairedTTest(sample1, sample2)</code>
1042 is less than <code>.05</code></dd>
1043 <dd><strong>Example 2: </strong> unpaired, two-sided, two-sample t-test using
1044 <code>StatisticalSummary</code> instances, without assuming that
1045 subpopulation variances are equal.
1046
1047 First create the <code>StatisticalSummary</code> instances. Both
1048 <code>DescriptiveStatistics</code> and <code>SummaryStatistics</code>
1049 implement this interface. Assume that <code>summary1</code> and
1050 <code>summary2</code> are <code>SummaryStatistics</code> instances,
1051 each of which has had at least 2 values added to the (virtual) dataset that
1052 it describes. The sample sizes do not have to be the same -- all that is required
1053 is that both samples have at least 2 elements.
1054 <p><strong>Note:</strong> The <code>SummaryStatistics</code> class does
1055 not store the dataset that it describes in memory, but it does compute all
1056 statistics necessary to perform t-tests, so this method can be used to
1057 conduct t-tests with very large samples. One-sample tests can also be
1058 performed this way.
1059 (See <a href="#1.2 Descriptive statistics">Descriptive statistics</a> for details
1060 on the <code>SummaryStatistics</code> class.)
1061 </p>
1062 <p>
1063 To compute the t-statistic:
1064 <div class="source"><pre>
1065 TestUtils.t(summary1, summary2);
1066 </pre>
1067 </div>
1068 </p>
1069 <p>
1070 To compute the p-value:
1071 <div class="source"><pre>
1072 TestUtils.tTest(sample1, sample2);
1073 </pre>
1074 </div>
1075 </p>
1076 <p>
1077 To perform a fixed significance level test with alpha = .05:
1078 <div class="source"><pre>
1079 TestUtils.tTest(sample1, sample2, .05);
1080 </pre>
1081 </div>
1082 </p>
1083 <p>
1084 In each case above, the test does not assume that the subpopulation
1085 variances are equal. To perform the tests under this assumption,
1086 replace &quot;t&quot; at the beginning of the method name with &quot;homoscedasticT&quot;
1087 </p>
1088 </dd>
1089 <br />
1090 </br><dt><strong>Chi-square tests</strong></dt>
1091 <br />
1092 </br><dd>To compute a chi-square statistic measuring the agreement between a
1093 <code>long[]</code> array of observed counts and a <code>double[]</code>
1094 array of expected counts, use:
1095 <div class="source"><pre>
1096 long[] observed = {10, 9, 11};
1097 double[] expected = {10.1, 9.8, 10.3};
1098 System.out.println(TestUtils.chiSquare(expected, observed));
1099 </pre>
1100 </div>
1101
1102 the value displayed will be
1103 <code>sum((expected[i] - observed[i])^2 / expected[i])</code></dd>
1104 <dd> To get the p-value associated with the null hypothesis that
1105 <code>observed</code> conforms to <code>expected</code> use:
1106 <div class="source"><pre>
1107 TestUtils.chiSquareTest(expected, observed);
1108 </pre>
1109 </div>
1110 </dd>
1111 <dd> To test the null hypothesis that <code>observed</code> conforms to
1112 <code>expected</code> with <code>alpha</code> siginficance level
1113 (equiv. <code>100 * (1-alpha)%</code> confidence) where <code>
1114 0 &lt; alpha &lt; 1 </code> use:
1115 <div class="source"><pre>
1116 TestUtils.chiSquareTest(expected, observed, alpha);
1117 </pre>
1118 </div>
1119
1120 The boolean value returned will be <code>true</code> iff the null hypothesis
1121 can be rejected with confidence <code>1 - alpha</code>.
1122 </dd>
1123 <dd>To compute a chi-square statistic statistic associated with a
1124 <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm" class="externalLink">
1125 chi-square test of independence</a> based on a two-dimensional (long[][])
1126 <code>counts</code> array viewed as a two-way table, use:
1127 <div class="source"><pre>
1128 TestUtils.chiSquareTest(counts);
1129 </pre>
1130 </div>
1131
1132 The rows of the 2-way table are
1133 <code>count[0], ... , count[count.length - 1]. </code><br />
1134 </br>
1135 The chi-square statistic returned is
1136 <code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code>
1137 where the sum is taken over all table entries and
1138 <code>expected[i][j]</code> is the product of the row and column sums at
1139 row <code>i</code>, column <code>j</code> divided by the total count.
1140 </dd>
1141 <dd>To compute the p-value associated with the null hypothesis that
1142 the classifications represented by the counts in the columns of the input 2-way
1143 table are independent of the rows, use:
1144 <div class="source"><pre>
1145 TestUtils.chiSquareTest(counts);
1146 </pre>
1147 </div>
1148 </dd>
1149 <dd>To perform a chi-square test of independence with <code>alpha</code>
1150 siginficance level (equiv. <code>100 * (1-alpha)%</code> confidence)
1151 where <code>0 &lt; alpha &lt; 1 </code> use:
1152 <div class="source"><pre>
1153 TestUtils.chiSquareTest(counts, alpha);
1154 </pre>
1155 </div>
1156
1157 The boolean value returned will be <code>true</code> iff the null
1158 hypothesis can be rejected with confidence <code>1 - alpha</code>.
1159 </dd>
1160 <br />
1161 </br><dt><strong>One-Way Anova tests</strong></dt>
1162 <br />
1163 </br><dd>To conduct a One-Way Analysis of Variance (ANOVA) to evaluate the
1164 null hypothesis that the means of a collection of univariate datasets
1165 are the same, start by loading the datasets into a collection, e.g.
1166 <div class="source"><pre>
1167 double[] classA =
1168 {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 };
1169 double[] classB =
1170 {99.0, 92.0, 102.0, 100.0, 102.0, 89.0 };
1171 double[] classC =
1172 {110.0, 115.0, 111.0, 117.0, 128.0, 117.0 };
1173 List classes = new ArrayList();
1174 classes.add(classA);
1175 classes.add(classB);
1176 classes.add(classC);
1177 </pre>
1178 </div>
1179
1180 Then you can compute ANOVA F- or p-values associated with the
1181 null hypothesis that the class means are all the same
1182 using a <code>OneWayAnova</code> instance or <code>TestUtils</code>
1183 methods:
1184 <div class="source"><pre>
1185 double fStatistic = TestUtils.oneWayAnovaFValue(classes); // F-value
1186 double pValue = TestUtils.oneWayAnovaPValue(classes); // P-value
1187 </pre>
1188 </div>
1189
1190 To test perform a One-Way Anova test with signficance level set at 0.01
1191 (so the test will, assuming assumptions are met, reject the null
1192 hypothesis incorrectly only about one in 100 times), use
1193 <div class="source"><pre>
1194 TestUtils.oneWayAnovaTest(classes, 0.01); // returns a boolean
1195 // true means reject null hypothesis
1196 </pre>
1197 </div>
1198 </dd>
1199 </dl>
1200 </p>
1201 </div>
1202 </div>
1203
1204 </div>
1205 </div>
1206 <div class="clear">
1207 <hr/>
1208 </div>
1209 <div id="footer">
1210 <div class="xright">&#169;
1211 2003-2010
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221 </div>
1222 <div class="clear">
1223 <hr/>
1224 </div>
1225 </div>
1226 </body>
1227 </html>