Mercurial > hg > de.mpg.mpiwg.itgroup.digilib.core
comparison libs/commons-math-2.1/docs/userguide/stat.html @ 10:5f2c5fb36e93
commons-math-2.1 added
author | dwinter |
---|---|
date | Tue, 04 Jan 2011 10:00:53 +0100 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
9:e63a64652f4d | 10:5f2c5fb36e93 |
---|---|
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 <html xmlns="http://www.w3.org/1999/xhtml"> | |
14 <head> | |
15 <title>Math - The Commons Math User Guide - Statistics</title> | |
16 <style type="text/css" media="all"> | |
17 @import url("../css/maven-base.css"); | |
18 @import url("../css/maven-theme.css"); | |
19 @import url("../css/site.css"); | |
20 </style> | |
21 <link rel="stylesheet" href="../css/print.css" type="text/css" media="print" /> | |
22 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> | |
23 </head> | |
24 <body class="composite"> | |
25 <div id="banner"> | |
26 <span id="bannerLeft"> | |
27 | |
28 Commons Math User Guide | |
29 | |
30 </span> | |
31 <div class="clear"> | |
32 <hr/> | |
33 </div> | |
34 </div> | |
35 <div id="breadcrumbs"> | |
36 | |
37 | |
38 | |
39 | |
40 | |
41 | |
42 | |
43 | |
44 <div class="xright"> | |
45 | |
46 | |
47 | |
48 | |
49 | |
50 | |
51 | |
52 </div> | |
53 <div class="clear"> | |
54 <hr/> | |
55 </div> | |
56 </div> | |
57 <div id="leftColumn"> | |
58 <div id="navcolumn"> | |
59 | |
60 | |
61 | |
62 | |
63 | |
64 | |
65 | |
66 | |
67 <h5>User Guide</h5> | |
68 <ul> | |
69 | |
70 <li class="none"> | |
71 <a href="../userguide/index.html">Contents</a> | |
72 </li> | |
73 | |
74 <li class="none"> | |
75 <a href="../userguide/overview.html">Overview</a> | |
76 </li> | |
77 | |
78 <li class="none"> | |
79 <strong>Statistics</strong> | |
80 </li> | |
81 | |
82 <li class="none"> | |
83 <a href="../userguide/random.html">Data Generation</a> | |
84 </li> | |
85 | |
86 <li class="none"> | |
87 <a href="../userguide/linear.html">Linear Algebra</a> | |
88 </li> | |
89 | |
90 <li class="none"> | |
91 <a href="../userguide/analysis.html">Numerical Analysis</a> | |
92 </li> | |
93 | |
94 <li class="none"> | |
95 <a href="../userguide/special.html">Special Functions</a> | |
96 </li> | |
97 | |
98 <li class="none"> | |
99 <a href="../userguide/utilities.html">Utilities</a> | |
100 </li> | |
101 | |
102 <li class="none"> | |
103 <a href="../userguide/complex.html">Complex Numbers</a> | |
104 </li> | |
105 | |
106 <li class="none"> | |
107 <a href="../userguide/distribution.html">Distributions</a> | |
108 </li> | |
109 | |
110 <li class="none"> | |
111 <a href="../userguide/fraction.html">Fractions</a> | |
112 </li> | |
113 | |
114 <li class="none"> | |
115 <a href="../userguide/transform.html">Transform Methods</a> | |
116 </li> | |
117 | |
118 <li class="none"> | |
119 <a href="../userguide/geometry.html">3D Geometry</a> | |
120 </li> | |
121 | |
122 <li class="none"> | |
123 <a href="../userguide/optimization.html">Optimization</a> | |
124 </li> | |
125 | |
126 <li class="none"> | |
127 <a href="../userguide/ode.html">Ordinary Differential Equations</a> | |
128 </li> | |
129 | |
130 <li class="none"> | |
131 <a href="../userguide/genetics.html">Genetic Algorithms</a> | |
132 </li> | |
133 </ul> | |
134 <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"> | |
135 <img alt="Built by Maven" src="../images/logos/maven-feather.png"></img> | |
136 </a> | |
137 | |
138 | |
139 | |
140 | |
141 | |
142 | |
143 | |
144 | |
145 </div> | |
146 </div> | |
147 <div id="bodyColumn"> | |
148 <div id="contentBox"> | |
149 <div class="section"><h2><a name="a1_Statistics"></a>1 Statistics</h2> | |
150 <div class="section"><h3><a name="a1.1_Overview"></a>1.1 Overview</h3> | |
151 <p> | |
152 The statistics package provides frameworks and implementations for | |
153 basic Descriptive statistics, frequency distributions, bivariate regression, | |
154 and t-, chi-square and ANOVA test statistics. | |
155 </p> | |
156 <p><a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br /> | |
157 </br><a href="#a1.3_Frequency_distributions">Frequency distributions</a><br /> | |
158 </br><a href="#a1.4_Simple_regression">Simple Regression</a><br /> | |
159 </br><a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br /> | |
160 </br><a href="#a1.6_Rank_transformations">Rank transformations</a><br /> | |
161 </br><a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br /> | |
162 </br><a href="#a1.8_Statistical_tests">Statistical Tests</a><br /> | |
163 </br></p> | |
164 </div> | |
165 <div class="section"><h3><a name="a1.2_Descriptive_statistics"></a>1.2 Descriptive statistics</h3> | |
166 <p> | |
167 The stat package includes a framework and default implementations for | |
168 the following Descriptive statistics: | |
169 <ul><li>arithmetic and geometric means</li> | |
170 <li>variance and standard deviation</li> | |
171 <li>sum, product, log sum, sum of squared values</li> | |
172 <li>minimum, maximum, median, and percentiles</li> | |
173 <li>skewness and kurtosis</li> | |
174 <li>first, second, third and fourth moments</li> | |
175 </ul> | |
176 </p> | |
177 <p> | |
178 With the exception of percentiles and the median, all of these | |
179 statistics can be computed without maintaining the full list of input | |
180 data values in memory. The stat package provides interfaces and | |
181 implementations that do not require value storage as well as | |
182 implementations that operate on arrays of stored values. | |
183 </p> | |
184 <p> | |
185 The top level interface is | |
186 <a href="../apidocs/org/apache/commons/math/stat/descriptive/UnivariateStatistic.html"> | |
187 org.apache.commons.math.stat.descriptive.UnivariateStatistic.</a> | |
188 This interface, implemented by all statistics, consists of | |
189 <code>evaluate()</code> methods that take double[] arrays as arguments | |
190 and return the value of the statistic. This interface is extended by | |
191 <a href="../apidocs/org/apache/commons/math/stat/descriptive/StorelessUnivariateStatistic.html"> | |
192 StorelessUnivariateStatistic</a>, which adds <code>increment(),</code><code>getResult()</code> and associated methods to support | |
193 "storageless" implementations that maintain counters, sums or other | |
194 state information as values are added using the <code>increment()</code> | |
195 method. | |
196 </p> | |
197 <p> | |
198 Abstract implementations of the top level interfaces are provided in | |
199 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractUnivariateStatistic.html"> | |
200 AbstractUnivariateStatistic</a> and | |
201 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractStorelessUnivariateStatistic.html"> | |
202 AbstractStorelessUnivariateStatistic</a> respectively. | |
203 </p> | |
204 <p> | |
205 Each statistic is implemented as a separate class, in one of the | |
206 subpackages (moment, rank, summary) and each extends one of the abstract | |
207 classes above (depending on whether or not value storage is required to | |
208 compute the statistic). There are several ways to instantiate and use statistics. | |
209 Statistics can be instantiated and used directly, but it is generally more convenient | |
210 (and efficient) to access them using the provided aggregates, | |
211 <a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html"> | |
212 DescriptiveStatistics</a> and | |
213 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html"> | |
214 SummaryStatistics.</a></p> | |
215 <p><code>DescriptiveStatistics</code> maintains the input data in memory | |
216 and has the capability of producing "rolling" statistics computed from a | |
217 "window" consisting of the most recently added values. | |
218 </p> | |
219 <p><code>SummaryStatistics</code> does not store the input data values | |
220 in memory, so the statistics included in this aggregate are limited to those | |
221 that can be computed in one pass through the data without access to | |
222 the full array of values. | |
223 </p> | |
224 <p><table class="bodyTable"><tr class="a"><th>Aggregate</th> | |
225 <th>Statistics Included</th> | |
226 <th>Values stored?</th> | |
227 <th>"Rolling" capability?</th> | |
228 </tr> | |
229 <tr class="b"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html"> | |
230 DescriptiveStatistics</a></td> | |
231 <td>min, max, mean, geometric mean, n, | |
232 sum, sum of squares, standard deviation, variance, percentiles, skewness, | |
233 kurtosis, median</td> | |
234 <td>Yes</td> | |
235 <td>Yes</td> | |
236 </tr> | |
237 <tr class="a"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html"> | |
238 SummaryStatistics</a></td> | |
239 <td>min, max, mean, geometric mean, n, | |
240 sum, sum of squares, standard deviation, variance</td> | |
241 <td>No</td> | |
242 <td>No</td> | |
243 </tr> | |
244 </table> | |
245 </p> | |
246 <p><code>SummaryStatistics</code> can be aggregated using | |
247 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html"> | |
248 AggregateSummaryStatistics.</a> This class can be used to concurrently gather statistics for multiple | |
249 datasets as well as for a combined sample including all of the data. | |
250 </p> | |
251 <p><code>MultivariateSummaryStatistics</code> is similar to <code>SummaryStatistics</code> | |
252 but handles n-tuple values instead of scalar values. It can also compute the | |
253 full covariance matrix for the input data. | |
254 </p> | |
255 <p> | |
256 Neither <code>DescriptiveStatistics</code> nor <code>SummaryStatistics</code> is | |
257 thread-safe. <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedDescriptiveStatistics.html"> | |
258 SynchronizedDescriptiveStatistics</a> and | |
259 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedSummaryStatistics.html"> | |
260 SynchronizedSummaryStatistics</a>, respectively, provide thread-safe versions for applications that | |
261 require concurrent access to statistical aggregates by multiple threads. | |
262 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedMultiVariateSummaryStatistics.html"> | |
263 SynchronizedMultivariateSummaryStatistics</a> provides threadsafe <code>MultivariateSummaryStatistics.</code></p> | |
264 <p> | |
265 There is also a utility class, | |
266 <a href="../apidocs/org/apache/commons/math/stat/StatUtils.html"> | |
267 StatUtils</a>, that provides static methods for computing statistics | |
268 directly from double[] arrays. | |
269 </p> | |
270 <p> | |
271 Here are some examples showing how to compute Descriptive statistics. | |
272 <dl><dt>Compute summary statistics for a list of double values</dt> | |
273 <br /> | |
274 </br><dd>Using the <code>DescriptiveStatistics</code> aggregate | |
275 (values are stored in memory): | |
276 <div class="source"><pre> | |
277 // Get a DescriptiveStatistics instance | |
278 DescriptiveStatistics stats = new DescriptiveStatistics(); | |
279 | |
280 // Add the data from the array | |
281 for( int i = 0; i < inputArray.length; i++) { | |
282 stats.addValue(inputArray[i]); | |
283 } | |
284 | |
285 // Compute some statistics | |
286 double mean = stats.getMean(); | |
287 double std = stats.getStandardDeviation(); | |
288 double median = stats.getMedian(); | |
289 </pre> | |
290 </div> | |
291 </dd> | |
292 <dd>Using the <code>SummaryStatistics</code> aggregate (values are | |
293 <strong>not</strong> stored in memory): | |
294 <div class="source"><pre> | |
295 // Get a SummaryStatistics instance | |
296 SummaryStatistics stats = new SummaryStatistics(); | |
297 | |
298 // Read data from an input stream, | |
299 // adding values and updating sums, counters, etc. | |
300 while (line != null) { | |
301 line = in.readLine(); | |
302 stats.addValue(Double.parseDouble(line.trim())); | |
303 } | |
304 in.close(); | |
305 | |
306 // Compute the statistics | |
307 double mean = stats.getMean(); | |
308 double std = stats.getStandardDeviation(); | |
309 //double median = stats.getMedian(); <-- NOT AVAILABLE | |
310 </pre> | |
311 </div> | |
312 </dd> | |
313 <dd>Using the <code>StatUtils</code> utility class: | |
314 <div class="source"><pre> | |
315 // Compute statistics directly from the array | |
316 // assume values is a double[] array | |
317 double mean = StatUtils.mean(values); | |
318 double std = StatUtils.variance(values); | |
319 double median = StatUtils.percentile(50); | |
320 | |
321 // Compute the mean of the first three values in the array | |
322 mean = StatUtils.mean(values, 0, 3); | |
323 </pre> | |
324 </div> | |
325 </dd> | |
326 <dt>Maintain a "rolling mean" of the most recent 100 values from | |
327 an input stream</dt> | |
328 <br /> | |
329 </br><dd>Use a <code>DescriptiveStatistics</code> instance with | |
330 window size set to 100 | |
331 <div class="source"><pre> | |
332 // Create a DescriptiveStats instance and set the window size to 100 | |
333 DescriptiveStatistics stats = new DescriptiveStatistics(); | |
334 stats.setWindowSize(100); | |
335 | |
336 // Read data from an input stream, | |
337 // displaying the mean of the most recent 100 observations | |
338 // after every 100 observations | |
339 long nLines = 0; | |
340 while (line != null) { | |
341 line = in.readLine(); | |
342 stats.addValue(Double.parseDouble(line.trim())); | |
343 if (nLines == 100) { | |
344 nLines = 0; | |
345 System.out.println(stats.getMean()); | |
346 } | |
347 } | |
348 in.close(); | |
349 </pre> | |
350 </div> | |
351 </dd> | |
352 <dt>Compute statistics in a thread-safe manner</dt> | |
353 <br /> | |
354 <dd>Use a <code>SynchronizedDescriptiveStatistics</code> instance | |
355 <div class="source"><pre> | |
356 // Create a SynchronizedDescriptiveStatistics instance and | |
357 // use as any other DescriptiveStatistics instance | |
358 DescriptiveStatistics stats = new SynchronizedDescriptiveStatistics(); | |
359 </pre> | |
360 </div> | |
361 </dd> | |
362 <dt>Compute statistics for multiple samples and overall statistics concurrently</dt> | |
363 <br /> | |
364 <dd>There are two ways to do this using <code>AggregateSummaryStatistics.</code> | |
365 The first is to use an <code>AggregateSummaryStatistics</code> instance to accumulate | |
366 overall statistics contributed by <code>SummaryStatistics</code> instances created using | |
367 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#createContributingStatistics()"> | |
368 AggregateSummaryStatistics.createContributingStatistics()</a>: | |
369 <div class="source"><pre> | |
370 // Create a AggregateSummaryStatistics instance to accumulate the overall statistics | |
371 // and AggregatingSummaryStatistics for the subsamples | |
372 AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics(); | |
373 SummaryStatistics setOneStats = aggregate.createContributingStatistics(); | |
374 SummaryStatistics setTwoStats = aggregate.createContributingStatistics(); | |
375 // Add values to the subsample aggregates | |
376 setOneStats.addValue(2); | |
377 setOneStats.addValue(3); | |
378 setTwoStats.addValue(2); | |
379 setTwoStats.addValue(4); | |
380 ... | |
381 // Full sample data is reported by the aggregate | |
382 double totalSampleSum = aggregate.getSum(); | |
383 </pre> | |
384 </div> | |
385 | |
386 The above approach has the disadvantages that the <code>addValue</code> calls must be synchronized on the | |
387 <code>SummaryStatistics</code> instance maintained by the aggregate and each value addition updates the | |
388 aggregate as well as the subsample. For applications that can wait to do the aggregation until all values | |
389 have been added, a static | |
390 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#aggregate(java.util.Collection)"> | |
391 aggregate</a> method is available, as shown in the following example. | |
392 This method should be used when aggregation needs to be done across threads. | |
393 <div class="source"><pre> | |
394 // Create SummaryStatistics instances for the subsample data | |
395 SummaryStatistics setOneStats = new SummaryStatistics(); | |
396 SummaryStatistics setTwoStats = new SummaryStatistics(); | |
397 // Add values to the subsample SummaryStatistics instances | |
398 setOneStats.addValue(2); | |
399 setOneStats.addValue(3); | |
400 setTwoStats.addValue(2); | |
401 setTwoStats.addValue(4); | |
402 ... | |
403 // Aggregate the subsample statistics | |
404 Collection<SummaryStatistics> aggregate = new ArrayList<SummaryStatistics>(); | |
405 aggregate.add(setOneStats); | |
406 aggregate.add(setTwoStats); | |
407 StatisticalSummary aggregatedStats = AggregateSummaryStatistics.aggregate(aggregate); | |
408 | |
409 // Full sample data is reported by aggregatedStats | |
410 double totalSampleSum = aggregatedStats.getSum(); | |
411 </pre> | |
412 </div> | |
413 </dd> | |
414 </dl> | |
415 </p> | |
416 </div> | |
417 <div class="section"><h3><a name="a1.3_Frequency_distributions"></a>1.3 Frequency distributions</h3> | |
418 <p><a href="../apidocs/org/apache/commons/math/stat/Frequency.html"> | |
419 org.apache.commons.math.stat.descriptive.Frequency</a> | |
420 provides a simple interface for maintaining counts and percentages of discrete | |
421 values. | |
422 </p> | |
423 <p> | |
424 Strings, integers, longs and chars are all supported as value types, | |
425 as well as instances of any class that implements <code>Comparable.</code> | |
426 The ordering of values used in computing cumulative frequencies is by | |
427 default the <i>natural ordering,</i> but this can be overriden by supplying a | |
428 <code>Comparator</code> to the constructor. Adding values that are not | |
429 comparable to those that have already been added results in an | |
430 <code>IllegalArgumentException.</code></p> | |
431 <p> | |
432 Here are some examples. | |
433 <dl><dt>Compute a frequency distribution based on integer values</dt> | |
434 <br /> | |
435 </br><dd>Mixing integers, longs, Integers and Longs: | |
436 <div class="source"><pre> | |
437 Frequency f = new Frequency(); | |
438 f.addValue(1); | |
439 f.addValue(new Integer(1)); | |
440 f.addValue(new Long(1)); | |
441 f.addValue(2); | |
442 f.addValue(new Integer(-1)); | |
443 System.out.prinltn(f.getCount(1)); // displays 3 | |
444 System.out.println(f.getCumPct(0)); // displays 0.2 | |
445 System.out.println(f.getPct(new Integer(1))); // displays 0.6 | |
446 System.out.println(f.getCumPct(-2)); // displays 0 | |
447 System.out.println(f.getCumPct(10)); // displays 1 | |
448 </pre> | |
449 </div> | |
450 </dd> | |
451 <dt>Count string frequencies</dt> | |
452 <br /> | |
453 </br><dd>Using case-sensitive comparison, alpha sort order (natural comparator): | |
454 <div class="source"><pre> | |
455 Frequency f = new Frequency(); | |
456 f.addValue("one"); | |
457 f.addValue("One"); | |
458 f.addValue("oNe"); | |
459 f.addValue("Z"); | |
460 System.out.println(f.getCount("one")); // displays 1 | |
461 System.out.println(f.getCumPct("Z")); // displays 0.5 | |
462 System.out.println(f.getCumPct("Ot")); // displays 0.25 | |
463 </pre> | |
464 </div> | |
465 </dd> | |
466 <dd>Using case-insensitive comparator: | |
467 <div class="source"><pre> | |
468 Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER); | |
469 f.addValue("one"); | |
470 f.addValue("One"); | |
471 f.addValue("oNe"); | |
472 f.addValue("Z"); | |
473 System.out.println(f.getCount("one")); // displays 3 | |
474 System.out.println(f.getCumPct("z")); // displays 1 | |
475 </pre> | |
476 </div> | |
477 </dd> | |
478 </dl> | |
479 </p> | |
480 </div> | |
481 <div class="section"><h3><a name="a1.4_Simple_regression"></a>1.4 Simple regression</h3> | |
482 <p><a href="../apidocs/org/apache/commons/math/stat/regression/SimpleRegression.html"> | |
483 org.apache.commons.math.stat.regression.SimpleRegression</a> | |
484 provides ordinary least squares regression with one independent variable, | |
485 estimating the linear model: | |
486 </p> | |
487 <p><code> y = intercept + slope * x </code></p> | |
488 <p> | |
489 Standard errors for <code>intercept</code> and <code>slope</code> are | |
490 available as well as ANOVA, r-square and Pearson's r statistics. | |
491 </p> | |
492 <p> | |
493 Observations (x,y pairs) can be added to the model one at a time or they | |
494 can be provided in a 2-dimensional array. The observations are not stored | |
495 in memory, so there is no limit to the number of observations that can be | |
496 added to the model. | |
497 </p> | |
498 <p><strong>Usage Notes</strong>: <ul><li> When there are fewer than two observations in the model, or when | |
499 there is no variation in the x values (i.e. all x values are the same) | |
500 all statistics return <code>NaN</code>. At least two observations with | |
501 different x coordinates are requred to estimate a bivariate regression | |
502 model.</li> | |
503 <li> getters for the statistics always compute values based on the current | |
504 set of observations -- i.e., you can get statistics, then add more data | |
505 and get updated statistics without using a new instance. There is no | |
506 "compute" method that updates all statistics. Each of the getters performs | |
507 the necessary computations to return the requested statistic.</li> | |
508 </ul> | |
509 </p> | |
510 <p><strong>Implementation Notes</strong>: <ul><li> As observations are added to the model, the sum of x values, y values, | |
511 cross products (x times y), and squared deviations of x and y from their | |
512 respective means are updated using updating formulas defined in | |
513 "Algorithms for Computing the Sample Variance: Analysis and | |
514 Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. | |
515 1983, American Statistician, vol. 37, pp. 242-247, referenced in | |
516 Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985. All regression | |
517 statistics are computed from these sums.</li> | |
518 <li> Inference statistics (confidence intervals, parameter significance levels) | |
519 are based on on the assumption that the observations included in the model are | |
520 drawn from a <a href="http://mathworld.wolfram.com/BivariateNormalDistribution.html" class="externalLink"> | |
521 Bivariate Normal Distribution</a></li> | |
522 </ul> | |
523 </p> | |
524 <p> | |
525 Here are some examples. | |
526 <dl><dt>Estimate a model based on observations added one at a time</dt> | |
527 <br /> | |
528 </br><dd>Instantiate a regression instance and add data points | |
529 <div class="source"><pre> | |
530 regression = new SimpleRegression(); | |
531 regression.addData(1d, 2d); | |
532 // At this point, with only one observation, | |
533 // all regression statistics will return NaN | |
534 | |
535 regression.addData(3d, 3d); | |
536 // With only two observations, | |
537 // slope and intercept can be computed | |
538 // but inference statistics will return NaN | |
539 | |
540 regression.addData(3d, 3d); | |
541 // Now all statistics are defined. | |
542 </pre> | |
543 </div> | |
544 </dd> | |
545 <dd>Compute some statistics based on observations added so far | |
546 <div class="source"><pre> | |
547 System.out.println(regression.getIntercept()); | |
548 // displays intercept of regression line | |
549 | |
550 System.out.println(regression.getSlope()); | |
551 // displays slope of regression line | |
552 | |
553 System.out.println(regression.getSlopeStdErr()); | |
554 // displays slope standard error | |
555 </pre> | |
556 </div> | |
557 </dd> | |
558 <dd>Use the regression model to predict the y value for a new x value | |
559 <div class="source"><pre> | |
560 System.out.println(regression.predict(1.5d) | |
561 // displays predicted y value for x = 1.5 | |
562 </pre> | |
563 </div> | |
564 | |
565 More data points can be added and subsequent getXxx calls will incorporate | |
566 additional data in statistics. | |
567 </dd> | |
568 <dt>Estimate a model from a double[][] array of data points</dt> | |
569 <br /> | |
570 </br><dd>Instantiate a regression object and load dataset | |
571 <div class="source"><pre> | |
572 double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; | |
573 SimpleRegression regression = new SimpleRegression(); | |
574 regression.addData(data); | |
575 </pre> | |
576 </div> | |
577 </dd> | |
578 <dd>Estimate regression model based on data | |
579 <div class="source"><pre> | |
580 System.out.println(regression.getIntercept()); | |
581 // displays intercept of regression line | |
582 | |
583 System.out.println(regression.getSlope()); | |
584 // displays slope of regression line | |
585 | |
586 System.out.println(regression.getSlopeStdErr()); | |
587 // displays slope standard error | |
588 </pre> | |
589 </div> | |
590 | |
591 More data points -- even another double[][] array -- can be added and subsequent | |
592 getXxx calls will incorporate additional data in statistics. | |
593 </dd> | |
594 </dl> | |
595 </p> | |
596 </div> | |
597 <div class="section"><h3><a name="a1.5_Multiple_linear_regression"></a>1.5 Multiple linear regression</h3> | |
598 <p><a href="../apidocs/org/apache/commons/math/stat/regression/MultipleLinearRegression.html"> | |
599 org.apache.commons.math.stat.regression.MultipleLinearRegression</a> | |
600 provides ordinary least squares regression with a generic multiple variable linear model, which | |
601 in matrix notation can be expressed as: | |
602 </p> | |
603 <p><code> y=X*b+u </code></p> | |
604 <p> | |
605 where y is an <code>n-vector</code><b>regressand</b>, X is a <code>[n,k]</code> matrix whose <code>k</code> columns are called | |
606 <b>regressors</b>, b is <code>k-vector</code> of <b>regression parameters</b> and <code>u</code> is an <code>n-vector</code> | |
607 of <b>error terms</b> or <b>residuals</b>. The notation is quite standard in literature, | |
608 cf eg <a href="http://www.econ.queensu.ca/ETM" class="externalLink">Davidson and MacKinnon, Econometrics Theory and Methods, 2004</a>. | |
609 </p> | |
610 <p> | |
611 Two implementations are provided: <a href="../apidocs/org/apache/commons/math/stat/regression/OLSMultipleLinearRegression.html"> | |
612 org.apache.commons.math.stat.regression.OLSMultipleLinearRegression</a> and | |
613 <a href="../apidocs/org/apache/commons/math/stat/regression/GLSMultipleLinearRegression.html"> | |
614 org.apache.commons.math.stat.regression.GLSMultipleLinearRegression</a></p> | |
615 <p> | |
616 Observations (x,y and covariance data matrices) can be added to the model via the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method. | |
617 The observations are stored in memory until the next time the addData method is invoked. | |
618 </p> | |
619 <p><strong>Usage Notes</strong>: <ul><li> Data is validated when invoking the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method and | |
620 <code>IllegalArgumentException</code> is thrown when inappropriate. | |
621 </li> | |
622 <li> Only the GLS regressions require the covariance matrix, so in the OLS regression it is ignored and can be safely | |
623 inputted as <code>null</code>.</li> | |
624 </ul> | |
625 </p> | |
626 <p> | |
627 Here are some examples. | |
628 <dl><dt>OLS regression</dt> | |
629 <br /> | |
630 </br><dd>Instantiate an OLS regression object and load dataset | |
631 <div class="source"><pre> | |
632 MultipleLinearRegression regression = new OLSMultipleLinearRegression(); | |
633 double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; | |
634 double[] x = new double[6][]; | |
635 x[0] = new double[]{1.0, 0, 0, 0, 0, 0}; | |
636 x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0}; | |
637 x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0}; | |
638 x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0}; | |
639 x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0}; | |
640 x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0}; | |
641 regression.addData(y, x, null); // we don't need covariance | |
642 </pre> | |
643 </div> | |
644 </dd> | |
645 <dd>Estimate of regression values honours the <code>MultipleLinearRegression</code> interface: | |
646 <div class="source"><pre> | |
647 double[] beta = regression.estimateRegressionParameters(); | |
648 | |
649 double[] residuals = regression.estimateResiduals(); | |
650 | |
651 double[][] parametersVariance = regression.estimateRegressionParametersVariance(); | |
652 | |
653 double regressandVariance = regression.estimateRegressandVariance(); | |
654 </pre> | |
655 </div> | |
656 </dd> | |
657 <dt>GLS regression</dt> | |
658 <br /> | |
659 </br><dd>Instantiate an GLS regression object and load dataset | |
660 <div class="source"><pre> | |
661 MultipleLinearRegression regression = new GLSMultipleLinearRegression(); | |
662 double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; | |
663 double[] x = new double[6][]; | |
664 x[0] = new double[]{1.0, 0, 0, 0, 0, 0}; | |
665 x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0}; | |
666 x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0}; | |
667 x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0}; | |
668 x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0}; | |
669 x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0}; | |
670 double[][] omega = new double[6][]; | |
671 omega[0] = new double[]{1.1, 0, 0, 0, 0, 0}; | |
672 omega[1] = new double[]{0, 2.2, 0, 0, 0, 0}; | |
673 omega[2] = new double[]{0, 0, 3.3, 0, 0, 0}; | |
674 omega[3] = new double[]{0, 0, 0, 4.4, 0, 0}; | |
675 omega[4] = new double[]{0, 0, 0, 0, 5.5, 0}; | |
676 omega[5] = new double[]{0, 0, 0, 0, 0, 6.6}; | |
677 regression.addData(y, x, omega); // we do need covariance | |
678 </pre> | |
679 </div> | |
680 </dd> | |
681 <dd>Estimate of regression values honours the same <code>MultipleLinearRegression</code> interface as | |
682 the OLS regression. | |
683 </dd> | |
684 </dl> | |
685 </p> | |
686 </div> | |
687 <div class="section"><h3><a name="a1.6_Rank_transformations"></a>1.6 Rank transformations</h3> | |
688 <p> | |
689 Some statistical algorithms require that input data be replaced by ranks. | |
690 The <a href="../apidocs/org/apache/commons/math/stat/ranking/package-summary.html"> | |
691 org.apache.commons.math.stat.ranking</a> package provides rank transformation. | |
692 <a href="../apidocs/org/apache/commons/math/stat/ranking/RankingAlgorithm.html"> | |
693 RankingAlgorithm</a> defines the interface for ranking. | |
694 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html"> | |
695 NaturalRanking</a> provides an implementation that has two configuration options. | |
696 <ul><li><a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html"> | |
697 Ties strategy</a> deterimines how ties in the source data are handled by the ranking</li> | |
698 <li><a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html"> | |
699 NaN strategy</a> determines how NaN values in the source data are handled.</li> | |
700 </ul> | |
701 </p> | |
702 <p> | |
703 Examples: | |
704 <div class="source"><pre> | |
705 NaturalRanking ranking = new NaturalRanking(NaNStrategy.MINIMAL, | |
706 TiesStrategy.MAXIMUM); | |
707 double[] data = { 20, 17, 30, 42.3, 17, 50, | |
708 Double.NaN, Double.NEGATIVE_INFINITY, 17 }; | |
709 double[] ranks = ranking.rank(exampleData); | |
710 </pre> | |
711 </div> | |
712 | |
713 results in <code>ranks</code> containing <code>{6, 5, 7, 8, 5, 9, 2, 2, 5}.</code><div class="source"><pre> | |
714 new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData); | |
715 </pre> | |
716 </div> | |
717 | |
718 returns <code>{5, 2, 6, 7, 3, 8, 1, 4}.</code></p> | |
719 <p> | |
720 The default <code>NaNStrategy</code> is NaNStrategy.MAXIMAL. This makes <code>NaN</code> | |
721 values larger than any other value (including <code>Double.POSITIVE_INFINITY</code>). The | |
722 default <code>TiesStrategy</code> is <code>TiesStrategy.AVERAGE,</code> which assigns tied | |
723 values the average of the ranks applicable to the sequence of ties. See the | |
724 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html"> | |
725 NaturalRanking</a> for more examples and <a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html"> | |
726 TiesStrategy</a> and <a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html">NaNStrategy</a> | |
727 for details on these configuration options. | |
728 </p> | |
729 </div> | |
730 <div class="section"><h3><a name="a1.7_Covariance_and_correlation"></a>1.7 Covariance and correlation</h3> | |
731 <p> | |
732 The <a href="../apidocs/org/apache/commons/math/stat/correlation/package-summary.html"> | |
733 org.apache.commons.math.stat.correlation</a> package computes covariances | |
734 and correlations for pairs of arrays or columns of a matrix. | |
735 <a href="../apidocs/org/apache/commons/math/stat/correlation/Covariance.html"> | |
736 Covariance</a> computes covariances, | |
737 <a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html"> | |
738 PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients and | |
739 <a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html"> | |
740 SpearmansCorrelation</a> computes Spearman's rank correlation. | |
741 </p> | |
742 <p><strong>Implementation Notes</strong><ul><li> | |
743 Unbiased covariances are given by the formula <br /> | |
744 </br><code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code> | |
745 where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code> | |
746 is the mean of the <code>Y</code> values. Non-bias-corrected estimates use | |
747 <code>n</code> in place of <code>n - 1.</code> Whether or not covariances are | |
748 bias-corrected is determined by the optional parameter, "biasCorrected," which | |
749 defaults to <code>true.</code></li> | |
750 <li><a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html"> | |
751 PearsonsCorrelation</a> computes correlations defined by the formula <br /> | |
752 </br><code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br /> | |
753 | |
754 where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code> | |
755 and <code>s(X)</code>, <code>s(Y)</code> are standard deviations. | |
756 </li> | |
757 <li><a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html"> | |
758 SpearmansCorrelation</a> applies a rank transformation to the input data and computes Pearson's | |
759 correlation on the ranked data. The ranking algorithm is configurable. By default, | |
760 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html"> | |
761 NaturalRanking</a> with default strategies for handling ties and NaN values is used. | |
762 </li> | |
763 </ul> | |
764 </p> | |
765 <p><strong>Examples:</strong><dl><dt><strong>Covariance of 2 arrays</strong></dt> | |
766 <br /> | |
767 </br><dd>To compute the unbiased covariance between 2 double arrays, | |
768 <code>x</code> and <code>y</code>, use: | |
769 <div class="source"><pre> | |
770 new Covariance().covariance(x, y) | |
771 </pre> | |
772 </div> | |
773 | |
774 For non-bias-corrected covariances, use | |
775 <div class="source"><pre> | |
776 covariance(x, y, false) | |
777 </pre> | |
778 </div> | |
779 </dd> | |
780 <br /> | |
781 </br><dt><strong>Covariance matrix</strong></dt> | |
782 <br /> | |
783 </br><dd> A covariance matrix over the columns of a source matrix <code>data</code> | |
784 can be computed using | |
785 <div class="source"><pre> | |
786 new Covariance().computeCovarianceMatrix(data) | |
787 </pre> | |
788 </div> | |
789 | |
790 The i-jth entry of the returned matrix is the unbiased covariance of the ith and jth | |
791 columns of <code>data.</code> As above, to get non-bias-corrected covariances, | |
792 use | |
793 <div class="source"><pre> | |
794 computeCovarianceMatrix(data, false) | |
795 </pre> | |
796 </div> | |
797 </dd> | |
798 <br /> | |
799 </br><dt><strong>Pearson's correlation of 2 arrays</strong></dt> | |
800 <br /> | |
801 </br><dd>To compute the Pearson's product-moment correlation between two double arrays | |
802 <code>x</code> and <code>y</code>, use: | |
803 <div class="source"><pre> | |
804 new PearsonsCorrelation().correlation(x, y) | |
805 </pre> | |
806 </div> | |
807 </dd> | |
808 <br /> | |
809 </br><dt><strong>Pearson's correlation matrix</strong></dt> | |
810 <br /> | |
811 </br><dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code> | |
812 can be computed using | |
813 <div class="source"><pre> | |
814 new PearsonsCorrelation().computeCorrelationMatrix(data) | |
815 </pre> | |
816 </div> | |
817 | |
818 The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the | |
819 ith and jth columns of <code>data.</code></dd> | |
820 <br /> | |
821 </br><dt><strong>Pearson's correlation significance and standard errors</strong></dt> | |
822 <br /> | |
823 </br><dd> To compute standard errors and/or significances of correlation coefficients | |
824 associated with Pearson's correlation coefficients, start by creating a | |
825 <code>PearsonsCorrelation</code> instance | |
826 <div class="source"><pre> | |
827 PearsonsCorrelation correlation = new PearsonsCorrelation(data); | |
828 </pre> | |
829 </div> | |
830 | |
831 where <code>data</code> is either a rectangular array or a <code>RealMatrix.</code> | |
832 Then the matrix of standard errors is | |
833 <div class="source"><pre> | |
834 correlation.getCorrelationStandardErrors(); | |
835 </pre> | |
836 </div> | |
837 | |
838 The formula used to compute the standard error is <br /> | |
839 <code>SE<sub>r</sub> = ((1 - r<sup>2</sup>) / (n - 2))<sup>1/2</sup></code><br /> | |
840 | |
841 where <code>r</code> is the estimated correlation coefficient and | |
842 <code>n</code> is the number of observations in the source dataset.<br /> | |
843 <br /> | |
844 <strong>p-values</strong> for the (2-sided) null hypotheses that elements of | |
845 a correlation matrix are zero populate the RealMatrix returned by | |
846 <div class="source"><pre> | |
847 correlation.getCorrelationPValues() | |
848 </pre> | |
849 </div> | |
850 <code>getCorrelationPValues().getEntry(i,j)</code> is the | |
851 probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes | |
852 a value with absolute value greater than or equal to <br /> | |
853 </br><code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>, | |
854 where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth | |
855 columns of the source array or RealMatrix. This is sometimes referred to as the | |
856 <i>significance</i> of the coefficient.<br /> | |
857 <br /> | |
858 | |
859 For example, if <code>data</code> is a RealMatrix with 2 columns and 10 rows, then | |
860 <div class="source"><pre> | |
861 new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1) | |
862 </pre> | |
863 </div> | |
864 | |
865 is the significance of the Pearson's correlation coefficient between the two columns | |
866 of <code>data</code>. If this value is less than .01, we can say that the correlation | |
867 between the two columns of data is significant at the 99% level. | |
868 </dd> | |
869 <br /> | |
870 </br><dt><strong>Spearman's rank correlation coefficient</strong></dt> | |
871 <br /> | |
872 </br><dd>To compute the Spearman's rank-moment correlation between two double arrays | |
873 <code>x</code> and <code>y</code>: | |
874 <div class="source"><pre> | |
875 new SpearmansCorrelation().correlation(x, y) | |
876 </pre> | |
877 </div> | |
878 | |
879 This is equivalent to | |
880 <div class="source"><pre> | |
881 RankingAlgorithm ranking = new NaturalRanking(); | |
882 new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y)) | |
883 </pre> | |
884 </div> | |
885 </dd> | |
886 <br /> | |
887 </br></dl> | |
888 </p> | |
889 </div> | |
890 <div class="section"><h3><a name="a1.8_Statistical_tests"></a>1.8 Statistical tests</h3> | |
891 <p> | |
892 The interfaces and implementations in the | |
893 <a href="../apidocs/org/apache/commons/math/stat/inference/"> | |
894 org.apache.commons.math.stat.inference</a> package provide | |
895 <a href="http://www.itl.nist.gov/div898/handbook/prc/section2/prc22.htm" class="externalLink"> | |
896 Student's t</a>, | |
897 <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm" class="externalLink"> | |
898 Chi-Square</a> and | |
899 <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm" class="externalLink"> | |
900 One-Way ANOVA</a> test statistics as well as | |
901 <a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue" class="externalLink"> | |
902 p-values</a> associated with <code>t-</code>, | |
903 <code>Chi-Square</code> and <code>One-Way ANOVA</code> tests. The | |
904 interfaces are | |
905 <a href="../apidocs/org/apache/commons/math/stat/inference/TTest.html"> | |
906 TTest</a>, | |
907 <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTest.html"> | |
908 ChiSquareTest</a>, and | |
909 <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnova.html"> | |
910 OneWayAnova</a> with provided implementations | |
911 <a href="../apidocs/org/apache/commons/math/stat/inference/TTestImpl.html"> | |
912 TTestImpl</a>, | |
913 <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTestImpl.html"> | |
914 ChiSquareTestImpl</a> and | |
915 <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnovaImpl.html"> | |
916 OneWayAnovaImpl</a>, respectively. | |
917 The | |
918 <a href="../apidocs/org/apache/commons/math/stat/inference/TestUtils.html"> | |
919 TestUtils</a> class provides static methods to get test instances or | |
920 to compute test statistics directly. The examples below all use the | |
921 static methods in <code>TestUtils</code> to execute tests. To get | |
922 test object instances, either use e.g., | |
923 <code>TestUtils.getTTest()</code> or use the implementation constructors | |
924 directly, e.g., | |
925 <code>new TTestImpl()</code>. | |
926 </p> | |
927 <p><strong>Implementation Notes</strong><ul><li>Both one- and two-sample t-tests are supported. Two sample tests | |
928 can be either paired or unpaired and the unpaired two-sample tests can | |
929 be conducted under the assumption of equal subpopulation variances or | |
930 without this assumption. When equal variances is assumed, a pooled | |
931 variance estimate is used to compute the t-statistic and the degrees | |
932 of freedom used in the t-test equals the sum of the sample sizes minus 2. | |
933 When equal variances is not assumed, the t-statistic uses both sample | |
934 variances and the | |
935 <a href="http://www.itl.nist.gov/div898/handbook/prc/section3/gifs/nu3.gif" class="externalLink"> | |
936 Welch-Satterwaite approximation</a> is used to compute the degrees | |
937 of freedom. Methods to return t-statistics and p-values are provided in each | |
938 case, as well as boolean-valued methods to perform fixed significance | |
939 level tests. The names of methods or methods that assume equal | |
940 subpopulation variances always start with "homoscedastic." Test or | |
941 test-statistic methods that just start with "t" do not assume equal | |
942 variances. See the examples below and the API documentation for | |
943 more details.</li> | |
944 <li>The validity of the p-values returned by the t-test depends on the | |
945 assumptions of the parametric t-test procedure, as discussed | |
946 <a href="http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html" class="externalLink"> | |
947 here</a></li> | |
948 <li>p-values returned by t-, chi-square and Anova tests are exact, based | |
949 on numerical approximations to the t-, chi-square and F distributions in the | |
950 <code>distributions</code> package. </li> | |
951 <li>p-values returned by t-tests are for two-sided tests and the boolean-valued | |
952 methods supporting fixed significance level tests assume that the hypotheses | |
953 are two-sided. One sided tests can be performed by dividing returned p-values | |
954 (resp. critical values) by 2.</li> | |
955 <li>Degrees of freedom for chi-square tests are integral values, based on the | |
956 number of observed or expected counts (number of observed counts - 1) | |
957 for the goodness-of-fit tests and (number of columns -1) * (number of rows - 1) | |
958 for independence tests.</li> | |
959 </ul> | |
960 </p> | |
961 <p><strong>Examples:</strong><dl><dt><strong>One-sample <code>t</code> tests</strong></dt> | |
962 <br /> | |
963 </br><dd>To compare the mean of a double[] array to a fixed value: | |
964 <div class="source"><pre> | |
965 double[] observed = {1d, 2d, 3d}; | |
966 double mu = 2.5d; | |
967 System.out.println(TestUtils.t(mu, observed)); | |
968 </pre> | |
969 </div> | |
970 | |
971 The code above will display the t-statisitic associated with a one-sample | |
972 t-test comparing the mean of the <code>observed</code> values against | |
973 <code>mu.</code></dd> | |
974 <dd>To compare the mean of a dataset described by a | |
975 <a href="../apidocs/org/apache/commons/math/stat/descriptive/StatisticalSummary.html"> | |
976 org.apache.commons.math.stat.descriptive.StatisticalSummary</a> to a fixed value: | |
977 <div class="source"><pre> | |
978 double[] observed ={1d, 2d, 3d}; | |
979 double mu = 2.5d; | |
980 SummaryStatistics sampleStats = new SummaryStatistics(); | |
981 for (int i = 0; i < observed.length; i++) { | |
982 sampleStats.addValue(observed[i]); | |
983 } | |
984 System.out.println(TestUtils.t(mu, observed)); | |
985 </pre> | |
986 </div> | |
987 </dd> | |
988 <dd>To compute the p-value associated with the null hypothesis that the mean | |
989 of a set of values equals a point estimate, against the two-sided alternative that | |
990 the mean is different from the target value: | |
991 <div class="source"><pre> | |
992 double[] observed = {1d, 2d, 3d}; | |
993 double mu = 2.5d; | |
994 System.out.println(TestUtils.tTest(mu, observed)); | |
995 </pre> | |
996 </div> | |
997 | |
998 The snippet above will display the p-value associated with the null | |
999 hypothesis that the mean of the population from which the | |
1000 <code>observed</code> values are drawn equals <code>mu.</code></dd> | |
1001 <dd>To perform the test using a fixed significance level, use: | |
1002 <div class="source"><pre> | |
1003 TestUtils.tTest(mu, observed, alpha); | |
1004 </pre> | |
1005 </div> | |
1006 | |
1007 where <code>0 < alpha < 0.5</code> is the significance level of | |
1008 the test. The boolean value returned will be <code>true</code> iff the | |
1009 null hypothesis can be rejected with confidence <code>1 - alpha</code>. | |
1010 To test, for example at the 95% level of confidence, use | |
1011 <code>alpha = 0.05</code></dd> | |
1012 <br /> | |
1013 </br><dt><strong>Two-Sample t-tests</strong></dt> | |
1014 <br /> | |
1015 </br><dd><strong>Example 1:</strong> Paired test evaluating | |
1016 the null hypothesis that the mean difference between corresponding | |
1017 (paired) elements of the <code>double[]</code> arrays | |
1018 <code>sample1</code> and <code>sample2</code> is zero. | |
1019 | |
1020 To compute the t-statistic: | |
1021 <div class="source"><pre> | |
1022 TestUtils.pairedT(sample1, sample2); | |
1023 </pre> | |
1024 </div> | |
1025 <p> | |
1026 To compute the p-value: | |
1027 <div class="source"><pre> | |
1028 TestUtils.pairedTTest(sample1, sample2); | |
1029 </pre> | |
1030 </div> | |
1031 </p> | |
1032 <p> | |
1033 To perform a fixed significance level test with alpha = .05: | |
1034 <div class="source"><pre> | |
1035 TestUtils.pairedTTest(sample1, sample2, .05); | |
1036 </pre> | |
1037 </div> | |
1038 </p> | |
1039 | |
1040 The last example will return <code>true</code> iff the p-value | |
1041 returned by <code>TestUtils.pairedTTest(sample1, sample2)</code> | |
1042 is less than <code>.05</code></dd> | |
1043 <dd><strong>Example 2: </strong> unpaired, two-sided, two-sample t-test using | |
1044 <code>StatisticalSummary</code> instances, without assuming that | |
1045 subpopulation variances are equal. | |
1046 | |
1047 First create the <code>StatisticalSummary</code> instances. Both | |
1048 <code>DescriptiveStatistics</code> and <code>SummaryStatistics</code> | |
1049 implement this interface. Assume that <code>summary1</code> and | |
1050 <code>summary2</code> are <code>SummaryStatistics</code> instances, | |
1051 each of which has had at least 2 values added to the (virtual) dataset that | |
1052 it describes. The sample sizes do not have to be the same -- all that is required | |
1053 is that both samples have at least 2 elements. | |
1054 <p><strong>Note:</strong> The <code>SummaryStatistics</code> class does | |
1055 not store the dataset that it describes in memory, but it does compute all | |
1056 statistics necessary to perform t-tests, so this method can be used to | |
1057 conduct t-tests with very large samples. One-sample tests can also be | |
1058 performed this way. | |
1059 (See <a href="#1.2 Descriptive statistics">Descriptive statistics</a> for details | |
1060 on the <code>SummaryStatistics</code> class.) | |
1061 </p> | |
1062 <p> | |
1063 To compute the t-statistic: | |
1064 <div class="source"><pre> | |
1065 TestUtils.t(summary1, summary2); | |
1066 </pre> | |
1067 </div> | |
1068 </p> | |
1069 <p> | |
1070 To compute the p-value: | |
1071 <div class="source"><pre> | |
1072 TestUtils.tTest(sample1, sample2); | |
1073 </pre> | |
1074 </div> | |
1075 </p> | |
1076 <p> | |
1077 To perform a fixed significance level test with alpha = .05: | |
1078 <div class="source"><pre> | |
1079 TestUtils.tTest(sample1, sample2, .05); | |
1080 </pre> | |
1081 </div> | |
1082 </p> | |
1083 <p> | |
1084 In each case above, the test does not assume that the subpopulation | |
1085 variances are equal. To perform the tests under this assumption, | |
1086 replace "t" at the beginning of the method name with "homoscedasticT" | |
1087 </p> | |
1088 </dd> | |
1089 <br /> | |
1090 </br><dt><strong>Chi-square tests</strong></dt> | |
1091 <br /> | |
1092 </br><dd>To compute a chi-square statistic measuring the agreement between a | |
1093 <code>long[]</code> array of observed counts and a <code>double[]</code> | |
1094 array of expected counts, use: | |
1095 <div class="source"><pre> | |
1096 long[] observed = {10, 9, 11}; | |
1097 double[] expected = {10.1, 9.8, 10.3}; | |
1098 System.out.println(TestUtils.chiSquare(expected, observed)); | |
1099 </pre> | |
1100 </div> | |
1101 | |
1102 the value displayed will be | |
1103 <code>sum((expected[i] - observed[i])^2 / expected[i])</code></dd> | |
1104 <dd> To get the p-value associated with the null hypothesis that | |
1105 <code>observed</code> conforms to <code>expected</code> use: | |
1106 <div class="source"><pre> | |
1107 TestUtils.chiSquareTest(expected, observed); | |
1108 </pre> | |
1109 </div> | |
1110 </dd> | |
1111 <dd> To test the null hypothesis that <code>observed</code> conforms to | |
1112 <code>expected</code> with <code>alpha</code> siginficance level | |
1113 (equiv. <code>100 * (1-alpha)%</code> confidence) where <code> | |
1114 0 < alpha < 1 </code> use: | |
1115 <div class="source"><pre> | |
1116 TestUtils.chiSquareTest(expected, observed, alpha); | |
1117 </pre> | |
1118 </div> | |
1119 | |
1120 The boolean value returned will be <code>true</code> iff the null hypothesis | |
1121 can be rejected with confidence <code>1 - alpha</code>. | |
1122 </dd> | |
1123 <dd>To compute a chi-square statistic statistic associated with a | |
1124 <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm" class="externalLink"> | |
1125 chi-square test of independence</a> based on a two-dimensional (long[][]) | |
1126 <code>counts</code> array viewed as a two-way table, use: | |
1127 <div class="source"><pre> | |
1128 TestUtils.chiSquareTest(counts); | |
1129 </pre> | |
1130 </div> | |
1131 | |
1132 The rows of the 2-way table are | |
1133 <code>count[0], ... , count[count.length - 1]. </code><br /> | |
1134 </br> | |
1135 The chi-square statistic returned is | |
1136 <code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code> | |
1137 where the sum is taken over all table entries and | |
1138 <code>expected[i][j]</code> is the product of the row and column sums at | |
1139 row <code>i</code>, column <code>j</code> divided by the total count. | |
1140 </dd> | |
1141 <dd>To compute the p-value associated with the null hypothesis that | |
1142 the classifications represented by the counts in the columns of the input 2-way | |
1143 table are independent of the rows, use: | |
1144 <div class="source"><pre> | |
1145 TestUtils.chiSquareTest(counts); | |
1146 </pre> | |
1147 </div> | |
1148 </dd> | |
1149 <dd>To perform a chi-square test of independence with <code>alpha</code> | |
1150 siginficance level (equiv. <code>100 * (1-alpha)%</code> confidence) | |
1151 where <code>0 < alpha < 1 </code> use: | |
1152 <div class="source"><pre> | |
1153 TestUtils.chiSquareTest(counts, alpha); | |
1154 </pre> | |
1155 </div> | |
1156 | |
1157 The boolean value returned will be <code>true</code> iff the null | |
1158 hypothesis can be rejected with confidence <code>1 - alpha</code>. | |
1159 </dd> | |
1160 <br /> | |
1161 </br><dt><strong>One-Way Anova tests</strong></dt> | |
1162 <br /> | |
1163 </br><dd>To conduct a One-Way Analysis of Variance (ANOVA) to evaluate the | |
1164 null hypothesis that the means of a collection of univariate datasets | |
1165 are the same, start by loading the datasets into a collection, e.g. | |
1166 <div class="source"><pre> | |
1167 double[] classA = | |
1168 {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 }; | |
1169 double[] classB = | |
1170 {99.0, 92.0, 102.0, 100.0, 102.0, 89.0 }; | |
1171 double[] classC = | |
1172 {110.0, 115.0, 111.0, 117.0, 128.0, 117.0 }; | |
1173 List classes = new ArrayList(); | |
1174 classes.add(classA); | |
1175 classes.add(classB); | |
1176 classes.add(classC); | |
1177 </pre> | |
1178 </div> | |
1179 | |
1180 Then you can compute ANOVA F- or p-values associated with the | |
1181 null hypothesis that the class means are all the same | |
1182 using a <code>OneWayAnova</code> instance or <code>TestUtils</code> | |
1183 methods: | |
1184 <div class="source"><pre> | |
1185 double fStatistic = TestUtils.oneWayAnovaFValue(classes); // F-value | |
1186 double pValue = TestUtils.oneWayAnovaPValue(classes); // P-value | |
1187 </pre> | |
1188 </div> | |
1189 | |
1190 To test perform a One-Way Anova test with signficance level set at 0.01 | |
1191 (so the test will, assuming assumptions are met, reject the null | |
1192 hypothesis incorrectly only about one in 100 times), use | |
1193 <div class="source"><pre> | |
1194 TestUtils.oneWayAnovaTest(classes, 0.01); // returns a boolean | |
1195 // true means reject null hypothesis | |
1196 </pre> | |
1197 </div> | |
1198 </dd> | |
1199 </dl> | |
1200 </p> | |
1201 </div> | |
1202 </div> | |
1203 | |
1204 </div> | |
1205 </div> | |
1206 <div class="clear"> | |
1207 <hr/> | |
1208 </div> | |
1209 <div id="footer"> | |
1210 <div class="xright">© | |
1211 2003-2010 | |
1212 | |
1213 | |
1214 | |
1215 | |
1216 | |
1217 | |
1218 | |
1219 | |
1220 | |
1221 </div> | |
1222 <div class="clear"> | |
1223 <hr/> | |
1224 </div> | |
1225 </div> | |
1226 </body> | |
1227 </html> |