10
|
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
2
|
|
3
|
|
4
|
|
5
|
|
6
|
|
7
|
|
8
|
|
9
|
|
10
|
|
11
|
|
12
|
|
13 <html xmlns="http://www.w3.org/1999/xhtml">
|
|
14 <head>
|
|
15 <title>Math - The Commons Math User Guide - Statistics</title>
|
|
16 <style type="text/css" media="all">
|
|
17 @import url("../css/maven-base.css");
|
|
18 @import url("../css/maven-theme.css");
|
|
19 @import url("../css/site.css");
|
|
20 </style>
|
|
21 <link rel="stylesheet" href="../css/print.css" type="text/css" media="print" />
|
|
22 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
|
|
23 </head>
|
|
24 <body class="composite">
|
|
25 <div id="banner">
|
|
26 <span id="bannerLeft">
|
|
27
|
|
28 Commons Math User Guide
|
|
29
|
|
30 </span>
|
|
31 <div class="clear">
|
|
32 <hr/>
|
|
33 </div>
|
|
34 </div>
|
|
35 <div id="breadcrumbs">
|
|
36
|
|
37
|
|
38
|
|
39
|
|
40
|
|
41
|
|
42
|
|
43
|
|
44 <div class="xright">
|
|
45
|
|
46
|
|
47
|
|
48
|
|
49
|
|
50
|
|
51
|
|
52 </div>
|
|
53 <div class="clear">
|
|
54 <hr/>
|
|
55 </div>
|
|
56 </div>
|
|
57 <div id="leftColumn">
|
|
58 <div id="navcolumn">
|
|
59
|
|
60
|
|
61
|
|
62
|
|
63
|
|
64
|
|
65
|
|
66
|
|
67 <h5>User Guide</h5>
|
|
68 <ul>
|
|
69
|
|
70 <li class="none">
|
|
71 <a href="../userguide/index.html">Contents</a>
|
|
72 </li>
|
|
73
|
|
74 <li class="none">
|
|
75 <a href="../userguide/overview.html">Overview</a>
|
|
76 </li>
|
|
77
|
|
78 <li class="none">
|
|
79 <strong>Statistics</strong>
|
|
80 </li>
|
|
81
|
|
82 <li class="none">
|
|
83 <a href="../userguide/random.html">Data Generation</a>
|
|
84 </li>
|
|
85
|
|
86 <li class="none">
|
|
87 <a href="../userguide/linear.html">Linear Algebra</a>
|
|
88 </li>
|
|
89
|
|
90 <li class="none">
|
|
91 <a href="../userguide/analysis.html">Numerical Analysis</a>
|
|
92 </li>
|
|
93
|
|
94 <li class="none">
|
|
95 <a href="../userguide/special.html">Special Functions</a>
|
|
96 </li>
|
|
97
|
|
98 <li class="none">
|
|
99 <a href="../userguide/utilities.html">Utilities</a>
|
|
100 </li>
|
|
101
|
|
102 <li class="none">
|
|
103 <a href="../userguide/complex.html">Complex Numbers</a>
|
|
104 </li>
|
|
105
|
|
106 <li class="none">
|
|
107 <a href="../userguide/distribution.html">Distributions</a>
|
|
108 </li>
|
|
109
|
|
110 <li class="none">
|
|
111 <a href="../userguide/fraction.html">Fractions</a>
|
|
112 </li>
|
|
113
|
|
114 <li class="none">
|
|
115 <a href="../userguide/transform.html">Transform Methods</a>
|
|
116 </li>
|
|
117
|
|
118 <li class="none">
|
|
119 <a href="../userguide/geometry.html">3D Geometry</a>
|
|
120 </li>
|
|
121
|
|
122 <li class="none">
|
|
123 <a href="../userguide/optimization.html">Optimization</a>
|
|
124 </li>
|
|
125
|
|
126 <li class="none">
|
|
127 <a href="../userguide/ode.html">Ordinary Differential Equations</a>
|
|
128 </li>
|
|
129
|
|
130 <li class="none">
|
|
131 <a href="../userguide/genetics.html">Genetic Algorithms</a>
|
|
132 </li>
|
|
133 </ul>
|
|
134 <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
|
|
135 <img alt="Built by Maven" src="../images/logos/maven-feather.png"></img>
|
|
136 </a>
|
|
137
|
|
138
|
|
139
|
|
140
|
|
141
|
|
142
|
|
143
|
|
144
|
|
145 </div>
|
|
146 </div>
|
|
147 <div id="bodyColumn">
|
|
148 <div id="contentBox">
|
|
149 <div class="section"><h2><a name="a1_Statistics"></a>1 Statistics</h2>
|
|
150 <div class="section"><h3><a name="a1.1_Overview"></a>1.1 Overview</h3>
|
|
151 <p>
|
|
152 The statistics package provides frameworks and implementations for
|
|
153 basic Descriptive statistics, frequency distributions, bivariate regression,
|
|
154 and t-, chi-square and ANOVA test statistics.
|
|
155 </p>
|
|
156 <p><a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br />
|
|
157 </br><a href="#a1.3_Frequency_distributions">Frequency distributions</a><br />
|
|
158 </br><a href="#a1.4_Simple_regression">Simple Regression</a><br />
|
|
159 </br><a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br />
|
|
160 </br><a href="#a1.6_Rank_transformations">Rank transformations</a><br />
|
|
161 </br><a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br />
|
|
162 </br><a href="#a1.8_Statistical_tests">Statistical Tests</a><br />
|
|
163 </br></p>
|
|
164 </div>
|
|
165 <div class="section"><h3><a name="a1.2_Descriptive_statistics"></a>1.2 Descriptive statistics</h3>
|
|
166 <p>
|
|
167 The stat package includes a framework and default implementations for
|
|
168 the following Descriptive statistics:
|
|
169 <ul><li>arithmetic and geometric means</li>
|
|
170 <li>variance and standard deviation</li>
|
|
171 <li>sum, product, log sum, sum of squared values</li>
|
|
172 <li>minimum, maximum, median, and percentiles</li>
|
|
173 <li>skewness and kurtosis</li>
|
|
174 <li>first, second, third and fourth moments</li>
|
|
175 </ul>
|
|
176 </p>
|
|
177 <p>
|
|
178 With the exception of percentiles and the median, all of these
|
|
179 statistics can be computed without maintaining the full list of input
|
|
180 data values in memory. The stat package provides interfaces and
|
|
181 implementations that do not require value storage as well as
|
|
182 implementations that operate on arrays of stored values.
|
|
183 </p>
|
|
184 <p>
|
|
185 The top level interface is
|
|
186 <a href="../apidocs/org/apache/commons/math/stat/descriptive/UnivariateStatistic.html">
|
|
187 org.apache.commons.math.stat.descriptive.UnivariateStatistic.</a>
|
|
188 This interface, implemented by all statistics, consists of
|
|
189 <code>evaluate()</code> methods that take double[] arrays as arguments
|
|
190 and return the value of the statistic. This interface is extended by
|
|
191 <a href="../apidocs/org/apache/commons/math/stat/descriptive/StorelessUnivariateStatistic.html">
|
|
192 StorelessUnivariateStatistic</a>, which adds <code>increment(),</code><code>getResult()</code> and associated methods to support
|
|
193 "storageless" implementations that maintain counters, sums or other
|
|
194 state information as values are added using the <code>increment()</code>
|
|
195 method.
|
|
196 </p>
|
|
197 <p>
|
|
198 Abstract implementations of the top level interfaces are provided in
|
|
199 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractUnivariateStatistic.html">
|
|
200 AbstractUnivariateStatistic</a> and
|
|
201 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractStorelessUnivariateStatistic.html">
|
|
202 AbstractStorelessUnivariateStatistic</a> respectively.
|
|
203 </p>
|
|
204 <p>
|
|
205 Each statistic is implemented as a separate class, in one of the
|
|
206 subpackages (moment, rank, summary) and each extends one of the abstract
|
|
207 classes above (depending on whether or not value storage is required to
|
|
208 compute the statistic). There are several ways to instantiate and use statistics.
|
|
209 Statistics can be instantiated and used directly, but it is generally more convenient
|
|
210 (and efficient) to access them using the provided aggregates,
|
|
211 <a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html">
|
|
212 DescriptiveStatistics</a> and
|
|
213 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html">
|
|
214 SummaryStatistics.</a></p>
|
|
215 <p><code>DescriptiveStatistics</code> maintains the input data in memory
|
|
216 and has the capability of producing "rolling" statistics computed from a
|
|
217 "window" consisting of the most recently added values.
|
|
218 </p>
|
|
219 <p><code>SummaryStatistics</code> does not store the input data values
|
|
220 in memory, so the statistics included in this aggregate are limited to those
|
|
221 that can be computed in one pass through the data without access to
|
|
222 the full array of values.
|
|
223 </p>
|
|
224 <p><table class="bodyTable"><tr class="a"><th>Aggregate</th>
|
|
225 <th>Statistics Included</th>
|
|
226 <th>Values stored?</th>
|
|
227 <th>"Rolling" capability?</th>
|
|
228 </tr>
|
|
229 <tr class="b"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html">
|
|
230 DescriptiveStatistics</a></td>
|
|
231 <td>min, max, mean, geometric mean, n,
|
|
232 sum, sum of squares, standard deviation, variance, percentiles, skewness,
|
|
233 kurtosis, median</td>
|
|
234 <td>Yes</td>
|
|
235 <td>Yes</td>
|
|
236 </tr>
|
|
237 <tr class="a"><td><a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html">
|
|
238 SummaryStatistics</a></td>
|
|
239 <td>min, max, mean, geometric mean, n,
|
|
240 sum, sum of squares, standard deviation, variance</td>
|
|
241 <td>No</td>
|
|
242 <td>No</td>
|
|
243 </tr>
|
|
244 </table>
|
|
245 </p>
|
|
246 <p><code>SummaryStatistics</code> can be aggregated using
|
|
247 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html">
|
|
248 AggregateSummaryStatistics.</a> This class can be used to concurrently gather statistics for multiple
|
|
249 datasets as well as for a combined sample including all of the data.
|
|
250 </p>
|
|
251 <p><code>MultivariateSummaryStatistics</code> is similar to <code>SummaryStatistics</code>
|
|
252 but handles n-tuple values instead of scalar values. It can also compute the
|
|
253 full covariance matrix for the input data.
|
|
254 </p>
|
|
255 <p>
|
|
256 Neither <code>DescriptiveStatistics</code> nor <code>SummaryStatistics</code> is
|
|
257 thread-safe. <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedDescriptiveStatistics.html">
|
|
258 SynchronizedDescriptiveStatistics</a> and
|
|
259 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedSummaryStatistics.html">
|
|
260 SynchronizedSummaryStatistics</a>, respectively, provide thread-safe versions for applications that
|
|
261 require concurrent access to statistical aggregates by multiple threads.
|
|
262 <a href="../apidocs/org/apache/commons/math/stat/descriptive/SynchronizedMultiVariateSummaryStatistics.html">
|
|
263 SynchronizedMultivariateSummaryStatistics</a> provides threadsafe <code>MultivariateSummaryStatistics.</code></p>
|
|
264 <p>
|
|
265 There is also a utility class,
|
|
266 <a href="../apidocs/org/apache/commons/math/stat/StatUtils.html">
|
|
267 StatUtils</a>, that provides static methods for computing statistics
|
|
268 directly from double[] arrays.
|
|
269 </p>
|
|
270 <p>
|
|
271 Here are some examples showing how to compute Descriptive statistics.
|
|
272 <dl><dt>Compute summary statistics for a list of double values</dt>
|
|
273 <br />
|
|
274 </br><dd>Using the <code>DescriptiveStatistics</code> aggregate
|
|
275 (values are stored in memory):
|
|
276 <div class="source"><pre>
|
|
277 // Get a DescriptiveStatistics instance
|
|
278 DescriptiveStatistics stats = new DescriptiveStatistics();
|
|
279
|
|
280 // Add the data from the array
|
|
281 for( int i = 0; i < inputArray.length; i++) {
|
|
282 stats.addValue(inputArray[i]);
|
|
283 }
|
|
284
|
|
285 // Compute some statistics
|
|
286 double mean = stats.getMean();
|
|
287 double std = stats.getStandardDeviation();
|
|
288 double median = stats.getMedian();
|
|
289 </pre>
|
|
290 </div>
|
|
291 </dd>
|
|
292 <dd>Using the <code>SummaryStatistics</code> aggregate (values are
|
|
293 <strong>not</strong> stored in memory):
|
|
294 <div class="source"><pre>
|
|
295 // Get a SummaryStatistics instance
|
|
296 SummaryStatistics stats = new SummaryStatistics();
|
|
297
|
|
298 // Read data from an input stream,
|
|
299 // adding values and updating sums, counters, etc.
|
|
300 while (line != null) {
|
|
301 line = in.readLine();
|
|
302 stats.addValue(Double.parseDouble(line.trim()));
|
|
303 }
|
|
304 in.close();
|
|
305
|
|
306 // Compute the statistics
|
|
307 double mean = stats.getMean();
|
|
308 double std = stats.getStandardDeviation();
|
|
309 //double median = stats.getMedian(); <-- NOT AVAILABLE
|
|
310 </pre>
|
|
311 </div>
|
|
312 </dd>
|
|
313 <dd>Using the <code>StatUtils</code> utility class:
|
|
314 <div class="source"><pre>
|
|
315 // Compute statistics directly from the array
|
|
316 // assume values is a double[] array
|
|
317 double mean = StatUtils.mean(values);
|
|
318 double std = StatUtils.variance(values);
|
|
319 double median = StatUtils.percentile(50);
|
|
320
|
|
321 // Compute the mean of the first three values in the array
|
|
322 mean = StatUtils.mean(values, 0, 3);
|
|
323 </pre>
|
|
324 </div>
|
|
325 </dd>
|
|
326 <dt>Maintain a "rolling mean" of the most recent 100 values from
|
|
327 an input stream</dt>
|
|
328 <br />
|
|
329 </br><dd>Use a <code>DescriptiveStatistics</code> instance with
|
|
330 window size set to 100
|
|
331 <div class="source"><pre>
|
|
332 // Create a DescriptiveStats instance and set the window size to 100
|
|
333 DescriptiveStatistics stats = new DescriptiveStatistics();
|
|
334 stats.setWindowSize(100);
|
|
335
|
|
336 // Read data from an input stream,
|
|
337 // displaying the mean of the most recent 100 observations
|
|
338 // after every 100 observations
|
|
339 long nLines = 0;
|
|
340 while (line != null) {
|
|
341 line = in.readLine();
|
|
342 stats.addValue(Double.parseDouble(line.trim()));
|
|
343 if (nLines == 100) {
|
|
344 nLines = 0;
|
|
345 System.out.println(stats.getMean());
|
|
346 }
|
|
347 }
|
|
348 in.close();
|
|
349 </pre>
|
|
350 </div>
|
|
351 </dd>
|
|
352 <dt>Compute statistics in a thread-safe manner</dt>
|
|
353 <br />
|
|
354 <dd>Use a <code>SynchronizedDescriptiveStatistics</code> instance
|
|
355 <div class="source"><pre>
|
|
356 // Create a SynchronizedDescriptiveStatistics instance and
|
|
357 // use as any other DescriptiveStatistics instance
|
|
358 DescriptiveStatistics stats = new SynchronizedDescriptiveStatistics();
|
|
359 </pre>
|
|
360 </div>
|
|
361 </dd>
|
|
362 <dt>Compute statistics for multiple samples and overall statistics concurrently</dt>
|
|
363 <br />
|
|
364 <dd>There are two ways to do this using <code>AggregateSummaryStatistics.</code>
|
|
365 The first is to use an <code>AggregateSummaryStatistics</code> instance to accumulate
|
|
366 overall statistics contributed by <code>SummaryStatistics</code> instances created using
|
|
367 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#createContributingStatistics()">
|
|
368 AggregateSummaryStatistics.createContributingStatistics()</a>:
|
|
369 <div class="source"><pre>
|
|
370 // Create a AggregateSummaryStatistics instance to accumulate the overall statistics
|
|
371 // and AggregatingSummaryStatistics for the subsamples
|
|
372 AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics();
|
|
373 SummaryStatistics setOneStats = aggregate.createContributingStatistics();
|
|
374 SummaryStatistics setTwoStats = aggregate.createContributingStatistics();
|
|
375 // Add values to the subsample aggregates
|
|
376 setOneStats.addValue(2);
|
|
377 setOneStats.addValue(3);
|
|
378 setTwoStats.addValue(2);
|
|
379 setTwoStats.addValue(4);
|
|
380 ...
|
|
381 // Full sample data is reported by the aggregate
|
|
382 double totalSampleSum = aggregate.getSum();
|
|
383 </pre>
|
|
384 </div>
|
|
385
|
|
386 The above approach has the disadvantages that the <code>addValue</code> calls must be synchronized on the
|
|
387 <code>SummaryStatistics</code> instance maintained by the aggregate and each value addition updates the
|
|
388 aggregate as well as the subsample. For applications that can wait to do the aggregation until all values
|
|
389 have been added, a static
|
|
390 <a href="../apidocs/org/apache/commons/math/stat/descriptive/AggregateSummaryStatistics.html#aggregate(java.util.Collection)">
|
|
391 aggregate</a> method is available, as shown in the following example.
|
|
392 This method should be used when aggregation needs to be done across threads.
|
|
393 <div class="source"><pre>
|
|
394 // Create SummaryStatistics instances for the subsample data
|
|
395 SummaryStatistics setOneStats = new SummaryStatistics();
|
|
396 SummaryStatistics setTwoStats = new SummaryStatistics();
|
|
397 // Add values to the subsample SummaryStatistics instances
|
|
398 setOneStats.addValue(2);
|
|
399 setOneStats.addValue(3);
|
|
400 setTwoStats.addValue(2);
|
|
401 setTwoStats.addValue(4);
|
|
402 ...
|
|
403 // Aggregate the subsample statistics
|
|
404 Collection<SummaryStatistics> aggregate = new ArrayList<SummaryStatistics>();
|
|
405 aggregate.add(setOneStats);
|
|
406 aggregate.add(setTwoStats);
|
|
407 StatisticalSummary aggregatedStats = AggregateSummaryStatistics.aggregate(aggregate);
|
|
408
|
|
409 // Full sample data is reported by aggregatedStats
|
|
410 double totalSampleSum = aggregatedStats.getSum();
|
|
411 </pre>
|
|
412 </div>
|
|
413 </dd>
|
|
414 </dl>
|
|
415 </p>
|
|
416 </div>
|
|
417 <div class="section"><h3><a name="a1.3_Frequency_distributions"></a>1.3 Frequency distributions</h3>
|
|
418 <p><a href="../apidocs/org/apache/commons/math/stat/Frequency.html">
|
|
419 org.apache.commons.math.stat.descriptive.Frequency</a>
|
|
420 provides a simple interface for maintaining counts and percentages of discrete
|
|
421 values.
|
|
422 </p>
|
|
423 <p>
|
|
424 Strings, integers, longs and chars are all supported as value types,
|
|
425 as well as instances of any class that implements <code>Comparable.</code>
|
|
426 The ordering of values used in computing cumulative frequencies is by
|
|
427 default the <i>natural ordering,</i> but this can be overriden by supplying a
|
|
428 <code>Comparator</code> to the constructor. Adding values that are not
|
|
429 comparable to those that have already been added results in an
|
|
430 <code>IllegalArgumentException.</code></p>
|
|
431 <p>
|
|
432 Here are some examples.
|
|
433 <dl><dt>Compute a frequency distribution based on integer values</dt>
|
|
434 <br />
|
|
435 </br><dd>Mixing integers, longs, Integers and Longs:
|
|
436 <div class="source"><pre>
|
|
437 Frequency f = new Frequency();
|
|
438 f.addValue(1);
|
|
439 f.addValue(new Integer(1));
|
|
440 f.addValue(new Long(1));
|
|
441 f.addValue(2);
|
|
442 f.addValue(new Integer(-1));
|
|
443 System.out.prinltn(f.getCount(1)); // displays 3
|
|
444 System.out.println(f.getCumPct(0)); // displays 0.2
|
|
445 System.out.println(f.getPct(new Integer(1))); // displays 0.6
|
|
446 System.out.println(f.getCumPct(-2)); // displays 0
|
|
447 System.out.println(f.getCumPct(10)); // displays 1
|
|
448 </pre>
|
|
449 </div>
|
|
450 </dd>
|
|
451 <dt>Count string frequencies</dt>
|
|
452 <br />
|
|
453 </br><dd>Using case-sensitive comparison, alpha sort order (natural comparator):
|
|
454 <div class="source"><pre>
|
|
455 Frequency f = new Frequency();
|
|
456 f.addValue("one");
|
|
457 f.addValue("One");
|
|
458 f.addValue("oNe");
|
|
459 f.addValue("Z");
|
|
460 System.out.println(f.getCount("one")); // displays 1
|
|
461 System.out.println(f.getCumPct("Z")); // displays 0.5
|
|
462 System.out.println(f.getCumPct("Ot")); // displays 0.25
|
|
463 </pre>
|
|
464 </div>
|
|
465 </dd>
|
|
466 <dd>Using case-insensitive comparator:
|
|
467 <div class="source"><pre>
|
|
468 Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER);
|
|
469 f.addValue("one");
|
|
470 f.addValue("One");
|
|
471 f.addValue("oNe");
|
|
472 f.addValue("Z");
|
|
473 System.out.println(f.getCount("one")); // displays 3
|
|
474 System.out.println(f.getCumPct("z")); // displays 1
|
|
475 </pre>
|
|
476 </div>
|
|
477 </dd>
|
|
478 </dl>
|
|
479 </p>
|
|
480 </div>
|
|
481 <div class="section"><h3><a name="a1.4_Simple_regression"></a>1.4 Simple regression</h3>
|
|
482 <p><a href="../apidocs/org/apache/commons/math/stat/regression/SimpleRegression.html">
|
|
483 org.apache.commons.math.stat.regression.SimpleRegression</a>
|
|
484 provides ordinary least squares regression with one independent variable,
|
|
485 estimating the linear model:
|
|
486 </p>
|
|
487 <p><code> y = intercept + slope * x </code></p>
|
|
488 <p>
|
|
489 Standard errors for <code>intercept</code> and <code>slope</code> are
|
|
490 available as well as ANOVA, r-square and Pearson's r statistics.
|
|
491 </p>
|
|
492 <p>
|
|
493 Observations (x,y pairs) can be added to the model one at a time or they
|
|
494 can be provided in a 2-dimensional array. The observations are not stored
|
|
495 in memory, so there is no limit to the number of observations that can be
|
|
496 added to the model.
|
|
497 </p>
|
|
498 <p><strong>Usage Notes</strong>: <ul><li> When there are fewer than two observations in the model, or when
|
|
499 there is no variation in the x values (i.e. all x values are the same)
|
|
500 all statistics return <code>NaN</code>. At least two observations with
|
|
501 different x coordinates are requred to estimate a bivariate regression
|
|
502 model.</li>
|
|
503 <li> getters for the statistics always compute values based on the current
|
|
504 set of observations -- i.e., you can get statistics, then add more data
|
|
505 and get updated statistics without using a new instance. There is no
|
|
506 "compute" method that updates all statistics. Each of the getters performs
|
|
507 the necessary computations to return the requested statistic.</li>
|
|
508 </ul>
|
|
509 </p>
|
|
510 <p><strong>Implementation Notes</strong>: <ul><li> As observations are added to the model, the sum of x values, y values,
|
|
511 cross products (x times y), and squared deviations of x and y from their
|
|
512 respective means are updated using updating formulas defined in
|
|
513 "Algorithms for Computing the Sample Variance: Analysis and
|
|
514 Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J.
|
|
515 1983, American Statistician, vol. 37, pp. 242-247, referenced in
|
|
516 Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985. All regression
|
|
517 statistics are computed from these sums.</li>
|
|
518 <li> Inference statistics (confidence intervals, parameter significance levels)
|
|
519 are based on on the assumption that the observations included in the model are
|
|
520 drawn from a <a href="http://mathworld.wolfram.com/BivariateNormalDistribution.html" class="externalLink">
|
|
521 Bivariate Normal Distribution</a></li>
|
|
522 </ul>
|
|
523 </p>
|
|
524 <p>
|
|
525 Here are some examples.
|
|
526 <dl><dt>Estimate a model based on observations added one at a time</dt>
|
|
527 <br />
|
|
528 </br><dd>Instantiate a regression instance and add data points
|
|
529 <div class="source"><pre>
|
|
530 regression = new SimpleRegression();
|
|
531 regression.addData(1d, 2d);
|
|
532 // At this point, with only one observation,
|
|
533 // all regression statistics will return NaN
|
|
534
|
|
535 regression.addData(3d, 3d);
|
|
536 // With only two observations,
|
|
537 // slope and intercept can be computed
|
|
538 // but inference statistics will return NaN
|
|
539
|
|
540 regression.addData(3d, 3d);
|
|
541 // Now all statistics are defined.
|
|
542 </pre>
|
|
543 </div>
|
|
544 </dd>
|
|
545 <dd>Compute some statistics based on observations added so far
|
|
546 <div class="source"><pre>
|
|
547 System.out.println(regression.getIntercept());
|
|
548 // displays intercept of regression line
|
|
549
|
|
550 System.out.println(regression.getSlope());
|
|
551 // displays slope of regression line
|
|
552
|
|
553 System.out.println(regression.getSlopeStdErr());
|
|
554 // displays slope standard error
|
|
555 </pre>
|
|
556 </div>
|
|
557 </dd>
|
|
558 <dd>Use the regression model to predict the y value for a new x value
|
|
559 <div class="source"><pre>
|
|
560 System.out.println(regression.predict(1.5d)
|
|
561 // displays predicted y value for x = 1.5
|
|
562 </pre>
|
|
563 </div>
|
|
564
|
|
565 More data points can be added and subsequent getXxx calls will incorporate
|
|
566 additional data in statistics.
|
|
567 </dd>
|
|
568 <dt>Estimate a model from a double[][] array of data points</dt>
|
|
569 <br />
|
|
570 </br><dd>Instantiate a regression object and load dataset
|
|
571 <div class="source"><pre>
|
|
572 double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }};
|
|
573 SimpleRegression regression = new SimpleRegression();
|
|
574 regression.addData(data);
|
|
575 </pre>
|
|
576 </div>
|
|
577 </dd>
|
|
578 <dd>Estimate regression model based on data
|
|
579 <div class="source"><pre>
|
|
580 System.out.println(regression.getIntercept());
|
|
581 // displays intercept of regression line
|
|
582
|
|
583 System.out.println(regression.getSlope());
|
|
584 // displays slope of regression line
|
|
585
|
|
586 System.out.println(regression.getSlopeStdErr());
|
|
587 // displays slope standard error
|
|
588 </pre>
|
|
589 </div>
|
|
590
|
|
591 More data points -- even another double[][] array -- can be added and subsequent
|
|
592 getXxx calls will incorporate additional data in statistics.
|
|
593 </dd>
|
|
594 </dl>
|
|
595 </p>
|
|
596 </div>
|
|
597 <div class="section"><h3><a name="a1.5_Multiple_linear_regression"></a>1.5 Multiple linear regression</h3>
|
|
598 <p><a href="../apidocs/org/apache/commons/math/stat/regression/MultipleLinearRegression.html">
|
|
599 org.apache.commons.math.stat.regression.MultipleLinearRegression</a>
|
|
600 provides ordinary least squares regression with a generic multiple variable linear model, which
|
|
601 in matrix notation can be expressed as:
|
|
602 </p>
|
|
603 <p><code> y=X*b+u </code></p>
|
|
604 <p>
|
|
605 where y is an <code>n-vector</code><b>regressand</b>, X is a <code>[n,k]</code> matrix whose <code>k</code> columns are called
|
|
606 <b>regressors</b>, b is <code>k-vector</code> of <b>regression parameters</b> and <code>u</code> is an <code>n-vector</code>
|
|
607 of <b>error terms</b> or <b>residuals</b>. The notation is quite standard in literature,
|
|
608 cf eg <a href="http://www.econ.queensu.ca/ETM" class="externalLink">Davidson and MacKinnon, Econometrics Theory and Methods, 2004</a>.
|
|
609 </p>
|
|
610 <p>
|
|
611 Two implementations are provided: <a href="../apidocs/org/apache/commons/math/stat/regression/OLSMultipleLinearRegression.html">
|
|
612 org.apache.commons.math.stat.regression.OLSMultipleLinearRegression</a> and
|
|
613 <a href="../apidocs/org/apache/commons/math/stat/regression/GLSMultipleLinearRegression.html">
|
|
614 org.apache.commons.math.stat.regression.GLSMultipleLinearRegression</a></p>
|
|
615 <p>
|
|
616 Observations (x,y and covariance data matrices) can be added to the model via the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method.
|
|
617 The observations are stored in memory until the next time the addData method is invoked.
|
|
618 </p>
|
|
619 <p><strong>Usage Notes</strong>: <ul><li> Data is validated when invoking the <code>addData(double[] y, double[][] x, double[][] covariance)</code> method and
|
|
620 <code>IllegalArgumentException</code> is thrown when inappropriate.
|
|
621 </li>
|
|
622 <li> Only the GLS regressions require the covariance matrix, so in the OLS regression it is ignored and can be safely
|
|
623 inputted as <code>null</code>.</li>
|
|
624 </ul>
|
|
625 </p>
|
|
626 <p>
|
|
627 Here are some examples.
|
|
628 <dl><dt>OLS regression</dt>
|
|
629 <br />
|
|
630 </br><dd>Instantiate an OLS regression object and load dataset
|
|
631 <div class="source"><pre>
|
|
632 MultipleLinearRegression regression = new OLSMultipleLinearRegression();
|
|
633 double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0};
|
|
634 double[] x = new double[6][];
|
|
635 x[0] = new double[]{1.0, 0, 0, 0, 0, 0};
|
|
636 x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0};
|
|
637 x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0};
|
|
638 x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0};
|
|
639 x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0};
|
|
640 x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0};
|
|
641 regression.addData(y, x, null); // we don't need covariance
|
|
642 </pre>
|
|
643 </div>
|
|
644 </dd>
|
|
645 <dd>Estimate of regression values honours the <code>MultipleLinearRegression</code> interface:
|
|
646 <div class="source"><pre>
|
|
647 double[] beta = regression.estimateRegressionParameters();
|
|
648
|
|
649 double[] residuals = regression.estimateResiduals();
|
|
650
|
|
651 double[][] parametersVariance = regression.estimateRegressionParametersVariance();
|
|
652
|
|
653 double regressandVariance = regression.estimateRegressandVariance();
|
|
654 </pre>
|
|
655 </div>
|
|
656 </dd>
|
|
657 <dt>GLS regression</dt>
|
|
658 <br />
|
|
659 </br><dd>Instantiate an GLS regression object and load dataset
|
|
660 <div class="source"><pre>
|
|
661 MultipleLinearRegression regression = new GLSMultipleLinearRegression();
|
|
662 double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0};
|
|
663 double[] x = new double[6][];
|
|
664 x[0] = new double[]{1.0, 0, 0, 0, 0, 0};
|
|
665 x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0};
|
|
666 x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0};
|
|
667 x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0};
|
|
668 x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0};
|
|
669 x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0};
|
|
670 double[][] omega = new double[6][];
|
|
671 omega[0] = new double[]{1.1, 0, 0, 0, 0, 0};
|
|
672 omega[1] = new double[]{0, 2.2, 0, 0, 0, 0};
|
|
673 omega[2] = new double[]{0, 0, 3.3, 0, 0, 0};
|
|
674 omega[3] = new double[]{0, 0, 0, 4.4, 0, 0};
|
|
675 omega[4] = new double[]{0, 0, 0, 0, 5.5, 0};
|
|
676 omega[5] = new double[]{0, 0, 0, 0, 0, 6.6};
|
|
677 regression.addData(y, x, omega); // we do need covariance
|
|
678 </pre>
|
|
679 </div>
|
|
680 </dd>
|
|
681 <dd>Estimate of regression values honours the same <code>MultipleLinearRegression</code> interface as
|
|
682 the OLS regression.
|
|
683 </dd>
|
|
684 </dl>
|
|
685 </p>
|
|
686 </div>
|
|
687 <div class="section"><h3><a name="a1.6_Rank_transformations"></a>1.6 Rank transformations</h3>
|
|
688 <p>
|
|
689 Some statistical algorithms require that input data be replaced by ranks.
|
|
690 The <a href="../apidocs/org/apache/commons/math/stat/ranking/package-summary.html">
|
|
691 org.apache.commons.math.stat.ranking</a> package provides rank transformation.
|
|
692 <a href="../apidocs/org/apache/commons/math/stat/ranking/RankingAlgorithm.html">
|
|
693 RankingAlgorithm</a> defines the interface for ranking.
|
|
694 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html">
|
|
695 NaturalRanking</a> provides an implementation that has two configuration options.
|
|
696 <ul><li><a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html">
|
|
697 Ties strategy</a> deterimines how ties in the source data are handled by the ranking</li>
|
|
698 <li><a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html">
|
|
699 NaN strategy</a> determines how NaN values in the source data are handled.</li>
|
|
700 </ul>
|
|
701 </p>
|
|
702 <p>
|
|
703 Examples:
|
|
704 <div class="source"><pre>
|
|
705 NaturalRanking ranking = new NaturalRanking(NaNStrategy.MINIMAL,
|
|
706 TiesStrategy.MAXIMUM);
|
|
707 double[] data = { 20, 17, 30, 42.3, 17, 50,
|
|
708 Double.NaN, Double.NEGATIVE_INFINITY, 17 };
|
|
709 double[] ranks = ranking.rank(exampleData);
|
|
710 </pre>
|
|
711 </div>
|
|
712
|
|
713 results in <code>ranks</code> containing <code>{6, 5, 7, 8, 5, 9, 2, 2, 5}.</code><div class="source"><pre>
|
|
714 new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData);
|
|
715 </pre>
|
|
716 </div>
|
|
717
|
|
718 returns <code>{5, 2, 6, 7, 3, 8, 1, 4}.</code></p>
|
|
719 <p>
|
|
720 The default <code>NaNStrategy</code> is NaNStrategy.MAXIMAL. This makes <code>NaN</code>
|
|
721 values larger than any other value (including <code>Double.POSITIVE_INFINITY</code>). The
|
|
722 default <code>TiesStrategy</code> is <code>TiesStrategy.AVERAGE,</code> which assigns tied
|
|
723 values the average of the ranks applicable to the sequence of ties. See the
|
|
724 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html">
|
|
725 NaturalRanking</a> for more examples and <a href="../apidocs/org/apache/commons/math/stat/ranking/TiesStrategy.html">
|
|
726 TiesStrategy</a> and <a href="../apidocs/org/apache/commons/math/stat/ranking/NaNStrategy.html">NaNStrategy</a>
|
|
727 for details on these configuration options.
|
|
728 </p>
|
|
729 </div>
|
|
730 <div class="section"><h3><a name="a1.7_Covariance_and_correlation"></a>1.7 Covariance and correlation</h3>
|
|
731 <p>
|
|
732 The <a href="../apidocs/org/apache/commons/math/stat/correlation/package-summary.html">
|
|
733 org.apache.commons.math.stat.correlation</a> package computes covariances
|
|
734 and correlations for pairs of arrays or columns of a matrix.
|
|
735 <a href="../apidocs/org/apache/commons/math/stat/correlation/Covariance.html">
|
|
736 Covariance</a> computes covariances,
|
|
737 <a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html">
|
|
738 PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients and
|
|
739 <a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html">
|
|
740 SpearmansCorrelation</a> computes Spearman's rank correlation.
|
|
741 </p>
|
|
742 <p><strong>Implementation Notes</strong><ul><li>
|
|
743 Unbiased covariances are given by the formula <br />
|
|
744 </br><code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code>
|
|
745 where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code>
|
|
746 is the mean of the <code>Y</code> values. Non-bias-corrected estimates use
|
|
747 <code>n</code> in place of <code>n - 1.</code> Whether or not covariances are
|
|
748 bias-corrected is determined by the optional parameter, "biasCorrected," which
|
|
749 defaults to <code>true.</code></li>
|
|
750 <li><a href="../apidocs/org/apache/commons/math/stat/correlation/PearsonsCorrelation.html">
|
|
751 PearsonsCorrelation</a> computes correlations defined by the formula <br />
|
|
752 </br><code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br />
|
|
753
|
|
754 where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code>
|
|
755 and <code>s(X)</code>, <code>s(Y)</code> are standard deviations.
|
|
756 </li>
|
|
757 <li><a href="../apidocs/org/apache/commons/math/stat/correlation/SpearmansCorrelation.html">
|
|
758 SpearmansCorrelation</a> applies a rank transformation to the input data and computes Pearson's
|
|
759 correlation on the ranked data. The ranking algorithm is configurable. By default,
|
|
760 <a href="../apidocs/org/apache/commons/math/stat/ranking/NaturalRanking.html">
|
|
761 NaturalRanking</a> with default strategies for handling ties and NaN values is used.
|
|
762 </li>
|
|
763 </ul>
|
|
764 </p>
|
|
765 <p><strong>Examples:</strong><dl><dt><strong>Covariance of 2 arrays</strong></dt>
|
|
766 <br />
|
|
767 </br><dd>To compute the unbiased covariance between 2 double arrays,
|
|
768 <code>x</code> and <code>y</code>, use:
|
|
769 <div class="source"><pre>
|
|
770 new Covariance().covariance(x, y)
|
|
771 </pre>
|
|
772 </div>
|
|
773
|
|
774 For non-bias-corrected covariances, use
|
|
775 <div class="source"><pre>
|
|
776 covariance(x, y, false)
|
|
777 </pre>
|
|
778 </div>
|
|
779 </dd>
|
|
780 <br />
|
|
781 </br><dt><strong>Covariance matrix</strong></dt>
|
|
782 <br />
|
|
783 </br><dd> A covariance matrix over the columns of a source matrix <code>data</code>
|
|
784 can be computed using
|
|
785 <div class="source"><pre>
|
|
786 new Covariance().computeCovarianceMatrix(data)
|
|
787 </pre>
|
|
788 </div>
|
|
789
|
|
790 The i-jth entry of the returned matrix is the unbiased covariance of the ith and jth
|
|
791 columns of <code>data.</code> As above, to get non-bias-corrected covariances,
|
|
792 use
|
|
793 <div class="source"><pre>
|
|
794 computeCovarianceMatrix(data, false)
|
|
795 </pre>
|
|
796 </div>
|
|
797 </dd>
|
|
798 <br />
|
|
799 </br><dt><strong>Pearson's correlation of 2 arrays</strong></dt>
|
|
800 <br />
|
|
801 </br><dd>To compute the Pearson's product-moment correlation between two double arrays
|
|
802 <code>x</code> and <code>y</code>, use:
|
|
803 <div class="source"><pre>
|
|
804 new PearsonsCorrelation().correlation(x, y)
|
|
805 </pre>
|
|
806 </div>
|
|
807 </dd>
|
|
808 <br />
|
|
809 </br><dt><strong>Pearson's correlation matrix</strong></dt>
|
|
810 <br />
|
|
811 </br><dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code>
|
|
812 can be computed using
|
|
813 <div class="source"><pre>
|
|
814 new PearsonsCorrelation().computeCorrelationMatrix(data)
|
|
815 </pre>
|
|
816 </div>
|
|
817
|
|
818 The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the
|
|
819 ith and jth columns of <code>data.</code></dd>
|
|
820 <br />
|
|
821 </br><dt><strong>Pearson's correlation significance and standard errors</strong></dt>
|
|
822 <br />
|
|
823 </br><dd> To compute standard errors and/or significances of correlation coefficients
|
|
824 associated with Pearson's correlation coefficients, start by creating a
|
|
825 <code>PearsonsCorrelation</code> instance
|
|
826 <div class="source"><pre>
|
|
827 PearsonsCorrelation correlation = new PearsonsCorrelation(data);
|
|
828 </pre>
|
|
829 </div>
|
|
830
|
|
831 where <code>data</code> is either a rectangular array or a <code>RealMatrix.</code>
|
|
832 Then the matrix of standard errors is
|
|
833 <div class="source"><pre>
|
|
834 correlation.getCorrelationStandardErrors();
|
|
835 </pre>
|
|
836 </div>
|
|
837
|
|
838 The formula used to compute the standard error is <br />
|
|
839 <code>SE<sub>r</sub> = ((1 - r<sup>2</sup>) / (n - 2))<sup>1/2</sup></code><br />
|
|
840
|
|
841 where <code>r</code> is the estimated correlation coefficient and
|
|
842 <code>n</code> is the number of observations in the source dataset.<br />
|
|
843 <br />
|
|
844 <strong>p-values</strong> for the (2-sided) null hypotheses that elements of
|
|
845 a correlation matrix are zero populate the RealMatrix returned by
|
|
846 <div class="source"><pre>
|
|
847 correlation.getCorrelationPValues()
|
|
848 </pre>
|
|
849 </div>
|
|
850 <code>getCorrelationPValues().getEntry(i,j)</code> is the
|
|
851 probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes
|
|
852 a value with absolute value greater than or equal to <br />
|
|
853 </br><code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>,
|
|
854 where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth
|
|
855 columns of the source array or RealMatrix. This is sometimes referred to as the
|
|
856 <i>significance</i> of the coefficient.<br />
|
|
857 <br />
|
|
858
|
|
859 For example, if <code>data</code> is a RealMatrix with 2 columns and 10 rows, then
|
|
860 <div class="source"><pre>
|
|
861 new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1)
|
|
862 </pre>
|
|
863 </div>
|
|
864
|
|
865 is the significance of the Pearson's correlation coefficient between the two columns
|
|
866 of <code>data</code>. If this value is less than .01, we can say that the correlation
|
|
867 between the two columns of data is significant at the 99% level.
|
|
868 </dd>
|
|
869 <br />
|
|
870 </br><dt><strong>Spearman's rank correlation coefficient</strong></dt>
|
|
871 <br />
|
|
872 </br><dd>To compute the Spearman's rank-moment correlation between two double arrays
|
|
873 <code>x</code> and <code>y</code>:
|
|
874 <div class="source"><pre>
|
|
875 new SpearmansCorrelation().correlation(x, y)
|
|
876 </pre>
|
|
877 </div>
|
|
878
|
|
879 This is equivalent to
|
|
880 <div class="source"><pre>
|
|
881 RankingAlgorithm ranking = new NaturalRanking();
|
|
882 new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
|
|
883 </pre>
|
|
884 </div>
|
|
885 </dd>
|
|
886 <br />
|
|
887 </br></dl>
|
|
888 </p>
|
|
889 </div>
|
|
890 <div class="section"><h3><a name="a1.8_Statistical_tests"></a>1.8 Statistical tests</h3>
|
|
891 <p>
|
|
892 The interfaces and implementations in the
|
|
893 <a href="../apidocs/org/apache/commons/math/stat/inference/">
|
|
894 org.apache.commons.math.stat.inference</a> package provide
|
|
895 <a href="http://www.itl.nist.gov/div898/handbook/prc/section2/prc22.htm" class="externalLink">
|
|
896 Student's t</a>,
|
|
897 <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm" class="externalLink">
|
|
898 Chi-Square</a> and
|
|
899 <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm" class="externalLink">
|
|
900 One-Way ANOVA</a> test statistics as well as
|
|
901 <a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue" class="externalLink">
|
|
902 p-values</a> associated with <code>t-</code>,
|
|
903 <code>Chi-Square</code> and <code>One-Way ANOVA</code> tests. The
|
|
904 interfaces are
|
|
905 <a href="../apidocs/org/apache/commons/math/stat/inference/TTest.html">
|
|
906 TTest</a>,
|
|
907 <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTest.html">
|
|
908 ChiSquareTest</a>, and
|
|
909 <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnova.html">
|
|
910 OneWayAnova</a> with provided implementations
|
|
911 <a href="../apidocs/org/apache/commons/math/stat/inference/TTestImpl.html">
|
|
912 TTestImpl</a>,
|
|
913 <a href="../apidocs/org/apache/commons/math/stat/inference/ChiSquareTestImpl.html">
|
|
914 ChiSquareTestImpl</a> and
|
|
915 <a href="../apidocs/org/apache/commons/math/stat/inference/OneWayAnovaImpl.html">
|
|
916 OneWayAnovaImpl</a>, respectively.
|
|
917 The
|
|
918 <a href="../apidocs/org/apache/commons/math/stat/inference/TestUtils.html">
|
|
919 TestUtils</a> class provides static methods to get test instances or
|
|
920 to compute test statistics directly. The examples below all use the
|
|
921 static methods in <code>TestUtils</code> to execute tests. To get
|
|
922 test object instances, either use e.g.,
|
|
923 <code>TestUtils.getTTest()</code> or use the implementation constructors
|
|
924 directly, e.g.,
|
|
925 <code>new TTestImpl()</code>.
|
|
926 </p>
|
|
927 <p><strong>Implementation Notes</strong><ul><li>Both one- and two-sample t-tests are supported. Two sample tests
|
|
928 can be either paired or unpaired and the unpaired two-sample tests can
|
|
929 be conducted under the assumption of equal subpopulation variances or
|
|
930 without this assumption. When equal variances is assumed, a pooled
|
|
931 variance estimate is used to compute the t-statistic and the degrees
|
|
932 of freedom used in the t-test equals the sum of the sample sizes minus 2.
|
|
933 When equal variances is not assumed, the t-statistic uses both sample
|
|
934 variances and the
|
|
935 <a href="http://www.itl.nist.gov/div898/handbook/prc/section3/gifs/nu3.gif" class="externalLink">
|
|
936 Welch-Satterwaite approximation</a> is used to compute the degrees
|
|
937 of freedom. Methods to return t-statistics and p-values are provided in each
|
|
938 case, as well as boolean-valued methods to perform fixed significance
|
|
939 level tests. The names of methods or methods that assume equal
|
|
940 subpopulation variances always start with "homoscedastic." Test or
|
|
941 test-statistic methods that just start with "t" do not assume equal
|
|
942 variances. See the examples below and the API documentation for
|
|
943 more details.</li>
|
|
944 <li>The validity of the p-values returned by the t-test depends on the
|
|
945 assumptions of the parametric t-test procedure, as discussed
|
|
946 <a href="http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html" class="externalLink">
|
|
947 here</a></li>
|
|
948 <li>p-values returned by t-, chi-square and Anova tests are exact, based
|
|
949 on numerical approximations to the t-, chi-square and F distributions in the
|
|
950 <code>distributions</code> package. </li>
|
|
951 <li>p-values returned by t-tests are for two-sided tests and the boolean-valued
|
|
952 methods supporting fixed significance level tests assume that the hypotheses
|
|
953 are two-sided. One sided tests can be performed by dividing returned p-values
|
|
954 (resp. critical values) by 2.</li>
|
|
955 <li>Degrees of freedom for chi-square tests are integral values, based on the
|
|
956 number of observed or expected counts (number of observed counts - 1)
|
|
957 for the goodness-of-fit tests and (number of columns -1) * (number of rows - 1)
|
|
958 for independence tests.</li>
|
|
959 </ul>
|
|
960 </p>
|
|
961 <p><strong>Examples:</strong><dl><dt><strong>One-sample <code>t</code> tests</strong></dt>
|
|
962 <br />
|
|
963 </br><dd>To compare the mean of a double[] array to a fixed value:
|
|
964 <div class="source"><pre>
|
|
965 double[] observed = {1d, 2d, 3d};
|
|
966 double mu = 2.5d;
|
|
967 System.out.println(TestUtils.t(mu, observed));
|
|
968 </pre>
|
|
969 </div>
|
|
970
|
|
971 The code above will display the t-statisitic associated with a one-sample
|
|
972 t-test comparing the mean of the <code>observed</code> values against
|
|
973 <code>mu.</code></dd>
|
|
974 <dd>To compare the mean of a dataset described by a
|
|
975 <a href="../apidocs/org/apache/commons/math/stat/descriptive/StatisticalSummary.html">
|
|
976 org.apache.commons.math.stat.descriptive.StatisticalSummary</a> to a fixed value:
|
|
977 <div class="source"><pre>
|
|
978 double[] observed ={1d, 2d, 3d};
|
|
979 double mu = 2.5d;
|
|
980 SummaryStatistics sampleStats = new SummaryStatistics();
|
|
981 for (int i = 0; i < observed.length; i++) {
|
|
982 sampleStats.addValue(observed[i]);
|
|
983 }
|
|
984 System.out.println(TestUtils.t(mu, observed));
|
|
985 </pre>
|
|
986 </div>
|
|
987 </dd>
|
|
988 <dd>To compute the p-value associated with the null hypothesis that the mean
|
|
989 of a set of values equals a point estimate, against the two-sided alternative that
|
|
990 the mean is different from the target value:
|
|
991 <div class="source"><pre>
|
|
992 double[] observed = {1d, 2d, 3d};
|
|
993 double mu = 2.5d;
|
|
994 System.out.println(TestUtils.tTest(mu, observed));
|
|
995 </pre>
|
|
996 </div>
|
|
997
|
|
998 The snippet above will display the p-value associated with the null
|
|
999 hypothesis that the mean of the population from which the
|
|
1000 <code>observed</code> values are drawn equals <code>mu.</code></dd>
|
|
1001 <dd>To perform the test using a fixed significance level, use:
|
|
1002 <div class="source"><pre>
|
|
1003 TestUtils.tTest(mu, observed, alpha);
|
|
1004 </pre>
|
|
1005 </div>
|
|
1006
|
|
1007 where <code>0 < alpha < 0.5</code> is the significance level of
|
|
1008 the test. The boolean value returned will be <code>true</code> iff the
|
|
1009 null hypothesis can be rejected with confidence <code>1 - alpha</code>.
|
|
1010 To test, for example at the 95% level of confidence, use
|
|
1011 <code>alpha = 0.05</code></dd>
|
|
1012 <br />
|
|
1013 </br><dt><strong>Two-Sample t-tests</strong></dt>
|
|
1014 <br />
|
|
1015 </br><dd><strong>Example 1:</strong> Paired test evaluating
|
|
1016 the null hypothesis that the mean difference between corresponding
|
|
1017 (paired) elements of the <code>double[]</code> arrays
|
|
1018 <code>sample1</code> and <code>sample2</code> is zero.
|
|
1019
|
|
1020 To compute the t-statistic:
|
|
1021 <div class="source"><pre>
|
|
1022 TestUtils.pairedT(sample1, sample2);
|
|
1023 </pre>
|
|
1024 </div>
|
|
1025 <p>
|
|
1026 To compute the p-value:
|
|
1027 <div class="source"><pre>
|
|
1028 TestUtils.pairedTTest(sample1, sample2);
|
|
1029 </pre>
|
|
1030 </div>
|
|
1031 </p>
|
|
1032 <p>
|
|
1033 To perform a fixed significance level test with alpha = .05:
|
|
1034 <div class="source"><pre>
|
|
1035 TestUtils.pairedTTest(sample1, sample2, .05);
|
|
1036 </pre>
|
|
1037 </div>
|
|
1038 </p>
|
|
1039
|
|
1040 The last example will return <code>true</code> iff the p-value
|
|
1041 returned by <code>TestUtils.pairedTTest(sample1, sample2)</code>
|
|
1042 is less than <code>.05</code></dd>
|
|
1043 <dd><strong>Example 2: </strong> unpaired, two-sided, two-sample t-test using
|
|
1044 <code>StatisticalSummary</code> instances, without assuming that
|
|
1045 subpopulation variances are equal.
|
|
1046
|
|
1047 First create the <code>StatisticalSummary</code> instances. Both
|
|
1048 <code>DescriptiveStatistics</code> and <code>SummaryStatistics</code>
|
|
1049 implement this interface. Assume that <code>summary1</code> and
|
|
1050 <code>summary2</code> are <code>SummaryStatistics</code> instances,
|
|
1051 each of which has had at least 2 values added to the (virtual) dataset that
|
|
1052 it describes. The sample sizes do not have to be the same -- all that is required
|
|
1053 is that both samples have at least 2 elements.
|
|
1054 <p><strong>Note:</strong> The <code>SummaryStatistics</code> class does
|
|
1055 not store the dataset that it describes in memory, but it does compute all
|
|
1056 statistics necessary to perform t-tests, so this method can be used to
|
|
1057 conduct t-tests with very large samples. One-sample tests can also be
|
|
1058 performed this way.
|
|
1059 (See <a href="#1.2 Descriptive statistics">Descriptive statistics</a> for details
|
|
1060 on the <code>SummaryStatistics</code> class.)
|
|
1061 </p>
|
|
1062 <p>
|
|
1063 To compute the t-statistic:
|
|
1064 <div class="source"><pre>
|
|
1065 TestUtils.t(summary1, summary2);
|
|
1066 </pre>
|
|
1067 </div>
|
|
1068 </p>
|
|
1069 <p>
|
|
1070 To compute the p-value:
|
|
1071 <div class="source"><pre>
|
|
1072 TestUtils.tTest(sample1, sample2);
|
|
1073 </pre>
|
|
1074 </div>
|
|
1075 </p>
|
|
1076 <p>
|
|
1077 To perform a fixed significance level test with alpha = .05:
|
|
1078 <div class="source"><pre>
|
|
1079 TestUtils.tTest(sample1, sample2, .05);
|
|
1080 </pre>
|
|
1081 </div>
|
|
1082 </p>
|
|
1083 <p>
|
|
1084 In each case above, the test does not assume that the subpopulation
|
|
1085 variances are equal. To perform the tests under this assumption,
|
|
1086 replace "t" at the beginning of the method name with "homoscedasticT"
|
|
1087 </p>
|
|
1088 </dd>
|
|
1089 <br />
|
|
1090 </br><dt><strong>Chi-square tests</strong></dt>
|
|
1091 <br />
|
|
1092 </br><dd>To compute a chi-square statistic measuring the agreement between a
|
|
1093 <code>long[]</code> array of observed counts and a <code>double[]</code>
|
|
1094 array of expected counts, use:
|
|
1095 <div class="source"><pre>
|
|
1096 long[] observed = {10, 9, 11};
|
|
1097 double[] expected = {10.1, 9.8, 10.3};
|
|
1098 System.out.println(TestUtils.chiSquare(expected, observed));
|
|
1099 </pre>
|
|
1100 </div>
|
|
1101
|
|
1102 the value displayed will be
|
|
1103 <code>sum((expected[i] - observed[i])^2 / expected[i])</code></dd>
|
|
1104 <dd> To get the p-value associated with the null hypothesis that
|
|
1105 <code>observed</code> conforms to <code>expected</code> use:
|
|
1106 <div class="source"><pre>
|
|
1107 TestUtils.chiSquareTest(expected, observed);
|
|
1108 </pre>
|
|
1109 </div>
|
|
1110 </dd>
|
|
1111 <dd> To test the null hypothesis that <code>observed</code> conforms to
|
|
1112 <code>expected</code> with <code>alpha</code> siginficance level
|
|
1113 (equiv. <code>100 * (1-alpha)%</code> confidence) where <code>
|
|
1114 0 < alpha < 1 </code> use:
|
|
1115 <div class="source"><pre>
|
|
1116 TestUtils.chiSquareTest(expected, observed, alpha);
|
|
1117 </pre>
|
|
1118 </div>
|
|
1119
|
|
1120 The boolean value returned will be <code>true</code> iff the null hypothesis
|
|
1121 can be rejected with confidence <code>1 - alpha</code>.
|
|
1122 </dd>
|
|
1123 <dd>To compute a chi-square statistic statistic associated with a
|
|
1124 <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm" class="externalLink">
|
|
1125 chi-square test of independence</a> based on a two-dimensional (long[][])
|
|
1126 <code>counts</code> array viewed as a two-way table, use:
|
|
1127 <div class="source"><pre>
|
|
1128 TestUtils.chiSquareTest(counts);
|
|
1129 </pre>
|
|
1130 </div>
|
|
1131
|
|
1132 The rows of the 2-way table are
|
|
1133 <code>count[0], ... , count[count.length - 1]. </code><br />
|
|
1134 </br>
|
|
1135 The chi-square statistic returned is
|
|
1136 <code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code>
|
|
1137 where the sum is taken over all table entries and
|
|
1138 <code>expected[i][j]</code> is the product of the row and column sums at
|
|
1139 row <code>i</code>, column <code>j</code> divided by the total count.
|
|
1140 </dd>
|
|
1141 <dd>To compute the p-value associated with the null hypothesis that
|
|
1142 the classifications represented by the counts in the columns of the input 2-way
|
|
1143 table are independent of the rows, use:
|
|
1144 <div class="source"><pre>
|
|
1145 TestUtils.chiSquareTest(counts);
|
|
1146 </pre>
|
|
1147 </div>
|
|
1148 </dd>
|
|
1149 <dd>To perform a chi-square test of independence with <code>alpha</code>
|
|
1150 siginficance level (equiv. <code>100 * (1-alpha)%</code> confidence)
|
|
1151 where <code>0 < alpha < 1 </code> use:
|
|
1152 <div class="source"><pre>
|
|
1153 TestUtils.chiSquareTest(counts, alpha);
|
|
1154 </pre>
|
|
1155 </div>
|
|
1156
|
|
1157 The boolean value returned will be <code>true</code> iff the null
|
|
1158 hypothesis can be rejected with confidence <code>1 - alpha</code>.
|
|
1159 </dd>
|
|
1160 <br />
|
|
1161 </br><dt><strong>One-Way Anova tests</strong></dt>
|
|
1162 <br />
|
|
1163 </br><dd>To conduct a One-Way Analysis of Variance (ANOVA) to evaluate the
|
|
1164 null hypothesis that the means of a collection of univariate datasets
|
|
1165 are the same, start by loading the datasets into a collection, e.g.
|
|
1166 <div class="source"><pre>
|
|
1167 double[] classA =
|
|
1168 {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 };
|
|
1169 double[] classB =
|
|
1170 {99.0, 92.0, 102.0, 100.0, 102.0, 89.0 };
|
|
1171 double[] classC =
|
|
1172 {110.0, 115.0, 111.0, 117.0, 128.0, 117.0 };
|
|
1173 List classes = new ArrayList();
|
|
1174 classes.add(classA);
|
|
1175 classes.add(classB);
|
|
1176 classes.add(classC);
|
|
1177 </pre>
|
|
1178 </div>
|
|
1179
|
|
1180 Then you can compute ANOVA F- or p-values associated with the
|
|
1181 null hypothesis that the class means are all the same
|
|
1182 using a <code>OneWayAnova</code> instance or <code>TestUtils</code>
|
|
1183 methods:
|
|
1184 <div class="source"><pre>
|
|
1185 double fStatistic = TestUtils.oneWayAnovaFValue(classes); // F-value
|
|
1186 double pValue = TestUtils.oneWayAnovaPValue(classes); // P-value
|
|
1187 </pre>
|
|
1188 </div>
|
|
1189
|
|
1190 To test perform a One-Way Anova test with signficance level set at 0.01
|
|
1191 (so the test will, assuming assumptions are met, reject the null
|
|
1192 hypothesis incorrectly only about one in 100 times), use
|
|
1193 <div class="source"><pre>
|
|
1194 TestUtils.oneWayAnovaTest(classes, 0.01); // returns a boolean
|
|
1195 // true means reject null hypothesis
|
|
1196 </pre>
|
|
1197 </div>
|
|
1198 </dd>
|
|
1199 </dl>
|
|
1200 </p>
|
|
1201 </div>
|
|
1202 </div>
|
|
1203
|
|
1204 </div>
|
|
1205 </div>
|
|
1206 <div class="clear">
|
|
1207 <hr/>
|
|
1208 </div>
|
|
1209 <div id="footer">
|
|
1210 <div class="xright">©
|
|
1211 2003-2010
|
|
1212
|
|
1213
|
|
1214
|
|
1215
|
|
1216
|
|
1217
|
|
1218
|
|
1219
|
|
1220
|
|
1221 </div>
|
|
1222 <div class="clear">
|
|
1223 <hr/>
|
|
1224 </div>
|
|
1225 </div>
|
|
1226 </body>
|
|
1227 </html>
|