Mercurial > hg > LGDataverses
comparison DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html @ 6:1b2188262ae9
adding the installer.
author | "jurzua <jurzua@mpiwg-berlin.mpg.de>" |
---|---|
date | Wed, 13 May 2015 11:50:21 +0200 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
5:dd9adfc73390 | 6:1b2188262ae9 |
---|---|
1 | |
2 | |
3 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" | |
4 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> | |
5 | |
6 | |
7 <html xmlns="http://www.w3.org/1999/xhtml"> | |
8 <head> | |
9 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> | |
10 | |
11 <title>Ingest of R (.RData) files — The Harvard Dataverse Network 3.6.1 documentation</title> | |
12 | |
13 <link rel="stylesheet" href="_static/agogo.css" type="text/css" /> | |
14 <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> | |
15 | |
16 <script type="text/javascript"> | |
17 var DOCUMENTATION_OPTIONS = { | |
18 URL_ROOT: './', | |
19 VERSION: '3.6.1', | |
20 COLLAPSE_INDEX: false, | |
21 FILE_SUFFIX: '.html', | |
22 HAS_SOURCE: true | |
23 }; | |
24 </script> | |
25 <script type="text/javascript" src="_static/jquery.js"></script> | |
26 <script type="text/javascript" src="_static/underscore.js"></script> | |
27 <script type="text/javascript" src="_static/doctools.js"></script> | |
28 <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> | |
29 <link rel="top" title="The Harvard Dataverse Network 3.6.1 documentation" href="index.html" /> | |
30 </head> | |
31 <body> | |
32 <div class="header-wrapper"> | |
33 <div class="header"> | |
34 <div class="headertitle"><a | |
35 href="index.html">The Harvard Dataverse Network 3.6.1 documentation</a></div> | |
36 <div class="rel"> | |
37 <a href="genindex.html" title="General Index" | |
38 accesskey="I">index</a> | |
39 </div> | |
40 </div> | |
41 </div> | |
42 | |
43 <div class="content-wrapper"> | |
44 <div class="content"> | |
45 <div class="document"> | |
46 | |
47 <div class="documentwrapper"> | |
48 <div class="bodywrapper"> | |
49 <div class="body"> | |
50 | |
51 <div class="section" id="ingest-of-r-rdata-files"> | |
52 <h1>Ingest of R (.RData) files<a class="headerlink" href="#ingest-of-r-rdata-files" title="Permalink to this headline">¶</a></h1> | |
53 <div class="section" id="overview"> | |
54 <h2>Overview.<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2> | |
55 <p>Support for ingesting R data files has been added in version 3.5. R | |
56 has been increasingly popular in the research/academic community, | |
57 owing to the fact that it is free and open-source (unlike SPSS and | |
58 STATA). Consequently, more and more data is becoming available | |
59 exclusively as R data files. This long-awaited feature makes it | |
60 possible to ingest such data into DVN as “subsettable” files.</p> | |
61 </div> | |
62 <div class="section" id="requirements"> | |
63 <h2>Requirements.<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2> | |
64 <p>R ingest relies on R having been installed, configured and made | |
65 available to the DVN application via RServe (see the Installers | |
66 Guide). This is in contrast to the SPSS and Stata ingest - which can | |
67 be performed without R present. (though R is still needed to perform | |
68 most subsetting/analysis tasks on the resulting data files).</p> | |
69 <p>The data must be formatted as an R dataframe (data.frame()). If an | |
70 .RData file contains multiple dataframes, only the 1st one will be | |
71 ingested.</p> | |
72 </div> | |
73 <div class="section" id="data-types-compared-to-other-supported-formats-stat-spss"> | |
74 <h2>Data Types, compared to other supported formats (Stat, SPSS)<a class="headerlink" href="#data-types-compared-to-other-supported-formats-stat-spss" title="Permalink to this headline">¶</a></h2> | |
75 <div class="section" id="integers-doubles-character-strings"> | |
76 <h3>Integers, Doubles, Character strings<a class="headerlink" href="#integers-doubles-character-strings" title="Permalink to this headline">¶</a></h3> | |
77 <p>The handling of these types is intuitive and straightforward. The | |
78 resulting tab file columns, summary statistics and UNF signatures | |
79 should be identical to those produced by ingesting the same vectors | |
80 from SPSS and Stata.</p> | |
81 <p><strong>A couple of things that are unique to R/new in DVN:</strong></p> | |
82 <p>R explicitly supports Missing Values for all of the types above; | |
83 Missing Values encoded in R vectors will be recognized and preserved | |
84 in TAB files (as ‘NA’), counted in the generated summary statistics | |
85 and data analysis.</p> | |
86 <p>In addition to Missing Values, R recognizes “Not a Value” (NaN) and | |
87 positive and negative infinity for floating point variables. These | |
88 are now properly supported by the DVN.</p> | |
89 <p>Also note, that unlike Stata, that does recognize “float” and “double” | |
90 as distinct data types, all floating point values in R are in fact | |
91 double precision.</p> | |
92 </div> | |
93 <div class="section" id="r-factors"> | |
94 <h3>R Factors<a class="headerlink" href="#r-factors" title="Permalink to this headline">¶</a></h3> | |
95 <p>These are ingested as “Categorical Values” in the DVN.</p> | |
96 <p>One thing to keep in mind: in both Stata and SPSS, the actual value of | |
97 a categorical variable can be both character and numeric. In R, all | |
98 factor values are strings, even if they are string representations of | |
99 numbers. So the values of the resulting categoricals in the DVN will | |
100 always be of string type too.</p> | |
101 <div class="line-block"> | |
102 <div class="line"><strong>New:</strong> To properly handle <em>ordered factors</em> in R, the DVN now supports the concept of an “Ordered Categorical” - a categorical value where an explicit order is assigned to the list of value labels.</div> | |
103 </div> | |
104 </div> | |
105 <div class="section" id="new-boolean-values"> | |
106 <h3>(New!) Boolean values<a class="headerlink" href="#new-boolean-values" title="Permalink to this headline">¶</a></h3> | |
107 <p>R Boolean (logical) values are supported.</p> | |
108 </div> | |
109 <div class="section" id="limitations-of-r-as-compared-to-spss-and-stata"> | |
110 <h3>Limitations of R, as compared to SPSS and STATA.<a class="headerlink" href="#limitations-of-r-as-compared-to-spss-and-stata" title="Permalink to this headline">¶</a></h3> | |
111 <p>Most noticeably, R lacks a standard mechanism for defining descriptive | |
112 labels for the data frame variables. In the DVN, similarly to | |
113 both Stata and SPSS, variables have distinct names and labels; with | |
114 the latter reserved for longer, descriptive text. | |
115 With variables ingested from R data frames the variable name will be | |
116 used for both the “name” and the “label”.</p> | |
117 <div class="line-block"> | |
118 <div class="line"><em>Optional R packages exist for providing descriptive variable labels; | |
119 in one of the future versions support may be added for such a | |
120 mechanism. It would of course work only for R files that were | |
121 created with such optional packages</em>.</div> | |
122 </div> | |
123 <p>Similarly, R categorical values (factors) lack descriptive labels too. | |
124 <strong>Note:</strong> This is potentially confusing, since R factors do | |
125 actually have “labels”. This is a matter of terminology - an R | |
126 factor’s label is in fact the same thing as the “value” of a | |
127 categorical variable in SPSS or Stata and DVN; it contains the actual | |
128 meaningful data for the given observation. It is NOT a field reserved | |
129 for explanatory, human-readable text, such as the case with the | |
130 SPSS/Stata “label”.</p> | |
131 <p>Ingesting an R factor with the level labels “MALE” and “FEMALE” will | |
132 produce a categorical variable with “MALE” and “FEMALE” in the | |
133 values and labels both.</p> | |
134 </div> | |
135 </div> | |
136 <div class="section" id="time-values-in-r"> | |
137 <h2>Time values in R<a class="headerlink" href="#time-values-in-r" title="Permalink to this headline">¶</a></h2> | |
138 <p>This warrants a dedicated section of its own, because of some unique | |
139 ways in which time values are handled in R.</p> | |
140 <p>R makes an effort to treat a time value as a real time instance. This | |
141 is in contrast with either SPSS or Stata, where time value | |
142 representations such as “Sep-23-2013 14:57:21” are allowed; note that | |
143 in the absence of an explicitly defined time zone, this value cannot | |
144 be mapped to an exact point in real time. R handles times in the | |
145 “Unix-style” way: the value is converted to the | |
146 “seconds-since-the-Epoch” Greenwitch time (GMT or UTC) and the | |
147 resulting numeric value is stored in the data file; time zone | |
148 adjustments are made in real time as needed.</p> | |
149 <p>Things still get ambiguous and confusing when R <strong>displays</strong> this time | |
150 value: unless the time zone was explicitly defined, R will adjust the | |
151 value to the current time zone. The resulting behavior is often | |
152 counter-intuitive: if you create a time value, for example:</p> | |
153 <blockquote> | |
154 <div>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”);</div></blockquote> | |
155 <p>on a computer configured for the San Francisco time zone, the value | |
156 will be differently displayed on computers in different time zones; | |
157 for example, as “12:57 PST” while still on the West Coast, but as | |
158 “15:57 EST” in Boston.</p> | |
159 <p>If it is important that the values are always displayed the same way, | |
160 regardless of the current time zones, it is recommended that the time | |
161 zone is explicitly defined. For example:</p> | |
162 <blockquote> | |
163 <div>attr(timevalue,”tzone”)<-“PST”</div></blockquote> | |
164 <dl class="docutils"> | |
165 <dt>or</dt> | |
166 <dd>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”, tz=”PST”);</dd> | |
167 </dl> | |
168 <p>Now the value will always be displayed as “15:57 PST”, regardless of | |
169 the time zone that is current for the OS ... <strong>BUT ONLY</strong> if the OS | |
170 where R is installed actually understands the time zone “PST”, which | |
171 is not by any means guaranteed! Otherwise, it will <strong>quietly adjust</strong> | |
172 the stored GMT value to <strong>the current time zone</strong>, yet it will still | |
173 display it with the “PST” tag attached!** One way to rephrase this is | |
174 that R does a fairly decent job <strong>storing</strong> time values in a | |
175 non-ambiguous, platform-independent manner - but gives you no guarantee that | |
176 the values will be displayed in any way that is predictable or intuitive.</p> | |
177 <p>In practical terms, it is recommended to use the long/descriptive | |
178 forms of time zones, as they are more likely to be properly recognized | |
179 on most computers. For example, “Japan” instead of “JST”. Another possible | |
180 solution is to explicitly use GMT or UTC (since it is very likely to be | |
181 properly recognized on any system), or the “UTC+<OFFSET>” notation. Still, none of the above | |
182 <strong>guarantees</strong> proper, non-ambiguous handling of time values in R data | |
183 sets. The fact that R <strong>quietly</strong> modifies time values when it doesn’t | |
184 recognize the supplied timezone attribute, yet still appends it to the | |
185 <strong>changed</strong> time value does make it quite difficult. (These issues are | |
186 discussed in depth on R-related forums, and no attempt is made to | |
187 summarize it all in any depth here; this is just to made you aware of | |
188 this being a potentially complex issue!)</p> | |
189 <p>An important thing to keep in mind, in connection with the DVN ingest | |
190 of R files, is that it will <strong>reject</strong> an R data file with any time | |
191 values that have time zones that we can’t recognize. This is done in | |
192 order to avoid (some) of the potential issues outlined above.</p> | |
193 <p>It is also recommended that any vectors containing time values | |
194 ingested into the DVN are reviewed, and the resulting entries in the | |
195 TAB files are compared against the original values in the R data | |
196 frame, to make sure they have been ingested as expected.</p> | |
197 <p>Another <strong>potential issue</strong> here is the <strong>UNF</strong>. The way the UNF | |
198 algorithm works, the same date/time values with and without the | |
199 timezone (e.g. “12:45” vs. “12:45 EST”) <strong>produce different | |
200 UNFs</strong>. Considering that time values in Stata/SPSS do not have time | |
201 zones, but ALL time values in R do (yes, they all do - if the timezone | |
202 wasn’t defined explicitely, it implicitly becomes a time value in the | |
203 “UTC” zone!), this means that it is <strong>impossible</strong> to have 2 time | |
204 value vectors, in Stata/SPSS and R, that produce the same UNF.</p> | |
205 <div class="line-block"> | |
206 <div class="line"><strong>A pro tip:</strong> if it is important to produce SPSS/Stata and R versions of</div> | |
207 </div> | |
208 <p>the same data set that result in the same UNF when ingested, you may | |
209 define the time variables as <strong>strings</strong> in the R data frame, and use | |
210 the “YYYY-MM-DD HH:mm:ss” formatting notation. This is the formatting used by the UNF | |
211 algorithm to normalize time values, so doing the above will result in | |
212 the same UNF as the vector of the same time values in Stata.</p> | |
213 <p>Note: date values (dates only, without time) should be handled the | |
214 exact same way as those in SPSS and Stata, and should produce the same | |
215 UNFs.</p> | |
216 </div> | |
217 </div> | |
218 | |
219 | |
220 </div> | |
221 </div> | |
222 </div> | |
223 </div> | |
224 <div class="sidebar"> | |
225 <h3>Table Of Contents</h3> | |
226 <ul> | |
227 <li class="toctree-l1"><a class="reference internal" href="dataverse-user-main.html">User Guide</a></li> | |
228 <li class="toctree-l1"><a class="reference internal" href="dataverse-installer-main.html">Installers Guide</a></li> | |
229 <li class="toctree-l1"><a class="reference internal" href="dataverse-developer-main.html">DVN Developers Guide</a></li> | |
230 <li class="toctree-l1"><a class="reference internal" href="dataverse-api-main.html">APIs Guide</a></li> | |
231 </ul> | |
232 | |
233 <h3 style="margin-top: 1.5em;">Search</h3> | |
234 <form class="search" action="search.html" method="get"> | |
235 <input type="text" name="q" /> | |
236 <input type="submit" value="Go" /> | |
237 <input type="hidden" name="check_keywords" value="yes" /> | |
238 <input type="hidden" name="area" value="default" /> | |
239 </form> | |
240 <p class="searchtip" style="font-size: 90%"> | |
241 Enter search terms. | |
242 </p> | |
243 </div> | |
244 <div class="clearer"></div> | |
245 </div> | |
246 </div> | |
247 | |
248 <div class="footer-wrapper"> | |
249 <div class="footer"> | |
250 <div class="left"> | |
251 <a href="genindex.html" title="General Index" | |
252 >index</a> | |
253 <br/> | |
254 <a href="_sources/dataverse-R-ingest.txt" | |
255 rel="nofollow">Show Source</a> | |
256 </div> | |
257 | |
258 <div class="right"> | |
259 | |
260 <div class="footer"> | |
261 © Copyright 1997-2013, President & Fellows Harvard University. | |
262 Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1. | |
263 </div> | |
264 </div> | |
265 <div class="clearer"></div> | |
266 </div> | |
267 </div> | |
268 | |
269 </body> | |
270 </html> |