comparison DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html @ 6:1b2188262ae9

adding the installer.
author "jurzua <jurzua@mpiwg-berlin.mpg.de>"
date Wed, 13 May 2015 11:50:21 +0200
parents
children
comparison
equal deleted inserted replaced
5:dd9adfc73390 6:1b2188262ae9
1
2
3 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
4 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
5
6
7 <html xmlns="http://www.w3.org/1999/xhtml">
8 <head>
9 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
10
11 <title>Ingest of R (.RData) files &mdash; The Harvard Dataverse Network 3.6.1 documentation</title>
12
13 <link rel="stylesheet" href="_static/agogo.css" type="text/css" />
14 <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
15
16 <script type="text/javascript">
17 var DOCUMENTATION_OPTIONS = {
18 URL_ROOT: './',
19 VERSION: '3.6.1',
20 COLLAPSE_INDEX: false,
21 FILE_SUFFIX: '.html',
22 HAS_SOURCE: true
23 };
24 </script>
25 <script type="text/javascript" src="_static/jquery.js"></script>
26 <script type="text/javascript" src="_static/underscore.js"></script>
27 <script type="text/javascript" src="_static/doctools.js"></script>
28 <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
29 <link rel="top" title="The Harvard Dataverse Network 3.6.1 documentation" href="index.html" />
30 </head>
31 <body>
32 <div class="header-wrapper">
33 <div class="header">
34 <div class="headertitle"><a
35 href="index.html">The Harvard Dataverse Network 3.6.1 documentation</a></div>
36 <div class="rel">
37 <a href="genindex.html" title="General Index"
38 accesskey="I">index</a>
39 </div>
40 </div>
41 </div>
42
43 <div class="content-wrapper">
44 <div class="content">
45 <div class="document">
46
47 <div class="documentwrapper">
48 <div class="bodywrapper">
49 <div class="body">
50
51 <div class="section" id="ingest-of-r-rdata-files">
52 <h1>Ingest of R (.RData) files<a class="headerlink" href="#ingest-of-r-rdata-files" title="Permalink to this headline">¶</a></h1>
53 <div class="section" id="overview">
54 <h2>Overview.<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2>
55 <p>Support for ingesting R data files has been added in version 3.5. R
56 has been increasingly popular in the research/academic community,
57 owing to the fact that it is free and open-source (unlike SPSS and
58 STATA). Consequently, more and more data is becoming available
59 exclusively as R data files. This long-awaited feature makes it
60 possible to ingest such data into DVN as &#8220;subsettable&#8221; files.</p>
61 </div>
62 <div class="section" id="requirements">
63 <h2>Requirements.<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2>
64 <p>R ingest relies on R having been installed, configured and made
65 available to the DVN application via RServe (see the Installers
66 Guide). This is in contrast to the SPSS and Stata ingest - which can
67 be performed without R present. (though R is still needed to perform
68 most subsetting/analysis tasks on the resulting data files).</p>
69 <p>The data must be formatted as an R dataframe (data.frame()). If an
70 .RData file contains multiple dataframes, only the 1st one will be
71 ingested.</p>
72 </div>
73 <div class="section" id="data-types-compared-to-other-supported-formats-stat-spss">
74 <h2>Data Types, compared to other supported formats (Stat, SPSS)<a class="headerlink" href="#data-types-compared-to-other-supported-formats-stat-spss" title="Permalink to this headline">¶</a></h2>
75 <div class="section" id="integers-doubles-character-strings">
76 <h3>Integers, Doubles, Character strings<a class="headerlink" href="#integers-doubles-character-strings" title="Permalink to this headline">¶</a></h3>
77 <p>The handling of these types is intuitive and straightforward. The
78 resulting tab file columns, summary statistics and UNF signatures
79 should be identical to those produced by ingesting the same vectors
80 from SPSS and Stata.</p>
81 <p><strong>A couple of things that are unique to R/new in DVN:</strong></p>
82 <p>R explicitly supports Missing Values for all of the types above;
83 Missing Values encoded in R vectors will be recognized and preserved
84 in TAB files (as &#8216;NA&#8217;), counted in the generated summary statistics
85 and data analysis.</p>
86 <p>In addition to Missing Values, R recognizes &#8220;Not a Value&#8221; (NaN) and
87 positive and negative infinity for floating point variables. These
88 are now properly supported by the DVN.</p>
89 <p>Also note, that unlike Stata, that does recognize &#8220;float&#8221; and &#8220;double&#8221;
90 as distinct data types, all floating point values in R are in fact
91 double precision.</p>
92 </div>
93 <div class="section" id="r-factors">
94 <h3>R Factors<a class="headerlink" href="#r-factors" title="Permalink to this headline">¶</a></h3>
95 <p>These are ingested as &#8220;Categorical Values&#8221; in the DVN.</p>
96 <p>One thing to keep in mind: in both Stata and SPSS, the actual value of
97 a categorical variable can be both character and numeric. In R, all
98 factor values are strings, even if they are string representations of
99 numbers. So the values of the resulting categoricals in the DVN will
100 always be of string type too.</p>
101 <div class="line-block">
102 <div class="line"><strong>New:</strong> To properly handle <em>ordered factors</em> in R, the DVN now supports the concept of an &#8220;Ordered Categorical&#8221; - a categorical value where an explicit order is assigned to the list of value labels.</div>
103 </div>
104 </div>
105 <div class="section" id="new-boolean-values">
106 <h3>(New!) Boolean values<a class="headerlink" href="#new-boolean-values" title="Permalink to this headline">¶</a></h3>
107 <p>R Boolean (logical) values are supported.</p>
108 </div>
109 <div class="section" id="limitations-of-r-as-compared-to-spss-and-stata">
110 <h3>Limitations of R, as compared to SPSS and STATA.<a class="headerlink" href="#limitations-of-r-as-compared-to-spss-and-stata" title="Permalink to this headline">¶</a></h3>
111 <p>Most noticeably, R lacks a standard mechanism for defining descriptive
112 labels for the data frame variables. In the DVN, similarly to
113 both Stata and SPSS, variables have distinct names and labels; with
114 the latter reserved for longer, descriptive text.
115 With variables ingested from R data frames the variable name will be
116 used for both the &#8220;name&#8221; and the &#8220;label&#8221;.</p>
117 <div class="line-block">
118 <div class="line"><em>Optional R packages exist for providing descriptive variable labels;
119 in one of the future versions support may be added for such a
120 mechanism. It would of course work only for R files that were
121 created with such optional packages</em>.</div>
122 </div>
123 <p>Similarly, R categorical values (factors) lack descriptive labels too.
124 <strong>Note:</strong> This is potentially confusing, since R factors do
125 actually have &#8220;labels&#8221;. This is a matter of terminology - an R
126 factor&#8217;s label is in fact the same thing as the &#8220;value&#8221; of a
127 categorical variable in SPSS or Stata and DVN; it contains the actual
128 meaningful data for the given observation. It is NOT a field reserved
129 for explanatory, human-readable text, such as the case with the
130 SPSS/Stata &#8220;label&#8221;.</p>
131 <p>Ingesting an R factor with the level labels &#8220;MALE&#8221; and &#8220;FEMALE&#8221; will
132 produce a categorical variable with &#8220;MALE&#8221; and &#8220;FEMALE&#8221; in the
133 values and labels both.</p>
134 </div>
135 </div>
136 <div class="section" id="time-values-in-r">
137 <h2>Time values in R<a class="headerlink" href="#time-values-in-r" title="Permalink to this headline">¶</a></h2>
138 <p>This warrants a dedicated section of its own, because of some unique
139 ways in which time values are handled in R.</p>
140 <p>R makes an effort to treat a time value as a real time instance. This
141 is in contrast with either SPSS or Stata, where time value
142 representations such as &#8220;Sep-23-2013 14:57:21&#8221; are allowed; note that
143 in the absence of an explicitly defined time zone, this value cannot
144 be mapped to an exact point in real time. R handles times in the
145 &#8220;Unix-style&#8221; way: the value is converted to the
146 &#8220;seconds-since-the-Epoch&#8221; Greenwitch time (GMT or UTC) and the
147 resulting numeric value is stored in the data file; time zone
148 adjustments are made in real time as needed.</p>
149 <p>Things still get ambiguous and confusing when R <strong>displays</strong> this time
150 value: unless the time zone was explicitly defined, R will adjust the
151 value to the current time zone. The resulting behavior is often
152 counter-intuitive: if you create a time value, for example:</p>
153 <blockquote>
154 <div>timevalue&lt;-as.POSIXct(&#8220;03/19/2013 12:57:00&#8221;, format = &#8220;%m/%d/%Y %H:%M:%OS&#8221;);</div></blockquote>
155 <p>on a computer configured for the San Francisco time zone, the value
156 will be differently displayed on computers in different time zones;
157 for example, as &#8220;12:57 PST&#8221; while still on the West Coast, but as
158 &#8220;15:57 EST&#8221; in Boston.</p>
159 <p>If it is important that the values are always displayed the same way,
160 regardless of the current time zones, it is recommended that the time
161 zone is explicitly defined. For example:</p>
162 <blockquote>
163 <div>attr(timevalue,&#8221;tzone&#8221;)&lt;-&#8220;PST&#8221;</div></blockquote>
164 <dl class="docutils">
165 <dt>or</dt>
166 <dd>timevalue&lt;-as.POSIXct(&#8220;03/19/2013 12:57:00&#8221;, format = &#8220;%m/%d/%Y %H:%M:%OS&#8221;, tz=&#8221;PST&#8221;);</dd>
167 </dl>
168 <p>Now the value will always be displayed as &#8220;15:57 PST&#8221;, regardless of
169 the time zone that is current for the OS ... <strong>BUT ONLY</strong> if the OS
170 where R is installed actually understands the time zone &#8220;PST&#8221;, which
171 is not by any means guaranteed! Otherwise, it will <strong>quietly adjust</strong>
172 the stored GMT value to <strong>the current time zone</strong>, yet it will still
173 display it with the &#8220;PST&#8221; tag attached!** One way to rephrase this is
174 that R does a fairly decent job <strong>storing</strong> time values in a
175 non-ambiguous, platform-independent manner - but gives you no guarantee that
176 the values will be displayed in any way that is predictable or intuitive.</p>
177 <p>In practical terms, it is recommended to use the long/descriptive
178 forms of time zones, as they are more likely to be properly recognized
179 on most computers. For example, &#8220;Japan&#8221; instead of &#8220;JST&#8221;. Another possible
180 solution is to explicitly use GMT or UTC (since it is very likely to be
181 properly recognized on any system), or the &#8220;UTC+&lt;OFFSET&gt;&#8221; notation. Still, none of the above
182 <strong>guarantees</strong> proper, non-ambiguous handling of time values in R data
183 sets. The fact that R <strong>quietly</strong> modifies time values when it doesn&#8217;t
184 recognize the supplied timezone attribute, yet still appends it to the
185 <strong>changed</strong> time value does make it quite difficult. (These issues are
186 discussed in depth on R-related forums, and no attempt is made to
187 summarize it all in any depth here; this is just to made you aware of
188 this being a potentially complex issue!)</p>
189 <p>An important thing to keep in mind, in connection with the DVN ingest
190 of R files, is that it will <strong>reject</strong> an R data file with any time
191 values that have time zones that we can&#8217;t recognize. This is done in
192 order to avoid (some) of the potential issues outlined above.</p>
193 <p>It is also recommended that any vectors containing time values
194 ingested into the DVN are reviewed, and the resulting entries in the
195 TAB files are compared against the original values in the R data
196 frame, to make sure they have been ingested as expected.</p>
197 <p>Another <strong>potential issue</strong> here is the <strong>UNF</strong>. The way the UNF
198 algorithm works, the same date/time values with and without the
199 timezone (e.g. &#8220;12:45&#8221; vs. &#8220;12:45 EST&#8221;) <strong>produce different
200 UNFs</strong>. Considering that time values in Stata/SPSS do not have time
201 zones, but ALL time values in R do (yes, they all do - if the timezone
202 wasn&#8217;t defined explicitely, it implicitly becomes a time value in the
203 &#8220;UTC&#8221; zone!), this means that it is <strong>impossible</strong> to have 2 time
204 value vectors, in Stata/SPSS and R, that produce the same UNF.</p>
205 <div class="line-block">
206 <div class="line"><strong>A pro tip:</strong> if it is important to produce SPSS/Stata and R versions of</div>
207 </div>
208 <p>the same data set that result in the same UNF when ingested, you may
209 define the time variables as <strong>strings</strong> in the R data frame, and use
210 the &#8220;YYYY-MM-DD HH:mm:ss&#8221; formatting notation. This is the formatting used by the UNF
211 algorithm to normalize time values, so doing the above will result in
212 the same UNF as the vector of the same time values in Stata.</p>
213 <p>Note: date values (dates only, without time) should be handled the
214 exact same way as those in SPSS and Stata, and should produce the same
215 UNFs.</p>
216 </div>
217 </div>
218
219
220 </div>
221 </div>
222 </div>
223 </div>
224 <div class="sidebar">
225 <h3>Table Of Contents</h3>
226 <ul>
227 <li class="toctree-l1"><a class="reference internal" href="dataverse-user-main.html">User Guide</a></li>
228 <li class="toctree-l1"><a class="reference internal" href="dataverse-installer-main.html">Installers Guide</a></li>
229 <li class="toctree-l1"><a class="reference internal" href="dataverse-developer-main.html">DVN Developers Guide</a></li>
230 <li class="toctree-l1"><a class="reference internal" href="dataverse-api-main.html">APIs Guide</a></li>
231 </ul>
232
233 <h3 style="margin-top: 1.5em;">Search</h3>
234 <form class="search" action="search.html" method="get">
235 <input type="text" name="q" />
236 <input type="submit" value="Go" />
237 <input type="hidden" name="check_keywords" value="yes" />
238 <input type="hidden" name="area" value="default" />
239 </form>
240 <p class="searchtip" style="font-size: 90%">
241 Enter search terms.
242 </p>
243 </div>
244 <div class="clearer"></div>
245 </div>
246 </div>
247
248 <div class="footer-wrapper">
249 <div class="footer">
250 <div class="left">
251 <a href="genindex.html" title="General Index"
252 >index</a>
253 <br/>
254 <a href="_sources/dataverse-R-ingest.txt"
255 rel="nofollow">Show Source</a>
256 </div>
257
258 <div class="right">
259
260 <div class="footer">
261 &copy; Copyright 1997-2013, President &amp; Fellows Harvard University.
262 Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1.
263 </div>
264 </div>
265 <div class="clearer"></div>
266 </div>
267 </div>
268
269 </body>
270 </html>