6
|
1
|
|
2
|
|
3 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
4 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
5
|
|
6
|
|
7 <html xmlns="http://www.w3.org/1999/xhtml">
|
|
8 <head>
|
|
9 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
10
|
|
11 <title>Ingest of R (.RData) files — The Harvard Dataverse Network 3.6.1 documentation</title>
|
|
12
|
|
13 <link rel="stylesheet" href="_static/agogo.css" type="text/css" />
|
|
14 <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
|
|
15
|
|
16 <script type="text/javascript">
|
|
17 var DOCUMENTATION_OPTIONS = {
|
|
18 URL_ROOT: './',
|
|
19 VERSION: '3.6.1',
|
|
20 COLLAPSE_INDEX: false,
|
|
21 FILE_SUFFIX: '.html',
|
|
22 HAS_SOURCE: true
|
|
23 };
|
|
24 </script>
|
|
25 <script type="text/javascript" src="_static/jquery.js"></script>
|
|
26 <script type="text/javascript" src="_static/underscore.js"></script>
|
|
27 <script type="text/javascript" src="_static/doctools.js"></script>
|
|
28 <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
|
|
29 <link rel="top" title="The Harvard Dataverse Network 3.6.1 documentation" href="index.html" />
|
|
30 </head>
|
|
31 <body>
|
|
32 <div class="header-wrapper">
|
|
33 <div class="header">
|
|
34 <div class="headertitle"><a
|
|
35 href="index.html">The Harvard Dataverse Network 3.6.1 documentation</a></div>
|
|
36 <div class="rel">
|
|
37 <a href="genindex.html" title="General Index"
|
|
38 accesskey="I">index</a>
|
|
39 </div>
|
|
40 </div>
|
|
41 </div>
|
|
42
|
|
43 <div class="content-wrapper">
|
|
44 <div class="content">
|
|
45 <div class="document">
|
|
46
|
|
47 <div class="documentwrapper">
|
|
48 <div class="bodywrapper">
|
|
49 <div class="body">
|
|
50
|
|
51 <div class="section" id="ingest-of-r-rdata-files">
|
|
52 <h1>Ingest of R (.RData) files<a class="headerlink" href="#ingest-of-r-rdata-files" title="Permalink to this headline">¶</a></h1>
|
|
53 <div class="section" id="overview">
|
|
54 <h2>Overview.<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2>
|
|
55 <p>Support for ingesting R data files has been added in version 3.5. R
|
|
56 has been increasingly popular in the research/academic community,
|
|
57 owing to the fact that it is free and open-source (unlike SPSS and
|
|
58 STATA). Consequently, more and more data is becoming available
|
|
59 exclusively as R data files. This long-awaited feature makes it
|
|
60 possible to ingest such data into DVN as “subsettable” files.</p>
|
|
61 </div>
|
|
62 <div class="section" id="requirements">
|
|
63 <h2>Requirements.<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2>
|
|
64 <p>R ingest relies on R having been installed, configured and made
|
|
65 available to the DVN application via RServe (see the Installers
|
|
66 Guide). This is in contrast to the SPSS and Stata ingest - which can
|
|
67 be performed without R present. (though R is still needed to perform
|
|
68 most subsetting/analysis tasks on the resulting data files).</p>
|
|
69 <p>The data must be formatted as an R dataframe (data.frame()). If an
|
|
70 .RData file contains multiple dataframes, only the 1st one will be
|
|
71 ingested.</p>
|
|
72 </div>
|
|
73 <div class="section" id="data-types-compared-to-other-supported-formats-stat-spss">
|
|
74 <h2>Data Types, compared to other supported formats (Stat, SPSS)<a class="headerlink" href="#data-types-compared-to-other-supported-formats-stat-spss" title="Permalink to this headline">¶</a></h2>
|
|
75 <div class="section" id="integers-doubles-character-strings">
|
|
76 <h3>Integers, Doubles, Character strings<a class="headerlink" href="#integers-doubles-character-strings" title="Permalink to this headline">¶</a></h3>
|
|
77 <p>The handling of these types is intuitive and straightforward. The
|
|
78 resulting tab file columns, summary statistics and UNF signatures
|
|
79 should be identical to those produced by ingesting the same vectors
|
|
80 from SPSS and Stata.</p>
|
|
81 <p><strong>A couple of things that are unique to R/new in DVN:</strong></p>
|
|
82 <p>R explicitly supports Missing Values for all of the types above;
|
|
83 Missing Values encoded in R vectors will be recognized and preserved
|
|
84 in TAB files (as ‘NA’), counted in the generated summary statistics
|
|
85 and data analysis.</p>
|
|
86 <p>In addition to Missing Values, R recognizes “Not a Value” (NaN) and
|
|
87 positive and negative infinity for floating point variables. These
|
|
88 are now properly supported by the DVN.</p>
|
|
89 <p>Also note, that unlike Stata, that does recognize “float” and “double”
|
|
90 as distinct data types, all floating point values in R are in fact
|
|
91 double precision.</p>
|
|
92 </div>
|
|
93 <div class="section" id="r-factors">
|
|
94 <h3>R Factors<a class="headerlink" href="#r-factors" title="Permalink to this headline">¶</a></h3>
|
|
95 <p>These are ingested as “Categorical Values” in the DVN.</p>
|
|
96 <p>One thing to keep in mind: in both Stata and SPSS, the actual value of
|
|
97 a categorical variable can be both character and numeric. In R, all
|
|
98 factor values are strings, even if they are string representations of
|
|
99 numbers. So the values of the resulting categoricals in the DVN will
|
|
100 always be of string type too.</p>
|
|
101 <div class="line-block">
|
|
102 <div class="line"><strong>New:</strong> To properly handle <em>ordered factors</em> in R, the DVN now supports the concept of an “Ordered Categorical” - a categorical value where an explicit order is assigned to the list of value labels.</div>
|
|
103 </div>
|
|
104 </div>
|
|
105 <div class="section" id="new-boolean-values">
|
|
106 <h3>(New!) Boolean values<a class="headerlink" href="#new-boolean-values" title="Permalink to this headline">¶</a></h3>
|
|
107 <p>R Boolean (logical) values are supported.</p>
|
|
108 </div>
|
|
109 <div class="section" id="limitations-of-r-as-compared-to-spss-and-stata">
|
|
110 <h3>Limitations of R, as compared to SPSS and STATA.<a class="headerlink" href="#limitations-of-r-as-compared-to-spss-and-stata" title="Permalink to this headline">¶</a></h3>
|
|
111 <p>Most noticeably, R lacks a standard mechanism for defining descriptive
|
|
112 labels for the data frame variables. In the DVN, similarly to
|
|
113 both Stata and SPSS, variables have distinct names and labels; with
|
|
114 the latter reserved for longer, descriptive text.
|
|
115 With variables ingested from R data frames the variable name will be
|
|
116 used for both the “name” and the “label”.</p>
|
|
117 <div class="line-block">
|
|
118 <div class="line"><em>Optional R packages exist for providing descriptive variable labels;
|
|
119 in one of the future versions support may be added for such a
|
|
120 mechanism. It would of course work only for R files that were
|
|
121 created with such optional packages</em>.</div>
|
|
122 </div>
|
|
123 <p>Similarly, R categorical values (factors) lack descriptive labels too.
|
|
124 <strong>Note:</strong> This is potentially confusing, since R factors do
|
|
125 actually have “labels”. This is a matter of terminology - an R
|
|
126 factor’s label is in fact the same thing as the “value” of a
|
|
127 categorical variable in SPSS or Stata and DVN; it contains the actual
|
|
128 meaningful data for the given observation. It is NOT a field reserved
|
|
129 for explanatory, human-readable text, such as the case with the
|
|
130 SPSS/Stata “label”.</p>
|
|
131 <p>Ingesting an R factor with the level labels “MALE” and “FEMALE” will
|
|
132 produce a categorical variable with “MALE” and “FEMALE” in the
|
|
133 values and labels both.</p>
|
|
134 </div>
|
|
135 </div>
|
|
136 <div class="section" id="time-values-in-r">
|
|
137 <h2>Time values in R<a class="headerlink" href="#time-values-in-r" title="Permalink to this headline">¶</a></h2>
|
|
138 <p>This warrants a dedicated section of its own, because of some unique
|
|
139 ways in which time values are handled in R.</p>
|
|
140 <p>R makes an effort to treat a time value as a real time instance. This
|
|
141 is in contrast with either SPSS or Stata, where time value
|
|
142 representations such as “Sep-23-2013 14:57:21” are allowed; note that
|
|
143 in the absence of an explicitly defined time zone, this value cannot
|
|
144 be mapped to an exact point in real time. R handles times in the
|
|
145 “Unix-style” way: the value is converted to the
|
|
146 “seconds-since-the-Epoch” Greenwitch time (GMT or UTC) and the
|
|
147 resulting numeric value is stored in the data file; time zone
|
|
148 adjustments are made in real time as needed.</p>
|
|
149 <p>Things still get ambiguous and confusing when R <strong>displays</strong> this time
|
|
150 value: unless the time zone was explicitly defined, R will adjust the
|
|
151 value to the current time zone. The resulting behavior is often
|
|
152 counter-intuitive: if you create a time value, for example:</p>
|
|
153 <blockquote>
|
|
154 <div>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”);</div></blockquote>
|
|
155 <p>on a computer configured for the San Francisco time zone, the value
|
|
156 will be differently displayed on computers in different time zones;
|
|
157 for example, as “12:57 PST” while still on the West Coast, but as
|
|
158 “15:57 EST” in Boston.</p>
|
|
159 <p>If it is important that the values are always displayed the same way,
|
|
160 regardless of the current time zones, it is recommended that the time
|
|
161 zone is explicitly defined. For example:</p>
|
|
162 <blockquote>
|
|
163 <div>attr(timevalue,”tzone”)<-“PST”</div></blockquote>
|
|
164 <dl class="docutils">
|
|
165 <dt>or</dt>
|
|
166 <dd>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”, tz=”PST”);</dd>
|
|
167 </dl>
|
|
168 <p>Now the value will always be displayed as “15:57 PST”, regardless of
|
|
169 the time zone that is current for the OS ... <strong>BUT ONLY</strong> if the OS
|
|
170 where R is installed actually understands the time zone “PST”, which
|
|
171 is not by any means guaranteed! Otherwise, it will <strong>quietly adjust</strong>
|
|
172 the stored GMT value to <strong>the current time zone</strong>, yet it will still
|
|
173 display it with the “PST” tag attached!** One way to rephrase this is
|
|
174 that R does a fairly decent job <strong>storing</strong> time values in a
|
|
175 non-ambiguous, platform-independent manner - but gives you no guarantee that
|
|
176 the values will be displayed in any way that is predictable or intuitive.</p>
|
|
177 <p>In practical terms, it is recommended to use the long/descriptive
|
|
178 forms of time zones, as they are more likely to be properly recognized
|
|
179 on most computers. For example, “Japan” instead of “JST”. Another possible
|
|
180 solution is to explicitly use GMT or UTC (since it is very likely to be
|
|
181 properly recognized on any system), or the “UTC+<OFFSET>” notation. Still, none of the above
|
|
182 <strong>guarantees</strong> proper, non-ambiguous handling of time values in R data
|
|
183 sets. The fact that R <strong>quietly</strong> modifies time values when it doesn’t
|
|
184 recognize the supplied timezone attribute, yet still appends it to the
|
|
185 <strong>changed</strong> time value does make it quite difficult. (These issues are
|
|
186 discussed in depth on R-related forums, and no attempt is made to
|
|
187 summarize it all in any depth here; this is just to made you aware of
|
|
188 this being a potentially complex issue!)</p>
|
|
189 <p>An important thing to keep in mind, in connection with the DVN ingest
|
|
190 of R files, is that it will <strong>reject</strong> an R data file with any time
|
|
191 values that have time zones that we can’t recognize. This is done in
|
|
192 order to avoid (some) of the potential issues outlined above.</p>
|
|
193 <p>It is also recommended that any vectors containing time values
|
|
194 ingested into the DVN are reviewed, and the resulting entries in the
|
|
195 TAB files are compared against the original values in the R data
|
|
196 frame, to make sure they have been ingested as expected.</p>
|
|
197 <p>Another <strong>potential issue</strong> here is the <strong>UNF</strong>. The way the UNF
|
|
198 algorithm works, the same date/time values with and without the
|
|
199 timezone (e.g. “12:45” vs. “12:45 EST”) <strong>produce different
|
|
200 UNFs</strong>. Considering that time values in Stata/SPSS do not have time
|
|
201 zones, but ALL time values in R do (yes, they all do - if the timezone
|
|
202 wasn’t defined explicitely, it implicitly becomes a time value in the
|
|
203 “UTC” zone!), this means that it is <strong>impossible</strong> to have 2 time
|
|
204 value vectors, in Stata/SPSS and R, that produce the same UNF.</p>
|
|
205 <div class="line-block">
|
|
206 <div class="line"><strong>A pro tip:</strong> if it is important to produce SPSS/Stata and R versions of</div>
|
|
207 </div>
|
|
208 <p>the same data set that result in the same UNF when ingested, you may
|
|
209 define the time variables as <strong>strings</strong> in the R data frame, and use
|
|
210 the “YYYY-MM-DD HH:mm:ss” formatting notation. This is the formatting used by the UNF
|
|
211 algorithm to normalize time values, so doing the above will result in
|
|
212 the same UNF as the vector of the same time values in Stata.</p>
|
|
213 <p>Note: date values (dates only, without time) should be handled the
|
|
214 exact same way as those in SPSS and Stata, and should produce the same
|
|
215 UNFs.</p>
|
|
216 </div>
|
|
217 </div>
|
|
218
|
|
219
|
|
220 </div>
|
|
221 </div>
|
|
222 </div>
|
|
223 </div>
|
|
224 <div class="sidebar">
|
|
225 <h3>Table Of Contents</h3>
|
|
226 <ul>
|
|
227 <li class="toctree-l1"><a class="reference internal" href="dataverse-user-main.html">User Guide</a></li>
|
|
228 <li class="toctree-l1"><a class="reference internal" href="dataverse-installer-main.html">Installers Guide</a></li>
|
|
229 <li class="toctree-l1"><a class="reference internal" href="dataverse-developer-main.html">DVN Developers Guide</a></li>
|
|
230 <li class="toctree-l1"><a class="reference internal" href="dataverse-api-main.html">APIs Guide</a></li>
|
|
231 </ul>
|
|
232
|
|
233 <h3 style="margin-top: 1.5em;">Search</h3>
|
|
234 <form class="search" action="search.html" method="get">
|
|
235 <input type="text" name="q" />
|
|
236 <input type="submit" value="Go" />
|
|
237 <input type="hidden" name="check_keywords" value="yes" />
|
|
238 <input type="hidden" name="area" value="default" />
|
|
239 </form>
|
|
240 <p class="searchtip" style="font-size: 90%">
|
|
241 Enter search terms.
|
|
242 </p>
|
|
243 </div>
|
|
244 <div class="clearer"></div>
|
|
245 </div>
|
|
246 </div>
|
|
247
|
|
248 <div class="footer-wrapper">
|
|
249 <div class="footer">
|
|
250 <div class="left">
|
|
251 <a href="genindex.html" title="General Index"
|
|
252 >index</a>
|
|
253 <br/>
|
|
254 <a href="_sources/dataverse-R-ingest.txt"
|
|
255 rel="nofollow">Show Source</a>
|
|
256 </div>
|
|
257
|
|
258 <div class="right">
|
|
259
|
|
260 <div class="footer">
|
|
261 © Copyright 1997-2013, President & Fellows Harvard University.
|
|
262 Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1.
|
|
263 </div>
|
|
264 </div>
|
|
265 <div class="clearer"></div>
|
|
266 </div>
|
|
267 </div>
|
|
268
|
|
269 </body>
|
|
270 </html> |