Mercurial > hg > LGDataverses
view DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html @ 6:1b2188262ae9
adding the installer.
author | "jurzua <jurzua@mpiwg-berlin.mpg.de>" |
---|---|
date | Wed, 13 May 2015 11:50:21 +0200 |
parents | |
children |
line wrap: on
line source
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Ingest of R (.RData) files — The Harvard Dataverse Network 3.6.1 documentation</title> <link rel="stylesheet" href="_static/agogo.css" type="text/css" /> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT: './', VERSION: '3.6.1', COLLAPSE_INDEX: false, FILE_SUFFIX: '.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <link rel="top" title="The Harvard Dataverse Network 3.6.1 documentation" href="index.html" /> </head> <body> <div class="header-wrapper"> <div class="header"> <div class="headertitle"><a href="index.html">The Harvard Dataverse Network 3.6.1 documentation</a></div> <div class="rel"> <a href="genindex.html" title="General Index" accesskey="I">index</a> </div> </div> </div> <div class="content-wrapper"> <div class="content"> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body"> <div class="section" id="ingest-of-r-rdata-files"> <h1>Ingest of R (.RData) files<a class="headerlink" href="#ingest-of-r-rdata-files" title="Permalink to this headline">¶</a></h1> <div class="section" id="overview"> <h2>Overview.<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2> <p>Support for ingesting R data files has been added in version 3.5. R has been increasingly popular in the research/academic community, owing to the fact that it is free and open-source (unlike SPSS and STATA). Consequently, more and more data is becoming available exclusively as R data files. This long-awaited feature makes it possible to ingest such data into DVN as “subsettable” files.</p> </div> <div class="section" id="requirements"> <h2>Requirements.<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2> <p>R ingest relies on R having been installed, configured and made available to the DVN application via RServe (see the Installers Guide). This is in contrast to the SPSS and Stata ingest - which can be performed without R present. (though R is still needed to perform most subsetting/analysis tasks on the resulting data files).</p> <p>The data must be formatted as an R dataframe (data.frame()). If an .RData file contains multiple dataframes, only the 1st one will be ingested.</p> </div> <div class="section" id="data-types-compared-to-other-supported-formats-stat-spss"> <h2>Data Types, compared to other supported formats (Stat, SPSS)<a class="headerlink" href="#data-types-compared-to-other-supported-formats-stat-spss" title="Permalink to this headline">¶</a></h2> <div class="section" id="integers-doubles-character-strings"> <h3>Integers, Doubles, Character strings<a class="headerlink" href="#integers-doubles-character-strings" title="Permalink to this headline">¶</a></h3> <p>The handling of these types is intuitive and straightforward. The resulting tab file columns, summary statistics and UNF signatures should be identical to those produced by ingesting the same vectors from SPSS and Stata.</p> <p><strong>A couple of things that are unique to R/new in DVN:</strong></p> <p>R explicitly supports Missing Values for all of the types above; Missing Values encoded in R vectors will be recognized and preserved in TAB files (as ‘NA’), counted in the generated summary statistics and data analysis.</p> <p>In addition to Missing Values, R recognizes “Not a Value” (NaN) and positive and negative infinity for floating point variables. These are now properly supported by the DVN.</p> <p>Also note, that unlike Stata, that does recognize “float” and “double” as distinct data types, all floating point values in R are in fact double precision.</p> </div> <div class="section" id="r-factors"> <h3>R Factors<a class="headerlink" href="#r-factors" title="Permalink to this headline">¶</a></h3> <p>These are ingested as “Categorical Values” in the DVN.</p> <p>One thing to keep in mind: in both Stata and SPSS, the actual value of a categorical variable can be both character and numeric. In R, all factor values are strings, even if they are string representations of numbers. So the values of the resulting categoricals in the DVN will always be of string type too.</p> <div class="line-block"> <div class="line"><strong>New:</strong> To properly handle <em>ordered factors</em> in R, the DVN now supports the concept of an “Ordered Categorical” - a categorical value where an explicit order is assigned to the list of value labels.</div> </div> </div> <div class="section" id="new-boolean-values"> <h3>(New!) Boolean values<a class="headerlink" href="#new-boolean-values" title="Permalink to this headline">¶</a></h3> <p>R Boolean (logical) values are supported.</p> </div> <div class="section" id="limitations-of-r-as-compared-to-spss-and-stata"> <h3>Limitations of R, as compared to SPSS and STATA.<a class="headerlink" href="#limitations-of-r-as-compared-to-spss-and-stata" title="Permalink to this headline">¶</a></h3> <p>Most noticeably, R lacks a standard mechanism for defining descriptive labels for the data frame variables. In the DVN, similarly to both Stata and SPSS, variables have distinct names and labels; with the latter reserved for longer, descriptive text. With variables ingested from R data frames the variable name will be used for both the “name” and the “label”.</p> <div class="line-block"> <div class="line"><em>Optional R packages exist for providing descriptive variable labels; in one of the future versions support may be added for such a mechanism. It would of course work only for R files that were created with such optional packages</em>.</div> </div> <p>Similarly, R categorical values (factors) lack descriptive labels too. <strong>Note:</strong> This is potentially confusing, since R factors do actually have “labels”. This is a matter of terminology - an R factor’s label is in fact the same thing as the “value” of a categorical variable in SPSS or Stata and DVN; it contains the actual meaningful data for the given observation. It is NOT a field reserved for explanatory, human-readable text, such as the case with the SPSS/Stata “label”.</p> <p>Ingesting an R factor with the level labels “MALE” and “FEMALE” will produce a categorical variable with “MALE” and “FEMALE” in the values and labels both.</p> </div> </div> <div class="section" id="time-values-in-r"> <h2>Time values in R<a class="headerlink" href="#time-values-in-r" title="Permalink to this headline">¶</a></h2> <p>This warrants a dedicated section of its own, because of some unique ways in which time values are handled in R.</p> <p>R makes an effort to treat a time value as a real time instance. This is in contrast with either SPSS or Stata, where time value representations such as “Sep-23-2013 14:57:21” are allowed; note that in the absence of an explicitly defined time zone, this value cannot be mapped to an exact point in real time. R handles times in the “Unix-style” way: the value is converted to the “seconds-since-the-Epoch” Greenwitch time (GMT or UTC) and the resulting numeric value is stored in the data file; time zone adjustments are made in real time as needed.</p> <p>Things still get ambiguous and confusing when R <strong>displays</strong> this time value: unless the time zone was explicitly defined, R will adjust the value to the current time zone. The resulting behavior is often counter-intuitive: if you create a time value, for example:</p> <blockquote> <div>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”);</div></blockquote> <p>on a computer configured for the San Francisco time zone, the value will be differently displayed on computers in different time zones; for example, as “12:57 PST” while still on the West Coast, but as “15:57 EST” in Boston.</p> <p>If it is important that the values are always displayed the same way, regardless of the current time zones, it is recommended that the time zone is explicitly defined. For example:</p> <blockquote> <div>attr(timevalue,”tzone”)<-“PST”</div></blockquote> <dl class="docutils"> <dt>or</dt> <dd>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”, tz=”PST”);</dd> </dl> <p>Now the value will always be displayed as “15:57 PST”, regardless of the time zone that is current for the OS ... <strong>BUT ONLY</strong> if the OS where R is installed actually understands the time zone “PST”, which is not by any means guaranteed! Otherwise, it will <strong>quietly adjust</strong> the stored GMT value to <strong>the current time zone</strong>, yet it will still display it with the “PST” tag attached!** One way to rephrase this is that R does a fairly decent job <strong>storing</strong> time values in a non-ambiguous, platform-independent manner - but gives you no guarantee that the values will be displayed in any way that is predictable or intuitive.</p> <p>In practical terms, it is recommended to use the long/descriptive forms of time zones, as they are more likely to be properly recognized on most computers. For example, “Japan” instead of “JST”. Another possible solution is to explicitly use GMT or UTC (since it is very likely to be properly recognized on any system), or the “UTC+<OFFSET>” notation. Still, none of the above <strong>guarantees</strong> proper, non-ambiguous handling of time values in R data sets. The fact that R <strong>quietly</strong> modifies time values when it doesn’t recognize the supplied timezone attribute, yet still appends it to the <strong>changed</strong> time value does make it quite difficult. (These issues are discussed in depth on R-related forums, and no attempt is made to summarize it all in any depth here; this is just to made you aware of this being a potentially complex issue!)</p> <p>An important thing to keep in mind, in connection with the DVN ingest of R files, is that it will <strong>reject</strong> an R data file with any time values that have time zones that we can’t recognize. This is done in order to avoid (some) of the potential issues outlined above.</p> <p>It is also recommended that any vectors containing time values ingested into the DVN are reviewed, and the resulting entries in the TAB files are compared against the original values in the R data frame, to make sure they have been ingested as expected.</p> <p>Another <strong>potential issue</strong> here is the <strong>UNF</strong>. The way the UNF algorithm works, the same date/time values with and without the timezone (e.g. “12:45” vs. “12:45 EST”) <strong>produce different UNFs</strong>. Considering that time values in Stata/SPSS do not have time zones, but ALL time values in R do (yes, they all do - if the timezone wasn’t defined explicitely, it implicitly becomes a time value in the “UTC” zone!), this means that it is <strong>impossible</strong> to have 2 time value vectors, in Stata/SPSS and R, that produce the same UNF.</p> <div class="line-block"> <div class="line"><strong>A pro tip:</strong> if it is important to produce SPSS/Stata and R versions of</div> </div> <p>the same data set that result in the same UNF when ingested, you may define the time variables as <strong>strings</strong> in the R data frame, and use the “YYYY-MM-DD HH:mm:ss” formatting notation. This is the formatting used by the UNF algorithm to normalize time values, so doing the above will result in the same UNF as the vector of the same time values in Stata.</p> <p>Note: date values (dates only, without time) should be handled the exact same way as those in SPSS and Stata, and should produce the same UNFs.</p> </div> </div> </div> </div> </div> </div> <div class="sidebar"> <h3>Table Of Contents</h3> <ul> <li class="toctree-l1"><a class="reference internal" href="dataverse-user-main.html">User Guide</a></li> <li class="toctree-l1"><a class="reference internal" href="dataverse-installer-main.html">Installers Guide</a></li> <li class="toctree-l1"><a class="reference internal" href="dataverse-developer-main.html">DVN Developers Guide</a></li> <li class="toctree-l1"><a class="reference internal" href="dataverse-api-main.html">APIs Guide</a></li> </ul> <h3 style="margin-top: 1.5em;">Search</h3> <form class="search" action="search.html" method="get"> <input type="text" name="q" /> <input type="submit" value="Go" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> <p class="searchtip" style="font-size: 90%"> Enter search terms. </p> </div> <div class="clearer"></div> </div> </div> <div class="footer-wrapper"> <div class="footer"> <div class="left"> <a href="genindex.html" title="General Index" >index</a> <br/> <a href="_sources/dataverse-R-ingest.txt" rel="nofollow">Show Source</a> </div> <div class="right"> <div class="footer"> © Copyright 1997-2013, President & Fellows Harvard University. Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1. </div> </div> <div class="clearer"></div> </div> </div> </body> </html>