Mercurial > hg > LGDataverses
diff DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html @ 6:1b2188262ae9
adding the installer.
author | "jurzua <jurzua@mpiwg-berlin.mpg.de>" |
---|---|
date | Wed, 13 May 2015 11:50:21 +0200 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html Wed May 13 11:50:21 2015 +0200 @@ -0,0 +1,270 @@ + + +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + + +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> + + <title>Ingest of R (.RData) files — The Harvard Dataverse Network 3.6.1 documentation</title> + + <link rel="stylesheet" href="_static/agogo.css" type="text/css" /> + <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> + + <script type="text/javascript"> + var DOCUMENTATION_OPTIONS = { + URL_ROOT: './', + VERSION: '3.6.1', + COLLAPSE_INDEX: false, + FILE_SUFFIX: '.html', + HAS_SOURCE: true + }; + </script> + <script type="text/javascript" src="_static/jquery.js"></script> + <script type="text/javascript" src="_static/underscore.js"></script> + <script type="text/javascript" src="_static/doctools.js"></script> + <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> + <link rel="top" title="The Harvard Dataverse Network 3.6.1 documentation" href="index.html" /> + </head> + <body> + <div class="header-wrapper"> + <div class="header"> + <div class="headertitle"><a + href="index.html">The Harvard Dataverse Network 3.6.1 documentation</a></div> + <div class="rel"> + <a href="genindex.html" title="General Index" + accesskey="I">index</a> + </div> + </div> + </div> + + <div class="content-wrapper"> + <div class="content"> + <div class="document"> + + <div class="documentwrapper"> + <div class="bodywrapper"> + <div class="body"> + + <div class="section" id="ingest-of-r-rdata-files"> +<h1>Ingest of R (.RData) files<a class="headerlink" href="#ingest-of-r-rdata-files" title="Permalink to this headline">¶</a></h1> +<div class="section" id="overview"> +<h2>Overview.<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2> +<p>Support for ingesting R data files has been added in version 3.5. R +has been increasingly popular in the research/academic community, +owing to the fact that it is free and open-source (unlike SPSS and +STATA). Consequently, more and more data is becoming available +exclusively as R data files. This long-awaited feature makes it +possible to ingest such data into DVN as “subsettable” files.</p> +</div> +<div class="section" id="requirements"> +<h2>Requirements.<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2> +<p>R ingest relies on R having been installed, configured and made +available to the DVN application via RServe (see the Installers +Guide). This is in contrast to the SPSS and Stata ingest - which can +be performed without R present. (though R is still needed to perform +most subsetting/analysis tasks on the resulting data files).</p> +<p>The data must be formatted as an R dataframe (data.frame()). If an +.RData file contains multiple dataframes, only the 1st one will be +ingested.</p> +</div> +<div class="section" id="data-types-compared-to-other-supported-formats-stat-spss"> +<h2>Data Types, compared to other supported formats (Stat, SPSS)<a class="headerlink" href="#data-types-compared-to-other-supported-formats-stat-spss" title="Permalink to this headline">¶</a></h2> +<div class="section" id="integers-doubles-character-strings"> +<h3>Integers, Doubles, Character strings<a class="headerlink" href="#integers-doubles-character-strings" title="Permalink to this headline">¶</a></h3> +<p>The handling of these types is intuitive and straightforward. The +resulting tab file columns, summary statistics and UNF signatures +should be identical to those produced by ingesting the same vectors +from SPSS and Stata.</p> +<p><strong>A couple of things that are unique to R/new in DVN:</strong></p> +<p>R explicitly supports Missing Values for all of the types above; +Missing Values encoded in R vectors will be recognized and preserved +in TAB files (as ‘NA’), counted in the generated summary statistics +and data analysis.</p> +<p>In addition to Missing Values, R recognizes “Not a Value” (NaN) and +positive and negative infinity for floating point variables. These +are now properly supported by the DVN.</p> +<p>Also note, that unlike Stata, that does recognize “float” and “double” +as distinct data types, all floating point values in R are in fact +double precision.</p> +</div> +<div class="section" id="r-factors"> +<h3>R Factors<a class="headerlink" href="#r-factors" title="Permalink to this headline">¶</a></h3> +<p>These are ingested as “Categorical Values” in the DVN.</p> +<p>One thing to keep in mind: in both Stata and SPSS, the actual value of +a categorical variable can be both character and numeric. In R, all +factor values are strings, even if they are string representations of +numbers. So the values of the resulting categoricals in the DVN will +always be of string type too.</p> +<div class="line-block"> +<div class="line"><strong>New:</strong> To properly handle <em>ordered factors</em> in R, the DVN now supports the concept of an “Ordered Categorical” - a categorical value where an explicit order is assigned to the list of value labels.</div> +</div> +</div> +<div class="section" id="new-boolean-values"> +<h3>(New!) Boolean values<a class="headerlink" href="#new-boolean-values" title="Permalink to this headline">¶</a></h3> +<p>R Boolean (logical) values are supported.</p> +</div> +<div class="section" id="limitations-of-r-as-compared-to-spss-and-stata"> +<h3>Limitations of R, as compared to SPSS and STATA.<a class="headerlink" href="#limitations-of-r-as-compared-to-spss-and-stata" title="Permalink to this headline">¶</a></h3> +<p>Most noticeably, R lacks a standard mechanism for defining descriptive +labels for the data frame variables. In the DVN, similarly to +both Stata and SPSS, variables have distinct names and labels; with +the latter reserved for longer, descriptive text. +With variables ingested from R data frames the variable name will be +used for both the “name” and the “label”.</p> +<div class="line-block"> +<div class="line"><em>Optional R packages exist for providing descriptive variable labels; +in one of the future versions support may be added for such a +mechanism. It would of course work only for R files that were +created with such optional packages</em>.</div> +</div> +<p>Similarly, R categorical values (factors) lack descriptive labels too. +<strong>Note:</strong> This is potentially confusing, since R factors do +actually have “labels”. This is a matter of terminology - an R +factor’s label is in fact the same thing as the “value” of a +categorical variable in SPSS or Stata and DVN; it contains the actual +meaningful data for the given observation. It is NOT a field reserved +for explanatory, human-readable text, such as the case with the +SPSS/Stata “label”.</p> +<p>Ingesting an R factor with the level labels “MALE” and “FEMALE” will +produce a categorical variable with “MALE” and “FEMALE” in the +values and labels both.</p> +</div> +</div> +<div class="section" id="time-values-in-r"> +<h2>Time values in R<a class="headerlink" href="#time-values-in-r" title="Permalink to this headline">¶</a></h2> +<p>This warrants a dedicated section of its own, because of some unique +ways in which time values are handled in R.</p> +<p>R makes an effort to treat a time value as a real time instance. This +is in contrast with either SPSS or Stata, where time value +representations such as “Sep-23-2013 14:57:21” are allowed; note that +in the absence of an explicitly defined time zone, this value cannot +be mapped to an exact point in real time. R handles times in the +“Unix-style” way: the value is converted to the +“seconds-since-the-Epoch” Greenwitch time (GMT or UTC) and the +resulting numeric value is stored in the data file; time zone +adjustments are made in real time as needed.</p> +<p>Things still get ambiguous and confusing when R <strong>displays</strong> this time +value: unless the time zone was explicitly defined, R will adjust the +value to the current time zone. The resulting behavior is often +counter-intuitive: if you create a time value, for example:</p> +<blockquote> +<div>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”);</div></blockquote> +<p>on a computer configured for the San Francisco time zone, the value +will be differently displayed on computers in different time zones; +for example, as “12:57 PST” while still on the West Coast, but as +“15:57 EST” in Boston.</p> +<p>If it is important that the values are always displayed the same way, +regardless of the current time zones, it is recommended that the time +zone is explicitly defined. For example:</p> +<blockquote> +<div>attr(timevalue,”tzone”)<-“PST”</div></blockquote> +<dl class="docutils"> +<dt>or</dt> +<dd>timevalue<-as.POSIXct(“03/19/2013 12:57:00”, format = “%m/%d/%Y %H:%M:%OS”, tz=”PST”);</dd> +</dl> +<p>Now the value will always be displayed as “15:57 PST”, regardless of +the time zone that is current for the OS ... <strong>BUT ONLY</strong> if the OS +where R is installed actually understands the time zone “PST”, which +is not by any means guaranteed! Otherwise, it will <strong>quietly adjust</strong> +the stored GMT value to <strong>the current time zone</strong>, yet it will still +display it with the “PST” tag attached!** One way to rephrase this is +that R does a fairly decent job <strong>storing</strong> time values in a +non-ambiguous, platform-independent manner - but gives you no guarantee that +the values will be displayed in any way that is predictable or intuitive.</p> +<p>In practical terms, it is recommended to use the long/descriptive +forms of time zones, as they are more likely to be properly recognized +on most computers. For example, “Japan” instead of “JST”. Another possible +solution is to explicitly use GMT or UTC (since it is very likely to be +properly recognized on any system), or the “UTC+<OFFSET>” notation. Still, none of the above +<strong>guarantees</strong> proper, non-ambiguous handling of time values in R data +sets. The fact that R <strong>quietly</strong> modifies time values when it doesn’t +recognize the supplied timezone attribute, yet still appends it to the +<strong>changed</strong> time value does make it quite difficult. (These issues are +discussed in depth on R-related forums, and no attempt is made to +summarize it all in any depth here; this is just to made you aware of +this being a potentially complex issue!)</p> +<p>An important thing to keep in mind, in connection with the DVN ingest +of R files, is that it will <strong>reject</strong> an R data file with any time +values that have time zones that we can’t recognize. This is done in +order to avoid (some) of the potential issues outlined above.</p> +<p>It is also recommended that any vectors containing time values +ingested into the DVN are reviewed, and the resulting entries in the +TAB files are compared against the original values in the R data +frame, to make sure they have been ingested as expected.</p> +<p>Another <strong>potential issue</strong> here is the <strong>UNF</strong>. The way the UNF +algorithm works, the same date/time values with and without the +timezone (e.g. “12:45” vs. “12:45 EST”) <strong>produce different +UNFs</strong>. Considering that time values in Stata/SPSS do not have time +zones, but ALL time values in R do (yes, they all do - if the timezone +wasn’t defined explicitely, it implicitly becomes a time value in the +“UTC” zone!), this means that it is <strong>impossible</strong> to have 2 time +value vectors, in Stata/SPSS and R, that produce the same UNF.</p> +<div class="line-block"> +<div class="line"><strong>A pro tip:</strong> if it is important to produce SPSS/Stata and R versions of</div> +</div> +<p>the same data set that result in the same UNF when ingested, you may +define the time variables as <strong>strings</strong> in the R data frame, and use +the “YYYY-MM-DD HH:mm:ss” formatting notation. This is the formatting used by the UNF +algorithm to normalize time values, so doing the above will result in +the same UNF as the vector of the same time values in Stata.</p> +<p>Note: date values (dates only, without time) should be handled the +exact same way as those in SPSS and Stata, and should produce the same +UNFs.</p> +</div> +</div> + + + </div> + </div> + </div> + </div> + <div class="sidebar"> + <h3>Table Of Contents</h3> + <ul> +<li class="toctree-l1"><a class="reference internal" href="dataverse-user-main.html">User Guide</a></li> +<li class="toctree-l1"><a class="reference internal" href="dataverse-installer-main.html">Installers Guide</a></li> +<li class="toctree-l1"><a class="reference internal" href="dataverse-developer-main.html">DVN Developers Guide</a></li> +<li class="toctree-l1"><a class="reference internal" href="dataverse-api-main.html">APIs Guide</a></li> +</ul> + + <h3 style="margin-top: 1.5em;">Search</h3> + <form class="search" action="search.html" method="get"> + <input type="text" name="q" /> + <input type="submit" value="Go" /> + <input type="hidden" name="check_keywords" value="yes" /> + <input type="hidden" name="area" value="default" /> + </form> + <p class="searchtip" style="font-size: 90%"> + Enter search terms. + </p> + </div> + <div class="clearer"></div> + </div> + </div> + + <div class="footer-wrapper"> + <div class="footer"> + <div class="left"> + <a href="genindex.html" title="General Index" + >index</a> + <br/> + <a href="_sources/dataverse-R-ingest.txt" + rel="nofollow">Show Source</a> + </div> + + <div class="right"> + + <div class="footer"> + © Copyright 1997-2013, President & Fellows Harvard University. + Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1. + </div> + </div> + <div class="clearer"></div> + </div> + </div> + + </body> +</html> \ No newline at end of file