view DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html @ 6:1b2188262ae9

adding the installer.
author "jurzua <jurzua@mpiwg-berlin.mpg.de>"
date Wed, 13 May 2015 11:50:21 +0200
parents
children
line wrap: on
line source



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Ingest of R (.RData) files &mdash; The Harvard Dataverse Network 3.6.1 documentation</title>
    
    <link rel="stylesheet" href="_static/agogo.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '3.6.1',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
    <link rel="top" title="The Harvard Dataverse Network 3.6.1 documentation" href="index.html" /> 
  </head>
  <body>
    <div class="header-wrapper">
      <div class="header">
        <div class="headertitle"><a
          href="index.html">The Harvard Dataverse Network 3.6.1 documentation</a></div>
        <div class="rel">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a>
        </div>
       </div>
    </div>

    <div class="content-wrapper">
      <div class="content">
        <div class="document">
            
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="ingest-of-r-rdata-files">
<h1>Ingest of R (.RData) files<a class="headerlink" href="#ingest-of-r-rdata-files" title="Permalink to this headline">¶</a></h1>
<div class="section" id="overview">
<h2>Overview.<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2>
<p>Support for ingesting R data files has been added in version 3.5. R
has been increasingly popular in the research/academic community,
owing to the fact that it is free and open-source (unlike SPSS and
STATA). Consequently, more and more data is becoming available
exclusively as R data files. This long-awaited feature makes it
possible to ingest such data into DVN as &#8220;subsettable&#8221; files.</p>
</div>
<div class="section" id="requirements">
<h2>Requirements.<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2>
<p>R ingest relies on R having been installed, configured and made
available to the DVN application via RServe (see the Installers
Guide). This is in contrast to the SPSS and Stata ingest - which can
be performed without R present. (though R is still needed to perform
most subsetting/analysis tasks on the resulting data files).</p>
<p>The data must be formatted as an R dataframe (data.frame()). If an
.RData file contains multiple dataframes, only the 1st one will be
ingested.</p>
</div>
<div class="section" id="data-types-compared-to-other-supported-formats-stat-spss">
<h2>Data Types, compared to other supported formats (Stat, SPSS)<a class="headerlink" href="#data-types-compared-to-other-supported-formats-stat-spss" title="Permalink to this headline">¶</a></h2>
<div class="section" id="integers-doubles-character-strings">
<h3>Integers, Doubles, Character strings<a class="headerlink" href="#integers-doubles-character-strings" title="Permalink to this headline">¶</a></h3>
<p>The handling of these types is intuitive and straightforward. The
resulting tab file columns, summary statistics and UNF signatures
should be identical to those produced by ingesting the same vectors
from SPSS and Stata.</p>
<p><strong>A couple of things that are unique to R/new in DVN:</strong></p>
<p>R explicitly supports Missing Values for all of the types above;
Missing Values encoded in R vectors will be recognized and preserved
in TAB files (as &#8216;NA&#8217;), counted in the generated summary statistics
and data analysis.</p>
<p>In addition to Missing Values, R recognizes &#8220;Not a Value&#8221; (NaN) and
positive and negative infinity for floating point variables. These
are now properly supported by the DVN.</p>
<p>Also note, that unlike Stata, that does recognize &#8220;float&#8221; and &#8220;double&#8221;
as distinct data types, all floating point values in R are in fact
double precision.</p>
</div>
<div class="section" id="r-factors">
<h3>R Factors<a class="headerlink" href="#r-factors" title="Permalink to this headline">¶</a></h3>
<p>These are ingested as &#8220;Categorical Values&#8221; in the DVN.</p>
<p>One thing to keep in mind: in both Stata and SPSS, the actual value of
a categorical variable can be both character and numeric. In R, all
factor values are strings, even if they are string representations of
numbers. So the values of the resulting categoricals in the DVN will
always be of string type too.</p>
<div class="line-block">
<div class="line"><strong>New:</strong> To properly handle <em>ordered factors</em> in R, the DVN now supports the concept of an &#8220;Ordered Categorical&#8221; - a categorical value where an explicit order is assigned to the list of value labels.</div>
</div>
</div>
<div class="section" id="new-boolean-values">
<h3>(New!) Boolean values<a class="headerlink" href="#new-boolean-values" title="Permalink to this headline">¶</a></h3>
<p>R Boolean (logical) values are supported.</p>
</div>
<div class="section" id="limitations-of-r-as-compared-to-spss-and-stata">
<h3>Limitations of R, as compared to SPSS and STATA.<a class="headerlink" href="#limitations-of-r-as-compared-to-spss-and-stata" title="Permalink to this headline">¶</a></h3>
<p>Most noticeably, R lacks a standard mechanism for defining descriptive
labels for the data frame variables.  In the DVN, similarly to
both Stata and SPSS, variables have distinct names and labels; with
the latter reserved for longer, descriptive text.
With variables ingested from R data frames the variable name will be
used for both the &#8220;name&#8221; and the &#8220;label&#8221;.</p>
<div class="line-block">
<div class="line"><em>Optional R packages exist for providing descriptive variable labels;
in one of the future versions support may be added for such a
mechanism. It would of course work only for R files that were
created with such optional packages</em>.</div>
</div>
<p>Similarly, R categorical values (factors) lack descriptive labels too.
<strong>Note:</strong> This is potentially confusing, since R factors do
actually have &#8220;labels&#8221;.  This is a matter of terminology - an R
factor&#8217;s label is in fact the same thing as the &#8220;value&#8221; of a
categorical variable in SPSS or Stata and DVN; it contains the actual
meaningful data for the given observation. It is NOT a field reserved
for explanatory, human-readable text, such as the case with the
SPSS/Stata &#8220;label&#8221;.</p>
<p>Ingesting an R factor with the level labels &#8220;MALE&#8221; and &#8220;FEMALE&#8221; will
produce a categorical variable with &#8220;MALE&#8221; and &#8220;FEMALE&#8221; in the
values and labels both.</p>
</div>
</div>
<div class="section" id="time-values-in-r">
<h2>Time values in R<a class="headerlink" href="#time-values-in-r" title="Permalink to this headline">¶</a></h2>
<p>This warrants a dedicated section of its own, because of some unique
ways in which time values are handled in R.</p>
<p>R makes an effort to treat a time value as a real time instance. This
is in contrast with either SPSS or Stata, where time value
representations such as &#8220;Sep-23-2013 14:57:21&#8221; are allowed; note that
in the absence of an explicitly defined time zone, this value cannot
be mapped to an exact point in real time.  R handles times in the
&#8220;Unix-style&#8221; way: the value is converted to the
&#8220;seconds-since-the-Epoch&#8221; Greenwitch time (GMT or UTC) and the
resulting numeric value is stored in the data file; time zone
adjustments are made in real time as needed.</p>
<p>Things still get ambiguous and confusing when R <strong>displays</strong> this time
value: unless the time zone was explicitly defined, R will adjust the
value to the current time zone. The resulting behavior is often
counter-intuitive: if you create a time value, for example:</p>
<blockquote>
<div>timevalue&lt;-as.POSIXct(&#8220;03/19/2013 12:57:00&#8221;, format = &#8220;%m/%d/%Y %H:%M:%OS&#8221;);</div></blockquote>
<p>on a computer configured for the San Francisco time zone, the value
will be differently displayed on computers in different time zones;
for example, as &#8220;12:57 PST&#8221; while still on the West Coast, but as
&#8220;15:57 EST&#8221; in Boston.</p>
<p>If it is important that the values are always displayed the same way,
regardless of the current time zones, it is recommended that the time
zone is explicitly defined. For example:</p>
<blockquote>
<div>attr(timevalue,&#8221;tzone&#8221;)&lt;-&#8220;PST&#8221;</div></blockquote>
<dl class="docutils">
<dt>or</dt>
<dd>timevalue&lt;-as.POSIXct(&#8220;03/19/2013 12:57:00&#8221;, format = &#8220;%m/%d/%Y %H:%M:%OS&#8221;, tz=&#8221;PST&#8221;);</dd>
</dl>
<p>Now the value will always be displayed as &#8220;15:57 PST&#8221;, regardless of
the time zone that is current for the OS ... <strong>BUT ONLY</strong> if the OS
where R is installed actually understands the time zone &#8220;PST&#8221;, which
is not by any means guaranteed! Otherwise, it will <strong>quietly adjust</strong>
the stored GMT value to <strong>the current time zone</strong>, yet it will still
display it with the &#8220;PST&#8221; tag attached!** One way to rephrase this is
that R does a fairly decent job <strong>storing</strong> time values in a
non-ambiguous, platform-independent manner - but gives you no guarantee that
the values will be displayed in any way that is predictable or intuitive.</p>
<p>In practical terms, it is recommended to use the long/descriptive
forms of time zones, as they are more likely to be properly recognized
on most computers. For example, &#8220;Japan&#8221; instead of &#8220;JST&#8221;.  Another possible
solution is to explicitly use GMT or UTC (since it is very likely to be
properly recognized on any system), or the &#8220;UTC+&lt;OFFSET&gt;&#8221; notation. Still, none of the above
<strong>guarantees</strong> proper, non-ambiguous handling of time values in R data
sets. The fact that R <strong>quietly</strong> modifies time values when it doesn&#8217;t
recognize the supplied timezone attribute, yet still appends it to the
<strong>changed</strong> time value does make it quite difficult. (These issues are
discussed in depth on R-related forums, and no attempt is made to
summarize it all in any depth here; this is just to made you aware of
this being a potentially complex issue!)</p>
<p>An important thing to keep in mind, in connection with the DVN ingest
of R files, is that it will <strong>reject</strong> an R data file with any time
values that have time zones that we can&#8217;t recognize. This is done in
order to avoid (some) of the potential issues outlined above.</p>
<p>It is also recommended that any vectors containing time values
ingested into the DVN are reviewed, and the resulting entries in the
TAB files are compared against the original values in the R data
frame, to make sure they have been ingested as expected.</p>
<p>Another <strong>potential issue</strong> here is the <strong>UNF</strong>. The way the UNF
algorithm works, the same date/time values with and without the
timezone (e.g. &#8220;12:45&#8221; vs. &#8220;12:45 EST&#8221;) <strong>produce different
UNFs</strong>. Considering that time values in Stata/SPSS do not have time
zones, but ALL time values in R do (yes, they all do - if the timezone
wasn&#8217;t defined explicitely, it implicitly becomes a time value in the
&#8220;UTC&#8221; zone!), this means that it is <strong>impossible</strong> to have 2 time
value vectors, in Stata/SPSS and R, that produce the same UNF.</p>
<div class="line-block">
<div class="line"><strong>A pro tip:</strong> if it is important to produce SPSS/Stata and R versions of</div>
</div>
<p>the same data set that result in the same UNF when ingested, you may
define the time variables as <strong>strings</strong> in the R data frame, and use
the &#8220;YYYY-MM-DD HH:mm:ss&#8221; formatting notation. This is the formatting used by the UNF
algorithm to normalize time values, so doing the above will result in
the same UNF as the vector of the same time values in Stata.</p>
<p>Note: date values (dates only, without time) should be handled the
exact same way as those in SPSS and Stata, and should produce the same
UNFs.</p>
</div>
</div>


          </div>
        </div>
      </div>
        </div>
        <div class="sidebar">
          <h3>Table Of Contents</h3>
          <ul>
<li class="toctree-l1"><a class="reference internal" href="dataverse-user-main.html">User Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="dataverse-installer-main.html">Installers Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="dataverse-developer-main.html">DVN Developers Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="dataverse-api-main.html">APIs Guide</a></li>
</ul>

          <h3 style="margin-top: 1.5em;">Search</h3>
          <form class="search" action="search.html" method="get">
            <input type="text" name="q" />
            <input type="submit" value="Go" />
            <input type="hidden" name="check_keywords" value="yes" />
            <input type="hidden" name="area" value="default" />
          </form>
          <p class="searchtip" style="font-size: 90%">
            Enter search terms.
          </p>
        </div>
        <div class="clearer"></div>
      </div>
    </div>

    <div class="footer-wrapper">
      <div class="footer">
        <div class="left">
          <a href="genindex.html" title="General Index"
             >index</a>
            <br/>
            <a href="_sources/dataverse-R-ingest.txt"
               rel="nofollow">Show Source</a>
        </div>

        <div class="right">
          
    <div class="footer">
        &copy; Copyright 1997-2013, President &amp; Fellows Harvard University.
      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1.
    </div>
        </div>
        <div class="clearer"></div>
      </div>
    </div>

  </body>
</html>