diff DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html @ 6:1b2188262ae9

adding the installer.
author "jurzua <jurzua@mpiwg-berlin.mpg.de>"
date Wed, 13 May 2015 11:50:21 +0200
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/DVN-web/installer/dvninstall/doc/guides/dataverse-R-ingest.html	Wed May 13 11:50:21 2015 +0200
@@ -0,0 +1,270 @@
+
+
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+    
+    <title>Ingest of R (.RData) files &mdash; The Harvard Dataverse Network 3.6.1 documentation</title>
+    
+    <link rel="stylesheet" href="_static/agogo.css" type="text/css" />
+    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
+    
+    <script type="text/javascript">
+      var DOCUMENTATION_OPTIONS = {
+        URL_ROOT:    './',
+        VERSION:     '3.6.1',
+        COLLAPSE_INDEX: false,
+        FILE_SUFFIX: '.html',
+        HAS_SOURCE:  true
+      };
+    </script>
+    <script type="text/javascript" src="_static/jquery.js"></script>
+    <script type="text/javascript" src="_static/underscore.js"></script>
+    <script type="text/javascript" src="_static/doctools.js"></script>
+    <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
+    <link rel="top" title="The Harvard Dataverse Network 3.6.1 documentation" href="index.html" /> 
+  </head>
+  <body>
+    <div class="header-wrapper">
+      <div class="header">
+        <div class="headertitle"><a
+          href="index.html">The Harvard Dataverse Network 3.6.1 documentation</a></div>
+        <div class="rel">
+          <a href="genindex.html" title="General Index"
+             accesskey="I">index</a>
+        </div>
+       </div>
+    </div>
+
+    <div class="content-wrapper">
+      <div class="content">
+        <div class="document">
+            
+      <div class="documentwrapper">
+        <div class="bodywrapper">
+          <div class="body">
+            
+  <div class="section" id="ingest-of-r-rdata-files">
+<h1>Ingest of R (.RData) files<a class="headerlink" href="#ingest-of-r-rdata-files" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="overview">
+<h2>Overview.<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2>
+<p>Support for ingesting R data files has been added in version 3.5. R
+has been increasingly popular in the research/academic community,
+owing to the fact that it is free and open-source (unlike SPSS and
+STATA). Consequently, more and more data is becoming available
+exclusively as R data files. This long-awaited feature makes it
+possible to ingest such data into DVN as &#8220;subsettable&#8221; files.</p>
+</div>
+<div class="section" id="requirements">
+<h2>Requirements.<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2>
+<p>R ingest relies on R having been installed, configured and made
+available to the DVN application via RServe (see the Installers
+Guide). This is in contrast to the SPSS and Stata ingest - which can
+be performed without R present. (though R is still needed to perform
+most subsetting/analysis tasks on the resulting data files).</p>
+<p>The data must be formatted as an R dataframe (data.frame()). If an
+.RData file contains multiple dataframes, only the 1st one will be
+ingested.</p>
+</div>
+<div class="section" id="data-types-compared-to-other-supported-formats-stat-spss">
+<h2>Data Types, compared to other supported formats (Stat, SPSS)<a class="headerlink" href="#data-types-compared-to-other-supported-formats-stat-spss" title="Permalink to this headline">¶</a></h2>
+<div class="section" id="integers-doubles-character-strings">
+<h3>Integers, Doubles, Character strings<a class="headerlink" href="#integers-doubles-character-strings" title="Permalink to this headline">¶</a></h3>
+<p>The handling of these types is intuitive and straightforward. The
+resulting tab file columns, summary statistics and UNF signatures
+should be identical to those produced by ingesting the same vectors
+from SPSS and Stata.</p>
+<p><strong>A couple of things that are unique to R/new in DVN:</strong></p>
+<p>R explicitly supports Missing Values for all of the types above;
+Missing Values encoded in R vectors will be recognized and preserved
+in TAB files (as &#8216;NA&#8217;), counted in the generated summary statistics
+and data analysis.</p>
+<p>In addition to Missing Values, R recognizes &#8220;Not a Value&#8221; (NaN) and
+positive and negative infinity for floating point variables. These
+are now properly supported by the DVN.</p>
+<p>Also note, that unlike Stata, that does recognize &#8220;float&#8221; and &#8220;double&#8221;
+as distinct data types, all floating point values in R are in fact
+double precision.</p>
+</div>
+<div class="section" id="r-factors">
+<h3>R Factors<a class="headerlink" href="#r-factors" title="Permalink to this headline">¶</a></h3>
+<p>These are ingested as &#8220;Categorical Values&#8221; in the DVN.</p>
+<p>One thing to keep in mind: in both Stata and SPSS, the actual value of
+a categorical variable can be both character and numeric. In R, all
+factor values are strings, even if they are string representations of
+numbers. So the values of the resulting categoricals in the DVN will
+always be of string type too.</p>
+<div class="line-block">
+<div class="line"><strong>New:</strong> To properly handle <em>ordered factors</em> in R, the DVN now supports the concept of an &#8220;Ordered Categorical&#8221; - a categorical value where an explicit order is assigned to the list of value labels.</div>
+</div>
+</div>
+<div class="section" id="new-boolean-values">
+<h3>(New!) Boolean values<a class="headerlink" href="#new-boolean-values" title="Permalink to this headline">¶</a></h3>
+<p>R Boolean (logical) values are supported.</p>
+</div>
+<div class="section" id="limitations-of-r-as-compared-to-spss-and-stata">
+<h3>Limitations of R, as compared to SPSS and STATA.<a class="headerlink" href="#limitations-of-r-as-compared-to-spss-and-stata" title="Permalink to this headline">¶</a></h3>
+<p>Most noticeably, R lacks a standard mechanism for defining descriptive
+labels for the data frame variables.  In the DVN, similarly to
+both Stata and SPSS, variables have distinct names and labels; with
+the latter reserved for longer, descriptive text.
+With variables ingested from R data frames the variable name will be
+used for both the &#8220;name&#8221; and the &#8220;label&#8221;.</p>
+<div class="line-block">
+<div class="line"><em>Optional R packages exist for providing descriptive variable labels;
+in one of the future versions support may be added for such a
+mechanism. It would of course work only for R files that were
+created with such optional packages</em>.</div>
+</div>
+<p>Similarly, R categorical values (factors) lack descriptive labels too.
+<strong>Note:</strong> This is potentially confusing, since R factors do
+actually have &#8220;labels&#8221;.  This is a matter of terminology - an R
+factor&#8217;s label is in fact the same thing as the &#8220;value&#8221; of a
+categorical variable in SPSS or Stata and DVN; it contains the actual
+meaningful data for the given observation. It is NOT a field reserved
+for explanatory, human-readable text, such as the case with the
+SPSS/Stata &#8220;label&#8221;.</p>
+<p>Ingesting an R factor with the level labels &#8220;MALE&#8221; and &#8220;FEMALE&#8221; will
+produce a categorical variable with &#8220;MALE&#8221; and &#8220;FEMALE&#8221; in the
+values and labels both.</p>
+</div>
+</div>
+<div class="section" id="time-values-in-r">
+<h2>Time values in R<a class="headerlink" href="#time-values-in-r" title="Permalink to this headline">¶</a></h2>
+<p>This warrants a dedicated section of its own, because of some unique
+ways in which time values are handled in R.</p>
+<p>R makes an effort to treat a time value as a real time instance. This
+is in contrast with either SPSS or Stata, where time value
+representations such as &#8220;Sep-23-2013 14:57:21&#8221; are allowed; note that
+in the absence of an explicitly defined time zone, this value cannot
+be mapped to an exact point in real time.  R handles times in the
+&#8220;Unix-style&#8221; way: the value is converted to the
+&#8220;seconds-since-the-Epoch&#8221; Greenwitch time (GMT or UTC) and the
+resulting numeric value is stored in the data file; time zone
+adjustments are made in real time as needed.</p>
+<p>Things still get ambiguous and confusing when R <strong>displays</strong> this time
+value: unless the time zone was explicitly defined, R will adjust the
+value to the current time zone. The resulting behavior is often
+counter-intuitive: if you create a time value, for example:</p>
+<blockquote>
+<div>timevalue&lt;-as.POSIXct(&#8220;03/19/2013 12:57:00&#8221;, format = &#8220;%m/%d/%Y %H:%M:%OS&#8221;);</div></blockquote>
+<p>on a computer configured for the San Francisco time zone, the value
+will be differently displayed on computers in different time zones;
+for example, as &#8220;12:57 PST&#8221; while still on the West Coast, but as
+&#8220;15:57 EST&#8221; in Boston.</p>
+<p>If it is important that the values are always displayed the same way,
+regardless of the current time zones, it is recommended that the time
+zone is explicitly defined. For example:</p>
+<blockquote>
+<div>attr(timevalue,&#8221;tzone&#8221;)&lt;-&#8220;PST&#8221;</div></blockquote>
+<dl class="docutils">
+<dt>or</dt>
+<dd>timevalue&lt;-as.POSIXct(&#8220;03/19/2013 12:57:00&#8221;, format = &#8220;%m/%d/%Y %H:%M:%OS&#8221;, tz=&#8221;PST&#8221;);</dd>
+</dl>
+<p>Now the value will always be displayed as &#8220;15:57 PST&#8221;, regardless of
+the time zone that is current for the OS ... <strong>BUT ONLY</strong> if the OS
+where R is installed actually understands the time zone &#8220;PST&#8221;, which
+is not by any means guaranteed! Otherwise, it will <strong>quietly adjust</strong>
+the stored GMT value to <strong>the current time zone</strong>, yet it will still
+display it with the &#8220;PST&#8221; tag attached!** One way to rephrase this is
+that R does a fairly decent job <strong>storing</strong> time values in a
+non-ambiguous, platform-independent manner - but gives you no guarantee that
+the values will be displayed in any way that is predictable or intuitive.</p>
+<p>In practical terms, it is recommended to use the long/descriptive
+forms of time zones, as they are more likely to be properly recognized
+on most computers. For example, &#8220;Japan&#8221; instead of &#8220;JST&#8221;.  Another possible
+solution is to explicitly use GMT or UTC (since it is very likely to be
+properly recognized on any system), or the &#8220;UTC+&lt;OFFSET&gt;&#8221; notation. Still, none of the above
+<strong>guarantees</strong> proper, non-ambiguous handling of time values in R data
+sets. The fact that R <strong>quietly</strong> modifies time values when it doesn&#8217;t
+recognize the supplied timezone attribute, yet still appends it to the
+<strong>changed</strong> time value does make it quite difficult. (These issues are
+discussed in depth on R-related forums, and no attempt is made to
+summarize it all in any depth here; this is just to made you aware of
+this being a potentially complex issue!)</p>
+<p>An important thing to keep in mind, in connection with the DVN ingest
+of R files, is that it will <strong>reject</strong> an R data file with any time
+values that have time zones that we can&#8217;t recognize. This is done in
+order to avoid (some) of the potential issues outlined above.</p>
+<p>It is also recommended that any vectors containing time values
+ingested into the DVN are reviewed, and the resulting entries in the
+TAB files are compared against the original values in the R data
+frame, to make sure they have been ingested as expected.</p>
+<p>Another <strong>potential issue</strong> here is the <strong>UNF</strong>. The way the UNF
+algorithm works, the same date/time values with and without the
+timezone (e.g. &#8220;12:45&#8221; vs. &#8220;12:45 EST&#8221;) <strong>produce different
+UNFs</strong>. Considering that time values in Stata/SPSS do not have time
+zones, but ALL time values in R do (yes, they all do - if the timezone
+wasn&#8217;t defined explicitely, it implicitly becomes a time value in the
+&#8220;UTC&#8221; zone!), this means that it is <strong>impossible</strong> to have 2 time
+value vectors, in Stata/SPSS and R, that produce the same UNF.</p>
+<div class="line-block">
+<div class="line"><strong>A pro tip:</strong> if it is important to produce SPSS/Stata and R versions of</div>
+</div>
+<p>the same data set that result in the same UNF when ingested, you may
+define the time variables as <strong>strings</strong> in the R data frame, and use
+the &#8220;YYYY-MM-DD HH:mm:ss&#8221; formatting notation. This is the formatting used by the UNF
+algorithm to normalize time values, so doing the above will result in
+the same UNF as the vector of the same time values in Stata.</p>
+<p>Note: date values (dates only, without time) should be handled the
+exact same way as those in SPSS and Stata, and should produce the same
+UNFs.</p>
+</div>
+</div>
+
+
+          </div>
+        </div>
+      </div>
+        </div>
+        <div class="sidebar">
+          <h3>Table Of Contents</h3>
+          <ul>
+<li class="toctree-l1"><a class="reference internal" href="dataverse-user-main.html">User Guide</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dataverse-installer-main.html">Installers Guide</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dataverse-developer-main.html">DVN Developers Guide</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dataverse-api-main.html">APIs Guide</a></li>
+</ul>
+
+          <h3 style="margin-top: 1.5em;">Search</h3>
+          <form class="search" action="search.html" method="get">
+            <input type="text" name="q" />
+            <input type="submit" value="Go" />
+            <input type="hidden" name="check_keywords" value="yes" />
+            <input type="hidden" name="area" value="default" />
+          </form>
+          <p class="searchtip" style="font-size: 90%">
+            Enter search terms.
+          </p>
+        </div>
+        <div class="clearer"></div>
+      </div>
+    </div>
+
+    <div class="footer-wrapper">
+      <div class="footer">
+        <div class="left">
+          <a href="genindex.html" title="General Index"
+             >index</a>
+            <br/>
+            <a href="_sources/dataverse-R-ingest.txt"
+               rel="nofollow">Show Source</a>
+        </div>
+
+        <div class="right">
+          
+    <div class="footer">
+        &copy; Copyright 1997-2013, President &amp; Fellows Harvard University.
+      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1.
+    </div>
+        </div>
+        <div class="clearer"></div>
+      </div>
+    </div>
+
+  </body>
+</html>
\ No newline at end of file