LGDataverses: DVN-web/installer/dvninstall/doc/guides/_sources/dataverse-R-ingest.txt comparison

comparison DVN-web/installer/dvninstall/doc/guides/_sources/dataverse-R-ingest.txt @ 6:1b2188262ae9

adding the installer.

author	"jurzua <jurzua@mpiwg-berlin.mpg.de>"
date	Wed, 13 May 2015 11:50:21 +0200
parents
children

comparison

equal deleted inserted replaced

-:dd9adfc73390
+:1b2188262ae9
+============
+Ingest of R (.RData) files
+============
+Overview.
+=========
+Support for ingesting R data files has been added in version 3.5. R
+has been increasingly popular in the research/academic community,
+owing to the fact that it is free and open-source (unlike SPSS and
+STATA). Consequently, more and more data is becoming available
+exclusively as R data files. This long-awaited feature makes it
+possible to ingest such data into DVN as "subsettable" files.
+Requirements.
+============
+R ingest relies on R having been installed, configured and made
+available to the DVN application via RServe (see the Installers
+Guide). This is in contrast to the SPSS and Stata ingest - which can
+be performed without R present. (though R is still needed to perform
+most subsetting/analysis tasks on the resulting data files).
+The data must be formatted as an R dataframe (data.frame()). If an
+.RData file contains multiple dataframes, only the 1st one will be
+ingested.
+Data Types, compared to other supported formats (Stat, SPSS)
+===========================================================
+Integers, Doubles, Character strings
+------------------------------------
+The handling of these types is intuitive and straightforward. The
+resulting tab file columns, summary statistics and UNF signatures
+should be identical to those produced by ingesting the same vectors
+from SPSS and Stata.
+**A couple of things that are unique to R/new in DVN:**
+R explicitly supports Missing Values for all of the types above;
+Missing Values encoded in R vectors will be recognized and preserved
+in TAB files (as 'NA'), counted in the generated summary statistics
+and data analysis.
+In addition to Missing Values, R recognizes "Not a Value" (NaN) and
+positive and negative infinity for floating point variables. These
+are now properly supported by the DVN.
+Also note, that unlike Stata, that does recognize "float" and "double"
+as distinct data types, all floating point values in R are in fact
+double precision.
+R Factors
+---------
+These are ingested as "Categorical Values" in the DVN.
+One thing to keep in mind: in both Stata and SPSS, the actual value of
+a categorical variable can be both character and numeric. In R, all
+factor values are strings, even if they are string representations of
+numbers. So the values of the resulting categoricals in the DVN will
+always be of string type too.
+| **New:** To properly handle *ordered factors* in R, the DVN now supports the concept of an "Ordered Categorical" - a categorical value where an explicit order is assigned to the list of value labels.
+(New!) Boolean values
+---------------------
+R Boolean (logical) values are supported.
+Limitations of R, as compared to SPSS and STATA.
+------------------------------------------------
+Most noticeably, R lacks a standard mechanism for defining descriptive
+labels for the data frame variables.  In the DVN, similarly to
+both Stata and SPSS, variables have distinct names and labels; with
+the latter reserved for longer, descriptive text.
+With variables ingested from R data frames the variable name will be
+used for both the "name" and the "label".
+| *Optional R packages exist for providing descriptive variable labels;
+in one of the future versions support may be added for such a
+mechanism. It would of course work only for R files that were
+created with such optional packages*.
+Similarly, R categorical values (factors) lack descriptive labels too.
+**Note:** This is potentially confusing, since R factors do
+actually have "labels".  This is a matter of terminology - an R
+factor's label is in fact the same thing as the "value" of a
+categorical variable in SPSS or Stata and DVN; it contains the actual
+meaningful data for the given observation. It is NOT a field reserved
+for explanatory, human-readable text, such as the case with the
+SPSS/Stata "label".
+Ingesting an R factor with the level labels "MALE" and "FEMALE" will
+produce a categorical variable with "MALE" and "FEMALE" in the
+values and labels both.
+Time values in R
+================
+This warrants a dedicated section of its own, because of some unique
+ways in which time values are handled in R.
+R makes an effort to treat a time value as a real time instance. This
+is in contrast with either SPSS or Stata, where time value
+representations such as "Sep-23-2013 14:57:21" are allowed; note that
+in the absence of an explicitly defined time zone, this value cannot
+be mapped to an exact point in real time.  R handles times in the
+"Unix-style" way: the value is converted to the
+"seconds-since-the-Epoch" Greenwitch time (GMT or UTC) and the
+resulting numeric value is stored in the data file; time zone
+adjustments are made in real time as needed.
+Things still get ambiguous and confusing when R **displays** this time
+value: unless the time zone was explicitly defined, R will adjust the
+value to the current time zone. The resulting behavior is often
+counter-intuitive: if you create a time value, for example:
+		   timevalue<-as.POSIXct("03/19/2013 12:57:00", format = "%m/%d/%Y %H:%M:%OS");
+on a computer configured for the San Francisco time zone, the value
+will be differently displayed on computers in different time zones;
+for example, as "12:57 PST" while still on the West Coast, but as
+"15:57 EST" in Boston.
+If it is important that the values are always displayed the same way,
+regardless of the current time zones, it is recommended that the time
+zone is explicitly defined. For example:
+attr(timevalue,"tzone")<-"PST"
+or
+timevalue<-as.POSIXct("03/19/2013 12:57:00", format = "%m/%d/%Y %H:%M:%OS", tz="PST");
+Now the value will always be displayed as "15:57 PST", regardless of
+the time zone that is current for the OS ... **BUT ONLY** if the OS
+where R is installed actually understands the time zone "PST", which
+is not by any means guaranteed! Otherwise, it will **quietly adjust**
+the stored GMT value to **the current time zone**, yet it will still
+display it with the "PST" tag attached!** One way to rephrase this is
+that R does a fairly decent job **storing** time values in a
+non-ambiguous, platform-independent manner - but gives you no guarantee that
+the values will be displayed in any way that is predictable or intuitive.
+In practical terms, it is recommended to use the long/descriptive
+forms of time zones, as they are more likely to be properly recognized
+on most computers. For example, "Japan" instead of "JST".  Another possible
+solution is to explicitly use GMT or UTC (since it is very likely to be
+properly recognized on any system), or the "UTC+<OFFSET>" notation. Still, none of the above
+**guarantees** proper, non-ambiguous handling of time values in R data
+sets. The fact that R **quietly** modifies time values when it doesn't
+recognize the supplied timezone attribute, yet still appends it to the
+**changed** time value does make it quite difficult. (These issues are
+discussed in depth on R-related forums, and no attempt is made to
+summarize it all in any depth here; this is just to made you aware of
+this being a potentially complex issue!)
+An important thing to keep in mind, in connection with the DVN ingest
+of R files, is that it will **reject** an R data file with any time
+values that have time zones that we can't recognize. This is done in
+order to avoid (some) of the potential issues outlined above.
+It is also recommended that any vectors containing time values
+ingested into the DVN are reviewed, and the resulting entries in the
+TAB files are compared against the original values in the R data
+frame, to make sure they have been ingested as expected.
+Another **potential issue** here is the **UNF**. The way the UNF
+algorithm works, the same date/time values with and without the
+timezone (e.g. "12:45" vs. "12:45 EST") **produce different
+UNFs**. Considering that time values in Stata/SPSS do not have time
+zones, but ALL time values in R do (yes, they all do - if the timezone
+wasn't defined explicitely, it implicitly becomes a time value in the
+"UTC" zone!), this means that it is **impossible** to have 2 time
+value vectors, in Stata/SPSS and R, that produce the same UNF.
+| **A pro tip:** if it is important to produce SPSS/Stata and R versions of
+the same data set that result in the same UNF when ingested, you may
+define the time variables as **strings** in the R data frame, and use
+the "YYYY-MM-DD HH:mm:ss" formatting notation. This is the formatting used by the UNF
+algorithm to normalize time values, so doing the above will result in
+the same UNF as the vector of the same time values in Stata.
+Note: date values (dates only, without time) should be handled the
+exact same way as those in SPSS and Stata, and should produce the same
+UNFs.

Mercurial > hg > LGDataverses

comparison DVN-web/installer/dvninstall/doc/guides/_sources/dataverse-R-ingest.txt @ 6:1b2188262ae9