comparison DVN-web/installer/dvninstall/doc/guides/_sources/dataverse-R-ingest.txt @ 6:1b2188262ae9

adding the installer.
author "jurzua <jurzua@mpiwg-berlin.mpg.de>"
date Wed, 13 May 2015 11:50:21 +0200
parents
children
comparison
equal deleted inserted replaced
5:dd9adfc73390 6:1b2188262ae9
1 ============
2 Ingest of R (.RData) files
3 ============
4
5 Overview.
6 =========
7
8 Support for ingesting R data files has been added in version 3.5. R
9 has been increasingly popular in the research/academic community,
10 owing to the fact that it is free and open-source (unlike SPSS and
11 STATA). Consequently, more and more data is becoming available
12 exclusively as R data files. This long-awaited feature makes it
13 possible to ingest such data into DVN as "subsettable" files.
14
15 Requirements.
16 ============
17
18 R ingest relies on R having been installed, configured and made
19 available to the DVN application via RServe (see the Installers
20 Guide). This is in contrast to the SPSS and Stata ingest - which can
21 be performed without R present. (though R is still needed to perform
22 most subsetting/analysis tasks on the resulting data files).
23
24 The data must be formatted as an R dataframe (data.frame()). If an
25 .RData file contains multiple dataframes, only the 1st one will be
26 ingested.
27
28 Data Types, compared to other supported formats (Stat, SPSS)
29 ===========================================================
30
31 Integers, Doubles, Character strings
32 ------------------------------------
33
34 The handling of these types is intuitive and straightforward. The
35 resulting tab file columns, summary statistics and UNF signatures
36 should be identical to those produced by ingesting the same vectors
37 from SPSS and Stata.
38
39 **A couple of things that are unique to R/new in DVN:**
40
41 R explicitly supports Missing Values for all of the types above;
42 Missing Values encoded in R vectors will be recognized and preserved
43 in TAB files (as 'NA'), counted in the generated summary statistics
44 and data analysis.
45
46 In addition to Missing Values, R recognizes "Not a Value" (NaN) and
47 positive and negative infinity for floating point variables. These
48 are now properly supported by the DVN.
49
50 Also note, that unlike Stata, that does recognize "float" and "double"
51 as distinct data types, all floating point values in R are in fact
52 double precision.
53
54 R Factors
55 ---------
56
57 These are ingested as "Categorical Values" in the DVN.
58
59 One thing to keep in mind: in both Stata and SPSS, the actual value of
60 a categorical variable can be both character and numeric. In R, all
61 factor values are strings, even if they are string representations of
62 numbers. So the values of the resulting categoricals in the DVN will
63 always be of string type too.
64
65 | **New:** To properly handle *ordered factors* in R, the DVN now supports the concept of an "Ordered Categorical" - a categorical value where an explicit order is assigned to the list of value labels.
66
67 (New!) Boolean values
68 ---------------------
69
70 R Boolean (logical) values are supported.
71
72
73 Limitations of R, as compared to SPSS and STATA.
74 ------------------------------------------------
75
76 Most noticeably, R lacks a standard mechanism for defining descriptive
77 labels for the data frame variables. In the DVN, similarly to
78 both Stata and SPSS, variables have distinct names and labels; with
79 the latter reserved for longer, descriptive text.
80 With variables ingested from R data frames the variable name will be
81 used for both the "name" and the "label".
82
83 | *Optional R packages exist for providing descriptive variable labels;
84 in one of the future versions support may be added for such a
85 mechanism. It would of course work only for R files that were
86 created with such optional packages*.
87
88 Similarly, R categorical values (factors) lack descriptive labels too.
89 **Note:** This is potentially confusing, since R factors do
90 actually have "labels". This is a matter of terminology - an R
91 factor's label is in fact the same thing as the "value" of a
92 categorical variable in SPSS or Stata and DVN; it contains the actual
93 meaningful data for the given observation. It is NOT a field reserved
94 for explanatory, human-readable text, such as the case with the
95 SPSS/Stata "label".
96
97 Ingesting an R factor with the level labels "MALE" and "FEMALE" will
98 produce a categorical variable with "MALE" and "FEMALE" in the
99 values and labels both.
100
101
102 Time values in R
103 ================
104
105 This warrants a dedicated section of its own, because of some unique
106 ways in which time values are handled in R.
107
108 R makes an effort to treat a time value as a real time instance. This
109 is in contrast with either SPSS or Stata, where time value
110 representations such as "Sep-23-2013 14:57:21" are allowed; note that
111 in the absence of an explicitly defined time zone, this value cannot
112 be mapped to an exact point in real time. R handles times in the
113 "Unix-style" way: the value is converted to the
114 "seconds-since-the-Epoch" Greenwitch time (GMT or UTC) and the
115 resulting numeric value is stored in the data file; time zone
116 adjustments are made in real time as needed.
117
118 Things still get ambiguous and confusing when R **displays** this time
119 value: unless the time zone was explicitly defined, R will adjust the
120 value to the current time zone. The resulting behavior is often
121 counter-intuitive: if you create a time value, for example:
122
123 timevalue<-as.POSIXct("03/19/2013 12:57:00", format = "%m/%d/%Y %H:%M:%OS");
124
125 on a computer configured for the San Francisco time zone, the value
126 will be differently displayed on computers in different time zones;
127 for example, as "12:57 PST" while still on the West Coast, but as
128 "15:57 EST" in Boston.
129
130 If it is important that the values are always displayed the same way,
131 regardless of the current time zones, it is recommended that the time
132 zone is explicitly defined. For example:
133
134 attr(timevalue,"tzone")<-"PST"
135 or
136 timevalue<-as.POSIXct("03/19/2013 12:57:00", format = "%m/%d/%Y %H:%M:%OS", tz="PST");
137
138 Now the value will always be displayed as "15:57 PST", regardless of
139 the time zone that is current for the OS ... **BUT ONLY** if the OS
140 where R is installed actually understands the time zone "PST", which
141 is not by any means guaranteed! Otherwise, it will **quietly adjust**
142 the stored GMT value to **the current time zone**, yet it will still
143 display it with the "PST" tag attached!** One way to rephrase this is
144 that R does a fairly decent job **storing** time values in a
145 non-ambiguous, platform-independent manner - but gives you no guarantee that
146 the values will be displayed in any way that is predictable or intuitive.
147
148 In practical terms, it is recommended to use the long/descriptive
149 forms of time zones, as they are more likely to be properly recognized
150 on most computers. For example, "Japan" instead of "JST". Another possible
151 solution is to explicitly use GMT or UTC (since it is very likely to be
152 properly recognized on any system), or the "UTC+<OFFSET>" notation. Still, none of the above
153 **guarantees** proper, non-ambiguous handling of time values in R data
154 sets. The fact that R **quietly** modifies time values when it doesn't
155 recognize the supplied timezone attribute, yet still appends it to the
156 **changed** time value does make it quite difficult. (These issues are
157 discussed in depth on R-related forums, and no attempt is made to
158 summarize it all in any depth here; this is just to made you aware of
159 this being a potentially complex issue!)
160
161 An important thing to keep in mind, in connection with the DVN ingest
162 of R files, is that it will **reject** an R data file with any time
163 values that have time zones that we can't recognize. This is done in
164 order to avoid (some) of the potential issues outlined above.
165
166 It is also recommended that any vectors containing time values
167 ingested into the DVN are reviewed, and the resulting entries in the
168 TAB files are compared against the original values in the R data
169 frame, to make sure they have been ingested as expected.
170
171 Another **potential issue** here is the **UNF**. The way the UNF
172 algorithm works, the same date/time values with and without the
173 timezone (e.g. "12:45" vs. "12:45 EST") **produce different
174 UNFs**. Considering that time values in Stata/SPSS do not have time
175 zones, but ALL time values in R do (yes, they all do - if the timezone
176 wasn't defined explicitely, it implicitly becomes a time value in the
177 "UTC" zone!), this means that it is **impossible** to have 2 time
178 value vectors, in Stata/SPSS and R, that produce the same UNF.
179
180 | **A pro tip:** if it is important to produce SPSS/Stata and R versions of
181 the same data set that result in the same UNF when ingested, you may
182 define the time variables as **strings** in the R data frame, and use
183 the "YYYY-MM-DD HH:mm:ss" formatting notation. This is the formatting used by the UNF
184 algorithm to normalize time values, so doing the above will result in
185 the same UNF as the vector of the same time values in Stata.
186
187 Note: date values (dates only, without time) should be handled the
188 exact same way as those in SPSS and Stata, and should produce the same
189 UNFs.