Mercurial > hg > LGDataverses
comparison DVN-web/installer/dvninstall/doc/guides/_sources/dataverse-R-ingest.txt @ 6:1b2188262ae9
adding the installer.
author | "jurzua <jurzua@mpiwg-berlin.mpg.de>" |
---|---|
date | Wed, 13 May 2015 11:50:21 +0200 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
5:dd9adfc73390 | 6:1b2188262ae9 |
---|---|
1 ============ | |
2 Ingest of R (.RData) files | |
3 ============ | |
4 | |
5 Overview. | |
6 ========= | |
7 | |
8 Support for ingesting R data files has been added in version 3.5. R | |
9 has been increasingly popular in the research/academic community, | |
10 owing to the fact that it is free and open-source (unlike SPSS and | |
11 STATA). Consequently, more and more data is becoming available | |
12 exclusively as R data files. This long-awaited feature makes it | |
13 possible to ingest such data into DVN as "subsettable" files. | |
14 | |
15 Requirements. | |
16 ============ | |
17 | |
18 R ingest relies on R having been installed, configured and made | |
19 available to the DVN application via RServe (see the Installers | |
20 Guide). This is in contrast to the SPSS and Stata ingest - which can | |
21 be performed without R present. (though R is still needed to perform | |
22 most subsetting/analysis tasks on the resulting data files). | |
23 | |
24 The data must be formatted as an R dataframe (data.frame()). If an | |
25 .RData file contains multiple dataframes, only the 1st one will be | |
26 ingested. | |
27 | |
28 Data Types, compared to other supported formats (Stat, SPSS) | |
29 =========================================================== | |
30 | |
31 Integers, Doubles, Character strings | |
32 ------------------------------------ | |
33 | |
34 The handling of these types is intuitive and straightforward. The | |
35 resulting tab file columns, summary statistics and UNF signatures | |
36 should be identical to those produced by ingesting the same vectors | |
37 from SPSS and Stata. | |
38 | |
39 **A couple of things that are unique to R/new in DVN:** | |
40 | |
41 R explicitly supports Missing Values for all of the types above; | |
42 Missing Values encoded in R vectors will be recognized and preserved | |
43 in TAB files (as 'NA'), counted in the generated summary statistics | |
44 and data analysis. | |
45 | |
46 In addition to Missing Values, R recognizes "Not a Value" (NaN) and | |
47 positive and negative infinity for floating point variables. These | |
48 are now properly supported by the DVN. | |
49 | |
50 Also note, that unlike Stata, that does recognize "float" and "double" | |
51 as distinct data types, all floating point values in R are in fact | |
52 double precision. | |
53 | |
54 R Factors | |
55 --------- | |
56 | |
57 These are ingested as "Categorical Values" in the DVN. | |
58 | |
59 One thing to keep in mind: in both Stata and SPSS, the actual value of | |
60 a categorical variable can be both character and numeric. In R, all | |
61 factor values are strings, even if they are string representations of | |
62 numbers. So the values of the resulting categoricals in the DVN will | |
63 always be of string type too. | |
64 | |
65 | **New:** To properly handle *ordered factors* in R, the DVN now supports the concept of an "Ordered Categorical" - a categorical value where an explicit order is assigned to the list of value labels. | |
66 | |
67 (New!) Boolean values | |
68 --------------------- | |
69 | |
70 R Boolean (logical) values are supported. | |
71 | |
72 | |
73 Limitations of R, as compared to SPSS and STATA. | |
74 ------------------------------------------------ | |
75 | |
76 Most noticeably, R lacks a standard mechanism for defining descriptive | |
77 labels for the data frame variables. In the DVN, similarly to | |
78 both Stata and SPSS, variables have distinct names and labels; with | |
79 the latter reserved for longer, descriptive text. | |
80 With variables ingested from R data frames the variable name will be | |
81 used for both the "name" and the "label". | |
82 | |
83 | *Optional R packages exist for providing descriptive variable labels; | |
84 in one of the future versions support may be added for such a | |
85 mechanism. It would of course work only for R files that were | |
86 created with such optional packages*. | |
87 | |
88 Similarly, R categorical values (factors) lack descriptive labels too. | |
89 **Note:** This is potentially confusing, since R factors do | |
90 actually have "labels". This is a matter of terminology - an R | |
91 factor's label is in fact the same thing as the "value" of a | |
92 categorical variable in SPSS or Stata and DVN; it contains the actual | |
93 meaningful data for the given observation. It is NOT a field reserved | |
94 for explanatory, human-readable text, such as the case with the | |
95 SPSS/Stata "label". | |
96 | |
97 Ingesting an R factor with the level labels "MALE" and "FEMALE" will | |
98 produce a categorical variable with "MALE" and "FEMALE" in the | |
99 values and labels both. | |
100 | |
101 | |
102 Time values in R | |
103 ================ | |
104 | |
105 This warrants a dedicated section of its own, because of some unique | |
106 ways in which time values are handled in R. | |
107 | |
108 R makes an effort to treat a time value as a real time instance. This | |
109 is in contrast with either SPSS or Stata, where time value | |
110 representations such as "Sep-23-2013 14:57:21" are allowed; note that | |
111 in the absence of an explicitly defined time zone, this value cannot | |
112 be mapped to an exact point in real time. R handles times in the | |
113 "Unix-style" way: the value is converted to the | |
114 "seconds-since-the-Epoch" Greenwitch time (GMT or UTC) and the | |
115 resulting numeric value is stored in the data file; time zone | |
116 adjustments are made in real time as needed. | |
117 | |
118 Things still get ambiguous and confusing when R **displays** this time | |
119 value: unless the time zone was explicitly defined, R will adjust the | |
120 value to the current time zone. The resulting behavior is often | |
121 counter-intuitive: if you create a time value, for example: | |
122 | |
123 timevalue<-as.POSIXct("03/19/2013 12:57:00", format = "%m/%d/%Y %H:%M:%OS"); | |
124 | |
125 on a computer configured for the San Francisco time zone, the value | |
126 will be differently displayed on computers in different time zones; | |
127 for example, as "12:57 PST" while still on the West Coast, but as | |
128 "15:57 EST" in Boston. | |
129 | |
130 If it is important that the values are always displayed the same way, | |
131 regardless of the current time zones, it is recommended that the time | |
132 zone is explicitly defined. For example: | |
133 | |
134 attr(timevalue,"tzone")<-"PST" | |
135 or | |
136 timevalue<-as.POSIXct("03/19/2013 12:57:00", format = "%m/%d/%Y %H:%M:%OS", tz="PST"); | |
137 | |
138 Now the value will always be displayed as "15:57 PST", regardless of | |
139 the time zone that is current for the OS ... **BUT ONLY** if the OS | |
140 where R is installed actually understands the time zone "PST", which | |
141 is not by any means guaranteed! Otherwise, it will **quietly adjust** | |
142 the stored GMT value to **the current time zone**, yet it will still | |
143 display it with the "PST" tag attached!** One way to rephrase this is | |
144 that R does a fairly decent job **storing** time values in a | |
145 non-ambiguous, platform-independent manner - but gives you no guarantee that | |
146 the values will be displayed in any way that is predictable or intuitive. | |
147 | |
148 In practical terms, it is recommended to use the long/descriptive | |
149 forms of time zones, as they are more likely to be properly recognized | |
150 on most computers. For example, "Japan" instead of "JST". Another possible | |
151 solution is to explicitly use GMT or UTC (since it is very likely to be | |
152 properly recognized on any system), or the "UTC+<OFFSET>" notation. Still, none of the above | |
153 **guarantees** proper, non-ambiguous handling of time values in R data | |
154 sets. The fact that R **quietly** modifies time values when it doesn't | |
155 recognize the supplied timezone attribute, yet still appends it to the | |
156 **changed** time value does make it quite difficult. (These issues are | |
157 discussed in depth on R-related forums, and no attempt is made to | |
158 summarize it all in any depth here; this is just to made you aware of | |
159 this being a potentially complex issue!) | |
160 | |
161 An important thing to keep in mind, in connection with the DVN ingest | |
162 of R files, is that it will **reject** an R data file with any time | |
163 values that have time zones that we can't recognize. This is done in | |
164 order to avoid (some) of the potential issues outlined above. | |
165 | |
166 It is also recommended that any vectors containing time values | |
167 ingested into the DVN are reviewed, and the resulting entries in the | |
168 TAB files are compared against the original values in the R data | |
169 frame, to make sure they have been ingested as expected. | |
170 | |
171 Another **potential issue** here is the **UNF**. The way the UNF | |
172 algorithm works, the same date/time values with and without the | |
173 timezone (e.g. "12:45" vs. "12:45 EST") **produce different | |
174 UNFs**. Considering that time values in Stata/SPSS do not have time | |
175 zones, but ALL time values in R do (yes, they all do - if the timezone | |
176 wasn't defined explicitely, it implicitly becomes a time value in the | |
177 "UTC" zone!), this means that it is **impossible** to have 2 time | |
178 value vectors, in Stata/SPSS and R, that produce the same UNF. | |
179 | |
180 | **A pro tip:** if it is important to produce SPSS/Stata and R versions of | |
181 the same data set that result in the same UNF when ingested, you may | |
182 define the time variables as **strings** in the R data frame, and use | |
183 the "YYYY-MM-DD HH:mm:ss" formatting notation. This is the formatting used by the UNF | |
184 algorithm to normalize time values, so doing the above will result in | |
185 the same UNF as the vector of the same time values in Stata. | |
186 | |
187 Note: date values (dates only, without time) should be handled the | |
188 exact same way as those in SPSS and Stata, and should produce the same | |
189 UNFs. |