annotate CHANGES.txt @ 0:3b37d71af924 default tip

iniitial
author dwinter
date Tue, 26 Feb 2013 15:50:30 +0100
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1 Nutch Change Log
3b37d71af924 iniitial
dwinter
parents:
diff changeset
2
3b37d71af924 iniitial
dwinter
parents:
diff changeset
3 (trunk): Current Development
3b37d71af924 iniitial
dwinter
parents:
diff changeset
4
3b37d71af924 iniitial
dwinter
parents:
diff changeset
5 * NUTCH-1274 Fix [cast] javac warnings (Tejas Patil via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
6
3b37d71af924 iniitial
dwinter
parents:
diff changeset
7 * NUTCH-1494 RSS feed plugin seems broken (Sourajit Basak, Tejas Patil and lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
8
3b37d71af924 iniitial
dwinter
parents:
diff changeset
9 * NUTCH-1127 JUnit test for urlfilter-validator (Tejas Patil via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
10
3b37d71af924 iniitial
dwinter
parents:
diff changeset
11 * NUTCH-1119 JUnit test for index-static (Tejas Patil via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
12
3b37d71af924 iniitial
dwinter
parents:
diff changeset
13 * NUTCH-1510 Upgrade to Hadoop 1.1.1 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
14
3b37d71af924 iniitial
dwinter
parents:
diff changeset
15 * NUTCH-1118 JUnit test for index-basic (Tejas Patil via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
16
3b37d71af924 iniitial
dwinter
parents:
diff changeset
17 * NUTCH-1331 limit crawler to defined depth (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
18
3b37d71af924 iniitial
dwinter
parents:
diff changeset
19 Release 1.6 - 23/11/2012
3b37d71af924 iniitial
dwinter
parents:
diff changeset
20
3b37d71af924 iniitial
dwinter
parents:
diff changeset
21 * NUTCH-1370 Expose exact number of urls injected @runtime (snagel via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
22
3b37d71af924 iniitial
dwinter
parents:
diff changeset
23 * NUTCH-1117 JUnit test for index-anchor (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
24
3b37d71af924 iniitial
dwinter
parents:
diff changeset
25 * NUTCH-1451 Upgrade automaton jar to 1.11-8 (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
26
3b37d71af924 iniitial
dwinter
parents:
diff changeset
27 * NUTCH-1488 bin/nutch to run junit from any directory (snagel via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
28
3b37d71af924 iniitial
dwinter
parents:
diff changeset
29 * NUTCH-1493 Error adding field 'contentLength'='' during solrindex using index-more (Nathan Gass via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
30
3b37d71af924 iniitial
dwinter
parents:
diff changeset
31 * NUTCH-1491 Strip UTF-8 non-character codepoints in title (Nathan Gass via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
32
3b37d71af924 iniitial
dwinter
parents:
diff changeset
33 * NUTCH-1421 RegexURLNormalizer to only skip rules with invalid patterns (snagel)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
34
3b37d71af924 iniitial
dwinter
parents:
diff changeset
35 * NUTCH-1341 NotModified time set to now but page not modified (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
36
3b37d71af924 iniitial
dwinter
parents:
diff changeset
37 * NUTCH-1215 UpdateDB should not require segment as input (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
38
3b37d71af924 iniitial
dwinter
parents:
diff changeset
39 * NUTCH-1383 IndexingFiltersChecker to show error message instead of null pointer exception (snagel)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
40
3b37d71af924 iniitial
dwinter
parents:
diff changeset
41 * NUTCH-1476 SegmentReader getStats should set parsed = -1 if no parsing took place (snagel)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
42
3b37d71af924 iniitial
dwinter
parents:
diff changeset
43 * NUTCH-1252 SegmentReader -get shows wrong data (snagel)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
44
3b37d71af924 iniitial
dwinter
parents:
diff changeset
45 * NUTCH-1344 BasicURLNormalizer to normalize https same as http (snagel)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
46
3b37d71af924 iniitial
dwinter
parents:
diff changeset
47 * NUTCH-706 Url regex normalizer: pattern for session id removal not to match "newsId" (Meghna Kukreja via snagel)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
48
3b37d71af924 iniitial
dwinter
parents:
diff changeset
49 * NUTCH-1415 release packages to contain top level folder apache-nutch-x.x (snagel)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
50
3b37d71af924 iniitial
dwinter
parents:
diff changeset
51 * NUTCH-1441 AnchorIndexingFilter should use plain HashSet (ferdy via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
52
3b37d71af924 iniitial
dwinter
parents:
diff changeset
53 * NUTCH-1470 Ensure test files are included for runtime testing (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
54
3b37d71af924 iniitial
dwinter
parents:
diff changeset
55 * NUTCH-1434 Indexer to delete robots noindex (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
56
3b37d71af924 iniitial
dwinter
parents:
diff changeset
57 * NUTCH-1443 Solr schema version is invalid (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
58
3b37d71af924 iniitial
dwinter
parents:
diff changeset
59 * NUTCH-1417 Remove o.a.n.metadata.Office (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
60
3b37d71af924 iniitial
dwinter
parents:
diff changeset
61 * NUTCH-1376 Add description parameter to every ant task (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
62
3b37d71af924 iniitial
dwinter
parents:
diff changeset
63 * NUTCH-1440 reconfigure non-existent stopwords_en.txt in schema-solr4.xml (shekhar sharma via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
64
3b37d71af924 iniitial
dwinter
parents:
diff changeset
65 * NUTCH-1439 Define boost field as type float in schema-solr4.xml (shekhar sharma via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
66
3b37d71af924 iniitial
dwinter
parents:
diff changeset
67 * NUTCH-1433 Upgrade to Tika 1.2 (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
68
3b37d71af924 iniitial
dwinter
parents:
diff changeset
69 * NUTCH-1388 Optionally maintain custom fetch interval despite AdaptiveFetchSchedule (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
70
3b37d71af924 iniitial
dwinter
parents:
diff changeset
71 * NUTCH-1430 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
72
3b37d71af924 iniitial
dwinter
parents:
diff changeset
73 * NUTCH-1087 Deprecate crawl command and replace with example script (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
74
3b37d71af924 iniitial
dwinter
parents:
diff changeset
75 * NUTCH-1306 Add option to not commit and clarify existing solr.commit.size (ferdy)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
76
3b37d71af924 iniitial
dwinter
parents:
diff changeset
77 * NUTCH-1405 Allow to overwrite CrawlDatum's with injected entries (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
78
3b37d71af924 iniitial
dwinter
parents:
diff changeset
79 * NUTCH-1412 Upgrade commons lang (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
80
3b37d71af924 iniitial
dwinter
parents:
diff changeset
81 * NUTCH-1251 SolrDedup to use proper Lucene catch-all query (Arkadi Kosmynin via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
82
3b37d71af924 iniitial
dwinter
parents:
diff changeset
83 * NUTCH-1407 BasicIndexingFilter to optionally add domain field (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
84
3b37d71af924 iniitial
dwinter
parents:
diff changeset
85 * NUTCH-1408 RobotRulesParser main doesn't take URL's (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
86
3b37d71af924 iniitial
dwinter
parents:
diff changeset
87 * NUTCH-1300 Indexer to filter normalize URL's (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
88
3b37d71af924 iniitial
dwinter
parents:
diff changeset
89 * NUTCH-1330 WebGraph OutlinkDB to preserve back up (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
90
3b37d71af924 iniitial
dwinter
parents:
diff changeset
91 * NUTCH-1319 HostNormalizer plugin (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
92
3b37d71af924 iniitial
dwinter
parents:
diff changeset
93 * NUTCH-1386 Headings filter not to add empty values (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
94
3b37d71af924 iniitial
dwinter
parents:
diff changeset
95 * NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling (ferdy via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
96
3b37d71af924 iniitial
dwinter
parents:
diff changeset
97 * NUTCH-1352 Improve regex urlfilters/normalizers synchronization (ferdy via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
98
3b37d71af924 iniitial
dwinter
parents:
diff changeset
99 * NUTCH-1024 Dynamically set fetchInterval by MIME-type (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
100
3b37d71af924 iniitial
dwinter
parents:
diff changeset
101 * NUTCH-1364 Add a counter in Generator for malformed urls (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
102
3b37d71af924 iniitial
dwinter
parents:
diff changeset
103 * NUTCH-1262 Map `duplicating` content-types to a single type (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
104
3b37d71af924 iniitial
dwinter
parents:
diff changeset
105 * NUTCH-1385 More robust plug-in order properties in nutch-site.xml (Andy Xue via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
106
3b37d71af924 iniitial
dwinter
parents:
diff changeset
107 * NUTCH-1336 Optionally not index db_notmodified pages (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
108
3b37d71af924 iniitial
dwinter
parents:
diff changeset
109 * NUTCH-1346 Follow outlinks to ignore external (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
110
3b37d71af924 iniitial
dwinter
parents:
diff changeset
111 * NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
112
3b37d71af924 iniitial
dwinter
parents:
diff changeset
113 * NUTCH-1351 DomainStatistics to aggregate by TLD (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
114
3b37d71af924 iniitial
dwinter
parents:
diff changeset
115 * NUTCH-1381 Allow to override default subcollection field name (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
116
3b37d71af924 iniitial
dwinter
parents:
diff changeset
117 * NUTCH-XX Commit to add configuration for separation of ant distribution targets (lewismc + jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
118
3b37d71af924 iniitial
dwinter
parents:
diff changeset
119 Release 1.5.1 - 07/10/2012
3b37d71af924 iniitial
dwinter
parents:
diff changeset
120
3b37d71af924 iniitial
dwinter
parents:
diff changeset
121 * NUTCH-1404 Nutch script fails to find job file in deploy mode (sidabatra, jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
122
3b37d71af924 iniitial
dwinter
parents:
diff changeset
123 * NUTCH-1415 release packages to contain top level folder apache-nutch-x.x (snagel via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
124
3b37d71af924 iniitial
dwinter
parents:
diff changeset
125 * NUTCH-1400 Remove developer -core option for bin/nutch (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
126
3b37d71af924 iniitial
dwinter
parents:
diff changeset
127 * NUTCH-1384 Typo in ParseSegment's run-method (Matthias Agethle via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
128
3b37d71af924 iniitial
dwinter
parents:
diff changeset
129 * NUTCH-1398 Upgrade to Hadoop 1.0.3 (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
130
3b37d71af924 iniitial
dwinter
parents:
diff changeset
131 Release 1.5 - 04/15/2012
3b37d71af924 iniitial
dwinter
parents:
diff changeset
132
3b37d71af924 iniitial
dwinter
parents:
diff changeset
133 * NUTCH-1208 Don't include KEYS file in bin distribution (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
134
3b37d71af924 iniitial
dwinter
parents:
diff changeset
135 * NUTCH-1234 Upgrade to Tika 1.1 (jnioche, markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
136
3b37d71af924 iniitial
dwinter
parents:
diff changeset
137 * NUTCH-809 Parse-metatags plugin (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
138
3b37d71af924 iniitial
dwinter
parents:
diff changeset
139 * NUTCH-1310 Nutch to send HTTP-accept header (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
140
3b37d71af924 iniitial
dwinter
parents:
diff changeset
141 * NUTCH-1305 Domain(blacklist)URLFilter to trim entries (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
142
3b37d71af924 iniitial
dwinter
parents:
diff changeset
143 * NUTCH-1307 Improve formatting of ant targets for clearer project help (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
144
3b37d71af924 iniitial
dwinter
parents:
diff changeset
145 * NUTCH-1299 LinkRank inverter to ignore records without Node (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
146
3b37d71af924 iniitial
dwinter
parents:
diff changeset
147 * NUTCH-1258 MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata (jnioche, markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
148
3b37d71af924 iniitial
dwinter
parents:
diff changeset
149 * NUTCH-1293 IndexingFiltersChecker to store detected content type in crawldatum metadata (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
150
3b37d71af924 iniitial
dwinter
parents:
diff changeset
151 * NUTCH-1291 Fetcher to stringify exception on // unexpected exception (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
152
3b37d71af924 iniitial
dwinter
parents:
diff changeset
153 * NUTCH-965 Skip parsing for truncated documents (alexis, lewismc, ferdy)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
154
3b37d71af924 iniitial
dwinter
parents:
diff changeset
155 * NUTCH-1210 DomainBlacklistFilter (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
156
3b37d71af924 iniitial
dwinter
parents:
diff changeset
157 * NUTCH-1193 Incorrect url transform to lowercase: parameter solr (Eduardo dos Santos Leggiero via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
158
3b37d71af924 iniitial
dwinter
parents:
diff changeset
159 * NUTCH-1272 Wrong property name for index-static in nutch-default.xml (Daniel Baur via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
160
3b37d71af924 iniitial
dwinter
parents:
diff changeset
161 * NUTCH-1259 Store detected content-type in crawldatum metadata (jnioche, markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
162
3b37d71af924 iniitial
dwinter
parents:
diff changeset
163 * NUTCH-1266 Subcollection to optionally write to configured fields (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
164
3b37d71af924 iniitial
dwinter
parents:
diff changeset
165 * NUTCH-1005 Parse headings plugin (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
166
3b37d71af924 iniitial
dwinter
parents:
diff changeset
167 * NUTCH-1264 Configurable indexing plugin index-metadata (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
168
3b37d71af924 iniitial
dwinter
parents:
diff changeset
169 * NUTCH-1242 Allow disabling of URL Filters in ParseSegment (Edward Drapkin via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
170
3b37d71af924 iniitial
dwinter
parents:
diff changeset
171 * NUTCH-1256 WebGraph to dump host + score (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
172
3b37d71af924 iniitial
dwinter
parents:
diff changeset
173 * NUTCH-1260 Fetcher should log fetching of redirects (Sebastian Nagel via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
174
3b37d71af924 iniitial
dwinter
parents:
diff changeset
175 * NUTCH-1255 Change ivy.xml of all plugins to remove "nutch.root" property (ferdy)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
176
3b37d71af924 iniitial
dwinter
parents:
diff changeset
177 * NUTCH-1248 Generator to select on status (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
178
3b37d71af924 iniitial
dwinter
parents:
diff changeset
179 * NUTCH-1177 Generator to select on retry interval (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
180
3b37d71af924 iniitial
dwinter
parents:
diff changeset
181 * NUTCH-1246 Upgrade to Hadoop 1.0.0 (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
182
3b37d71af924 iniitial
dwinter
parents:
diff changeset
183 * NUTCH-1139 Indexer to delete gone documents (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
184
3b37d71af924 iniitial
dwinter
parents:
diff changeset
185 * NUTCH-1244 CrawlDBDumper to filter by regex (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
186
3b37d71af924 iniitial
dwinter
parents:
diff changeset
187 * NUTCH-1237 Improve javac arguements for more verbose ouput (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
188
3b37d71af924 iniitial
dwinter
parents:
diff changeset
189 * NUTCH-1236 Add link to site documentation to download older versions of Nutch (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
190
3b37d71af924 iniitial
dwinter
parents:
diff changeset
191 * NUTCH-1146 Prevent generation of _SUCCESS files in output (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
192
3b37d71af924 iniitial
dwinter
parents:
diff changeset
193 * NUTCH-1232 Remove site field from index-basic (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
194
3b37d71af924 iniitial
dwinter
parents:
diff changeset
195 * NUTCH-1239 Webgraph should remove deleted pages from segment input (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
196
3b37d71af924 iniitial
dwinter
parents:
diff changeset
197 * NUTCH-1238 Fetcher throughput threshold must start before feeder finished (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
198
3b37d71af924 iniitial
dwinter
parents:
diff changeset
199 * NUTCH-1138 remove LogUtil from trunk and nutch gora (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
200
3b37d71af924 iniitial
dwinter
parents:
diff changeset
201 * NUTCH-1231 Upgrade to Tika 1.0 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
202
3b37d71af924 iniitial
dwinter
parents:
diff changeset
203 * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
204
3b37d71af924 iniitial
dwinter
parents:
diff changeset
205 * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
206
3b37d71af924 iniitial
dwinter
parents:
diff changeset
207 * NUTCH-1217 Update NOTICE.txt to drop some copyrights (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
208
3b37d71af924 iniitial
dwinter
parents:
diff changeset
209 * NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
210
3b37d71af924 iniitial
dwinter
parents:
diff changeset
211 * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
212
3b37d71af924 iniitial
dwinter
parents:
diff changeset
213 * NUTCH-1221 Migrate DomainStatistics to MapReduce API (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
214
3b37d71af924 iniitial
dwinter
parents:
diff changeset
215 * NUTCH-1216 Add trivial comment to lib/native/README.txt (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
216
3b37d71af924 iniitial
dwinter
parents:
diff changeset
217 * NUTCH-1214 DomainStats tool should be named for what it's doing (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
218
3b37d71af924 iniitial
dwinter
parents:
diff changeset
219 * NUTCH-1213 Pass additional SolrParams when indexing to Solr (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
220
3b37d71af924 iniitial
dwinter
parents:
diff changeset
221 * NUTCH-1211 URLFilterChecker command line help doesn't inform user of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
222 STDIN requirements (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
223
3b37d71af924 iniitial
dwinter
parents:
diff changeset
224 * NUTCH-1209 Output from ParserChecker Url missing a newline (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
225
3b37d71af924 iniitial
dwinter
parents:
diff changeset
226 * NUTCH-1207 ParserChecker to output signature (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
227
3b37d71af924 iniitial
dwinter
parents:
diff changeset
228 * NUTCH-1090 InvertLinks should inform when ignoring internal links (Marek Backmann via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
229
3b37d71af924 iniitial
dwinter
parents:
diff changeset
230 * NUTCH-1174 Outlinks are not properly normalized (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
231
3b37d71af924 iniitial
dwinter
parents:
diff changeset
232 * NUTCH-1203 ParseSegment to show number of milliseconds per parse (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
233
3b37d71af924 iniitial
dwinter
parents:
diff changeset
234 * NUTCH-1185 Decrease solr.commit.size to 250 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
235
3b37d71af924 iniitial
dwinter
parents:
diff changeset
236 * NUTCH-1180 UpdateDB to backup previous CrawlDB (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
237
3b37d71af924 iniitial
dwinter
parents:
diff changeset
238 * NUTCH-1173 DomainStats doesn't count db_not_modified (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
239
3b37d71af924 iniitial
dwinter
parents:
diff changeset
240 * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
241
3b37d71af924 iniitial
dwinter
parents:
diff changeset
242 * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
243
3b37d71af924 iniitial
dwinter
parents:
diff changeset
244 * NUTCH-1178 Incorrect CSV header CrawlDatumCsvOutputFormat (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
245
3b37d71af924 iniitial
dwinter
parents:
diff changeset
246 * NUTCH-1142 Normalization and filtering in WebGraph (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
247
3b37d71af924 iniitial
dwinter
parents:
diff changeset
248 * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
249
3b37d71af924 iniitial
dwinter
parents:
diff changeset
250 Release 1.4 - 11/4/2011
3b37d71af924 iniitial
dwinter
parents:
diff changeset
251
3b37d71af924 iniitial
dwinter
parents:
diff changeset
252 * NUTCH-1195 Add Solr 4x (trunk) example schema (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
253
3b37d71af924 iniitial
dwinter
parents:
diff changeset
254 * NUTCH-1192 Add '/runtime' to svn ignore (ferdy)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
255
3b37d71af924 iniitial
dwinter
parents:
diff changeset
256 * NUTCH-1097 application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml (Ferdy via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
257
3b37d71af924 iniitial
dwinter
parents:
diff changeset
258 * NUTCH-797 Fix parse-tika and parse-html to use relative URL resolution per RFC-3986
3b37d71af924 iniitial
dwinter
parents:
diff changeset
259 (Robert Hohman, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
260
3b37d71af924 iniitial
dwinter
parents:
diff changeset
261 * NUTCH-1154 Upgrade to Tika 0.10. NOTE: Tika's new RTF parser may ignore more
3b37d71af924 iniitial
dwinter
parents:
diff changeset
262 text in malformed documents than previously - see TIKA-748 for details. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
263
3b37d71af924 iniitial
dwinter
parents:
diff changeset
264 * NUTCH-1109 Add Sonar targets to Ant build.xml (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
265
3b37d71af924 iniitial
dwinter
parents:
diff changeset
266 * NUTCH-1152 Upgrade SolrJ to version 3.4.0 (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
267
3b37d71af924 iniitial
dwinter
parents:
diff changeset
268 * NUTCH-1136 Ant pmd target is broken (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
269
3b37d71af924 iniitial
dwinter
parents:
diff changeset
270 * NUTCH-1058 Upgrade Solr schema to version 1.4 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
271
3b37d71af924 iniitial
dwinter
parents:
diff changeset
272 * NUTCH-1137 LinkDB invertlinks other options ignored when using -dir option (Sebastian Nagel, markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
273
3b37d71af924 iniitial
dwinter
parents:
diff changeset
274 * NUTCH-1141 Configurable Fetcher queue depth (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
275
3b37d71af924 iniitial
dwinter
parents:
diff changeset
276 * NUTCH-1091 Remove commons logging dependency from Nutch branch and trunk (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
277
3b37d71af924 iniitial
dwinter
parents:
diff changeset
278 * NUTCH-672 allow unit tests to be run from bin/nutch (Todd Lipton via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
279
3b37d71af924 iniitial
dwinter
parents:
diff changeset
280 * NUTCH-937 Put plugins in classes/plugins in job file (Claudio Martella, Ferdy Galema, jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
281
3b37d71af924 iniitial
dwinter
parents:
diff changeset
282 * NUTCH-623 Change plugin source directory "languageidentifier" to "language-identifier" (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
283
3b37d71af924 iniitial
dwinter
parents:
diff changeset
284 * NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count (Robert Thomson via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
285
3b37d71af924 iniitial
dwinter
parents:
diff changeset
286 * NUTCH-1078 Upgrade all instances of commons logging to slf4j (with log4j backend) (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
287
3b37d71af924 iniitial
dwinter
parents:
diff changeset
288 * NUTCH-1115 Option to disable fixing embedded URL parameters in DomContentUtils (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
289
3b37d71af924 iniitial
dwinter
parents:
diff changeset
290 * NUTCH-1114 Attr file missing in domain filter (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
291
3b37d71af924 iniitial
dwinter
parents:
diff changeset
292 * NUTCH-1067 Configure minimum throughput for fetcher (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
293
3b37d71af924 iniitial
dwinter
parents:
diff changeset
294 * NUTCH-1102 Fetcher to rely on fetcher.parse directive (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
295
3b37d71af924 iniitial
dwinter
parents:
diff changeset
296 * NUTCH-1110 UpdateDB must not write _success file (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
297
3b37d71af924 iniitial
dwinter
parents:
diff changeset
298 * NUTCH-1105 Max content length option for index-basic (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
299
3b37d71af924 iniitial
dwinter
parents:
diff changeset
300 * NUTCH-940 static field plugin (Claudio Martella via lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
301
3b37d71af924 iniitial
dwinter
parents:
diff changeset
302 * NUTCH-914 Implement Apache Project Branding Requirements (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
303
3b37d71af924 iniitial
dwinter
parents:
diff changeset
304 * NUTCH-1095 remove i18n from Nutch site to archive and legacy secton of wiki (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
305
3b37d71af924 iniitial
dwinter
parents:
diff changeset
306 * NUTCH-1101 Option to purge db_gone records with updatedb (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
307
3b37d71af924 iniitial
dwinter
parents:
diff changeset
308 * NUTCH-1096 Empty (not null) ContentLength results in failure of fetch (Ferdy Galema via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
309
3b37d71af924 iniitial
dwinter
parents:
diff changeset
310 * NUTCH-1073 Rename parameters 'fetcher.threads.per.host.by.ip' and 'fetcher.threads.per.host' (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
311
3b37d71af924 iniitial
dwinter
parents:
diff changeset
312 * NUTCH-1089 Short compressed pages caused exception in protocol-httpclient (Simone Frenzel via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
313
3b37d71af924 iniitial
dwinter
parents:
diff changeset
314 * NUTCH-1085 Nutch script does not require HADOOP_HOME (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
315
3b37d71af924 iniitial
dwinter
parents:
diff changeset
316 * NUTCH-1075 Delegate language identification to Tika (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
317
3b37d71af924 iniitial
dwinter
parents:
diff changeset
318 * NUTCH-1049 Add classes to bin/nutch script (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
319
3b37d71af924 iniitial
dwinter
parents:
diff changeset
320 * NUTCH-1051 Export WebGraph node scores for Solr.ExternalFileField (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
321
3b37d71af924 iniitial
dwinter
parents:
diff changeset
322 * NUTCH-1083 ParserChecker implements Tools (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
323
3b37d71af924 iniitial
dwinter
parents:
diff changeset
324 * NUTCH-1082 IndexingFiltersChecker utility does not list multi valued fields (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
325
3b37d71af924 iniitial
dwinter
parents:
diff changeset
326 * NUTCH-1004 Do not index empty values for title field (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
327
3b37d71af924 iniitial
dwinter
parents:
diff changeset
328 * NUTCH-914 Implement Apache Project Branding Requirements (lewismc via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
329
3b37d71af924 iniitial
dwinter
parents:
diff changeset
330 * NUTCH-1069 Readlinkdb broken on Hadoop > 0.20 (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
331
3b37d71af924 iniitial
dwinter
parents:
diff changeset
332 * NUTCH-1044 Redirected URLs and possibly all of their outlinked URLs have invalid scores (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
333
3b37d71af924 iniitial
dwinter
parents:
diff changeset
334 * NUTCH-1028 Log urls when parsing (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
335
3b37d71af924 iniitial
dwinter
parents:
diff changeset
336 * NUTCH-1065 New mvn.template (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
337
3b37d71af924 iniitial
dwinter
parents:
diff changeset
338 * NUTCH-1072 Display number and size of queues in Fetcher status (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
339
3b37d71af924 iniitial
dwinter
parents:
diff changeset
340 * NUTCH-1071 Crawldb update displays total number of URLs per status (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
341
3b37d71af924 iniitial
dwinter
parents:
diff changeset
342 * NUTCH-1045 MimeUtil to rely on default config provided by Tika (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
343
3b37d71af924 iniitial
dwinter
parents:
diff changeset
344 * NUTCH-1057 Fetcher thread time out configurable (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
345
3b37d71af924 iniitial
dwinter
parents:
diff changeset
346 * NUTCH-1037 Option to deduplicate anchors prior to indexing (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
347
3b37d71af924 iniitial
dwinter
parents:
diff changeset
348 * NUTCH-1050 Add segmentDir option to WebGraph (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
349
3b37d71af924 iniitial
dwinter
parents:
diff changeset
350 * NUTCH-1055 upgrade package.html file in language identifier plugin (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
351
3b37d71af924 iniitial
dwinter
parents:
diff changeset
352 * NUTCH-1059 Remove convdb command from /bin/nutch (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
353
3b37d71af924 iniitial
dwinter
parents:
diff changeset
354 * NUTCH-1019 Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
355
3b37d71af924 iniitial
dwinter
parents:
diff changeset
356 * NUTCH-1023 Trivial error in error message for org.apache.nutch.crawl.LinkDbReader (lewismc)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
357
3b37d71af924 iniitial
dwinter
parents:
diff changeset
358 * NUTCH-1043 Add pattern for filtering .js in default url filters (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
359
3b37d71af924 iniitial
dwinter
parents:
diff changeset
360 * NUTCH-1054 LinkDB optional during indexing (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
361
3b37d71af924 iniitial
dwinter
parents:
diff changeset
362 * NUTCH-1029 Readdb throws EOFException (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
363
3b37d71af924 iniitial
dwinter
parents:
diff changeset
364 * NUTCH-1036 Solr jobs should increment counters in Reporter (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
365
3b37d71af924 iniitial
dwinter
parents:
diff changeset
366 * NUTCH-987 Support HTTP auth for Solr communication (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
367
3b37d71af924 iniitial
dwinter
parents:
diff changeset
368 * NUTCH-1027 Degrade log level of `can't find rules for scope` (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
369
3b37d71af924 iniitial
dwinter
parents:
diff changeset
370 * NUTCH-783 IndexingFiltersChecker utility (jnioche via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
371
3b37d71af924 iniitial
dwinter
parents:
diff changeset
372 * NUTCH-1030 WebgraphDB program requires manually added directories (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
373
3b37d71af924 iniitial
dwinter
parents:
diff changeset
374 * NUTCH-1011 Normalize duplicate slashes in URL's (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
375
3b37d71af924 iniitial
dwinter
parents:
diff changeset
376 * NUTCH-993 NullPointerException at FetcherOutputFormat.checkOutputSpecs (Christian Guegi via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
377
3b37d71af924 iniitial
dwinter
parents:
diff changeset
378 * NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
379
3b37d71af924 iniitial
dwinter
parents:
diff changeset
380 * NUTCH-1016 Strip UTF-8 non-character codepoints and add logging for SolrWriter (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
381
3b37d71af924 iniitial
dwinter
parents:
diff changeset
382 * NUTCH-1012 Cannot handle illegal charset $charset (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
383
3b37d71af924 iniitial
dwinter
parents:
diff changeset
384 * NUTCH-1022 Upgrade version number of Nutch agent in conf (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
385
3b37d71af924 iniitial
dwinter
parents:
diff changeset
386 * NUTCH-295 Description for fetcher.threads.fetch property (kubes via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
387
3b37d71af924 iniitial
dwinter
parents:
diff changeset
388 * NUTCH-1000 Add option not to commit to Solr (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
389
3b37d71af924 iniitial
dwinter
parents:
diff changeset
390 * NUTCH-1006 MetaEquiv with single quotes not accepted (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
391
3b37d71af924 iniitial
dwinter
parents:
diff changeset
392 * NUTCH-1010 ContentLength not trimmed (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
393
3b37d71af924 iniitial
dwinter
parents:
diff changeset
394 Release 1.3 - 6/4/2011
3b37d71af924 iniitial
dwinter
parents:
diff changeset
395
3b37d71af924 iniitial
dwinter
parents:
diff changeset
396 * NUTCH-995 Generate POM file using the Ivy makepom task (mattmann, jnioche, Gabriele Kahlout)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
397
3b37d71af924 iniitial
dwinter
parents:
diff changeset
398 * NUTCH-1003 task 'package' does not reflect the new organisation of the code (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
399
3b37d71af924 iniitial
dwinter
parents:
diff changeset
400 * NUTCH-994 Fine tune Solr schema (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
401
3b37d71af924 iniitial
dwinter
parents:
diff changeset
402 * NUTCH-997 IndexingFitlers to store Date objects instead of Strings (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
403
3b37d71af924 iniitial
dwinter
parents:
diff changeset
404 * NUTCH-996 Indexer adds solr.commit.size+1 docs (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
405
3b37d71af924 iniitial
dwinter
parents:
diff changeset
406 * NUTCH-983 Upgrade SolrJ to 3.1 (markus, jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
407
3b37d71af924 iniitial
dwinter
parents:
diff changeset
408 * NUTCH-989 Index-basic plugin and Solr schema now use date fieldType for tstamp field (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
409
3b37d71af924 iniitial
dwinter
parents:
diff changeset
410 * NUTCH-888 Remove parse-rss and add tests for rss to parse-tika (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
411
3b37d71af924 iniitial
dwinter
parents:
diff changeset
412 * NUTCH-991 SolrDedup must issue a commit (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
413
3b37d71af924 iniitial
dwinter
parents:
diff changeset
414 * NUTCH 986 SolrDedup fails due to date incorrect format (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
415
3b37d71af924 iniitial
dwinter
parents:
diff changeset
416 * NUTCH-977 SolrMappingReader uses hardcoded configuration parameter name for mapping file (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
417
3b37d71af924 iniitial
dwinter
parents:
diff changeset
418 * NUTCH-976 Rename properties solrindex.* to solr.* (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
419
3b37d71af924 iniitial
dwinter
parents:
diff changeset
420 * NUTCH-890 Fix IllegalAccessError with slf4j used in Solrj (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
421
3b37d71af924 iniitial
dwinter
parents:
diff changeset
422 * NUTCH-891 Subcollection plugin won't require blacklist any more (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
423
3b37d71af924 iniitial
dwinter
parents:
diff changeset
424 * NUTCH-972 CrawlDbMerger doesn't break on non-existent input (Gabriele Kahlout via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
425
3b37d71af924 iniitial
dwinter
parents:
diff changeset
426 * NUTCH-967 Upgrade to Tika 0.9 (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
427
3b37d71af924 iniitial
dwinter
parents:
diff changeset
428 * NUTCH-975 Fix missing/wrong headers in source files (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
429
3b37d71af924 iniitial
dwinter
parents:
diff changeset
430 * NUTCH-963 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (Claudio Martella, markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
431
3b37d71af924 iniitial
dwinter
parents:
diff changeset
432 * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann, jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
433
3b37d71af924 iniitial
dwinter
parents:
diff changeset
434 * NUTCH-962 max. redirects not handled correctly: fetcher stops at max-1 redirects (Sebastian Nagel via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
435
3b37d71af924 iniitial
dwinter
parents:
diff changeset
436 * NUTCH-921 Reduce dependency of Nutch on config files (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
437
3b37d71af924 iniitial
dwinter
parents:
diff changeset
438 * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
439
3b37d71af924 iniitial
dwinter
parents:
diff changeset
440 * NUTCH-872 Change the default fetcher.parse to FALSE (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
441
3b37d71af924 iniitial
dwinter
parents:
diff changeset
442 * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
443
3b37d71af924 iniitial
dwinter
parents:
diff changeset
444 * NUTCH-964 Upgraded Xerces to 2.91, ERROR conf.Configuration - Failed to set setXIncludeAware (markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
445
3b37d71af924 iniitial
dwinter
parents:
diff changeset
446 * NUTCH-927 Fetcher.timelimit.mins is invalid when depth is greater than 1 (Wade Lau via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
447
3b37d71af924 iniitial
dwinter
parents:
diff changeset
448 * NUTCH-824 Crawling - File Error 404 when fetching file with an hexadecimal character in the file name (Michela Becchi via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
449
3b37d71af924 iniitial
dwinter
parents:
diff changeset
450 * NUTCH-954 Strict application of Content-Length limit for http protocols (Alexis Detreglode via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
451
3b37d71af924 iniitial
dwinter
parents:
diff changeset
452 * NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
453
3b37d71af924 iniitial
dwinter
parents:
diff changeset
454 * NUTCH-935 basicurlnormalizer removes unnecessary /./ in URLs (Stondet via markus)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
455
3b37d71af924 iniitial
dwinter
parents:
diff changeset
456 * NUTCH-912 MoreIndexingFilter does not parse docx and xlsx date formats (Markus Jelsma, jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
457
3b37d71af924 iniitial
dwinter
parents:
diff changeset
458 * NUTCH-886 A .gitignore file for Nutch (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
459
3b37d71af924 iniitial
dwinter
parents:
diff changeset
460 * NUTCH-930 Remove remaining dependencies on Lucene API (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
461
3b37d71af924 iniitial
dwinter
parents:
diff changeset
462 * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
463
3b37d71af924 iniitial
dwinter
parents:
diff changeset
464 * NUTCH-936 LanguageIdentifier should not set empty lang field on NutchDocument (Markus Jelsma via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
465
3b37d71af924 iniitial
dwinter
parents:
diff changeset
466 * NUTCH-787 ScoringFilters should not override the injected score (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
467
3b37d71af924 iniitial
dwinter
parents:
diff changeset
468 * NUTCH-949 Conflicting ANT jars in classpath (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
469
3b37d71af924 iniitial
dwinter
parents:
diff changeset
470 * NUTCH-863 Benchmark and a testbed proxy server (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
471
3b37d71af924 iniitial
dwinter
parents:
diff changeset
472 * NUTCH-844 Improve NutchConfiguration (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
473
3b37d71af924 iniitial
dwinter
parents:
diff changeset
474 * NUTCH-845 Native hadoop libs not available through maven (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
475
3b37d71af924 iniitial
dwinter
parents:
diff changeset
476 * NUTCH-843 Separate the build and runtime environments (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
477
3b37d71af924 iniitial
dwinter
parents:
diff changeset
478 * NUTCH-821 Use ivy in nutch builds (Enis Soztutar, jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
479
3b37d71af924 iniitial
dwinter
parents:
diff changeset
480 * NUTCH-837 Remove search servers and Lucene dependencies (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
481
3b37d71af924 iniitial
dwinter
parents:
diff changeset
482 * NUTCH-836 Remove deprecated parse plugins (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
483
3b37d71af924 iniitial
dwinter
parents:
diff changeset
484 * NUTCH-939 Added -dir command line option to SolrIndexer (Claudio Martella via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
485
3b37d71af924 iniitial
dwinter
parents:
diff changeset
486 * NUTCH-948 Remove Lucene dependencies (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
487
3b37d71af924 iniitial
dwinter
parents:
diff changeset
488 Release 1.2 - 09/18/2010
3b37d71af924 iniitial
dwinter
parents:
diff changeset
489
3b37d71af924 iniitial
dwinter
parents:
diff changeset
490 * NUTCH-901 Make index-more plug-in configurable (Markus Jelsma via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
491
3b37d71af924 iniitial
dwinter
parents:
diff changeset
492 * NUTCH-908 Infinite Loop and Null Pointer Bugs in Searching (kubes via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
493
3b37d71af924 iniitial
dwinter
parents:
diff changeset
494 * NUTCH-906 Nutch OpenSearch sometimes raises DOMExceptions (Asheesh Laroia via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
495
3b37d71af924 iniitial
dwinter
parents:
diff changeset
496 * NUTCH-862 HttpClient null pointer exception (Sebastian Nagel via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
497
3b37d71af924 iniitial
dwinter
parents:
diff changeset
498 * NUTCH-905 Configurable file protocol parent directory crawling (Thorsten Scherler, mattmann, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
499
3b37d71af924 iniitial
dwinter
parents:
diff changeset
500 * NUTCH-877 Allow setting of slop values for non-quote phrase queries on query-basic plugin (kubes via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
501
3b37d71af924 iniitial
dwinter
parents:
diff changeset
502 * NUTCH-716 Make subcollection index filed multivalued (Dmitry Lihachev via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
503
3b37d71af924 iniitial
dwinter
parents:
diff changeset
504 * NUTCH-878 ScoringFilters should not override the injected score
3b37d71af924 iniitial
dwinter
parents:
diff changeset
505
3b37d71af924 iniitial
dwinter
parents:
diff changeset
506 * NUTCH-870 Injector should add the metadata before calling injectedScore (jnioche via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
507
3b37d71af924 iniitial
dwinter
parents:
diff changeset
508 * NUTCH-858 No longer able to set per-field boosts on lucene documents (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
509
3b37d71af924 iniitial
dwinter
parents:
diff changeset
510 * NUTCH-869 Add parse-html back (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
511
3b37d71af924 iniitial
dwinter
parents:
diff changeset
512 * NUTCH-871 MoreIndexingFilter missing date format (Max Lynch via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
513
3b37d71af924 iniitial
dwinter
parents:
diff changeset
514 * NUTCH-696 Timeout for Parser (ab, jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
515
3b37d71af924 iniitial
dwinter
parents:
diff changeset
516 * NUTCH-857 DistributedBeans should not close their RPC counterparts (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
517
3b37d71af924 iniitial
dwinter
parents:
diff changeset
518 * NUTCH-855 ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags
3b37d71af924 iniitial
dwinter
parents:
diff changeset
519 and their subsequent indexing (Scott Gonyea via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
520
3b37d71af924 iniitial
dwinter
parents:
diff changeset
521 * NUTCH-677 Segment merge filering based on segment content (Marcin Okraszewski via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
522
3b37d71af924 iniitial
dwinter
parents:
diff changeset
523 * NUTCH-774 Retry interval in crawl date is set to 0 (Reinhard Schwab via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
524
3b37d71af924 iniitial
dwinter
parents:
diff changeset
525 * NUTCH-697 Generate log output for solr indexer and dedup (Dmitry Lihachev, Jeroen van Vianen via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
526
3b37d71af924 iniitial
dwinter
parents:
diff changeset
527 * NUTCH-850 SolrDeleteDuplicates needs to clone the SolrRecord objects (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
528
3b37d71af924 iniitial
dwinter
parents:
diff changeset
529 * NUTCH-838 Add timing information to all Tool classes (Jeroen van Vianen, mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
530
3b37d71af924 iniitial
dwinter
parents:
diff changeset
531 * NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
532
3b37d71af924 iniitial
dwinter
parents:
diff changeset
533 * NUTCH-831 Allow configuration of how fields crawled by Nutch are stored / indexed /
3b37d71af924 iniitial
dwinter
parents:
diff changeset
534 tokenized (Jeroen van Vianen via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
535
3b37d71af924 iniitial
dwinter
parents:
diff changeset
536 * NUTCH-278 Fetcher-status might need clarification: kbit/s instead of kb/s shown (Alex McLintock via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
537
3b37d71af924 iniitial
dwinter
parents:
diff changeset
538 * NUTCH-833 Website is still Lucene branded (mattmann, Alex McLintock)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
539
3b37d71af924 iniitial
dwinter
parents:
diff changeset
540 * NUTCH-832 Website menu has lots of broken links - in particular the API docs (Alex McLintock via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
541
3b37d71af924 iniitial
dwinter
parents:
diff changeset
542 Release 1.1 - 2010-06-06
3b37d71af924 iniitial
dwinter
parents:
diff changeset
543
3b37d71af924 iniitial
dwinter
parents:
diff changeset
544 * NUTCH-819 Included Solr schema.xml and solrindex-mapping.xml don't play together (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
545
3b37d71af924 iniitial
dwinter
parents:
diff changeset
546 * NUTCH-818 Bugfix : Parse-tika uses minorCodes instead of majorCodes in ParseStatus (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
547
3b37d71af924 iniitial
dwinter
parents:
diff changeset
548 * NUTCH-816 Add zip target to build.xml (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
549
3b37d71af924 iniitial
dwinter
parents:
diff changeset
550 * NUTCH-732 Subcollection plugin not working (Filipe Antunes, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
551
3b37d71af924 iniitial
dwinter
parents:
diff changeset
552 * NUTCH-815 Invalid blank line before If-Modified-Since header (Pascal Dimassimo via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
553
3b37d71af924 iniitial
dwinter
parents:
diff changeset
554 * NUTCH-814 SegmentMerger bug (Rob Bradshaw, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
555
3b37d71af924 iniitial
dwinter
parents:
diff changeset
556 * NUTCH-812 Crawl.java incorrectly uses the Generator API resulting in NPE (Phil Barnett via mattmann and ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
557
3b37d71af924 iniitial
dwinter
parents:
diff changeset
558 * NUTCH-810 Upgrade to Tika 0.7 (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
559
3b37d71af924 iniitial
dwinter
parents:
diff changeset
560 * NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + call scfilters.initialScore on newly created URL (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
561
3b37d71af924 iniitial
dwinter
parents:
diff changeset
562 * NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
563
3b37d71af924 iniitial
dwinter
parents:
diff changeset
564 * NUTCH-784 CrawlDBScanner (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
565
3b37d71af924 iniitial
dwinter
parents:
diff changeset
566 * NUTCH-762 Generator can generate several segments in one parse of the crawlDB (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
567
3b37d71af924 iniitial
dwinter
parents:
diff changeset
568 * NUTCH-740 Configuration option to override default language for fetched pages (Marcin Okraszewski via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
569
3b37d71af924 iniitial
dwinter
parents:
diff changeset
570 * NUTCH-803 Upgrade to Hadoop 0.20.2 (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
571
3b37d71af924 iniitial
dwinter
parents:
diff changeset
572 * NUTCH-787 Upgrade Lucene to 3.0.1. (Dawid Weiss via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
573
3b37d71af924 iniitial
dwinter
parents:
diff changeset
574 * NUTCH-796 Zero results problems difficult to troubleshoot due to lack of logging (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
575
3b37d71af924 iniitial
dwinter
parents:
diff changeset
576 * NUTCH-801 Remove RTF and MP3 parse plugins (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
577
3b37d71af924 iniitial
dwinter
parents:
diff changeset
578 * NUTCH-798 Upgrade to SOLR1.4 and its dependencies (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
579
3b37d71af924 iniitial
dwinter
parents:
diff changeset
580 * NUTCH-799 SOLRIndexer to commit once all reducers have finished (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
581
3b37d71af924 iniitial
dwinter
parents:
diff changeset
582 * NUTCH-782 Ability to order htmlparsefilters (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
583
3b37d71af924 iniitial
dwinter
parents:
diff changeset
584 * NUTCH-719 fetchQueues.totalSize incorrect in Fetcher (Steven Denny via jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
585
3b37d71af924 iniitial
dwinter
parents:
diff changeset
586 * NUTCH-790 Some external javadoc links are broken (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
587
3b37d71af924 iniitial
dwinter
parents:
diff changeset
588 * NUTCH-766 Tika parser (jnioche via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
589
3b37d71af924 iniitial
dwinter
parents:
diff changeset
590 * NUTCH-786 Improvement to the list of suffix domains (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
591
3b37d71af924 iniitial
dwinter
parents:
diff changeset
592 * NUTCH-775 Enhance searcher interface (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
593
3b37d71af924 iniitial
dwinter
parents:
diff changeset
594 * NUTCH-781 Update Tika to v0.6 (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
595
3b37d71af924 iniitial
dwinter
parents:
diff changeset
596 * NUTCH-269 CrawlDbReducer: OOME because no upper-bound on inlinks count (stack + jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
597
3b37d71af924 iniitial
dwinter
parents:
diff changeset
598 * NUTCH-655 Injecting Crawl metadata (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
599
3b37d71af924 iniitial
dwinter
parents:
diff changeset
600 * NUTCH-658 Use counters to report fetching and parsing status (jnioche)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
601
3b37d71af924 iniitial
dwinter
parents:
diff changeset
602 * NUTCH-777 Upgrading to jetty6 broke unit tests (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
603
3b37d71af924 iniitial
dwinter
parents:
diff changeset
604 * NUTCH-767 Update Tika to v0.5 for the MimeType detection (Julien Nioche via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
605
3b37d71af924 iniitial
dwinter
parents:
diff changeset
606 * NUTCH-769 Fetcher to skip queues for URLS getting repeated exceptions
3b37d71af924 iniitial
dwinter
parents:
diff changeset
607 (Julien Nioche via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
608
3b37d71af924 iniitial
dwinter
parents:
diff changeset
609 * NUTCH-768 - Upgrade Nutch 1.0 to use Hadoop 0.20.1, also upgrades Xerces to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
610 version 2.9.1. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
611
3b37d71af924 iniitial
dwinter
parents:
diff changeset
612 * NUTCH-712 ParseOutputFormat should catch java.net.MalformedURLException
3b37d71af924 iniitial
dwinter
parents:
diff changeset
613 coming from normalizers (Julien Nioche via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
614
3b37d71af924 iniitial
dwinter
parents:
diff changeset
615 * NUTCH-741 Job file includes multiple copies of nutch config files
3b37d71af924 iniitial
dwinter
parents:
diff changeset
616 (Kirby Bohling via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
617
3b37d71af924 iniitial
dwinter
parents:
diff changeset
618 * NUTCH-739 SolrDeleteDuplications too slow when using hadoop (Dmitry Lihachev via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
619
3b37d71af924 iniitial
dwinter
parents:
diff changeset
620 * NUTCH-738 Close SegmentUpdater when FetchedSegments is closed
3b37d71af924 iniitial
dwinter
parents:
diff changeset
621 (Martina Koch, Kirby Bohling via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
622
3b37d71af924 iniitial
dwinter
parents:
diff changeset
623 * NUTCH-746 NutchBeanConstructor does not close NutchBean upon contextDestroyed,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
624 causing resource leak in the container. (Kirby Bohling via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
625
3b37d71af924 iniitial
dwinter
parents:
diff changeset
626 * NUTCH-772 Upgrade Nutch to use Lucene 2.9.1 (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
627
3b37d71af924 iniitial
dwinter
parents:
diff changeset
628 * NUTCH-760 Allow field mapping from Nutch to Solr index (David Stuart, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
629
3b37d71af924 iniitial
dwinter
parents:
diff changeset
630 * NUTCH-761 Avoid cloning CrawlDatum in CrawlDbReducer (Julien Nioche, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
631
3b37d71af924 iniitial
dwinter
parents:
diff changeset
632 * NUTCH-753 Prevent new Fetcher from retrieving the robots twice (Julien Nioche via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
633
3b37d71af924 iniitial
dwinter
parents:
diff changeset
634 * NUTCH-773 - Some minor bugs in AbstractFetchSchedule (Reinhard Schwab via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
635
3b37d71af924 iniitial
dwinter
parents:
diff changeset
636 * NUTCH-765 - Allow Crawl class to call Either Solr or Lucene Indexer (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
637
3b37d71af924 iniitial
dwinter
parents:
diff changeset
638 * NUTCH-735 - crawl-tool.xml must be read before nutch-site.xml when
3b37d71af924 iniitial
dwinter
parents:
diff changeset
639 invoked using crawl command (Susam Pal via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
640
3b37d71af924 iniitial
dwinter
parents:
diff changeset
641 * NUTCH-721 - Fetcher2 Slow (Julien Nioche via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
642
3b37d71af924 iniitial
dwinter
parents:
diff changeset
643 * NUTCH-702 - Lazy Instanciation of Metadata in CrawlDatum (Julien Nioche via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
644
3b37d71af924 iniitial
dwinter
parents:
diff changeset
645 * NUTCH-707 - Generation of multiple segments in multiple runs returns only 1 segment
3b37d71af924 iniitial
dwinter
parents:
diff changeset
646 (Michael Chen, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
647
3b37d71af924 iniitial
dwinter
parents:
diff changeset
648 * NUTCH-730 - NPE in LinkRank if no nodes with which to create the WebGraph
3b37d71af924 iniitial
dwinter
parents:
diff changeset
649 (Dennis Kubes via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
650
3b37d71af924 iniitial
dwinter
parents:
diff changeset
651 * NUTCH-731 - Redirection of robots.txt in RobotRulesParser (Julien Nioche via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
652
3b37d71af924 iniitial
dwinter
parents:
diff changeset
653 * NUTCH-757 - RequestUtils getBooleanParameter() always returns false
3b37d71af924 iniitial
dwinter
parents:
diff changeset
654 (Niall Pemberton via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
655
3b37d71af924 iniitial
dwinter
parents:
diff changeset
656 * NUTCH-754 - Use GenericOptionsParser instead of FileSystem.parseArgs() (Julien
3b37d71af924 iniitial
dwinter
parents:
diff changeset
657 Nioche via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
658
3b37d71af924 iniitial
dwinter
parents:
diff changeset
659 * NUTCH-756 - CrawlDatum.set() does not reset Metadata if it is null (Julien Nioche
3b37d71af924 iniitial
dwinter
parents:
diff changeset
660 via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
661
3b37d71af924 iniitial
dwinter
parents:
diff changeset
662 * NUTCH-679 - Fetcher2 implementing Tool (Julien Nioche via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
663
3b37d71af924 iniitial
dwinter
parents:
diff changeset
664 * NUTCH-758 - Set subversion eol-style to "native" (Niall Pemberton via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
665
3b37d71af924 iniitial
dwinter
parents:
diff changeset
666 Release 1.0 - 2009-03-23
3b37d71af924 iniitial
dwinter
parents:
diff changeset
667
3b37d71af924 iniitial
dwinter
parents:
diff changeset
668 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
669
3b37d71af924 iniitial
dwinter
parents:
diff changeset
670 2. NUTCH-443 - Allow parsers to return multiple Parse objects.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
671 (Dogacan Guney et al, via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
672
3b37d71af924 iniitial
dwinter
parents:
diff changeset
673 3. NUTCH-393 - Indexer should handle null documents returned by filters.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
674 (Eelco Lempsink via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
675
3b37d71af924 iniitial
dwinter
parents:
diff changeset
676 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
677
3b37d71af924 iniitial
dwinter
parents:
diff changeset
678 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
3b37d71af924 iniitial
dwinter
parents:
diff changeset
679 bots in robots.txt (Dogacan Guney via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
680
3b37d71af924 iniitial
dwinter
parents:
diff changeset
681 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
682
3b37d71af924 iniitial
dwinter
parents:
diff changeset
683 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
3b37d71af924 iniitial
dwinter
parents:
diff changeset
684 (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
685
3b37d71af924 iniitial
dwinter
parents:
diff changeset
686 8. NUTCH-161 - Change Plain text parser to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
687 use parser.character.encoding.default property for fall back encoding
3b37d71af924 iniitial
dwinter
parents:
diff changeset
688 (KuroSaka TeruHiko, siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
689
3b37d71af924 iniitial
dwinter
parents:
diff changeset
690 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
691 unmodified content. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
692
3b37d71af924 iniitial
dwinter
parents:
diff changeset
693 10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
694 (cutting via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
695
3b37d71af924 iniitial
dwinter
parents:
diff changeset
696 11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
697
3b37d71af924 iniitial
dwinter
parents:
diff changeset
698 12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed
3b37d71af924 iniitial
dwinter
parents:
diff changeset
699 up the rss parser (dogacan via mattmann). This update is a fix and semantics
3b37d71af924 iniitial
dwinter
parents:
diff changeset
700 change from the original patch for NUTCH-443. The original patch did not tell
3b37d71af924 iniitial
dwinter
parents:
diff changeset
701 the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch
3b37d71af924 iniitial
dwinter
parents:
diff changeset
702 datums. This patch addresses that issue. Now, if Fetcher gets a null content,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
703 instead of pushing an empty content, it filters the null content.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
704
3b37d71af924 iniitial
dwinter
parents:
diff changeset
705 13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
706 Parse object. (Gal Nitzan via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
707
3b37d71af924 iniitial
dwinter
parents:
diff changeset
708 14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains
3b37d71af924 iniitial
dwinter
parents:
diff changeset
709 some query parameters. (Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
710
3b37d71af924 iniitial
dwinter
parents:
diff changeset
711 15. NUTCH-502 - Bug in SegmentReader causes infinite loop.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
712 (Ilya Vishnevsky via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
713
3b37d71af924 iniitial
dwinter
parents:
diff changeset
714 16. NUTCH-444 Possibly use a different library to parse RSS feed for improved
3b37d71af924 iniitial
dwinter
parents:
diff changeset
715 performance and compatibility. This patch introduced a new plugin, feed,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
716 that includes an index filter and a parse plugin for feeds that uses ROME.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
717 There was discussion to remove parse-rss, in light of the feed plugin,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
718 however, this patch does not explicitly remove parse-rss. (dogacan, mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
719
3b37d71af924 iniitial
dwinter
parents:
diff changeset
720 17. NUTCH-471 - Fix synchronization in NutchBean creation.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
721 (Enis Soztutar via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
722
3b37d71af924 iniitial
dwinter
parents:
diff changeset
723 18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
724
3b37d71af924 iniitial
dwinter
parents:
diff changeset
725 19. NUTCH-468 - Scoring filter should distribute score to all outlinks at
3b37d71af924 iniitial
dwinter
parents:
diff changeset
726 once. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
727
3b37d71af924 iniitial
dwinter
parents:
diff changeset
728 20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
729
3b37d71af924 iniitial
dwinter
parents:
diff changeset
730 21. NUTCH-497 - Extreme Nested Tags causes StackOverflowException in
3b37d71af924 iniitial
dwinter
parents:
diff changeset
731 DomContentUtils...Spider Trap. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
732
3b37d71af924 iniitial
dwinter
parents:
diff changeset
733 22. NUTCH-434 - Replace usage of ObjectWritable with something based on
3b37d71af924 iniitial
dwinter
parents:
diff changeset
734 GenericWritable. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
735
3b37d71af924 iniitial
dwinter
parents:
diff changeset
736 23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
737
3b37d71af924 iniitial
dwinter
parents:
diff changeset
738 24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
739 (Espen Amble Kolstad via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
740
3b37d71af924 iniitial
dwinter
parents:
diff changeset
741 25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
742 (Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
743
3b37d71af924 iniitial
dwinter
parents:
diff changeset
744 26. NUTCH-503 - Generator exits incorrectly for small fetchlists.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
745 (Vishal Shah via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
746
3b37d71af924 iniitial
dwinter
parents:
diff changeset
747 27. NUTCH-505 - Outlink urls should be validated. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
748
3b37d71af924 iniitial
dwinter
parents:
diff changeset
749 28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
750
3b37d71af924 iniitial
dwinter
parents:
diff changeset
751 29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
752
3b37d71af924 iniitial
dwinter
parents:
diff changeset
753 30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
754
3b37d71af924 iniitial
dwinter
parents:
diff changeset
755 30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
756
3b37d71af924 iniitial
dwinter
parents:
diff changeset
757 31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).
3b37d71af924 iniitial
dwinter
parents:
diff changeset
758
3b37d71af924 iniitial
dwinter
parents:
diff changeset
759 32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
760 (Enis Soztutar via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
761
3b37d71af924 iniitial
dwinter
parents:
diff changeset
762 33. NUTCH-516 - Next fetch time is not set when it is a
3b37d71af924 iniitial
dwinter
parents:
diff changeset
763 CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
764
3b37d71af924 iniitial
dwinter
parents:
diff changeset
765 34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
3b37d71af924 iniitial
dwinter
parents:
diff changeset
766 when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
767
3b37d71af924 iniitial
dwinter
parents:
diff changeset
768 35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
769 (dogacan) Note: There is a bigger problem, i.e how to deal
3b37d71af924 iniitial
dwinter
parents:
diff changeset
770 with redirected pages, and this issue can be considered as a band-aid
3b37d71af924 iniitial
dwinter
parents:
diff changeset
771 for the time being. See NUTCH-273 and NUTCH-353 for more details.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
772
3b37d71af924 iniitial
dwinter
parents:
diff changeset
773 36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
3b37d71af924 iniitial
dwinter
parents:
diff changeset
774 inlinks list. (Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
775
3b37d71af924 iniitial
dwinter
parents:
diff changeset
776 37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during
3b37d71af924 iniitial
dwinter
parents:
diff changeset
777 parse. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
778
3b37d71af924 iniitial
dwinter
parents:
diff changeset
779 38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
780
3b37d71af924 iniitial
dwinter
parents:
diff changeset
781 39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
782
3b37d71af924 iniitial
dwinter
parents:
diff changeset
783 40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
3b37d71af924 iniitial
dwinter
parents:
diff changeset
784 domain-related utilities. (Enis Soztutar via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
785
3b37d71af924 iniitial
dwinter
parents:
diff changeset
786 41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
3b37d71af924 iniitial
dwinter
parents:
diff changeset
787 release (2.1). (Dawid Weiss via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
788
3b37d71af924 iniitial
dwinter
parents:
diff changeset
789 42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
3b37d71af924 iniitial
dwinter
parents:
diff changeset
790 request. (Dawid Weiss via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
791
3b37d71af924 iniitial
dwinter
parents:
diff changeset
792 43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
793 (Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
794
3b37d71af924 iniitial
dwinter
parents:
diff changeset
795 44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
796
3b37d71af924 iniitial
dwinter
parents:
diff changeset
797 45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
798
3b37d71af924 iniitial
dwinter
parents:
diff changeset
799 46. NUTCH-554 - Generator throws IOException on invalid urls.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
800 (Brian Whitman via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
801
3b37d71af924 iniitial
dwinter
parents:
diff changeset
802 47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
803 (Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
804
3b37d71af924 iniitial
dwinter
parents:
diff changeset
805 48. NUTCH-25 - needs 'character encoding' detector.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
806 (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
807
3b37d71af924 iniitial
dwinter
parents:
diff changeset
808 49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
3b37d71af924 iniitial
dwinter
parents:
diff changeset
809 to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
810
3b37d71af924 iniitial
dwinter
parents:
diff changeset
811 50. NUTCH-562 - Port mime type framework to use Tika mime detection framework.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
812 (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
813
3b37d71af924 iniitial
dwinter
parents:
diff changeset
814 51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink
3b37d71af924 iniitial
dwinter
parents:
diff changeset
815 list. (Emmanuel Joke, Marcin Okraszewski via kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
816
3b37d71af924 iniitial
dwinter
parents:
diff changeset
817 52. NUTCH-501 - Implement a different caching mechanism for objects cached in
3b37d71af924 iniitial
dwinter
parents:
diff changeset
818 configuration. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
819
3b37d71af924 iniitial
dwinter
parents:
diff changeset
820 53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
821
3b37d71af924 iniitial
dwinter
parents:
diff changeset
822 54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
823
3b37d71af924 iniitial
dwinter
parents:
diff changeset
824 55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
825 (dogacan, kubes via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
826
3b37d71af924 iniitial
dwinter
parents:
diff changeset
827 56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
828 (Emmanuel Joke via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
829
3b37d71af924 iniitial
dwinter
parents:
diff changeset
830 57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
831
3b37d71af924 iniitial
dwinter
parents:
diff changeset
832 58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
833
3b37d71af924 iniitial
dwinter
parents:
diff changeset
834 59. NUTCH-574 - Including inlink anchor text in index can create irrelevant
3b37d71af924 iniitial
dwinter
parents:
diff changeset
835 search results. Created index-anchor plugin, removed functionality from
3b37d71af924 iniitial
dwinter
parents:
diff changeset
836 index-basic plugin. For backwards compatibility, add index-anchor plugin to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
837 nutch-site.xml plugin.includes. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
838
3b37d71af924 iniitial
dwinter
parents:
diff changeset
839 60. NUTCH-581 - DistributedSearch does not update search servers added to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
840 search-servers.txt on the fly. (Rohan Mehta via kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
841
3b37d71af924 iniitial
dwinter
parents:
diff changeset
842 61. NUTCH-586 - Add option to run compiled classes without job file
3b37d71af924 iniitial
dwinter
parents:
diff changeset
843 (enis via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
844
3b37d71af924 iniitial
dwinter
parents:
diff changeset
845 62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
3b37d71af924 iniitial
dwinter
parents:
diff changeset
846 server. (Susam Pal via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
847
3b37d71af924 iniitial
dwinter
parents:
diff changeset
848 63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
849
3b37d71af924 iniitial
dwinter
parents:
diff changeset
850 64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
3b37d71af924 iniitial
dwinter
parents:
diff changeset
851 (Emmanuel Joke via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
852
3b37d71af924 iniitial
dwinter
parents:
diff changeset
853 65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
854
3b37d71af924 iniitial
dwinter
parents:
diff changeset
855 66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
856
3b37d71af924 iniitial
dwinter
parents:
diff changeset
857 67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
858
3b37d71af924 iniitial
dwinter
parents:
diff changeset
859 68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
860
3b37d71af924 iniitial
dwinter
parents:
diff changeset
861 69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
862
3b37d71af924 iniitial
dwinter
parents:
diff changeset
863 70. NUTCH-602 - Allow configurable number of handlers for search servers
3b37d71af924 iniitial
dwinter
parents:
diff changeset
864 (hartbecke via kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
865
3b37d71af924 iniitial
dwinter
parents:
diff changeset
866 71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
867
3b37d71af924 iniitial
dwinter
parents:
diff changeset
868 72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
869
3b37d71af924 iniitial
dwinter
parents:
diff changeset
870 73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
871
3b37d71af924 iniitial
dwinter
parents:
diff changeset
872 74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
873
3b37d71af924 iniitial
dwinter
parents:
diff changeset
874 75. NUTCH-603 - Add more default url normalizations (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
875
3b37d71af924 iniitial
dwinter
parents:
diff changeset
876 76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
877
3b37d71af924 iniitial
dwinter
parents:
diff changeset
878 77. NUTCH-44 - Too many search results, limits max results returned from a
3b37d71af924 iniitial
dwinter
parents:
diff changeset
879 single search. (Emilijan Mirceski and Susam Pal via kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
880
3b37d71af924 iniitial
dwinter
parents:
diff changeset
881 78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
3b37d71af924 iniitial
dwinter
parents:
diff changeset
882 updated to 1.2 version. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
883
3b37d71af924 iniitial
dwinter
parents:
diff changeset
884 79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
885
3b37d71af924 iniitial
dwinter
parents:
diff changeset
886 80. NUTCH-612 - URL filtering was disabled in Generator when invoked
3b37d71af924 iniitial
dwinter
parents:
diff changeset
887 from Crawl (Susam Pal via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
888
3b37d71af924 iniitial
dwinter
parents:
diff changeset
889 81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
890
3b37d71af924 iniitial
dwinter
parents:
diff changeset
891 82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
892
3b37d71af924 iniitial
dwinter
parents:
diff changeset
893 83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
894
3b37d71af924 iniitial
dwinter
parents:
diff changeset
895 84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
896 Guard against reprUrl being null. (Emmanuel Joke, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
897
3b37d71af924 iniitial
dwinter
parents:
diff changeset
898 85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel
3b37d71af924 iniitial
dwinter
parents:
diff changeset
899 Joke, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
900
3b37d71af924 iniitial
dwinter
parents:
diff changeset
901 86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
902
3b37d71af924 iniitial
dwinter
parents:
diff changeset
903 87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
904
3b37d71af924 iniitial
dwinter
parents:
diff changeset
905 88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
906 (Emmanuel Joke, dogacan, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
907
3b37d71af924 iniitial
dwinter
parents:
diff changeset
908 89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
3b37d71af924 iniitial
dwinter
parents:
diff changeset
909 single slash. (Mark DeSpain via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
910
3b37d71af924 iniitial
dwinter
parents:
diff changeset
911 90. NUTCH-500 - Add hadoop masters configuration file into conf folder.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
912 (Emmanuel Joke via kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
913
3b37d71af924 iniitial
dwinter
parents:
diff changeset
914 91. NUTCH-596 - ParseSegments parse content even if its not
3b37d71af924 iniitial
dwinter
parents:
diff changeset
915 CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
916
3b37d71af924 iniitial
dwinter
parents:
diff changeset
917 92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
918
3b37d71af924 iniitial
dwinter
parents:
diff changeset
919 93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
3b37d71af924 iniitial
dwinter
parents:
diff changeset
920 Ritter, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
921
3b37d71af924 iniitial
dwinter
parents:
diff changeset
922 94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
923
3b37d71af924 iniitial
dwinter
parents:
diff changeset
924 95. NUTCH-645 - Parse-swf unit test failing (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
925
3b37d71af924 iniitial
dwinter
parents:
diff changeset
926 96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
927
3b37d71af924 iniitial
dwinter
parents:
diff changeset
928 97. NUTCH-639 - Change LuceneDocumentWrapper visibility from
3b37d71af924 iniitial
dwinter
parents:
diff changeset
929 private to _public_ (Guillaume Smet via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
930
3b37d71af924 iniitial
dwinter
parents:
diff changeset
931 98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
3b37d71af924 iniitial
dwinter
parents:
diff changeset
932 tracking. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
933
3b37d71af924 iniitial
dwinter
parents:
diff changeset
934 99. NUTCH-375 - Add support for Content-Encoding: deflated
3b37d71af924 iniitial
dwinter
parents:
diff changeset
935 (Pascal Beis, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
936
3b37d71af924 iniitial
dwinter
parents:
diff changeset
937 100. NUTCH-633 - ParseSegment no longer allow reparsing.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
938 (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
939
3b37d71af924 iniitial
dwinter
parents:
diff changeset
940 101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
941
3b37d71af924 iniitial
dwinter
parents:
diff changeset
942 102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
943
3b37d71af924 iniitial
dwinter
parents:
diff changeset
944 103. NUTCH-654 - urlfilter-regex's main does not work.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
945 (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
946
3b37d71af924 iniitial
dwinter
parents:
diff changeset
947 104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
3b37d71af924 iniitial
dwinter
parents:
diff changeset
948 (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
949
3b37d71af924 iniitial
dwinter
parents:
diff changeset
950 105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
951
3b37d71af924 iniitial
dwinter
parents:
diff changeset
952 106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
953
3b37d71af924 iniitial
dwinter
parents:
diff changeset
954 107. NUTCH-647 - Resolve URLs tool (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
955
3b37d71af924 iniitial
dwinter
parents:
diff changeset
956 108. NUTCH-665 - Search Load Testing Tool (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
957
3b37d71af924 iniitial
dwinter
parents:
diff changeset
958 109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
3b37d71af924 iniitial
dwinter
parents:
diff changeset
959 (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
960
3b37d71af924 iniitial
dwinter
parents:
diff changeset
961 110. NUTCH-635 - LinkAnalysis Tool for Nutch. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
962
3b37d71af924 iniitial
dwinter
parents:
diff changeset
963 111. NUTCH-646 - New Indexing Framework for Nutch. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
964
3b37d71af924 iniitial
dwinter
parents:
diff changeset
965 112. NUTCH-668 - Domain URL Filter. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
966
3b37d71af924 iniitial
dwinter
parents:
diff changeset
967 113. NUTCH-594 - Serve Nutch search results in multiple formats including
3b37d71af924 iniitial
dwinter
parents:
diff changeset
968 XML and JSON. (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
969
3b37d71af924 iniitial
dwinter
parents:
diff changeset
970 114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
971
3b37d71af924 iniitial
dwinter
parents:
diff changeset
972 115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
3b37d71af924 iniitial
dwinter
parents:
diff changeset
973 fetch interval correctly. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
974
3b37d71af924 iniitial
dwinter
parents:
diff changeset
975 116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
976
3b37d71af924 iniitial
dwinter
parents:
diff changeset
977 117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
978 (julien nioche via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
979
3b37d71af924 iniitial
dwinter
parents:
diff changeset
980 118. NUTCH-681 - parse-mp3 compilation problem.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
981 (Wildan Maulana via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
982
3b37d71af924 iniitial
dwinter
parents:
diff changeset
983 119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
984 (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
985
3b37d71af924 iniitial
dwinter
parents:
diff changeset
986 120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical
3b37d71af924 iniitial
dwinter
parents:
diff changeset
987 digest. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
988
3b37d71af924 iniitial
dwinter
parents:
diff changeset
989 121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
990 (Joseph Chen, dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
991
3b37d71af924 iniitial
dwinter
parents:
diff changeset
992 122. NUTCH-682 - SOLR indexer does not set boost on the document.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
993 (julien nioche via dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
994
3b37d71af924 iniitial
dwinter
parents:
diff changeset
995 123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
996
3b37d71af924 iniitial
dwinter
parents:
diff changeset
997 124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
998
3b37d71af924 iniitial
dwinter
parents:
diff changeset
999 125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1000
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1001 126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1002 (Curtis d'Entremont, ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1003
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1004 127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1005
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1006 128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1007 (Stefan Will, siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1008
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1009 129. NUTCH-691 - Update jakarta poi jars to the most relevant version
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1010 (Dmitry Lihachev via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1011
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1012 130. NUTCH-563 - Include custom fields in BasicQueryFilter
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1013 (Julien Nioche via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1014
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1015 131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1016 (Dmitry Lihachev via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1017
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1018 132. NUTCH-694 - Distributed Search Server fails (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1019
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1020 133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1021 set at cross domain redirects (Remco Verhoef, dogacan via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1022
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1023 134. NUTCH-247 - Robot parser to restrict (kubes, siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1024
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1025 135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1026 via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1027
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1028 136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1029 Dmitry Lihachev via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1030
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1031 137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1032
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1033 138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1034 Doug Cook via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1035
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1036 139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1037
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1038 140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1039
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1040 141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1041
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1042 142. NUTCH-684 - Dedup support for Solr. (dogacan)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1043
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1044 143. NUTCH-715 - Subcollection plugin doesn't work with default
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1045 subcollections.xml file (Dmitry Lihachev via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1046
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1047 144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1048
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1049 Release 0.9 - 2007-04-02
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1050
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1051 1. Changed log4j confiquration to log to stdout on commandline
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1052 tools (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1053
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1054 2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1055
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1056 3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1057 siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1058
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1059 4. Optionally skip pages with abnormally large values of Crawl-Delay
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1060 (Dennis Kubes via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1061
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1062 5. Change readdb -stats to use CombiningCollector (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1063
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1064 6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1065 Schneider and Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1066
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1067 7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1068 dependant jars (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1069
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1070 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1071 in parse-plugins.xml (Chris A. Mattmann via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1072
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1073 9. NUTCH-105 - Network error during robots.txt fetch causes file to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1074 be ignored (Greg Kim via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1075
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1076 10. NUTCH-367 - DistributedSearch thown ClassCastException (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1077
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1078 11. NUTCH-332 - Fix the problem of doubling scores caused by links pointing
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1079 to the current page (e.g. anchors). (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1080
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1081 12. NUTCH-365 - Flexible URL normalization (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1082
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1083 13. NUTCH-336 - Differentiate between newly discovered pages and newly
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1084 injected pages (Chris Schneider via ab) NOTE: this changes the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1085 scoring API, filter implementations need to be updated.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1086
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1087 14. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1088 via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1089
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1090 15. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1091 (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1092
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1093 16. NUTCH-374 - when http.content.limit be set to -1 and
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1094 Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1095 (King Kong via pkosiorowski)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1096
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1097 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1098
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1099 ****************************** WARNING !!! ********************************
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1100 * This upgrade breaks data format compatibility. A tool 'convertdb' *
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1101 * was added to migrate existing CrawlDb-s to the new format. Segment data *
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1102 * can be partially migrated using 'mergesegs', however segments will *
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1103 * require re-parsing (and consequently re-indexing). *
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1104 ****************************** WARNING !!! ********************************
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1105
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1106 18. NUTCH-371 - DeleteDuplicates now correctly implements both parts of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1107 the algorithm. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1108
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1109 19. NUTCH-391 - ParseUtil logs file contents to log file when it cannot
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1110 find parser (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1111
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1112 20. NUTCH-379 - ParseUtil does not pass through the content's URL to the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1113 ParserFactory (Chris A. Mattmann via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1114
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1115 21. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1116 partition. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1117
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1118 22. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1119
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1120 23. NUTCH-395 - Increase fetching speed (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1121
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1122 24. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1123 (reported by Jared Dunne)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1124
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1125 25. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1126
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1127 26. NUTCH-403 - Make URL filtering optional in Generator (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1128
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1129 27. NUTCH-405 - Content object is not properly initialized in map method
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1130 of ParseSegment (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1131
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1132 28. NUTCH-362 - Remove parse-text from unsupported filetypes in
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1133 parse-plugins.xml (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1134
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1135 29. NUTCH-305 - Update crawl and url filter lists to exclude
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1136 jpeg|JPEG|bmp|BMP, suffix-urlfilter.txt (contributed by Stefan
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1137 Neufeind) is also updated (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1138
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1139 30. NUTCH-406 - Metadata tries to write null values (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1140
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1141 31. NUTCH-415 - Generator should mark selected records in CrawlDb.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1142 Due to increased resource consumption this step is optional.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1143 Application-level locking has been added to prevent concurrent
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1144 modification of databases. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1145
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1146 32. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1147 now possible to correctly update CrawlDb from multiple segments.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1148 Introduce new status codes for temporary and permanent
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1149 redirection. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1150
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1151 33. NUTCH-322 - Fix Fetcher to store redirected pages and to store
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1152 protocol-level status. This also should fix NUTCH-273. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1153
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1154 34. Change default Fetcher behavior not to follow redirects immediately.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1155 Instead Fetcher will record redirects as new pages to be added to CrawlDb.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1156 This also partially addresses NUTCH-273. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1157
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1158 35. Detect and report when Generator creates 0-sized segments. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1159
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1160 36. Fix Injector to preserve already existing CrawlDatum if the seed list
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1161 being injected also contains such URL. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1162
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1163 37. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1164 skipping bad URLs. (Michael Stack via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1165
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1166 38. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1167 Filters that are not in plugin.includes (Stefan Groschupf, siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1168
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1169 39. NUTCH-421 - Allow predeterminate running order of indexing filters
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1170 (Alan Tanaman, siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1171
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1172 40. When indexing pages with redirection, drop all intermediate pages and
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1173 index only the final page. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1174
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1175 41. Upgrade to Hadoop 0.10.1. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1176
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1177 42. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1178 order in which IndexDoc-s are processed. (Dogacan Guney via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1179
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1180 43. NUTCH-428 - NullPointerException thrown when agent name is not
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1181 configured properly. Changed to throw RuntimeException instead.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1182 (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1183
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1184 44. NUTCH-430 - Integer overflow in HashComparator.compare (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1185
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1186 45. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1187
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1188 46. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1189 or indexing from hadoop.io.DataOutputBuffer (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1190
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1191 47. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1192
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1193 48. NUTCH-390 - Javadoc warnings (mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1194
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1195 49. NUTCH-449 - Make junit output format configurable. (nigel via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1196
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1197 50. NUTCH-432 - Fix a bug where platform name with spaces would break the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1198 bin/nutch script. (Brian Whitman via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1199
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1200 51. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1201
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1202 52. NUTCH-167 - Observation of robots "noarchive" directive. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1203
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1204 53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1205 framework to operate properly (Heiko Dietze via mattmann)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1206
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1207 54. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1208 Groschupf via kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1209
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1210 55. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1211 path is empty (kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1212
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1213 56. Upgrade to Hadoop 0.12.1 release. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1214
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1215 57. NUTCH-246 - Incorrect segment size being generated due to time
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1216 synchronization issue (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1217
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1218 58. Upgrade to Hadoop 0.12.2 release. (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1219
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1220 59. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1221 Stack and Dogacan Guney via kubes)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1222
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1223 Release 0.8 - 2006-07-25
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1224
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1225 0. Totally new architecture, based on hadoop
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1226 [http://lucene.apache.org/hadoop] (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1227
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1228 1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1229
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1230 2. NUTCH-108 - Log hosts that exceed generate.max.per.host.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1231 (Rod Taylor via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1232
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1233 3. NUTCH-88 - Enhance ParserFactory plugin selection policy
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1234 (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1235
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1236 4. NUTCH-124 - Protocol-httpclient does not follow redirects when
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1237 fetching robots.txt (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1238
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1239 5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1240 (stack@archive.org, cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1241
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1242 6. NUTCH-114 - Getting number of urls and links from crawldb
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1243 (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1244
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1245 7. NUTCH-112 - Link in cached.jsp page to cached content is an
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1246 absolute link (Chris A. Mattmann via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1247
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1248 8. NUTCH-135 - Http header meta data are case insensitive in the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1249 real world (Stefan Groschupf via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1250
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1251 9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1252 to UTF-8 BOM (KuroSaka TeruHiko via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1253
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1254 10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1255
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1256 11. Added support for OpenSearch (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1257
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1258 12. NUTCH-142 - NutchConf should use the thread context classloader
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1259 (Mike Cannon-Brookes via pkosiorowski)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1260
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1261 13. NUTCH-160 - Use standard Java Regex library rather than
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1262 org.apache.oro.text.regex (Rod Taylor via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1263
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1264 14. NUTCH-151 - CommandRunner can hang after the main thread exec is
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1265 finished and has inefficient busy loop (Paul Baclace via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1266
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1267 15. NUTCH-174 - Problem encountered with ant during compilation
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1268
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1269 16. NUTCH-190 - ParseUtil drops reason for failed parse
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1270 (stack@archive.org via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1271
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1272 17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1273
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1274 18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1275
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1276 19. NUTCH-178 - in search.jsp must be session creation "false"
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1277 (YourSoft via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1278
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1279 20. NUTCH-200 - OpenSearch Servlet ist broken
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1280 (Marko Bauhardt via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1281
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1282 21. NUTCH-81 - Webapp only works when deployed in root
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1283 (AJ Banck, Michael Nebel via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1284
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1285 22. NUTCH-139 - Standard metadata property names in the ParseData
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1286 metadata (Chris A. Mattmann, jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1287
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1288 23. NUTCH-192 - Meta data support for CrawlDatum
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1289 (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1290
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1291 24. NUTCH-52 - Parser plugin for MS Excel files
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1292 (Rohit Kulkarni via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1293
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1294 25. NUTCH-53 - Parser plugin for Zip files
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1295 (Rohit Kulkarni via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1296
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1297 26. NUTCH-137 - footer is not displayed in search result page
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1298 (KuroSaka TeruHiko via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1299
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1300 27. NUTCH-118 - FAQ link points to invalid URL
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1301 (Steve Betts via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1302
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1303 28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1304 translation (Ivan Sekulovic via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1305
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1306 29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1307 via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1308
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1309 30. NUTCH-140 - Add alias capability in parse-plugins.xml file that
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1310 allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1311
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1312 31. NUTCH-214 - Added Links to web site to search mailling list
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1313 (Jake Vanderdray via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1314
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1315 32. NUTCH-204 - Multiple field values in HitDetails
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1316 (Stefan Groschupf via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1317
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1318 33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1319 to -1 to be consistent with http (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1320
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1321 34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1322
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1323 35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1324 pkosiorowski)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1325
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1326 36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1327 jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1328
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1329 37. NUTCH-229 - Improved handling of plugin folder configuration
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1330 (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1331
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1332 38. NUTCH-206 - Search server throws InstantiationException (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1333
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1334 39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1335 via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1336
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1337 40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1338
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1339 41. Update to lucene 1.9.1 (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1340
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1341 42. NUTCH-235 - Duplicate Inlink values (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1342
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1343 43. NUTCH-234 - Clustering extension code cleanups and a real
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1344 JUnit test case for the current implementation (Dawid Weiss via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1345
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1346 44. NUTCH-210 - Context.xml file for Nutch web application
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1347 (Chris A. Mattmann via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1348
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1349 45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1350
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1351 46. NUTCH-232 - Search.jsp has multiple search forms creating
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1352 invalid html / incorrect focus function (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1353
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1354 47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1355
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1356 48. NUTCH-244 - Inconsistent handling of property values
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1357 boundaries / unable to set db.max.outlinks.per.page to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1358 infinite (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1359
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1360 49. NUTCH-245 - DTD for plugin.xml configuration files
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1361 (Chris A. Mattmann via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1362
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1363 50. NUTCH-250 - Generate to log truncation caused by
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1364 generate.max.per.host (Rod Taylor via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1365
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1366 51. NUTCH-125 - OpenOffice Parser plugin (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1367
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1368 52. Switch from using java.io.File to org.apache.hadoop.fs.Path.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1369 (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1370
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1371 53. NUTCH-240 - Scoring API: extension point, scoring filters and
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1372 an OPIC plugin (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1373
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1374 54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1375
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1376 55. NUTCH-268 - Generator and lib-http use different definitions of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1377 "unique host" (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1378
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1379 56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1380 via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1381
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1382 57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1383 (Dennis Kubes via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1384
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1385 58. NUTCH-201 - Add support for subcollections
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1386 (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1387
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1388 59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1389 (Stefan Groschupf via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1390
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1391 60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1392
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1393 61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1394 (Stefan Groschupf via jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1395
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1396 62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1397 (stack@archive.org via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1398
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1399 63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1400 (Stefan Neufeind via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1401
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1402 64. NUTCH-307 - Wrong configured log4j.properties (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1403
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1404 65. NUTCH-303 - Logging improvements (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1405
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1406 66. NUTCH-308 - Maximum search time limit (ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1407
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1408 67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1409 problem (Grant Glouser via siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1410
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1411 68. Update to hadoop-0.4 (Milind Bhandarkar, cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1412
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1413 69. NUTCH-317 - Clarify what the queryLanguage argument of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1414 Query.parse(...) means (jerome)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1415
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1416 70. Added alternative experimental web gui in contrib containing
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1417 extensions like subcollection, keymatch, user preferences,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1418 caching, implemented mainly using tiles and jstl (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1419
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1420 71. NUTCH-320 DmozParser does not output list of urls to stdout
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1421 but to a log file instead. Original functionality restored.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1422
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1423 72. NUTCH-271 - Add ability to limit crawling to the set of initially
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1424 injected hosts (db.ignore.external.links) (Philippe Eugene,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1425 Stefan Neufeind via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1426
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1427 73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1428
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1429 74. NUTCH-327 - Fixed logging directory on cygwin (siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1430
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1431 Release 0.7 - 2005-08-17
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1432
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1433 1. Added support for "type:" in queries. Search results are limited/qualified
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1434 by mimetype or its primary type or sub type. For example,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1435 (1) searching with "type:application/pdf" restricts results
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1436 to pages which were identified to be of mimetype "application/pdf".
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1437 (2) with "type:application", nutch will return pages of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1438 primary type "application".
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1439 (3) with "type:pdf", only pages of sub type "pdf" will be listed.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1440 (John Xing, 20050120)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1441
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1442 2. Added support for "date:" in queries. Last-Modified is indexed.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1443 Search results are restricted by lower and upper date (inclusive)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1444 as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1445 only returns pages with Last-Modified in year 2004.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1446 (John Xing, 20050122)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1447
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1448 3. Add URLFilter plugin interface and convert existing url filters into
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1449 plugins. (John Xing, 20050206)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1450
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1451 4. Add UpdateSegmentsFromDb tool, which updates the scores and
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1452 anchors of existing segments with the current values in the web
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1453 db. This is used by CrawlTool, so that pages are now only fetched
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1454 once per crawl. (Doug Cutting, 20050221)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1455
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1456 5. Moved code into org.apache.nutch sub-packages. Changed license to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1457 Apache 2.0. Removed jar files whose licenses do not permit
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1458 redistribution by Apache. Disabled compilation of plugins which
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1459 require these libraries. (Doug Cutting 20050301)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1460
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1461 6. Index host and title in separate fields. Host was indexed
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1462 previously only as a part of the URL. Title was indexed as an
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1463 anchor. Now boosts for matching these fields may be adjusted
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1464 separately from boosts for matching anchors and url. Also: move
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1465 site indexing to index-basic plugin to minimize the number of
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1466 times the URL needs to be parsed; and, stop using anchor analyzer
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1467 for anything but anchors. (Piotr Kosiorowski via Doug Cutting
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1468 20050323)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1469
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1470 7. Add servlet Cached.java that serves cached Content of any mime type.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1471 Slightly modified are web.xml and cached.jsp.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1472 (John Xing, 20050401)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1473
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1474 8. Add skipCompressedByteArray() to WritableUtils.java.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1475 (John Xing, 20050402)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1476
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1477 9. Fixes to jsp and static web pages. These now use relative links,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1478 so that the Nutch webapp file can be used in places other than at
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1479 the root. Also fixed links to the about and help pages. Bug #32.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1480 (Jerome Charron via cutting, 20050404)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1481
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1482 10. Added some features to DistributedSearch: new segments can be added
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1483 to searchservers without restarting the frontend, defective search
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1484 servers are not queried until tey come back online, watchdog keeps
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1485 an eye for your searchservers and writes simple statistics.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1486 (Sami Siren, 20050407)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1487
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1488 11. Fix for bug #4 - Unbalanced quote in query eats all resources.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1489 (Piotr Kosiorowski, Sami Siren, 20050407)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1490
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1491 12. Close Issue #33 - MIME content type detector (using magic char sequences).
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1492 (Jerome Charron and Hari Kodungallur via John Xing, 20050416)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1493
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1494 13. Add a servlet that implements A9's OpenSearch RSS web service.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1495 (cutting, 20050418)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1496
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1497 14. Remove references to link analysis from tutorial, and enable
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1498 scoring by link count when generating fetchlists and searching.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1499 (cutting, 20040419)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1500
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1501 15. Make query boosts for host, title, anchor and phrase matches
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1502 configurable. (Piotr Kosiorowski via cutting, 20050419)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1503
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1504 16. Add support for sorting search results and search-time deduping by
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1505 fields other than site.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1506
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1507 17. Automatically convert range queries into cached range filters.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1508 This improves the performance and scalability of, e.g., date range
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1509 searching.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1510
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1511 18. Several methods have been renamed due to misspellings. The old
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1512 methods have been deprecated and will be removed before the 1.0
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1513 release.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1514
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1515
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1516 Release 0.6
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1517
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1518 1. Added clustering-carrot2 plugin, together with introduction of clustering
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1519 api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1520
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1521 2. Make a number of changes to NDFS (Nutch Distributed File System)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1522 to fix bugs, add admin tools, etc.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1523
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1524 Also, modify all command line tools so you can indicate whether to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1525 use NDFS or the local filesystem. If you indicate nothing, then
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1526 it defaults to the local fs.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1527
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1528 I've used this to do a 35m page crawl via NDFS, distributed over a
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1529 dozen machines. (Mike Cafarella)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1530
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1531 3. Add support for BASE tags in HTML. Outlinks are now correctly
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1532 extracted when a BASE tag is present. (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1533
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1534 4. Fix two bugs in result pagination. When the last hit on a page
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1535 was the last hit overall, the "next" button was sometimes shown
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1536 when the "show all" button should be shown instead. Also, in
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1537 certain cases, the "show all" button would be shown when the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1538 "next" button should have been shown. (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1539
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1540 5. Add config parameter "indexer.max.tokens" that determines the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1541 maximum number of tokens indexed per field. (Andy Hedges via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1542
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1543 6. Add parser for mp3 files. (Andy Hedges via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1544
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1545 7. Add RegexUrlNormalizer. This is useful for things like stripping
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1546 out session IDs from URLs. To use it, add values for
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1547 urlnormalizer.class and urlnormalizer.regex.file to your
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1548 nutch-site.xml. The RegexUrlNormalizer class extends the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1549 BasicUrlNormalizer, and does basic normalization as well.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1550 (Luke Baker via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1551
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1552 8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1553
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1554 9. Added Polish translation (Andrzej Bialecki, 20040911)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1555
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1556 10. Added 3 more language profiles to language identifier (ru,hu,pl).
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1557 Other changes to language identifier: Porfiles converted to utf8,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1558 added some test cases, changed the similarity calculation.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1559 (Sami Siren, 20040925)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1560
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1561 11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1562
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1563 12. Added plugin index-more and more.jsp (John Xing, 20041003)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1564
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1565 13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1566 in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1567
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1568 14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1569 (but not search.jsp) with NullPointerException in distributed search.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1570 It seems that this bug appears after "hits per site" stuff is added.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1571 The fix is done in Hit.java, making sure String site is never null.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1572 Hope this fix not have bad effetct on "hits per site" code.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1573 (John Xing, 20041006)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1574
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1575 15. Fixed a bug that fails fullyDelete() in FileUtil.java for
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1576 LocalFileSystem.java. This bug also exposes possible incompleteness
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1577 of NDFSFile.java, where a few methods are not supported, including
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1578 delete(). Nothing changed in NDFSFile.java though. Leave it for future
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1579 improvement (John Xing, 20041022).
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1580
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1581 16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1582 A new status code CANT_PARSE is added to FetcherOutput.java.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1583 Without option -noParsing , no change in fetcher behavior. With
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1584 option -noParsing, fetcher does crawls only, no parsing is carried out.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1585 Then, ParseSegment.java should be used to parse in separate pass.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1586 (John Xing, 20041025)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1587
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1588 17. Added ontology plugin. Currently it is used for query refinement, as
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1589 examplified in refine-query-init.jsp and refine-query.jsp. By default,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1590 query refinement is disabled in search.jsp. Please check
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1591 ./src/plugin/ontology/README.txt for further description.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1592 Ontology plugin certainly can be used for many other things.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1593 (Michael J. Pan via John Xing, 20041129)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1594
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1595 18. Changed fetcher.server.delay to be a float, so that sub-second
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1596 delays can be specified. (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1597
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1598 19. Added plugin.includes config parameter that determines which
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1599 plugins are included. By default now only http, html and basic
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1600 indexing and search plugins are enabled, rather than all plugins.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1601 This should make default performance more predictable and reliable
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1602 going forward. (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1603
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1604 20. Cleaned up some filesystem code, including:
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1605
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1606 - Replaced BufferedRandomAccessFile with two simpler utilties,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1607 NFSDataInputStream and NFSDataOutputStream.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1608
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1609 - Fixed the bug where SequenceFiles were no longer flushed when
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1610 created, so that, when fetches crashed, segments were
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1611 unreadable. Now segments are always readable after crashes.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1612 Only the contents of the last buffer is lost.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1613
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1614 - Simplified the FSOutputStream API to not include seek(). We
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1615 should never need that functionality.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1616
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1617 - Simplified LocalFileSystem's implementations of FSInputStream
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1618 and FSOutputStream and optimized FSInputStream.seek().
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1619
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1620 (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1621
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1622 21. Fixed BasicUrlNormalizer to better handle relative urls. The file
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1623 part of a URL is normalized in the following manner:
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1624
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1625 1. "/aa/../" will be replaced by "/" This is done step by step until
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1626 the url doesn´t change anymore. So we ensure, that
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1627 "/aa/bb/../../" will be replaced by "/", too
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1628
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1629 2. leading "/../" will be replaced by "/"
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1630
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1631 (Sven Wende via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1632
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1633 22. Fix Page constructors so that next fetch date is less likely to be
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1634 misconstrued as a float. This patches a problem in WebDBInjector,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1635 where new pages were added to the db with nextScore set to the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1636 intended nextFetch date. This, in turn, confused link analysis.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1637
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1638 23. In ndfs code, replace addLocalFile(), putToLocalFile() with
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1639 copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1640 moveToLocalFile(). (John Xing, 20041217)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1641
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1642 24. Added new config parameter fetcher.threads.per.host. This is used
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1643 by the Http protocol. When this is one behavior is as before.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1644 When this is greater than one then multiple threads are permitted
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1645 to access a host at once. Note that fetcher.server.delay is no
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1646 longer consistently observed when this is greater than one.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1647 (Luke Baker via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1648
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1649 Release 0.5
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1650
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1651 1. Changed plugin directory to be a list of directories.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1652
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1653 2. Permit Plugin to be the default plugin implementation.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1654
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1655 3. Added pluggable interface for network protocols in new package
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1656 net.nutch.protocol. Moved http code from core into a plugin.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1657
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1658 4. Added pluggable interface for content parsing in new package
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1659 net.nutch.parse. Moved html parsing code from core into a
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1660 plugin.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1661
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1662 5. Fixed a bug in NutchAnalysis where 16-bit characters were not
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1663 processed correctly.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1664
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1665 6. Fixed bug #971731: random summaries on result page.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1666 (Daniel Naber via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1667
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1668 7. Made Nutch logo transparent. (Daniel Naber via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1669
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1670 8. Added file protocol plugin. (John Xing via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1671
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1672 9. Added ftp protocol plugin. (John Xing via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1673
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1674 10. Added pdf and msword parser plugins. (John Xing via cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1675
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1676 11. Added pluggable indexing interface. By default, url, content,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1677 anchors and title are indexed, as before, but now one can easily
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1678 alter this to, e.g., index metadata. A demonstration is provided
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1679 which extracts and indexes Creative Commons license urls. (cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1680
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1681 12. Add language identification plugin.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1682
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1683 The process of identification is as follows:
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1684
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1685 1. html (html only, HTML 4.0 "lang" attribute)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1686 2. meta tags (html only, http-equiv, dc.language)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1687 3. http header (Content-Language)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1688 4. if all above fail "statistical analysis"
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1689
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1690 1 & 2 are run during the fetching phase and 3 & 4 are run on
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1691 indexing phase.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1692
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1693 Currently supported languages (in "statistical analysis") are
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1694 da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1695 from http://www.isi.edu/~koehn/europarl/ and the profiles were
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1696 build with tool supplied in patch.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1697
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1698 After indexing the language can be found from field named "lang"
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1699
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1700 It's not 100% accurate but it's a start.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1701 (Sami Siren)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1702
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1703 13. Added SegmentMergeTool and "mergesegs" command, to remove
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1704 duplicated or otherwise not used content from several segments and
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1705 joining them together into a single new segment. The tool also
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1706 optionally performs several other steps required for proper
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1707 operation of Nutch - such as indexing segments, deleting
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1708 duplicates, merging indices, and indexing the new single segment.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1709 (Andrzej Bialecki)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1710
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1711 14. Add the ability to retrieve ParseData of a search hit. ParseData
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1712 contains many valuable properties of a search hit.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1713
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1714 This is required (among others) to properly display the cached
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1715 content because it's not possible to determine the character
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1716 encoding from the output of the getContent() method (which returns
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1717 byte[]). The symptoms are that for HTML pages using non-latin1 or
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1718 non-UTF8 encodings the cached preview will almost certainly look
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1719 broken. Using the attached patch it is possible to determine the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1720 character encoding from the ParseData (for HTTP: Content-Type
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1721 metadata), and encode the content accordingly. (Andrzej Bialecki)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1722
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1723 15. Add a pluggable query interface. By default, the content, anchor
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1724 and url fields are searched as before. A sample plugin indexes
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1725 the host name and adds a "site:" keyword to query parsing.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1726
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1727 16. Added support for "lang:" in queries. For example, searching with
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1728 "lang:en" restricts results to pages which were identified to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1729 be in English.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1730
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1731 17. Automatically optimize field queries to use cached Lucene filters.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1732 This makes, for example, searches restricted by languages or sites
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1733 that are very common much faster.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1734
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1735 18. Improved charset handling in jsp pages. (jshin by cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1736
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1737 19. Permit topic filtering when injecting DMOZ pages. (jshin by cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1738
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1739 20. When parsing crawled pages, interpret charset specifications in
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1740 html meta tags. (jshin by cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1741
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1742 21. Added support for "cc:licensed" in queries, which searches for documents
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1743 released under Creative Commons licenses. Attributes of the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1744 license may also be queried, with, e.g., "cc:by" for
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1745 attribution-required licenses, "cc:nc" for non-commercial
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1746 licenses, etc.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1747
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1748 22. Relative paths named in plugin.folders are now searched for on the
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1749 classpath. This makes, e.g., deployment in a war file much simpler.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1750
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1751 23. Modifications to Fetcher.java.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1752
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1753 1. Make sure it works properly with regard to creation and initialization
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1754 of plugin instances. The problem was that multiple threads race to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1755 startUp() or shutDown() plugin instances. It was solved by synchronizing
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1756 certain codes in PluginRepository.java and Extension.java.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1757 (Stefan Groschupf via John Xing)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1758
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1759 2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1760 may never return (quit) if there are still data or other structures
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1761 (e.g., persistent socket connections) associated with plugins. (John Xing)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1762
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1763 3. Fixed one type of Fetcher "hang" problems by monitoring named
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1764 FetcherThreads. If all FetcherThreads are gone (finished),
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1765 Fetcher.java is considered done. The problem was: there could be
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1766 runaway threads started by external libs via FetcherThreads.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1767 Those threads never return, thus keep Fetcher from exiting normally.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1768 (John Xing)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1769
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1770 24. Eliminate excessive hits from sites. This is done efficiently by
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1771 adding the site name to Hit instances, and, when needed,
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1772 re-querying with too-frequent sites prohibited in the query.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1773
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1774
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1775 Release 0.4
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1776
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1777 1. Http class refactored. (Kevin Smith via Tom Pierce)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1778
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1779 2. Add Finnish translation. (Sampo Syreeni via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1780
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1781 3. Added Japanese translation. (Yukio Andoh via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1782
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1783 4. Updated Dutch translation. (Ype Kingma via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1784
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1785 5. Initial version of Distributed DB code. (Mike Cafarella)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1786
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1787 6. Make things more tolerant of crashed fetcher output files.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1788 (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1789
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1790 7. New skin for website. (Frank Henze via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1791
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1792 8. Added Spanish translation. (Diego Basch via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1793
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1794 9. Add FTP support to fetcher. (John Xing via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1795
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1796 10. Added Thai translation. (Pichai Ongvasith via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1797
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1798 11. Added Robots.txt & throttling support to Fetcher.java. (Mike
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1799 Cafarella)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1800
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1801 12. Added nightly build. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1802
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1803 13. Default all link scores to 1.0. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1804
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1805 14. Permit one to keep internal links. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1806
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1807 15. Fixed dedup to select shortest URL. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1808
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1809 16. Changed index merger so that merged index is written to named
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1810 directory, rather than to a generated name in that directory.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1811 (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1812
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1813 17. Disable coordination weighting of query clauses and other minor
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1814 scoring improvements. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1815
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1816 18. Added a new command, crawl, that constructs a database, injects a
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1817 url file and performs a few rounds of generate/fetch/updatedb.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1818 This simplifies use for intranet sites. Changed some defaults to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1819 be more intranet friendly. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1820
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1821 19. Fixed a bug where Fetcher.java didn't construct correct relative
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1822 links when a page was redirected. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1823
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1824 20. Fixed a query parser problem with lookahead over plusses and minuses.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1825 (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1826
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1827 21. Add support for HTTP proxy servers. (Sami Siren via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1828
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1829 22. Permit searching while fetching and/or indexing.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1830 (Sami Siren via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1831
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1832 23. Fix a bug when throttling is disabled. (Sami Siren via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1833
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1834 24. Updated Bahasa Malaysia translation. (Michael Lim via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1835
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1836 25. Added Catalan translation. (Xavier Guardiola via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1837
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1838 26. Added brazilian portuguese translation.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1839 (A. Moreir via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1840
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1841 27. Added a french translation. (Julien Nioche via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1842
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1843 28. Updated to Lucene 1.4RC3. (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1844
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1845 29. Add capability to boost by link count & use it in crawl tool.
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1846 (Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1847
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1848 30. Added plugin system. (Stefan Groschupf via Doug Cutting)
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1849
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1850 31. Add this change log file, for recording significant changes to
3b37d71af924 iniitial
dwinter
parents:
diff changeset
1851 Nutch. Populate it with changes from the last few months.