Apache Nutch was built on top of Apache Lucene, a powerful Java search engine.
Nutch developers modified the Lucene codebase, transforming the data-agnostic Lucene codebase into a project dedicated for searching data on the Web specifically.
This technology can be used to search on your own Web pages as a built-in search server, or crawl the Web looking for data to parse and scrape into your database.
Nutch can run on a single machine, but works better in Hadoop clusters.
Various plugins are available for expanding its usage spectrum.
What is new in this release:
- Ensure duplicate tags do not exist in microformat-reltag tag set.
- A better fall back value for date field.
- Get rid of the dreaded.
- Upgrade to Hadoop 1.2.0.
- Upgrade to Tika 1.3.
What is new in version 2.0:
- Renamed HTMLParseFilter into ParseFilter.
- Remove remaining robots/IP blocking code in lib-http.
- Port logging to slf4j.
- External parser supports encoding attribute.
- Ivy configuration settings don't include Gora.
- Injector should add the metadata before calling injectedScore.
- Port Nutch benchmark to Nutchbase.
- Add parse-html back.
- MoreIndexingFilter missing date format.
- Timeout for Parser.
- Retry interval in crawl date is set to 0.
- Generate log output for solr indexer and dedup.
- Improved NutchConfiguration.
- SolrDeleteDuplicates needs to clone the SolrRecord objects.
- Native hadoop libs not available through maven.
- Separate the build and runtime environments.
What is new in version 1.5:
- This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.
What is new in version 1.4:
- Added Solr 4x (trunk) example schema.
- Added '/runtime' to svn ignore.
- Application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml.
- Fixed parse-tika and parse-html to use relative URL resolution per RFC-3986.
- Upgraded to Tika 0.10. NOTE: Tika's new RTF parser may ignore more text in malformed documents than previously - see TIKA-748 for details.
- Added Sonar targets to Ant build.xml.
- Upgraded SolrJ to version 3.4.0.
- Ant pmd target is broken.
- Upgraded Solr schema to version 1.4.
What is new in version 1.3:
- This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball -- only about 2MB!).
What is new in version 1.2:
- Make index-more plug-in configurable.
- Configurable file protocol parent directory crawling.
- Timeout for Parser.
- Website is still Lucene branded.
- Retry interval in crawl date is set to 0.
What is new in version 1.0:
- Allow parsers to return multiple Parse objects.
- Removed redundant commons-logging jar from ontology plugin.
- Bug in SegmentReader causes infinite loop.
- Scoring filter should distribute score to all outlinks at once.
- Reduce number of warnings in nutch core.
Comments not found