Apache Nutch

Software Screenshot:
Apache Nutch
Software Details:
Version: 2.3
Upload Date: 1 Mar 15
Distribution Type: Freeware
Downloads: 36

Rating: 3.0/5 (Total Votes: 1)

Apache Nutch was built on top of Apache Lucene, a powerful Java search engine.

Nutch developers modified the Lucene codebase, transforming the data-agnostic Lucene codebase into a project dedicated for searching data on the Web specifically.

This technology can be used to search on your own Web pages as a built-in search server, or crawl the Web looking for data to parse and scrape into your database.

Nutch can run on a single machine, but works better in Hadoop clusters.

Various plugins are available for expanding its usage spectrum.

What is new in this release:

  • Ensure duplicate tags do not exist in microformat-reltag tag set.
  • A better fall back value for date field.
  • Get rid of the dreaded.
  • Upgrade to Hadoop 1.2.0.
  • Upgrade to Tika 1.3.

What is new in version 2.0:

  • Renamed HTMLParseFilter into ParseFilter.
  • Remove remaining robots/IP blocking code in lib-http.
  • Port logging to slf4j.
  • External parser supports encoding attribute.
  • Ivy configuration settings don't include Gora.
  • Injector should add the metadata before calling injectedScore.
  • Port Nutch benchmark to Nutchbase.
  • Add parse-html back.
  • MoreIndexingFilter missing date format.
  • Timeout for Parser.
  • Retry interval in crawl date is set to 0.
  • Generate log output for solr indexer and dedup.
  • Improved NutchConfiguration.
  • SolrDeleteDuplicates needs to clone the SolrRecord objects.
  • Native hadoop libs not available through maven.
  • Separate the build and runtime environments.

What is new in version 1.5:

  • This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.

What is new in version 1.4:

  • Added Solr 4x (trunk) example schema.
  • Added '/runtime' to svn ignore.
  • Application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml.
  • Fixed parse-tika and parse-html to use relative URL resolution per RFC-3986.
  • Upgraded to Tika 0.10. NOTE: Tika's new RTF parser may ignore more text in malformed documents than previously - see TIKA-748 for details.
  • Added Sonar targets to Ant build.xml.
  • Upgraded SolrJ to version 3.4.0.
  • Ant pmd target is broken.
  • Upgraded Solr schema to version 1.4.

What is new in version 1.3:

  • This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball -- only about 2MB!).

What is new in version 1.2:

  • Make index-more plug-in configurable.
  • Configurable file protocol parent directory crawling.
  • Timeout for Parser.
  • Website is still Lucene branded.
  • Retry interval in crawl date is set to 0.

What is new in version 1.0:

  • Allow parsers to return multiple Parse objects.
  • Removed redundant commons-logging jar from ontology plugin.
  • Bug in SegmentReader causes infinite loop.
  • Scoring filter should distribute score to all outlinks at once.
  • Reduce number of warnings in nutch core.

Similar Software

Yioop!
Yioop!

10 Dec 15

useful.filter.js
useful.filter.js

12 May 15

Searchjoy
Searchjoy

13 Apr 15

Other Software of Developer Apache Software Foundation

Apache Axis2
Apache Axis2

10 Apr 16

Apache Spark
Apache Spark

6 Mar 16

Apache AntUnit
Apache AntUnit

13 May 15

Apache Hama
Apache Hama

21 Jul 15

Comments to Apache Nutch

Comments not found
Add Comment
Turn on images!