Apache Nutch

Software Screenshot:
Apache Nutch
Software Details:
Version: 2.3
Upload Date: 1 Mar 15
Distribution Type: Freeware
Downloads: 36

Rating: 3.0/5 (Total Votes: 1)

Apache Nutch was built on top of Apache Lucene, a powerful Java search engine.

Nutch developers modified the Lucene codebase, transforming the data-agnostic Lucene codebase into a project dedicated for searching data on the Web specifically.

This technology can be used to search on your own Web pages as a built-in search server, or crawl the Web looking for data to parse and scrape into your database.

Nutch can run on a single machine, but works better in Hadoop clusters.

Various plugins are available for expanding its usage spectrum.

What is new in this release:

  • Ensure duplicate tags do not exist in microformat-reltag tag set.
  • A better fall back value for date field.
  • Get rid of the dreaded.
  • Upgrade to Hadoop 1.2.0.
  • Upgrade to Tika 1.3.

What is new in version 2.0:

  • Renamed HTMLParseFilter into ParseFilter.
  • Remove remaining robots/IP blocking code in lib-http.
  • Port logging to slf4j.
  • External parser supports encoding attribute.
  • Ivy configuration settings don't include Gora.
  • Injector should add the metadata before calling injectedScore.
  • Port Nutch benchmark to Nutchbase.
  • Add parse-html back.
  • MoreIndexingFilter missing date format.
  • Timeout for Parser.
  • Retry interval in crawl date is set to 0.
  • Generate log output for solr indexer and dedup.
  • Improved NutchConfiguration.
  • SolrDeleteDuplicates needs to clone the SolrRecord objects.
  • Native hadoop libs not available through maven.
  • Separate the build and runtime environments.

What is new in version 1.5:

  • This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.

What is new in version 1.4:

  • Added Solr 4x (trunk) example schema.
  • Added '/runtime' to svn ignore.
  • Application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml.
  • Fixed parse-tika and parse-html to use relative URL resolution per RFC-3986.
  • Upgraded to Tika 0.10. NOTE: Tika's new RTF parser may ignore more text in malformed documents than previously - see TIKA-748 for details.
  • Added Sonar targets to Ant build.xml.
  • Upgraded SolrJ to version 3.4.0.
  • Ant pmd target is broken.
  • Upgraded Solr schema to version 1.4.

What is new in version 1.3:

  • This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball -- only about 2MB!).

What is new in version 1.2:

  • Make index-more plug-in configurable.
  • Configurable file protocol parent directory crawling.
  • Timeout for Parser.
  • Website is still Lucene branded.
  • Retry interval in crawl date is set to 0.

What is new in version 1.0:

  • Allow parsers to return multiple Parse objects.
  • Removed redundant commons-logging jar from ontology plugin.
  • Bug in SegmentReader causes infinite loop.
  • Scoring filter should distribute score to all outlinks at once.
  • Reduce number of warnings in nutch core.

Similar Software

finder.php
finder.php

13 Apr 15

Reds
Reds

1 Mar 15

Structured Filter
Structured Filter

10 Dec 15

Other Software of Developer Apache Software Foundation

Apache MRQL
Apache MRQL

1 Mar 15

Apache ODE
Apache ODE

6 Jun 15

Apache Stanbol
Apache Stanbol

13 Apr 15

Apache NiFi
Apache NiFi

18 Apr 16

Comments to Apache Nutch

Comments not found
Add Comment
Turn on images!