Methabot

Software Screenshot:
Methabot
Software Details:
Version: 1.6.0.1
Upload Date: 3 Jun 15
Developer: Emil Romanus
Distribution Type: Freeware
Downloads: 9

Rating: nan/5 (Total Votes: 0)

The Methabot software is a speed-optimized, scriptable and highly configurable web, ftp and local file system crawler. It supports scripted filetype parsing, a wide variety of customization options and is easily configured to fit anyones particular needs.

With the use of the module system and scripting language, users are able to take full or partial control of the crawling process and decide however Methabot should store web data, statistics and much more.

Just by running Methabot from command line you are able configure custom filetypes, filtering expressions, behaviour, and much more, so you don't have to be a scripter!

Features:

  • It's fast, designed from the ground and up with speed-optimization in mind.
  • Scriptable through Javascript with E4X
  • User-defined filetype filtering (according to MIME type, file extension or UMEX expression)
  • Multi-threaded
  • Highly configurable from command line
  • Extensible module system, supporting custom data parsers and filters.
  • Simple yet powerful filtering of URLs through UMEX.
  • Automated downloading
  • Support for automatic cookie handling when running over HTTP
  • Reliable, fault-tolerant networking
  • Portable, tested with success on 32-bit/64-bit Linux 2.6, 32-bit/64-bit FreeBSD 6.x/7.0, Windows XP and Mac OS X. Should work on almost any Unix-like OS.

What is new in this release:

  • Bugfix, when external-peek was used the depth limit was messed up.
  • Memory usage cleanup fixes
  • dynamic-url option is no longer set to lookup by default, since it slows down the crawling significantly
  • Build system now creates and installs some header files that modules can use when linking
  • metha-config tool added
  • lmm_mysql moved outside of this package

What is new in version 1.5.0:

  • Changes and new features:
  • Support for reading intial buffer from stdin
  • --type and --base-url command line options added, along with the initial_filetype option in configuration files
  • Cookies and DNS info is now properly shared between workers when running multithreaded
  • Added some example usage commands to --examples
  • Big improvements to the inter-thread communication, now faster and more organized
  • Added support for 'init' functions to scripts. Read more about init functions at http://bithack.se/projects/methabot/docs/e4x/init_functions.html
  • libmetha doesn't freeze when doing multiple concurrent HTTP HEAD requests anymore. The reason for the freezes was a bug in libcurl which is now fixed. Some workarounds have been added to libmetha to prevent the freezes from occuring when using the defect libcurl versions aswell.
  • Support for older libcurl versions 7.17.x and 7.16.x
  • New information is available in the "this" object of javascript parsers, content-type and transfer status code. Read more at http://bithack.se/projects/methabot/docs/e4x/this.html
  • --verbose option replaced with --silent, since verbose mode is now default
  • Initial support for FTP crawling and the ftp_dir_url crawler option
  • Depth limiting is now crawler-specific
  • Added the command line options --crawler and --filetype
  • Support for extending and overriding already defined crawlers and filetypes
  • Support for the copy keyword in configuration files
  • Support for dynamically switching the active crawler, this lets you crawl different websites in completely different ways in one crawling session. Read more about crawler switching at http://bithack.se/projects/methabot/docs/crawler_switching.html
  • libev version upgrade to 3.51
  • The include directive in configuration files now makes sure the included configuration file hasn't already been loaded, to prevent include-loops and multiple filetype/crawler definitions.
  • Various SpiderMonkey garbage collection fixes, libmetha does not crash anymore when cleaning up after a multithreaded session
  • Added some extra information to the --info option
  • The 'external' option is now fixed and enabled again
  • New option --spread-workers
  • New libmetha API function lmetha_global_setopt() allows changing the global error/message/warning reporter
  • Added initial implementation of a test suite for developers
  • Better error reporting when loading configuration files
  • Bugfix when an HTTP server didn't return a Content-Type header after a HEAD request
  • Bugfix when sorting URLs after multiple HTTP HEAD requests
  • Bugfix in the html to xml converter when the HTML page did not have an < html > tag
  • Bugfix, the extless-url option did not work
  • Bugfix, html to xml converter no longer chokes on byte-order marks or other text before the actual HTML
  • Bugfix, prevented libmetha from trying to access URLs of protocols that are not supported
  • Bugfix when shutting down after an error.
  • Bugfix, unresolvable URLs did not break out the retry loop after three retries
  • Very experimental and unstable support for Win32, mainly intended for developers
  • New configuration files:
  • google.conf, to perform google searches
  • youtube.conf, youtube searching
  • meta.conf, prints meta information such as keywords and description about HTML pages
  • title.conf, prints the title of HTML pages
  • ftp.conf, for crawling FTP servers

What is new in version 1.4.1:

  • Configure could not find jsapi.h on some systems, this should be fixed now.
  • Configuration files are now able to modify crawler and filetype flags, added the options 'external' and 'external_peek'
  • Bugfix, Methabot would sometimes crash when cleaning up empty URLs after multiple HTTP HEAD
  • Fixed a crash that occurred when running synchronously.
  • Build system include fix when jsconfig.h could not be found.

Requirements:

  • SpiderMonkey headers
  • cURL

Similar Software

Web-FTP
Web-FTP

3 Jun 15

Gistpy
Gistpy

20 Feb 15

MindTerm
MindTerm

14 Apr 15

Serv-U
Serv-U

14 Apr 15

Comments to Methabot

Comments not found
Add Comment
Turn on images!