Free Download reppy for Linux ::: Software

reppy

Software Screenshot:

Software Details:

Version: 0.1.0

Upload Date: 11 May 15

Developer: Dan Lecocq

Distribution Type: Freeware

Downloads: 5

Download

Currently nan/5
1
2
3
4
5

Rating: nan/5 (Total Votes: 0)

reppy started out of a lack of memoization support in other robots.txt parsers encountered, and the lack of support for Crawl-delay and Sitemap in the built-in robotparser.

Matching

This package supports the 1996 RFC, as well as additional commonly-implemented features, like wildcard matching, crawl-delay, and sitemaps. There are varying approaches to matching Allow and Disallow. One approach is to use the longest match. Another is to use the most specific. This package chooses to follow the directive that is longest, the assumption being that it's the one that is most specific -- a term that is a little difficult to define in this context.

Usage

The easiest way to use reppy is to just ask if a url or urls is/are allowed:

import reppy
# This implicitly fetches example.com's robot.txt
reppy.allowed('http://example.com/howdy')
# => True
# Now, it's cached based on when it should expire (read more in `Expiration`)
reppy.allowed('http://example.com/hello')
# => True
# It also supports batch queries
reppy.allowed(['http://example.com/allowed1', 'http://example.com/allowed2', 'http://example.com/disallowed'])
# => ['http://example.com/allowed1', 'http://example.com/allowed2']
# Batch queries are even supported accross several domains (though fetches are not done in parallel)
reppy.allowed(['http://a.com/allowed', 'http://b.com/allowed', 'http://b.com/disallowed'])
# => ['http://a.com/allowed', 'http://b.com/allowed']

It's pretty easy to use. The default behavior is to fetch it for you with urllib2

import reppy
# Make a reppy object associated with a particular domain
r = reppy.fetch('http://example.com/robots.txt')

but you can just as easily parse a string that you fetched.

import urllib2
data = urllib2.urlopen('http://example.com/robots.txt').read()
r = reppy.parse(data)

Expiration

The main advantage of having reppy fetch the robots.txt for you is that it can automatically refetch after its data has expired. It's completely transparent to you, so you don't even have to think about it -- just keep using it as normal. Or, if you'd prefer, you can set your own time-to-live, which takes precedence:

import reppy
r = reppy.fetch('http://example.com/robots.txt')
r.ttl
# => 10800 (How long to live?)
r.expired()
# => False (Has it expired?)
r.remaining()
# => 10798 (How long until it expires)
r = reppy.fetch('http://example.com/robots.txt', ttl=1)
# Wait 2 seconds
r.expired()
# => True

Queries

Reppy tries to keep track of the host so that you don't have to. This is done automatically when you use fetch, or you can optionally provide the url you fetched it from with parse. Doing so allows you to provide just the path when querying. Otherwise, you must provide the whole url:

# This is doable
r = reppy.fetch('http://example.com/robots.txt')
r.allowed('/')
r.allowed(['/hello', '/howdy'])
# And so is this
data = urllib2.urlopen('http://example.com/robots.txt').read()
r = reppy.parse(data, url='http://example.com/robots.txt')
r.allowed(['/', '/hello', '/howdy'])
# However, we don't implicitly know which domain these are from
reppy.allowed(['/', '/hello', '/howdy'])

Crawl-Delay and Sitemaps

Reppy also exposes the non-RFC, but widely-used Crawl-Delay and Sitemaps attributes. The crawl delay is considered on a per-user agent basis, but the sitemaps are considered global. If they are not specified, the crawl delay is None, and sitemaps is an empty list. For example, if this is my robots.txt:

User-agent: *
Crawl-delay: 1
Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.com/sitemap2.xml

Then these are accessible:

with file('myrobots.txt', 'r') as f:
r = reppy.parse(f.read())
r.sitemaps
# => ['http://example.com/sitemap.xml', 'http://example.com/sitemap2.xml']
r.crawlDelay
# => 1

User-Agent Matching

You can provide a user agent of your choosing for fetching robots.txt, and then the user agent string we match is defaulted to what appears before the first /. For example, if you provide the user agent as 'MyCrawler/1.0', then we'll use 'MyCrawler' as the string to match against User-agent. Comparisons are case-insensitive, and we do not support wildcards in User-Agent. If this default doesn't suit you, you can provide an alternative:

# This will match against 'myuseragent' by default
r = reppy.fetch('http://example.com/robots.txt', userAgent='MyUserAgent/1.0')
# This will match against 'someotheragent' instead
r = reppy.fetch('http://example.com/robots.txt', userAgent='MyUserAgent/1.0', userAgentString='someotheragent')

Path-Matching

Path matching supports both * and $

Features:

Memoization of fetched robots.txt
Expiration taken from the Expires header
Batch queries
Configurable user agent for fetching robots.txt
Automatic refetching basing on expiration
Support for Crawl-delay
Support for Sitemaps
Wildcard matching

Search by Category

reppy

Other Software of Developer Dan Lecocq

aws-trade-in

asis

Comments to reppy

Comments not found

Add Comment

Search by Category

Search by Category

Popular software

Elive 20 Jan 18

antiX MX 1 Dec 17

Elastix 2 Oct 16

LaTeX::BibTeX 14 Apr 15

Kismet 17 Feb 15

Tiny Core Linux 2 Sep 17

Android-x86 22 Jun 18

reppy

Other Software of Developer Dan Lecocq

aws-trade-in

asis

Comments to reppy

Comments not found

Add Comment

Search by Category

Popular software

AirSnort 3 Jun 15

Xandros Desktop OS 3 Jun 15

DHIS 2 17 Feb 15

CrossOver 16 Aug 18

pfSense 22 Jun 18

Zuma Deluxe 20 Feb 15

PlayOnLinux 9 Dec 15