reppy started out of a lack of memoization support in other robots.txt parsers encountered, and the lack of support for Crawl-delay and Sitemap in the built-in robotparser.
Matching
This package supports the 1996 RFC, as well as additional commonly-implemented features, like wildcard matching, crawl-delay, and sitemaps. There are varying approaches to matching Allow and Disallow. One approach is to use the longest match. Another is to use the most specific. This package chooses to follow the directive that is longest, the assumption being that it's the one that is most specific -- a term that is a little difficult to define in this context.
Usage
The easiest way to use reppy is to just ask if a url or urls is/are allowed:
import reppy
# This implicitly fetches example.com's robot.txt
reppy.allowed('http://example.com/howdy')
# => True
# Now, it's cached based on when it should expire (read more in `Expiration`)
reppy.allowed('http://example.com/hello')
# => True
# It also supports batch queries
reppy.allowed(['http://example.com/allowed1', 'http://example.com/allowed2', 'http://example.com/disallowed'])
# => ['http://example.com/allowed1', 'http://example.com/allowed2']
# Batch queries are even supported accross several domains (though fetches are not done in parallel)
reppy.allowed(['http://a.com/allowed', 'http://b.com/allowed', 'http://b.com/disallowed'])
# => ['http://a.com/allowed', 'http://b.com/allowed']
It's pretty easy to use. The default behavior is to fetch it for you with urllib2
import reppy
# Make a reppy object associated with a particular domain
r = reppy.fetch('http://example.com/robots.txt')
but you can just as easily parse a string that you fetched.
import urllib2
data = urllib2.urlopen('http://example.com/robots.txt').read()
r = reppy.parse(data)
Expiration
The main advantage of having reppy fetch the robots.txt for you is that it can automatically refetch after its data has expired. It's completely transparent to you, so you don't even have to think about it -- just keep using it as normal. Or, if you'd prefer, you can set your own time-to-live, which takes precedence:
import reppy
r = reppy.fetch('http://example.com/robots.txt')
r.ttl
# => 10800 (How long to live?)
r.expired()
# => False (Has it expired?)
r.remaining()
# => 10798 (How long until it expires)
r = reppy.fetch('http://example.com/robots.txt', ttl=1)
# Wait 2 seconds
r.expired()
# => True
Queries
Reppy tries to keep track of the host so that you don't have to. This is done automatically when you use fetch, or you can optionally provide the url you fetched it from with parse. Doing so allows you to provide just the path when querying. Otherwise, you must provide the whole url:
# This is doable
r = reppy.fetch('http://example.com/robots.txt')
r.allowed('/')
r.allowed(['/hello', '/howdy'])
# And so is this
data = urllib2.urlopen('http://example.com/robots.txt').read()
r = reppy.parse(data, url='http://example.com/robots.txt')
r.allowed(['/', '/hello', '/howdy'])
# However, we don't implicitly know which domain these are from
reppy.allowed(['/', '/hello', '/howdy'])
Crawl-Delay and Sitemaps
Reppy also exposes the non-RFC, but widely-used Crawl-Delay and Sitemaps attributes. The crawl delay is considered on a per-user agent basis, but the sitemaps are considered global. If they are not specified, the crawl delay is None, and sitemaps is an empty list. For example, if this is my robots.txt:
User-agent: *
Crawl-delay: 1
Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.com/sitemap2.xml
Then these are accessible:
with file('myrobots.txt', 'r') as f:
r = reppy.parse(f.read())
r.sitemaps
# => ['http://example.com/sitemap.xml', 'http://example.com/sitemap2.xml']
r.crawlDelay
# => 1
User-Agent Matching
You can provide a user agent of your choosing for fetching robots.txt, and then the user agent string we match is defaulted to what appears before the first /. For example, if you provide the user agent as 'MyCrawler/1.0', then we'll use 'MyCrawler' as the string to match against User-agent. Comparisons are case-insensitive, and we do not support wildcards in User-Agent. If this default doesn't suit you, you can provide an alternative:
# This will match against 'myuseragent' by default
r = reppy.fetch('http://example.com/robots.txt', userAgent='MyUserAgent/1.0')
# This will match against 'someotheragent' instead
r = reppy.fetch('http://example.com/robots.txt', userAgent='MyUserAgent/1.0', userAgentString='someotheragent')
Path-Matching
Path matching supports both * and $
Features:
- Memoization of fetched robots.txt
- Expiration taken from the Expires header
- Batch queries
- Configurable user agent for fetching robots.txt
- Automatic refetching basing on expiration
- Support for Crawl-delay
- Support for Sitemaps
- Wildcard matching
Requirements:
- Python
Comments not found