Попытки выполнить код по частям привели к выводу о том, что сначала надо бы прочитать всю документацию Scrapy. Там много работающих примеров. Я еще плохо знаю Python, поэтому надо разбирать работающий код с хорошей документацией, а не ломать голову над упражнениями неизвестных авторов.
In [1]:
%load /media/usb0/w8/GitHub_2/scrapy-proxies-master/scrapy-proxies-master/README.md
In []:
Random proxy middleware for Scrapy (http://scrapy.org/)
=======================================================
Processes Scrapy requests using a random proxy from list to avoid IP ban and
improve crawling speed.
Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file
and reformat to http://host:port format)
settings.py
-----------
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
# Fix path to this module
'yourspider.randomproxy.RandomProxy': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
Your spider
-----------
In each callback ensure that proxy /really/ returned your target page by
checking for site logo or some other significant element.
If not - retry request with dont_filter=True
if not hxs.select('//get/site/logo'):
yield Request(url=response.url, dont_filter=True)
In [2]:
%load '/media/usb0/w8/GitHub_2/scrapy-proxies-master/scrapy-proxies-master/randomproxy.py'
In []:
# Copyright (C) 2013 by Aivars Kalvans <aivars.kalvans@gmail.com>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
import re
import random
import base64
from scrapy import log
class RandomProxy(object):
def __init__(self, settings):
self.proxy_list = settings.get('PROXY_LIST')
fin = open(self.proxy_list)
self.proxies = {}
for line in fin.readlines():
parts = re.match('(\w+://)(\w+:\w+@)?(.+)', line)
# Cut trailing @
if parts[1]:
parts[1] = parts[1][:-1]
self.proxies[parts[0] + parts[2]] = parts[1]
fin.close()
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_request(self, request, spider):
# Don't overwrite with a random one (server-side state for IP)
if 'proxy' in request.meta:
return
proxy_address = random.choice(self.proxies.keys())
proxy_user_pass = self.proxies[proxy_address]
request.meta['proxy'] = proxy_address
if proxy_user_pass:
basic_auth = 'Basic ' + base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = basic_auth
def process_exception(self, request, exception, spider):
proxy = request.meta['proxy']
log.msg('Removing failed proxy <%s>, %d proxies left' % (
proxy, len(self.proxies)))
try:
del self.proxies[proxy]
except ValueError:
pass
In [3]:
import re
import random
import base64
from scrapy import log
In [4]:
fin=['http://host1:port','http://username:password@host2:port','http://host3:port']
In [6]:
proxies = {}
In [14]:
for line in fin:
parts = re.match('(\w+://)(\w+:\w+@)?(.+)', line)
print parts.group(0)
print ' 1=',parts.group(1)
print ' 2=',parts.group(2)
In []:
# Cut trailing @
if parts[1]:
parts[1] = parts[1][:-1]
proxies[parts[0] + parts[2]] = parts[1]
In [15]:
import scrapy
In [19]:
help(scrapy)
In []:
Посты чуть ниже также могут вас заинтересовать
downloader-middleware
ОтветитьУдалитьhttp://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retrymiddleware
The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy
DOWNLOADER_MIDDLEWARES_BASE
Удалить{
'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
}
А в коде примера порядок переопределяется:
УдалитьDOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
# Fix path to this module
'yourspider.randomproxy.RandomProxy': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}
Документация http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retrymiddleware#module-scrapy.contrib.downloadermiddleware.httpproxy
УдалитьHttpProxyMiddleware
New in version 0.8.
class scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware
This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value to Request objects.
Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
http_proxy
https_proxy
no_proxy
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retrymiddleware#module-scrapy.contrib.downloadermiddleware.retry
УдалитьRetryMiddleware
class scrapy.contrib.downloadermiddleware.retry.RetryMiddleware
A middlware to retry failed requests that are potentially caused by temporary problems such as a connection timeout or HTTP 500 error.
Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages. Once there are no more failed pages to retry, this middleware sends a signal (retry_complete), so other extensions could connect to that signal.
The RetryMiddleware can be configured through the following settings (see the settings documentation for more info):
RETRY_ENABLED
RETRY_TIMES
RETRY_HTTP_CODES
About HTTP errors to consider:
You may want to remove 400 from RETRY_HTTP_CODES, if you stick to the HTTP protocol. It’s included by default because it’s a common code used to indicate server overload, which would be something we want to retry.
If Request.meta contains the dont_retry key, the request will be ignored by this middleware.
RetryMiddleware Settings
RETRY_ENABLED
New in version 0.13.
Default: True
Whether the Retry middleware will be enabled.
RETRY_TIMES
Default: 2
Maximum number of times to retry, in addition to the first download.
RETRY_HTTP_CODES
Default: [500, 502, 503, 504, 400, 408]
Which HTTP response codes to retry. Other errors (DNS lookup issues, connections lost, etc) are always retried.
C:\Users\kiss\Documents\GitHub\scrapy\scrapy\contrib\downloadermiddleware\httpproxy.py
ОтветитьУдалитьimport base64
from urllib import getproxies, unquote, proxy_bypass
from urllib2 import _parse_proxy
from urlparse import urlunparse
from scrapy.utils.httpobj import urlparse_cached
from scrapy.exceptions import NotConfigured
class HttpProxyMiddleware(object):
def __init__(self):
self.proxies = {}
for type, url in getproxies().items():
self.proxies[type] = self._get_proxy(url, type)
if not self.proxies:
raise NotConfigured
def _get_proxy(self, url, orig_type):
proxy_type, user, password, hostport = _parse_proxy(url)
proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))
if user and password:
user_pass = '%s:%s' % (unquote(user), unquote(password))
creds = base64.b64encode(user_pass).strip()
else:
creds = None
return creds, proxy_url
def process_request(self, request, spider):
# ignore if proxy is already seted
if 'proxy' in request.meta:
return
parsed = urlparse_cached(request)
scheme = parsed.scheme
# 'no_proxy' is only supported by http schemes
if scheme in ('http', 'https') and proxy_bypass(parsed.hostname):
return
if scheme in self.proxies:
self._set_proxy(request, scheme)
def _set_proxy(self, request, scheme):
creds, proxy = self.proxies[scheme]
request.meta['proxy'] = proxy
if creds:
request.headers['Proxy-Authorization'] = 'Basic ' + creds
"""
ОтветитьУдалитьAn extension to retry failed requests that are potentially caused by temporary
problems such as a connection timeout or HTTP 500 error.
You can change the behaviour of this middleware by modifing the scraping settings:
RETRY_TIMES - how many times to retry a failed page
RETRY_HTTP_CODES - which HTTP response codes to retry
Failed pages are collected on the scraping process and rescheduled at the end,
once the spider has finished crawling all regular (non failed) pages. Once
there is no more failed pages to retry this middleware sends a signal
(retry_complete), so other extensions could connect to that signal.
About HTTP errors to consider:
- You may want to remove 400 from RETRY_HTTP_CODES, if you stick to the HTTP
protocol. It's included by default because it's a common code used to
indicate server overload, which would be something we want to retry
"""
from twisted.internet.defer import TimeoutError as UserTimeoutError
from twisted.internet.error import TimeoutError as ServerTimeoutError, \
DNSLookupError, ConnectionRefusedError, ConnectionDone, ConnectError, \
ConnectionLost, TCPTimedOutError
from scrapy import log
from scrapy.exceptions import NotConfigured
from scrapy.utils.response import response_status_message
from scrapy.xlib.tx import ResponseFailed
class RetryMiddleware(object):
# IOError is raised by the HttpCompression middleware when trying to
# decompress an empty response
EXCEPTIONS_TO_RETRY = (ServerTimeoutError, UserTimeoutError, DNSLookupError,
ConnectionRefusedError, ConnectionDone, ConnectError,
ConnectionLost, TCPTimedOutError, ResponseFailed,
IOError)
def __init__(self, settings):
if not settings.getbool('RETRY_ENABLED'):
raise NotConfigured
self.max_retry_times = settings.getint('RETRY_TIMES')
self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_response(self, request, response, spider):
if 'dont_retry' in request.meta:
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and 'dont_retry' not in request.meta:
return self._retry(request, exception, spider)
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
http://doc.scrapy.org/en/latest/topics/settings.html#topics-settings
ОтветитьУдалитьThe Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.
The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull configuration values from. The settings can be populated through different mechanisms, which are described below.
The settings are also the mechanism for selecting the currently active Scrapy project (in case you have many).
УдалитьPopulating the settings
УдалитьSettings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence:
Global overrides (most precedence)
Project settings module
Default settings per-command
Default global settings (less precedence)
These mechanisms are described in more detail below.
В какой последовательноси Python ищет модули
ОтветитьУдалитьhttps://docs.python.org/2/tutorial/modules.html#the-module-search-path
6.1.2. The Module Search Path
When a module named spam is imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named spam.py in a list of directories given by the variable sys.path. sys.path is initialized from these locations:
the directory containing the input script (or the current directory).
PYTHONPATH (a list of directory names, with the same syntax as the shell variable PATH).
the installation-dependent default.
After initialization, Python programs can modify sys.path.
The directory containing the script being run is placed at the beginning of the search path, ahead of the standard library path.
This means that scripts in that directory will be loaded instead of modules of the same name in the library directory.
This is an error unless the replacement is intended. See section Standard Modules for more information.
Что такое @classmethod
ОтветитьУдалитьhttps://docs.python.org/2/glossary.html?highlight=classmethod
decorator
A function returning another function, usually applied as a function transformation using the @wrapper syntax. Common examples for decorators are classmethod() and staticmethod().
Это всего-лишь syntactic sugar
The decorator syntax is merely syntactic sugar, the following two function definitions are semantically equivalent:
def f(...):
...
f = staticmethod(f)
@staticmethod
def f(...):
...
The same concept exists for classes, but is less commonly used there. See the documentation for function definitions and class definitions for more about decorators.
https://docs.python.org/2/library/functions.html?highlight=classmethod#classmethod
Удалитьclassmethod(function)
Return a class method for function.
A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:
class C(object):
@classmethod
def f(cls, arg1, arg2, ...):
...
The @classmethod form is a function decorator – see the description of function definitions in Function definitions for details.
It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.
Class methods are different than C++ or Java static methods. If you want those, see staticmethod() in this section.
For more information on class methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.
New in version 2.2.
Changed in version 2.4: Function decorator syntax added.
Что обозначает
ОтветитьУдалитьdef func(*args, **kwargs): ...
https://docs.python.org/2/glossary.html?highlight=kwargs
parameter
A named entity in a function (or method) definition that specifies an argument (or in some cases, arguments) that the function can accept. There are four types of parameters:
positional-or-keyword: specifies an argument that can be passed either positionally or as a keyword argument. This is the default kind of parameter, for example foo and bar in the following:
def func(foo, bar=None): ...
positional-only: specifies an argument that can be supplied only by position. Python has no syntax for defining positional-only parameters. However, some built-in functions have positional-only parameters (e.g. abs()).
var-positional: specifies that an arbitrary sequence of positional arguments can be provided (in addition to any positional arguments already accepted by other parameters). Such a parameter can be defined by prepending the parameter name with *, for example args in the following:
def func(*args, **kwargs): ...
var-keyword: specifies that arbitrarily many keyword arguments can be provided (in addition to any keyword arguments already accepted by other parameters). Such a parameter can be defined by prepending the parameter name with **, for example kwargs in the example above.
Parameters can specify both optional and required arguments, as well as default values for some optional arguments.
See also the argument glossary entry, the FAQ question on the difference between arguments and parameters, and the Function definitions section.