Поиск по блогу

четверг, 22 мая 2014 г.

Продолжаю изучать randomproxy (но решаю, что начинать надо с простых примеров из документации Scrapy)

Попытки выполнить код по частям привели к выводу о том, что сначала надо бы прочитать всю документацию Scrapy. Там много работающих примеров. Я еще плохо знаю Python, поэтому надо разбирать работающий код с хорошей документацией, а не ломать голову над упражнениями неизвестных авторов.
In [1]:
%load /media/usb0/w8/GitHub_2/scrapy-proxies-master/scrapy-proxies-master/README.md 
In []:
Random proxy middleware for Scrapy (http://scrapy.org/)
=======================================================

Processes Scrapy requests using a random proxy from list to avoid IP ban and
improve crawling speed.

Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file
and reformat to http://host:port format)

settings.py
-----------

    # Retry many times since proxies often fail
    RETRY_TIMES = 10
    # Retry on most error codes since proxies fail for different reasons
    RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
        # Fix path to this module
        'yourspider.randomproxy.RandomProxy': 100,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    }

    # Proxy list containing entries like
    # http://host1:port
    # http://username:password@host2:port
    # http://host3:port
    # ...
    PROXY_LIST = '/path/to/proxy/list.txt'


Your spider
-----------

In each callback ensure that proxy /really/ returned your target page by
checking for site logo or some other significant element.
If not - retry request with dont_filter=True

    if not hxs.select('//get/site/logo'):
        yield Request(url=response.url, dont_filter=True)
In [2]:
%load '/media/usb0/w8/GitHub_2/scrapy-proxies-master/scrapy-proxies-master/randomproxy.py'
In []:
# Copyright (C) 2013 by Aivars Kalvans <aivars.kalvans@gmail.com>
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

import re
import random
import base64
from scrapy import log

class RandomProxy(object):
    def __init__(self, settings):
        self.proxy_list = settings.get('PROXY_LIST')
        fin = open(self.proxy_list)

        self.proxies = {}
        for line in fin.readlines():
            parts = re.match('(\w+://)(\w+:\w+@)?(.+)', line)

            # Cut trailing @
            if parts[1]:
                parts[1] = parts[1][:-1]

            self.proxies[parts[0] + parts[2]] = parts[1]

        fin.close()

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_request(self, request, spider):
        # Don't overwrite with a random one (server-side state for IP)
        if 'proxy' in request.meta:
            return

        proxy_address = random.choice(self.proxies.keys())
        proxy_user_pass = self.proxies[proxy_address]

        request.meta['proxy'] = proxy_address
        if proxy_user_pass:
            basic_auth = 'Basic ' + base64.encodestring(proxy_user_pass)
            request.headers['Proxy-Authorization'] = basic_auth

    def process_exception(self, request, exception, spider):
        proxy = request.meta['proxy']
        log.msg('Removing failed proxy <%s>, %d proxies left' % (
                    proxy, len(self.proxies)))
        try:
            del self.proxies[proxy]
        except ValueError:
            pass
In [3]:
import re
import random
import base64
from scrapy import log
In [4]:
fin=['http://host1:port','http://username:password@host2:port','http://host3:port']
In [6]:
proxies = {}
In [14]:
for line in fin:
            parts = re.match('(\w+://)(\w+:\w+@)?(.+)', line)
            print parts.group(0)
            print '    1=',parts.group(1)
            print '    2=',parts.group(2)
http://host1:port
    1= http://
    2= None
http://username:password@host2:port
    1= http://
    2= username:password@
http://host3:port
    1= http://
    2= None

In []:
            # Cut trailing @
            if parts[1]:
                parts[1] = parts[1][:-1]

            proxies[parts[0] + parts[2]] = parts[1]
In [15]:
import scrapy
In [19]:
help(scrapy)
Help on package scrapy:

NAME
    scrapy - Scrapy - a screen scraping framework written in Python

FILE
    /usr/lib/python2.7/dist-packages/scrapy/__init__.py

PACKAGE CONTENTS
    cmdline
    command
    commands (package)
    conf
    contrib (package)
    contrib_exp (package)
    core (package)
    crawler
    dupefilter
    exceptions
    extension
    http (package)
    interfaces
    item
    link
    linkextractor
    log
    logformatter
    mail
    middleware
    project
    resolver
    responsetypes
    selector (package)
    settings (package)
    shell
    signals
    spider
    spidermanager
    squeue
    stats
    statscol
    telnet
    tests (package)
    utils (package)
    webservice
    xlib (package)

SUBMODULES
    twisted_250_monkeypatches
    urlparse_monkeypatches

DATA
    __version__ = '0.14.4'
    optional_features = set(['boto', 'ssl'])
    version_info = (0, 14, 4)

VERSION
    0.14.4



In []:



Посты чуть ниже также могут вас заинтересовать

14 комментариев:

  1. downloader-middleware
    http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retrymiddleware

    The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy

    ОтветитьУдалить
    Ответы
    1. DOWNLOADER_MIDDLEWARES_BASE

      {
      'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
      'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
      'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
      'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
      'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
      'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
      'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
      'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
      'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
      'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
      'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
      'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
      'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
      'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
      }

      Удалить
    2. А в коде примера порядок переопределяется:

      DOWNLOADER_MIDDLEWARES = {
      'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
      # Fix path to this module
      'yourspider.randomproxy.RandomProxy': 100,
      'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
      }

      Удалить
    3. Документация http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retrymiddleware#module-scrapy.contrib.downloadermiddleware.httpproxy

      HttpProxyMiddleware
      New in version 0.8.

      class scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware
      This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value to Request objects.

      Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:

      http_proxy
      https_proxy
      no_proxy

      Удалить
    4. http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retrymiddleware#module-scrapy.contrib.downloadermiddleware.retry

      RetryMiddleware
      class scrapy.contrib.downloadermiddleware.retry.RetryMiddleware
      A middlware to retry failed requests that are potentially caused by temporary problems such as a connection timeout or HTTP 500 error.

      Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages. Once there are no more failed pages to retry, this middleware sends a signal (retry_complete), so other extensions could connect to that signal.

      The RetryMiddleware can be configured through the following settings (see the settings documentation for more info):

      RETRY_ENABLED
      RETRY_TIMES
      RETRY_HTTP_CODES
      About HTTP errors to consider:

      You may want to remove 400 from RETRY_HTTP_CODES, if you stick to the HTTP protocol. It’s included by default because it’s a common code used to indicate server overload, which would be something we want to retry.

      If Request.meta contains the dont_retry key, the request will be ignored by this middleware.

      RetryMiddleware Settings
      RETRY_ENABLED
      New in version 0.13.

      Default: True

      Whether the Retry middleware will be enabled.

      RETRY_TIMES
      Default: 2

      Maximum number of times to retry, in addition to the first download.

      RETRY_HTTP_CODES
      Default: [500, 502, 503, 504, 400, 408]

      Which HTTP response codes to retry. Other errors (DNS lookup issues, connections lost, etc) are always retried.

      Удалить
  2. C:\Users\kiss\Documents\GitHub\scrapy\scrapy\contrib\downloadermiddleware\httpproxy.py

    import base64
    from urllib import getproxies, unquote, proxy_bypass
    from urllib2 import _parse_proxy
    from urlparse import urlunparse

    from scrapy.utils.httpobj import urlparse_cached
    from scrapy.exceptions import NotConfigured


    class HttpProxyMiddleware(object):

    def __init__(self):
    self.proxies = {}
    for type, url in getproxies().items():
    self.proxies[type] = self._get_proxy(url, type)

    if not self.proxies:
    raise NotConfigured

    def _get_proxy(self, url, orig_type):
    proxy_type, user, password, hostport = _parse_proxy(url)
    proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

    if user and password:
    user_pass = '%s:%s' % (unquote(user), unquote(password))
    creds = base64.b64encode(user_pass).strip()
    else:
    creds = None

    return creds, proxy_url

    def process_request(self, request, spider):
    # ignore if proxy is already seted
    if 'proxy' in request.meta:
    return

    parsed = urlparse_cached(request)
    scheme = parsed.scheme

    # 'no_proxy' is only supported by http schemes
    if scheme in ('http', 'https') and proxy_bypass(parsed.hostname):
    return

    if scheme in self.proxies:
    self._set_proxy(request, scheme)

    def _set_proxy(self, request, scheme):
    creds, proxy = self.proxies[scheme]
    request.meta['proxy'] = proxy
    if creds:
    request.headers['Proxy-Authorization'] = 'Basic ' + creds

    ОтветитьУдалить
  3. """
    An extension to retry failed requests that are potentially caused by temporary
    problems such as a connection timeout or HTTP 500 error.

    You can change the behaviour of this middleware by modifing the scraping settings:
    RETRY_TIMES - how many times to retry a failed page
    RETRY_HTTP_CODES - which HTTP response codes to retry

    Failed pages are collected on the scraping process and rescheduled at the end,
    once the spider has finished crawling all regular (non failed) pages. Once
    there is no more failed pages to retry this middleware sends a signal
    (retry_complete), so other extensions could connect to that signal.

    About HTTP errors to consider:

    - You may want to remove 400 from RETRY_HTTP_CODES, if you stick to the HTTP
    protocol. It's included by default because it's a common code used to
    indicate server overload, which would be something we want to retry
    """

    from twisted.internet.defer import TimeoutError as UserTimeoutError
    from twisted.internet.error import TimeoutError as ServerTimeoutError, \
    DNSLookupError, ConnectionRefusedError, ConnectionDone, ConnectError, \
    ConnectionLost, TCPTimedOutError

    from scrapy import log
    from scrapy.exceptions import NotConfigured
    from scrapy.utils.response import response_status_message
    from scrapy.xlib.tx import ResponseFailed


    class RetryMiddleware(object):

    # IOError is raised by the HttpCompression middleware when trying to
    # decompress an empty response
    EXCEPTIONS_TO_RETRY = (ServerTimeoutError, UserTimeoutError, DNSLookupError,
    ConnectionRefusedError, ConnectionDone, ConnectError,
    ConnectionLost, TCPTimedOutError, ResponseFailed,
    IOError)

    def __init__(self, settings):
    if not settings.getbool('RETRY_ENABLED'):
    raise NotConfigured
    self.max_retry_times = settings.getint('RETRY_TIMES')
    self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
    self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')

    @classmethod
    def from_crawler(cls, crawler):
    return cls(crawler.settings)

    def process_response(self, request, response, spider):
    if 'dont_retry' in request.meta:
    return response
    if response.status in self.retry_http_codes:
    reason = response_status_message(response.status)
    return self._retry(request, reason, spider) or response
    return response

    def process_exception(self, request, exception, spider):
    if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
    and 'dont_retry' not in request.meta:
    return self._retry(request, exception, spider)

    def _retry(self, request, reason, spider):
    retries = request.meta.get('retry_times', 0) + 1

    if retries <= self.max_retry_times:
    log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
    level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
    retryreq = request.copy()
    retryreq.meta['retry_times'] = retries
    retryreq.dont_filter = True
    retryreq.priority = request.priority + self.priority_adjust
    return retryreq
    else:
    log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
    level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)

    ОтветитьУдалить
  4. http://doc.scrapy.org/en/latest/topics/settings.html#topics-settings

    The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.

    The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull configuration values from. The settings can be populated through different mechanisms, which are described below.

    ОтветитьУдалить
    Ответы
    1. The settings are also the mechanism for selecting the currently active Scrapy project (in case you have many).

      Удалить
    2. Populating the settings
      Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence:

      Global overrides (most precedence)
      Project settings module
      Default settings per-command
      Default global settings (less precedence)
      These mechanisms are described in more detail below.

      Удалить
  5. В какой последовательноси Python ищет модули

    https://docs.python.org/2/tutorial/modules.html#the-module-search-path

    6.1.2. The Module Search Path
    When a module named spam is imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named spam.py in a list of directories given by the variable sys.path. sys.path is initialized from these locations:

    the directory containing the input script (or the current directory).
    PYTHONPATH (a list of directory names, with the same syntax as the shell variable PATH).
    the installation-dependent default.

    After initialization, Python programs can modify sys.path.
    The directory containing the script being run is placed at the beginning of the search path, ahead of the standard library path.
    This means that scripts in that directory will be loaded instead of modules of the same name in the library directory.
    This is an error unless the replacement is intended. See section Standard Modules for more information.

    ОтветитьУдалить
  6. Что такое @classmethod

    https://docs.python.org/2/glossary.html?highlight=classmethod

    decorator
    A function returning another function, usually applied as a function transformation using the @wrapper syntax. Common examples for decorators are classmethod() and staticmethod().
    Это всего-лишь syntactic sugar
    The decorator syntax is merely syntactic sugar, the following two function definitions are semantically equivalent:

    def f(...):
    ...
    f = staticmethod(f)

    @staticmethod
    def f(...):
    ...
    The same concept exists for classes, but is less commonly used there. See the documentation for function definitions and class definitions for more about decorators.

    ОтветитьУдалить
    Ответы
    1. https://docs.python.org/2/library/functions.html?highlight=classmethod#classmethod

      classmethod(function)
      Return a class method for function.

      A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:

      class C(object):
      @classmethod
      def f(cls, arg1, arg2, ...):
      ...
      The @classmethod form is a function decorator – see the description of function definitions in Function definitions for details.

      It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.

      Class methods are different than C++ or Java static methods. If you want those, see staticmethod() in this section.

      For more information on class methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.

      New in version 2.2.

      Changed in version 2.4: Function decorator syntax added.

      Удалить
  7. Что обозначает
    def func(*args, **kwargs): ...

    https://docs.python.org/2/glossary.html?highlight=kwargs

    parameter
    A named entity in a function (or method) definition that specifies an argument (or in some cases, arguments) that the function can accept. There are four types of parameters:

    positional-or-keyword: specifies an argument that can be passed either positionally or as a keyword argument. This is the default kind of parameter, for example foo and bar in the following:

    def func(foo, bar=None): ...
    positional-only: specifies an argument that can be supplied only by position. Python has no syntax for defining positional-only parameters. However, some built-in functions have positional-only parameters (e.g. abs()).

    var-positional: specifies that an arbitrary sequence of positional arguments can be provided (in addition to any positional arguments already accepted by other parameters). Such a parameter can be defined by prepending the parameter name with *, for example args in the following:

    def func(*args, **kwargs): ...
    var-keyword: specifies that arbitrarily many keyword arguments can be provided (in addition to any keyword arguments already accepted by other parameters). Such a parameter can be defined by prepending the parameter name with **, for example kwargs in the example above.

    Parameters can specify both optional and required arguments, as well as default values for some optional arguments.

    See also the argument glossary entry, the FAQ question on the difference between arguments and parameters, and the Function definitions section.

    ОтветитьУдалить