четверг, 22 мая 2014 г.

Продолжаю изучать randomproxy (но решаю, что начинать надо с простых примеров из документации Scrapy)

Попытки выполнить код по частям привели к выводу о том, что сначала надо бы прочитать всю документацию Scrapy. Там много работающих примеров. Я еще плохо знаю Python, поэтому надо разбирать работающий код с хорошей документацией, а не ломать голову над упражнениями неизвестных авторов.

In [1]:

%load /media/usb0/w8/GitHub_2/scrapy-proxies-master/scrapy-proxies-master/README.md

In []:

Random proxy middleware for Scrapy (http://scrapy.org/)
=======================================================

Processes Scrapy requests using a random proxy from list to avoid IP ban and
improve crawling speed.

Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file
and reformat to http://host:port format)

settings.py
-----------

    # Retry many times since proxies often fail
    RETRY_TIMES = 10
    # Retry on most error codes since proxies fail for different reasons
    RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
        # Fix path to this module
        'yourspider.randomproxy.RandomProxy': 100,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    }

    # Proxy list containing entries like
    # http://host1:port
    # http://username:password@host2:port
    # http://host3:port
    # ...
    PROXY_LIST = '/path/to/proxy/list.txt'


Your spider
-----------

In each callback ensure that proxy /really/ returned your target page by
checking for site logo or some other significant element.
If not - retry request with dont_filter=True

    if not hxs.select('//get/site/logo'):
        yield Request(url=response.url, dont_filter=True)

In [2]:

%load '/media/usb0/w8/GitHub_2/scrapy-proxies-master/scrapy-proxies-master/randomproxy.py'

In []:

# Copyright (C) 2013 by Aivars Kalvans <aivars.kalvans@gmail.com>
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

import re
import random
import base64
from scrapy import log

class RandomProxy(object):
    def __init__(self, settings):
        self.proxy_list = settings.get('PROXY_LIST')
        fin = open(self.proxy_list)

        self.proxies = {}
        for line in fin.readlines():
            parts = re.match('(\w+://)(\w+:\w+@)?(.+)', line)

            # Cut trailing @
            if parts[1]:
                parts[1] = parts[1][:-1]

            self.proxies[parts[0] + parts[2]] = parts[1]

        fin.close()

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_request(self, request, spider):
        # Don't overwrite with a random one (server-side state for IP)
        if 'proxy' in request.meta:
            return

        proxy_address = random.choice(self.proxies.keys())
        proxy_user_pass = self.proxies[proxy_address]

        request.meta['proxy'] = proxy_address
        if proxy_user_pass:
            basic_auth = 'Basic ' + base64.encodestring(proxy_user_pass)
            request.headers['Proxy-Authorization'] = basic_auth

    def process_exception(self, request, exception, spider):
        proxy = request.meta['proxy']
        log.msg('Removing failed proxy <%s>, %d proxies left' % (
                    proxy, len(self.proxies)))
        try:
            del self.proxies[proxy]
        except ValueError:
            pass

In [3]:

import re
import random
import base64
from scrapy import log

In [4]:

fin=['http://host1:port','http://username:password@host2:port','http://host3:port']

In [6]:

proxies = {}

In [14]:

for line in fin:
            parts = re.match('(\w+://)(\w+:\w+@)?(.+)', line)
            print parts.group(0)
            print '    1=',parts.group(1)
            print '    2=',parts.group(2)

http://host1:port
    1= http://
    2= None
http://username:password@host2:port
    1= http://
    2= username:password@
http://host3:port
    1= http://
    2= None

In []:

            # Cut trailing @
            if parts[1]:
                parts[1] = parts[1][:-1]

            proxies[parts[0] + parts[2]] = parts[1]

In [15]:

import scrapy

In [19]:

help(scrapy)

Help on package scrapy:

NAME
    scrapy - Scrapy - a screen scraping framework written in Python

FILE
    /usr/lib/python2.7/dist-packages/scrapy/__init__.py

PACKAGE CONTENTS
    cmdline
    command
    commands (package)
    conf
    contrib (package)
    contrib_exp (package)
    core (package)
    crawler
    dupefilter
    exceptions
    extension
    http (package)
    interfaces
    item
    link
    linkextractor
    log
    logformatter
    mail
    middleware
    project
    resolver
    responsetypes
    selector (package)
    settings (package)
    shell
    signals
    spider
    spidermanager
    squeue
    stats
    statscol
    telnet
    tests (package)
    utils (package)
    webservice
    xlib (package)

SUBMODULES
    twisted_250_monkeypatches
    urlparse_monkeypatches

DATA
    __version__ = '0.14.4'
    optional_features = set(['boto', 'ssl'])
    version_info = (0, 14, 4)

VERSION
    0.14.4

In []:

Посты чуть ниже также могут вас заинтересовать

14 комментариев:

Sergey Borisovich4 июня 2014 г. в 11:01
downloader-middleware
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retrymiddleware

The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy
ОтветитьУдалить
Ответы
Sergey Borisovich4 июня 2014 г. в 11:39
C:\Users\kiss\Documents\GitHub\scrapy\scrapy\contrib\downloadermiddleware\httpproxy.py

import base64
from urllib import getproxies, unquote, proxy_bypass
from urllib2 import _parse_proxy
from urlparse import urlunparse

from scrapy.utils.httpobj import urlparse_cached
from scrapy.exceptions import NotConfigured

class HttpProxyMiddleware(object):

def __init__(self):
self.proxies = {}
for type, url in getproxies().items():
self.proxies[type] = self._get_proxy(url, type)

if not self.proxies:
raise NotConfigured

def _get_proxy(self, url, orig_type):
proxy_type, user, password, hostport = _parse_proxy(url)
proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

if user and password:
user_pass = '%s:%s' % (unquote(user), unquote(password))
creds = base64.b64encode(user_pass).strip()
else:
creds = None

return creds, proxy_url

def process_request(self, request, spider):
# ignore if proxy is already seted
if 'proxy' in request.meta:
return

parsed = urlparse_cached(request)
scheme = parsed.scheme

# 'no_proxy' is only supported by http schemes
if scheme in ('http', 'https') and proxy_bypass(parsed.hostname):
return

if scheme in self.proxies:
self._set_proxy(request, scheme)

def _set_proxy(self, request, scheme):
creds, proxy = self.proxies[scheme]
request.meta['proxy'] = proxy
if creds:
request.headers['Proxy-Authorization'] = 'Basic ' + creds
ОтветитьУдалить
Ответы
Sergey Borisovich4 июня 2014 г. в 11:41
"""
An extension to retry failed requests that are potentially caused by temporary
problems such as a connection timeout or HTTP 500 error.

You can change the behaviour of this middleware by modifing the scraping settings:
RETRY_TIMES - how many times to retry a failed page
RETRY_HTTP_CODES - which HTTP response codes to retry

Failed pages are collected on the scraping process and rescheduled at the end,
once the spider has finished crawling all regular (non failed) pages. Once
there is no more failed pages to retry this middleware sends a signal
(retry_complete), so other extensions could connect to that signal.

About HTTP errors to consider:

- You may want to remove 400 from RETRY_HTTP_CODES, if you stick to the HTTP
protocol. It's included by default because it's a common code used to
indicate server overload, which would be something we want to retry
"""

from twisted.internet.defer import TimeoutError as UserTimeoutError
from twisted.internet.error import TimeoutError as ServerTimeoutError, \
DNSLookupError, ConnectionRefusedError, ConnectionDone, ConnectError, \
ConnectionLost, TCPTimedOutError

from scrapy import log
from scrapy.exceptions import NotConfigured
from scrapy.utils.response import response_status_message
from scrapy.xlib.tx import ResponseFailed

class RetryMiddleware(object):

# IOError is raised by the HttpCompression middleware when trying to
# decompress an empty response
EXCEPTIONS_TO_RETRY = (ServerTimeoutError, UserTimeoutError, DNSLookupError,
ConnectionRefusedError, ConnectionDone, ConnectError,
ConnectionLost, TCPTimedOutError, ResponseFailed,
IOError)

def __init__(self, settings):
if not settings.getbool('RETRY_ENABLED'):
raise NotConfigured
self.max_retry_times = settings.getint('RETRY_TIMES')
self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')

@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)

def process_response(self, request, response, spider):
if 'dont_retry' in request.meta:
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response

def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and 'dont_retry' not in request.meta:
return self._retry(request, exception, spider)

def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1

if retries <= self.max_retry_times:
log.msg(format="Retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
log.msg(format="Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
level=log.DEBUG, spider=spider, request=request, retries=retries, reason=reason)
ОтветитьУдалить
Ответы
Sergey Borisovich4 июня 2014 г. в 13:22
http://doc.scrapy.org/en/latest/topics/settings.html#topics-settings

The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.

The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull configuration values from. The settings can be populated through different mechanisms, which are described below.
ОтветитьУдалить
Ответы
Sergey Borisovich4 июня 2014 г. в 13:37
В какой последовательноси Python ищет модули

https://docs.python.org/2/tutorial/modules.html#the-module-search-path

6.1.2. The Module Search Path
When a module named spam is imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named spam.py in a list of directories given by the variable sys.path. sys.path is initialized from these locations:

the directory containing the input script (or the current directory).
PYTHONPATH (a list of directory names, with the same syntax as the shell variable PATH).
the installation-dependent default.

After initialization, Python programs can modify sys.path.
The directory containing the script being run is placed at the beginning of the search path, ahead of the standard library path.
This means that scripts in that directory will be loaded instead of modules of the same name in the library directory.
This is an error unless the replacement is intended. See section Standard Modules for more information.
ОтветитьУдалить
Ответы
Sergey Borisovich4 июня 2014 г. в 15:14
Что такое @classmethod

https://docs.python.org/2/glossary.html?highlight=classmethod

decorator
A function returning another function, usually applied as a function transformation using the @wrapper syntax. Common examples for decorators are classmethod() and staticmethod().
Это всего-лишь syntactic sugar
The decorator syntax is merely syntactic sugar, the following two function definitions are semantically equivalent:

def f(...):
...
f = staticmethod(f)

@staticmethod
def f(...):
...
The same concept exists for classes, but is less commonly used there. See the documentation for function definitions and class definitions for more about decorators.
ОтветитьУдалить
Ответы
Sergey Borisovich4 июня 2014 г. в 17:31
Что обозначает
def func(*args, **kwargs): ...

https://docs.python.org/2/glossary.html?highlight=kwargs

parameter
A named entity in a function (or method) definition that specifies an argument (or in some cases, arguments) that the function can accept. There are four types of parameters:

positional-or-keyword: specifies an argument that can be passed either positionally or as a keyword argument. This is the default kind of parameter, for example foo and bar in the following:

def func(foo, bar=None): ...
positional-only: specifies an argument that can be supplied only by position. Python has no syntax for defining positional-only parameters. However, some built-in functions have positional-only parameters (e.g. abs()).

var-positional: specifies that an arbitrary sequence of positional arguments can be provided (in addition to any positional arguments already accepted by other parameters). Such a parameter can be defined by prepending the parameter name with *, for example args in the following:

def func(*args, **kwargs): ...
var-keyword: specifies that arbitrarily many keyword arguments can be provided (in addition to any keyword arguments already accepted by other parameters). Such a parameter can be defined by prepending the parameter name with **, for example kwargs in the example above.

Parameters can specify both optional and required arguments, as well as default values for some optional arguments.

See also the argument glossary entry, the FAQ question on the difference between arguments and parameters, and the Function definitions section.
ОтветитьУдалить
Ответы

Добавить комментарий

iPython R Rapid Miner

Поиск по блогу

Страницы

четверг, 22 мая 2014 г.

Продолжаю изучать randomproxy (но решаю, что начинать надо с простых примеров из документации Scrapy)

14 комментариев:

Поиск по блогу

Страницы

четверг, 22 мая 2014 г.

Продолжаю изучать randomproxy (но решаю, что начинать надо с простых примеров из документации Scrapy)

14 комментариев:

четверг, 22 мая 2014 г.