Поиск по блогу

вторник, 15 июля 2014 г.

Продолжаем осваивать Scrapy shell - пытаемся понять и запомнить то, что выскакивает после .TAB в консоли iPython

Здесь мы учимся работать с консолью. Сначала на примере dirbot (scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/") смотрим на объекты request, response, settings... распечатываем объекты типа settings.overrides или spider.settings.overrides. Потом находим паку settings и распечатываем init.py в котором и задаются все методы класса Settings.

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Переходим в корневую папку проекта и запускаем загрузку из консоли. Несмотря на то, что мы не обращались к файлам проекта, scrapy использует файлы проекта для создания объектов (response ...) примеры внизу.
In [*]:
C:\Users\kiss\Anaconda>cd C:\Users\kiss\Documents\GitHub\dirbot

C:\Users\kiss\Documents\GitHub\dirbot>scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
2014-07-12 08:31:52+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapybot)
2014-07-12 08:31:52+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-12 08:31:52+0400 [scrapy] DEBUG: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'di
rbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders'], 'LOGSTATS_INTERVAL': 0}
2014-07-12 08:31:56+0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\pipeline\__init__.py:21: ScrapyDeprecationWarning: ITEM_PIPELINES defined as
 a list or a set is deprecated, switch to a dict
  category=ScrapyDeprecationWarning, stacklevel=1)
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Enabled item pipelines: FilterWordsPipeline
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-12 08:31:59+0400 [dmoz] INFO: Spider opened
2014-07-12 08:32:00+0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (refere
r: None)
[s] Available Scrapy objects:
[s]   item       {}
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   sel        <Selector xpath=None data=u'<html lang="en">\r\n<head>\r\n<meta http-equ'>
[s]   settings   <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s]   spider     <DmozSpider 'dmoz' at 0x4931f60>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
C:\Users\kiss\Anaconda\lib\site-packages\IPython\frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated.
All its subpackages have been moved to the top `IPython` level.
  warn("The top-level `frontend` package has been deprecated. "
In []:

Если после команды request поствить точку (.) и нажать клавишу "TAB"

In []:
In [58]: request.
request.body        request.copy        request.errback     request.method      request.url
request.callback    request.dont_filter request.headers     request.priority
request.cookies     request.encoding    request.meta        request.replace

In [58]: request.headers
Out[58]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Encoding': 'x-gzip,gzip,deflate',
 'Accept-Language': 'en',
 'User-Agent': 'Scrapy/0.20.1 (+http://scrapy.org)'}

In [59]: request.cookies
Out[59]: {}

In [60]: request.url
Out[60]: 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'

In [61]: request.meta
Out[61]:
{'depth': 0,
 'download_latency': 0.7599999904632568,
 'download_slot': 'www.dmoz.org',
 'download_timeout': 180,
 'handle_httpstatus_all': True}

In [62]: request.headers.
request.headers.appendlist     request.headers.has_key        request.headers.normvalue      request.headers.update
request.headers.clear          request.headers.items          request.headers.pop            request.headers.values
request.headers.copy           request.headers.iteritems      request.headers.popitem        request.headers.viewitems
request.headers.encoding       request.headers.iterkeys       request.headers.setdefault     request.headers.viewkeys
request.headers.fromkeys       request.headers.itervalues     request.headers.setlist        request.headers.viewvalues
request.headers.get            request.headers.keys           request.headers.setlistdefault
request.headers.getlist        request.headers.normkey        request.headers.to_string
In []:

In []:
In [1]: response.
response.body            response.encoding        response.meta            response.status
response.body_as_unicode response.flags           response.replace         response.url
response.copy            response.headers         response.request
In []:
In [1]: response.url
Out[1]: 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'

Вот так можно посмотреть заголовки response

In []:
In [2]: response.headers
Out[2]:
{'Content-Language': 'en',
 'Content-Type': 'text/html;charset=UTF-8',
 'Cteonnt-Length': '33758',
 'Date': 'Sat, 12 Jul 2014 04:32:05 GMT',
 'Server': 'Apache',
 'Set-Cookie': 'JSESSIONID=1B121BE4D4FF05F76826CBB0D1491910; Path=/'}

In [3]: request.headers
Out[3]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Encoding': 'x-gzip,gzip,deflate',
 'Accept-Language': 'en',
 'User-Agent': 'Scrapy/0.20.1 (+http://scrapy.org)'}

Снова посмотрим заголовки (в объекте request они должны быть такими же)

In []:
 request.headers.getlist
<bound method Headers.getlist of {'Accept-Language': ['en'], 'Accept-Encoding': ['x-gzip,gzip,deflate'], 'Accept':
application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Scrapy/0.20.1 (+http://scrapy.org)']}>
Теперь, если я захочу поменять заголовок. Напрмер, 'User-Agent'... Очевидно, что можно о обратится к объекту (request) в кэше, а можно ли сразу к первоначальным настройкам?
In []:
In [5]: settings.
settings.defaults        settings.getdict         settings.getlist         settings.settings_module
settings.get             settings.getfloat        settings.global_defaults settings.values
settings.getbool         settings.getint          settings.overrides
In []:
In [73]: settings['USER_AGENT']
Out[73]: 'Scrapy/0.20.1 (+http://scrapy.org)'
In []:
settings.getlist 
In []:
In [24]: settings.getlist
Out[24]: <bound method CrawlerSettings.getlist of <scrapy.settings.CrawlerSettings object at 0x00000000040C0A20>>

In [25]: settings.getlist. #Опять нажмем на TAB
settings.getlist.im_class settings.getlist.im_func  settings.getlist.im_self

In [25]: settings.getlist.im_self #выберем наугад 
Out[25]: <scrapy.settings.CrawlerSettings at 0x40c0a20>

In [26]: settings.getlist.im_self.
settings.getlist.im_self.defaults        settings.getlist.im_self.getfloat        settings.getlist.im_self.overrides
settings.getlist.im_self.get             settings.getlist.im_self.getint          settings.getlist.im_self.settings_module
settings.getlist.im_self.getbool         settings.getlist.im_self.getlist         settings.getlist.im_self.values
settings.getlist.im_self.getdict         settings.getlist.im_self.global_defaults

In [26]: settings.getlist.im_self.defaults #И снова выберем наугад 
Out[26]: {'KEEP_ALIVE': True, 'LOGSTATS_INTERVAL': 0}
Очевидно, что дело это утомительное, потому обратимс к документации Requests and Responses Очевидно, что здесь можно посмотреть редиректы, флаги... и все, что связано с загрукой страницы (запроса), которая(ый) уже произошла(ел).
In []:

In []:
In [27]: shelp()
[s] Available Scrapy objects:
[s]   _25        <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s]   __         <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s]   item       {}
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   sel        <Selector xpath=None data=u'<html lang="en">\r\n<head>\r\n<meta http-equ'>
[s]   settings   <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s]   spider     <DmozSpider 'dmoz' at 0x4931f60>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Поскольку я уже знаю, что есть команда "fetch" ... то можно ли к ней прикрепить новый заголовок (User-Agent)? Или сначала задать новый заголовок в settings, а потом уже "fetch"?
In [*]:
In [29]: fetch.
fetch.im_class fetch.im_func  fetch.im_self
In []:
#Пробуем settings.overrides
In [29]: settings.overrides.
settings.overrides.clear      settings.overrides.items      settings.overrides.pop        settings.overrides.viewitems
settings.overrides.copy       settings.overrides.iteritems  settings.overrides.popitem    settings.overrides.viewkeys
settings.overrides.fromkeys   settings.overrides.iterkeys   settings.overrides.setdefault settings.overrides.viewvalues
settings.overrides.get        settings.overrides.itervalues settings.overrides.update
settings.overrides.has_key    settings.overrides.keys       settings.overrides.values

Очевидно, что в файле проекта settings.py много других настроек Populating the settings Что именно лучше менять в файлах настроек?

Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence:
  1. Command line options (most precedence)
  2. Project settings module
  3. Default settings per-command
  4. Default global settings (less precedence)
The population of these settings sources is taken care of internally, but a manual handling is possible using API calls. See the Settings API topic for reference.
These mechanisms are described in more detail below.

  1. Command line options

Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option.
Example:
scrapy crawl myspider -s LOG_FILE=scrapy.log

  1. Project settings module

The project settings module is the standard configuration file for your Scrapy project. It’s where most of your custom settings will be populated. For example:: myproject.settings.

  1. Default settings per-command

Each Scrapy tool command can have its own default settings, which override the global default settings. Those custom command settings are specified in the default_settings attribute of the command class.

  1. Default global settings

The global defaults are located in the scrapy.settings.default_settings module and documented in the Built-in settings reference section.
In [1]:
#Вот пример файла настроек для ротирования листа прокси (пока не работает)
%load dirbot\settings.py
In []:
# Scrapy settings for dirbot project

SPIDER_MODULES = ['dirbot.spiders']
NEWSPIDER_MODULE = 'dirbot.spiders'
DEFAULT_ITEM_CLASS = 'dirbot.items.Website'

ITEM_PIPELINES = ['dirbot.pipelines.FilterWordsPipeline']
##############################################
import pdb; pdb.set_trace()
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
    # Fix path to this module
    'dirbot.randomproxy.RandomProxy': 100,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = 'dirbot/list.txt'
В то же время попытка распечатать словарь значений настроек показывает, что словарь пустой. Это потому, что я не запустил паука, а использовал shell из папки проекта?
In []:
In [39]: settings.values
Out[39]: {}
ОБъект item тоже ничего не показывает
In []:
In [43]: item.
item.clear      item.fields     item.has_key    item.iteritems  item.itervalues item.pop        item.setdefault item.values
item.copy       item.get        item.items      item.iterkeys   item.keys       item.popitem    item.update
In []:
А spider наоборот, выдает информацию из файла C:\Users\kiss\Documents\GitHub\dirbot_se1\dirbot\spiders\dmoz.py хотя я его не запускал
In []:
In [45]: spider.
spider.allowed_domains        spider.log                    spider.parse                  spider.start_requests
spider.crawler                spider.make_requests_from_url spider.set_crawler            spider.start_urls
spider.handles_request        spider.name                   spider.settings               spider.state
In []:
In [52]: spider.allowed_domains
Out[52]: ['dmoz.org']

In [53]: spider.start_urls
Out[53]:
['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']
In []:
In [48]: spider.settings.
spider.settings.defaults        spider.settings.getdict         spider.settings.getlist         spider.settings.settings_module
spider.settings.get             spider.settings.getfloat        spider.settings.global_defaults spider.settings.values
spider.settings.getbool         spider.settings.getint          spider.settings.overrides
In []:
In [49]: spider.settings.defaults
Out[49]: {'KEEP_ALIVE': True, 'LOGSTATS_INTERVAL': 0}

In [50]: spider.settings.values
Out[50]: {}

In [51]: spider.settings.global_defaults
Out[51]: <module 'scrapy.settings.default_settings' from 'C:\Users\kiss\Anaconda\lib\site-packages\scrapy\settings\default_settings.
pyc'>
In []:
In [57]: spider.settings.overrides.
    
spider.settings.overrides.clear      spider.settings.overrides.iteritems  spider.settings.overrides.setdefault
spider.settings.overrides.copy       spider.settings.overrides.iterkeys   spider.settings.overrides.update
spider.settings.overrides.fromkeys   spider.settings.overrides.itervalues spider.settings.overrides.values
spider.settings.overrides.get        spider.settings.overrides.keys       spider.settings.overrides.viewitems
spider.settings.overrides.has_key    spider.settings.overrides.pop        spider.settings.overrides.viewkeys
spider.settings.overrides.items      spider.settings.overrides.popitem    spider.settings.overrides.viewvalues
Итак, в настройках паука команды перенастройки почти те же, что и у объекта settings.

Откуда получается объект Settings

В папке C:-packages* есть файл init.py** в котором и описывается класс Settings со всеми своими методами... а константы импортируются из файла default_settings.py из этой же папки
In [1]:
%load "C:\\Users\\kiss\\Anaconda\\Lib\\site-packages\\scrapy\\settings\\__init__.py"
In []:
import json
from . import default_settings


class Settings(object):

    def __init__(self, values=None):
        self.values = values.copy() if values else {}
        self.global_defaults = default_settings

    def __getitem__(self, opt_name):
        if opt_name in self.values:
            return self.values[opt_name]
        return getattr(self.global_defaults, opt_name, None)

    def get(self, name, default=None):
        return self[name] if self[name] is not None else default

    def getbool(self, name, default=False):
        """
        True is: 1, '1', True
        False is: 0, '0', False, None
        """
        return bool(int(self.get(name, default)))

    def getint(self, name, default=0):
        return int(self.get(name, default))

    def getfloat(self, name, default=0.0):
        return float(self.get(name, default))

    def getlist(self, name, default=None):
        value = self.get(name)
        if value is None:
            return default or []
        elif hasattr(value, '__iter__'):
            return value
        else:
            return str(value).split(',')

    def getdict(self, name, default=None):
        value = self.get(name)
        if value is None:
            return default or {}
        if isinstance(value, basestring):
            value = json.loads(value)
        if isinstance(value, dict):
            return value
        raise ValueError("Cannot convert value for setting '%s' to dict: '%s'" % (name, value))

class CrawlerSettings(Settings):

    def __init__(self, settings_module=None, **kw):
        super(CrawlerSettings, self).__init__(**kw)
        self.settings_module = settings_module
        self.overrides = {}
        self.defaults = {}

    def __getitem__(self, opt_name):
        if opt_name in self.overrides:
            return self.overrides[opt_name]
        if self.settings_module and hasattr(self.settings_module, opt_name):
            return getattr(self.settings_module, opt_name)
        if opt_name in self.defaults:
            return self.defaults[opt_name]
        return super(CrawlerSettings, self).__getitem__(opt_name)

    def __str__(self):
        return "<CrawlerSettings module=%r>" % self.settings_module


def iter_default_settings():
    """Return the default settings as an iterator of (name, value) tuples"""
    for name in dir(default_settings):
        if name.isupper():
            yield name, getattr(default_settings, name)

def overridden_settings(settings):
    """Return a dict of the settings that have been overridden"""
    for name, defvalue in iter_default_settings():
        value = settings[name]
        if not isinstance(defvalue, dict) and value != defvalue:
            yield name, value
In []:



Посты чуть ниже также могут вас заинтересовать

1 комментарий:

  1. Invoking the shell from spiders to inspect responses
    http://doc.scrapy.org/en/latest/topics/shell.html#invoking-the-shell-from-spiders-to-inspect-responses

    This can be achieved by using the scrapy.shell.inspect_response function.

    Here’s an example of how you would call it from your spider:

    import scrapy


    class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
    "http://example.com",
    "http://example.org",
    "http://example.net",
    ]

    def parse(self, response):
    # We want to inspect one specific response.
    if ".org" in response.url:
    from scrapy.shell import inspect_response
    inspect_response(response)

    # Rest of parsing code.

    ОтветитьУдалить