Здесь мы учимся работать с консолью. Сначала на примере dirbot (scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/") смотрим на объекты request, response, settings... распечатываем объекты типа settings.overrides или spider.settings.overrides. Потом находим паку settings и распечатываем init.py в котором и задаются все методы класса Settings.
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"¶
Переходим в корневую папку проекта и запускаем загрузку из консоли. Несмотря на то, что мы не обращались к файлам проекта, scrapy использует файлы проекта для создания объектов (response ...) примеры внизу.
In [*]:
C:\Users\kiss\Anaconda>cd C:\Users\kiss\Documents\GitHub\dirbot
C:\Users\kiss\Documents\GitHub\dirbot>scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
2014-07-12 08:31:52+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapybot)
2014-07-12 08:31:52+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-12 08:31:52+0400 [scrapy] DEBUG: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'di
rbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders'], 'LOGSTATS_INTERVAL': 0}
2014-07-12 08:31:56+0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\pipeline\__init__.py:21: ScrapyDeprecationWarning: ITEM_PIPELINES defined as
a list or a set is deprecated, switch to a dict
category=ScrapyDeprecationWarning, stacklevel=1)
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Enabled item pipelines: FilterWordsPipeline
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-12 08:31:59+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-12 08:31:59+0400 [dmoz] INFO: Spider opened
2014-07-12 08:32:00+0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (refere
r: None)
[s] Available Scrapy objects:
[s] item {}
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] sel <Selector xpath=None data=u'<html lang="en">\r\n<head>\r\n<meta http-equ'>
[s] settings <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s] spider <DmozSpider 'dmoz' at 0x4931f60>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
C:\Users\kiss\Anaconda\lib\site-packages\IPython\frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated.
All its subpackages have been moved to the top `IPython` level.
warn("The top-level `frontend` package has been deprecated. "
In []:
Если после команды request поствить точку (.) и нажать клавишу "TAB"¶
In []:
In [58]: request.
request.body request.copy request.errback request.method request.url
request.callback request.dont_filter request.headers request.priority
request.cookies request.encoding request.meta request.replace
In [58]: request.headers
Out[58]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'x-gzip,gzip,deflate',
'Accept-Language': 'en',
'User-Agent': 'Scrapy/0.20.1 (+http://scrapy.org)'}
In [59]: request.cookies
Out[59]: {}
In [60]: request.url
Out[60]: 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
In [61]: request.meta
Out[61]:
{'depth': 0,
'download_latency': 0.7599999904632568,
'download_slot': 'www.dmoz.org',
'download_timeout': 180,
'handle_httpstatus_all': True}
In [62]: request.headers.
request.headers.appendlist request.headers.has_key request.headers.normvalue request.headers.update
request.headers.clear request.headers.items request.headers.pop request.headers.values
request.headers.copy request.headers.iteritems request.headers.popitem request.headers.viewitems
request.headers.encoding request.headers.iterkeys request.headers.setdefault request.headers.viewkeys
request.headers.fromkeys request.headers.itervalues request.headers.setlist request.headers.viewvalues
request.headers.get request.headers.keys request.headers.setlistdefault
request.headers.getlist request.headers.normkey request.headers.to_string
In []:
In []:
In [1]: response.
response.body response.encoding response.meta response.status
response.body_as_unicode response.flags response.replace response.url
response.copy response.headers response.request
In []:
In [1]: response.url
Out[1]: 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
Вот так можно посмотреть заголовки response¶
In []:
In [2]: response.headers
Out[2]:
{'Content-Language': 'en',
'Content-Type': 'text/html;charset=UTF-8',
'Cteonnt-Length': '33758',
'Date': 'Sat, 12 Jul 2014 04:32:05 GMT',
'Server': 'Apache',
'Set-Cookie': 'JSESSIONID=1B121BE4D4FF05F76826CBB0D1491910; Path=/'}
In [3]: request.headers
Out[3]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'x-gzip,gzip,deflate',
'Accept-Language': 'en',
'User-Agent': 'Scrapy/0.20.1 (+http://scrapy.org)'}
Снова посмотрим заголовки (в объекте request они должны быть такими же)¶
In []:
request.headers.getlist
<bound method Headers.getlist of {'Accept-Language': ['en'], 'Accept-Encoding': ['x-gzip,gzip,deflate'], 'Accept':
application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Scrapy/0.20.1 (+http://scrapy.org)']}>
In []:
In [5]: settings.
settings.defaults settings.getdict settings.getlist settings.settings_module
settings.get settings.getfloat settings.global_defaults settings.values
settings.getbool settings.getint settings.overrides
In []:
In [73]: settings['USER_AGENT']
Out[73]: 'Scrapy/0.20.1 (+http://scrapy.org)'
In []:
settings.getlist
In []:
In [24]: settings.getlist
Out[24]: <bound method CrawlerSettings.getlist of <scrapy.settings.CrawlerSettings object at 0x00000000040C0A20>>
In [25]: settings.getlist. #Опять нажмем на TAB
settings.getlist.im_class settings.getlist.im_func settings.getlist.im_self
In [25]: settings.getlist.im_self #выберем наугад
Out[25]: <scrapy.settings.CrawlerSettings at 0x40c0a20>
In [26]: settings.getlist.im_self.
settings.getlist.im_self.defaults settings.getlist.im_self.getfloat settings.getlist.im_self.overrides
settings.getlist.im_self.get settings.getlist.im_self.getint settings.getlist.im_self.settings_module
settings.getlist.im_self.getbool settings.getlist.im_self.getlist settings.getlist.im_self.values
settings.getlist.im_self.getdict settings.getlist.im_self.global_defaults
In [26]: settings.getlist.im_self.defaults #И снова выберем наугад
Out[26]: {'KEEP_ALIVE': True, 'LOGSTATS_INTERVAL': 0}
Очевидно, что дело это утомительное, потому обратимс к документации Requests and Responses Очевидно, что здесь можно посмотреть редиректы, флаги... и все, что связано с загрукой страницы (запроса), которая(ый) уже произошла(ел).
In []:
In []:
In [27]: shelp()
[s] Available Scrapy objects:
[s] _25 <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s] __ <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s] item {}
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] sel <Selector xpath=None data=u'<html lang="en">\r\n<head>\r\n<meta http-equ'>
[s] settings <CrawlerSettings module=<module 'dirbot.settings' from 'dirbot\settings.pyc'>>
[s] spider <DmozSpider 'dmoz' at 0x4931f60>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Поскольку я уже знаю, что есть команда "fetch" ... то можно ли к ней прикрепить новый заголовок (User-Agent)? Или сначала задать новый заголовок в settings, а потом уже "fetch"?
In [*]:
In [29]: fetch.
fetch.im_class fetch.im_func fetch.im_self
In []:
#Пробуем settings.overrides
In [29]: settings.overrides.
settings.overrides.clear settings.overrides.items settings.overrides.pop settings.overrides.viewitems
settings.overrides.copy settings.overrides.iteritems settings.overrides.popitem settings.overrides.viewkeys
settings.overrides.fromkeys settings.overrides.iterkeys settings.overrides.setdefault settings.overrides.viewvalues
settings.overrides.get settings.overrides.itervalues settings.overrides.update
settings.overrides.has_key settings.overrides.keys settings.overrides.values
Очевидно, что в файле проекта settings.py много других настроек Populating the settings Что именно лучше менять в файлах настроек?¶
Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence:
These mechanisms are described in more detail below.
Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option.
Example:
The project settings module is the standard configuration file for your Scrapy project. It’s where most of your custom settings will be populated. For example:: myproject.settings.
Each Scrapy tool command can have its own default settings, which override the global default settings. Those custom command settings are specified in the default_settings attribute of the command class.
The global defaults are located in the scrapy.settings.default_settings module and documented in the Built-in settings reference section.
The population of these settings sources is taken care of internally, but a manual handling is possible using API calls. See the Settings API topic for reference.
- Command line options (most precedence)
- Project settings module
- Default settings per-command
- Default global settings (less precedence)
These mechanisms are described in more detail below.
- Command line options¶
Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option.
Example:
scrapy crawl myspider -s LOG_FILE=scrapy.log
- Project settings module¶
The project settings module is the standard configuration file for your Scrapy project. It’s where most of your custom settings will be populated. For example:: myproject.settings.
- Default settings per-command¶
Each Scrapy tool command can have its own default settings, which override the global default settings. Those custom command settings are specified in the default_settings attribute of the command class.
- Default global settings¶
The global defaults are located in the scrapy.settings.default_settings module and documented in the Built-in settings reference section.
In [1]:
#Вот пример файла настроек для ротирования листа прокси (пока не работает)
%load dirbot\settings.py
In []:
# Scrapy settings for dirbot project
SPIDER_MODULES = ['dirbot.spiders']
NEWSPIDER_MODULE = 'dirbot.spiders'
DEFAULT_ITEM_CLASS = 'dirbot.items.Website'
ITEM_PIPELINES = ['dirbot.pipelines.FilterWordsPipeline']
##############################################
import pdb; pdb.set_trace()
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
# Fix path to this module
'dirbot.randomproxy.RandomProxy': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = 'dirbot/list.txt'
В то же время попытка распечатать словарь значений настроек показывает, что словарь пустой. Это потому, что я не запустил паука, а использовал shell из папки проекта?
In []:
In [39]: settings.values
Out[39]: {}
ОБъект item тоже ничего не показывает
In []:
In [43]: item.
item.clear item.fields item.has_key item.iteritems item.itervalues item.pop item.setdefault item.values
item.copy item.get item.items item.iterkeys item.keys item.popitem item.update
In []:
А spider наоборот, выдает информацию из файла C:\Users\kiss\Documents\GitHub\dirbot_se1\dirbot\spiders\dmoz.py хотя я его не запускал
In []:
In [45]: spider.
spider.allowed_domains spider.log spider.parse spider.start_requests
spider.crawler spider.make_requests_from_url spider.set_crawler spider.start_urls
spider.handles_request spider.name spider.settings spider.state
In []:
In [52]: spider.allowed_domains
Out[52]: ['dmoz.org']
In [53]: spider.start_urls
Out[53]:
['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']
In []:
In [48]: spider.settings.
spider.settings.defaults spider.settings.getdict spider.settings.getlist spider.settings.settings_module
spider.settings.get spider.settings.getfloat spider.settings.global_defaults spider.settings.values
spider.settings.getbool spider.settings.getint spider.settings.overrides
In []:
In [49]: spider.settings.defaults
Out[49]: {'KEEP_ALIVE': True, 'LOGSTATS_INTERVAL': 0}
In [50]: spider.settings.values
Out[50]: {}
In [51]: spider.settings.global_defaults
Out[51]: <module 'scrapy.settings.default_settings' from 'C:\Users\kiss\Anaconda\lib\site-packages\scrapy\settings\default_settings.
pyc'>
In []:
In [57]: spider.settings.overrides.
spider.settings.overrides.clear spider.settings.overrides.iteritems spider.settings.overrides.setdefault
spider.settings.overrides.copy spider.settings.overrides.iterkeys spider.settings.overrides.update
spider.settings.overrides.fromkeys spider.settings.overrides.itervalues spider.settings.overrides.values
spider.settings.overrides.get spider.settings.overrides.keys spider.settings.overrides.viewitems
spider.settings.overrides.has_key spider.settings.overrides.pop spider.settings.overrides.viewkeys
spider.settings.overrides.items spider.settings.overrides.popitem spider.settings.overrides.viewvalues
Итак, в настройках паука команды перенастройки почти те же, что и у объекта settings.
Откуда получается объект Settings¶
В папке C:-packages* есть файл init.py** в котором и описывается класс Settings со всеми своими методами... а константы импортируются из файла default_settings.py из этой же папки
In [1]:
%load "C:\\Users\\kiss\\Anaconda\\Lib\\site-packages\\scrapy\\settings\\__init__.py"
In []:
import json
from . import default_settings
class Settings(object):
def __init__(self, values=None):
self.values = values.copy() if values else {}
self.global_defaults = default_settings
def __getitem__(self, opt_name):
if opt_name in self.values:
return self.values[opt_name]
return getattr(self.global_defaults, opt_name, None)
def get(self, name, default=None):
return self[name] if self[name] is not None else default
def getbool(self, name, default=False):
"""
True is: 1, '1', True
False is: 0, '0', False, None
"""
return bool(int(self.get(name, default)))
def getint(self, name, default=0):
return int(self.get(name, default))
def getfloat(self, name, default=0.0):
return float(self.get(name, default))
def getlist(self, name, default=None):
value = self.get(name)
if value is None:
return default or []
elif hasattr(value, '__iter__'):
return value
else:
return str(value).split(',')
def getdict(self, name, default=None):
value = self.get(name)
if value is None:
return default or {}
if isinstance(value, basestring):
value = json.loads(value)
if isinstance(value, dict):
return value
raise ValueError("Cannot convert value for setting '%s' to dict: '%s'" % (name, value))
class CrawlerSettings(Settings):
def __init__(self, settings_module=None, **kw):
super(CrawlerSettings, self).__init__(**kw)
self.settings_module = settings_module
self.overrides = {}
self.defaults = {}
def __getitem__(self, opt_name):
if opt_name in self.overrides:
return self.overrides[opt_name]
if self.settings_module and hasattr(self.settings_module, opt_name):
return getattr(self.settings_module, opt_name)
if opt_name in self.defaults:
return self.defaults[opt_name]
return super(CrawlerSettings, self).__getitem__(opt_name)
def __str__(self):
return "<CrawlerSettings module=%r>" % self.settings_module
def iter_default_settings():
"""Return the default settings as an iterator of (name, value) tuples"""
for name in dir(default_settings):
if name.isupper():
yield name, getattr(default_settings, name)
def overridden_settings(settings):
"""Return a dict of the settings that have been overridden"""
for name, defvalue in iter_default_settings():
value = settings[name]
if not isinstance(defvalue, dict) and value != defvalue:
yield name, value
In []:
Посты чуть ниже также могут вас заинтересовать
Invoking the shell from spiders to inspect responses
ОтветитьУдалитьhttp://doc.scrapy.org/en/latest/topics/shell.html#invoking-the-shell-from-spiders-to-inspect-responses
This can be achieved by using the scrapy.shell.inspect_response function.
Here’s an example of how you would call it from your spider:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]
def parse(self, response):
# We want to inspect one specific response.
if ".org" in response.url:
from scrapy.shell import inspect_response
inspect_response(response)
# Rest of parsing code.