В предыдущем посте мы поняли, как задаются методы класса Settings. В этом мы откроем, что можно обращаться с объектом, как со списком settings['USER_AGENT']... Прочитаем, как меняются дефолтные настройки DEFAULT_REQUEST_HEADERS, DOWNLOADER_MIDDLEWARES_BASE, перейдем к middleware..., и в конце скопипастим код из статьи Using random user agent in Scrapy.
In []:
In [73]: settings['USER_AGENT']
Out[73]: 'Scrapy/0.20.1 (+http://scrapy.org)'
In []:
Default:
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
The default headers used for Scrapy HTTP Requests. They’re populated in the DefaultHeadersMiddleware
In [2]:
%load "C:\\Users\\kiss\\Anaconda\\Lib\\site-packages\\scrapy\\contrib\\downloadermiddleware\\useragent.py"
In []:
"""Set User-Agent header per spider or use a default value from settings"""
from scrapy import signals
class UserAgentMiddleware(object):
"""This middleware allows spiders to override the user_agent"""
def __init__(self, user_agent='Scrapy'):
self.user_agent = user_agent
@classmethod
def from_crawler(cls, crawler):
o = cls(crawler.settings['USER_AGENT'])
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent', self.user_agent)
def process_request(self, request, spider):
if self.user_agent:
request.headers.setdefault('User-Agent', self.user_agent)
In []:
class scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware
This middleware sets all default requests headers specified in the DEFAULT_REQUEST_HEADERS setting.
In [3]:
%load "C:\\Users\\kiss\\Anaconda\\Lib\\site-packages\\scrapy\\contrib\\downloadermiddleware\\defaultheaders.py"
In []:
"""
DefaultHeaders downloader middleware
See documentation in docs/topics/downloader-middleware.rst
"""
class DefaultHeadersMiddleware(object):
def __init__(self, headers):
self._headers = headers
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings.get('DEFAULT_REQUEST_HEADERS').items())
def process_request(self, request, spider):
for k, v in self._headers:
request.headers.setdefault(k, v)
Из кода наверху возьмем и выполним settings.get('DEFAULT_REQUEST_HEADERS').items()¶
In []:
In [64]: settings.get('DEFAULT_REQUEST_HEADERS').items()
Out[64]:
[('Accept-Language', 'en'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
Запомним эту команду и попробуем для 'DOWNLOADER_MIDDLEWARES_BASE'¶
In []:
In [65]: settings.get('DOWNLOADER_MIDDLEWARES_BASE').items()
Out[65]:
[('scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware', 400),
('scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware', 700),
('scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware', 350),
('scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware', 600),
('scrapy.contrib.downloadermiddleware.retry.RetryMiddleware', 500),
('scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware', 300),
('scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware', 100),
('scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware', 900),
('scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware', 830),
('scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware', 750),
('scrapy.contrib.downloadermiddleware.stats.DownloaderStats', 850),
('scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware', 590),
('scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware', 580),
('scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware', 550)]
In []:
In [71]: settings.get('DOWNLOADER_MIDDLEWARES').items()
Out[71]: []
The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the first middleware is the one closer to the engine and the last is the one closer to the downloader.
To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
If you want to disable a built-in middleware (the ones defined in DOWNLOADER_MIDDLEWARES_BASE and enabled by default) you must define it in your project’s DOWNLOADER_MIDDLEWARES setting and assign None as its value. For example, if you want to disable the user-agent middleware:
In []:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
Don't use 'USER_AGENT' in settings.py and add the following into it.
In []:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
}
Note: Crawler is the project name, comm is a folder in root directory.
Creat rotate_useragent.py and put it into comm folder.
In []:
#!/usr/bin/python
#-*-coding:utf-8-*-
import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
#the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
#for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list = [\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
Посты чуть ниже также могут вас заинтересовать
Попробовал на пауке C:\Users\kiss\Documents\GitMyScrapy\scrapy_xml_1\XMLFeedSpider Все работает
ОтветитьУдалитьB settings.py добавил:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'XMLFeedSpider.middleware.RotateUserAgentMiddleware' :400
}
Создал модуль middleware.py и скопировал в него год из http://tangww.com/2013/06/UsingRandomAgent/
запустил из консоли
scrapy crawl proxylists
Со строкой запуска дебаггера в proxylists.py
ipdb.set_trace()
В нашел заголовок запроса в дебаггере
ipdb> response.request.headers
{'Accept-Language': ['en'], 'Accept-Encoding': ['x-gzip,gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;
q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536
.5']}
ipdb>
Потом еще раз
Удалитьipdb> response.request.headers
{'Accept-Language': ['en'], 'Accept-Encoding': ['x-gzip,gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;
q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.
1063.0 Safari/536.3']}
Надо же дотестировать... случайный выбор...