Вчера я спозаранку решил просто "дочитать" документацию к Scrapy. Здесь хочу остановиться на двух "открытиях", которые помогли мне почувствовать, что я что-то понимаю. Первое - архитектура Scrapy engine, второе - структура папок проектов. Все файлы из папок собраны здесь.
Ссылки и первоисточники¶
In []:
[Scrapy 0.22 documentation](http://doc.scrapy.org/en/latest/)
<br/>[Architecture overview](http://doc.scrapy.org/en/latest/topics/architecture.html)
<br/>[Creating a project](http://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project)
<br/>[Default structure of Scrapy projects](http://doc.scrapy.org/en/latest/topics/commands.html#default-structure-of-scrapy-projects)
<br/>[Downloader Middleware](http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#topics-downloader-middleware)
<br/>[DOWNLOADER_MIDDLEWARES_BASE settings](http://doc.scrapy.org/en/latest/topics/settings.html#std:setting-DOWNLOADER_MIDDLEWARES_BASE)
<br/>[Selectors](http://doc.scrapy.org/en/latest/topics/selectors.html)
<br/>[Installing Python Modules](https://docs.python.org/2/install/)
<br/>[Writing the Setup Script](https://docs.python.org/2.7/distutils/setupscript.html)
<br/>[Modules](https://docs.python.org/2/tutorial/modules.html)
<br/>[]()
<br/>[]()
<br/>[]()
Architecture overview¶
Для понимания терминологии нужна вот эта картинка внизу. Здесь, вроде бы, все просто. Например, middlewares - бывает двух видов. Нас в данный момент интересует подключение прокси-сервера, т.е. "Downloader Middleware". Однако, прилежное чтение мануала грузит мозг неисчислимыми подробностями, но не дает четкого понимания того, как "подключать" прокси.
In [5]:
from IPython.display import Image,HTML
In [9]:
Image ('http://doc.scrapy.org/en/latest/_images/scrapy_architecture.png')
Out[9]:
Структура папок проектов Scrapy (Creating a project)¶
Выбираем (любую) папку в компьютере. Открываем ее в консоли. Потом выполняем команду:
In []:
scrapy startproject tutorial
И получаем готовую структуру папок. Далее подредактируем файлы, пример здесь
Creating a project
Default structure of Scrapy projects
Creating a project
Default structure of Scrapy projects
In []:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
In []:
scrapy.cfg
myproject/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
These are basically:
scrapy.cfg: the project configuration file
tutorial/: the project’s python module, you’ll later import your code from here.
tutorial/items.py: the project’s items file.
tutorial/pipelines.py: the project’s pipelines file.
tutorial/settings.py: the project’s settings file.
tutorial/spiders/: a directory where you’ll later put your spiders.
scrapy.cfg: the project configuration file
tutorial/: the project’s python module, you’ll later import your code from here.
tutorial/items.py: the project’s items file.
tutorial/pipelines.py: the project’s pipelines file.
tutorial/settings.py: the project’s settings file.
tutorial/spiders/: a directory where you’ll later put your spiders.
scrapy.cfg¶
The directory where the scrapy.cfg file resides is known as the project root directory. That file contains the name of the python module that defines the project settings. Here is an example:
In []:
[settings]
default = myproject.settings
In []:
Структура папок проекта dirbot (код примера скачал с GitHub)
In [26]:
!chcp 65001
!dir C:\Users\kiss\Documents\GitHub\dirbot
In [11]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\scrapy.cfg'
In []:
[settings]
default = dirbot.settings
что это за файл setup.py ???¶
In [12]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\setup.py'
In []:
from setuptools import setup, find_packages
setup(
name='dirbot',
version='1.0',
packages=find_packages(),
entry_points={'scrapy': ['settings = dirbot.settings']},
)
Это файл установки пакетов (модулей). Installing Python Modules
Writing the Setup Script
Открываем папку в консоли и запускаем "python setup.py install" ... Но здесь этот процесс нам не нужен. Зачем вообще устанавливат модули? Наерное, чтобы их самих можно было вызывать (они должны быть на путях PYTHONPATH)... Об этом еще надо будет подумать..., да вот, в разделе "How building works" написано, что так оно и есть:
Writing the Setup Script
Открываем папку в консоли и запускаем "python setup.py install" ... Но здесь этот процесс нам не нужен. Зачем вообще устанавливат модули? Наерное, чтобы их самих можно было вызывать (они должны быть на путях PYTHONPATH)... Об этом еще надо будет подумать..., да вот, в разделе "How building works" написано, что так оно и есть:
If you don’t choose an installation directory—i.e., if you just run setup.py install—then the install command installs to the standard location for third-party Python modules.
This location varies by platform and by how you built/installed Python itself. On Unix (and Mac OS X, which is also Unix-based), it also depends on whether the module distribution being installed is pure Python or contains extensions (“non-pure”):
This location varies by platform and by how you built/installed Python itself. On Unix (and Mac OS X, which is also Unix-based), it also depends on whether the module distribution being installed is pure Python or contains extensions (“non-pure”):
Platform | Standard installation location | Default value | Notes |
---|---|---|---|
Unix (pure) | prefix/lib/pythonX.Y/site-packages | /usr/local/lib/pythonX.Y/site-packages | (1) |
Unix (non-pure) | exec-prefix/lib/pythonX.Y/site-packages | /usr/local/lib/pythonX.Y/site-packages | (1) |
Windows | prefix-packages | C:XY-packages | (2) |
Еще у нас в root directory (пакета) есть readme.rst¶
In [16]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\readme.rst'
In []:
======
dirbot
======
This is a Scrapy project to scrape websites from public web directories.
This project is only meant for educational purposes.
Items
=====
The items scraped by this project are websites, and the item is defined in the
class::
dirbot.items.Website
See the source code for more details.
Spiders
=======
This project contains one spider called ``dmoz`` that you can see by running::
scrapy list
Spider: dmoz
------------
The ``dmoz`` spider scrapes the Open Directory Project (dmoz.org), and it's
based on the dmoz spider described in the `Scrapy tutorial`_
This spider doesn't crawl the entire dmoz.org site but only a few pages by
default (defined in the ``start_pages`` attribute). These pages are:
* http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
* http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
So, if you run the spider regularly (with ``scrapy crawl dmoz``) it will scrape
only those two pages.
.. _Scrapy tutorial: http://doc.scrapy.org/intro/tutorial.html
Pipelines
=========
This project uses a pipeline to filter out websites containing certain
forbidden words in their description. This pipeline is defined in the class::
dirbot.pipelines.FilterWordsPipeline
В папке dirbotвсе файлы так и называюся, как в руководстве¶
In [27]:
!dir C:\Users\kiss\Documents\GitHub\dirbot\dirbot
In [28]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\dirbot\\items.py'
In []:
from scrapy.item import Item, Field
class Website(Item):
name = Field()
description = Field()
url = Field()
In [29]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\dirbot\\pipelines.py'
In []:
from scrapy.exceptions import DropItem
class FilterWordsPipeline(object):
"""A pipeline for filtering out items which contain certain words in their
description"""
# put all words in lowercase
words_to_filter = ['politics', 'religion']
def process_item(self, item, spider):
for word in self.words_to_filter:
if word in unicode(item['description']).lower():
raise DropItem("Contains forbidden word: %s" % word)
else:
return item
In [30]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\dirbot\\settings.py'
In []:
# Scrapy settings for dirbot project
SPIDER_MODULES = ['dirbot.spiders']
NEWSPIDER_MODULE = 'dirbot.spiders'
DEFAULT_ITEM_CLASS = 'dirbot.items.Website'
ITEM_PIPELINES = ['dirbot.pipelines.FilterWordsPipeline']
А теперь посмотри на пауков в папке Spiders
In [31]:
!dir C:\Users\kiss\Documents\GitHub\dirbot\dirbot\spiders
In [32]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\dirbot\\spiders\\dmoz.py'
In []:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dirbot.items import Website
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.select('a/text()').extract()
item['url'] = site.select('a/@href').extract()
item['description'] = site.select('text()').re('-\s([^\n]*?)\\n')
items.append(item)
return items
In [33]:
%load 'C:\\Users\\kiss\\Documents\\GitHub\\dirbot\\dirbot\\spiders\\__init__.py'
In []:
# Place here all your scrapy spiders
In []:
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий