iPython R Rapid Miner: Возвращаемся к изуению Scrapy. Начинаем с shell

вторник, 20 мая 2014 г.

Возвращаемся к изуению Scrapy. Начинаем с shell

Здесь 10-минутный ролик и ссылка на документацию. Лучше читать документацию. Все изложено кратко и понятно.

Почти три недели я не позволял себе отвлекатся ни на что, кроме решения задачи о ротации IP-адресов. Proxychains был хорош всем, кроме того, что не устанавливался на Windows.
Черновиков я написал за это время изрядное количество, часть из них опубликую... позже... наверное.
В конце концов решил пока написать модули для перебора списков прокси сам. Слегка переделаю то, что нашел. Но это будут ученические упражнения..., а значит, задача становистя второстепенной.

С сегодняшнего дня начинаем систематические упражнения со Scrapy. Оказывается, я не законспектировал мои упражнения с Shell. Восполняю этот пробел в моих черновиках. Ниже видео, которое меня раздражает (надо щурится...), но лучше не нашел.
А вот ссылка на прекрасную Документацию Scrapy shell
Еще мне понадобится справка по XPath Tutorial

Самая последняя ссылка на XPath, добавим, что для любого браузера лучше использовать плагин, который показывает HPath для выделенного фрагманта (меню правой конопки мыши). Для Kali я поставил XPath Checker. Он формирует строчку запроса по Id и номерам тегов...

Зачем вообще нужна такая консоль?¶

Во-первых, для отладки селекторов. Как мы знаем, в Scrapy свои три вида селекторов (XPath, CSS, RE) регулярные выражения, кстати, самые быстрые.
Кроме того, консоль полностью имитирует процесс скрапинга. Для этого она создает соответствующие объекты, с которыми можно экспериментировать.
Один раз загрузил страницу, а потом пробуй из нее вырезать все, что хочешь..., и как хочешь...

Available Shortcuts¶

shelp() - print a help with the list of available objects and shortcuts
fetch(request_or_url) - fetch a new response from the given request or URL and update all related objects accordingly.
view(response) - open the given response in your local web browser, for inspection. This will add a tag to the response body in order for external links (such as images and style sheets) to display properly. Note, however,that this will create a temporary file in your computer, which won’t be removed automatically.

The Scrapy shell automatically creates some convenient objects from the downloaded page, like the Response object and the Selector objects (for both HTML and XML content).

Those objects are:
crawler - the current Crawler object.
spider - the Spider which is known to handle the URL, or a Spider object if there is no spider found for the current URL
request - a Request object of the last fetched page. You can modify this request using replace() or fetch a new request (without leaving the shell) using the fetch shortcut.
response - a Response object containing the last fetched page
sel - a Selector object constructed with the last response fetched
settings - the current Scrapy settings

Вопросы, на которые предстоит ответить¶

А если несколько пауков ползают по одно странице? (в pdf версии руководства попался такой пример)
Можно ли пробовать (имитировать) команды обработки полей Items.field... Как кэшируется загруженая (fetched) страница? Если я запущу другого паука, он загрузит данные из кэша?

Посты чуть ниже также могут вас заинтересовать

18 комментариев:

Sergey Borisovich10 июля 2014 г. в 14:21
E:\w8\IPython Notebooks\2014_07>scrapy shell "http://scrapy.org/doc/"
2014-07-10 14:08:06+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapybot)
2014-07-10 14:08:06+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-10 14:08:06+0400 [scrapy] DEBUG: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-07-10 14:08:15+0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-07-10 14:08:21+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-10 14:08:22+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
2014-07-10 14:08:22+0400 [scrapy] DEBUG: Enabled item pipelines:
2014-07-10 14:08:22+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-10 14:08:22+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-10 14:08:22+0400 [default] INFO: Spider opened
2014-07-10 14:08:23+0400 [default] DEBUG: Crawled (200) lt&;GET http://scrapy.org/doc/> (referer: None)
[s] Available Scrapy objects:
[s] item {}
[s] request lt&;GET http://scrapy.org/doc/>
[s] response lt&;200 http://scrapy.org/doc/>
[s] sel lt&;Selector xpath=None data=u'lt&;html>\n lt&;head>\n lt&;meta charset="utf-8'>
[s] settings lt&;CrawlerSettings module=None>
[s] spider lt&;BaseSpider 'default' at 0x4788860>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
C:\Users\kiss\Anaconda\lib\site-packages\IPython\frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated.
All its subpackages have been moved to the top `IPython` level.
warn("The top-level `frontend` package has been deprecated. "
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 14:23
In [1]: shelp()
[s] Available Scrapy objects:
[s] item {}
[s] request lt&;GET http://scrapy.org/doc/>
[s] response lt&;200 http://scrapy.org/doc/>
[s] sel lt&;Selector xpath=None data=u'lt&;html>\n lt&;head>\n lt&;meta charset="utf-8'>
[s] settings lt&;CrawlerSettings module=None>
[s] spider lt&;BaseSpider 'default' at 0x4788860>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 14:25
Если ввести после команды точку и нажать на 'TAB', то внизу появятся подсказки:
In [2]: request.
request.body request.copy request.errback request.method request.url
request.callback request.dont_filter request.headers request.priority
request.cookies request.encoding request.meta request.replace
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 14:26
In [2]: request.headers
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'x-gzip,gzip,deflate',
'Accept-Language': 'en',
'User-Agent': 'Scrapy/0.20.1 (+http://scrapy.org)'}
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 14:41
После команды
In [3]: view(response)
Out[3]: True
в браузере открылся файл

file:///C:/Users/kiss/AppData/Local/Temp/tmpj7_oon.html
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 14:43
Available Scrapy objects
The Scrapy shell automatically creates some convenient objects from the downloaded page, like the Response object
and the Selector objects (for both HTML and XML content).
Those objects are:
• crawler - the current Crawler object.
• spider - the Spider which is known to handle the URL, or a Spider object if there is no spider found for the
current URL
• request - a Request object of the last fetched page. You can modify this request using replace() or
fetch a new request (without leaving the shell) using the fetch shortcut.
• response - a Response object containing the last fetched page
• sel - a Selector object constructed with the last response fetched
• settings - the current Scrapy settings
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 14:45
.Tab

In [4]: settings.
settings.defaults settings.getdict settings.getlist settings.settings_module
settings.get settings.getfloat settings.global_defaults settings.values
settings.getbool settings.getint settings.overrides
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 16:37
Комнада shell относится к инструментам Scrapy
Поэтому нужно испоьзовать справку Scrapy вот так:

C:\Users\kiss\Anaconda>scrapy -h
Scrapy 0.20.1 - no active project

Usage:
scrapy [options] [args]

Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use "scrapy -h" to see more info about a command

C:\Users\kiss\Anaconda>
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 16:39
C:\Users\kiss\Anaconda>scrapy fetch -h
Usage
=====
scrapy fetch [options]

Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging

Options
=======
--help, -h show this help message and exit
--spider=SPIDER use this spider
--headers print response HTTP headers instead of body

Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--lsprof=FILE write lsprof profiling stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure

C:\Users\kiss\Anaconda>
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 16:42
C:\Users\kiss\Anaconda>scrapy shell -h
Usage
=====
scrapy shell [url|file]

Interactive console for scraping the given url

Options
=======
--help, -h show this help message and exit
-c CODE evaluate the code in the shell, print the result and
exit
--spider=SPIDER use this spider

Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--lsprof=FILE write lsprof profiling stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure

C:\Users\kiss\Anaconda>
ОтветитьУдалить
Ответы
Sergey Borisovich10 июля 2014 г. в 16:46
There are two kinds of commands, those that only work from inside a Scrapy project (Project-specific commands) and those that also work without an active Scrapy project (Global commands), though they may behave slightly different when running from inside a project (as they would use the project overridden settings).
ОтветитьУдалить
Ответы

Добавить комментарий

iPython R Rapid Miner

Поиск по блогу

Страницы

вторник, 20 мая 2014 г.

Возвращаемся к изуению Scrapy. Начинаем с shell

Зачем вообще нужна такая консоль?¶

Available Shortcuts¶

Вопросы, на которые предстоит ответить¶

18 комментариев:

Поиск по блогу

Страницы

вторник, 20 мая 2014 г.

Возвращаемся к изуению Scrapy. Начинаем с shell

Зачем вообще нужна такая консоль?¶

Available Shortcuts¶

Вопросы, на которые предстоит ответить¶

18 комментариев:

вторник, 20 мая 2014 г.