Создаем первый проект для парсинга csv файлов. Записываем все действия. После того, как был прописан абсолютный путь (да еще) и к новому "правильному" локальному csv файлу, сообщения в консоли подтвердили правильное распарсивание строк. Этим здесь и ограничились...
In []:
[Scrapy creating-projects](http://doc.scrapy.org/en/latest/topics/commands.html#creating-projects)
<br/>[xmlfeedspider](http://doc.scrapy.org/en/latest/topics/spiders.html#xmlfeedspider)
<br/>[csvfeedspider](http://doc.scrapy.org/en/latest/topics/spiders.html#csvfeedspider)
<br/>[settings](http://doc.scrapy.org/en/latest/topics/settings.html#settings)
<br/>[Scrapy shell](http://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)
Создаем папки и основные файлы проекта¶
In []:
#cd C:\Users\kiss\Documents\GitHub_2\
scrapy startproject scrapy_csv_1
In []:
#Next, you go inside the new project directory
cd C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1
# And you’re ready to use the scrapy command to manage and control your project from there
In [4]:
!chcp 65001
!dir C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1
Главной папкой проекта считается та, где размещается .cfg файл Все остальные файлы в этой папке добавлены мною после работы скрипта. (.ipynb - этот файл, nissan...csv - это файл для экспериментов из папки csv.lnk)
In [6]:
!dir C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1\scrapy_csv_1
Распечатаем некоторые файлы проекта¶
In [5]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\settings.py"
In []:
# Scrapy settings for scrapy_csv_1 project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'scrapy_csv_1'
SPIDER_MODULES = ['scrapy_csv_1.spiders']
NEWSPIDER_MODULE = 'scrapy_csv_1.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_csv_1 (+http://www.yourdomain.com)'
In [9]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\pipelines.py"
In []:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ScrapyCsv1Pipeline(object):
def process_item(self, item, spider):
return item
In [13]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\items.py"
In []:
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class ScrapyCsv1Item(Item):
# define the fields for your item here like:
# name = Field()
N = Field()
N100=Field()
purl=Field()
In []:
# Это первый вариант, сгененрированный скриптом
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class ScrapyCsv1Item(Item):
# define the fields for your item here like:
# name = Field()
pass
In [12]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\spiders\\mail_csv.py"
In []:
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_1.items import ScrapyCsv1Item
class MailCsvSpider(CSVFeedSpider):
name = 'mail_csv'
#allowed_domains = ['mail,ru']
start_urls = ['nissan_9_1_00.csv']
headers = ['N', 'N100', 'purl']
delimiter = ';'
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
#log.msg('Hi, this is a row!: %r' % row)
i = ScrapyCsv1Item()
i['N'] = row['N']
i['N100'] = row['N100']
i['purl'] = row['purl']
return i
In []:
# Это первый вариант, сгененрированный скриптом
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_1.items import ScrapyCsv1Item
class MailCsvSpider(CSVFeedSpider):
name = 'mail_csv'
allowed_domains = ['mail,ru']
start_urls = ['http://www.mail,ru/feed.csv']
# headers = ['id', 'name', 'description', 'image_link']
# delimiter = '\t'
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
i = ScrapyCsv1Item()
#i['url'] = row['url']
#i['name'] = row['name']
#i['description'] = row['description']
return i
Пример из документации csvfeedspider-example¶
In []:
from scrapy import log
from scrapy.contrib.spiders import CSVFeedSpider
from myproject.items import TestItem
class MySpider(CSVFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.csv']
delimiter = ';'
headers = ['id', 'name', 'description']
def parse_row(self, response, row):
log.msg('Hi, this is a row!: %r' % row)
item = TestItem()
item['id'] = row['id']
item['name'] = row['name']
item['description'] = row['description']
return item
Запусаем сначала проверки, а потом и паука¶
In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy check -l
In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy list
mail_csv
In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy crawl mail_csv
2014-07-21 21:33:59+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapy_csv_1)
2014-07-21 21:33:59+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-21 21:33:59+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_csv_1.spiders', 'SPIDER_MODULES': ['scrap
y_csv_1.spiders'], 'BOT_NAME': 'scrapy_csv_1'}
2014-07-21 21:34:01+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderStat
e
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Enabled item pipelines:
2014-07-21 21:34:02+0400 [mail_csv] INFO: Spider opened
2014-07-21 21:34:02+0400 [mail_csv] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-21 21:34:02+0400 [mail_csv] ERROR: Obtaining request from start requests
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 1192, in run
self.mainLoop()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\core\engine.py", line 111, in _next_request
request = next(slot.start_requests)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\spider.py", line 50, in start_requests
yield self.make_requests_from_url(url)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\spider.py", line 53, in make_requests_from_url
return Request(url, dont_filter=True)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 26, in __init__
self._set_url(url)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: nissan_9_1_00.csv
2014-07-21 21:34:02+0400 [mail_csv] INFO: Closing spider (finished)
2014-07-21 21:34:02+0400 [mail_csv] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 7, 21, 17, 34, 2, 335000),
'log_count/DEBUG': 6,
'log_count/ERROR': 1,
'log_count/INFO': 3,
'start_time': datetime.datetime(2014, 7, 21, 17, 34, 2, 269000)}
2014-07-21 21:34:02+0400 [mail_csv] INFO: Spider closed (finished)
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>
Удалось выполнить скрипт только для более простого файла csv¶
In [15]:
# Здесь первая строка заголовков, как в файлах
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\nissan_.csv"
In []:
"N";"N100";"purl"
"7371";"39,46";"http://auto.mail.ru/catalogue/nissan/"
"1416";"7,58";"http://auto.mail.ru/catalogue/nissan/qashqai/"
"1179";"6,31";"http://auto.mail.ru/catalogue/nissan/x-trail/"
In [16]:
#Загрузим код работающего паука
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\spiders\\mail_csv.py"
In []:
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_1.items import ScrapyCsv1Item
class MailCsvSpider(CSVFeedSpider):
name = 'mail_csv'
#allowed_domains = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/']
#start_urls = ['nissan_9_1_00.csv']
headers = ['N', 'N100', 'purl']
delimiter = ';'
start_urls = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv']
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
#log.msg('Hi, this is a row!: %r' % row)
i = ScrapyCsv1Item()
i['N'] = row['N']
i['N100'] = row['N100']
i['purl'] = row['purl']
return i
In []:
# Повторяю, после того, как заменил строку на
start_urls = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv']
# И установили соответствие в заголовках столбцов
Паук наконец то выдал что-то вразумительное:
In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy crawl mail_csv
2014-07-21 22:03:40+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapy_csv_1)
2014-07-21 22:03:40+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-21 22:03:40+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_csv_1.spiders', 'SPIDER_MODULES': ['scrap
y_csv_1.spiders'], 'BOT_NAME': 'scrapy_csv_1'}
2014-07-21 22:03:41+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderStat
e
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Enabled item pipelines:
2014-07-21 22:03:42+0400 [mail_csv] INFO: Spider opened
2014-07-21 22:03:42+0400 [mail_csv] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Crawled (200) <GET file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv> (ref
erer: None)
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
{'N': u'N', 'N100': u'N100', 'purl': u'purl'}
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
{'N': u'7371',
'N100': u'39,46',
'purl': u'http://auto.mail.ru/catalogue/nissan/'}
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
{'N': u'1416',
'N100': u'7,58',
'purl': u'http://auto.mail.ru/catalogue/nissan/qashqai/'}
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
{'N': u'1179',
'N100': u'6,31',
'purl': u'http://auto.mail.ru/catalogue/nissan/x-trail/'}
2014-07-21 22:03:42+0400 [mail_csv] INFO: Closing spider (finished)
2014-07-21 22:03:42+0400 [mail_csv] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 218,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 7, 21, 18, 3, 42, 559000),
'item_scraped_count': 4,
'log_count/DEBUG': 11,
'log_count/INFO': 3,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 7, 21, 18, 3, 42, 408000)}
2014-07-21 22:03:42+0400 [mail_csv] INFO: Spider closed (finished)
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>
Выше копия вывода на экран консоли (StOut) ... Видно, что все три строчки тестового файла были выскоблены в соответствующие строчки-словари (всего три штуки). С чем всех и поздравляю.
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий