Добавляем в файл паука в def parse_row(self, response, row) операторы присвоения новых полей. Паук теперь парсит csv файл в 6 полей, но это не окончательный вариант
In [1]:
%load "C:\\Users\\kiss\\Documents\\GitMyScrapy\\scrapy_csv_2\\scrapy_csv_2\\spiders\\mail_csv_2_1.py"
In []:
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_2.items import ScrapyCsv1Item
from scrapy import log
class MailCsvSpider(CSVFeedSpider):
name = 'mail_csv_2_1'
#allowed_domains = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_2/']
#start_urls = ['nissan_9_1_00.csv']
headers = ['N', 'N100', 'purl']
delimiter = ';'
start_urls = ['file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv']
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
# I`m going to ust url splitting Гыыыыы
# proj_fromurl = response.url.split("/")[-2]
def parse_row(self, response, row):
i = ScrapyCsv1Item()
i['N'] = row['N']
i['N100'] = row['N100']
i['purl'] = row['purl']
i['purlplit'] = i['purl'].split("/")[-1]
i['bodyurl'] = response.url
i['proj_name'] = response.url.split("/")[3]
log.msg('Hi, this is a row!: %r' % row)
return i
- Чтобы распозноаь кириллицу в комментариях (!!!) scrapy потребовал указать кодировку модуля (# -- coding: utf-8 --)
2. Все новые поля в def parse_row(...) должны быть определены в модуле items (from scrapy_csv_2.items import ScrapyCsv1Item)
3. Строка "i['purlplit'] = i['purl'].split("/")[-1]" выдает пустую строку (см. ниже) - с этим надо будет разбираться
4. Строка i['bodyurl'] = response.url выдает и записывает строку URL
5. Эта строка URL может быть разобрана по нескольким Item Fields i['proj_name'] = response.url.split("/")[3]
Вот такие строки были записаны в итоговый файл items_2_2.csv¶
In []:
bodyurl,purlplit,N100,purl,proj_name,N
file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv,,"39,46",http://auto.mail.ru/catalogue/nissan/,Users,7371
file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv,,"7,58",http://auto.mail.ru/catalogue/nissan/qashqai/,Users,1416
file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv,,"6,31",http://auto.mail.ru/catalogue/nissan/x-trail/,Users,1179
Вот сообщения из консоли, или как получился файл items_2_2.csv¶
In []:
C:\Users\kiss\Documents\GitMyScrapy\scrapy_csv_2>scrapy crawl mail_csv_2_1 -o items_2_2.csv -t csv
2014-07-27 16:56:04+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapy_csv_2)
2014-07-27 16:56:04+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-27 16:56:04+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_csv_2.spiders', 'FEED_FORMAT': 'csv', 'S
IDER_MODULES': ['scrapy_csv_2.spiders'], 'FEED_URI': 'items_2_2.csv', 'BOT_NAME': 'scrapy_csv_2'}
2014-07-27 16:56:07+0400 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreSt
ts, SpiderState
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMi
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMi
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, Url
engthMiddleware, DepthMiddleware
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Enabled item pipelines: ScrapyCsv1Pipeline
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Spider opened
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Crawled (200) <GET file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.
sv> (referer: None)
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'85837', 'purl': u'http://auto.mail.ru', 'N': u'\u0410\u0432\
0442\u043e@Mail.Ru'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] ERROR: Error processing {'N': u'\u0410\u0432\u0442\u043e@Mail.Ru',
'N100': u'85837',
'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
'proj_name': 'Users',
'purl': u'http://auto.mail.ru',
'purlplit': u'auto.mail.ru'}
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 65, in process_chain
d.callback(input)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
self._startRunCallbacks(result)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "scrapy_csv_2\pipelines.py", line 17, in process_item
raise DropItem("Contains forbidden word: %s" % word)
exceptions.NameError: global name 'DropItem' is not defined
2014-07-27 16:56:09+0400 [scrapy] WARNING: ignoring row 2 (length: 0, should be: 3)
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'%', 'purl': u'\u0421\u0442\u0440\u0430\u043d\u0438\u0446\u04
b', 'N': u'\u041f\u0440\u043e\u0441\u043c\u043e\u0442\u0440\u044b'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] ERROR: Error processing {'N': u'\u041f\u0440\u043e\u0441\u043c\u043e\u0442\u0440\u044b',
'N100': u'%',
'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
'proj_name': 'Users',
'purl': u'\u0421\u0442\u0440\u0430\u043d\u0438\u0446\u044b',
'purlplit': u'\u0421\u0442\u0440\u0430\u043d\u0438\u0446\u044b'}
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 65, in process_chain
d.callback(input)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
self._startRunCallbacks(result)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "scrapy_csv_2\pipelines.py", line 17, in process_item
raise DropItem("Contains forbidden word: %s" % word)
exceptions.NameError: global name 'DropItem' is not defined
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'39,46', 'purl': u'http://auto.mail.ru/catalogue/nissan/', 'N
: u'7371'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.c
v>
{'N': u'7371',
'N100': u'39,46',
'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
'proj_name': 'Users',
'purl': u'http://auto.mail.ru/catalogue/nissan/',
'purlplit': u''}
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'7,58', 'purl': u'http://auto.mail.ru/catalogue/nissan/qashqa
/', 'N': u'1416'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.c
v>
{'N': u'1416',
'N100': u'7,58',
'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
'proj_name': 'Users',
'purl': u'http://auto.mail.ru/catalogue/nissan/qashqai/',
'purlplit': u''}
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'6,31', 'purl': u'http://auto.mail.ru/catalogue/nissan/x-trai
/', 'N': u'1179'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.c
v>
{'N': u'1179',
'N100': u'6,31',
'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
'proj_name': 'Users',
'purl': u'http://auto.mail.ru/catalogue/nissan/x-trail/',
'purlplit': u''}
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Closing spider (finished)
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Stored csv feed (3 items) in: items_2_2.csv
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 265,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 294,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 7, 27, 12, 56, 9, 693000),
'item_scraped_count': 3,
'log_count/DEBUG': 10,
'log_count/ERROR': 2,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 7, 27, 12, 56, 9, 519000)}
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Spider closed (finished)
C:\Users\kiss\Documents\GitMyScrapy\scrapy_csv_2>
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий