Поиск по блогу

суббота, 26 июля 2014 г.

Пробуем добавить поля в паука mail_csv_2_1. Парсим i['proj_name'] = response.url.split("/")[3]

Добавляем в файл паука в def parse_row(self, response, row) операторы присвоения новых полей. Паук теперь парсит csv файл в 6 полей, но это не окончательный вариант
In [1]:
%load "C:\\Users\\kiss\\Documents\\GitMyScrapy\\scrapy_csv_2\\scrapy_csv_2\\spiders\\mail_csv_2_1.py"
In []:
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_2.items import ScrapyCsv1Item
from scrapy import log

class MailCsvSpider(CSVFeedSpider):
    name = 'mail_csv_2_1'
    #allowed_domains = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_2/']
    #start_urls = ['nissan_9_1_00.csv']
 
    headers = ['N', 'N100', 'purl']
    delimiter = ';'
    start_urls = ['file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv']
 
        
   
   # Do any adaptations you need here
   #def adapt_response(self, response):
   #    return response

   # I`m going to ust url splitting Гыыыыы
   # proj_fromurl = response.url.split("/")[-2]
 
    def parse_row(self, response, row):
  i = ScrapyCsv1Item()
  i['N'] = row['N']
  i['N100'] = row['N100']
  i['purl'] = row['purl']
  i['purlplit'] = i['purl'].split("/")[-1]
  i['bodyurl'] = response.url
  i['proj_name'] = response.url.split("/")[3]
  log.msg('Hi, this is a row!: %r' % row)
  return i 
  1. Чтобы распозноаь кириллицу в комментариях (!!!) scrapy потребовал указать кодировку модуля (# -- coding: utf-8 --)
    2. Все новые поля в def parse_row(...) должны быть определены в модуле items (from scrapy_csv_2.items import ScrapyCsv1Item)
    3. Строка "i['purlplit'] = i['purl'].split("/")[-1]" выдает пустую строку (см. ниже) - с этим надо будет разбираться
    4. Строка i['bodyurl'] = response.url выдает и записывает строку URL
    5. Эта строка URL может быть разобрана по нескольким Item Fields i['proj_name'] = response.url.split("/")[3]

Вот такие строки были записаны в итоговый файл items_2_2.csv

In []:
bodyurl,purlplit,N100,purl,proj_name,N
file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv,,"39,46",http://auto.mail.ru/catalogue/nissan/,Users,7371
file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv,,"7,58",http://auto.mail.ru/catalogue/nissan/qashqai/,Users,1416
file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv,,"6,31",http://auto.mail.ru/catalogue/nissan/x-trail/,Users,1179

Вот сообщения из консоли, или как получился файл items_2_2.csv

In []:
C:\Users\kiss\Documents\GitMyScrapy\scrapy_csv_2>scrapy crawl mail_csv_2_1 -o items_2_2.csv -t csv
2014-07-27 16:56:04+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapy_csv_2)
2014-07-27 16:56:04+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-27 16:56:04+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_csv_2.spiders', 'FEED_FORMAT': 'csv', 'S
IDER_MODULES': ['scrapy_csv_2.spiders'], 'FEED_URI': 'items_2_2.csv', 'BOT_NAME': 'scrapy_csv_2'}
2014-07-27 16:56:07+0400 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreSt
ts, SpiderState
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMi
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMi
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, Url
engthMiddleware, DepthMiddleware
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Enabled item pipelines: ScrapyCsv1Pipeline
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Spider opened
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-27 16:56:09+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Crawled (200) <GET file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.
sv> (referer: None)
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'85837', 'purl': u'http://auto.mail.ru', 'N': u'\u0410\u0432\
0442\u043e@Mail.Ru'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] ERROR: Error processing {'N': u'\u0410\u0432\u0442\u043e@Mail.Ru',
         'N100': u'85837',
         'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
         'proj_name': 'Users',
         'purl': u'http://auto.mail.ru',
         'purlplit': u'auto.mail.ru'}
        Traceback (most recent call last):
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\middleware.py", line 62, in _process_chain
            return process_chain(self.methods[methodname], obj, *args)
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 65, in process_chain
            d.callback(input)
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
            self._startRunCallbacks(result)
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "scrapy_csv_2\pipelines.py", line 17, in process_item
            raise DropItem("Contains forbidden word: %s" % word)
        exceptions.NameError: global name 'DropItem' is not defined

2014-07-27 16:56:09+0400 [scrapy] WARNING: ignoring row 2 (length: 0, should be: 3)
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'%', 'purl': u'\u0421\u0442\u0440\u0430\u043d\u0438\u0446\u04
b', 'N': u'\u041f\u0440\u043e\u0441\u043c\u043e\u0442\u0440\u044b'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] ERROR: Error processing {'N': u'\u041f\u0440\u043e\u0441\u043c\u043e\u0442\u0440\u044b',
         'N100': u'%',
         'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
         'proj_name': 'Users',
         'purl': u'\u0421\u0442\u0440\u0430\u043d\u0438\u0446\u044b',
         'purlplit': u'\u0421\u0442\u0440\u0430\u043d\u0438\u0446\u044b'}
        Traceback (most recent call last):
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\middleware.py", line 62, in _process_chain
            return process_chain(self.methods[methodname], obj, *args)
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 65, in process_chain
            d.callback(input)
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 382, in callback
            self._startRunCallbacks(result)
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "scrapy_csv_2\pipelines.py", line 17, in process_item
            raise DropItem("Contains forbidden word: %s" % word)
        exceptions.NameError: global name 'DropItem' is not defined

2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'39,46', 'purl': u'http://auto.mail.ru/catalogue/nissan/', 'N
: u'7371'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.c
v>
        {'N': u'7371',
         'N100': u'39,46',
         'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
         'proj_name': 'Users',
         'purl': u'http://auto.mail.ru/catalogue/nissan/',
         'purlplit': u''}
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'7,58', 'purl': u'http://auto.mail.ru/catalogue/nissan/qashqa
/', 'N': u'1416'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.c
v>
        {'N': u'1416',
         'N100': u'7,58',
         'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
         'proj_name': 'Users',
         'purl': u'http://auto.mail.ru/catalogue/nissan/qashqai/',
         'purlplit': u''}
2014-07-27 16:56:09+0400 [scrapy] INFO: Hi, this is a row!: {'N100': u'6,31', 'purl': u'http://auto.mail.ru/catalogue/nissan/x-trai
/', 'N': u'1179'}
2014-07-27 16:56:09+0400 [mail_csv_2_1] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.c
v>
        {'N': u'1179',
         'N100': u'6,31',
         'bodyurl': 'file://C:/Users/kiss/Documents/GitMyScrapy/scrapy_csv_2/nissan_2.csv',
         'proj_name': 'Users',
         'purl': u'http://auto.mail.ru/catalogue/nissan/x-trail/',
         'purlplit': u''}
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Closing spider (finished)
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Stored csv feed (3 items) in: items_2_2.csv
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 265,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 294,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 7, 27, 12, 56, 9, 693000),
         'item_scraped_count': 3,
         'log_count/DEBUG': 10,
         'log_count/ERROR': 2,
         'log_count/INFO': 9,
         'log_count/WARNING': 1,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2014, 7, 27, 12, 56, 9, 519000)}
2014-07-27 16:56:09+0400 [mail_csv_2_1] INFO: Spider closed (finished)

C:\Users\kiss\Documents\GitMyScrapy\scrapy_csv_2>


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий