Поиск по блогу

вторник, 22 июля 2014 г.

Пошаговое руководство по созданию проекта Scrapy CSV

Создаем первый проект для парсинга csv файлов. Записываем все действия. После того, как был прописан абсолютный путь (да еще) и к новому "правильному" локальному csv файлу, сообщения в консоли подтвердили правильное распарсивание строк. Этим здесь и ограничились...
In []:
[Scrapy creating-projects](http://doc.scrapy.org/en/latest/topics/commands.html#creating-projects)
<br/>[xmlfeedspider](http://doc.scrapy.org/en/latest/topics/spiders.html#xmlfeedspider)
<br/>[csvfeedspider](http://doc.scrapy.org/en/latest/topics/spiders.html#csvfeedspider)
<br/>[settings](http://doc.scrapy.org/en/latest/topics/settings.html#settings)
<br/>[Scrapy shell](http://doc.scrapy.org/en/latest/topics/shell.html#topics-shell)

Создаем папки и основные файлы проекта

In []:
#cd C:\Users\kiss\Documents\GitHub_2\
scrapy startproject scrapy_csv_1
In []:
#Next, you go inside the new project directory
cd C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1

# And you’re ready to use the scrapy command to manage and control your project from there
In [4]:
!chcp 65001
!dir C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1
Active code page: 65001
 Volume in drive C has no label.
 Volume Serial Number is 6017-2A0B

 Directory of C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1

21.07.2014  15:15    <DIR>          .
21.07.2014  15:15    <DIR>          ..
10.07.2014  19:52               809 csv.lnk
30.09.2013  14:27            15В 275 nissan_9_1_00.csv
10.07.2014  19:27               266 scrapy.cfg
10.07.2014  19:30    <DIR>          scrapy_csv_1
21.07.2014  15:27             3В 338 scrapy_csv_1.ipynb
               4 File(s)         19В 688 bytes
               3 Dir(s)  394В 077В 331В 456 bytes free

Главной папкой проекта считается та, где размещается .cfg файл Все остальные файлы в этой папке добавлены мною после работы скрипта. (.ipynb - этот файл, nissan...csv - это файл для экспериментов из папки csv.lnk)
In [6]:
!dir C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1\scrapy_csv_1
 Volume in drive C has no label.
 Volume Serial Number is 6017-2A0B

 Directory of C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1\scrapy_csv_1

10.07.2014  19:30    <DIR>          .
10.07.2014  19:30    <DIR>          ..
10.07.2014  19:27               271 items.py
10.07.2014  19:27               264 pipelines.py
10.07.2014  19:27               488 settings.py
10.07.2014  19:30               231 settings.pyc
10.07.2014  19:30    <DIR>          spiders
09.12.2013  15:38                 0 __init__.py
10.07.2014  19:30               111 __init__.pyc
               6 File(s)          1В 365 bytes
               3 Dir(s)  394В 076В 495В 872 bytes free

Распечатаем некоторые файлы проекта

In [5]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\settings.py"
In []:
# Scrapy settings for scrapy_csv_1 project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'scrapy_csv_1'

SPIDER_MODULES = ['scrapy_csv_1.spiders']
NEWSPIDER_MODULE = 'scrapy_csv_1.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_csv_1 (+http://www.yourdomain.com)'
In [9]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\pipelines.py"
In []:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class ScrapyCsv1Pipeline(object):
    def process_item(self, item, spider):
        return item
In [13]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\items.py"
In []:
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class ScrapyCsv1Item(Item):
    # define the fields for your item here like:
    # name = Field()
    N = Field()
    N100=Field()
    purl=Field()
In []:
# Это первый вариант, сгененрированный скриптом

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class ScrapyCsv1Item(Item):
    # define the fields for your item here like:
    # name = Field()
    pass
In [12]:
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\spiders\\mail_csv.py"
In []:
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_1.items import ScrapyCsv1Item

class MailCsvSpider(CSVFeedSpider):
    name = 'mail_csv'
    #allowed_domains = ['mail,ru']
    start_urls = ['nissan_9_1_00.csv']
    headers = ['N', 'N100', 'purl']
    delimiter = ';'

 
        
   
   # Do any adaptations you need here
   #def adapt_response(self, response):
   #    return response

    def parse_row(self, response, row):
     #log.msg('Hi, this is a row!: %r' % row)
     
        i = ScrapyCsv1Item()
        i['N'] = row['N']
        i['N100'] = row['N100']
        i['purl'] = row['purl']
        return i
In []:
# Это первый вариант, сгененрированный скриптом

from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_1.items import ScrapyCsv1Item

class MailCsvSpider(CSVFeedSpider):
    name = 'mail_csv'
    allowed_domains = ['mail,ru']
    start_urls = ['http://www.mail,ru/feed.csv']
    # headers = ['id', 'name', 'description', 'image_link']
    # delimiter = '\t'

    # Do any adaptations you need here
    #def adapt_response(self, response):
    #    return response

    def parse_row(self, response, row):
        i = ScrapyCsv1Item()
        #i['url'] = row['url']
        #i['name'] = row['name']
        #i['description'] = row['description']
        return i

Пример из документации csvfeedspider-example

In []:
from scrapy import log
from scrapy.contrib.spiders import CSVFeedSpider
from myproject.items import TestItem

class MySpider(CSVFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.csv']
    delimiter = ';'
    headers = ['id', 'name', 'description']

    def parse_row(self, response, row):
        log.msg('Hi, this is a row!: %r' % row)

        item = TestItem()
        item['id'] = row['id']
        item['name'] = row['name']
        item['description'] = row['description']
        return item

Запусаем сначала проверки, а потом и паука

In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy check -l
In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy list
mail_csv
In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy crawl mail_csv
2014-07-21 21:33:59+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapy_csv_1)
2014-07-21 21:33:59+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-21 21:33:59+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_csv_1.spiders', 'SPIDER_MODULES': ['scrap
y_csv_1.spiders'], 'BOT_NAME': 'scrapy_csv_1'}
2014-07-21 21:34:01+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderStat
e
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Enabled item pipelines:
2014-07-21 21:34:02+0400 [mail_csv] INFO: Spider opened
2014-07-21 21:34:02+0400 [mail_csv] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-21 21:34:02+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-21 21:34:02+0400 [mail_csv] ERROR: Obtaining request from start requests
        Traceback (most recent call last):
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 1192, in run
            self.mainLoop()
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
            self.runUntilCurrent()
          File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
            return self._func(*self._a, **self._kw)
        --- <exception caught here> ---
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\core\engine.py", line 111, in _next_request
            request = next(slot.start_requests)
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\spider.py", line 50, in start_requests
            yield self.make_requests_from_url(url)
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\spider.py", line 53, in make_requests_from_url
            return Request(url, dont_filter=True)
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 26, in __init__
            self._set_url(url)
          File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 61, in _set_url
            raise ValueError('Missing scheme in request url: %s' % self._url)
        exceptions.ValueError: Missing scheme in request url: nissan_9_1_00.csv

2014-07-21 21:34:02+0400 [mail_csv] INFO: Closing spider (finished)
2014-07-21 21:34:02+0400 [mail_csv] INFO: Dumping Scrapy stats:
        {'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 7, 21, 17, 34, 2, 335000),
         'log_count/DEBUG': 6,
         'log_count/ERROR': 1,
         'log_count/INFO': 3,
         'start_time': datetime.datetime(2014, 7, 21, 17, 34, 2, 269000)}
2014-07-21 21:34:02+0400 [mail_csv] INFO: Spider closed (finished)

C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>

Удалось выполнить скрипт только для более простого файла csv

In [15]:
# Здесь первая строка заголовков, как в файлах
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\nissan_.csv"
In []:
"N";"N100";"purl"
"7371";"39,46";"http://auto.mail.ru/catalogue/nissan/"
"1416";"7,58";"http://auto.mail.ru/catalogue/nissan/qashqai/"
"1179";"6,31";"http://auto.mail.ru/catalogue/nissan/x-trail/"
In [16]:
#Загрузим код работающего паука
%load "C:\\Users\\kiss\\Documents\\GitHub_2\\scrapy_csv_1\\scrapy_csv_1\\spiders\\mail_csv.py"
In []:
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy_csv_1.items import ScrapyCsv1Item

class MailCsvSpider(CSVFeedSpider):
    name = 'mail_csv'
    #allowed_domains = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/']
    #start_urls = ['nissan_9_1_00.csv']
 
    headers = ['N', 'N100', 'purl']
    delimiter = ';'
    start_urls = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv']
 
        
   
   # Do any adaptations you need here
   #def adapt_response(self, response):
   #    return response

    def parse_row(self, response, row):
     #log.msg('Hi, this is a row!: %r' % row)
     
        i = ScrapyCsv1Item()
        i['N'] = row['N']
        i['N100'] = row['N100']
        i['purl'] = row['purl']
        return i
In []:
# Повторяю, после того, как  заменил строку на 
start_urls = ['file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv']
# И установили соответствие в заголовках столбцов
Паук наконец то выдал что-то вразумительное:
In []:
C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>scrapy crawl mail_csv
2014-07-21 22:03:40+0400 [scrapy] INFO: Scrapy 0.20.1 started (bot: scrapy_csv_1)
2014-07-21 22:03:40+0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-07-21 22:03:40+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_csv_1.spiders', 'SPIDER_MODULES': ['scrap
y_csv_1.spiders'], 'BOT_NAME': 'scrapy_csv_1'}
2014-07-21 22:03:41+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderStat
e
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Enabled item pipelines:
2014-07-21 22:03:42+0400 [mail_csv] INFO: Spider opened
2014-07-21 22:03:42+0400 [mail_csv] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-07-21 22:03:42+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Crawled (200) <GET file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv> (ref
erer: None)
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
        {'N': u'N', 'N100': u'N100', 'purl': u'purl'}
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
        {'N': u'7371',
         'N100': u'39,46',
         'purl': u'http://auto.mail.ru/catalogue/nissan/'}
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
        {'N': u'1416',
         'N100': u'7,58',
         'purl': u'http://auto.mail.ru/catalogue/nissan/qashqai/'}
2014-07-21 22:03:42+0400 [mail_csv] DEBUG: Scraped from <200 file://C:/Users/kiss/Documents/GitHub_2/scrapy_csv_1/nissan_.csv>
        {'N': u'1179',
         'N100': u'6,31',
         'purl': u'http://auto.mail.ru/catalogue/nissan/x-trail/'}
2014-07-21 22:03:42+0400 [mail_csv] INFO: Closing spider (finished)
2014-07-21 22:03:42+0400 [mail_csv] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 261,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 218,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 7, 21, 18, 3, 42, 559000),
         'item_scraped_count': 4,
         'log_count/DEBUG': 11,
         'log_count/INFO': 3,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2014, 7, 21, 18, 3, 42, 408000)}
2014-07-21 22:03:42+0400 [mail_csv] INFO: Spider closed (finished)

C:\Users\kiss\Documents\GitHub_2\scrapy_csv_1>
Выше копия вывода на экран консоли (StOut) ... Видно, что все три строчки тестового файла были выскоблены в соответствующие строчки-словари (всего три штуки). С чем всех и поздравляю.


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий