Код в feedexport.py показался мне необычным - super(CSVkwItemExporter, self).init(args, kwargs) Здесь пример паука, который формирует csv таблицу с двумя последовательными полями, но в строке запуска нужно указать scrapy crawl njit -o 13nov.csv -t csv
C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example>tree /F
Структура папок
Серийный номер тома: 00000095 6017:2A0B
C:.
│ ''.csv
│ courses.csv
│ scrapy.cfg
│
└───example
│ feedexport.py
│ feedexport.pyc
│ items.py
│ items.pyc
│ items.py~
│ pipelines.py
│ settings.py
│ settings.pyc
│ settings.py~
│ __init__.py
│ __init__.pyc
│
└───spiders
test.py
test.pyc
test.py~
__init__.py
__init__.pyc
C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example>
- Посмотрим файл настроек example\settings.py
%load C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example\example\settings.py
# Scrapy settings for example project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/topics/settings.html
#
BOT_NAME = 'example'
BOT_VERSION = '1.0'
SPIDER_MODULES = ['example.spiders']
NEWSPIDER_MODULE = 'example.spiders'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
FEED_EXPORTERS = {
'csv': 'example.feedexport.CSVkwItemExporter'
}
EXPORT_FIELDS = [
'title',
'year',
]
2. Теперь файл **example\feedexport.py**
%load C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example\example\feedexport.py
"""
The standard CSVItemExporter class does not pass the kwargs through to the
CSV writer, resulting in EXPORT_FIELDS and EXPORT_ENCODING being ignored
(EXPORT_EMPTY is not used by CSV).
"""
from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter
class CSVkwItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
kwargs['fields_to_export'] = settings.getlist('EXPORT_FIELDS') or None
kwargs['encoding'] = settings.get('EXPORT_ENCODING', 'utf-8')
super(CSVkwItemExporter, self).__init__(*args, **kwargs)
Надо запускать дебагер, ибо вопросы есть:
1. Что это за from scrapy.conf import settings
2. Разве без этого EXPORT_FIELDS не видны в пауке?
3. ...
Здесь я таки запустил дебаггер, пришлось скопировать часть вывода в отдельный пост. Здесь лишь отметим, что да, есть объект Settings, про который я читал и "забыл", но вот, здесь "вспомнил"...
И не забудем про код паука¶
%load C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example\example\spiders\test.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from example.items import ExampleItem
class MySpider(BaseSpider):
name = "njit"
allowed_domains=["catalog.njit.edu"]
start_urls = ["http://catalog.njit.edu/courses/cs.php#gradcourses",
"http://catalog.njit.edu/courses/acct.php#gradcourses",
"http://catalog.njit.edu/courses/arch.php#gradcourses",
"http://catalog.njit.edu/courses/bnfo.php#gradcourses",
"http://catalog.njit.edu/courses/biol.php#gradcourses",
"http://catalog.njit.edu/courses/bme.php#gradcourses",
"http://catalog.njit.edu/courses/bio.php#gradcourses",
"http://catalog.njit.edu/courses/che.php#gradcourses"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
items = []
for titles in titles:
item = ExampleItem()
item['year'] = titles.select("span/text()").extract()[0]
item['title'] = titles.select("a/b/text()").extract()[0]
yield item
Здесь не вижу ничего особенного, разве что, команда yield item непривычна, и items = [] зачем?
Поскольку ни одна из этих конструкций не создает сразу весь список с результатами, они позволяют экономить память и производить дополнительные вычисления между операциями получения результатов.
Как мы увидим далее, обе конструкции поддерживают такую возможность возврата результатов по требованию за счет реализации протокола итераций, который мы изучили в главе 14.
...возможно написать функцию, которая может возвращать значение, а позднее продолжить свою работу с того места, где она была приостановлена. Такие функции известны как функции-генераторы, потому что они генерируют последовательность значений с течением времени.
Строчка hxs.select("//p") - устарела, об этом должно быть в логе внизу...
Некоторые страницы парсятся с ошибками, подробности ниже...
Чтобы запустить паука я набрал в консоли scrapy crawl njit -o ''.csv -t csv" (строка с ошибками, но сработала)¶
Как видно из справки, ключ -o - это --output=FILE, а ключ -t это --output-format
5.1.22 Simplest way to dump all my scraped items into a JSON/CSV/XML file?
To dump into a JSON file: scrapy crawl myspider -o items.json
To dump into a CSV file: scrapy crawl myspider -o items.csv
To dump into a XML file: scrapy crawl myspider -o items.xml
...> scrapy -h
Options
=======
--help, -h show this help message and exit
-a NAME=VALUE set spider argument (may be repeated)
--output=FILE, -o FILE dump scraped items into FILE (use - for stdout)
--output-format=FORMAT, -t FORMAT
format to use for dumping items with -o (default:
jsonlines)
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--lsprof=FILE write lsprof profiling stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
Полная строка запуска в конце этого поста, в длиннющем копипасте экрана консоли. Там есть ошибки, про них нельзя забывать, поскольку некоторые части страницы не парсятся, а вместо этого выдаются ошибки... посмотрю на них в последующих постах, а не здесь.
Здесь же отметим, что в папке проекта появился файл ''.csv он совпадает по размеру с тем, что уже был в папке - courses.csv (см. выше команду !tree). Значит, ошибки были и у автора этого проекта.
И вот он, файл ''.csv всего 223 строки вместе с заголовком title,year¶
%load C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example\''.csv
title,year
Acct 115 - Fundamentals of Financial Accounting (3-0-3),Effective From: Fall 2010
Acct 116 - Principles of Accounting II (3-0-3),Effective Until: Spring 2010
Acct 117 - Survey of Accounting (3-0-3),Effective From: Spring 2011
Acct 215 - Managerial Accounting I (3-0-3),Effective From: Fall 2010
CS 100 - Roadmap to Computing (3-0-3),Effective From: Fall 2010
Acct 315 - Accounting for Managerial Decision Making (3-0-3),Effective Until: Spring 2014
CS 101 - Computer Programming and Problem Solving (3-0-3),Effective From: Fall 2009
....
CS 337 - Performance Modeling in Computing (3-0-3),Effective From: Fall 2012
Biol 491 - Research and Independent Study (0-3-3),Effective From: Fall 2012
.....
CS 651 - Data Communications (3 credits),Effective From: Fall 2006
"CS 652 - Computer Networks-Architectures, Protocols and Standards (3 Credits)",Effective From: Fall 2006
CS 653 - Microcomputers and Applications (3 credits),Effective From: Fall 2006 Until: Spring 2009
CS 654 - Telecommunication Networks Performance Analysis (3 credits),Effective From: Fall 2006 Until: Spring 2009
CS 656 - Internet and Higher-Layer Protocols (3 credits),Effective From: Spring 2010
CS 657 - Principles of Interactive Computer Graphics (3 credits),Effective From: Fall 2006
...
В файле автора тоже 223 строки, полагаю, что сторки те же (проверил только пару), но порядок строк другой. Что, собственно, и ожидалось. Хорошая иллюстрация к Twisted...
И вот копипаст консоли, большую часть похожих строк удалил... надеюсь, что все записи об ошибках остались здесь.¶
C:\Users\kiss\Anaconda>cd C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example
C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example>scrapy crawl njit -o ''.csv -t csv"
C:\Users\kiss\Anaconda\lib\site-packages\scrapy\settings\deprecated.py:26: ScrapyDeprecationWarning: You are using the following set
tings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
BOT_VERSION: no longer used (user agent defaults to Scrapy now)
warnings.warn(msg, ScrapyDeprecationWarning)
2014-11-11 19:39:11+0300 [scrapy] INFO: Scrapy 0.20.1 started (bot: example)
2014-11-11 19:39:11+0300 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-11-11 19:39:11+0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'example.spiders', 'FEED_URI': "''.csv", 'SPIDER_
MODULES': ['example.spiders'], 'BOT_NAME': 'example', 'USER_AGENT': 'example/1.0', 'FEED_FORMAT': 'csv'}
2014-11-11 19:39:14+0300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreSta
ts, SpiderState
2014-11-11 19:39:16+0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMid
dleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMid
dleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-11 19:39:16+0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlL
engthMiddleware, DepthMiddleware
2014-11-11 19:39:16+0300 [scrapy] DEBUG: Enabled item pipelines:
2014-11-11 19:39:16+0300 [njit] INFO: Spider opened
2014-11-11 19:39:16+0300 [njit] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-11-11 19:39:16+0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-11-11 19:39:16+0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-11-11 19:39:18+0300 [njit] DEBUG: Crawled (200) <GET http://catalog.njit.edu/courses/acct.php#gradcourses> (referer: None)
C:\Users\kiss\Anaconda\lib\site-packages\scrapy\selector\lxmlsel.py:20: ScrapyDeprecationWarning: HtmlXPathSelector is deprecated, i
nstanciate scrapy.selector.Selector instead
category=ScrapyDeprecationWarning, stacklevel=1)
example\spiders\test.py:12: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
titles = hxs.select("//p")
example\spiders\test.py:16: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
item['year'] = titles.select("span/text()").extract()[0]
example\spiders\test.py:17: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
item['title'] = titles.select("a/b/text()").extract()[0]
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 115 - Fundamentals of Financial Accounting (3-0-3)',
'year': u'Effective From: Fall 2010'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 116 - Principles of Accounting II (3-0-3)',
'year': u'Effective Until: Spring 2010'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Crawled (200) <GET http://catalog.njit.edu/courses/cs.php#gradcourses> (referer: None)
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 117 - Survey of Accounting (3-0-3)',
'year': u'Effective From: Spring 2011'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 215 - Managerial Accounting I (3-0-3)',
'year': u'Effective From: Fall 2010'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 100 - Roadmap to Computing (3-0-3)',
'year': u'Effective From: Fall 2010'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Crawled (200) <GET http://catalog.njit.edu/courses/arch.php#gradcourses> (referer: None)
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 315 - Accounting for Managerial Decision Making (3-0-3)',
'year': u'Effective Until: Spring 2014'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 101 - Computer Programming and Problem Solving (3-0-3)',
'year': u'Effective From: Fall 2009'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Crawled (200) <GET http://catalog.njit.edu/courses/bnfo.php#gradcourses> (referer: None)
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 317 - Managerial Accounting (3-0-3)',
'year': u'Effective Until: Spring 2010'}
2014-11-11 19:39:18+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 102', 'year': u'Effective From: Fall 2006'}
2014-11-11 19:39:19+0300 [njit] ERROR: Spider error processing <GET http://catalog.njit.edu/courses/arch.php#gradcourses>
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 23, in process_spider_out
put
for x in result:
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "example\spiders\test.py", line 16, in parse
item['year'] = titles.select("span/text()").extract()[0]
exceptions.IndexError: list index out of range
2014-11-11 19:39:19+0300 [njit] DEBUG: Crawled (200) <GET http://catalog.njit.edu/courses/biol.php#gradcourses> (referer: None)
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 325 - Intermediate Accounting I (3-0-3)',
'year': u'Effective From: Spring 2014'}
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 103', 'year': u'Effective From: Fall 2012'}
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/bnfo.php>
{'title': u'BNFO 135 - Programming for Bioinformatics (3-0-3)',
'year': u'Effective From: Spring 2009'}
.....
.....
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 206 - Foundations of Biology: Ecology and Evolution Lab (0-3-1)',
'year': u'Effective From: Spring 2014'}
2014-11-11 19:39:19+0300 [njit] ERROR: Spider error processing <GET http://catalog.njit.edu/courses/che.php#gradcourses>
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 23, in process_spider_out
put
for x in result:
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "example\spiders\test.py", line 16, in parse
item['year'] = titles.select("span/text()").extract()[0]
exceptions.IndexError: list index out of range
2014-11-11 19:39:19+0300 [njit] DEBUG: Crawled (200) <GET http://catalog.njit.edu/courses/bio.php#gradcourses> (referer: None)
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/acct.php>
{'title': u'Acct 435 - Intermediate Accounting II (3-0-3)',
'year': u'Effective From: Fall 2010'}
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 110 - Introduction to Computer Science IA (3-0-3)',
'year': u'Effective From: Fall 2006 Until: Spring 2012'}
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/bnfo.php>
{'title': u'BNFO 240 - Data Analysis for Bioinformatics I (3-0-3)',
'year': u'Effective From: Spring 2014 Until: Spring 2014'}
2014-11-11 19:39:19+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 222 - Evolution (3-0-3)',
'year': u'Effective From: Spring 2014'}
2014-11-11 19:39:20+0300 [njit] ERROR: Spider error processing <GET http://catalog.njit.edu/courses/acct.php#gradcourses>
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 23, in process_spider_out
put
for x in result:
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "example\spiders\test.py", line 16, in parse
item['year'] = titles.select("span/text()").extract()[0]
exceptions.IndexError: list index out of range
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 110A - CS 110A Computer Science Lab for CS 111 ((0-1.5-0))',
'year': u'Effective From: Fall 2006 Until: Spring 2012'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/bnfo.php>
{'title': u'BNFO 330 - Data Analysis for Bioinformatics I (3-0-3)',
'year': u'Effective From: Spring 2015'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 225 - Insects and Human Society (3-0-3)',
'year': u'Effective From: Spring 2010'}
2014-11-11 19:39:20+0300 [njit] ERROR: Spider error processing <GET http://catalog.njit.edu/courses/bio.php#gradcourses>
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 23, in process_spider_out
put
for x in result:
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "example\spiders\test.py", line 16, in parse
item['year'] = titles.select("span/text()").extract()[0]
exceptions.IndexError: list index out of range
2014-11-11 19:39:20+0300 [njit] DEBUG: Crawled (200) <GET http://catalog.njit.edu/courses/bme.php#gradcourses> (referer: None)
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 111 - Introduction to Computer Science IB (3-0-3)',
'year': u'Effective From: Fall 2006 Until: Spring 2012'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/bnfo.php>
{'title': u'BNFO 340 - Data Analysis for Bioinformatics II (3-0-3)',
'year': u'Effective From: Spring 2014'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 250 - Biology of Neotropical Habitats: Ecuador and Galapagos Islands (2-2-3)',
'year': u'Effective From: Spring 2014'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 111A - CS111A Computer Science Lab for CS 111 ((0-1.5-0))',
'year': u'Effective From: Fall 2006 Until: Spring 2012'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/bnfo.php>
{'title': u'BNFO 482 - Databases and Data Mining in Bioinformatics (3-0-3)',
'year': u'Effective From: Spring 2015'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 310 - Research and Independent Study (3-0-3)',
'year': u'Effective From: Spring 2013'}
2014-11-11 19:39:20+0300 [njit] ERROR: Spider error processing <GET http://catalog.njit.edu/courses/bme.php#gradcourses>
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 23, in process_spider_out
put
for x in result:
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "example\spiders\test.py", line 16, in parse
item['year'] = titles.select("span/text()").extract()[0]
exceptions.IndexError: list index out of range
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 113 - Introduction to Computer Science (3-0-3)',
'year': u'Effective From: Fall 2012'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/bnfo.php>
{'title': u'BNFO 491 - Computer Science Project (3-0-3)',
'year': u'Effective From: Spring 2011'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 315 - Principles of Neurobiology (3-0-3)',
'year': u'Effective From: Fall 2013'}
.............
.............
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 114A - Lab (0-1.5-0)',
'year': u'Effective From: Fall 2006 Until: Spring 2012'}
2014-11-11 19:39:20+0300 [njit] ERROR: Spider error processing <GET http://catalog.njit.edu/courses/bnfo.php#gradcourses>
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 23, in process_spider_out
put
for x in result:
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "example\spiders\test.py", line 16, in parse
item['year'] = titles.select("span/text()").extract()[0]
exceptions.IndexError: list index out of range
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 340 - Mammalian Physiology (3-3-4)',
'year': u'Effective From: Spring 2014'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 114H - Honors Introduction to Computer Science II (3-0-3)',
'year': u'Effective Until: Fall 2006'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/biol.php>
{'title': u'Biol 341 - Introduction to Neurophysiology (3-0-3)',
'year': u'Effective From: Spring 2015'}
2014-11-11 19:39:20+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 115 - Intro. to CS I in C++ (3-0-3)',
'year': u'Effective From: Fall 2006'}
{'title': u'Biol 495 - Honors Seminar in Biology (3-0-3)',
'year': u'Effective From: Fall 2014'}
2014-11-11 19:39:21+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 341H - Honors Introduction to Logic and Automata (3-0-3)',
'year': u'Effective From: Fall 2006'}
2014-11-11 19:39:21+0300 [njit] ERROR: Spider error processing <GET http://catalog.njit.edu/courses/biol.php#gradcourses>
Traceback (most recent call last):
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Users\kiss\Anaconda\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 23, in process_spider_out
put
for x in result:
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kiss\Anaconda\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "example\spiders\test.py", line 16, in parse
item['year'] = titles.select("span/text()").extract()[0]
exceptions.IndexError: list index out of range
2014-11-11 19:39:21+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 345 - Web Search (3-0-3)',
'year': u'Effective From: Spring 2012'}
2014-11-11 19:39:21+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 352 - Parallel Computers and Programming (3-1-3)',
'year': u'Effective From: Fall 2006'}
2014-11-11 19:39:21+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 353 - Advanced Computer Organization (3-0-3)',
'year': u'Effective From: Fall 2006'}
..........
..........
2014-11-11 19:39:21+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 488H - Honors Independent Study in Computer Science/Information Systems (3-0-3)',
'year': u'Effective From: Fall 2006'}
2014-11-11 19:39:22+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 792 - Pre-Doctoral Research (3 credits)',
'year': u'Effective From: Fall 2006'}
2014-11-11 19:39:22+0300 [njit] DEBUG: Scraped from <200 http://catalog.njit.edu/courses/cs.php>
{'title': u'CS 794 - Computer Science/Information Systems Colloquium (Non-credit)',
'year': u'Effective From: Fall 2006'}
2014-11-11 19:39:23+0300 [njit] INFO: Closing spider (finished)
2014-11-11 19:39:23+0300 [njit] INFO: Stored csv feed (222 items) in: ''.csv
2014-11-11 19:39:23+0300 [njit] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1715,
'downloader/request_count': 8,
'downloader/request_method_count/GET': 8,
'downloader/response_bytes': 426500,
'downloader/response_count': 8,
'downloader/response_status_count/200': 8,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 11, 11, 16, 39, 23, 100000),
'item_scraped_count': 222,
'log_count/DEBUG': 236,
'log_count/ERROR': 7,
'log_count/INFO': 4,
'response_received_count': 8,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'spider_exceptions/IndexError': 7,
'start_time': datetime.datetime(2014, 11, 11, 16, 39, 16, 939000)}
2014-11-11 19:39:23+0300 [njit] INFO: Spider closed (finished)
C:\Users\kiss\Documents\GitHub_2\Python-Web-Crawler-master\Python-Web-Crawler-master\example>
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий