Поиск по блогу

вторник, 26 мая 2015 г.

Правим код к видео "Web Scraping in Node.js". Правило: Видео - документация - stakoverflow

Три дня назад я опубликова пост "Посмотрел Web Scraping in Node.js ...". У меня тогда не получилось выполнить код из видео. Оказалось, что синтаксис команды var url = this.attr('href'); теперь такой var url = $(this).attr("href");. Gjlhj. На странице Cheerio все хорошо документированио, на StackOverFlov тоже есть примеры. Здесь не только ссылки, но и записан процесс поиска ошибки, и обоснование правила: видео - документация - stackoverflow, немного о jQuery...

Я потратил слишком много времени на поиск ошибки

О чем я думал? Мне пришлось осваивать много нового - дебаггер, repl, новые объекты. Это я пытаюсь оправдаться. Я привык использовать простое правило "просто повтори код из видео". Не получилось, я решил, что надо все бросить и освоить дебаггер. Посмотрел видео про дебаггеры... и ошибся там - стал устанвливать дебаггер, которому требовался браузер chrome... С одной стороны, освоение инфраструктуры для nodejs идет упешно и быстро, но мои действия при решении конкретных задач - это бараньи наскоки самонадеянного "чайника"... Наверное, мой мозг "не хочет терять темп", если я буду читать документацию, то "это долго"..., в результате я ищу решение сначала на stackoverflow, а надо бы просто было запустить поиск по странице cheerio.

О jQuery

Функция обратного вызова используется в сонструкции .each()... Полагаю, что изменения в синтаксисе cheerio были сделаны не случайно - "The scope object this is not a jQuery object by default". Мы ведь уже знаем, что можно просто "инжектировать" любой js код на страницу, в том числе и неподдельный jQuery... В документации написано, что Cheerio быстрее ( в разы). Я пока не вижу других преимуществ..., но полагаю, что это от незнания...

Посмотрел "Web Scraping in Node.js" by Smitha Milli и скопировал сюда код из видео
cheerio Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
htmlparser2 Parser options ...A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface Object # has no method 'attr' when scraping with Cheerio and NodeJS - The scope object this is not a jQuery object by default.
call back on cheerio node.js - You are scraping some external site(s). You can't be sure the HTML all fits exactly the same structure, so you need to be defensive on how you traverse it.
How come a globally defined array that gets data pushed to it with in a callback function is empty in the global scope?

Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

❁ Incredibly flexible: Cheerio wraps around @FB55's forgiving htmlparser2. Cheerio can parse nearly any HTML or XML document.

Создаем вот такой файл и записываем на диск

In [4]:
%%file /home/kiss/Desktop/scr/r01.js
var request = require('request'),
    cheerio = require('cheerio'),
    urls = [];
    
request('http://www.reddit.com', function(error, response, body){
    if(!error && response.statusCode == 200){  
        var $ = cheerio.load(body);
        console.log(body);
        debugger;
        // console.log($('a.title', '#siteTable').attr('title'));
    }
});
Overwriting /home/kiss/Desktop/scr/r01.js

Запускаем файл с дебаггером

In [ ]:
kiss@kali:~/Desktop/scr$ node debug /home/kiss/Desktop/scr/r01.js
< debugger listening on port 5858
connecting... ok
break in r01.js:1
  1 var request = require('request'),
  2     cheerio = require('cheerio'),
  3     urls = [];
In [ ]:
debug> c
<!doctype html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>reddit: 
        the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, 
        comment, submit " /><meta name="description" content="reddit: the front page of the internet" />
        <meta name="referrer" content="always"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.reddit.com/" />
        <meta name="viewport" content="width=1024"><link rel='icon' href="//www.redditstatic.com/icon.png" ....
# и далее код всей страницы 

... 34:33.608551+00:00 running 7d6cd40 country code: RU.</span></p>
            <script type="text/javascript" src="//www.redditstatic.com/reddit.en.ltYFZGLPMrg.js"></script>
</body></html>
In [ ]:
break in r01.js:9
  7         var $ = cheerio.load(body);
  8         console.log(body);
  9         debugger;
 10         // console.log($('a.title', '#siteTable').attr('title'));
 11     }
debug> repl

И вот только теперь можем проверить, как работает Cheerio

In [ ]:
> $('a.title', '#siteTable')
{ '0': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://i.imgur.com/Ghn6HZZ.gif',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
    .......
#Это только начало, весь текст в приложении

Далее находим значение первого атрибута в списке (как jQuery)

In [ ]:
> $('a.title', '#siteTable').attr('href');
'http://i.imgur.com/Ghn6HZZ.gif'
> 
In [ ]:
 
In [ ]:
$('a.title', '#siteTable').each(function(){
            var url = this.attr('href');
            // urls.push(url); });
In [ ]:
$('a.title', '#siteTable').each(function(){ var url =  this.href; return url; });
In [ ]:
$('a.title', '#siteTable').each(function(){ var url =  this.attr('href'); return url; });
In [ ]:
$('a.title', '#siteTable').each(function(){ var url = $(this).attr("href"); });
In [ ]:
$('a.title', '#siteTable').each(function(){ var url = this.getAttribute("href"); return url; });

Object # has no method 'attr' when scraping with Cheerio and NodeJS

In [ ]:
> $('a.title', '#siteTable').each(function(){ var url = $(this).attr("href"); console.log(url); });
< http://i.imgur.com/Ghn6HZZ.gif
< http://i.imgur.com/CpXWD4A.jpg
< /r/AskReddit/comments/37a1oz/doctors_of_reddit_what_was_the_most_incorrect/
< https://www.youtube.com/watch?v=YLZNwxFtNhk
< http://i.imgur.com/oBGuvvI.jpg
< https://en.wikipedia.org/wiki/Zone_rouge
< http://i.imgur.com/TqWPpRe.jpg
< /r/Jokes/comments/379q7a/im_not_saying_its_a_mistake_letting_my_girlfriend/
< /r/movies/comments/379vj5/john_wick_and_mad_max_fury_road_signaling_the/
< /r/Showerthoughts/comments/379m5r/luke_skywalker_turning_his_targeting_computer_off/
< http://i.imgur.com/iIAw2mq.jpg
< http://gfycat.com/ConcernedReflectingFrillneckedlizard
< http://www.psypost.org/2015/05/ecstasy-may-soon-be-a-treatment-for-social-anxiety-among-autistic-adults-34602
< http://imgur.com/wPd6fVI
< /r/explainlikeimfive/comments/379tb9/eli5_how_can_a_candy_company_jelly_belly_create/
< /r/askscience/comments/379krr/why_is_forest_height_on_mountain_ranges_so_uniform/
< http://imgur.com/BlfmyKZ
< http://www.campusreform.org/?ID=6527
< http://africanspotlight.com/2015/05/23/kenyan-lawyer-offers-obama-50-cows-70-sheep-30-goats-to-marry-his-daughter-malia/
< http://imgur.com/i0eQHx8
< http://www.cbc.ca/news/technology/blood-turned-into-nerve-cells-by-canadian-researchers-1.3082288
< /r/IAmA/comments/3789vc/i_am_voice_actress_grey_delislegriffin_you_might/
< http://i.imgur.com/ZL2FX1r.jpg
< /r/LifeProTips/comments/378bxv/lpt_the_lesser_known_f6_key_will_highlight_the/
< http://www.londonlovesbusiness.com/business-news/london-2012-olympics/this-graph-shows-the-sickening-extent-of-the-qatar-world-cup-deaths/8120.article

Приложение: как работает Cheerio

In [ ]:
> $('a.title', '#siteTable')
{ '0': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://i.imgur.com/Ghn6HZZ.gif',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '1': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://i.imgur.com/CpXWD4A.jpg',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '2': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/AskReddit/comments/37a1oz/doctors_of_reddit_what_was_the_most_incorrect/',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '3': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'https://www.youtube.com/watch?v=YLZNwxFtNhk',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '4': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://i.imgur.com/oBGuvvI.jpg',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '5': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'https://en.wikipedia.org/wiki/Zone_rouge',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '6': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://i.imgur.com/TqWPpRe.jpg',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '7': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/Jokes/comments/379q7a/im_not_saying_its_a_mistake_letting_my_girlfriend/',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '8': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/movies/comments/379vj5/john_wick_and_mad_max_fury_road_signaling_the/',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { type: 'tag',
        name: 'span',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '9': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/Showerthoughts/comments/379m5r/luke_skywalker_turning_his_targeting_computer_... (length: 84)',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '10': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://i.imgur.com/iIAw2mq.jpg',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '11': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://gfycat.com/ConcernedReflectingFrillneckedlizard',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '12': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://www.psypost.org/2015/05/ecstasy-may-soon-be-a-treatment-for-social-anxiet... (length: 109)',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '13': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://imgur.com/wPd6fVI',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '14': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/explainlikeimfive/comments/379tb9/eli5_how_can_a_candy_company_jelly_belly_cr... (length: 85)',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '15': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/askscience/comments/379krr/why_is_forest_height_on_mountain_ranges_so_uniform... (length: 81)',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { type: 'tag',
        name: 'span',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '16': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://imgur.com/BlfmyKZ',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '17': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://www.campusreform.org/?ID=6527',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '18': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://africanspotlight.com/2015/05/23/kenyan-lawyer-offers-obama-50-cows-70-she... (length: 120)',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { type: 'tag',
        name: 'span',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '19': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://imgur.com/i0eQHx8',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '20': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://www.cbc.ca/news/technology/blood-turned-into-nerve-cells-by-canadian-rese... (length: 97)',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { type: 'tag',
        name: 'span',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '21': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/IAmA/comments/3789vc/i_am_voice_actress_grey_delislegriffin_you_might/',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { type: 'tag',
        name: 'span',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '22': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://i.imgur.com/ZL2FX1r.jpg',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { type: 'tag',
        name: 'span',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '23': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: '/r/LifeProTips/comments/378bxv/lpt_the_lesser_known_f6_key_will_highlight_the/',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  '24': 
   { type: 'tag',
     name: 'a',
     attribs: 
      { class: 'title may-blank ',
        href: 'http://www.londonlovesbusiness.com/business-news/london-2012-olympics/this-graph... (length: 150)',
        tabindex: '1' },
     children: [ [Object] ],
     next: 
      { data: ' ',
        type: 'text',
        next: [Object],
        prev: [Object],
        parent: [Object] },
     prev: null,
     parent: 
      { type: 'tag',
        name: 'p',
        attribs: [Object],
        children: [Object],
        next: [Object],
        prev: null,
        parent: [Object] } },
  options: 
   { withDomLvl1: true,
     normalizeWhitespace: false,
     xmlMode: false,
     decodeEntities: true },
  _root: 
   { '0': 
      { type: 'root',
        name: 'root',
        attribs: {},
        children: [Object],
        next: null,
        prev: null,
        parent: null },
     options: 
      { withDomLvl1: true,
        normalizeWhitespace: false,
        xmlMode: false,
        decodeEntities: true },
     length: 1,
     _root: 
      { '0': [Object],
        options: [Object],
        length: 1,
        _root: [Object] } },
  length: 25,
  prevObject: 
   { '0': 
      { type: 'root',
        name: 'root',
        attribs: {},
        children: [Object],
        next: null,
        prev: null,
        parent: null },
     options: 
      { withDomLvl1: true,
        normalizeWhitespace: false,
        xmlMode: false,
        decodeEntities: true },
     length: 1,
     _root: 
      { '0': [Object],
        options: [Object],
        length: 1,
        _root: [Object] } } }



Посты чуть ниже также могут вас заинтересовать

3 комментария:

  1. этот комментарий я диктую телефоны android

    ОтветитьУдалить
  2. сегодня выполнено комплексов номер 1
    Комплекс номер 2 тире 5 раз

    ОтветитьУдалить
  3. этот комментарий я диктую телефоны android

    ОтветитьУдалить