Статью Scraping One Million Points A Day, Here's How (TUTORIAL) я посчитал настолько важной, что решил закопипастить ее по частям и добавить свои комментарии.
Но после прочтения решил, что здесь будет только первая часть со всеми ссылками на сервисы и софт.
Но после прочтения решил, что здесь будет только первая часть со всеми ссылками на сервисы и софт.
(BEGINNER) COMPONENTS OF A SCRAPE
Before diving into the software, here are some helpful definitions for the process and technology.
"INPUT"
Establish a target website/keyword and generate a “seed list”. A seed list is an array or string (a set) of websites your are pulling data from. For example, if you are getting all of the cars from car http://soup.com you would have a list of all urls that you are pulling the data from.
"SELECTORS"
A selector is the method or tool you are using to grab data from a website, in this tutorial I will use XPATH which is a language of many for selecting items on a page. You can also use, Regex, CSS3, and JSDOM and JQuery.
"PARSING"
Parsing is more a process than an actual function. It is the steps taken to change the organizational structure of a data set; it could be served directly to a front-end or database. A real life example is parsing data from an XML feed to a MySQL table or converting a CSV file to EXCEL.
"OUTPUT"
This is the stagnant, meaning structured data that comes out of your scrape. This is very important because many of the scraping tools, modules, code languages, and databases play nicer (or less mean) with certain types of data. If a human being is reviewing it, you want your data to come out in CSV. If, for example, MongoDB is reading the data you want it to come out in JSON.
"STORAGE"
You need a storage schema and a storage location. Maybe you are writing directly to your local machine (which is fine to start) but more often you will be saving your results directly to a database or web service like Rackspace/AWS etc. My storage schema is outlined below in Toolset.
"PROXIES"
A proxy is a web address your computer or remote machine accesses first before initiating a scrape.
"USER AGENT"
A user agent lets a web server know what is accessing it’s data or making a request. Changing user agents to Mobile versions or simply rotating them helps increase your anonymity during scrapes. It does not eliminate it.
"BOT"
This is by far my favorite part, its time to choose a framework to build your code on--there are literally thousands of ways to absorb data and choosing the right bot is critical because they all interact with the web differently. My bot list is outlined below in Toolset.
Before diving into the software, here are some helpful definitions for the process and technology.
"INPUT"
Establish a target website/keyword and generate a “seed list”. A seed list is an array or string (a set) of websites your are pulling data from. For example, if you are getting all of the cars from car http://soup.com you would have a list of all urls that you are pulling the data from.
"SELECTORS"
A selector is the method or tool you are using to grab data from a website, in this tutorial I will use XPATH which is a language of many for selecting items on a page. You can also use, Regex, CSS3, and JSDOM and JQuery.
"PARSING"
Parsing is more a process than an actual function. It is the steps taken to change the organizational structure of a data set; it could be served directly to a front-end or database. A real life example is parsing data from an XML feed to a MySQL table or converting a CSV file to EXCEL.
"OUTPUT"
This is the stagnant, meaning structured data that comes out of your scrape. This is very important because many of the scraping tools, modules, code languages, and databases play nicer (or less mean) with certain types of data. If a human being is reviewing it, you want your data to come out in CSV. If, for example, MongoDB is reading the data you want it to come out in JSON.
"STORAGE"
You need a storage schema and a storage location. Maybe you are writing directly to your local machine (which is fine to start) but more often you will be saving your results directly to a database or web service like Rackspace/AWS etc. My storage schema is outlined below in Toolset.
"PROXIES"
A proxy is a web address your computer or remote machine accesses first before initiating a scrape.
"USER AGENT"
A user agent lets a web server know what is accessing it’s data or making a request. Changing user agents to Mobile versions or simply rotating them helps increase your anonymity during scrapes. It does not eliminate it.
"BOT"
This is by far my favorite part, its time to choose a framework to build your code on--there are literally thousands of ways to absorb data and choosing the right bot is critical because they all interact with the web differently. My bot list is outlined below in Toolset.
Пока я полность согласен с автором (кстати, его зовут Patrick McConlogue) любая попытка классификации условна и полезна. Поэтому вот, он проговаривает (формализует) свой вариант. Далее в первом разделе "для начинающих" Патрик дает ссылку на описание платной программы scrapebox.com. И описание и програма хорошо продуманы, есть видео ... в частности, здесь есть ротаторы прокси, верификаторы... и проч.
Начал читать описание, вспомнил про те программы (... vietspider ...) , которые я уже скачал и попробовал. Надо научится их юзать через proxy... попробовал запрос "proxy list checker software"... выгуглилось много интересного..., вот например proxyfire.net здесь и видео есть
Начал читать описание, вспомнил про те программы (... vietspider ...) , которые я уже скачал и попробовал. Надо научится их юзать через proxy... попробовал запрос "proxy list checker software"... выгуглилось много интересного..., вот например proxyfire.net здесь и видео есть
В продолжение темы о сервисах проверки прокси-листов. Еще нагуглилось GatherProxy Scraper is a small tool developed by Gatherproxy.com ... и здесь же "FREE PROXY SOFTWARE Visitor Generator Software - Website traffic view bot - Visitors bot for your website PingBox - Website Pinger - Mass Rank Checker - Mass Backlink verifier"
SELENIUM- Browser Automation Uses your actual browser to scrape items in a full DOM environment. This means that no matter what the structure of the page is, Selenium can load it. If you are looking for something that can still scrape tricky, JAVA or otherwise at the beginning level this is great. Keep in mind, Selenium is like trying to paint the Mona Lisa with Crayons--it’s easy but you have to grow past it because the next item is sexy.
SCRAPY Scrapy is by far the industry bread/butter for scraping at this point. Based in PYTHON Scrapy is an open source project which uses XPATH/CSS3 selectors to pull “items”. It is relatively simple to install. Scrapy has a built in output option in JSON which is important to note.
HERITRIX - Heritrix 3.0 and 3.1 User Guide For savvy scrapers reading this, if they have heard of Heritrix they may cringe, close the window, navigate to another topic etc. Before you judge me too quickly, Heritrix includes a basic web interface to control the scrapes without using the command line -> this is huge if non-technical people are looking for data or you want to demonstrate how cool scraping can be.
NUTCH - Welcome to Apache Nutch Hailing from Apache, Nutch is a web crawler not necessarily a scraper. What this means is, Nutch is closer to Google bot because it is interested in everything and generally is designed to download entire websites and follow links (aka threads) to more pages which is subsequently downloads. Using Nutch for a one of scrape of a website is like aiming a Tank at a mouse.
BEAUTIFULSOUP - We called him Tortoise because he taught us. A python module for selecting items on the page, BeautifulSoup (namely bs4) is ideal for quickly choosing items. However, unlike Scrapy BeautifulSoup assumes you are pretty aware of scraping procedure and comfortable with some basic coding in Python. BeautifulSoup combined with Celery and RabbitMQ can be a beast but due to the blocking nature of Python it is slowly than some of the advanced alternatives.
CASPERJS/PHANTOMJS Casper and Phantom are keys to scraping any website because they load a full environment for each page. This means that the target believes you are a full web browser. I have used Casper often but it’s speed is abysmal if your code is not optimized or ASynchronous. That being said, Casper (like Selenium) allows you to click, save, take screenshots, and fill out login forms.
Page on NODE - distributed data scraping and processing for node.js [These links don't work anymore. The github page is now https://github.com/chriso/node.io and the npm page is https://npmjs.org/package/http://node.io] Currently, I am obsessed with Node, queue based programming, and asynchronous design. In the section below describing USE CASES. It’s hot, so very hot.
SCRAPY Scrapy is by far the industry bread/butter for scraping at this point. Based in PYTHON Scrapy is an open source project which uses XPATH/CSS3 selectors to pull “items”. It is relatively simple to install. Scrapy has a built in output option in JSON which is important to note.
HERITRIX - Heritrix 3.0 and 3.1 User Guide For savvy scrapers reading this, if they have heard of Heritrix they may cringe, close the window, navigate to another topic etc. Before you judge me too quickly, Heritrix includes a basic web interface to control the scrapes without using the command line -> this is huge if non-technical people are looking for data or you want to demonstrate how cool scraping can be.
NUTCH - Welcome to Apache Nutch Hailing from Apache, Nutch is a web crawler not necessarily a scraper. What this means is, Nutch is closer to Google bot because it is interested in everything and generally is designed to download entire websites and follow links (aka threads) to more pages which is subsequently downloads. Using Nutch for a one of scrape of a website is like aiming a Tank at a mouse.
BEAUTIFULSOUP - We called him Tortoise because he taught us. A python module for selecting items on the page, BeautifulSoup (namely bs4) is ideal for quickly choosing items. However, unlike Scrapy BeautifulSoup assumes you are pretty aware of scraping procedure and comfortable with some basic coding in Python. BeautifulSoup combined with Celery and RabbitMQ can be a beast but due to the blocking nature of Python it is slowly than some of the advanced alternatives.
CASPERJS/PHANTOMJS Casper and Phantom are keys to scraping any website because they load a full environment for each page. This means that the target believes you are a full web browser. I have used Casper often but it’s speed is abysmal if your code is not optimized or ASynchronous. That being said, Casper (like Selenium) allows you to click, save, take screenshots, and fill out login forms.
Page on NODE - distributed data scraping and processing for node.js [These links don't work anymore. The github page is now https://github.com/chriso/node.io and the npm page is https://npmjs.org/package/http://node.io] Currently, I am obsessed with Node, queue based programming, and asynchronous design. In the section below describing USE CASES. It’s hot, so very hot.
(3/3) TOOLSET: WEB SERVICES 2MHOST A basic hosting service that is cheap, reliable, and includes a domain. I have used them for six years and will continue to do so. Keep in mind it is a shared host and without root access so it is only for front-end projects or FTP storage of scrapes. Honestly, I don’t know that it applies in this tutorial but its a great host--take it over leave it.
AWSCompute, Storage, Database Regarding scraping there is not much you can’t do with the Free Usage Tier provided by Amazon with an EC2 instance. If you are new or a beginner I would stop reading here and learn how to perform basic things with EC2/S3/IAM (focus on Security Groups and Elastic IPs). My only caution is that you are extremely careful with billing as I have made the mistake of turn on services I didn’t need and receiving a unfriendly bill. That being said AND I DON’T PROMISE THIS FOR YOU, AWS did refund one of my two huge bills because I guess they are awesome and love startups.
HEROKU I use Heroku for quick app testing and unique IP address assignments. Heroku launches an application or app package from your computers command line. This means that you can scope a server out anonymously with a free Heroku app space before beginning a full scrape.
WHATSMYUSERAGENT Is a website I use to make sure my Proxy rotation and User Agent rotation is working effectively. While not 100% accurate like Loga it is easier to simply scrape them and test (loga has a bothersome redirect).
GITHUB - www.github.com Every script of mine is committed to private Repos on Github. My new Coin Works startup will be available publically. Github lets you backup and version control your scripts; more importantly it is helpful in team based scenarios where multiple people may be editing or submitting scripts.
AWSCompute, Storage, Database Regarding scraping there is not much you can’t do with the Free Usage Tier provided by Amazon with an EC2 instance. If you are new or a beginner I would stop reading here and learn how to perform basic things with EC2/S3/IAM (focus on Security Groups and Elastic IPs). My only caution is that you are extremely careful with billing as I have made the mistake of turn on services I didn’t need and receiving a unfriendly bill. That being said AND I DON’T PROMISE THIS FOR YOU, AWS did refund one of my two huge bills because I guess they are awesome and love startups.
HEROKU I use Heroku for quick app testing and unique IP address assignments. Heroku launches an application or app package from your computers command line. This means that you can scope a server out anonymously with a free Heroku app space before beginning a full scrape.
WHATSMYUSERAGENT Is a website I use to make sure my Proxy rotation and User Agent rotation is working effectively. While not 100% accurate like Loga it is easier to simply scrape them and test (loga has a bothersome redirect).
GITHUB - www.github.com Every script of mine is committed to private Repos on Github. My new Coin Works startup will be available publically. Github lets you backup and version control your scripts; more importantly it is helpful in team based scenarios where multiple people may be editing or submitting scripts.
При просмотре ссылок не удержался и нашел Feature Guide: Amazon EC2 Elastic IP Addresses
LANGUAGES
Personally, I am a PYTHON/Node/Java fan in that order but if you want to learn more front-end technology I would suggest sticking to the PHP/Ruby/Java route. Code Academy is a waste of time, set a goal and starting learning how to do each part of the code from Github/Stackoverflow.
Personally, I am a PYTHON/Node/Java fan in that order but if you want to learn more front-end technology I would suggest sticking to the PHP/Ruby/Java route. Code Academy is a waste of time, set a goal and starting learning how to do each part of the code from Github/Stackoverflow.
Итак, выше перечислены все самые важные ссылки, материал очень обширен, что делать?¶
Даже с учетом того, что я уже многое знаю, мне трудно выбрать свой инструментарий. Точнее, я его уже выбрал и делаю. Глупо бросать все и начинать изучать сервисы Amazon... Но очевидно, что осваивать их надо обязательно... шутка ли, собственные серверы ... без головной боли от безопасности и обслуживанию и настройкам... И jsnode здесь тоже кстати...
Очевидно, что надо ознакомиться со всеми вариантами..., но сначала запустить в работу программы на Питоне... Именно по ссылкам и софту из данной статьи будем в дальнейшем искать соотвтствующие концептуальные статьи и видеоролики.
Очевидно, что надо ознакомиться со всеми вариантами..., но сначала запустить в работу программы на Питоне... Именно по ссылкам и софту из данной статьи будем в дальнейшем искать соотвтствующие концептуальные статьи и видеоролики.
Автор этой статьи почти не упоминает о proxy, однако я сейчас занимаюсь именно этим - учусь работать с proxy - об этом следующая статья. Вчера скачал и установил из списка выше Heristix (Selenium... года два назад показался слишком громоздким..., и вообще у нас с java пока взаимное охлаждение отношений... в связи с романом со змеями). Наверное надо пробовать Heristix параллельно со Scrapy ( На Питоне, да еще и консоль своя есть..., и бегло ознакомился я со Scrapy прежде, чем сел изучать Python)
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий