Первые практикумы (curl -v -i -I -D) Как скачать страницу, посмотреть исходящие и входящие HTTP заголовки, перенаправить все в файл, подменить заголовок "user agent"

In []:

-A, --user-agent STRING  User-Agent to send to server (H)

-O, --remote-name        Write output to a file named as the remote file
    --remote-name-all    Use the remote file name for all URLs

-I, --head               Show document info only    
    
-i, --include            Include protocol headers in the output (H/F)

-D, --dump-header FILE   Write the headers to this file

Первый этап - это первое знакомство с бибилотекой (найти документацию производителя и обзорное видео), второй этап - найти чью-нибудь "обучалку", лучше серию видеороликов..., или толковую статью. Помимо изучения библиотеки, нужно еще четко представить себе последовательность действий (процессов) и сформировать инфраструктуру (инструментарий) для последующей работы. Например, как я буду смотреть заголовки HTTP? Так что, видеоролики еще ценны и тем, что там "между делом" показывают эти инструменты... Третий этап - постижение опций..., короткие статьи разных авторов с примерами кода.

Linux and Unix curl command
Using cURL to automate HTTP jobs

Modifying user-agent in curl or wget ERROR 403: Forbidden
Change User Agent with curl to Get URL Source Code as Different OS & Browser
Custom User-Agent String
User agent The User-Agent string is one of the criteria by which Web crawlers may be excluded from accessing certain parts of a Web site using the Robots Exclusion Standard (robots.txt file).
Robots exclusion standard

Get HTTP Header Info from Web Sites Using curl
curl.1 the man page

In [1]:

!curl -h

Usage: curl [options...] <url>

Options: (H) means HTTP/HTTPS only, (F) means FTP only

     --anyauth       Pick "any" authentication method (H)

 -a, --append        Append to target file when uploading (F/SFTP)

     --basic         Use HTTP Basic Authentication (H)

     --cacert FILE   CA certificate to verify peer against (SSL)

     --capath DIR    CA directory to verify peer against (SSL)

 -E, --cert CERT[:PASSWD] Client certificate file and password (SSL)

     --cert-type TYPE Certificate file type (DER/PEM/ENG) (SSL)

     --ciphers LIST  SSL ciphers to use (SSL)

     --compressed    Request compressed response (using deflate or gzip)

 -K, --config FILE   Specify which config file to read

     --connect-timeout SECONDS  Maximum time allowed for connection

 -C, --continue-at OFFSET  Resumed transfer offset

 -b, --cookie STRING/FILE  String or file to read cookies from (H)

 -c, --cookie-jar FILE  Write cookies to this file after operation (H)

     --create-dirs   Create necessary local directory hierarchy

     --crlf          Convert LF to CRLF in upload

     --crlfile FILE  Get a CRL list in PEM format from the given file

 -d, --data DATA     HTTP POST data (H)

     --data-ascii DATA  HTTP POST ASCII data (H)

     --data-binary DATA  HTTP POST binary data (H)

     --data-urlencode DATA  HTTP POST data url encoded (H)

     --delegation STRING GSS-API delegation permission

     --digest        Use HTTP Digest Authentication (H)

     --disable-eprt  Inhibit using EPRT or LPRT (F)

     --disable-epsv  Inhibit using EPSV (F)

 -D, --dump-header FILE  Write the headers to this file

     --egd-file FILE  EGD socket path for random data (SSL)

     --engine ENGINGE  Crypto engine (SSL). "--engine list" for list

 -f, --fail          Fail silently (no output at all) on HTTP errors (H)

 -F, --form CONTENT  Specify HTTP multipart POST data (H)

     --form-string STRING  Specify HTTP multipart POST data (H)

     --ftp-account DATA  Account data string (F)

     --ftp-alternative-to-user COMMAND  String to replace "USER [name]" (F)

     --ftp-create-dirs  Create the remote dirs if not present (F)

     --ftp-method [MULTICWD/NOCWD/SINGLECWD] Control CWD usage (F)

     --ftp-pasv      Use PASV/EPSV instead of PORT (F)

 -P, --ftp-port ADR  Use PORT with given address instead of PASV (F)

     --ftp-skip-pasv-ip Skip the IP address for PASV (F)

     --ftp-pret      Send PRET before PASV (for drftpd) (F)

     --ftp-ssl-ccc   Send CCC after authenticating (F)

     --ftp-ssl-ccc-mode ACTIVE/PASSIVE  Set CCC mode (F)

     --ftp-ssl-control Require SSL/TLS for ftp login, clear for transfer (F)

 -G, --get           Send the -d data with a HTTP GET (H)

 -g, --globoff       Disable URL sequences and ranges using {} and []

 -H, --header LINE   Custom header to pass to server (H)

 -I, --head          Show document info only

 -h, --help          This help text

     --hostpubmd5 MD5  Hex encoded MD5 string of the host public key. (SSH)

 -0, --http1.0       Use HTTP 1.0 (H)

     --ignore-content-length  Ignore the HTTP Content-Length header

 -i, --include       Include protocol headers in the output (H/F)

 -k, --insecure      Allow connections to SSL sites without certs (H)

     --interface INTERFACE  Specify network interface/address to use

 -4, --ipv4          Resolve name to IPv4 address

 -6, --ipv6          Resolve name to IPv6 address

 -j, --junk-session-cookies Ignore session cookies read from file (H)

     --keepalive-time SECONDS  Interval between keepalive probes

     --key KEY       Private key file name (SSL/SSH)

     --key-type TYPE Private key file type (DER/PEM/ENG) (SSL)

     --krb LEVEL     Enable Kerberos with specified security level (F)

     --libcurl FILE  Dump libcurl equivalent code of this command line

     --limit-rate RATE  Limit transfer speed to this rate

 -l, --list-only     List only names of an FTP directory (F)

     --local-port RANGE  Force use of these local port numbers

 -L, --location      Follow redirects (H)

     --location-trusted like --location and send auth to other hosts (H)

 -M, --manual        Display the full manual

     --mail-from FROM  Mail from this address

     --mail-rcpt TO  Mail to this receiver(s)

     --mail-auth AUTH  Originator address of the original email

     --max-filesize BYTES  Maximum file size to download (H/F)

     --max-redirs NUM  Maximum number of redirects allowed (H)

 -m, --max-time SECONDS  Maximum time allowed for the transfer

     --negotiate     Use HTTP Negotiate Authentication (H)

 -n, --netrc         Must read .netrc for user name and password

     --netrc-optional Use either .netrc or URL; overrides -n

     --netrc-file FILE  Set up the netrc filename to use

 -N, --no-buffer     Disable buffering of the output stream

     --no-keepalive  Disable keepalive use on the connection

     --no-sessionid  Disable SSL session-ID reusing (SSL)

     --noproxy       List of hosts which do not use proxy

     --ntlm          Use HTTP NTLM authentication (H)

 -o, --output FILE   Write output to <file> instead of stdout

     --pass PASS     Pass phrase for the private key (SSL/SSH)

     --post301       Do not switch to GET after following a 301 redirect (H)

     --post302       Do not switch to GET after following a 302 redirect (H)

     --post303       Do not switch to GET after following a 303 redirect (H)

 -#, --progress-bar  Display transfer progress as a progress bar

     --proto PROTOCOLS  Enable/disable specified protocols

     --proto-redir PROTOCOLS  Enable/disable specified protocols on redirect

 -x, --proxy [PROTOCOL://]HOST[:PORT] Use proxy on given port

     --proxy-anyauth Pick "any" proxy authentication method (H)

     --proxy-basic   Use Basic authentication on the proxy (H)

     --proxy-digest  Use Digest authentication on the proxy (H)

     --proxy-negotiate Use Negotiate authentication on the proxy (H)

     --proxy-ntlm    Use NTLM authentication on the proxy (H)

 -U, --proxy-user USER[:PASSWORD]  Proxy user and password

     --proxy1.0 HOST[:PORT]  Use HTTP/1.0 proxy on given port

 -p, --proxytunnel   Operate through a HTTP proxy tunnel (using CONNECT)

     --pubkey KEY    Public key file name (SSH)

 -Q, --quote CMD     Send command(s) to server before transfer (F/SFTP)

     --random-file FILE  File for reading random data from (SSL)

 -r, --range RANGE   Retrieve only the bytes within a range

     --raw           Do HTTP "raw", without any transfer decoding (H)

 -e, --referer       Referer URL (H)

 -J, --remote-header-name Use the header-provided filename (H)

 -O, --remote-name   Write output to a file named as the remote file

     --remote-name-all Use the remote file name for all URLs

 -R, --remote-time   Set the remote file's time on the local output

 -X, --request COMMAND  Specify request command to use

     --resolve HOST:PORT:ADDRESS  Force resolve of HOST:PORT to ADDRESS

     --retry NUM   Retry request NUM times if transient problems occur

     --retry-delay SECONDS When retrying, wait this many seconds between each

     --retry-max-time SECONDS  Retry only within this period

 -S, --show-error    Show error. With -s, make curl show errors when they occur

 -s, --silent        Silent mode. Don't output anything

     --socks4 HOST[:PORT]  SOCKS4 proxy on given host + port

     --socks4a HOST[:PORT]  SOCKS4a proxy on given host + port

     --socks5 HOST[:PORT]  SOCKS5 proxy on given host + port

     --socks5-hostname HOST[:PORT] SOCKS5 proxy, pass host name to proxy

     --socks5-gssapi-service NAME  SOCKS5 proxy service name for gssapi

     --socks5-gssapi-nec  Compatibility with NEC SOCKS5 server

 -Y, --speed-limit RATE  Stop transfers below speed-limit for 'speed-time' secs

 -y, --speed-time SECONDS  Time for trig speed-limit abort. Defaults to 30

     --ssl           Try SSL/TLS (FTP, IMAP, POP3, SMTP)

     --ssl-reqd      Require SSL/TLS (FTP, IMAP, POP3, SMTP)

 -2, --sslv2         Use SSLv2 (SSL)

 -3, --sslv3         Use SSLv3 (SSL)

     --ssl-allow-beast Allow security flaw to improve interop (SSL)

     --stderr FILE   Where to redirect stderr. - means stdout

     --tcp-nodelay   Use the TCP_NODELAY option

 -t, --telnet-option OPT=VAL  Set telnet option

     --tftp-blksize VALUE  Set TFTP BLKSIZE option (must be >512)

 -z, --time-cond TIME  Transfer based on a time condition

 -1, --tlsv1         Use TLSv1 (SSL)

     --trace FILE    Write a debug trace to the given file

     --trace-ascii FILE  Like --trace but without the hex output

     --trace-time    Add time stamps to trace/verbose output

     --tr-encoding   Request compressed transfer encoding (H)

 -T, --upload-file FILE  Transfer FILE to destination

     --url URL       URL to work with

 -B, --use-ascii     Use ASCII/text transfer

 -u, --user USER[:PASSWORD]  Server user and password

     --tlsuser USER  TLS username

     --tlspassword STRING TLS password

     --tlsauthtype STRING  TLS authentication type (default SRP)

 -A, --user-agent STRING  User-Agent to send to server (H)

 -v, --verbose       Make the operation more talkative

 -V, --version       Show version number and quit

 -w, --write-out FORMAT  What to output after completion

     --xattr        Store metadata in extended file attributes

 -q                 If used as the first parameter disables .curlrc

In []:

--user-agent Ниже примеры из первой и второй ссылок. Базовый синтаксис опции: curl -A "UserAgentString" http://url.com¶

In []:

wget --user-agent="User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12" -c http://yourwebaddresshere/filetodownload.txt

curl -A "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12" -O http://yourwebaddresshere/filetodownload.txt

In []:

curl -A "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5" http://www.apple.com

Отметим, что здесь два варианта строки с "User-Agent:..." в начале и без. Очевидно, что второй вариант предпочтетельнее...

Custom User-Agent String¶

Как проверить, работает ли строчка "UserAgentString"? Проще всего - тупо проверить ее в браузере. Для этого поставил плагин Custom User-Agent String

Robots exclusion standard ¶

Стоит ли притворяться Мозиллой, или назваться ботом? Как использовать robots.txt? Пока на эти вопросы отвечать рановато. Но есть отличный Wki- ресурс, статьи на котором мне очень понравились... The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol

Далее примеры из "Get HTTP Header Info from Web Sites Using curl"¶

The easiest way to get HTTP header information from any website is by using the command line tool curl.

In [2]:

!curl -I www.google.com

HTTP/1.1 302 Found


Cache-Control: private


Content-Type: text/html; charset=UTF-8


Location: http://www.google.ru/?gfe_rd=cr&ei=wX-yU9GvBsyCZIPageAK


Content-Length: 256


Date: Tue, 01 Jul 2014 09:30:41 GMT


Server: GFE/2.0


Alternate-Protocol: 80:quic

Чтобы видеть отправленные на сервер заголовки, добавим опцию -v -I (--verbose --head )

In []:

kiss@kali:~/Desktop/curl_wget$ curl -v -I www.google.com
* About to connect() to www.google.com port 80 (#0)
*   Trying 64.233.164.103...
* connected
* Connected to www.google.com (64.233.164.103) port 80 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.26.0
> Host: www.google.com
> Accept: */*
> 
* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 302 Found
HTTP/1.1 302 Found
< Cache-Control: private
Cache-Control: private
< Content-Type: text/html; charset=UTF-8
Content-Type: text/html; charset=UTF-8
< Location: http://www.google.ru/?gfe_rd=cr&ei=y4CyU-nGL8bGZKiDgYgD
Location: http://www.google.ru/?gfe_rd=cr&ei=y4CyU-nGL8bGZKiDgYgD
< Content-Length: 256
Content-Length: 256
< Date: Tue, 01 Jul 2014 09:35:07 GMT
Date: Tue, 01 Jul 2014 09:35:07 GMT
< Server: GFE/2.0
Server: GFE/2.0
< Alternate-Protocol: 80:quic
Alternate-Protocol: 80:quic

< 
* Connection #0 to host www.google.com left intact
* Closing connection #0
kiss@kali:~/Desktop/curl_wget$

An easy way to get around all the HTML, Javascript, and CSS nonsense is to use the -D flag to download the header itself into a separate file, and then open that file in your preferred text editor:

In []:

curl -iD httpheader.txt www.apple.com && open httpheader.txt
#
# -D, --dump-header FILE  Write the headers to this file
# -i, --include       Include protocol headers in the output (H/F)

In []:

После выполнения команды (curl -iD httpheader.txt www.apple.com && vim httpheader.txt) мгновенно открывается vim с содержанием:

In []:

HTTP/1.1 200 OK
Server: Apache
Content-Type: text/html; charset=UTF-8
Cache-Control: max-age=129
Expires: Tue, 01 Jul 2014 10:33:13 GMT
Date: Tue, 01 Jul 2014 10:31:04 GMT
Content-Length: 9783
Connection: keep-alive

А после его закрытия в консоли оказывается... загрузилось все содержимое страницы:

In []:

kiss@kali:~/Desktop/curl_wget$ curl -iD httpheader.txt www.apple.com && vim httpheader.txt
HTTP/1.1 200 OK
Server: Apache
Content-Type: text/html; charset=UTF-8
Cache-Control: max-age=129
Expires: Tue, 01 Jul 2014 10:33:13 GMT
Date: Tue, 01 Jul 2014 10:31:04 GMT
Content-Length: 9783
Connection: keep-alive

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
<head>
....
....

Попытка ддобавить -v не приводит к успеху

In []:

kiss@kali:~/Desktop/curl_wget$ curl -iDv httpheader.txt www.apple.com && vim httpheader.txt
curl: (6) Couldn't resolve host 'httpheader.txt'
HTTP/1.1 200 OK

ETag: "KXAGGALFMKVQKNR"

    Server: Apache
Content-Type: text/html; charset=UTF-8
Cache-Control: max-age=398
Expires: Tue, 01 Jul 2014 10:40:08 GMT
Date: Tue, 01 Jul 2014 10:33:30 GMT
Content-Length: 9783
Connection: keep-alive

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
<head>
 <meta charset="utf-8" />
 <meta name="Author" content="Apple Inc." />
 <meta name="viewport" content="width=1024" />
....
....

</body>
</html>
kiss@kali:~/Desktop/curl_wget$

Вот правильная связка опций¶

In []:

kiss@kali:~/Desktop/curl_wget$ curl -v -iD httpheader.txt www.apple.com 
* About to connect() to www.apple.com port 80 (#0)
*   Trying 23.9.216.182...
* connected
* Connected to www.apple.com (23.9.216.182) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.26.0
> Host: www.apple.com
> Accept: */*
> 
* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Server: Apache
Server: Apache
< Content-Type: text/html; charset=UTF-8
Content-Type: text/html; charset=UTF-8
< Cache-Control: max-age=338
Cache-Control: max-age=338
< Expires: Tue, 01 Jul 2014 10:57:39 GMT
Expires: Tue, 01 Jul 2014 10:57:39 GMT
< Date: Tue, 01 Jul 2014 10:52:01 GMT
Date: Tue, 01 Jul 2014 10:52:01 GMT
< Content-Length: 9783
Content-Length: 9783
< Connection: keep-alive
Connection: keep-alive

< 
<!DOCTYPE html>
...
...
</body>
</html>
* Connection #0 to host www.apple.com left intact
* Closing connection #0
kiss@kali:~/Desktop/curl_wget$

Замене заголовка User-Agent. Пример с опциями: -v (--verbose) -I, (--head Show document info only)¶

In [3]:

!curl -v -I -A "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140611 Firefox/24.0 Iceweasel/24.6.0" www.apple.com

* About to connect() to www.apple.com port 80 (#0)

*   Trying 172.227.93.15...

* connected

* Connected to www.apple.com (172.227.93.15) port 80 (#0)

> HEAD / HTTP/1.1


> User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140611 Firefox/24.0 Iceweasel/24.6.0


> Host: www.apple.com


> Accept: */*


> 


* additional stuff not fine transfer.c:1037: 0 0

* HTTP 1.1 or later with persistent connection, pipelining supported

< HTTP/1.1 200 OK


HTTP/1.1 200 OK


< Server: Apache


Server: Apache


< Content-Type: text/html; charset=UTF-8


Content-Type: text/html; charset=UTF-8


< Cache-Control: max-age=199


Cache-Control: max-age=199


< Expires: Tue, 01 Jul 2014 11:53:14 GMT


Expires: Tue, 01 Jul 2014 11:53:14 GMT


< Date: Tue, 01 Jul 2014 11:49:55 GMT


Date: Tue, 01 Jul 2014 11:49:55 GMT


< Connection: keep-alive


Connection: keep-alive


* no chunk, no close, no size. Assume close to signal end




< 


* Closing connection #0

Замене заголовка User-Agent. Пример сохранения в файле (index.html) текущего каталога (~/Desktop/curl_wge) страницы http://www.apple.com/index.html¶

In []:

kiss@kali:~/Desktop/curl_wget$ curl -v -A "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140611 Firefox/24.0 Iceweasel/24.6.0" -O http://www.apple.com/index.html
* About to connect() to www.apple.com port 80 (#0)
*   Trying 172.227.93.15...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* connected
* Connected to www.apple.com (172.227.93.15) port 80 (#0)
> GET /index.html HTTP/1.1
> User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140611 Firefox/24.0 Iceweasel/24.6.0
> Host: www.apple.com
> Accept: */*
> 
* additional stuff not fine transfer.c:1037: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 200 OK
< Server: Apache
< Content-Type: text/html; charset=UTF-8
< Cache-Control: max-age=1
< Expires: Tue, 01 Jul 2014 11:51:02 GMT
< Date: Tue, 01 Jul 2014 11:51:01 GMT
< Content-Length: 9783
< Connection: keep-alive
< 
{ [data not shown]
100  9783  100  9783    0     0  48438      0 --:--:-- --:--:-- --:--:-- 78895
* Connection #0 to host www.apple.com left intact
* Closing connection #0
kiss@kali:~/Desktop/curl_wget$

Посты чуть ниже также могут вас заинтересовать

iPython R Rapid Miner

Поиск по блогу

Страницы

среда, 2 июля 2014 г.

Третий этап освоения Curl - статьи по запросу типа "Curl user agent"

--user-agent Ниже примеры из первой и второй ссылок. Базовый синтаксис опции: curl -A "UserAgentString" http://url.com¶

Custom User-Agent String¶

Robots exclusion standard ¶

Далее примеры из "Get HTTP Header Info from Web Sites Using curl"¶

Вот правильная связка опций¶

Замене заголовка User-Agent. Пример с опциями: -v (--verbose) -I, (--head Show document info only)¶

Замене заголовка User-Agent. Пример сохранения в файле (index.html) текущего каталога (~/Desktop/curl_wge) страницы http://www.apple.com/index.html¶

Комментариев нет:

Отправить комментарий

Поиск по блогу

Страницы

среда, 2 июля 2014 г.

Третий этап освоения Curl - статьи по запросу типа "Curl user agent"

--user-agent Ниже примеры из первой и второй ссылок. Базовый синтаксис опции: curl -A "UserAgentString" http://url.com¶

Custom User-Agent String¶

Robots exclusion standard¶

Далее примеры из "Get HTTP Header Info from Web Sites Using curl"¶

Вот правильная связка опций¶

Замене заголовка User-Agent. Пример с опциями: -v (--verbose) -I, (--head Show document info only)¶

Замене заголовка User-Agent. Пример сохранения в файле (index.html) текущего каталога (~/Desktop/curl_wge) страницы http://www.apple.com/index.html¶

Комментариев нет:

Отправить комментарий

среда, 2 июля 2014 г.

Robots exclusion standard ¶