Снова читаю urllib, снова не хватает времени, потому делаю заготовки для последующего воспроизведения примеров из документации. Здесь же и ссылки на каждый раздел и пример из документации

Читаем статью "Проксирование в Scrapy" ...
Из документации urllib

In [3]:

%load "C:\\Users\\kiss\\Anaconda\\Lib\\site-packages\\scrapy\\contrib\\downloadermiddleware\\httpproxy.py"

In []:

import base64
from urllib import getproxies, unquote, proxy_bypass
from urllib2 import _parse_proxy
from urlparse import urlunparse

from scrapy.utils.httpobj import urlparse_cached
from scrapy.exceptions import NotConfigured


class HttpProxyMiddleware(object):

    def __init__(self):
        self.proxies = {}
        for type, url in getproxies().items():
            self.proxies[type] = self._get_proxy(url, type)

        if not self.proxies:
            raise NotConfigured

    def _get_proxy(self, url, orig_type):
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

        if user and password:
            user_pass = '%s:%s' % (unquote(user), unquote(password))
            creds = base64.b64encode(user_pass).strip()
        else:
            creds = None

        return creds, proxy_url

    def process_request(self, request, spider):
        # ignore if proxy is already seted
        if 'proxy' in request.meta:
            return

        parsed = urlparse_cached(request)
        scheme = parsed.scheme

        # 'no_proxy' is only supported by http schemes
        if scheme in ('http', 'https') and proxy_bypass(parsed.hostname):
            return

        if scheme in self.proxies:
            self._set_proxy(request, scheme)

    def _set_proxy(self, request, scheme):
        creds, proxy = self.proxies[scheme]
        request.meta['proxy'] = proxy
        if creds:
            request.headers['Proxy-Authorization'] = 'Basic ' + creds

In [2]:

import urllib2
help(urllib2._parse_proxy)

Help on function _parse_proxy in module urllib2:

_parse_proxy(proxy)
    Return (scheme, user, password, host/port) given a URL or an authority.
    
    If a URL is supplied, it must have an authority (host:port) component.
    According to RFC 3986, having an authority component means the URL must
    have two slashes after the scheme:
    
    >>> _parse_proxy('file:/ftp.example.com/')
    Traceback (most recent call last):
    ValueError: proxy URL with no authority: 'file:/ftp.example.com/'
    
    The first three items of the returned tuple may be None.
    
    Examples of authority parsing:
    
    >>> _parse_proxy('proxy.example.com')
    (None, None, None, 'proxy.example.com')
    >>> _parse_proxy('proxy.example.com:3128')
    (None, None, None, 'proxy.example.com:3128')
    
    The authority component may optionally include userinfo (assumed to be
    username:password):
    
    >>> _parse_proxy('joe:password@proxy.example.com')
    (None, 'joe', 'password', 'proxy.example.com')
    >>> _parse_proxy('joe:password@proxy.example.com:3128')
    (None, 'joe', 'password', 'proxy.example.com:3128')
    
    Same examples, but with URLs instead:
    
    >>> _parse_proxy('http://proxy.example.com/')
    ('http', None, None, 'proxy.example.com')
    >>> _parse_proxy('http://proxy.example.com:3128/')
    ('http', None, None, 'proxy.example.com:3128')
    >>> _parse_proxy('http://joe:password@proxy.example.com/')
    ('http', 'joe', 'password', 'proxy.example.com')
    >>> _parse_proxy('http://joe:password@proxy.example.com:3128')
    ('http', 'joe', 'password', 'proxy.example.com:3128')
    
    Everything after the authority is ignored:
    
    >>> _parse_proxy('ftp://joe:password@proxy.example.com/rubbish:3128')
    ('ftp', 'joe', 'password', 'proxy.example.com')
    
    Test for no trailing '/' case:
    
    >>> _parse_proxy('http://joe:password@proxy.example.com')
    ('http', 'joe', 'password', 'proxy.example.com')

urlliburlliburlliburlliburlliburlliburlliburlliburlliburlliburlliburllib

Из документации urllib ¶

The urlopen() function works transparently with proxies which do not require authentication. In a Unix or Windows environment, set the http_proxy, or ftp_proxy environment variables to a URL that identifies the proxy server before starting the Python interpreter. For example (the '%' is the command prompt):

In []:

% http_proxy="http://www.someproxy.com:3128"
% export http_proxy
% python
...

The no_proxy environment variable can be used to specify hosts which shouldn’t be reached via proxy; if set, it should be a comma-separated list of hostname suffixes, optionally with :port appended, for example cern.ch,ncsa.uiuc.edu,some.host:8080.

In a Windows environment, if no proxy environment variables are set, proxy settings are obtained from the registry’s Internet Settings section.

In a Mac OS X environment, urlopen() will retrieve proxy information from the OS X System Configuration Framework, which can be managed with Network System Preferences panel.

Alternatively, the optional proxies argument may be used to explicitly specify proxies. It must be a dictionary mapping scheme names to proxy URLs, where an empty dictionary causes no proxies to be used, and None (the default value) causes environmental proxy settings to be used as discussed above. For example:

In []:

# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
# Don't use any proxies
filehandle = urllib.urlopen(some_url, proxies={})
# Use proxies from environment - both versions are equivalent
filehandle = urllib.urlopen(some_url, proxies=None)
filehandle = urllib.urlopen(some_url)

urllib.getproxies()¶

This helper function returns a dictionary of scheme to proxy server URL mappings. It scans the environment for variables named _proxy, in case insensitive way, for all operating systems first, and when it cannot find it, looks for proxy information from Mac OSX System Configuration for Mac OS X and Windows Systems Registry for Windows.

20.5.3. URL Opener objects ¶

In []:

class urllib.URLopener([proxies[, context[, **x509]]])

By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number. Applications can define their own User-Agent header by subclassing URLopener or FancyURLopener and setting the class attribute version to an appropriate string value in the subclass definition.

The optional proxies parameter should be a dictionary mapping scheme names to proxy URLs, where an empty dictionary turns proxies off completely. Its default value is None, in which case environmental proxy settings will be used if present, as discussed in the definition of urlopen(), above.

The following example uses an explicitly specified HTTP proxy, overriding environment settings ¶

In []:

>>> import urllib
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.FancyURLopener(proxies)
>>> f = opener.open("http://www.python.org")
>>> f.read()

The following example uses no proxies at all, overriding environment settings:

In []:

>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> f = opener.open("http://www.python.org/")
>>> f.read()

urlliburlliburlliburlliburlliburlliburlliburlliburlliburlliburlliburllib

urllib2urllib2urllib2urllib2urllib2urllib2urllib2urllib2urllib2urllib2ur

Из документации urllib2 ¶

The urllib2 module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

The urllib2 module has been split across several modules in Python 3 named urllib.request and urllib.error.

... In addition, if proxy settings are detected (for example, when a #_proxy environment variable like http_proxy is set), ProxyHandler is default installed and makes sure the requests are handled through the proxy.

urllib2.build_opener([handler, ...])¶

Return an OpenerDirector instance, which chains the handlers in the order given. handlers can be either instances of BaseHandler, or subclasses of BaseHandler (in which case it must be possible to call the constructor without any parameters).
Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them:
ProxyHandler (if proxy settings are detected), UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.

class urllib2.ProxyHandler([proxies])¶

Cause requests to go through a proxy. If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies. The default is to read the list of proxies from the environment variables _proxy.

If no proxy environment variables are set, then in a Windows environment proxy settings are obtained from the registry’s Internet Settings section, and in a Mac OS X environment proxy information is retrieved from the OS X System Configuration Framework.

To disable autodetected proxy pass an empty dictionary.

Request.set_proxy(host, type)¶

Prepare the request by connecting to a proxy server. The host and type will replace those of the instance, and the instance’s selector will be the original URL given in the constructor.

20.6.6. ProxyHandler Objects ¶

ProxyHandler.protocol_open(request) (“protocol” is to be replaced by the protocol name.)

The ProxyHandler will have a method protocol_open for every protocol which has a proxy in the proxies dictionary given in the constructor. The method will modify requests to go through the proxy, by calling request.set_proxy(), and call the next handler in the chain to actually execute the protocol.

20.6.21. Examples urllib2 ¶

In []:

#This example gets the python.org main page and displays the first 100 bytes of it:

>>>
>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> print f.read(100)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<?xml-stylesheet href="./css/ht2html

In []:

#Here we are sending a data-stream to the stdin of a CGI and reading the data it returns to us. 
#Note that this example will only work when the Python installation supports SSL.

>>> import urllib2
>>> req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi',
...                       data='This data is passed to stdin of the CGI')
>>> f = urllib2.urlopen(req)
>>> print f.read()
#Got Data: "This data is passed to stdin of the CGI"

In []:

#The code for the sample CGI used in the above example is:

#!/usr/bin/env python
import sys
data = sys.stdin.read()
print 'Content-type: text-plain\n\nGot Data: "%s"' % data

In []:

#Use of Basic HTTP Authentication:

import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')

build_opener() provides many handlers by default, including a ProxyHandler. By default, ProxyHandler uses the environment variables named _proxy, where is the URL scheme involved. For example, the http_proxy environment variable is read to obtain the HTTP proxy’s URL.

replaces the default ProxyHandler¶

In []:

#This example replaces the default ProxyHandler with one that uses programmatically-supplied proxy URLs,
#and adds proxy authorization support with ProxyBasicAuthHandler.

proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib2.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
opener.open('http://www.example.com/login.html')

Adding HTTP headers¶

In []:

#Use the headers argument to the Request constructor, or:

import urllib2
req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib2.urlopen(req)

In []:

#OpenerDirector automatically adds a User-Agent header to every Request. To change this:

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

Also, remember that a few standard headers (Content-Length, Content-Type and Host) are added when the Request is passed to urlopen() (or OpenerDirector.open())

urllib2urllib2urllib2urllib2urllib2urllib2urllib2urllib2urllib2urllib2url

Посты чуть ниже также могут вас заинтересовать

iPython R Rapid Miner

Поиск по блогу

Страницы

воскресенье, 4 января 2015 г.

Копипаст примеров и фрагментов из документации urllib Python

Из документации urllib ¶

urllib.getproxies()¶

20.5.3. URL Opener objects ¶

The following example uses an explicitly specified HTTP proxy, overriding environment settings ¶

Из документации urllib2 ¶

urllib2.build_opener([handler, ...])¶

class urllib2.ProxyHandler([proxies])¶

Request.set_proxy(host, type)¶

20.6.6. ProxyHandler Objects ¶

20.6.21. Examples urllib2 ¶

replaces the default ProxyHandler¶

Adding HTTP headers¶

Комментариев нет:

Отправить комментарий

Поиск по блогу

Страницы

воскресенье, 4 января 2015 г.

Копипаст примеров и фрагментов из документации urllib Python

Из документации urllib¶

Из документации urllib2¶

replaces the default ProxyHandler¶

Adding HTTP headers¶

Комментариев нет:

Отправить комментарий

воскресенье, 4 января 2015 г.

Из документации urllib ¶

Из документации urllib2 ¶