Поиск по блогу

пятница, 17 октября 2014 г.

Разбираем, что делает в Scrapy "downloadermiddleware\httpproxy.py"

В справке к urllib.getproxies() нашел "It scans the environment for variables named _proxy ...and when it cannot find it... looks for proxy information from ... Windows Systems Registry ..." ... и поумнел... По сути httpproxy.py этим только и занимается. Поскольку он включен по умолчанию, то надо попробовать "set (export) http_proxy"

In [9]:
%load "C:\\Users\\kiss\\Anaconda\\Lib\\site-packages\\scrapy\\contrib\\downloadermiddleware\\httpproxy.py"
In []:
import base64
from urllib import getproxies, unquote, proxy_bypass
from urllib2 import _parse_proxy
from urlparse import urlunparse

from scrapy.utils.httpobj import urlparse_cached
from scrapy.exceptions import NotConfigured


class HttpProxyMiddleware(object):

    def __init__(self):
        self.proxies = {}
        for type, url in getproxies().items():
            self.proxies[type] = self._get_proxy(url, type)

        if not self.proxies:
            raise NotConfigured

    def _get_proxy(self, url, orig_type):
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

        if user and password:
            user_pass = '%s:%s' % (unquote(user), unquote(password))
            creds = base64.b64encode(user_pass).strip()
        else:
            creds = None

        return creds, proxy_url

    def process_request(self, request, spider):
        # ignore if proxy is already seted
        if 'proxy' in request.meta:
            return

        parsed = urlparse_cached(request)
        scheme = parsed.scheme

        # 'no_proxy' is only supported by http schemes
        if scheme in ('http', 'https') and proxy_bypass(parsed.hostname):
            return

        if scheme in self.proxies:
            self._set_proxy(request, scheme)

    def _set_proxy(self, request, scheme):
        creds, proxy = self.proxies[scheme]
        request.meta['proxy'] = proxy
        if creds:
            request.headers['Proxy-Authorization'] = 'Basic ' + creds

Далее выпишем, что мы импортируем из urllib, urllib2 ... Очевидно, эти библиотеки сильно меняются, в третьей ветсии вместо них уже другие (см. ссылку ниже), кроме того, я уже не смогн найти proxy_bypass, _parse_proxy в текущей документации

In [4]:
import urllib
In []:
urllib.getproxies()

This helper function returns a dictionary of scheme to proxy server URL mappings. 
It scans the environment for variables named <scheme>_proxy, in case insensitive way, for all operating systems first, 
and when it cannot find it, looks for proxy information from Mac OSX System Configuration for Mac OS X and 
Windows Systems Registry for Windows.
In [22]:
help(urllib.getproxies)
Help on function getproxies in module urllib:

getproxies()
    Return a dictionary of scheme -> proxy server URL mappings.
    
    Returns settings gathered from the environment, if specified,
    or the registry.


In [23]:
urllib.getproxies()
Out[23]:
{}

Чего проще, "просканировать" переменные окружения, а как залезть в системные переменные? Вот фрагмент из urllib

In []:
def getproxies_environment():
    """Здесь справка, которую я уже распечатал выше, потому здесь убрал
    """
    proxies = {}
    for name, value in os.environ.items():
        name = name.lower()
        if value and name[-6:] == '_proxy':
            proxies[name[:-6]] = value
    return proxies

В дальнейшем надо бы собрать вместе возможности командной строки (!set ...) и os.environ. Для ручного переключения прокси, например.

In [26]:
from os import environ
In [28]:
environ.items()
Out[28]:
[('TMP', 'C:\\Users\\kiss\\AppData\\Local\\Temp'),
 ('COMPUTERNAME', 'WEB-UNIVERSUM'),
 ('USERDOMAIN', 'WEB-UNIVERSUM'),
 ('PYTHON', 'C:\\Users\\kiss\\Anaconda\\python.exe'),
 ('PSMODULEPATH', 'C:\\WINDOWS\\system32\\WindowsPowerShell\\v1.0\\Modules\\'),
 ('COMMONPROGRAMFILES', 'C:\\Program Files\\Common Files'),
 ('PROCESSOR_IDENTIFIER', 'AMD64 Family 20 Model 2 Stepping 0, AuthenticAMD'),
 ('PROGRAMFILES', 'C:\\Program Files'),
 ('PROCESSOR_REVISION', '0200'),
 ('PATH',
  'C:\\Users\\kiss\\Anaconda\\lib\\site-packages\\numpy\\core;C:\\Program Files\\ImageMagick-6.8.8-Q8;C:\\Program Files (x86)\\AMD APP\\bin\\x86_64;C:\\Program Files (x86)\\AMD APP\\bin\\x86;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\Program Files (x86)\\ATI Technologies\\ATI.ACE\\Core-Static;C:\\Program Files (x86)\\Windows Live\\Shared;C:\\Program Files\\Java\\jdk1.7.0_21\\bin;C:\\Program Files\\Microsoft SQL Server\\110\\Tools\\Binn\\;C:\\HashiCorp\\Vagrant\\bin;C:\\Users\\kiss\\Anaconda\\Scripts;C:\\Program Files\\cURL\\bin;C:\\Program Files (x86)\\MiKTeX 2.9\\miktex\\bin\\;C:\\Program Files\\HDF_Group\\HDF5\\1.8.12\\bin;C:\\Program Files\\nodejs\\;C:\\Users\\kiss\\Anaconda;C:\\Users\\kiss\\Anaconda\\Scripts;C:\\Users\\kiss\\AppData\\Local\\Pandoc\\;C:\\Program Files (x86)\\Google\\google_appengine\\;C:\\Users\\kiss\\AppData\\Roaming\\npm;C:\\Program Files (x86)\\Nmap'),
 ('SYSTEMROOT', 'C:\\WINDOWS'),
 ('CLICOLOR', '1'),
 ('AMDAPPSDKROOT', 'C:\\Program Files (x86)\\AMD APP\\'),
 ('PROGRAMFILES(X86)', 'C:\\Program Files (x86)'),
 ('LANG', 'RU'),
 ('TERM', 'xterm-color'),
 ('TEMP', 'C:\\Users\\kiss\\AppData\\Local\\Temp'),
 ('COMMONPROGRAMFILES(X86)', 'C:\\Program Files (x86)\\Common Files'),
 ('PROCESSOR_ARCHITECTURE', 'AMD64'),
 ('ALLUSERSPROFILE', 'C:\\ProgramData'),
 ('LOCALAPPDATA', 'C:\\Users\\kiss\\AppData\\Local'),
 ('HOMEPATH', '\\Users\\kiss'),
 ('USERDOMAIN_ROAMINGPROFILE', 'WEB-UNIVERSUM'),
 ('VS120COMNTOOLS',
  'C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\Common7\\Tools\\'),
 ('JAVA_HOME', 'C:\\Program Files\\Java\\jdk1.7.0_21'),
 ('PROGRAMW6432', 'C:\\Program Files'),
 ('USERNAME', 'kiss'),
 ('LOGONSERVER', '\\\\MicrosoftAccount'),
 ('PROMPT', '$P$G'),
 ('COMSPEC', 'C:\\WINDOWS\\system32\\cmd.exe'),
 ('PROGRAMDATA', 'C:\\ProgramData'),
 ('PYTHONPATH',
  'C:\\Users\\kiss\\Documents\\Python-Django;C:\\Users\\kiss\\Anaconda\\Lib\\site-packages\\django\\bin'),
 ('GIT_PAGER', 'cat'),
 ('SESSIONNAME', 'Console'),
 ('PATHEXT', '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC'),
 ('FP_NO_HOST_CHECK', 'NO'),
 ('WINDIR', 'C:\\WINDOWS'),
 ('OPENSSL_CONF', 'C:\\OpenSSL-Win64\\bin\\openssl.cfg'),
 ('APPDATA', 'C:\\Users\\kiss\\AppData\\Roaming'),
 ('HOMEDRIVE', 'C:'),
 ('PAGER', 'cat'),
 ('SYSTEMDRIVE', 'C:'),
 ('NUMBER_OF_PROCESSORS', '2'),
 ('VBOX_INSTALL_PATH', 'C:\\Program Files\\Oracle\\VirtualBox\\'),
 ('PROCESSOR_LEVEL', '20'),
 ('COMMONPROGRAMW6432', 'C:\\Program Files\\Common Files'),
 ('OS', 'Windows_NT'),
 ('PUBLIC', 'C:\\Users\\Public'),
 ('USERPROFILE', 'C:\\Users\\kiss')]
In [31]:
help(environ)
Help on instance of _Environ in module os:

class _Environ(UserDict.IterableUserDict)
 |  # But we store them as upper case
 |  
 |  Method resolution order:
 |      _Environ
 |      UserDict.IterableUserDict
 |      UserDict.UserDict
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, key)
 |  
 |  __delitem__(self, key)
 |  
 |  __getitem__(self, key)
 |  
 |  __init__(self, environ)
 |  
 |  __setitem__(self, key, item)
 |  
 |  clear(self)
 |  
 |  copy(self)
 |  
 |  get(self, key, failobj=None)
 |  
 |  has_key(self, key)
 |  
 |  pop(self, key, *args)
 |  
 |  update(self, dict=None, **kwargs)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from UserDict.IterableUserDict:
 |  
 |  __iter__(self)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from UserDict.UserDict:
 |  
 |  __cmp__(self, dict)
 |  
 |  __len__(self)
 |  
 |  __repr__(self)
 |  
 |  items(self)
 |  
 |  iteritems(self)
 |  
 |  iterkeys(self)
 |  
 |  itervalues(self)
 |  
 |  keys(self)
 |  
 |  popitem(self)
 |  
 |  setdefault(self, key, failobj=None)
 |  
 |  values(self)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from UserDict.UserDict:
 |  
 |  fromkeys(cls, iterable, value=None) from __builtin__.classobj
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from UserDict.UserDict:
 |  
 |  __hash__ = None


In []:
FancyURLopener subclasses URLopener 

providing default handling for the following HTTP response codes: 301, 302, 303, 307 and 401. 
For the 30x response codes listed above, the Location header is used to fetch the actual URL. 
For 401 response codes (authentication required), basic HTTP authentication is performed. 
For the 30x response codes, recursion is bounded by the value of the maxtries attribute, which defaults to 10.

For all other response codes, the method http_error_default() is called which you can override in subclasses to handle 
the error appropriately.

Note According to the letter of RFC 2616, 301 and 302 responses to POST requests must not be automatically redirected 
without confirmation by the user. In reality, browsers do allow automatic redirection of these responses, changing 
the POST to a GET, and urllib reproduces this behaviour.

The parameters to the constructor are the same as those for URLopener.
In []:
urllib.unquote(string)
Replace %xx escapes by their single-character equivalent.

Example: unquote('/%7Econnolly/') yields '/~connolly/'.
In [8]:
print urllib.unquote('/%7Econnolly/') 
/~connolly/

In [12]:
help(urllib.proxy_bypass)
Help on function proxy_bypass in module urllib:

proxy_bypass(host)
    Return a dictionary of scheme -> proxy server URL mappings.
    
    Returns settings gathered from the environment, if specified,
    or the registry.


In [13]:
import urllib2
In [14]:
help(urllib2._parse_proxy)
Help on function _parse_proxy in module urllib2:

_parse_proxy(proxy)
    Return (scheme, user, password, host/port) given a URL or an authority.
    
    If a URL is supplied, it must have an authority (host:port) component.
    According to RFC 3986, having an authority component means the URL must
    have two slashes after the scheme:
    
    >>> _parse_proxy('file:/ftp.example.com/')
    Traceback (most recent call last):
    ValueError: proxy URL with no authority: 'file:/ftp.example.com/'
    
    The first three items of the returned tuple may be None.
    
    Examples of authority parsing:
    
    >>> _parse_proxy('proxy.example.com')
    (None, None, None, 'proxy.example.com')
    >>> _parse_proxy('proxy.example.com:3128')
    (None, None, None, 'proxy.example.com:3128')
    
    The authority component may optionally include userinfo (assumed to be
    username:password):
    
    >>> _parse_proxy('joe:password@proxy.example.com')
    (None, 'joe', 'password', 'proxy.example.com')
    >>> _parse_proxy('joe:password@proxy.example.com:3128')
    (None, 'joe', 'password', 'proxy.example.com:3128')
    
    Same examples, but with URLs instead:
    
    >>> _parse_proxy('http://proxy.example.com/')
    ('http', None, None, 'proxy.example.com')
    >>> _parse_proxy('http://proxy.example.com:3128/')
    ('http', None, None, 'proxy.example.com:3128')
    >>> _parse_proxy('http://joe:password@proxy.example.com/')
    ('http', 'joe', 'password', 'proxy.example.com')
    >>> _parse_proxy('http://joe:password@proxy.example.com:3128')
    ('http', 'joe', 'password', 'proxy.example.com:3128')
    
    Everything after the authority is ignored:
    
    >>> _parse_proxy('ftp://joe:password@proxy.example.com/rubbish:3128')
    ('ftp', 'joe', 'password', 'proxy.example.com')
    
    Test for no trailing '/' case:
    
    >>> _parse_proxy('http://joe:password@proxy.example.com')
    ('http', 'joe', 'password', 'proxy.example.com')


In [15]:
import urlparse
In []:
 
In [17]:
from scrapy.utils.httpobj import urlparse_cached
In [19]:
help(urlparse_cached)
Help on function urlparse_cached in module scrapy.utils.httpobj:

urlparse_cached(request_or_response)
    Return urlparse.urlparse caching the result, where the argument can be a
    Request or Response object


In [16]:
help(urlparse.urlunparse)
Help on function urlunparse in module urlparse:

urlunparse(data)
    Put a parsed URL back together again.  This may result in a
    slightly different, but equivalent URL, if the URL that was parsed
    originally had redundant delimiters, e.g. a ? with an empty query
    (the draft states that these are equivalent).


In [20]:
from scrapy.exceptions import NotConfigured
In [21]:
help(NotConfigured)
Help on class NotConfigured in module scrapy.exceptions:

class NotConfigured(exceptions.Exception)
 |  Indicates a missing configuration situation
 |  
 |  Method resolution order:
 |      NotConfigured
 |      exceptions.Exception
 |      exceptions.BaseException
 |      __builtin__.object
 |  
 |  Data descriptors defined here:
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from exceptions.Exception:
 |  
 |  __init__(...)
 |      x.__init__(...) initializes x; see help(type(x)) for signature
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from exceptions.Exception:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from exceptions.BaseException:
 |  
 |  __delattr__(...)
 |      x.__delattr__('name') <==> del x.name
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __getslice__(...)
 |      x.__getslice__(i, j) <==> x[i:j]
 |      
 |      Use of negative indices is not supported.
 |  
 |  __reduce__(...)
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __setattr__(...)
 |      x.__setattr__('name', value) <==> x.name = value
 |  
 |  __setstate__(...)
 |  
 |  __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  __unicode__(...)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from exceptions.BaseException:
 |  
 |  __dict__
 |  
 |  args
 |  
 |  message


In [24]:
from scrapy.exceptions import NotConfigured
In [25]:
help(NotConfigured)
Help on class NotConfigured in module scrapy.exceptions:

class NotConfigured(exceptions.Exception)
 |  Indicates a missing configuration situation
 |  
 |  Method resolution order:
 |      NotConfigured
 |      exceptions.Exception
 |      exceptions.BaseException
 |      __builtin__.object
 |  
 |  Data descriptors defined here:
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from exceptions.Exception:
 |  
 |  __init__(...)
 |      x.__init__(...) initializes x; see help(type(x)) for signature
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from exceptions.Exception:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from exceptions.BaseException:
 |  
 |  __delattr__(...)
 |      x.__delattr__('name') <==> del x.name
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __getslice__(...)
 |      x.__getslice__(i, j) <==> x[i:j]
 |      
 |      Use of negative indices is not supported.
 |  
 |  __reduce__(...)
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __setattr__(...)
 |      x.__setattr__('name', value) <==> x.name = value
 |  
 |  __setstate__(...)
 |  
 |  __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  __unicode__(...)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from exceptions.BaseException:
 |  
 |  __dict__
 |  
 |  args
 |  
 |  message


In []:
[Hypertext Transfer Protocol -- HTTP/1.1 rfc2616](http://tools.ietf.org/html/rfc2616.html)
In []:
 10.2  Successful 2xx ..............................................58
   10.2.1   200 OK ...................................................58
   10.2.2   201 Created ..............................................59
   10.2.3   202 Accepted .............................................59
   10.2.4   203 Non-Authoritative Information ........................59
   10.2.5   204 No Content ...........................................60
   10.2.6   205 Reset Content ........................................60
   10.2.7   206 Partial Content ......................................60
  
 10.3  Redirection 3xx .............................................61
   10.3.1   300 Multiple Choices .....................................61
   10.3.2   301 Moved Permanently ....................................62
   10.3.3   302 Found ................................................62
   10.3.4   303 See Other ............................................63
   10.3.5   304 Not Modified .........................................63
   10.3.6   305 Use Proxy ............................................64
   10.3.7   306 (Unused) .............................................64
   10.3.8   307 Temporary Redirect ...................................65
 
 10.4  Client Error 4xx ............................................65
   10.4.1    400 Bad Request .........................................65
   10.4.2    401 Unauthorized ........................................66
   10.4.3    402 Payment Required ....................................66
   10.4.4    403 Forbidden ...........................................66
   10.4.5    404 Not Found ...........................................66
   10.4.6    405 Method Not Allowed ..................................66
   10.4.7    406 Not Acceptable ......................................67
   10.4.8    407 Proxy Authentication Required .......................67
   10.4.9    408 Request Timeout .....................................67
   10.4.10   409 Conflict ............................................67
   10.4.11   410 Gone ................................................68
   10.4.12   411 Length Required .....................................68
   10.4.13   412 Precondition Failed .................................68
   10.4.14   413 Request Entity Too Large ............................69
   10.4.15   414 Request-URI Too Long ................................69
   10.4.16   415 Unsupported Media Type ..............................69
   10.4.17   416 Requested Range Not Satisfiable .....................69
   10.4.18   417 Expectation Failed ..................................70
  
 10.5  Server Error 5xx ............................................70
   10.5.1   500 Internal Server Error ................................70
   10.5.2   501 Not Implemented ......................................70
   10.5.3   502 Bad Gateway ..........................................70
   10.5.4   503 Service Unavailable ..................................70
   10.5.5   504 Gateway Timeout ......................................71
   10.5.6   505 HTTP Version Not Supported ...........................71
In []:
 


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий