Поиск по блогу

воскресенье, 23 ноября 2014 г.

Как лучше парсить, чистить и склеивать строки вида 'http://127.127.0.1:8080'

Сразу вспомнилось про find-replace, ранее нашел split, "вспомнил", что каждая строка - это (упорядоченный)список, значит можно манипулировать индексами s[i] s[i:j] по позициям элемента... А это неправильно, не список, а КОРТЕЖ... Значи, подстроки нельзя изменять простым присваиванием...

Надо еще вспомнить прос строки Юникода, но здесь лишь напомню

In [19]:
 print u'Гыыыы',u'\u0413\u044b\u044b\u044b\u044b'
Гыыыы Гыыыы

In [18]:
u'Гыыыы'
Out[18]:
u'\u0413\u044b\u044b\u044b\u044b'
In []:
что строки кириллицы - это объекты юникода, т.е. другие объекты... А далее вспомним про объект "Strings"
In [2]:
sstring='http://127.127.12.7:8080'
ss= sstring.split(':')  #[0] --> 'http'
In [3]:
ss
Out[3]:
['http', '//127.127.12.7', '8080']
In []:
Получили список строк, к которому, естественно, применимы все методы списков...
In [7]:
'How we can remove "//" from "%s"'% ss[1]
Out[7]:
'How we can remove "//" from "//127.127.12.7"'

Каждая строка - это не список, а кортеж (!!!) букв

In [11]:
ss[1][1],ss[1][2],ss[1][3],ss[1][0:2],
Out[11]:
('/', '1', '2', '//')
In [13]:
print "Было так %s" %ss
# Присвоение ниже не работает
ss[1][0:2]=''
print "Стало", ss
Было так ['http', '//127.127.12.7', '8080']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-ec471e6cc8fd> in <module>()
      1 print "Было так %s" %ss
----> 2 ss[1][0:2]=''
      3 print "Стало", ss

TypeError: 'str' object does not support item assignment
In [27]:
help(sstring.split)
Help on built-in function split:

split(...)
    S.split([sep [,maxsplit]]) -> list of strings
    
    Return a list of the words in the string S, using sep as the
    delimiter string.  If maxsplit is given, at most maxsplit
    splits are done. If sep is not specified or is None, any
    whitespace string is a separator and empty strings are removed
    from the result.


In [69]:
sss='11--фыва2--фыва3--фыва4--фыва5--фыва'
sss.split('фы',3)
Out[69]:
['11--',
 '\xd0\xb2\xd0\xb02--',
 '\xd0\xb2\xd0\xb03--',
 '\xd0\xb2\xd0\xb04--\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb05--\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0']
In [71]:
decode(sss.split('фы',3)[2])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-71-e8efdeb0a3ca> in <module>()
----> 1 decode(sss.split('фы',3)[2])

NameError: name 'decode' is not defined
In [28]:
help(sstring.rsplit)
Help on built-in function rsplit:

rsplit(...)
    S.rsplit([sep [,maxsplit]]) -> list of strings
    
    Return a list of the words in the string S, using sep as the
    delimiter string, starting at the end of the string and working
    to the front.  If maxsplit is given, at most maxsplit splits are
    done. If sep is not specified or is None, any whitespace string
    is a separator.


In [36]:
help(sstring.splitlines)
Help on built-in function splitlines:

splitlines(...)
    S.splitlines(keepends=False) -> list of strings
    
    Return a list of the lines in S, breaking at line boundaries.
    Line breaks are not included in the resulting list unless keepends
    is given and true.


In [64]:
sss='qwerqwe1\nrqwe2\n111фыва3\nфывафы4\nвафывафыва'
sss.splitlines() # =(0)=False
Out[64]:
['qwerqwe1',
 'rqwe2',
 '111\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb03',
 '\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b4',
 '\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0']
In [67]:
sss='qwerqwe1\nrqwe2\n111фыва3\nфывафы4\nвафывафыва'
sss.splitlines(True) # or any digit >< 0  --> with "\n" as keepends
Out[67]:
['qwerqwe1\n',
 'rqwe2\n',
 '111\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb03\n',
 '\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b4\n',
 '\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0']
In []:
 
In [29]:
help(sstring.strip)
Help on built-in function strip:

strip(...)
    S.strip([chars]) -> string or unicode
    
    Return a copy of the string S with leading and trailing
    whitespace removed.
    If chars is given and not None, remove characters in chars instead.
    If chars is unicode, S will be converted to unicode before stripping


In [57]:
sss='qwerqwerqwe111фывафывафывафывафыва'
sss.strip('qw')
Out[57]:
'erqwerqwe111\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0'
In [58]:
sss='qwerqwerqwe111фывафывафывафывафыва'
sss.strip('фыва')
Out[58]:
'qwerqwerqwe111'
In [34]:
help(sstring.lstrip)
Help on built-in function lstrip:

lstrip(...)
    S.lstrip([chars]) -> string or unicode
    
    Return a copy of the string S with leading whitespace removed.
    If chars is given and not None, remove characters in chars instead.
    If chars is unicode, S will be converted to unicode before stripping


In [30]:
help(sstring.find)
Help on built-in function find:

find(...)
    S.find(sub [,start [,end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.


In [31]:
help(sstring.replace)
Help on built-in function replace:

replace(...)
    S.replace(old, new[, count]) -> string
    
    Return a copy of string S with all occurrences of substring
    old replaced by new.  If the optional argument count is
    given, only the first count occurrences are replaced.


In [56]:
sss='qwerqwerqwe111rqwerqwerqwer'
sss.replace ('er','ER',2)
Out[56]:
'qwERqwERqwe111rqwerqwerqwer'
In [32]:
help(sstring.zfill)
Help on built-in function zfill:

zfill(...)
    S.zfill(width) -> string
    
    Pad a numeric string S with zeros on the left, to fill a field
    of the specified width.  The string S is never truncated.


In [55]:
sstring='123'
sstring.zfill(5)
Out[55]:
'00123'
In [33]:
help(sstring.partition)
Help on built-in function partition:

partition(...)
    S.partition(sep) -> (head, sep, tail)
    
    Search for the separator sep in S, and return the part before it,
    the separator itself, and the part after it.  If the separator is not
    found, return S and two empty strings.


In [53]:
sss='qwerqw111erqwrrqwr'
sss.partition('1')
Out[53]:
('qwerqw', '1', '11erqwrrqwr')
In [35]:
help(sstring.ljust)
Help on built-in function ljust:

ljust(...)
    S.ljust(width[, fillchar]) -> string
    
    Return S left-justified in a string of length width. Padding is
    done using the specified fill character (default is a space).


In [37]:
help(sstring.translate)
Help on built-in function translate:

translate(...)
    S.translate(table [,deletechars]) -> string
    
    Return a copy of the string S, where all characters occurring
    in the optional argument deletechars are removed, and the
    remaining characters have been mapped through the given
    translation table, which must be a string of length 256 or None.
    If the table argument is None, no translation is applied and
    the operation simply removes the characters in deletechars.


In [44]:
sss='qwerqwerqwrrqwr'
sss.translate(None,'qr')
Out[44]:
'weweww'
In [50]:
sss='qwerqwerqwrrqwr'
sss.translate(None,'qrw')
Out[50]:
'ee'
In [24]:
list(dir(sstring))
Out[24]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getslice__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_formatter_field_name_split',
 '_formatter_parser',
 'capitalize',
 'center',
 'count',
 'decode',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'index',
 'isalnum',
 'isalpha',
 'isdigit',
 'islower',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
In [23]:
dir(ss)
Out[23]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__delslice__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getslice__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__setslice__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']
In [26]:
help(string)
no Python documentation found for 'http://127.127.12.7:8080'


In []:
 


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий