А не замахнуться ли нам на UTF-8 и прочий Unocode, головоломки с ASCII, Latin-1, cp1251, !chcp 65001, encode('cp1251'), unicodedata, codecs.open(filepath, encoding='cp1251')... и еще line.encoding, repr() - посмотреть кодировку, ord(),chr(),hex(),bin()

Две недели назад (всего-то) я столкнулся с проблемой: При вызове консольных команд Windows из Notebook iPythons вместо кириллицы крякозябыO
Тогда я понимал, что еще только учусь... Нарушать План обучения не хотелось, потому я довольствовался минимальным приемлемым результатом и торжественно пообещал изучить юникод позже.

И вот, опять проблемы с кодировкой... И опять, попыка "по-быстрому" разобораться с кодировками ни к чему не привела. Пришлось угрохать два дня на чтение (последовательность чтения сохраняю):
1. Статьи на Хабре (ссылка под постом) Юникод для чайников / Хабрахабр
2. Википедии (ссылки под постом) Юникод
3. Главы 36 "Юникод и строки байтов" из М. Лутца
4. Документации Python Unicode HOWTO
4. Документации Python Standard Encodings

Стоило ли тратить столько времени? Стоило, поскольку проблемы с кодировками могут выскочить при парсинге файлов... У меня проблемы возникали (только за эти две недели) следующие:
1. Крякозябы в консолях IDLE, iPpython при печати кириллицы
2. Крякозябы при импорте файла c кириллицей в iPython
3. Строки типа "022@Mail.Ru" или "41043244243e@Mail.Ru" при чтениистрок изфайл(ов)(ы)

При этом в голове было... не то, чтобы девственно пусто..., я скорее обманывал себя, что знаю, что такое кодировка... Очевидно, что начинать надо с модели...
Но, как правило, начинают с истории кодировок. Но единства исторического и логического не наблюдается..., потому попробую начать с модели.

Модель "Наборщик в типографии"¶

Итак, я - наборщик. Передо мной касса (ящик) с буквами. Каждая буква в отдельной ячейке. Над каждой ячейкой этикетка, но на этикетке не буква, а номер ячейки. Вот, например, буква "w" имеет номер 119

In [11]:

ord("a"),ord("w") # "w" – байт с целочисленным значением 119 в ASCII

Out[11]:

(97, 119)

In [14]:

 chr(119) # буква в 119 ячейке - это w

Out[14]:

'w'

Но по ряду причин (подробности ниже) используюутся не десятичные, а шестнадцатиричный обозначения. Переведедем порядковые номера (ячеек в "ящиках" - наборных кассах) в шестнадцатиричную форму:

In [12]:

hex(97), hex(119) # число 119 в шестнацатиричной форме - это '0x77'

Out[12]:

('0x61', '0x77')

In [17]:

'0x77', 0x77 # если мы зададим 0x77 без кавычек, то напечатано будет 119

Out[17]:

('0x77', 119)

In [19]:

chr(119), chr(0x77)# буква в 119 ячейке - это w... и в 0x77 ячейке

Out[19]:

('w', 'w')

На самом деле все в двоичной форме (в памяти компьютера), а почему все используют шестнадцатиричную запись?¶

In [27]:

bin(1),bin(2),bin(3),bin(4),bin(7),bin(8),bin(127),bin(255)

Out[27]:

('0b1', '0b10', '0b11', '0b100', '0b111', '0b1000', '0b1111111', '0b11111111')

В памяти "железного" компьютера числа хранятся в виде последовательностей нулей и единиц. Это двоичный код. Чтобы попрактиковаться в переводе чисел в разные представления я вспомнил, что в windows есть программа "каклькулятор" (для программистов).
В строке примеров выше видно,что значениям 1,3,7,127, 255 соответствуют последовательности, состоящие только из единиц.
Если вспомнить, что 1 bit - это ячейка, в которой либо ноль, либо единица, то можно сказать (про 127 и 255), что максимальное число, которое можно записать при помощи 7бит - 127, а максимальное значение для 8бит - это 255.

In [29]:

hex(127), hex(0b1111111),hex(255), hex(0b11111111)

Out[29]:

('0x7f', '0x7f', '0xff', '0xff')

Некоторые сведения из жизни нумерологов:
В байте 8 бит - '0b11111111' и два символа [0,1], всех возможных сочетаний нулей и единиц в 8-битовой строчке только 256.
А в 16-ричной записи '0xff' симовлов 16 [0-9,a,b,c,d,e,f]; посредством всего двух символов можно записать все числа от 0 до 255 (всего 256)

Между делом мы освоили команды ord(),char(),hex(),bin(). Запомним и форматы '23'- десятичный, '0x7f'- шестнадцатиричный '0b1...'- двоичный. Обратите внимание, числа в любом формате в этих командах используются без кавычек.

Итак, для того, чтобы начать кодирование нам необходимы всего три вещи:
1. Набор символов (букв, цифр, знаков)
2. Ящик с пронумерованными ячейками (таких "разных ящиков" около сотни)
3. Соглашение о том, в какой ячейке какие символы (таблицы соответствия).
Все вместе можно назвать..., например кодировка 'cp1251'(названия "ящика").

Если кодирование - это как набор текста, то откуда столько сложностей?¶

Сложностей хватает. А все из-за того, что поначалу для кодирования выделяли 1 байт (в ящике с 255 ячейками, поначалу использовали только 127 ячеек), потом стали использовать два байта, ... и так дело дошло до 8 байтов (8х8=64 разряда). И захотели сделать вместо компактных маленьких ящиков один большой для всех... Да не тут-то было ... (подробности в Википедии)

Или, вот пример из документации python unicode. The first encoding you might think of is an array of 32-bit integers. In this representation, the string “Python” would look like this:

   P           y           t           h           o           n

0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00

   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

In [36]:

hex(ord('P')),  hex(ord('y')),  hex(ord('t')),  hex(ord('h')),  hex(ord('o')),  hex(ord('n'))

Out[36]:

('0x50', '0x79', '0x74', '0x68', '0x6f', '0x6e')

Здесь вместо 1байта используются 4, как много ненужных нулей... А можно ли использовать для кодирования последовательности байтов разной длинны? Например, для латиницы - однобайтовую кодировку, а для кириллицы - двухбайтовую?

Говоря техническим языком, преобразования между последовательностями байтов и строками обозначаются двумя терминами:
• Кодирование – процесс преобразования строки символов в последовательность простых байтов в соответствии с желаемой кодировкой.
• Декодирование – процесс преобразования последовательности байтов в строку символов в соответствии с желаемой кодировкой.

Правила кодирования для UTF-8¶

Для некоторых кодировок процесс преобразования тривиально прост – в кодировках ASCII и Latin-1, например, каждому символу соответствует единственный байт, поэтому фактически никакого преобразования не требуется. Для других кодировок процедура отображения может оказаться намного сложнее и порождать по несколько байтов для каждого символа. Вот описание UTF-8 из документации Python

UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.)
UTF-8 uses the following rules:
If the code point is < 128, it’s represented by the corresponding byte value.
If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

На первый взгляд условия кажутся какими-то несуразными. Ну зачем использовать для двухбайтовых кодировок только байты с 128 по 255, а не все?
Ответ находится быстро - чтобы было легко по первому байту определить, добавлять ли к нему следующий(ие), т.е., отличать где однобайтовая кодировка, а где двух(много) байтовая.

UTF-8 has several convenient properties:
It can handle any Unicode code point.
A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
A string of ASCII text is also valid UTF-8 text.
UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.

Unicode Literals in Python Source Code¶

Прежде, чем переходить к конкретным примерам работы с объектом 'unicode', нам нужно освоить новые обозначения для этой кеодировки.

In Python source code, Unicode literals are written as strings prefixed with the ‘u’ or ‘U’ character: u'abcdefghijk'.
Specific code points can be written using the scape sequence, which is followed by four hex digits giving the code point. The escape sequence is similar, but expects 8 hex digits, not 4.
Unicode literals can also use the same escape sequences as 8-bit strings, including ,
but only takes two hex digits
so it can’t express an arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.

>>> s = u"a\xac\u1234\u20ac\U00008000"

... #      ^^^^ two-digit hex escape

... #          ^^^^^^ four-digit Unicode escape

... #                      ^^^^^^^^^^ eight-digit Unicode escape

>>> for c in s:  print ord(c),

...

97 172 4660 8364 32768

Однобайтовая кодировка Python ASCII¶

Проведем эксперимент с интерепретатором Python. По умолчанию он использует 7-битовую кодировку (!!!) для строк. Т.е. может адресоваться только к 127 ячейкам, в которых символы разложены в порядке, соответствующему кодировке ASCII. Но память выделяется байтами, а не битами... Значит остальные ячейки байта должны быть свободны...?

In [82]:

chr(31),chr(32),chr(48), chr(58),chr(64),   chr(65),chr(90),chr(91),chr(94),chr(96),     chr(97),chr(122),   chr(126),   chr(127)

Out[82]:

('\x1f', ' ', '0', ':', '@', 'A', 'Z', '[', '^', '`', 'a', 'z', '~', '\x7f')

Как мы установили, при помощи одного байта (8 бит) можно (пронумеровать) кодировать 256 символов. Примеры выше показывают, что в кодировке по умочанию заняты толко номера с 32 по 126

Естественно, если мы обратимся к первому адресу во втором байте, то получим ошибку:

In [68]:

chr(256)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-3010ef827cbd> in <module>()
----> 1 chr(256)

ValueError: chr() arg not in range(256)

Пустые места в байте всем хотелось заполнить, потому появились варианты стандарта. Подробности можно посмотреть в википедии. ASCII

Однобайтовая кодировка Python Latin-1¶

У названия кодировки есть синонимы: "...‘latin-1’, ‘iso_8859_1’ and ‘8859’ are all synonyms for the same encoding"...
Как следует из названия, кодировка была ориентирована на охват европейских языков и была кодировкой по умолчанию в старых версиях Python:
"Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals"
Она, по видимому и подключается по умолчанию, когда интерпретатор пытается обработать шестнадцатиричные адреса, находящиеся за пределами диапазона ASCII...

Здесь обнаруживается важный аспект понимания работы интерпретатора, который нужно понять: 1. Когда интерпретатор считывает шестнадцатиричные (hex) числа вида 0x1d, он трактует их, как одно (hex) число - один символ 2. Но в двухбайтовых кодировках символ кодируется двумя байтами, которые записываются двумя (hex) числами (например: 0x0d 0x1с) 3. Кириллица кодируется двумя байтами..., а интерпертатор считывает по байту (по пол-букве) Вот так и получаются крякозябы...

На многих форумах написано, что Питон 2.x не поддерживает операции кодирования-декодирования для строк. Это не точно. Смена кодировки для объекта "строка" возможна, но только на однобайтную. Вот доказательство - в методах объекта str есть decode(...),enecode(...)

In [147]:

help(str)

Help on class str in module __builtin__:

class str(basestring)
 |  str(object='') -> string
 |  
 |  Return a nice string representation of the object.
 |  If the argument is a string, the return value is the same object.
 |  
 |  Method resolution order:
 |      str
 |      basestring
 |      object
 |  
 |  Methods defined here:
 |  
 |  __add__(...)
 |      x.__add__(y) <==> x+y
 |  
 |  __contains__(...)
 |      x.__contains__(y) <==> y in x
 |  
 |  __eq__(...)
 |      x.__eq__(y) <==> x==y
 |  
 |  __format__(...)
 |      S.__format__(format_spec) -> string
 |      
 |      Return a formatted version of S as described by format_spec.
 |  
 |  __ge__(...)
 |      x.__ge__(y) <==> x>=y
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __getnewargs__(...)
 |  
 |  __getslice__(...)
 |      x.__getslice__(i, j) <==> x[i:j]
 |      
 |      Use of negative indices is not supported.
 |  
 |  __gt__(...)
 |      x.__gt__(y) <==> x>y
 |  
 |  __hash__(...)
 |      x.__hash__() <==> hash(x)
 |  
 |  __le__(...)
 |      x.__le__(y) <==> x<=y
 |  
 |  __len__(...)
 |      x.__len__() <==> len(x)
 |  
 |  __lt__(...)
 |      x.__lt__(y) <==> x<y
 |  
 |  __mod__(...)
 |      x.__mod__(y) <==> x%y
 |  
 |  __mul__(...)
 |      x.__mul__(n) <==> x*n
 |  
 |  __ne__(...)
 |      x.__ne__(y) <==> x!=y
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __rmod__(...)
 |      x.__rmod__(y) <==> y%x
 |  
 |  __rmul__(...)
 |      x.__rmul__(n) <==> n*x
 |  
 |  __sizeof__(...)
 |      S.__sizeof__() -> size of S in memory, in bytes
 |  
 |  __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  capitalize(...)
 |      S.capitalize() -> string
 |      
 |      Return a copy of the string S with only its first character
 |      capitalized.
 |  
 |  center(...)
 |      S.center(width[, fillchar]) -> string
 |      
 |      Return S centered in a string of length width. Padding is
 |      done using the specified fill character (default is a space)
 |  
 |  count(...)
 |      S.count(sub[, start[, end]]) -> int
 |      
 |      Return the number of non-overlapping occurrences of substring sub in
 |      string S[start:end].  Optional arguments start and end are interpreted
 |      as in slice notation.
 |  
 |  decode(...)
 |      S.decode([encoding[,errors]]) -> object
 |      
 |      Decodes S using the codec registered for encoding. encoding defaults
 |      to the default encoding. errors may be given to set a different error
 |      handling scheme. Default is 'strict' meaning that encoding errors raise
 |      a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
 |      as well as any other name registered with codecs.register_error that is
 |      able to handle UnicodeDecodeErrors.
 |  
 |  encode(...)
 |      S.encode([encoding[,errors]]) -> object
 |      
 |      Encodes S using the codec registered for encoding. encoding defaults
 |      to the default encoding. errors may be given to set a different error
 |      handling scheme. Default is 'strict' meaning that encoding errors raise
 |      a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
 |      'xmlcharrefreplace' as well as any other name registered with
 |      codecs.register_error that is able to handle UnicodeEncodeErrors.
 |  
 |  endswith(...)
 |      S.endswith(suffix[, start[, end]]) -> bool
 |      
 |      Return True if S ends with the specified suffix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      suffix can also be a tuple of strings to try.
 |  
 |  expandtabs(...)
 |      S.expandtabs([tabsize]) -> string
 |      
 |      Return a copy of S where all tab characters are expanded using spaces.
 |      If tabsize is not given, a tab size of 8 characters is assumed.
 |  
 |  find(...)
 |      S.find(sub [,start [,end]]) -> int
 |      
 |      Return the lowest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |      
 |      Return -1 on failure.
 |  
 |  format(...)
 |      S.format(*args, **kwargs) -> string
 |      
 |      Return a formatted version of S, using substitutions from args and kwargs.
 |      The substitutions are identified by braces ('{' and '}').
 |  
 |  index(...)
 |      S.index(sub [,start [,end]]) -> int
 |      
 |      Like S.find() but raise ValueError when the substring is not found.
 |  
 |  isalnum(...)
 |      S.isalnum() -> bool
 |      
 |      Return True if all characters in S are alphanumeric
 |      and there is at least one character in S, False otherwise.
 |  
 |  isalpha(...)
 |      S.isalpha() -> bool
 |      
 |      Return True if all characters in S are alphabetic
 |      and there is at least one character in S, False otherwise.
 |  
 |  isdigit(...)
 |      S.isdigit() -> bool
 |      
 |      Return True if all characters in S are digits
 |      and there is at least one character in S, False otherwise.
 |  
 |  islower(...)
 |      S.islower() -> bool
 |      
 |      Return True if all cased characters in S are lowercase and there is
 |      at least one cased character in S, False otherwise.
 |  
 |  isspace(...)
 |      S.isspace() -> bool
 |      
 |      Return True if all characters in S are whitespace
 |      and there is at least one character in S, False otherwise.
 |  
 |  istitle(...)
 |      S.istitle() -> bool
 |      
 |      Return True if S is a titlecased string and there is at least one
 |      character in S, i.e. uppercase characters may only follow uncased
 |      characters and lowercase characters only cased ones. Return False
 |      otherwise.
 |  
 |  isupper(...)
 |      S.isupper() -> bool
 |      
 |      Return True if all cased characters in S are uppercase and there is
 |      at least one cased character in S, False otherwise.
 |  
 |  join(...)
 |      S.join(iterable) -> string
 |      
 |      Return a string which is the concatenation of the strings in the
 |      iterable.  The separator between elements is S.
 |  
 |  ljust(...)
 |      S.ljust(width[, fillchar]) -> string
 |      
 |      Return S left-justified in a string of length width. Padding is
 |      done using the specified fill character (default is a space).
 |  
 |  lower(...)
 |      S.lower() -> string
 |      
 |      Return a copy of the string S converted to lowercase.
 |  
 |  lstrip(...)
 |      S.lstrip([chars]) -> string or unicode
 |      
 |      Return a copy of the string S with leading whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |      If chars is unicode, S will be converted to unicode before stripping
 |  
 |  partition(...)
 |      S.partition(sep) -> (head, sep, tail)
 |      
 |      Search for the separator sep in S, and return the part before it,
 |      the separator itself, and the part after it.  If the separator is not
 |      found, return S and two empty strings.
 |  
 |  replace(...)
 |      S.replace(old, new[, count]) -> string
 |      
 |      Return a copy of string S with all occurrences of substring
 |      old replaced by new.  If the optional argument count is
 |      given, only the first count occurrences are replaced.
 |  
 |  rfind(...)
 |      S.rfind(sub [,start [,end]]) -> int
 |      
 |      Return the highest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |      
 |      Return -1 on failure.
 |  
 |  rindex(...)
 |      S.rindex(sub [,start [,end]]) -> int
 |      
 |      Like S.rfind() but raise ValueError when the substring is not found.
 |  
 |  rjust(...)
 |      S.rjust(width[, fillchar]) -> string
 |      
 |      Return S right-justified in a string of length width. Padding is
 |      done using the specified fill character (default is a space)
 |  
 |  rpartition(...)
 |      S.rpartition(sep) -> (head, sep, tail)
 |      
 |      Search for the separator sep in S, starting at the end of S, and return
 |      the part before it, the separator itself, and the part after it.  If the
 |      separator is not found, return two empty strings and S.
 |  
 |  rsplit(...)
 |      S.rsplit([sep [,maxsplit]]) -> list of strings
 |      
 |      Return a list of the words in the string S, using sep as the
 |      delimiter string, starting at the end of the string and working
 |      to the front.  If maxsplit is given, at most maxsplit splits are
 |      done. If sep is not specified or is None, any whitespace string
 |      is a separator.
 |  
 |  rstrip(...)
 |      S.rstrip([chars]) -> string or unicode
 |      
 |      Return a copy of the string S with trailing whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |      If chars is unicode, S will be converted to unicode before stripping
 |  
 |  split(...)
 |      S.split([sep [,maxsplit]]) -> list of strings
 |      
 |      Return a list of the words in the string S, using sep as the
 |      delimiter string.  If maxsplit is given, at most maxsplit
 |      splits are done. If sep is not specified or is None, any
 |      whitespace string is a separator and empty strings are removed
 |      from the result.
 |  
 |  splitlines(...)
 |      S.splitlines(keepends=False) -> list of strings
 |      
 |      Return a list of the lines in S, breaking at line boundaries.
 |      Line breaks are not included in the resulting list unless keepends
 |      is given and true.
 |  
 |  startswith(...)
 |      S.startswith(prefix[, start[, end]]) -> bool
 |      
 |      Return True if S starts with the specified prefix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      prefix can also be a tuple of strings to try.
 |  
 |  strip(...)
 |      S.strip([chars]) -> string or unicode
 |      
 |      Return a copy of the string S with leading and trailing
 |      whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |      If chars is unicode, S will be converted to unicode before stripping
 |  
 |  swapcase(...)
 |      S.swapcase() -> string
 |      
 |      Return a copy of the string S with uppercase characters
 |      converted to lowercase and vice versa.
 |  
 |  title(...)
 |      S.title() -> string
 |      
 |      Return a titlecased version of S, i.e. words start with uppercase
 |      characters, all remaining cased characters have lowercase.
 |  
 |  translate(...)
 |      S.translate(table [,deletechars]) -> string
 |      
 |      Return a copy of the string S, where all characters occurring
 |      in the optional argument deletechars are removed, and the
 |      remaining characters have been mapped through the given
 |      translation table, which must be a string of length 256 or None.
 |      If the table argument is None, no translation is applied and
 |      the operation simply removes the characters in deletechars.
 |  
 |  upper(...)
 |      S.upper() -> string
 |      
 |      Return a copy of the string S converted to uppercase.
 |  
 |  zfill(...)
 |      S.zfill(width) -> string
 |      
 |      Pad a numeric string S with zeros on the left, to fill a field
 |      of the specified width.  The string S is never truncated.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

Объект unicode в Python 2.x¶

Полагаю, это плата за то, что Питон представляет строки посимвольными кортежами ... Но как быть с двухбайтовыми кодировками и юникодом? Для этого используют не объект "строка", а обект "юникод". Поначалу это решение кажется даже изящным... Но в третьем питоне объекты "строки" и "юникод" объединили, а еще ввели обекты "байт" и "массив байтов". Говорят, это сняло почти все проблемы с кодировками и крякозябами... Однако, нам надо научиться обращатся с объектом "Юникод"

In [3]:

help(unicode)

Help on class unicode in module __builtin__:

class unicode(basestring)
 |  unicode(object='') -> unicode object
 |  unicode(string[, encoding[, errors]]) -> unicode object
 |  
 |  Create a new Unicode object from the given encoded string.
 |  encoding defaults to the current default string encoding.
 |  errors can be 'strict', 'replace' or 'ignore' and defaults to 'strict'.
 |  
 |  Method resolution order:
 |      unicode
 |      basestring
 |      object
 |  
 |  Methods defined here:
 |  
 |  __add__(...)
 |      x.__add__(y) <==> x+y
 |  
 |  __contains__(...)
 |      x.__contains__(y) <==> y in x
 |  
 |  __eq__(...)
 |      x.__eq__(y) <==> x==y
 |  
 |  __format__(...)
 |      S.__format__(format_spec) -> unicode
 |      
 |      Return a formatted version of S as described by format_spec.
 |  
 |  __ge__(...)
 |      x.__ge__(y) <==> x>=y
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __getnewargs__(...)
 |  
 |  __getslice__(...)
 |      x.__getslice__(i, j) <==> x[i:j]
 |      
 |      Use of negative indices is not supported.
 |  
 |  __gt__(...)
 |      x.__gt__(y) <==> x>y
 |  
 |  __hash__(...)
 |      x.__hash__() <==> hash(x)
 |  
 |  __le__(...)
 |      x.__le__(y) <==> x<=y
 |  
 |  __len__(...)
 |      x.__len__() <==> len(x)
 |  
 |  __lt__(...)
 |      x.__lt__(y) <==> x<y
 |  
 |  __mod__(...)
 |      x.__mod__(y) <==> x%y
 |  
 |  __mul__(...)
 |      x.__mul__(n) <==> x*n
 |  
 |  __ne__(...)
 |      x.__ne__(y) <==> x!=y
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __rmod__(...)
 |      x.__rmod__(y) <==> y%x
 |  
 |  __rmul__(...)
 |      x.__rmul__(n) <==> n*x
 |  
 |  __sizeof__(...)
 |      S.__sizeof__() -> size of S in memory, in bytes
 |  
 |  __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  capitalize(...)
 |      S.capitalize() -> unicode
 |      
 |      Return a capitalized version of S, i.e. make the first character
 |      have upper case and the rest lower case.
 |  
 |  center(...)
 |      S.center(width[, fillchar]) -> unicode
 |      
 |      Return S centered in a Unicode string of length width. Padding is
 |      done using the specified fill character (default is a space)
 |  
 |  count(...)
 |      S.count(sub[, start[, end]]) -> int
 |      
 |      Return the number of non-overlapping occurrences of substring sub in
 |      Unicode string S[start:end].  Optional arguments start and end are
 |      interpreted as in slice notation.
 |  
 |  decode(...)
 |      S.decode([encoding[,errors]]) -> string or unicode
 |      
 |      Decodes S using the codec registered for encoding. encoding defaults
 |      to the default encoding. errors may be given to set a different error
 |      handling scheme. Default is 'strict' meaning that encoding errors raise
 |      a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
 |      as well as any other name registered with codecs.register_error that is
 |      able to handle UnicodeDecodeErrors.
 |  
 |  encode(...)
 |      S.encode([encoding[,errors]]) -> string or unicode
 |      
 |      Encodes S using the codec registered for encoding. encoding defaults
 |      to the default encoding. errors may be given to set a different error
 |      handling scheme. Default is 'strict' meaning that encoding errors raise
 |      a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
 |      'xmlcharrefreplace' as well as any other name registered with
 |      codecs.register_error that can handle UnicodeEncodeErrors.
 |  
 |  endswith(...)
 |      S.endswith(suffix[, start[, end]]) -> bool
 |      
 |      Return True if S ends with the specified suffix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      suffix can also be a tuple of strings to try.
 |  
 |  expandtabs(...)
 |      S.expandtabs([tabsize]) -> unicode
 |      
 |      Return a copy of S where all tab characters are expanded using spaces.
 |      If tabsize is not given, a tab size of 8 characters is assumed.
 |  
 |  find(...)
 |      S.find(sub [,start [,end]]) -> int
 |      
 |      Return the lowest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |      
 |      Return -1 on failure.
 |  
 |  format(...)
 |      S.format(*args, **kwargs) -> unicode
 |      
 |      Return a formatted version of S, using substitutions from args and kwargs.
 |      The substitutions are identified by braces ('{' and '}').
 |  
 |  index(...)
 |      S.index(sub [,start [,end]]) -> int
 |      
 |      Like S.find() but raise ValueError when the substring is not found.
 |  
 |  isalnum(...)
 |      S.isalnum() -> bool
 |      
 |      Return True if all characters in S are alphanumeric
 |      and there is at least one character in S, False otherwise.
 |  
 |  isalpha(...)
 |      S.isalpha() -> bool
 |      
 |      Return True if all characters in S are alphabetic
 |      and there is at least one character in S, False otherwise.
 |  
 |  isdecimal(...)
 |      S.isdecimal() -> bool
 |      
 |      Return True if there are only decimal characters in S,
 |      False otherwise.
 |  
 |  isdigit(...)
 |      S.isdigit() -> bool
 |      
 |      Return True if all characters in S are digits
 |      and there is at least one character in S, False otherwise.
 |  
 |  islower(...)
 |      S.islower() -> bool
 |      
 |      Return True if all cased characters in S are lowercase and there is
 |      at least one cased character in S, False otherwise.
 |  
 |  isnumeric(...)
 |      S.isnumeric() -> bool
 |      
 |      Return True if there are only numeric characters in S,
 |      False otherwise.
 |  
 |  isspace(...)
 |      S.isspace() -> bool
 |      
 |      Return True if all characters in S are whitespace
 |      and there is at least one character in S, False otherwise.
 |  
 |  istitle(...)
 |      S.istitle() -> bool
 |      
 |      Return True if S is a titlecased string and there is at least one
 |      character in S, i.e. upper- and titlecase characters may only
 |      follow uncased characters and lowercase characters only cased ones.
 |      Return False otherwise.
 |  
 |  isupper(...)
 |      S.isupper() -> bool
 |      
 |      Return True if all cased characters in S are uppercase and there is
 |      at least one cased character in S, False otherwise.
 |  
 |  join(...)
 |      S.join(iterable) -> unicode
 |      
 |      Return a string which is the concatenation of the strings in the
 |      iterable.  The separator between elements is S.
 |  
 |  ljust(...)
 |      S.ljust(width[, fillchar]) -> int
 |      
 |      Return S left-justified in a Unicode string of length width. Padding is
 |      done using the specified fill character (default is a space).
 |  
 |  lower(...)
 |      S.lower() -> unicode
 |      
 |      Return a copy of the string S converted to lowercase.
 |  
 |  lstrip(...)
 |      S.lstrip([chars]) -> unicode
 |      
 |      Return a copy of the string S with leading whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |      If chars is a str, it will be converted to unicode before stripping
 |  
 |  partition(...)
 |      S.partition(sep) -> (head, sep, tail)
 |      
 |      Search for the separator sep in S, and return the part before it,
 |      the separator itself, and the part after it.  If the separator is not
 |      found, return S and two empty strings.
 |  
 |  replace(...)
 |      S.replace(old, new[, count]) -> unicode
 |      
 |      Return a copy of S with all occurrences of substring
 |      old replaced by new.  If the optional argument count is
 |      given, only the first count occurrences are replaced.
 |  
 |  rfind(...)
 |      S.rfind(sub [,start [,end]]) -> int
 |      
 |      Return the highest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |      
 |      Return -1 on failure.
 |  
 |  rindex(...)
 |      S.rindex(sub [,start [,end]]) -> int
 |      
 |      Like S.rfind() but raise ValueError when the substring is not found.
 |  
 |  rjust(...)
 |      S.rjust(width[, fillchar]) -> unicode
 |      
 |      Return S right-justified in a Unicode string of length width. Padding is
 |      done using the specified fill character (default is a space).
 |  
 |  rpartition(...)
 |      S.rpartition(sep) -> (head, sep, tail)
 |      
 |      Search for the separator sep in S, starting at the end of S, and return
 |      the part before it, the separator itself, and the part after it.  If the
 |      separator is not found, return two empty strings and S.
 |  
 |  rsplit(...)
 |      S.rsplit([sep [,maxsplit]]) -> list of strings
 |      
 |      Return a list of the words in S, using sep as the
 |      delimiter string, starting at the end of the string and
 |      working to the front.  If maxsplit is given, at most maxsplit
 |      splits are done. If sep is not specified, any whitespace string
 |      is a separator.
 |  
 |  rstrip(...)
 |      S.rstrip([chars]) -> unicode
 |      
 |      Return a copy of the string S with trailing whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |      If chars is a str, it will be converted to unicode before stripping
 |  
 |  split(...)
 |      S.split([sep [,maxsplit]]) -> list of strings
 |      
 |      Return a list of the words in S, using sep as the
 |      delimiter string.  If maxsplit is given, at most maxsplit
 |      splits are done. If sep is not specified or is None, any
 |      whitespace string is a separator and empty strings are
 |      removed from the result.
 |  
 |  splitlines(...)
 |      S.splitlines(keepends=False) -> list of strings
 |      
 |      Return a list of the lines in S, breaking at line boundaries.
 |      Line breaks are not included in the resulting list unless keepends
 |      is given and true.
 |  
 |  startswith(...)
 |      S.startswith(prefix[, start[, end]]) -> bool
 |      
 |      Return True if S starts with the specified prefix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      prefix can also be a tuple of strings to try.
 |  
 |  strip(...)
 |      S.strip([chars]) -> unicode
 |      
 |      Return a copy of the string S with leading and trailing
 |      whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |      If chars is a str, it will be converted to unicode before stripping
 |  
 |  swapcase(...)
 |      S.swapcase() -> unicode
 |      
 |      Return a copy of S with uppercase characters converted to lowercase
 |      and vice versa.
 |  
 |  title(...)
 |      S.title() -> unicode
 |      
 |      Return a titlecased version of S, i.e. words start with title case
 |      characters, all remaining cased characters have lower case.
 |  
 |  translate(...)
 |      S.translate(table) -> unicode
 |      
 |      Return a copy of the string S, where all characters have been mapped
 |      through the given translation table, which must be a mapping of
 |      Unicode ordinals to Unicode ordinals, Unicode strings or None.
 |      Unmapped characters are left untouched. Characters mapped to None
 |      are deleted.
 |  
 |  upper(...)
 |      S.upper() -> unicode
 |      
 |      Return a copy of S converted to uppercase.
 |  
 |  zfill(...)
 |      S.zfill(width) -> unicode
 |      
 |      Pad a numeric string S with zeros on the left, to fill a field
 |      of the specified width. The string S is never truncated.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

The unicode() constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings. The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors

Если дословно, то "Первый аргумент конвертируется в Юникод с использованием специфицированной кодировки" ...формулировочка не очень понятная,.. может быть потому, что сбивает с толку статья на Хабре? (ссылка в начале поста)
Потому начнем с конца (в конце поста команды чтения из файла). Там все понятно. Файл имеет какую-то кодировку..., её нужно указать, и осуществится перекодировка в utf-8 (или utf-16, в зависимости от настроек по умолчанию) а здесь мы напечатали строку (на экране), в какой она кодировке? Если она в ASCII, то почему мы видим кирилицу?
Вот здесь я сцепил латиницу с кириллицей... и iPython Notebok, и Spyder ее показывают:

In [167]:

sstr='q'+'ы'
print sstr

qы

In [168]:

sstr, type(sstr)

Out[168]:

('q\xd1\x8b', str)

Очевидно, что выводом на экран командует Windows... Поэтому в консоли мы видим "ы"... Так или иначе, в интерпретатор поступает строка 'q18b' в формате ASCII, консоль забирает ее из интерпретатора и выводит на экран, а Windows в процессе вывода преобразует ее используя свою кодировку...
Таким образом, кодировка, для команды unicode может зависеть от настроек и особенностей работы IDLE, от настроек Windows, и... когда используем вызов командной строки из консоли ipython, например "!dir"... Со всеми этими вопросами еще предстоит разобратся, а здесь лишь заметим, что в документации черным по белому написано, что первый агумент - объект "строка" берется с экрана и конвертируется в Unicod (и не во что другое), а второй параметр - это кодировка строки...

In [164]:

s=unicode(sstr,'utf-8')

In [165]:

Out[165]:

u'q\u044b'

In [166]:

print s

qы

Что мы только что сделали? Мы взяли объект "строка" 'q18b'
и конвертировали его в объект "юникод" u'q44b'
Интерпретатор обрабатывает эти два набора символов по разному. И обозначения в первом случае - шестнадцатиричные послеодвательности, а во втором - послеовательности для Юникода (подробности в документации)

In []:

Посмотрим, как можно преоразовывать и распечатывать символы юникода

In [182]:

ord(u'\u044b')

Out[182]:

In [189]:

unichr(1099)

Out[189]:

u'\u044b'

In [188]:

print unichr(1099)

ы

Есть еще вот такая библиотека, подроности в документации Unicode HOWTO

In [190]:

import unicodedata

u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)

for i, c in enumerate(u):
    print i, '%04x' % ord(c), unicodedata.category(c),
    print unicodedata.name(c)

# Get numeric value of second character
print unicodedata.numeric(u[1])

0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0

Здесь, помимо основных функций, хороший пример использования enumerate(u)... И конечно же интересно, что еще моежет класс unicodedata

In [191]:

help(unicodedata)

Help on module unicodedata:

NAME
    unicodedata

FILE
    c:\users\kiss\anaconda\dlls\unicodedata.pyd

DESCRIPTION
    This module provides access to the Unicode Character Database which
    defines character properties for all Unicode characters. The data in
    this database is based on the UnicodeData.txt file version
    5.2.0 which is publically available from ftp://ftp.unicode.org/.
    
    The module uses the same names and symbols as defined by the
    UnicodeData File Format 5.2.0 (see
    http://www.unicode.org/reports/tr44/tr44-4.html).

CLASSES
    __builtin__.object
        UCD
    
    class UCD(__builtin__.object)
     |  Methods defined here:
     |  
     |  __getattribute__(...)
     |      x.__getattribute__('name') <==> x.name
     |  
     |  bidirectional(...)
     |      bidirectional(unichr)
     |      
     |      Returns the bidirectional class assigned to the Unicode character
     |      unichr as string. If no such value is defined, an empty string is
     |      returned.
     |  
     |  category(...)
     |      category(unichr)
     |      
     |      Returns the general category assigned to the Unicode character
     |      unichr as string.
     |  
     |  combining(...)
     |      combining(unichr)
     |      
     |      Returns the canonical combining class assigned to the Unicode
     |      character unichr as integer. Returns 0 if no combining class is
     |      defined.
     |  
     |  decimal(...)
     |      decimal(unichr[, default])
     |      
     |      Returns the decimal value assigned to the Unicode character unichr
     |      as integer. If no such value is defined, default is returned, or, if
     |      not given, ValueError is raised.
     |  
     |  decomposition(...)
     |      decomposition(unichr)
     |      
     |      Returns the character decomposition mapping assigned to the Unicode
     |      character unichr as string. An empty string is returned in case no
     |      such mapping is defined.
     |  
     |  digit(...)
     |      digit(unichr[, default])
     |      
     |      Returns the digit value assigned to the Unicode character unichr as
     |      integer. If no such value is defined, default is returned, or, if
     |      not given, ValueError is raised.
     |  
     |  east_asian_width(...)
     |      east_asian_width(unichr)
     |      
     |      Returns the east asian width assigned to the Unicode character
     |      unichr as string.
     |  
     |  lookup(...)
     |      lookup(name)
     |      
     |      Look up character by name.  If a character with the
     |      given name is found, return the corresponding Unicode
     |      character.  If not found, KeyError is raised.
     |  
     |  mirrored(...)
     |      mirrored(unichr)
     |      
     |      Returns the mirrored property assigned to the Unicode character
     |      unichr as integer. Returns 1 if the character has been identified as
     |      a "mirrored" character in bidirectional text, 0 otherwise.
     |  
     |  name(...)
     |      name(unichr[, default])
     |      Returns the name assigned to the Unicode character unichr as a
     |      string. If no name is defined, default is returned, or, if not
     |      given, ValueError is raised.
     |  
     |  normalize(...)
     |      normalize(form, unistr)
     |      
     |      Return the normal form 'form' for the Unicode string unistr.  Valid
     |      values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
     |  
     |  numeric(...)
     |      numeric(unichr[, default])
     |      
     |      Returns the numeric value assigned to the Unicode character unichr
     |      as float. If no such value is defined, default is returned, or, if
     |      not given, ValueError is raised.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  unidata_version

FUNCTIONS
    bidirectional(...)
        bidirectional(unichr)
        
        Returns the bidirectional class assigned to the Unicode character
        unichr as string. If no such value is defined, an empty string is
        returned.
    
    category(...)
        category(unichr)
        
        Returns the general category assigned to the Unicode character
        unichr as string.
    
    combining(...)
        combining(unichr)
        
        Returns the canonical combining class assigned to the Unicode
        character unichr as integer. Returns 0 if no combining class is
        defined.
    
    decimal(...)
        decimal(unichr[, default])
        
        Returns the decimal value assigned to the Unicode character unichr
        as integer. If no such value is defined, default is returned, or, if
        not given, ValueError is raised.
    
    decomposition(...)
        decomposition(unichr)
        
        Returns the character decomposition mapping assigned to the Unicode
        character unichr as string. An empty string is returned in case no
        such mapping is defined.
    
    digit(...)
        digit(unichr[, default])
        
        Returns the digit value assigned to the Unicode character unichr as
        integer. If no such value is defined, default is returned, or, if
        not given, ValueError is raised.
    
    east_asian_width(...)
        east_asian_width(unichr)
        
        Returns the east asian width assigned to the Unicode character
        unichr as string.
    
    lookup(...)
        lookup(name)
        
        Look up character by name.  If a character with the
        given name is found, return the corresponding Unicode
        character.  If not found, KeyError is raised.
    
    mirrored(...)
        mirrored(unichr)
        
        Returns the mirrored property assigned to the Unicode character
        unichr as integer. Returns 1 if the character has been identified as
        a "mirrored" character in bidirectional text, 0 otherwise.
    
    name(...)
        name(unichr[, default])
        Returns the name assigned to the Unicode character unichr as a
        string. If no name is defined, default is returned, or, if not
        given, ValueError is raised.
    
    normalize(...)
        normalize(form, unistr)
        
        Return the normal form 'form' for the Unicode string unistr.  Valid
        values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
    
    numeric(...)
        numeric(unichr[, default])
        
        Returns the numeric value assigned to the Unicode character unichr
        as float. If no such value is defined, default is returned, or, if
        not given, ValueError is raised.

DATA
    ucd_3_2_0 = <unicodedata.UCD object>
    ucnhash_CAPI = <capsule object "unicodedata.ucnhash_CAPI">
    unidata_version = '5.2.0'

Далее надо было бы ознакомится с таблицами кодировок... Но нельзя объять необъятное. Важнее тема о конвертации файлов.

Как конвертировать файлы?¶

Для примера возьмем файл, который был записан при работе программы-парсера на Python. При открытии файла виден список строк, в первой и третьей есть символы кириллицы, но вместо них в файле мы находим шеснадцатириные последоватлености:

In [174]:

filepath="C:\\Users\\kiss\\Documents\\IPython Notebooks\\web\\Spyder\\mail\\mail_pages\\0.csv" 
f = open(filepath)

In [175]:

f.readlines()

Out[175]:

['"\xc0\xe2\xf2\xee@Mail.Ru";85837;"http://auto.mail.ru"\n',
 '\n',
 '"\xcf\xe5\xf0\xe5\xf5\xee\xe4\xfb";"%";"\xd1\xf2\xf0\xe0\xed\xe8\xf6\xfb"\n',
 '"327";"8,11";"http://cars.mail.ru/reviews/mercedes-benz/vito"\n',
 '"283";"7,02";"http://cars.mail.ru/reviews/mercedes-benz/c"\n',
 '"257";"6,37";"http://cars.mail.ru/reviews/mercedes-benz/glk"\n',
 '"247";"6,12";"http://cars.mail.ru/reviews/mercedes-benz/a"\n',
 '"242";"6,00";"http://cars.mail.ru/catalog/mercedes-benz"\n',
 '"222";"5,50";"http://cars.mail.ru/reviews/mercedes-benz/e"\n',
 '"218";"5,40";"http://cars.mail.ru/reviews/mercedes-benz/gl"\n',
 '"199";"4,93";"http://cars.mail.ru/reviews/mercedes-benz"\n',
 '"192";"4,76";"http://cars.mail.ru/reviews/mercedes-benz/m"\n',
 '"178";"4,41";"http://cars.mail.ru/reviews/mercedes-benz/s"\n',
 '"134";"3,32";"http://cars.mail.ru/reviews/mercedes-benz/g"\n',
 '"134";"3,32";"http://cars.mail.ru/reviews/mercedes-benz/sprinter"\n',
 '"104";"2,58";"http://cars.mail.ru/reviews/mercedes-benz/190"\n',
 '"58";"1,44";"http://cars.mail.ru/reviews/mercedes-benz/230"\n',
 '"44";"1,09";"http://cars.mail.ru/reviews/mercedes-benz/b"\n',
 '"43";"1,07";"http://cars.mail.ru/reviews/mercedes-benz/200"\n',
 '"39";"0,97";"http://cars.mail.ru/reviews/mercedes-benz/glk/2012/54385"\n',
 '"39";"0,97";"http://cars.mail.ru/reviews/mercedes-benz/viano"\n',
 '"38";"0,94";"http://cars.mail.ru/reviews/mercedes-benz/cls"\n',
 '"33";"0,82";"http://cars.mail.ru/reviews/mercedes-benz/gl/2013/55075"\n',
 '\n']

Запрос кодировки выдает "None", попытка конвертации приводит к ошибке... Да еще и объект не читается. А почему? Потому что объект "string" однобайтовый...

In [179]:

f.encoding, f.readlines()

Out[179]:

(None, [])

Но есть волшебная библиотека, которая конвертирует объект 'string' в объект 'unocode'

In [180]:

import codecs
f1 = codecs.open(filepath, encoding='cp1251')
for line in f1:
    print line, repr(line)

"Авто@Mail.Ru";85837;"http://auto.mail.ru"
u'"\u0410\u0432\u0442\u043e@Mail.Ru";85837;"http://auto.mail.ru"\n'

u'\n'
"Переходы";"%";"Страницы"
u'"\u041f\u0435\u0440\u0435\u0445\u043e\u0434\u044b";"%";"\u0421\u0442\u0440\u0430\u043d\u0438\u0446\u044b"\n'
"327";"8,11";"http://cars.mail.ru/reviews/mercedes-benz/vito"
u'"327";"8,11";"http://cars.mail.ru/reviews/mercedes-benz/vito"\n'
"283";"7,02";"http://cars.mail.ru/reviews/mercedes-benz/c"
u'"283";"7,02";"http://cars.mail.ru/reviews/mercedes-benz/c"\n'
"257";"6,37";"http://cars.mail.ru/reviews/mercedes-benz/glk"
u'"257";"6,37";"http://cars.mail.ru/reviews/mercedes-benz/glk"\n'
"247";"6,12";"http://cars.mail.ru/reviews/mercedes-benz/a"
u'"247";"6,12";"http://cars.mail.ru/reviews/mercedes-benz/a"\n'
"242";"6,00";"http://cars.mail.ru/catalog/mercedes-benz"
u'"242";"6,00";"http://cars.mail.ru/catalog/mercedes-benz"\n'
"222";"5,50";"http://cars.mail.ru/reviews/mercedes-benz/e"
u'"222";"5,50";"http://cars.mail.ru/reviews/mercedes-benz/e"\n'
"218";"5,40";"http://cars.mail.ru/reviews/mercedes-benz/gl"
u'"218";"5,40";"http://cars.mail.ru/reviews/mercedes-benz/gl"\n'
"199";"4,93";"http://cars.mail.ru/reviews/mercedes-benz"
u'"199";"4,93";"http://cars.mail.ru/reviews/mercedes-benz"\n'
"192";"4,76";"http://cars.mail.ru/reviews/mercedes-benz/m"
u'"192";"4,76";"http://cars.mail.ru/reviews/mercedes-benz/m"\n'
"178";"4,41";"http://cars.mail.ru/reviews/mercedes-benz/s"
u'"178";"4,41";"http://cars.mail.ru/reviews/mercedes-benz/s"\n'
"134";"3,32";"http://cars.mail.ru/reviews/mercedes-benz/g"
u'"134";"3,32";"http://cars.mail.ru/reviews/mercedes-benz/g"\n'
"134";"3,32";"http://cars.mail.ru/reviews/mercedes-benz/sprinter"
u'"134";"3,32";"http://cars.mail.ru/reviews/mercedes-benz/sprinter"\n'
"104";"2,58";"http://cars.mail.ru/reviews/mercedes-benz/190"
u'"104";"2,58";"http://cars.mail.ru/reviews/mercedes-benz/190"\n'
"58";"1,44";"http://cars.mail.ru/reviews/mercedes-benz/230"
u'"58";"1,44";"http://cars.mail.ru/reviews/mercedes-benz/230"\n'
"44";"1,09";"http://cars.mail.ru/reviews/mercedes-benz/b"
u'"44";"1,09";"http://cars.mail.ru/reviews/mercedes-benz/b"\n'
"43";"1,07";"http://cars.mail.ru/reviews/mercedes-benz/200"
u'"43";"1,07";"http://cars.mail.ru/reviews/mercedes-benz/200"\n'
"39";"0,97";"http://cars.mail.ru/reviews/mercedes-benz/glk/2012/54385"
u'"39";"0,97";"http://cars.mail.ru/reviews/mercedes-benz/glk/2012/54385"\n'
"39";"0,97";"http://cars.mail.ru/reviews/mercedes-benz/viano"
u'"39";"0,97";"http://cars.mail.ru/reviews/mercedes-benz/viano"\n'
"38";"0,94";"http://cars.mail.ru/reviews/mercedes-benz/cls"
u'"38";"0,94";"http://cars.mail.ru/reviews/mercedes-benz/cls"\n'
"33";"0,82";"http://cars.mail.ru/reviews/mercedes-benz/gl/2013/55075"
u'"33";"0,82";"http://cars.mail.ru/reviews/mercedes-benz/gl/2013/55075"\n'

u'\n'

Здесь каждая строка печатается два раза, потому, что мы так задали print line, repr(line)

Но что это за чудеса, оказалось, что файл на диске в кодировке 'cp1251'. Пока это не понятно, чуть выше при конвертации с экрана использовали UTF-8.

In [181]:

f1.encoding

Out[181]:

'cp1251'

По сути объект юникода - это парсер строки, который производит замену ее некоторых частей:

In [199]:

# из Out[180] первая строчка с кириллицей из файла с кодировкой 'cp1251'
print '"\u0410\u0432\u0442\u043e@Mail.Ru";85837;"http://auto.mail.ru"\n'

"\u0410\u0432\u0442\u043e@Mail.Ru";85837;"http://auto.mail.ru"

А теперь превратим строку в объект юникода - добавим 'u'

In [200]:

# из Out[180] первая строчка с кириллицей из файла с кодировкой 'cp1251'
print u'"\u0410\u0432\u0442\u043e@Mail.Ru";85837;"http://auto.mail.ru"\n'

"Авто@Mail.Ru";85837;"http://auto.mail.ru"

Итак, почти все ясно, остался, пожалуй, вопрос о том, как читать кириллицу из консоли Windows... Но об этом в другой раз, ...хотя оставим здесь небольшой "задел" для изучения sys и os... они нам понадобятся не только для получения информации о кодировках..., но и для перехвата потоков входных и выходных данных.

In [122]:

import sys
sys.getfilesystemencoding()

Out[122]:

'mbcs'

Windows uses a configurable encoding; on Windows, Python uses the name “mbcs” to refer to whatever the currently configured encoding is.Unicode HOWTO

In [48]:

!chcp

’ҐЄгй п Є®¤®ў п бва Ёж : 866

In [140]:

sys.getdefaultencoding()

Out[140]:

'ascii'

In [141]:

help(sys)

Help on built-in module sys:

NAME
    sys

FILE
    (built-in)

MODULE DOCS
    http://docs.python.org/library/sys

DESCRIPTION
    This module provides access to some objects used or maintained by the
    interpreter and to functions that interact strongly with the interpreter.
    
    Dynamic objects:
    
    argv -- command line arguments; argv[0] is the script pathname if known
    path -- module search path; path[0] is the script directory, else ''
    modules -- dictionary of loaded modules
    
    displayhook -- called to show results in an interactive session
    excepthook -- called to handle any uncaught exception other than SystemExit
      To customize printing in an interactive session or to install a custom
      top-level exception handler, assign other functions to replace these.
    
    exitfunc -- if sys.exitfunc exists, this routine is called when Python exits
      Assigning to sys.exitfunc is deprecated; use the atexit module instead.
    
    stdin -- standard input file object; used by raw_input() and input()
    stdout -- standard output file object; used by the print statement
    stderr -- standard error object; used for error messages
      By assigning other file objects (or objects that behave like files)
      to these, it is possible to redirect all of the interpreter's I/O.
    
    last_type -- type of last uncaught exception
    last_value -- value of last uncaught exception
    last_traceback -- traceback of last uncaught exception
      These three are only available in an interactive session after a
      traceback has been printed.
    
    exc_type -- type of exception currently being handled
    exc_value -- value of exception currently being handled
    exc_traceback -- traceback of exception currently being handled
      The function exc_info() should be used instead of these three,
      because it is thread-safe.
    
    Static objects:
    
    float_info -- a dict with information about the float inplementation.
    long_info -- a struct sequence with information about the long implementation.
    maxint -- the largest supported integer (the smallest is -maxint-1)
    maxsize -- the largest supported length of containers.
    maxunicode -- the largest supported character
    builtin_module_names -- tuple of module names built into this interpreter
    version -- the version of this interpreter as a string
    version_info -- version information as a named tuple
    hexversion -- version information encoded as a single integer
    copyright -- copyright notice pertaining to this interpreter
    platform -- platform identifier
    executable -- absolute path of the executable binary of the Python interpreter
    prefix -- prefix used to find the Python library
    exec_prefix -- prefix used to find the machine-specific Python library
    float_repr_style -- string indicating the style of repr() output for floats
    dllhandle -- [Windows only] integer handle of the Python DLL
    winver -- [Windows only] version number of the Python DLL
    __stdin__ -- the original stdin; don't touch!
    __stdout__ -- the original stdout; don't touch!
    __stderr__ -- the original stderr; don't touch!
    __displayhook__ -- the original displayhook; don't touch!
    __excepthook__ -- the original excepthook; don't touch!
    
    Functions:
    
    displayhook() -- print an object to the screen, and save it in __builtin__._
    excepthook() -- print an exception and its traceback to sys.stderr
    exc_info() -- return thread-safe information about the current exception
    exc_clear() -- clear the exception state for the current thread
    exit() -- exit the interpreter by raising SystemExit
    getdlopenflags() -- returns flags to be used for dlopen() calls
    getprofile() -- get the global profiling function
    getrefcount() -- return the reference count for an object (plus one :-)
    getrecursionlimit() -- return the max recursion depth for the interpreter
    getsizeof() -- return the size of an object in bytes
    gettrace() -- get the global debug tracing function
    setcheckinterval() -- control how often the interpreter checks for events
    setdlopenflags() -- set the flags to be used for dlopen() calls
    setprofile() -- set the global profiling function
    setrecursionlimit() -- set the max recursion depth for the interpreter
    settrace() -- set the global debug tracing function

FUNCTIONS
    __displayhook__ = displayhook(...)
        displayhook(object) -> None
        
        Print an object to sys.stdout and also save it in __builtin__._
    
    __excepthook__ = excepthook(...)
        excepthook(exctype, value, traceback) -> None
        
        Handle an exception by displaying it with a traceback on sys.stderr.
    
    call_tracing(...)
        call_tracing(func, args) -> object
        
        Call func(*args), while tracing is enabled.  The tracing state is
        saved, and restored afterwards.  This is intended to be called from
        a debugger from a checkpoint, to recursively debug some other code.
    
    callstats(...)
        callstats() -> tuple of integers
        
        Return a tuple of function call statistics, if CALL_PROFILE was defined
        when Python was built.  Otherwise, return None.
        
        When enabled, this function returns detailed, implementation-specific
        details about the number of function calls executed. The return value is
        a 11-tuple where the entries in the tuple are counts of:
        0. all function calls
        1. calls to PyFunction_Type objects
        2. PyFunction calls that do not create an argument tuple
        3. PyFunction calls that do not create an argument tuple
           and bypass PyEval_EvalCodeEx()
        4. PyMethod calls
        5. PyMethod calls on bound methods
        6. PyType calls
        7. PyCFunction calls
        8. generator calls
        9. All other calls
        10. Number of stack pops performed by call_function()
    
    exc_clear(...)
        exc_clear() -> None
        
        Clear global information on the current exception.  Subsequent calls to
        exc_info() will return (None,None,None) until another exception is raised
        in the current thread or the execution stack returns to a frame where
        another exception is being handled.
    
    exc_info(...)
        exc_info() -> (type, value, traceback)
        
        Return information about the most recent exception caught by an except
        clause in the current stack frame or in an older stack frame.
    
    exit(...)
        exit([status])
        
        Exit the interpreter by raising SystemExit(status).
        If the status is omitted or None, it defaults to zero (i.e., success).
        If the status is numeric, it will be used as the system exit status.
        If it is another kind of object, it will be printed and the system
        exit status will be one (i.e., failure).
    
    getcheckinterval(...)
        getcheckinterval() -> current check interval; see setcheckinterval().
    
    getdefaultencoding(...)
        getdefaultencoding() -> string
        
        Return the current default string encoding used by the Unicode 
        implementation.
    
    getfilesystemencoding(...)
        getfilesystemencoding() -> string
        
        Return the encoding used to convert Unicode filenames in
        operating system filenames.
    
    getprofile(...)
        getprofile()
        
        Return the profiling function set with sys.setprofile.
        See the profiler chapter in the library manual.
    
    getrecursionlimit(...)
        getrecursionlimit()
        
        Return the current value of the recursion limit, the maximum depth
        of the Python interpreter stack.  This limit prevents infinite
        recursion from causing an overflow of the C stack and crashing Python.
    
    getrefcount(...)
        getrefcount(object) -> integer
        
        Return the reference count of object.  The count returned is generally
        one higher than you might expect, because it includes the (temporary)
        reference as an argument to getrefcount().
    
    getsizeof(...)
        getsizeof(object, default) -> int
        
        Return the size of object in bytes.
    
    gettrace(...)
        gettrace()
        
        Return the global debug tracing function set with sys.settrace.
        See the debugger chapter in the library manual.
    
    getwindowsversion(...)
        getwindowsversion()
        
        Return information about the running version of Windows as a named tuple.
        The members are named: major, minor, build, platform, service_pack,
        service_pack_major, service_pack_minor, suite_mask, and product_type. For
        backward compatibility, only the first 5 items are available by indexing.
        All elements are numbers, except service_pack which is a string. Platform
        may be 0 for win32s, 1 for Windows 9x/ME, 2 for Windows NT/2000/XP/Vista/7,
        3 for Windows CE. Product_type may be 1 for a workstation, 2 for a domain
        controller, 3 for a server.
    
    setcheckinterval(...)
        setcheckinterval(n)
        
        Tell the Python interpreter to check for asynchronous events every
        n instructions.  This also affects how often thread switches occur.
    
    setprofile(...)
        setprofile(function)
        
        Set the profiling function.  It will be called on each function call
        and return.  See the profiler chapter in the library manual.
    
    setrecursionlimit(...)
        setrecursionlimit(n)
        
        Set the maximum depth of the Python interpreter stack to n.  This
        limit prevents infinite recursion from causing an overflow of the C
        stack and crashing Python.  The highest possible limit is platform-
        dependent.
    
    settrace(...)
        settrace(function)
        
        Set the global debug tracing function.  It will be called on each
        function call.  See the debugger chapter in the library manual.

DATA
    __stderr__ = <open file '<stderr>', mode 'w'>
    __stdin__ = <open file '<stdin>', mode 'r'>
    __stdout__ = <open file '<stdout>', mode 'w'>
    api_version = 1013
    argv = ['-c', '-f', r'C:\Users\kiss\.ipython\profile_default\security\...
    builtin_module_names = ('__builtin__', '__main__', '_ast', '_bisect', ...
    byteorder = 'little'
    copyright = 'Copyright (c) 2001-2013 Python Software Foundati...ematis...
    displayhook = <IPython.kernel.zmq.displayhook.ZMQShellDisplayHook obje...
    dllhandle = 503316480L
    dont_write_bytecode = False
    exc_value = TypeError("<module 'sys' (built-in)> is a built-in module"...
    exec_prefix = r'C:\Users\kiss\Anaconda'
    executable = r'C:\Users\kiss\Anaconda\python.exe'
    flags = sys.flags(debug=0, py3k_warning=0, division_warn...unicode=0, ...
    float_info = sys.float_info(max=1.7976931348623157e+308, max_...epsilo...
    float_repr_style = 'short'
    hexversion = 34014704
    last_value = NameError("name 'f' is not defined",)
    long_info = sys.long_info(bits_per_digit=30, sizeof_digit=4)
    maxint = 2147483647
    maxsize = 9223372036854775807L
    maxunicode = 65535
    meta_path = []
    modules = {'ConfigParser': <module 'ConfigParser' from 'C:\Users\kiss\...
    path = ['', r'C:\Users\kiss\Anaconda\python27.zip', r'C:\Users\kiss\An...
    path_hooks = [<type 'zipimport.zipimporter'>]
    path_importer_cache = {'': None, r'C:\Users\kiss\.ipython\extensions':...
    platform = 'win32'
    prefix = r'C:\Users\kiss\Anaconda'
    ps1 = 'In : '
    ps2 = '...: '
    ps3 = 'Out: '
    py3kwarning = False
    stderr = <IPython.kernel.zmq.iostream.OutStream object>
    stdin = <open file '<stdin>', mode 'r'>
    stdout = <IPython.kernel.zmq.iostream.OutStream object>
    subversion = ('CPython', '', '')
    version = '2.7.5 |Anaconda 1.8.0 (64-bit)| (default, Jul  1 2013, 12:3...
    version_info = sys.version_info(major=2, minor=7, micro=5, releaseleve...
    warnoptions = []
    winver = '2.7'

Посты чуть ниже также могут вас заинтересовать

3 комментария:

Sergey Borisovich3 февраля 2014 г. в 15:28
На этот пост я угрохал дней пять. Сначала гордился тем, что прочитал две сотни страниц на двух языках всего за два дня... Но потом выяснилось, что эти страницы еще нужно уложить в голове, иначе получится ..." чукча не читатель, чукча-писатель"
В целом, написание поста помогло. Я не просто законспектировал процесс изучения, а сначала плохо законспектировал, а потом переписал все набело..., но до хорошей статьи все-таки не дотянул...
Чтобы статья была хорошей, нужно мастерство, а я только учусь..., остается конспект..., но с праутикумами - упражнениями
Пожалуй, конспект сразу надо писать набело, а для экспериментов открывать второй файл Notebook
ОтветитьУдалить
Ответы
Unknown10 августа 2018 г. в 07:04
Nice blog it is informative thank you for sharing Python Online Training
ОтветитьУдалить
Ответы
Unknown18 февраля 2019 г. в 20:51
отличный пост спасибо вам огромное
ОтветитьУдалить
Ответы

Добавить комментарий

iPython R Rapid Miner

Поиск по блогу

Страницы

понедельник, 3 февраля 2014 г.

Кодировка UTF-8 Или, как дружить с объектами Str и Unicode Pyton 2.x

Модель "Наборщик в типографии"¶

На самом деле все в двоичной форме (в памяти компьютера), а почему все используют шестнадцатиричную запись?¶

Если кодирование - это как набор текста, то откуда столько сложностей?¶

Правила кодирования для UTF-8¶

Unicode Literals in Python Source Code¶

Однобайтовая кодировка Python ASCII¶

Однобайтовая кодировка Python Latin-1¶

Объект unicode в Python 2.x¶

Как конвертировать файлы?¶

3 комментария:

Поиск по блогу

Страницы

понедельник, 3 февраля 2014 г.

Кодировка UTF-8 Или, как дружить с объектами Str и Unicode Pyton 2.x

Модель "Наборщик в типографии"¶

На самом деле все в двоичной форме (в памяти компьютера), а почему все используют шестнадцатиричную запись?¶

Если кодирование - это как набор текста, то откуда столько сложностей?¶

Правила кодирования для UTF-8¶

Unicode Literals in Python Source Code¶

Однобайтовая кодировка Python ASCII¶

Однобайтовая кодировка Python Latin-1¶

Объект unicode в Python 2.x¶

Как конвертировать файлы?¶

3 комментария:

понедельник, 3 февраля 2014 г.