[Python] Domanda facile facile su caso manipolazione unicode

Mer 28 Gen 2015 18:15:58 CET

Ciao a tutti,

ho messo in file json alcune possibili sostituzioni di simboli
all'interno di un testo:
    "to_substitute":{
        "“": "'",
        "”": "'",
        "—": "-",
        "’": "'",
        "è": "e'",
        "é": "e'"
    }

facendo così:
self.extracted_text_u = self.extracted_text_u.replace("“",
self.textManipulation["to_substitute"][unicode("“", "utf-8")])
self.extracted_text_u = self.extracted_text_u.replace("”",
self.textManipulation["to_substitute"][unicode("”", "utf-8")])
self.extracted_text_u = self.extracted_text_u.replace("—",
self.textManipulation["to_substitute"][unicode("—", "utf-8")])
self.extracted_text_u = self.extracted_text_u.replace("’",
self.textManipulation["to_substitute"][unicode("’", "utf-8")])
self.extracted_text_u = self.extracted_text_u.replace("è",
self.textManipulation["to_substitute"][unicode("è", "utf-8")])
self.extracted_text_u = self.extracted_text_u.replace("é",
self.textManipulation["to_substitute"][unicode("é", "utf-8")])

sostituisco all'interno di un testo alcuni simboli con il
corrispettivo messo nel file json.

Ma questa soluzione, pur funzionando, non è granchè....
per cui ho importato le coppie simbolo-sostituto in un dictionary:
self.to_substitute_dictionary = self.textManipulation["to_substitute"]
{u'\xe9': "e'", u'\xe8': "e'", u'\u2014': '-', u'\u2019': "'",
u'\u201d': "'", u'\u201c': "'"}

for k, v in self.to_substitute_dictionary.iteritems():
        print k, v
é e'
è e'
— -
’ '
” '
“ '

Quando invece faccio:
for k, v in self.to_substitute_dictionary.iteritems():
      self.extracted_text_u = self.extracted_text_u.replace(k, v)

  File "extract_sentences.py", line 56, in sentences_extraction_meth
    self.extracted_text_u = self.extracted_text_u.replace(k, v)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
16: ordinal not in range(128)

E se faccio:
for k, v in self.to_substitute_dictionary.iteritems():
     v_u = unicode(v, "utf-8")
     self.extracted_text_u = self.extracted_text_u.replace(k, v_u)

   self.extracted_text_u = self.extracted_text_u.replace(k, v_u)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
16: ordinal not in range(128)

>>> for k, v in to_substitute_dictionary.iteritems():
...     v_u = unicode(v, "utf-8")
...     print "(type(k), type(v_u))= ", (type(k), type(v_u))
...
(type(k), type(v_u))=  (<type 'unicode'>, <type 'unicode'>)
(type(k), type(v_u))=  (<type 'unicode'>, <type 'unicode'>)
(type(k), type(v_u))=  (<type 'unicode'>, <type 'unicode'>)

Deve essere una stupidata.....qualche consiglio?

Marco