[Python] Domanda facile facile su caso manipolazione unicode

Gio 29 Gen 2015 08:09:40 CET

> Il giorno 28/gen/2015, alle ore 18:15, Marco Ippolito <ippolito.marco a gmail.com> ha scritto:
> 
> Ciao a tutti,
> 
> ho messo in file json alcune possibili sostituzioni di simboli
> all'interno di un testo:
>    "to_substitute":{
>        "“": "'",
>        "”": "'",
>        "—": "-",
>        "’": "'",
>        "è": "e'",
>        "é": "e'"
>    }
> 
> facendo così:
> self.extracted_text_u = self.extracted_text_u.replace("“",
> self.textManipulation["to_substitute"][unicode("“", "utf-8")])
> self.extracted_text_u = self.extracted_text_u.replace("”",
> self.textManipulation["to_substitute"][unicode("”", "utf-8")])
> self.extracted_text_u = self.extracted_text_u.replace("—",
> self.textManipulation["to_substitute"][unicode("—", "utf-8")])
> self.extracted_text_u = self.extracted_text_u.replace("’",
> self.textManipulation["to_substitute"][unicode("’", "utf-8")])
> self.extracted_text_u = self.extracted_text_u.replace("è",
> self.textManipulation["to_substitute"][unicode("è", "utf-8")])
> self.extracted_text_u = self.extracted_text_u.replace("é",
> self.textManipulation["to_substitute"][unicode("é", "utf-8")])
> 
> sostituisco all'interno di un testo alcuni simboli con il
> corrispettivo messo nel file json.
> 
> Ma questa soluzione, pur funzionando, non è granchè....
> per cui ho importato le coppie simbolo-sostituto in un dictionary:
> self.to_substitute_dictionary = self.textManipulation["to_substitute"]
> {u'\xe9': "e'", u'\xe8': "e'", u'\u2014': '-', u'\u2019': "'",
> u'\u201d': "'", u'\u201c': "'"}
> 
> for k, v in self.to_substitute_dictionary.iteritems():
>        print k, v
> é e'
> è e'
> — -
> ’ '
> ” '
> “ '
> 
> Quando invece faccio:
> for k, v in self.to_substitute_dictionary.iteritems():
>      self.extracted_text_u = self.extracted_text_u.replace(k, v)
> 
>  File "extract_sentences.py", line 56, in sentences_extraction_meth
>    self.extracted_text_u = self.extracted_text_u.replace(k, v)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 16: ordinal not in range(128)
> 
> E se faccio:
> for k, v in self.to_substitute_dictionary.iteritems():
>     v_u = unicode(v, "utf-8")
>     self.extracted_text_u = self.extracted_text_u.replace(k, v_u)
> 
>   self.extracted_text_u = self.extracted_text_u.replace(k, v_u)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 16: ordinal not in range(128)
> 
>>>> for k, v in to_substitute_dictionary.iteritems():
> ...     v_u = unicode(v, "utf-8")
> ...     print "(type(k), type(v_u))= ", (type(k), type(v_u))
> ...
> (type(k), type(v_u))=  (<type 'unicode'>, <type 'unicode'>)
> (type(k), type(v_u))=  (<type 'unicode'>, <type 'unicode'>)
> (type(k), type(v_u))=  (<type 'unicode'>, <type 'unicode'>)
> 
> Deve essere una stupidata.....qualche consiglio?
> 
> Marco

Butto lì una soluzione rapida ...

#-*- coding: UTF-8 -*-
import re
chardict={u"“": "'",
          u"”": "'",
          u"—": "-",
          u"’": "'",
          u"è": "e'",
          u"é": "e'"}

SUBS=re.compile(u"([%s])" %'|'.join(chardict.keys()))

s=u"Mario disse “non so perché non so chi è” "

print SUBS.sub(lambda m:chardict[m.group(1)],s)

>>>Mario disse 'non so perche' non so chi e'' 

Spero ti sia utile.

Ciao

G