[Python] Un aiuto su RegEx

Sab 1 Mar 2014 09:04:55 CET

Ciao a tutti,

con NLTK sto provando a correttamente suddividere  '5,300 full-time employees'
in: '5,300', 'full-time', 'employees'

Come vedete, ho provato diverse soluzioni, ma il risultato riguardante
il numero '5,300' è sempre scorretto: '5', ',' ,'300' :
'5', ',', '300', 'full-time', 'employees' invece di
'5,300',  'full-time', 'employess'

from nltk.tokenize import RegexpTokenizer
pattern = r'''(?x)              # set flag to allow verbose regexps
    ([A-Z]\.)+                  # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*                # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?          # currency and percentages, e.g. $12.40, 82%
    | \.\.\.                    # ellipsis
    | [][.,;"'?():-_']          # these are separate tokens
    | \d+([\d,]?\d)*(\.\d+)?    # number,number
    | \d+(\,\d+)
    | \[0-9]+\,\[0-9]+
    | /[1-9](?:\d{0,2})(?:,\d{3})*(?:\.\d*[1-9])?|0?\.\d*[1-9]|0/
'''
tokenizer = RegexpTokenizer(pattern)

Avete qualche consiglio da darmi per riuscire a mettere insieme '5',
',', '300 : '5,300' ?

Vi ringrazio.
Marco