[Pycon] [new paper] "Albertus Kelvin" - Creating Part Of Speech Tagger with Hidden Markov Model and Viterbi Algorithm

Mer 3 Gen 2018 18:14:35 CET

Title: Creating Part Of Speech Tagger with Hidden Markov Model and Viterbi Algorithm
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk

Abstract: Each word in a sentence has its own word class. In Natural Language Processing (NLP), the word class is also known as the Part Of Speech (POS). Some examples of word class are noun, verb, adverb, adjective, and so on. This word class denotes the role of a word in a sentence. Moreover, their sequence builds the structure of a sentence. For instance, a sentence has a general structure, namely the sequence of noun, verb, and noun.

This talk gives discusses an approach for predicting the possible sequence of part of speech based on the given sentence (the sequence of words). In probabilistic way, it could be written as P (Y | X), where Y is the sequence of Part Of Speech and X is the sequence of words. Since the probability and statistic are the part of this topic, any knowledges on machine learning and statistic would be useful.

The model is built by implementing the Hidden Markov Model (HMM) and Viterbi algorithm. The creation process uses the transition probability and emission probability from every word found in the training data. The transition probability denotes the probability of a Part Of Speech to occur given certain Part Of Speech. For instance, the P(V | N) is the probability of Verb class to occur after the Noun class. While the emission probability denotes the probability of word to occur given certain Part Of Speech. For instance, the emission probability P(language | N) is the probability of word 'language' to occur and have Noun class as its Part Of Speech.

To evaluate the built model, the Viterbi algorithm is used. This algorithm searches for the best sequence of Part Of Speech and is consisted of two steps, namely forward step and backward step. The forward step is used to find the best path containing the part of speech, whereas the backward step is used to regain the best path.

Tags: [u'Machine Learning', u'Data Mining', u'Statistical Learning', u'nlp', u'computer-science', u'text-analysis', u'bigdata', u'data-science', u'Text-Mining', u'computational-linguistics', u'Artificial Intelligence']