[Pycon] [new paper] "Parul Sethi" - Next generation of word embeddings in gensim

Lun 25 Dic 2017 18:43:02 CET

Title: Next generation of word embeddings in gensim
Duration: 45 (includes Q&A)
Q&A Session: 0
Language: en
Type: Talk

Abstract: Python has many Natural language processing tools. In particular if someone wants to implement a recommender or a document classifier they face a problem choosing from many open source word embeddings available. I will highlight the differences between popular word embeddings, Word2Vec, FastText and WordRank and reflect how these different embeddings could directly affect the downstream NLP tasks especially related to similarity. I'll also discuss how to deal with the common issues of rare, frequent and out of vocabulary words.

As Visualizations are also a crucial part of Data analysis, to understand the structure and underlying patterns that may be held within the data, I’ll cover about visualizing the word embeddings using TensorBoard and gensim. 

Outline:

1. What are word embeddings and why are they useful
2. Examples of some popular word embeddings
3. Why you need to choose carefully b/w those different embeddings, example of their different results for similarity
5. Benchmark performance overview: on Word Similarity and Analogy data (how diff. embeddings perform on this) 
6. Visualizations: PCA, t-SNE (using TensorBoard)
7. Relation b/w word frequency and embedding performance

Tags: [u'data-visualization', u'Machine Learning', u'nlp']