[Pycon] [new paper] "Mirko Mazzoleni" - Sentiment analysis of emotions via word embeddings: from data collection & analysis to visualization

Dom 7 Gen 2018 13:51:38 CET

Title: Sentiment analysis of emotions via word embeddings: from data collection & analysis to visualization
Duration: 60 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk

Abstract: ## Audience level 

The audience level required for the talk is **beginner**. The talk will be more focused on the data analysis logical flow and intuitions about employed algorithm. Thus, specific understanding about python code is not required. Preliminary basics about machine learning, linear algebra, NoSQL databases suchs as MongoDB constitute a useful background in order to understand the overall context.

## Abstract

This talk presents a novel unsupervised approach for the detection of emotional states  (Anger, Disgust, Sadness, Happiness, Fear, Surprise) from textual data. The method, based on word embeddings (word2vec algorithm), has been tested on a collection of Twitter messages and on the SemEval 2007 news headlines dataset. After an introduction to the field of Natural Language Processing (NLP) and the importance of the sentiment analysis, a simple Django webapp is presented in order to crawl specific tweets. The retrieved data are saved in a MongoDB database. The sentences are then processed by employing libraries such as *gensim*, *nltk*, *pyenchant*, *scikit-learn*, in order to assign to each one of them the percentage of the aforementioned emotions. Finally, after a review of the fundamental principles of data visualization, the obtained results are represented in a web page using the libraries *node.js*, *d3.js*, *dc.js* and *crossfilter.js*. The presentation will cover all the steps of the journey from the research idea, to data collection, processing and algorithm development, to data visualization, showing the main tools to implement the research ideas.

## Description

The talk will be subdivided in 3 major parts. In all of the parts, snippets of code will be shown. However, due to the fair amount of code which comprise the whole application, it will not be possible to directly re-implement all of the presented material. The aim is therefore to give all the instruments to reproduce the analysis.

- **Introduction (20 minutes)**: in this part, we introduce the theoretical foundation of the talk. In particular, we review the motivation behind the developed experiment, describing the different purposes of Natural Language Processing (NLP) algorithms. In particular, the field of sentiment analysis and its standard approaches are reviewed. In this part, we will describe also the concepts of supervised and unsupervised machine learning, along with the word embeddings representation of textwords.

- **Application development (15 minutes)**: in this part, we will describe the developed sentiment analysis workflow. The first part is devoted to the deployment of a simple Django application, which has the aim of collecting tweets, given a specific keyword. We will describe how the MongoDB database is used to store the collected tweets. The second part concerns the heart of the sentiment analysis algorithm: infact, all the defined pre-processing steps (such as stopwords removal, emoticons handling, stemming) are described. Finally, in the third section, after that the cleaned sentences have been represented by word embeddings, the results are presented by employing a custom tweets dataset and a benchmark one, based on the SemEval 2007 news headlines competition. The SemEval 2007 task on “Affective text” focused on the
emotion classification of news headlines extracted from news web sites. Headlines are suitable for these experiments because they are typically intended to express emotions, in order to draw the readers’ attention.

- **Interactive data visualization (15 minutes)**: in the last part, the fundamental principles of data visualization are briefly reviewed. This section will take inspiration from the works of Tufte and Cleveland, giving also references to more recent books and researcher on the topic. The results from the sentiment analysis of the tweets are then displayed on a web page with the use of javascript libraries. 

- **Question and answer session (10 minutes)**: last minutes are dictated to questions from the audience and preparation for the next talk.

Tags: [u'machine-learning', u'sentiment-analysis', u'data-visualization', u'nlp', u'word_embedding']