[Pycon] [new paper] "Leif Uwe Vogelsang" - Lessons learned while analysing 30 million news articles

Dom 7 Gen 2018 17:14:27 CET

Title: Lessons learned while analysing 30 million news articles
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk

Abstract: A lot of the NLP tutorials you find online and in books are too shallow and high-level to give you the information you need when you want to go from running small examples to doing real work. In this talk I will show how you can do analysis on a fairly large dataset of 30 million news articles using a single computer or server and open source tools like spaCy and Pandas. 

This talk will show you how to tackle the analysis of a fairly large text dataset. The talk centers around a set of lessons learned from analysing 30 million news articles:

Have  a clear understanding of the questions you want answered from the data. The questions should guide the analysis work and need to be clear and well defined or the work will suffer. 

Wait with the coding, first investigate your data. Will it answer the questions or do you need more/different types of data? 

Start small when building your workflow, use a sample of your dataset. The sample you use should be representative of the of contents and shape of the complete dataset. 

Change perspectives and dig deeper. How does the answers change when the perspective changes from yearly to monthly or daily? 

When the dataset is large enough all operations will be slow. Profile your code, remove bottlenecks and use multiprocessing whenever possible.

Tags: [u'Text-Mining', u'datamining', u'data', u'nlp', u'text-analysis']