[Pycon] [new paper] "Pacham Pacham Sri Srinivasan" - Association Rules Mining Using Python Generators and Pandas to Handle Large Datasets
info a pycon.it
info a pycon.it
Gio 17 Gen 2019 03:07:27 CET
Title: Association Rules Mining Using Python Generators and Pandas to Handle Large Datasets
Duration: 45 (includes Q&A)
Q&A Session: 15
Language: en
Type: Talk
Abstract: Association rule mining with apriori algorithm is a standard approach to derive association rules. The basic implementations of the algorithm with pandas involving splitting the data into multiple subsets are not suitable for handling large datasets due to excessive use of RAM memory. Hence, the algorithm fails to execute. However, the use of the python generator makes it possible to implement and process one value at a time, discard when finished and move on to process the next value. This feature makes generators perfect for creating item pairs, counting their frequency of co-occurrence and determining the association rules.
A generator is a special type of function that returns an iterable sequence of items, unlike regular functions which return all the values at once (eg: returning all the elements of a list). A generator yields one value at a time. To get the next value in the set, we must ask for it - either by explicitly calling the generator's built-in "next" method, or implicitly via a for loop. This is a great property of generators because it means that we don't have to store all of the values in memory at once.
This efficient implementation is tested in Market Basket Analysis Dataset for various minimum support thresholds.
The designed implementation with pandas could handle large dataset of lower minimum support with the reduction in execution time.
Source code: https://github.com/srivignessh/Association-Rules-Mining
Tags: [u'data-structures', u'information-retrieval', u'statistics', u'mathematical-modelling', u'optimization', u'marketing', u'Algorithms']
Maggiori informazioni sulla lista
Pycon