Data matching and deduplication with Python
Teacher: Simone Marzola — Oval Money
In the era of multi-tiered big-data infrastructures, data is commonly spread in multiple datasources and duplicates are everywhere. As a data scientist you’ll need to focus on consolidation of data to improve the data quality and build comprehensive data assets, through a process called data deduplication.
This sessions aims to show how data analysis tools for Python, like Pandas and NumPy, can be used to solve the deduplication problem in very large datasets. The proposed method includes data preprocessing and cleaning, comparison, indexing and classification.
We will use an anonymized subset of Oval Money user transactions to match duplicates and detect recurring transactions.
Duration: Half day.
Prerequisites: Python, Pandas, NumPy.
#bigdata #deduplication #classification #finance