A project by C. Barbagianni, S. Basso, A. Corradi, A. Klimont and E. Osti
About the dataset
The project was based on a dataset related to:
/ 6 months of transactions (May-October 2014) in 4 stores;
/ Item quantities;
/ Prices and categories assortment catalogue.
The dataset was 2GB large with a mix of element types (strings, dates and floats).
Challenge provided by the Data Sponsor
The Data Sponsor is currently evolving towards a selling/marketing strategy based on final customer. This change requires a better knowledge about customer habits and a deep differentiation among the stores. A city store probably has different sales trends in comparison with a store in countryside. The Data Sponsor is already aware of these differences but is interested in a data-driven evidence based approach and in a scientific model that states these facts.
Execution by Divers
The Divers work could be split in two stages:
The analysis (based on standard deviation and confidence intervals, predictive models and network approach to pattern analysis) investigates if there are any features that allow for distinguishing and classifying stores in four groups.
In the dataviz Dashboard the sales of the 4 stores are displayed by product family. Each circle is a family, the cluster of circles represent the departments.
The data representation may be switched between “values” and “differences”. In the “Value case” the circle size is proportional to the normalized quantities sold, and the best-sellers in each store are highlighted. In the “Differences case” the differences between the sales in one store and the overall average are shown. Blue circles means more than the average, red ones less.
Predictive model, developed by the team as an extra task, aims at classifying which type of store is under focus through the generic transaction analysis. The accuracy reached during the test has been 95%.
Results show that there are differences indeed but it is not possible to say that there is a statistical significance in the results obtained. A bigger range in the dataset (timeframe and number of shops) would probably help to have more reliability in results.
/ CHIARA BARBAGIANNI #ComputationalLinguist
/ SIMONE BASSO #Developer
/ ANDREA CORRADI #InformationDesigner
/ ADAM KLIMONT #DataScientist
/ ELENA OSTI #Analyst
/ Apache Spark
/ SQL in Zeppeling