A project by R. Catini, I. David, A. Favenza, G.E. Melon
About the dataset
The project was based on three datasets related to eCommerce purchases loan requests. Particularly:
/ 22,358 records with including a few frauds cases;
/ The frauds are dummy variables;
/ The dataset contained different type of features: loans database, anonymized customer information, credit bureau alerts.
Challenge provided by the Data Sponsor
As online commerce becomes more common, fraud is an increasingly important concern. Automatic detection of hidden frauds, which elude standard control process, is needed in order to evolve the standard detection model.
Challenges are decreasing the human effort and achieving “optimal” trade-off of total cost (Total cost = Oppurtunity loss + Operation cost + Fraud loss).
Execution by Divers
The proposed approach was to perform a classification through machine learning techniques: Random Forest, Stratified Cross Validation and Features selection by ROC.
After a short exploration the team run into some problems: first of all they discovered a pretty unbalanced situation between loans and identified frauds, this generates a classical problem in classification of binary variables. Secondly Divers had to face the identification of relevant variables in the model. Finally some potential important features were missing in the initial version of the database. Some additional data were provided during the project execution by the Data Sponsor.
Divers developed an algorithm based on weighted Random Forest where the important features were defined through an automatic procedure.
A sample of the dataset was used to perform test, furthermore 2 thresholds were defined in order to evaluate the goodness of the algorithm. Of course the trade-off was between the correct classification of not frauds and frauds.
Finally some data visualizations were deployed in order to represent the frauds in a understandable and quick-to-identify way. The first data visualization was based on maps, on the other hand the second step was the development of a dashboard (fraud probability VS time and correlation with loan amounts) with some advanced filters.
The model developed by the Divers works quite well and the project could be a good initial step in order to create a day-by-day tool the may help the operator in identifying frauds. However the Team work revealed the need for more data and more features to train better the algorithm.
/ ROBERTO CATINI #DataScientist
/ ISABELLA DAVID #DataScientist
/ ALFREDO FAVENZA #Researcher
/ GIORGIO ETTORE MELON #DataScientist