TEACHING MODULES

June 18th to July 13th 2018 | Turin, Italy

 

img12

“A journey...

“A journey of the unexpected: Big data, humanities data” by Oohmm

A journey of the unexpected: Big data, humanities data

Teacher: Renato Gabriele — Oohmm

A journey through years of advancement in the big data analysis related with the computational propaganda, crisis, disaster, social events or general elections, researching for new models, analysis covering data and metadata if “polluted”, by multiple kind of interests and sources.

Renato Gabriele will talk about a 5-year research project on social and web metadata, following the required innovation to connect data analysis with digital humanities to better understand complex big data.

We’ll end with an open discussion on the matrix of usual mistakes in approaching critical data analysis, and related unexpected effects.

Duration: Talk.

#bigdata #propaganda #research #digitalhumanities

img12

“Data Visu...

dataviz

“Data Visualization with D3.js” by TODO

Data Visualization with D3.js

Teacher: Fabio Franchino — TODO

Immersive lecture on the key elements and concepts behind data visualization.

The workshop is an immersive tutorial about how to use the JavaScript open source library D3.js to represent data and to create customized and animated diagrams and charts.

Duration: 1.5 days.

Prerequisites: HTML, CSS, previous experience with JavaScript is welcome.

#dataviz #datavisualization #d3js #javascript

img12

“Crash cou...

“Crash course in Python and data science libraries” by TOP-IX

Crash course in Python and data science libraries

Teacher: Stefania Delprete — TOP-IX

Interactive lessons using Jupyter Notebooks on Python and its most used libraries for data science: NumPy, Pandas, Matplotlib, and an initial Scikit-learn exposure. Plus you’ll get clear on what’s inside the Anaconda and SciPy ecosystems.

This session will include insights of the history and future of the open source libraries, how to contribute and participate to the community events.

Stickers for all the participants provided by Python Software Foundation and NumFOCUS.

Duration: 1.5 days.

Prerequisites: Exposure to Python and Jupyter Notebooks.

#datascience #python #numpy #pandas #matplotlib #scipy #scikitlearn

img12

“Real Time...

“Real Time Ingestion and Analysis of data with MongoDB and Python” by AXANT

Real Time Ingestion and Analysis of data with MongoDB and Python

Teacher: Alessandro Molina — AXANT

Nowadays more and more data is generated by companies and software products, especially in the IoT world records are saved with a throughput of thousands per second.

That requires solutions able to scale writes and perform real time cleanup and analysis of thousands of records per second and MongoDB is getting wildly used in those
environments in the role of what’s commonly named “speed layers” to perform fast analytics over the most recent data and adapt or cleanup incoming records.

This session aims to show how MongoDB can be used as a primary storage for your data, scaling it to thousand of records and thousand of writes per second while also acting as a real-time analysis and visualization channel thanks to change streams and as a flexible analytics tool thanks to the aggregation pipeline and MapReduce.

Duration: 1 day.

Prerequisites: Python, JavaScript.

#mongodb #realtime #scaling #mapreduce

img12

“Data Anal...

“Data Analysis with Spark Streaming” by Agile Lab

Data Analysis with Spark Streaming

Teacher: Nicolò Bidotti — AgileLab

Big Data analysis is a hot trend and one of its major roles is to give new value to enterprise data. However data and information lose value as they become old, so it is important in a lot of contexts to do near real-time analysis of incoming data flows. Apache Spark is a major actor in the big data scenario and with its Streaming module aims to solve the main challenges in real-time data processing at scale in distributed environments.

This session aims to show the potential of streaming data analysis and how to leverage on Apache Spark with Structured Streaming to extract value from it without taking care of common problems of streaming processing at scale already solved by Apache Spark.

Duration: 2 days.

Prerequisites: Python.

#bigdata #dataengineering #dataframework #apachespark

img12

“Excursus:...

“Excursus: Agent-based modelling and synthetic populations” by GCF

Excursus: Agent-based modelling and synthetic populations

TeachersSarah Wolf, Andreas GeigesGlobal Climate Forum

To understand possible transitions of complex systems (like e.g.societies, markets, systems of socio-technical co-evolution) pure data analysis might not be sufficient because such transitions often imply substantial shifts that can hardly be described by pure statistical data extrapolation. Therefore, modelling activities can be a useful complement to data analysis.

This workshop introduces an agent-based model, which is based on synthetic populations, for the global challenge of how to make mobility more sustainable. It illustrates the methodological approach of agent-based modelling, discusses how the process of model development can be accompanied with stakeholder dialogues, explores the interaction between such an agent-based model and the relevant data science tools, and provides some hands-on exercises.

Duration: 2 days.

Prerequisites: basic knowledge of Python

#datascience #complexsystems #agentbased #mobility #sustainability

img12

“Data Citi...

“Data Citizenship and NetScience: technology for data-culture” by HER

Data Citizenship and NetScience: technology for data-culture

Teachers: Salvatore Iaconesi, Oriana Persico — Human Ecosystems Relazioni

We constantly generate data, whether we realize it or not, whether we want it or not, and a very limited number of subjects has access to all of this data. This is a very serious condition, with enormous implications for our fundamental rights and freedoms, and for our opportunities to prosper, create, express, relate and live a just, inclusive, constructive life.

In this session we explore technologies for cultural acceleration through data: Human Ecosystems to create large scale, participatory data collection processes; Ubiquitous Commons for distributed, blockchain supported data-rights and evolved data-ownership patterns; Generative Open Data as accessibility layer for shared data commons.

This is a hands on session in which profound theoretical concepts emerge from technological architectures themselves and through the ways in which we will use them. It will be mainly focused on Network Science and the ways in which we can use it to gain better understandings of the city’s Relational Ecosystem between people, organizations, network connected objects, sensors and more.

We will see and understand how to use the platforms, and explore a practical case study: Bologna’s TDays, the limited traffic week-ends in the historical center of Bologna. We will figure out together how possible ways in which to transform them into a data-driven, inclusive, engaging opportunity for participatory citizenship, by using the platforms, social networks, art and design.

Duration: 1 day.

Prerequisites: Some familiarity with databases, to browse and export data from them, Python and/or JavaScript.

#networkscience #socialscience #territory #city #citizenship

img12

“Data matc...

“Data matching and deduplication with Python” by Oval Money

Data matching and deduplication with Python

Teacher: Simone Marzola — Oval Money

In the era of multi-tiered big-data infrastructures, data is commonly spread in multiple datasources and duplicates are everywhere. As a data scientist you’ll need to focus on consolidation of data to improve the data quality and build comprehensive data assets, through a process called data deduplication.

This sessions aims to show how data analysis tools for Python, like Pandas and NumPy, can be used to solve the deduplication problem in very large datasets. The proposed method includes data preprocessing and cleaning, comparison, indexing and classification.

We will use an anonymized subset of Oval Money user transactions to match duplicates and detect recurring transactions.

Duration: Half day.

Prerequisites: Python, Pandas, NumPy.

#bigdata #deduplication #classification #finance

img12

“Data Visu...

“Data Visualization using the open source KNOWAGE suite” by KNOWAGE

Data Visualization using the open source KNOWAGE suite

Teachers: Isabella Iennaco, Paolo Raineri — KNOWAGE (Engineering S.p.A)

Business analytics lecture on a real KNOWAGE use case of predictive maintenance with an Open Source full stack!

The lecture is an interesting journey around KNOWAGE data visualization and data discovery capabilities and how they work in practice. The teacher will guide you towards a comprehensive understanding of KNOWAGE suite and allowing you to explore a large Industry 4.0 business project.

Duration: Half day.

Prerequisites: Python, JavaScript.

#mongodb #realtime #scaling #mapreduce #predictivemainteinance

img12

“Machine L...

“Machine Learning and Deep Learning for Computer Vision” by ISI

Machine Learning and Deep Learning for Computer Vision

Teachers: Andrè Panisson, Alan Perotti — ISI Foundation

This in-depth part of the course allows to build an appealing and diversified Machine Learning portfolio. It starts with a Machine Learning introduction and application with Scikit-learn, and continues with Neural Networks and backpropagation lectures where you’ll start exploring Computer Vision techniques on a dataset of images.

Deep Learning methods. You’ll be challenged to use TensorFlow and Keras on a image classification real cases (such as distracted drivers, healthcare or plant diseases). The workshop ends with lessons in Transfer Learning and one last project building your data set by scraping Google images and practicing everything you learned.

Duration: 3.5 days.

Prerequisites: Python, Pandas, Statistics, exposure to Machine Learning is welcome.

#machinelearning #deeplearning #neuralnetworks #scikitlearn #tensorflow

img12

“Voice Rec...

“Voice Recognition models in DeepSpeech and Common Voice” by Mozilla

Voice Recognition models in DeepSpeech and Common Voice

Teacher: Alexandre Lissy — Mozilla

DeepSpeech is an open source Speech-To-Text engine, using model trained by machine learning techniques, based on Baidu’s Deep Speech research paper.

You will learn how the model works, and how this was implemented using TensorFlow. The workshop will cover how we went from a PoC hack to a model that we try and make usable in production and how we leverage the distributed training system. We’ll explore how the inference-specific model is being built and the code around to make it run on several devices, and the tooling from TensorFlow we explored to try and speedup things.

We also present the Common Voice project, aiming at collecting open dataset for machine learning and more specifically voice-targetted machine learning.

You’ll be able to contribute to both project: how to train your own model for DeepSpeech, how to use DeepSpeech as a “blackbox”, how to hack into DeepSpeech, and how to contribute to Common Voice.

Duration: Half day.

Prerequisites: Python, shell, exposure to C++ is welcome.

#machinelearning #deepspeech #voicerecognition #tensorflow

img12

“From loca...

map FBK

“From local to glocal using community data” by FBK

From local to glocal using community data

Teacher: Maurizio Napolitano — Fondazione Bruno Kessler

The workshop starts with an introduction to the GIS world, the geospatial protocols and the available geodata resources.

It continues diving in the OpenStreetMap ecosystem where we explore how it can be used as a great tool for data scientists. After the examples of analysis on real cases, you’ll be challenged to make your own geospatial project supervised by the expert Maurizio Napolitano.

Duration: 1 day.

Prerequisites: Python, previous experience with OpenSteetMap is welcome.

#geospatial #map #opendata #osm

SEVENTH EDITION DIVERS LIST


/ MUHAMMAD AL READEAN
/ LUCA BARBATI
/ STEFANO CALDERAN
/ OMJYOTI DUTTA
/ PIER PAOLO GRASSI
/ MAREK KUFEL
/ ENRICO LOMBARDO
/ MARCO MARTELLACCI
/ STEFANO MENOZZI
/ DANIELE MORANO
/ MICHELE MORELLO
/ IVAN NARDINI
/ GABRIELE PECE
/ ELISA REALE
/ MASSIMO SANTOLI
/ MARCO SEBASTIANELLI
/ CHRISTIAN TORRERO

TEACHERS & SPEAKERS


Renato Gabriele – oohmm.info
Fabio Franchino – todo.to.it
Stefania Delprete – top-ix.org
Alessandro Molina – axant.it
Niccolò Bidotti – agilelab.it
Sarah Wolf, Andreas Geiges – globalclimateforum.org
Salvatore Iconesi, Oriana Persico – he-r.it
Simone Marzola – ovalmoney.com
Isabella Iennaco, Paolo Ranieri – knowage-suite.com
Andrè Panisson, Alan Perotti  – isi.it
Alexandre Lissy – mozilla.org
Maurizio Napolitano – fbk.eu

Guest speakers
(in collaboration with ISI Foundation)
Roberta Sinatra – robertasinatra.com
Michael Sxell – michael.szell.net

DATA SPONSORS


Vem Solutions
vemsolutions.it

SmartData@PoliTO
smartdata.polito.it/

 

OTHER DATA PROVIDERS


INRIM
inrim.eu

NIST
comune.torino.it

SMART DATA NET
smartdatanet.it

5T Torino

5t.torino.it

FINAL PROJECTS

Group 1) “The Group”
Data by: Vem Solutions

 

Team Group: E. Lombardo, C. Torrero, P. P, Grassi, L. Barbati, S. Menozzi
Team Composition: Developers: x2 | Data Scientists: x2 | Domain Expert: x1

The question behind the project work: considering the 2% of vehicles population, is it possible to extract key trends in parking and point of interests?

The group explored mobility in Turin using dataset by Vem. They started by visualizing both in a static and a dynamic way the flows. Then they focused on anomalous situations in vehicles distribution trying to link them to mainstream city events.

After this analysis the team stated that Vem blackboxes provide a complete characterization of the points of interest around the city.

Tools: MySQL, Flask, PowerBI, QGIS, Jupyter Notebook. deck.gl framework
Data Science Methods: Kernel Density Estimation, Grid Clustering

Group 2)“M2O”
Data by: Vem Solutions, INRIM

Team Group: M. Morello, O. Dutta, M. Al Readean
Team Composition: Developer: x1 | Data Scientist: x1 | Researcher: x1

The question behind the project work: is it possible to correlate “noise” (optical fiber oscillation) to the traffic distribution over the city ?

Recent studies proved that optical fiber oscillation measurement can be used to detect earthquake phenomena. This is particularly relevant in the submarine geo-seismic events where it is difficult to use traditional seismographs. Using a similar approach and crossing data from Vem and INRIM, the group found an interesting correlation between the fiber noise and the general behavior of vehicles around the city.

Tools: QGIS, Power BI, Python and R, MS Access, Jupyter Notebook
Data Science Methods: Moving Averages, Linear Regression

Group 3)“TOnnect”
Data by: SmartData@PoliTO

Team Group: G. Pece, I. Nardini, E. Reale, M. Sebastianelli, M. Santoli,
Team Composition: Data Scientists: x2 | Researchers: x2 | Developer: x1

Behaviors and trends identification of car-sharing mobility between Turin and Caselle Airport, and between Milano and Linate Airport analyzing data provided by SmartData@PoliTO.

The team proceeded developing geospatial clustering and time series prediction of flow-in and flow-out in a specific cluster. For improving the user experience the team proposes a mobile app to organize and visualize the different choices available in real-time to reach the airport.

Tools: Python and libraries, Jupyter Notebook, deck.gl framework, Adobe Illustrator
Data Science Methods: DBSCAN and HDBSCAN algorithms, ARIMA, AUTOARIMA, PROPHET

Group 4)“Four Pandas”
Data by: Vem Solutions

Team Group: S. Calderan, D. Morano, M. Kufel, M. Martellacci
Team Composition: Developers: x2 | Data Scientist: x1 | Researcher: x1

Traffic level prediction in Turin studying Vem Solution mobility dataset by analysing the metrics in specific areas of the city.

After researching for a metric for the traffic level and frequency, the team created a grid dividing the city in 500 zones and carefully choose the algorithm for to optimise their prediction of the traffic.

Tools: Python and libraries, Jupyter Notebook, JS
Data Science Methods: Extremely Randomized Trees Regressor, Extreme Gradient Boosting (XGBoost)