educating data scientists: the sobigdata master experience
TRANSCRIPT
Social Mining & Big Data Ecosystem
Educating Data Scientists:
the SoBigData master experience
www.sobigdata.euFosca Giannotti, Valerio Grossi
ISTI-CNR Pisa
H2020-INFRAIA-2014-2015 Grant Agreement N. 654024
Modern science is data-intensive, multidisciplinary, collaborative and global
– efficiency of data management (noSQL paradigms and cloud computing play important role here) and curation, search, sharing, transfer.
– managing the complexity of the analytical process is a key issue (scalable distributed analytical methods and and Visual Analytics are crucial here).
Firenze, 14 Nov 2016
Validation
Data
Dem
ogr
aph
ic d
ata
Geo
grap
hic
dat
aM
ove
me
nt
dat
aTr
ansp
ort
dat
a
Models
T-C
lust
erin
gT-
Patt
ern
s
Forecasts
Big Data Analytics process
Firenze, 14 Nov 2016
Interdisciplinary and collaborative
• for sharing data/models/processes and results of experiments (different level of interoperability and semantic enrichment)
• to realize experiments by combining resources (data, methods and results) belonging to different communities.
– This call for tools facilitating the govern of complex analytical process in a workflow style or mega-modeling.
– This call also for sophisticate search that supports resource discovery.
Firenze, 14 Nov 2016
Data scientist
A new kind of professionalhas emerged, the data scientist, who combines the skills of software programmer, statistician andstoryteller/artist to extractthe nuggets of gold hiddenunder mountains of data.
Firenze, 14 Nov 2016
Four core points of a data scientist
• Data Procurement and Curation
• Making sense of Data
• Story-telling
• Respond step-by-step on technical correctness and legal and ethical issues
Firenze, 14 Nov 2016
SoBigData is…
A Multidisciplinary European Infrastructure for Big Data and Social
Data Mining providing an integrated ecosystem for ethically
sensitive scientific discoveries and advanced applications of social
data mining on the various dimensions of social life, as recorded by
“big data”.
Firenze, 14 Nov 2016
Social Mining - Answer to:
Firenze, 14 Nov 2016
• Who will win US elections? What’s the elector’s current intention of vote? How reliable is it?
• Which are the indicators of social well-being (beyond GDP) and how can they be computed and monitored?
• How is the aging population effectively helped by the social participation to digital community services?
• What is the link between media ownership and media content? Is there bias in news reporting? And in content reviews?
• Is an infective disease emerging? How is its diffusion model?
Firenze, 14 Nov 2016
Estimating traffic fluxes on road network with mobile phone
data
A
B
C
HW
Firenze, 14 Nov 2016
Predicting Success“Football is a simple game: 22 men chase a ball for 90 minutes and at the end, the Germans always win”-- Gary Lieneker (after Italy 1990 Final)
Firenze, 14 Nov 2016
Managing Data does not meansSupport discover
Provide access, Verify the quality of data, Clean errors, outliers, anomalierTransform data in a format suitable for specific data analytical tools
It must include support for• legal interoperability
– copyright management, – licensing of single and derivative products– terms of use
• fine-grained policies– attribution,– citation policy, – provenance management
• Ethics issues
Managing Data: what this means?
Firenze, 14 Nov 2016
Metadata in the SoBigData RI experience
• Huge datasets often describe human activities, which implies privacy and ethical issues
• As a Research Infrastructure FAIRness is one of our main targets– The success of the RI is directly connected to the fact that
datasets are Findable, Accessible, Interoperable and Reusable
– The intellectual property has to be considered– The design of a highly structured metadata schema allows
the RI to automatically grant or deny access to a dataset, to force the acceptance of terms of use or signing NDAs…
SoBigData metadata structure
• A highly structured and detailed metadata structure has been designed in order to provide information about:– Description of the dataset (to make it Findable)– How the dataset has been produced– Intellectual Property– Privacy issues– Who can access the data and how (terms of use,
NDA…)• Mainly based on the DataCite standard
The ethics of SoBigData
• Gathering large quantities of data has serious consequences that SoBigData is trying to address. These consequences range from personal harm, to issues of autonomy, injustice and inequality.
• In order to deal with these problems, SoBigData adheres to a value-sensitive design approach. This approach consists in using design solutions to overcome ethical dilemma’s, in this case those between the utility of the data gathered vs. the protection of the individuals subject to the research.
• In order to make the ideals of SoBigData successful, scientific methods also need to be developed in order embed moral principles in practice.
Ethics: the challenge for SoBigData
• How do we create an infrastructure in which such data and methods can be disseminated and improved upon?
1. A Massive Online Open Cource (MOOC) which instructs all prospective researchers about the legal and ethicaldangers of big data research and the steps they can take to minimise these;
2. A set of workflows that outline the steps researchers can take when designing their approach;
3. Information pop-ups which redirect researchers to state-of-the-art ethical methods.
Meta data definition: Ethics
Firenze, 14 Nov 2016
Meta data definition: Intellectual Properties
Firenze, 14 Nov 2016
Master in Big Data Analytics & Social Mininghttp://www.sobigdata.eu/master/bigdata
Firenze, 14 Nov 2016
Firenze, 14 Nov 2016
Education
• Big Data Sensing
• Big Data Mining
• Big Data Story Telling
• Big Data Technology
• Big Data for Social Good
• Big Data Ethics
Firenze, 14 Nov 2016
Students: their studies
0
1
2
3
4
5
6
7
8
2015
2016
Firenze, 14 Nov 2016
Gender distribution
0
5
10
15
20
25
2014-2015 2015-2016
M
F
Firenze, 14 Nov 2016
Firenze, 14 Nov 2016