about data from a machine learning perspective
TRANSCRIPT
![Page 1: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/1.jpg)
Oriol Pujol
ABOUT DATA FROM A MACHINE LEARNING
PERSPECTIVE
![Page 2: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/2.jpg)
TWO FLAVOURS AND ONE PRESENTATION
Share experiences in openness from the machine learning community
To propose potential solutions to the problem of sharing sensible data
![Page 3: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/3.jpg)
A COMMUNITY SHIFT
![Page 4: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/4.jpg)
THE ROLE OF DATA IN MACHINE LEARNING
• Machine learning is sometimes called learning from data. Thus we are talking about building/coding algorithms that learn from data.
• Data is the key component in most research, but it is obviously crucial in machine learning.
![Page 5: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/5.jpg)
Automatic captioning
EXAMPLES OF TASKS IN MACHINE LEARNING
![Page 6: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/6.jpg)
CURRENT MACHINE LEARNING RESEARCH FOCUS
• Improving results in terms of generalization, robustness, speed, and interpretability. • Comparison with other techniques is fundamental. • Many different data sets are needed to assess each property we are tackling in the research, e.g. we need large datasets for speed and big data.• Sometimes, their application largely depends on the domain, e.g., financial products forecasting, targeted gene identification.
![Page 7: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/7.jpg)
AS A RESEARCHER
DISEMINATION: REPORTING PROCESS, CLAIMING AUTHORSHIP
QUALITY: COMMUNITY VALIDITY ASSESSMENT, PEER REVIEW (KPI USED BY ORGANISATIONS FOR RESEARCH QUALITY)
We pursue to contribute to advance science: reproducibility of experimentation that helps to eliminate individual biases.
![Page 8: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/8.jpg)
FOR THIS REASON WE USE PUBLISHING
DISSEMINATIONQUALITY ASSESSMENT
![Page 9: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/9.jpg)
FOR THIS REASON WE USE PUBLISHINGLONG PROCESS > 1 YEARNOT A GOOD DISEMINATION SOURCE, OR ATTRIBUTION PROC.
PEER QUALITY ASSESSMENT CRITICAL FEEDBACKGUARD AGAINST ERRORSMAJOR SOURCE FOR RANKING RESEARCHERS
NON-‐PUBLISHED RESEARCH IS NOT RESEARCH
![Page 10: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/10.jpg)
FOR THIS REASON WE USE PUBLISHING
FAIRER PROCESS > 3-‐6 month
QUALITY ASSESSMENTRANKING RESEARCHERS
NON-‐PUBLISHED RESEARCH IS NOT RESEARCH
![Page 11: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/11.jpg)
THE NIPS CASE: WHEN CONFERENCE PAPERS ARE OLD NEWS
IN MACHINE LEARNING 3-‐6 MONTHS IS TOO LONG
DECEMBER 5TH – 10TH , 2016
![Page 12: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/12.jpg)
THE NIPS CASE: WHEN CONFERENCE PAPERS ARE OLD NEWS
![Page 13: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/13.jpg)
24/7 PUBLISHING CYCLE
IMMEDIACY, 24/7 PUBLISHING CYCLE
IMMEDIATE ATTRIBUTION
NO QUALITY ASSESSMENTRANKING RESEARCHERSHOW TO DISTINGUISH GOOD RESEARCH FROM FRAUD
![Page 14: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/14.jpg)
HOWEVER, IN REALITY
![Page 15: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/15.jpg)
TOWARDS REPRODUCIBILITY: WE NEED DATA
![Page 16: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/16.jpg)
AN EXAMPLE: PLOS ONE
FOCUS SHIFT:
METHODOLOGICAL CORRECTNESS
DATA
![Page 17: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/17.jpg)
MACHINE LEARNING DATASET REPOSITORIES
![Page 18: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/18.jpg)
MACHINE LEARNING DATASET REPOSITORIES
![Page 19: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/19.jpg)
Data Sets -‐ Raw data as a collection of similarily structured objects.Material and Methods -‐Descriptions of the computational pipeline.Learning Tasks -‐ Learning tasks defined on raw data.Challenges -‐ Collections of tasks which have a particular theme.
MACHINE LEARNING DATASET REPOSITORIES
![Page 20: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/20.jpg)
• The U.S. Securities and Exchange Commission (SEC) 2010 proposal, financial products must include Pythoncode.
WHEN DESCRIPTION AND DATA IS NOT ENOUGH
![Page 21: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/21.jpg)
WHEN DESCRIPTION AND DATA IS NOT ENOUGH
![Page 22: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/22.jpg)
GITHUB PROJECTS
![Page 23: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/23.jpg)
BUT … DEVIL IS IN THE DETAILS
![Page 24: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/24.jpg)
DOCUMENTING THE DETAILS
![Page 25: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/25.jpg)
JUPYTER NOTEBOOKS
EXECUTABLE CODELATEXMARKDOWNHTML+CSSHYPERLINKSEMBEDDED FRAMES… and more.
![Page 26: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/26.jpg)
PUTTING ALL TOGETHER
![Page 27: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/27.jpg)
![Page 28: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/28.jpg)
BOTTOM LINE
1. SCIENCE IS MORE THAN A PUBLICATION
2. AT ITS CORE LIES REPRODUCIBILITY…THIS INVOLVES DATA
3. BUT DATA IS NOT EVEN ENOUGH… CODE AND PRECISE EXPERIMENT REPRODUCIBILITY
4. DATA AND PROGRAMMING LITERACY IS NEEDED
![Page 29: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/29.jpg)
BUT … SOMETHING ELSE ABOUT QUALITY ASSESSMENT: OPEN REVIEW CASE
![Page 30: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/30.jpg)
THE RULES
• Submissions posted on arXiv before being submitted to the conference• Program committee assigns blind reviewers (as always) marked as designated reviewers• Anyone can ask the program chairs for permission to become an anonymous designated reviewer (open bidding).• Open commenters will have to use their real names, linked with their Google Scholar profiles.• Papers that are presented in the workshop track or are not accepted will be considered non-‐archival, and may be submitted elsewhere (modified or not), although the ICLR site will maintain the reviews, the comments, and the links to the arXiv versions
![Page 31: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/31.jpg)
SOME POINTS ABOUT OPEN REVIEW
• Reading the answers from others is valuable
• Rejected papers have influence
•May become a popularity contest
![Page 32: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/32.jpg)
SHARING DATA IN THE AGE OF DEEP LEARNING
![Page 33: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/33.jpg)
dis
clo
sure
?
![Page 34: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/34.jpg)
Rel
ativ
e m
orta
lity
Privacy budget
Dis
clos
ure
ris
k
More privateLess private
![Page 35: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/35.jpg)
CLAS
SIFIER
![Page 36: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/36.jpg)
CLAS
SIFIER
LEAR
NER
![Page 37: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/37.jpg)
![Page 38: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/38.jpg)
![Page 39: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/39.jpg)
CLAS
SIFIER
![Page 40: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/40.jpg)
![Page 41: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/41.jpg)
Research is based on reproducibility:documentation, data, code, and details.
Machine learning methods can help in the challenge for democratizing data.
CONCLUSIONS
![Page 42: About Data From A Machine Learning Perspective](https://reader034.vdocument.in/reader034/viewer/2022042706/58aba05d1a28abdf3c8b4833/html5/thumbnails/42.jpg)
@oriolpujolvila