principles and practice of open science

58
Open Science Peter Murray-Rust, ContentMine.org, and University of Cambridge Opencon2015, Bologna, IT 2015-11-18 What is “Open”? Why is it essential? Open Data Content Mining – a battle we must win Young researchers are the present (Mike Eisen)

Upload: thecontentmine

Post on 20-Jan-2017

177 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Principles and practice of Open Science

Open Science

Peter Murray-Rust, ContentMine.org, and University of Cambridge

Opencon2015, Bologna, IT 2015-11-18

What is “Open”?Why is it essential?

Open DataContent Mining – a battle we must win

Young researchers are the present (Mike Eisen)

Page 2: Principles and practice of Open Science

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Page 3: Principles and practice of Open Science

My European Heroes

Young People(ContentMine)

NEELIE KROES

Page 4: Principles and practice of Open Science

Messages• The system is completely broken• We are at war with major publishers• Students have the power to change the world• Universities need help from students• Open is a state of mind• The opposite of Open is broken [1]• Friction destroys Open• Don’t buy it, build it …• … TOGETHER

[1] (John Wilbanks)

Page 5: Principles and practice of Open Science

@Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism:

"Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/ …

#opencon #TDM

Breaking news:Elsevier stopped me doing my research

Chris Hartgerink

Page 6: Principles and practice of Open Science

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

Chris Hartgerink’s blog post

Page 7: Principles and practice of Open Science

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 8: Principles and practice of Open Science

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 9: Principles and practice of Open Science

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

Page 10: Principles and practice of Open Science

After AMI2 processing…..

… AMI2 has detected a square

Page 11: Principles and practice of Open Science
Page 12: Principles and practice of Open Science

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

CONTENTMINE Complete OPEN Platform for Mining Scientific Literature

Page 14: Principles and practice of Open Science

Refs: Erriquez_Daniela_tesi, Fiorentina_Elena_tesi, Gou_Qian_Tesi, mbarontini_tesid, terracciano_maria_tesi

BagOfWords for Italian Theses

Page 15: Principles and practice of Open Science

Copyright and Mining

• UK (“Hargreaves”) 2014 legislation:– “personal” “non-commercial*” “research” “data

analytics”– legitimizes copying (?to disk), but not publishing

• PMR-premise: You cannot do reproducible scientific mining and avoid violating copyright.

Page 16: Principles and practice of Open Science

Massive political activity in Europe

REDA Publisher-influenced

Page 17: Principles and practice of Open Science

Elsevier wants to control Open Data

[asked by Michelle Brook]

Page 18: Principles and practice of Open Science

Scholarly infrastructure becomes closed

No accountability for monitoring and control

Page 19: Principles and practice of Open Science

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing

Page 20: Principles and practice of Open Science

[1] The Military-Industrial-Academic complex (1961)(Dwight D Eisenhower, US President)

Publishers AcademiaGlory+?

$$, MSreview

Taxpayer

Student

Researcher

$$ $$

in-kind

The Publisher-Academic complex[1]

Page 21: Principles and practice of Open Science

[Wikipedia:] On the steps of Sproul Hall [Student] Mario Savio gave a famous speech

... But we're a bunch of raw materials that don't mean … to end up being bought by some clients of the University, be they the government, be they industry, be they organized labor, be they anyone! We're human beings! ... There's a time when the operation of the machine becomes so odious — makes you so sick at heart — that you can't take part. You can't even passively take part. And you've got to put your

bodies upon the gears and upon the wheels, upon the levers, upon all the apparatus, and you've got to make it stop. And you've got to indicate to the people who run it, to the people who own it, that unless you're free, the machine will be

prevented from working at all. [1]

Univ California,Berkeley 1964

The Free Speech Movement

Page 23: Principles and practice of Open Science

Flower Power1967

Berkeley 2010“Flowerpoint”

Page 24: Principles and practice of Open Science

["How We Stopped SOPA”:

This bill ... shut down whole websites. Essentially, it stopped Americans from communicating entirely with certain groups.... I called all my friends, and we stayed up all night setting up a website for this new group, Demand Progress, with an online petition opposing this noxious bill.... We [got] ... 300,000 signers.... We met with the staff of members of Congress and pleaded with them.... And then it passed unanimously.... And then, suddenly, the process stopped. Senator Ron Wyden ... put a hold on the bill.[48][49]

He added, "We won this fight because everyone made themselves the hero of their own story. Everyone took it as their job to save this crucial freedom.”

Robert Swartz: "Aaron was killed by the government, and MIT betrayed all of its basic principles."[116]

Aaron Swartz

Page 25: Principles and practice of Open Science

Rules for Revolutionaries

• Be publicly clear about your public aims.• Gather whole-hearted allies.• Choose your moment/s carefully.• Be prominent – blogs, talks, papers.• Be bold – and probably brave.• Write Liberation Software.• Create slogans, warcries, mantras.

Page 26: Principles and practice of Open Science

Take the fight to publishers. Hold them accountable for the near-criminal business models they operate on, and the stranglehold they have had on academia for too long.

Extending this, I need your help. I want to know if we initiate a formal investigation into the practices of publishers, in terms of the fact that they operate within an unregulated market and enjoy enormous profits to commit immoral acts (creating knowledge inequality). …. I want to know what we can do, and if such an investigation is even feasible, and whether or not we have a legal case supporting us.

Don’t sacrifice your career.. [PMR] said it best, that for any revolution blood will be spilled. If you’re making someone angry, you’re probably doing it right. But when you’re ‘advocating’ for open access, maintain one simple rule: don’t be a dick…. (and lots more)

Jon Tennant 2014-11-25http://blogs.egu.eu/palaeoblog/2014/11/25/open-access-wins-all-of-the-arguments-all-of-the-time/

Page 27: Principles and practice of Open Science

The Right to Readis

The Right to Roam

The Right to Mine

Kinder Mass Trespass used without permission but with love and thanks

Page 28: Principles and practice of Open Science

How can we achieve Freedom?

• Change the law to allow ContentMining– Hard, tedious, but necessary– Requires evidence, campaigning, making yourselves a

pain in the arse…• Make all outputs Open

– Requires culture change in researchers– Tools: Open Notebook Science, Github, Open source,

Social media.– Needs support from funders, learned societies,

universities

Page 29: Principles and practice of Open Science

Four Freedoms (Richard Stallman)

The freedom to:

0 run the program as you wish, for any purpose1 study how the program works, and change it2 to redistribute copies3 distribute copies of your modified program

Most other “Opens” follow these principles, including CC-BY material. However “Green Open Access” is incompatible with Freedom2 and 3

Page 30: Principles and practice of Open Science

The Open Definition

“Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).”

Page 31: Principles and practice of Open Science

http://www.budapestopenaccessinitiative.org/read

… an unprecedented public good. …

… completely free and unrestricted access to [peer-reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. …

…Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)

Page 32: Principles and practice of Open Science

Panton Principles for Open Data in science(2010)

• PUBLISH YOUR DATA OPENLY• …make an explicit and robust statement of your wishes.• Use a recognized waiver or license that is appropriate for data. • open as defined by the Open Knowledge/Data Definition (…

NOT non-commercial)• Explicit dedication of data … into the public domain via PDDL or

CCZero

Peter Murray-Rust, Cameron Neylon, Rufus Pollock, John Wilbanks

Page 33: Principles and practice of Open Science

Panton Authors and Fellows

Page 34: Principles and practice of Open Science

Bjorn Brembs enhanced by OpenDatahttp://bjoern.brembs.net/2015/11/dont-be-afraid-of-open-data/

This is a response to Dorothy Bishop’s post “Who’s afraid of open data?“. After we had published a paper on how Drosophila strains that are referred to by the same name in the literature (Canton S), but came from different laboratories behaved completely different in a particular behavioral experiment, Casey Bergman from Manchester contacted me, asking if we shouldn’t sequence the genomes of these five fly strains to find out how they differ. So I went and behaviorally tested each of the strains again, extracted the DNA from the 100 individuals I had just tested and sent the material to him. I also published the behavioral data immediately on our GitHub project page.

Casey then sequenced the strains and made the sequences available, as well. A few weeks later, both Casey and I were contacted by Nelson Lau at Brandeis, showing us his bioinformatics analyses of our genome data. Importantly, his analyses wasn’t even close to what we had planned. On the contrary, he had looked at something I (not being a bioinformatician) would have considered orthogonal (Casey may disagree). So there we had a large chunk of work we would have never done on the data we hadn’t even started analyzing, yet. I was so thrilled! I learned so much from Nelson’s work, this was fantastic! Nelson even asked us to be co-author, to which I quickly protested and suggested, if anything, I might be mentioned in the acknowledgments for “technical assistance” – after all, I had only extracted the DNA.

However, after some back-and-forth, he persuaded me with the argument that he wanted to have us as co-authors to set an example. He wanted to show everyone that sharing data is something that can bring you direct rewards in publications. He wanted us to be co-authors as a reward for posting our data and as incentive for others to let go of their fears and also post their data online.

Page 35: Principles and practice of Open Science

Arguments for Open

• Open Science:– is Better Science– can reach and involve everyone– Open Science moves more quickly– Open Science challenges injustice– helps the world

It also happens to:– Promote the careers of scientists– Save money

Page 36: Principles and practice of Open Science

Jean-Claude BradleyJean-Claude Bradley was one of the most influential open scientists of our time. He was an innovator in all that he did, from Open Education to bleeding edge Open Science; in 2006, he coined the phrase Open Notebook Science. His loss is felt deeply by friends and colleagues around the world.On Monday July 14, 2014 we shall gather at Cambridge University to honour his memory and the legacy he leaves behind with a highly distinguished set of invited speakers to revisit and build upon the ideas which inspired and defined his life’s work.

Wikipedia CC BY-SA

Page 37: Principles and practice of Open Science

Traditional Research and Publication

“Lab” work paper/thesis

Write

rewrite

Re-experiment

publish

???

Validation??

DATA

output “belongs” to publisher

process “belongs” to publisher

Walls of academia

Page 38: Principles and practice of Open Science

Free/Open Software Development CODE REPOSITORY

Worldcommunity

CODErewrite

validate

CODEfork

CODE

Re-use

CODERe-use

Github, BitBucketStackOverflow,Apache

inspires

OSI

Example: ContentMine athttp://github.com/ContentMine/quickscrape

BORN-OPEN-SOURCE

NO WALLS

Page 39: Principles and practice of Open Science

TOOLS

Open Notebook ScienceOpen engineeredrepository

Worldcommunity

INSTRUMENT

validate

merge

MODELCODE

DATA

DATAknowledge

calibrate

Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous

Machines and humansWorking together

CC-BY

Page 40: Principles and practice of Open Science

Mat Todd (Sydney) and MANY collaborators

http://opensourcemalaria.org/ (Chrome)

Page 41: Principles and practice of Open Science
Page 42: Principles and practice of Open Science
Page 43: Principles and practice of Open Science
Page 44: Principles and practice of Open Science
Page 45: Principles and practice of Open Science
Page 46: Principles and practice of Open Science

University of Southampton, BSD-like Open

Page 47: Principles and practice of Open Science

Open Source and Open Data

www.crystallography.net

Page 48: Principles and practice of Open Science
Page 49: Principles and practice of Open Science
Page 50: Principles and practice of Open Science

OPEN CLOSED

Zenodo Figshare

Git

Dat

OpenOffice Word, PPT

LabTrove, cheminfo.org Chemdraw

CrystallographyOpenDB Cambridge Cryst data Centre

WriteLatex / Overleaf

ReadCube, Symplectic,

Page 51: Principles and practice of Open Science

From Wikipedia CC BY-SA

Crowdsourcing

Page 52: Principles and practice of Open Science

Young people

Jenny Molloy

Ross MounceSam Moore Peter Kraker Rosie GraySophie Kay

Sophie: 3rd yr Grad students train 1st year students

PANTON ARMS

Panton Fellows

Page 53: Principles and practice of Open Science

Sophie Kershaw, Panton Fellow, Training PhD Students

Page 54: Principles and practice of Open Science

Rotation-Based Learning (RBL)

Phase 1: Initiator• No communication

permitted between groups• Attempt to reproduce

existing literature• Deliver a coherent research

story by the end of Phase 1

Phase 2: Successor• Communication between

groups still prohibited• Validate and develop the

inherited research story• Critique your predecessors

• Role of research producer vs. research user • Can this approach help to foster awareness of reproducibility issues?

Throughout Phases 1 & 2:• Daily lectures on open

science culture & techniques• First-hand application to own

research work• Version control using GitHub• Daily group supervision

Page 55: Principles and practice of Open Science

“Do you think you would be more confident in the future about trying to apply Open techniques to your work..?”

• 50% Yes, by myself• 41% Yes, with help/guidance

• 9% No opinion/neutral• 0% No

Page 56: Principles and practice of Open Science

Some Children of the Digital Enlightenment

• David Carroll & Joe McArthur: OAButton• Rayna Stamboliyska & Pierre-Carl Langlais• Jon Tennant• Ross Mounce • Jenny Molloy• Erin McKiernan• Jack Andraka• Michelle Brook• Heather Piwowar• TheContentMine Team• Rufus Pollock• Jonathan Gray• Sophie Kay

Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP)

J-C promoted these ideas with UNDERGRADUATE scientists.

[1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge

Sophie Kay

Page 57: Principles and practice of Open Science

More Thoughts

• Don’t negotiate with walled gardens, make them change or make them obsolete

• Building on top of non-Open is very fragile, unpredictable and usually bad engineering

Page 58: Principles and practice of Open Science

Protecting innovation

• Many start-ups get acquired and lose their mission

• “Embrace, extend, exterminate” (Microsoft)

• Consider adding “Open Lock” clauses to articles of incorporation