data and society lecture 4: data and the health sciencesbermaf/data course 2015/data and society...
TRANSCRIPT
Fran Berman, Data and Society, CSCI 4967/6963
Data and Society Lecture 4: Data and the Health Sciences
2/27/15
Fran Berman, Data and Society, CSCI 4967/6963
Announcements
• Paper preparation -- No class next week (March 6)
• No office hours next week (March 6)
• Lecture 5 – Data and Entertainment -- March 13
• L4 Data Roundtable March 13
• Section 2 Paper /Mini-proposal due April 3 (Details in syllabus on website)
Fran Berman, Data and Society, CSCI 4967/6963
Today (2/27/15)
Exam 1
• Lecture 4: Data and the Health Sciences
• No Data Roundtable
3
Fran Berman, Data and Society, CSCI 4967/6963
You are here
Section Theme Date First “half” Second “half”
Section 1: The Data Ecosystem -- Fundamentals
January 30 Class introduction; Digital data in the 21st Century (L1)
Data Roundtable / Fran
February 6 Data Stewardship and Preservation (L2)
L1 Data Roundtable / 5 students
February 13 Data and Computing (L3) L2 Data Roundtable / 6 students
February 20 Colin Bodel, Time Inc. CTO Guest Lecture and Q&A
L3 Data Roundtable / 5 students
Section 2: Data and Innovation – How has data transformed science and society?
February 27 Section 1 Exam Data and the Health Sciences (L4)
March 6 Paper preparation / no class
March 13 Data and Entertainment (L5) L4 Data Roundtable / 6 students
March 20 Big Data Applications (L6) L5 Data Roundtable / 5 students
Section 3: Data and Community – Social infrastructure for a data-driven world
April 3 Data in the Global Landscape (L7) Section 2 paper due
L6 Data Roundtable / 6 students
April 10 Bulent Yener Guest Lecture, Data Privacy / Bad guys on the Internet (L8)
L7 Data Roundtable / 5 students
April 17 Data and the Workforce (L9) L8 Data Roundtable / 6 students
April 24 Mike Schroepfer, Facebook CTO Guest Lecture and Q&A
May 1 Data Futures (L10) L9 Data Roundtable / 5 students
May 8 Section 3 Exam L10 Data Roundtable / 5 students
You are here
Fran Berman, Data and Society, CSCI 4967/6963
Lecture 4: Data and the Health Sciences
Fran Berman, Data and Society, CSCI 4967/6963
Information technologies have revolutionized the Health Sciences • Electronic Medical Records
– Greater data accessibility, analysis,
• Better disease diagnosis and treatment
– Disease modeling and analysis, IT-guided surgery, electronic monitoring, etc.
• Better understanding of biological structure and function and fundamental science
– Mapping of the human genome, characterization of biological function and disease trajectory
• Personalized medicine
– Monitoring, early prediction of risk and diagnosis
• Public health
– Better prediction and response of epidemics
– Better characterization of health risk and mitigation
• Etc., etc., etc.
Fran Berman, Data and Society, CSCI 4967/6963
• What genes are associated with cancer?
• What parts of the brain are responsible for Alzheimer’s?
• How do coral reefs evolve over time?
• Who is at greatest risk for sickle cell anemia?
• …
Health Sciences Research: Data a fundamental driver for greater discovery
Integration across multiple scales in the biosciences
Disciplinary
Databases Users
Data
Access
and Use
Data
Integration
Organisms
Organs
Cells
Atoms
Bio-
polymers
Organelles
Cell Biology
Anatomy
Physiology
Proteomics
Medicinal
Chemistry
Genomics
Image courtesy of Mark Miller
Fran Berman, Data and Society, CSCI 4967/6963
Potential of data-driven discovery coupled with many societal challenges
Regulation and policy
• Who owns health information? Who has a right to see it?
• Can health information be used to make decisions (insurance premiums, job eligibility)?
• What regulations should apply to mobile health applications?
Culture and practice
• How should clinicians team effectively with technology for better diagnosis, treatment and care?
• How and by whom should personalized health information be used?
Ethics
• Under what conditions is stem cell research, cloning, euthanasia, etc. acceptable? Does the availability of relevant data change this?
• Who is responsible for a misleading / incorrect data-driven diagnosis?
• How can sensitive data be used?
Etc., etc., etc.
Fran Berman, Data and Society, CSCI 4967/6963
Lecture Outline: A sampling of data-driven efforts in the Health Sciences
• Data infrastructure as a resource for discovery:
– The Protein Data Bank
• Community practice around health data
– The Alzheimer’s Disease Neuroimaging Initiative (ADNI)
• Biomedical Research in an Information-rich World
– Atul Butte, TedMed 2014
Fran Berman, Data and Society, CSCI 4967/6963
The Protein Data Bank
Some information in this section courtesy of Phil Bourne and Helen Berman
Fran Berman, Data and Society, CSCI 4967/6963
What is the Protein Data Bank (PDB)?
• International repository and archival database for information about the 3D structure of large biological molecules (such as proteins and nucleic acids)
• Most major scientific journals, and some funding agencies (including the NIH) now require scientists to submit their structure data to the PDB.
• Provides free worldwide public access 24/7 to accurate protein data
Fran Berman, Data and Society, CSCI 4967/6963
About PDB
• PDB data downloaded > 350,000,000 times as of 2012
• PDB supports the development of standards for the representation, annotation, and validation of these structural data that are collected from different experimental methods.
• World-wide PDB (wwPDB.org) is a consortium of groups that host deposition, annotation, and distribution centers for PDB data and collaborate on PDB projects: – RCSB [Research Collaboratory for Structural Bioinformatics] PDB (US: Rutgers and
SDSC/UCSD)
– PDB Europe (PDBe, UK)
– PDB Japan
– BioMagResBank (US)
• U.S. RCSB team includes computer scientists, biologists, chemists, educators.
Fran Berman, Data and Society, CSCI 4967/6963
Broad use, new innovation and discovery
PDB has enabled
• Safe storage of protein data
• Molecular replacement models for structure determination
• “Parts list” for modeling
• Structure based drug design
• Protein structure classification
• Protein structure prediction
Enhancements in 2013 and prior include
• New system for deposition and annotation
• New annotation modules and tools
• Expansion of reference dictionaries
• New community validation task forces for PDB data collections (NMR data, Small angle scattering data)
• Website and system improvement
• Mobile access
• Better search algorithms
• Expanded visualizations and views
Fran Berman, Data and Society, CSCI 4967/6963
PDB vehicle for education and outreach Molecule of the month/ February 2015
http://www.rcsb.org/pdb/home/home.do
www.wwpdb.org www.pdbe.org www.pdbj.org
Insulin Receptor February 2015 Molecule of the Month by David Goodsell doi: 10.2210/rcsb_pdb/mom_2015_2
“Introduction
Cells throughout the body are fueled largely by glucose that is delivered through the bloodstream. A complex signaling system is used to control the process, ensuring that glucose is delivered when needed and stored when there is a surplus. Two hormones, insulin and glucagon, are at the center of this signaling system. When blood glucose levels drop, alpha cells in the pancreas release glucagon, which then stimulates liver cells to release glucose into the circulation. When blood glucose levels rise, on the other hand, beta cells in the pancreas release insulin, which promotes uptake of glucose for metabolism and storage. Both hormones are small proteins that are recognized by receptors on the surface of cells. … “
From the PDB website: http://www.rcsb.org/pdb/101/motm.do?momID=182
Fran Berman, Data and Society, CSCI 4967/6963 From the 2013 PDB Annual Report http://www.rcsb.org/pdb/general_information/news_publications/annual_reports/annual_report_year_2013.pdf
Rich global, multi-sector resource: 100,000+ PDB searchable structures in 2014 (~10K added in 2013)
Fran Berman, Data and Society, CSCI 4967/6963
PDB data workflow – two views
Figures from 2013 PDB Annual Report and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC102472/; Used for teaching under fair use policy http://www.ncbi.nlm.nih.gov/pmc/about/copyright/
Fran Berman, Data and Society, CSCI 4967/6963
Who uses RCSB PDB?
Fran Berman, Data and Society, CSCI 4967/6963
PDB History and Current Status
1970s • Community discussions about
how to establish an archive of protein structures
• Cold Spring Harbor meeting in protein crystallography
• PDB established at Brookhaven (Oct 1971; 7 structures)
1980s • Number of structures increases
as technology improves • Community discussions about
requiring depositions • IUCr guidelines established • Number of structures deposited
increases
1990s • Structural genomics begins • PDB moves to RCSB PDB
2000s • WWPDB formed • 50,000th structure released
(April 2008)
2010’s • 40th Anniversary of PDB (2011) • 10th Anniversary of WWPDB (2013) • 2013 / rcsb.org:
– ~286,000 unique visitors per month from 190 countries
– 1,000,000 downloads of data from PDB archive per day
– 1.3 TB per month transferred – 10,000 downloads of mobile app
Information courtesy of Helen Berman and 2013 PDB Annual Report
Fran Berman, Data and Society, CSCI 4967/6963
PDB and Data Sharing
“ A very important factor in the growth of the PDB has been the change in attitudes
regarding data sharing.
In 1971, the incentives to deposit data in the PDB were very practical: by putting data
in the archive, depositors would ensure that the data would not get lost. The task of
distributing data resident on magnetic tapes to interested parties located around the
world became the job of the PDB. In spite of these conveniences, it was not the norm
to deposit data.
It wasn't until the 1980s that several community groups began to establish
guidelines for data sharing. Once published, the funding agencies and the journals
began to adopt these guidelines.
Today, structure deposition into the PDB is a prerequisite for publication in virtually
every journal. These scientific, technological, and cultural changes have driven the
continual growth of the PDB.”
The Future of the Protein Data Bank
From: “The Future of the Protein Data Bank” http://onlinelibrary.wiley.com/doi/10.1002/bip.22132/full
Fran Berman, Data and Society, CSCI 4967/6963
PDB Business Model
• Funding models for PDB vary for each PDB center in WWPDB
– Different funding cycles
– Different funding criteria
– Multiple agencies involved
• Costs
– Cost for structures 2013 is roughly $750M (~$75K/structure)
– Cost for archiving is roughly $10,000,000
• Sustainability: Multiple models being explored
– Current models: multiple-agency, multi national funds
– Potential models: • Journal model – pay per structure
• Congressional appropriation (NCBI)
• International funding structure with strong community oversight, rolling tenure
– Charitable WWPDB Foundation recently formed to support education, outreach, continued collaboration with respect to standards, and community meetings
Wellcome Trust, EU,
CCP4, BBSRC, MRC,
EMBL
BIRD-JST,
MEXT
NSF, NIGMS, DOE, NLM,
NCI, NINDS, NIDDK
NLM Information courtesy of Helen Berman and Phil Bourne.
Fran Berman, Data and Society, CSCI 4967/6963
The Alzheimer’s Disease Neuroimaging Initiative (ADNI)
Fran Berman, Data and Society, CSCI 4967/6963
ADNI
• Purpose: – Define biomarkers for and determine the best way to measure the treatment effects
of AD therapeutics.
– Provide a public domain research resource to facilitate the scientific evaluation of neuroimaging and other biomarkers for the onset and progression of mild cognitive impairment
• Patient Database: ~1700 participants in US and Canada: patients with normal
memory, patients with significant memory concerns, patients with mild cognitive impairment, patients with Alzheimer’s
• “Social” innovations: – Public-private partnership to accelerate understanding and treatment of the disease.
• Project began in 2004 with funding from NIH, pharmaceutical companies and foundations.
– Open approach to data sharing (challenging in the health sciences)
Alzheimer’s: A type of dementia that causes problems with memory, thinking and behavior. Symptoms develop slowly and get worse over time, becoming sever enough to interfere with daily tasks
• Affects almost 50% of those over 85
• 6th leading cause of death in US
Fran Berman, Data and Society, CSCI 4967/6963
Many different kinds of relevant data
http://www.adni-info.org/
Fran Berman, Data and Society, CSCI 4967/6963
ADNI Accomplishments
ADNI has
• Developed methods for early detection of AD
• Developed standardized methods for clinical tests
• Demonstrated the feasibility and value of multicenter PET amyloid imaging
• Discovered unanticipated observations and new findings
• Stimulated the development of the world-wide Alzheimer’s community
• Created a globally valuable open access dataset around Alzheimer’s
Fran Berman, Data and Society, CSCI 4967/6963
ADNI Data Workflow
From: The Informatics Core of Alzheimer’s Disease Neuroimaging Initiative, Alzheimer’s Dement., May 2010, Author’s Manuscript
Fran Berman, Data and Society, CSCI 4967/6963
ADNI Data Infrastructure
From: The Informatics Core of Alzheimer’s Disease Neuroimaging Initiative, Alzheimer’s Dement., May 2010, Author’s Manuscript
Fran Berman, Data and Society, CSCI 4967/6963
Project Evolution • ADNI1 (2004, 6 years, 400 subjects, $67M): Develop clinical, imaging, genetic and
biochemical biomarkers for early detection and tracking of Alzheimer’s
• ADNI GO (2009, existing ADNI1 cohort + 200 EMCI participants): Extension examines biomarkers in an earlier stage of the disease.
• ADNI2 (2011, existing (ADNI1 + ADNI GO) cohort + 150 elderly controls, 100 EMCI, 150, LMCI, 150 MCI patients, $67M): Continuation of study with broader and more refined patient population
From http://adni.loni.usc.edu/study-design/
MCI: Mild cognitive impairment EMCI: Early MCI LMCI: Late MCI
Fran Berman, Data and Society, CSCI 4967/6963
Community Data
• Data in the Image Data Archive (IDA) in the Laboratory for Neuroimaging (LONI) at USC
– Clinical data, MR and PET image data, Image analysis results, Chemical biomarker data
– Participant demographic metadata includes gender, race, ethnicity, age, education, diagnostic categories, etc.
• ADNI provides (free) rapid public access all raw and processed data. New data are quarantined for a maximum of 30 days for quality control review prior to posting.
• An ADNI data-use agreement is a prerequisite for obtaining data, and a user table lists everyone who is accessing ADNI data. All qualified investigators have equal access
Authorization: Investigators must sign “data use” agreement”. If allowed access to data,
– Investigators must use the data for the purpose of scientific investigation, teaching or planning of clinical research studies.
– They and their collaborators must agree to a) refrain from contacting or trying to identify subjects, b) respond to requests for information about data use or results, c) cite data source, ADNI and ADNI funding sources
Image from http://adni.loni.usc.edu/data-samples/
Fran Berman, Data and Society, CSCI 4967/6963
ADNI and Big Data
• New effort announced in 2012 extends ADNI project to include whole genome sequences of 800 individuals from ADNI GO and ADNI2
• Sequencing will result in 165+ TB of data
• Raw data will be made available to qualified scientists to mine for novel targets for risk assessment, new therapies, etc.
• New data will enable to explore correlation of ADNI data about disease markers, indicators and changes with gene sequence data
Image: ADNI PET Core
Fran Berman, Data and Society, CSCI 4967/6963
Business Model
• Innovative Public – Private Partnership. Originally funded for $60M
– $40M from the National Institute on Aging and National Institute of Bioimaging and Bioengineering)
– $20M from the pharmaceutical industry and several foundations, donated to the Foundation for the NIH
– ADNI member VA Medical Center, SF also supported by VA Office of R&D
• New phases of project bring in new funding (“relay “ model)
Fran Berman, Data and Society, CSCI 4967/6963
ADNI Funding: Public, private, foundations, US, Canada
• NIA and NIBIB from the NIH
• $20M from the pharmaceutical industry donated to the Foundation of the NIH
• Private Partner Scientific Board
Fran Berman, Data and Society, CSCI 4967/6963
Data-driven Biomedical research
Fran Berman, Data and Society, CSCI 4967/6963
Atul Butte – Biomedical research in a data-rich world / TedMed 2014
https://www.youtube.com/watch?v=dtNMA46YgX4
Fran Berman, Data and Society, CSCI 4967/6963
A brave new (data-driven) world …
• Opportunities
– New forms of data input (fitbit, cell phone, Internet of Things)
– Expanding notion of relevant data (environmental, behavioral, social, etc.)
– New approaches to prevention, prediction, monitoring, treatment, disease exploration and treatment, etc.
– New players – crowd-sourcing, commercial medical device and analysis, etc.
• Challenges
– Appropriate legal and policy constraints
• What can be shared and to whom? • What is a medical device? • What must be private? • Who owns / has control?
– Interoperability
• ICU = “Internet of Things” – how to integrate distinct data sources
• Multiple record-keeping / SW approaches – electronic records, cancer centers
– Valid analysis
• What models are valid representations? • How can data be best used for useful
results?
Fran Berman, Data and Society, CSCI 4967/6963
Lecture Materials (in addition to those designated on slides) • Lecture Materials (all linked on course website)
– 2013 Protein Data Bank Annual Report http://www.rcsb.org/pdb/general_information/news_publications/annual_reports/annual_report_year_2013.pdf
– The Protein Data Bank, Nucleic Acids Research, 2000 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC102472/.
– The Future of the Protein Data Bank, Biopolymers, volume 99, http://onlinelibrary.wiley.com/doi/10.1002/bip.22132/full
– “The Protein Data Bank and lessons in Data Management,” Briefings in Bioinformatics, http://www.sdsc.edu/pb/papers/briefings04.pdf
– “The informatics core of the Alzheimer’s Disease Neuroimaging Initiative”, A. Toga, K. Crawford and the ADNI Laboratory of NeuroImaging, Alzheimer’s Dement. 2010 May: 6(3) [See Fran for author’s copy]
– PDB: www.rcsb.org
– ADNI: http://www.adni-info.org/
– Atul Butte TedMed 2014 talk https://www.youtube.com/watch?v=dtNMA46YgX4
Fran Berman, Data and Society, CSCI 4967/6963
Two weeks: L4 Data Roundtable March 13
• “M-health: Health and appiness”, The Economist, http://www.economist.com/news/business/21595461-those-pouring-money-health-related-mobile-gadgets-and-apps-believe-they-can-work (Charles Hathaway)
• “The Parable of Google Flu: Traps in Big Data Analysis”, Science (March, 2014), http://gking.harvard.edu/publications/parable-google-flu%C2%A0traps-big-data-analysis (Philip Cioni)
• “Open Science Champion of Change” Sage Bionetworks’ Stephen Friend: Honored at White House Event”, Business Wire, http://www.businesswire.com/news/home/20130624006494/en/%E2%80%9COpen-Science-Champion-Change%E2%80%9D-Sage-Bionetworks%E2%80%99-Stephen (Alex Karcher)
• “Can big data cure cancer?”, Fortune, http://fortune.com/2014/07/24/can-big-data-cure-cancer/ (Robert Stephens)
• “I had my DNA picture taken, with varying results,” The New York Times, http://www.nytimes.com/2013/12/31/science/i-had-my-dna-picture-taken-with-varying-results.html?pagewanted=all&_r=0 .(Kate McGuire)
Fran Berman, Data and Society, CSCI 4967/6963
March 20: L5 (Data and Entertainment) Data Roundtable
• “Management Secrets of the Grateful Dead”, The Atlantic http://www.theatlantic.com/magazine/archive/2010/03/management-secrets-of-the-grateful-dead/307918/ (Karl Appel)
• “The Shazam Effect”, The Atlantic, http://www.theatlantic.com/magazine/archive/2014/12/the-shazam-effect/382237/?single_page=true (Sumit Munshi)
• “At Disney Parks, a Bracelet Meant to Build Loyalty (and Sales)”, The New York Times, http://www.nytimes.com/2013/01/07/business/media/at-disney-parks-a-bracelet-meant-to-build-loyalty-and-sales.html?pagewanted=all (Yusri Jamaluddin)
• “Dancing Data”, re/code, http://recode.net/2014/01/28/dancing-data/ (Miguel Inoa-Lantigua)
• “Here’s How Piracy Hurts Indie Film”, Indiewire, http://www.indiewire.com/article/guest-post-heres-how-piracy-hurts-indie-film-20140711 (Oskari Rautiainen)