big data supporting drug discovery - cautionary tales from the world of chemistry for translational...
TRANSCRIPT
Big Data Supporting Drug Discovery
Cautionary Tales from the World of Chemistry for Translational Informatics
Valery Tkachenko
RSC-CSIR/OSDD meeting
Pune, India
February 3rd 2014
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
Chemical space - 1060
Navigation in chemical space
Navigation in chemical space
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
Structure-based Drug Design
Structure-based Drug Design
Ligand-based Drug Design
Ligand-based Drug Design
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
Machine learning
Applied machine learning
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• Ongoing deposition of data from our journals and our collaborators
• A structure centric hub for web-searching
ChemSpider
ChemSpider
Properties - experimental
Properties - ACDLabs
Properties – EPI Suite
Properties - ChemAxon
Literature references
Patents references
Books
Classification
Chemical vendors and datasources
Multimedia
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
ChemSpider Reactions
ChemSpider Reactions
ChemSpider Reactions
ChemSpider Reactions
ChemSpider Spectra
ChemSpider Spectra
ChemSpider Databases
ChemSpider Compounds
ChemSpider Reactions
ChemSpider Spectra
ChemSpider Crystals
ChemSpider Materials
ChemSpider Assays
ChemSpider Algorithms
Research data inflow
Deposition Gateway
Staging databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds Module
Spectra Module
Reactions Module
Materials Module
TextminingModule
!͙Module
Web UI for unified depositions
DropBox, Google Drive, SkyDrive, etc
LabTrove and other templated data
Documents
API, FTP, etc
Raw data Validated dataStaging
databases
All databases are sliced by data sources/data
collections and have simple
security model where each data
slice/source is private, public or
embargoed
Research data outflow
Compounds Reactions Spectra Materials Documents
CompoundsAPI
ReactionsAPI
SpectraAPI
MaterialsAPI
DocumentsAPI
CompoundsWidgets
ReactionsWidgets
SpectraWidgets
MaterialsWidgets
DocumentsWidgets
Data tier
Data access tier
User interface
components tier
Analytical Laboratory application
User interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
RSC Archive – since 1841
DERA - Digitally Enabling RSC Archive
Semantic mark-up of articles
It is so difficult to navigate…
What’s the structure?What’s the structure?
Are they in our file?
Are they in our file?
What’s similar?What’s
similar?
What’s the target?
What’s the target?Pharmacology
data?Pharmacology
data?
Known Pathways?
Known Pathways?
Working On Now?
Working On Now?Connections
to disease?Connections to disease?
Expressed in right cell type?Expressed in
right cell type?
Competitors?Competitors?
IP?IP?
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and private databases
– Automated quality control system
DrugBank dataset (6516 records)
J. Brechner, IUPACGraphical Representation of stereochem. configurationsSection: ST-1.1.10
DB06287
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
Research data management
University 1
Data Hub
Workstations
University 2
Data Hub
Workstations
Company 3
Data Hub
Workstations
Data Repositoryindexed storage
Data Repository provideddata storage
Chemically intelligent services
Indexes
Data
External clients Publishers
Scientists Funding bodies
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
Crowdsourcing
AltMetrics
RSC/Rewards and Recognition
Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.
The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementVisualization and navigationBuilding Global Chemistry Network
Visualization
Visualization and navigation
Visualization and navigation
Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network
We are a part of a larger world
ChemSpider APIs
National Chemistry Database
http://www.openphacts.org
Open PHACTS is an Innovative Medicines Initiative (IMI) project, aiming to reduce the barriers to
drug discovery in industry, academia and for small
businesses.
Semantic web is one of the corner stones
OSDD