finding our way in information space phil ashworth phil scordis
TRANSCRIPT
Finding our way in information space
Phil Ashworth
Phil Scordis
UCB: The Next Generation Biopharmaceutical Leader
R&D activities at 10 global sitesR&D Headcount = 2,100 (August 2007)
Braine (Be)
Atlanta (US)
Bulle (CH)
Tokyo (Jap)
Slough & Cambridge (UK)
RTP (US)
Rochester (US)
Shannon (Ire)
Monheim (De)
Global biopharmaceutical company with specialist focus:Neurology, Inflammation and Oncology
Proven sales and marketing – creating global brands
• Keppra®, Xyzal®, Zyrtec®
Revenues of €3.5 billion in 2006 (pro forma)
Successfully transformed with:
• Celltech acquisition in 2004
• Integration of SCHWARZ PHARMA in September 2007
Over 10,000 employees across more than 40 countries
Listed on EURONEXT (Brussels); current market cap of €7.5 bn
Apology
Health Warning
• We are still in the middle of all of this, I don’t have all of the answers
History
Research and Development in UCB
• Comes from integration of Schwarz Pharma, Celltech, OGS, Chiroscience, Darwin
Variety of data source issues
• Silos, vendor systems, structured, un-structured etc.
Data integration
• A mess of legacy approaches and many situations where no attempt has been made.
• To warehouse or not to warehouse?• After a rollout of a research warehouse, at least two distinct examples of
different working practice “break” the model
• Difficult to extend and rebuild warehouses. – Just another rigid system
Principles and Ideals of the Semantic Web
“The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” [Tim Berners-Lee et al 2001]
Ideal environment
• Starting from scratch, building connectivity
• Start defining the problem space from a blank page
How applicable is this attractive approach to us?
Lets find out……
The Dream
What did we want
• Facilitating UCB’s pipeline faster to market
• Better ROI, an environment in which investment in data generation can be exploited to the full.
• Breaking down data boundaries
Major Areas for Improvement
• Operational Orchestration
• Data Integration
• Knowledge discovery and creation
The fantasy
• Legacy systems remain in place where appropriate
• Data integration is seamless, facilitates aggregation, query based on the meaning of the data
• Facilitated exploration of data and exploitation of connections
Starting the journey
Heard of others oscillating around the semantic vs warehouse question
• Large investment in both technologies, building components, rolling out home built solutions
Our initial investment
• Minimal resource
• Limited to vendor applications (best of breed) rather than building our own• But not an all or nothing approach offered by a some
Our learning curve has been steep
• Made many mistakes
• Visited many dead ends
• Experienced limitations first hand
• Had many frustrations
Data Integration was our key goal
Where to start
Principles of the Semantic Web
• Understanding the concepts of semantics – so much reading.
Semantic Technologies
• Differences between the semantic and OO mindsets
Academia
• Some nice projects but, not enterprise orientated
Data Integration
• RDF• Has desirable flexibility inherent potential for integration
• OWL• Builds on top of RDF potential for rich descriptive framework, plus the power
of DL to facilitate Knowledge discovery through Reasoning• Making connections
• But our data is in relational systems!
How to integrate: Getting RDF from RDB
RDF from RDB
• D2RQ • Offered the ability to read/query relational databases as RDF
• Limitations• Open source.
• Didn’t work on real world databases in our hands
• Concerns of query speed when using multiple data sources. Wanted asynchronous distributed environment
• Reasoning very slow across multiple data sources, Forward Chaining
• Cerebra server• Tantalising prospect. A dead-end? Recent changes within company meant
that direction for tool was uncertain.
• SDS – Interesting prospect (www.insilicodiscovery.com)
• Integrated query environment across a variety of data sources (relational, excel, web services etc.)
• Distributed asynchronous computing model
• No RDF!
How to integrate: RDF Stores / Warehouse
Triple stores• Allegrograph – Franz.
• Sesame
Problems
• Immature technology• data volumes are limited wrt to life science data volumes
• Security and backup – primitive
• Limited Integration with other tools.• Needed tighter integration – queries not being carried out directly in RDF
stores. Again slow queries & reasoning from tools due to forward chaining.
• Still have data duplication issues and requirements for ETL processes
One step forward, two steps back!
How to integrate: Development Tools
Few professional development and deployment environments
• Roll your own vs the use of open source
Protégé
• Great for model development but lacked integration with other tools (when we looked)
TopBraidComposer - TopQuadrant
• Excellent functionality out of the box. Easy interface, File imports, navigation etc
• Integrated with a variety of third party systems. • D2RQ, Allegrograph, Sesame, Jena, Oracle
• But still could not do everything we wanted it to.
• TopQuadrant supported our limited resource to enhance our understanding and knowledge.
• TopBraidLive one of the first development –> deployment applications
Reasoners
• Several looked at - Each had their quirks
• None did as we thought or wanted with the data volume we had.
• Used Rules to achieve what we needed.• Isn’t this cheating?
Stop the journey – we are getting off
We have tried to achieve data integration chasing several avenues
• RDF from RDB
• RDF warehouse• Via RDB data -> txt -> RDF -> RDF Store
• Semantic SOA, another approach• Pragmatic semantics
Now we understand the messages others have been trying to pass
• Blowing hot and cold on the whole idea
• Wavering over semantic vs conventional warehousing
• Heavy investment in home brew technology or enterprise environment
Is this a dead end?
The end
Thanks for coming …
Hang on, we are not giving up yet
RDF Stores
Ontologies
Data Integration Tools
Delivery Tools
Development Tools
Visualisation
We decided to persevere
• But we still don’t have a large amount of resource to throw at this
• We need to take a different path• Community action
• Collaboration
• There is a vibrant and active community out there• W3C …
• Involved in direction and calling for standards
So where are we today?
Driving change
TopBraidComposer - A semantic development environment using open source and limited data integration tools.
• Help with SDS
• Tighter Integration with RDF stores• TQ also had to drive other vendors to provide functionality for them
• Many other changes as we pushed the boundaries of the tool
• TopBraidLive looks very promising as an easy deployment environment
SDS - A data integration platform, enterprise ready, lacking a semantic direction
• SPARQL integration (Not just RDF from RDB, RDF from RDB, Excel, web services)• We believe this is key to our future strategy
• Changes to their interfaces, tools and capabilities
• Integration with TBC
UCB is driving collaborative development
• Helping bring companies together (A big thank you to TQ and ISD)
• Helping drive the community
In Summary
The semantic wave is too large to surf alone
• Too unpredictable to control
There are some big hurdles to overcome
• Integration, tools, enterprise solutions, visualisation, orchestration
However we are committed to helping make things happen
• Always on the lookout for open-minded enthusiasts
• Committed to contribute to the community
Still believe that Semantic Technologies are part of the solution
• But it is not just something we can adopt (at the moment)
• It is still something we have to help forge so others can be adopters.
Thank you
Any Advice Questions?