worldwide protein data bank wwpdb common d&a project january 28, 2010 steering committee...
TRANSCRIPT
Worldwide Protein Data Bank
www.wwpdb.org
wwPDB Common D&A Project January 28, 2010
Steering Committee
Project Update
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Update report Status of D&A initial production deliverable:
– Sequence Editor tool development– Integration within existing pipelines
Status of WF infrastructure initial implementation: – Sequence Processing components (external search, internal
analysis etc) integrated by WF engine and manager into the “new” Sequence Processing Module.
– Integration of Sequence Processing Module into existing pipeline. RECONSIDER Timeline Estimate and Strategy
Next Phase– Ligand Processing: Planning
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Overview of deliverable status for:Sequence Editor tool
Deliverable timelines have been extended to enable full response to user testing input (expanded requirements) and to ensure development to agreed upon design.
Completion of Interface with additional prioritized requirements - projected Feb 15
Integration within current production pipelines – Initial implementation of Master Format and format conversion
support
In Use by annotators by Feb 25
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Sequence Editor Tool Technologies and Standards
Model View Controller (MVC) Design – – Separates data/application from presentation as much as
possible
Client/Server protocol – AJAX using JSON protocol REST style service definitions
Server – Apache with embedded WSGI (mod_wsgi)
Application – – Python with C++ extensions (Boost/Python)
All the good acronyms!
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Sequence Editor ToolArchitecture for Current and Future Deployment
SequenceData Store
CurrentDP Pipeline
WFE/WFM
SequenceEditor Tool
Annotated Sequence Data
Future WorkflowDP Pipeline
PDB/FASTAPDBx/PreBlast
PDB/PDBx
WFE/WFM
Sequence Editor
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Accomplishments Annotator graphical interface for Sequence Editing
– Prototype evaluation and prioritization of additional requirements by Annotators at all sites completed Jan 12
– Expanded functionality development expected to be completed and available for user testing Feb. 15, including:
Implements the capability to incrementally undo a process step (UNDO) Summarization of sequence conflicts Global editing features
Integration of this Sequence Editor tool (interface) into the existing data processing pipelines (Feb 26)– Input accepts existing sequence data files at PDBe and RCSB (e.g. PDBx
+ Blast report or PDB + FASTA)
– Output integration via intermediate file to be integrated via Maxit
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Accomplishments
Master Format implementation (for current data model)– PDB to Master Format translation working with MAXIT
Final Test at PDBe– Validation and testing at all sites.– PDBj creation of new tool for Master Format Validation with
extended diagnostics.– Issues with Master Format will be ongoing - with evolution of the
PDB format, Hybrid methods etc.
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Sequence Editor Tool DevelopmentLessons Learned
Iterative development and active Annotator involvement is essential – and takes time.
Addressing integration issues with existing systems in terms of modularity, process ordering and data availability poses significant challenges.
Agile process of development and planning supports adaptation to evolving requirements.
We will need to further consider the most efficient level of granularity for the deployment of new functionality in existing systems in future planning.
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Design Convergence AccomplishmentsMaster Format, API, WFM, WFE, UI
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Accomplishments: WF infrastructure -Integration of Sequence Processing
Tracking and Status DB developed and installed at RCSB and PDBe for development purposes.
Work Flow Manager (WFM)– Prototype user testing on-going– Requirements refined and prototype updated– Infrastructure complete – to be deployed for testing this week
Work Flow Manager User Interface (WFM UI) – User prototype created, input received and prototype enhanced– Initial Level 1 annotator interface signed off by annotators– Level 2/3/4 interfaces prototyped and under review– Level 3 /4 under further development
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
PDBe resource
Workflow XML– Luana/Tom : 1 day total to complete annotator requirements
WFE component supporting Sequence Processing : – Tom, 1-2 days per week ongoing, estimating 5-6 days (3 actual
weeks) to complete after all api’s are in place WFM
– Luana : currently full time – work is being prioritised to define the subset of requirements to be delivered in March.
Web resources : interfaces and WFM– External services –technology requirements have been defined.
Timeline tbd. Critical Path. Other resources
– Wim : python expertise– Swanand : python expertise (after 13th Feb) – fall-back
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
RCSB Resources Web Tools -
– Currently supporting development and alpha-testing sites – Will add production site for Feb deployment
Database Support – – MySQL database server for status and tracking database
Application Support– Project SVN code repository– JIRA issue tracking system – Project documentation and information site (Drupal)– Automated build system for API and application tools
People –– Vladimir – API and build system (Python/C++)– Li – DB system and status and tracking API (Python/SQL)– Rahip – Sequence Editor Tool (Javascript/CSS)– Zukang/Raul/John – DP applications (C++/Python)
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Updated Timeline Summary
Sequence Processing
1. Sequence Editor Tool– Completion of Interface with prioritized additional requirements
and beginning of final user testing - projected Feb 15– Integration with current pipelines using Master Format In test
by annotators by Feb 25– In production – best estimate early March
2. Integration of Sequence processing components with new architecture (WFE/API and WFM) – User testing – April
3. Integration of module into Pipeline – Plan by end of March
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Competing/Complementary Priorities
Address On-going data quality issues and remediation Three Validation task forces
– Implementation of recommendations
New PDB Format – with the next 6 months? De-programming Kim
– For Ligand Processing: timeline end of March – early April
Other strategic considerations Stakeholders
– Stress testing of new solutions against expectations and existing solutions must be managed and will take some time.
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Next Phase - TimelineLigand Processing Requirements
– Plans in place for Annotator exchange– March requirements consolidation, initial design plan– March create overview plan and initial timeline
Kick off development Deployment
– Strategy to be defined based on current and ongoing lessons learned.
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Things that have kept us up at night
These are cornerstone deliverables requiring intense study and design consideration – beyond the proof of concept.– Organization of data, communication protocols, etc. – Clear consensus of design features has required an evolution of
understanding – requiring wetting of hands
Ramp up of skill sets: Python, mmCIF (PDBe), EBI External services: web-service set up Site specific integration challenges Resource issues
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
BACK UP SLIDES
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Data and Application API Design
Unified Python language implementation Provides all access to data and applications for the
workflow manager and workflow engine Subcomponents of the API provide access to:
– Data objects and data values – Applications and tools – Tracking and status information– Site level configuration information
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Deliverable update: WFM Design Functional Architectural design
Will present progress and tracking information Will start/stop and restart the workflow engine in executing data
processing tasks Will work in a fully distributed web-based mode Will provide a launch point for tasks requiring interactive or
graphical interactions. Two modes defined – • Immediate mode – all processing occurs in a single session
(simple case).• Deferred mode – requests for input are registered with the
workflow manager for later processing by annotator
Worldwide Protein Data Bank
Common D&A Project January 2010 Update
Process Overview
With GO BACK functionality