biothings presentation

22
Biothings.api https://github.com/SuLab/biothings.api Generalizing MyGene and MyVariant

Upload: cyrus-afrasiabi

Post on 12-Feb-2017

74 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Biothings presentation

Biothings.apihttps://github.com/SuLab/biothings.api

Generalizing MyGene and MyVariant

Page 2: Biothings presentation

Historyhttp://biogps.org/ is a user-defined and user-extensible tool to

analyze genes. Given a gene of interest, different people are interested in different data about the gene. BioGPS allows you to select and display the data you are interested in about your gene.

To power the backend queries, a database of gene information was abstracted from BioGPS. The database contained aggregated (on entrez gene id) and up to date (weekly) information about all genes.

To access the data seamlessly from BioGPS, a REST API was implemented giving an annotation lookup service (/gene/) and a full text query service (/query/). The combination of these (data aggregation/API front end) became MyGene.info

Page 3: Biothings presentation

MyGene.info• MyGene.info provides simple-to-use REST web services to

query/retrieve gene annotation data. Aggregated on entrez ID.

• Examples:– http://mygene.info/v2/gene/1017– http://mygene.info/v2/query?q=cdk*&fields=pdb– http://mygene.info/v2/metadata

• Hosted entirely on AWS cloud computers (3 8 GB 2-core data nodes and 2 4GB 2-core web nodes). Currently serves millions of requests per month.

Page 4: Biothings presentation

MyVariant.info

• MyVariant.info provides simple-to-use REST web services to query/retrieve variant annotation data, aggregated from many popular data resources. Aggregated on HGVS ID.

• Examples:– http://myvariant.info/v1/variant/chr6:g.152708291G%3E

A– http://myvariant.info/v1/query?q=clinvar.chrom:10&field

s=clinvar– http://myvariant.info/v1/query?q=chr1:69000-

70000&fields=dbnsfp,dbsnp

Page 5: Biothings presentation

Biothings.api - abstracting web front end

From the point of view of the front end, the nature of the document is inconsequential, i.e., whether we serve a documents of genes or variants or chemicals isn’t particularly important => How much can we abstract out of mygene and myvariant and apply it to

Page 6: Biothings presentation

Motivation

• Isolate the common aspects of MyGene and MyVariant codebases and make them available in a separate framework: biothings.api

• Allows easier development of additional biothings APIs (Disease, Drug/Chemical, GO, Species… -> JSON, aggregate on a single field)

• Allows easier maintenance and development of current biothings (gene, variant).

Page 7: Biothings presentation

System Overview

• The tornado HTTP server consists of handlers that contain the code to run when a particular URL pattern is matched, e.g. /variant/, or /metadata

• The biothing codebase essentially contains the connection between the appropriate Tornado HTTP Request Handler for a request and the elasticsearch query that executes that request. Conceptually very similar to model-controller framework, where the model is the elasticsearch python box, and the controller is the tornado HTTP server.

Page 8: Biothings presentation

Biothings – HTTP Handling• tornado.web.RequestHandler: base tornado class for HTTP request handling. Important class methods:

get/post, get_arguments, write• biothings.www.helper.BaseHandler: contains methods common to all biothings RequestHandlers.

Important class methods: get_query_params, return_json• biothings.www.api.handlers.QueryHandler: contains methods to implement the biothings query

endpoint. Important class methods: get, post, _examine_kwargs• biothings.www.api.handlers.BiothingHandler: contains methods to implement the biothings annotation

endpoint. Important class methods: get, post, _examine_kwargs• biothings.www.api.handlers.MetaDataHandler: contains methods to implement the metadata endpoint• biothings.www.api.handlers.StatusHandler: contains methods to implement a status endpoint for AWS

ELB

Page 9: Biothings presentation

Biothings – HTTP Handling• biothings.www.api.handlers.BiothingHandler:

– GET request (e.g. /variant/chr6:g.152708291G>A)

– POST request (e.g. /variant/)

Page 10: Biothings presentation

Biothings – HTTP Handling• biothings.www.api.handlers.QueryHandler:

– GET request (e.g. /query?q=_exists_:dbsnp)

– POST request (e.g. /query/)

Page 11: Biothings presentation

Biothings – Elasticsearch query• biothings.www.api.es.ESQuery – contains the python code

for constructing the elasticsearch query and formatting the resulting data– query(q, **kwargs) – Contains the elasticsearch query to run with data obtained from a

GET or POST to the /query/ endpoint.– get_biothing(bid, **kwargs) – Contains the elasticsearch query to run with data

obtained from a GET to the /annotation/ endpoint.– mget_biothings(bid_list, **kwargs) – Contains the elasticsearch query to run with data

obtained from a POST to the /annotation/ endpoint.– _cleaned_res(res) – Contains the code to format the return object for get_biothing and

mget_biothings.– _cleaned_res2(res) – Contains the code to format the return object for query.– _get_biothingdoc(hit) – Contains the code to format a single biothing object from any

elasticsearch query. Called by _cleaned_res and _cleaned_res2.– _modify_biothingdoc(doc) – Contains the code to modify a biothing_doc. Called in

_get_biothingdoc. Currently empty -> for overriding.

Page 12: Biothings presentation

Biothings - Settings• Problem: Until now, we have left out the problem of how to

refer to things that MUST be project specific (e.g., the name of the elasticsearch index to search, the type of the document, etc). How do we do this?

• Solution: We make a settings module in biothings that all code within biothings refers to. That module looks for an environment variable called BIOTHING_SETTINGS with the name of a module that can be imported to set project specific variables.– export BIOTHING_SETTINGS = ‘biothings.config’

• Similar to Django.

Page 13: Biothings presentation

Biothings - Settings

Page 14: Biothings presentation

Biothings – Project template• At this point, we have the tools necessary to easily create and

subclass 4 types of biothings handlers (BiothingHandler, QueryHandler, MetaDataHandler, StatusHandler), and the elasticsearch query class (ESQuery)

• Could definitely stop here and have a useful tool, but we wanted to make it even easier to create a new project (also enforces a uniform project structure across all biothings APIs).

• To do this we have a project template folder containing the project directory structure and some skeleton code:– config.py, – URL patterns to Handlers connection– Handlers to ESQuery connection

Page 15: Biothings presentation

Biothings - Project template

• To create the actual project directory from the template, we wrote a small function: biothings-admin.py– Usage: biothings-admin.py <path-to-project-directory>

<biothing-object-name>– biothings-admin.py ~ variant

• Any folder or file in the template directory will be created in the project directory. The contents of any file are passed through the python String.template function before they are created in the project directory.

Page 16: Biothings presentation

Small project structure review

Page 17: Biothings presentation

Recreating MyVariant.info using biothings.api

• Recreated current MyVariant.info service using the biothings.api framework– Very little extra code required (~100 lines)– Less than a day of time to create the web front end from start.– https://github.com/SuLab/myvariant.info/tree/biothings.variant

• Seems disingenuous to gauge the utility of a tool by recreating a codebase if that tool was itself created from the codebase => Should try implementing other APIs, especially MyGene.info (has more varied gene specific query options), and modify biothings as needed.

Page 18: Biothings presentation

MyGene.info v3

• Sebastien reimplemented MyGene using biothings framework

• Currently live at mygene.info/v3 for testing purposes

• Some structural changes to data also• Examples:

–http://mygene.info/v3/gene/1017–http://mygene.info/v3/query?

q=cdk*&fields=pdb

Page 19: Biothings presentation

Small Biothing Cluster

• With biothings, new front end frameworks are very easy to set up => We are limited only by our ability to parse, aggregate, index etc. new data.

• For small ES indices (<1 or 2 GB), we set up a small biothings cluster with 1 m4.large data node serving all search requests, and 1 t2.micro web node per biothing.

• Currently, this consists of:

small biothing data/masterm4.large

Taxonomyt2.micro

Chemicalt2.micro

Page 20: Biothings presentation

Taxonomy biothing

• Using a taxonomy parser written by Greg. Aggregated on NCBI taxonomy ID.

• Currently live at http://52.34.211.113• Examples:

–http://52.34.211.113/v1/species/9606–http://52.34.211.113/v1/query?q=human

• Soon to become http://s.biothings.io

Page 21: Biothings presentation

Chemicals biothing

• Data from several chemical databases aggregated by Julee on InChIKey (hash of string representation of chemical) https://en.wikipedia.org/wiki/International_Chemical_Identifier#InChIKey

• currently live at: http://52.38.192.121/• Examples:

– http://52.38.192.121/v1/drug/CHEMBL1201666– http://52.38.192.121/v1/query?q=chembl.pref_name:ne

o*&fields=chembl.pref_name• Soon to become http://c.biothings.io

Page 22: Biothings presentation

Future work• Integrate data load and data index functions into biothings

(WIP, large project)• Documentation! – Projects like this need very good

documentation to be of any use to an API developer (on the level of tornado’s excellent documentation: http://www.tornadoweb.org/en/stable/web.html) (also, WIP)

• Host API services for external users data (essentially possible without too much work already).

• Auto-generate clients (python client, R client)• Auto-generate ansible-playbook to create cluster hardware on

AWS• One-click API…