google genomics documentation2.1.4step 3. export variants to google bigquery create a bigquery...

33
Google Genomics Documentation Release v1beta2 Cassie March 16, 2015

Upload: others

Post on 15-Jul-2020

24 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics DocumentationRelease v1beta2

Cassie

March 16, 2015

Page 2: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow
Page 3: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Contents

1 Discover Public Data 31.1 Google Genomics Public Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Annotate Variants with Tute Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 PGP data in Google Cloud Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Load Data into Google Genomics 52.1 Loading Genomic Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Troubleshooting Job failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Browse Genomic Data 9

4 Quality Control 11

5 Annotate Variants 135.1 Annotate Variants with BioConductor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Annotate Variants with Tute Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.3 Annotate Variants with Google Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 Analyze Variants 15

7 Compute Principal Coordinate Analysis 17

8 Compute Identity By State 19

9 Build your own Google Genomics API Client 219.1 Important constants and links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219.2 Common API workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219.3 API authorization requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239.4 The java client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239.5 The python client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249.6 The R client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.7 Migrating from v1beta to v1beta2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

10 The mailing list 29

i

Page 4: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

ii

Page 5: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

Here you will find task-oriented documentation. What do you want to do today?

Contents 1

Page 6: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

2 Contents

Page 7: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 1

Discover Public Data

1.1 Google Genomics Public Data

See https://cloud.google.com/genomics/public-data

1.2 Annotate Variants with Tute Genomics

Tute Genomics has made available to the community annotations for all hg19 SNPs as a BigQuery table.

See Tute’s documentation for more details about the annotation databases included and sample queries upon publicdata.

To make use of this upon your own data:

1. Load Data into Google Genomics

2. Use the BigQuery JOIN command to join the Tute table with your variants and materialize the result to a newtable.

TODO: actual example with bq tool

1.3 PGP data in Google Cloud Storage

Google is hosting a copy of the PGP Harvard data in Google Cloud Storage. All of the data is in this bucket:gs://pgp-harvard-data-public

If you wish to browse the data you will need to install gsutil.

Once installed, you can run the ls command on the pgp bucket:

$ gsutil ls gs://pgp-harvard-data-publicgs://pgp-harvard-data-public/cgi_disk_20130601_00C68/gs://pgp-harvard-data-public/hu011C57/gs://pgp-harvard-data-public/hu016B28/....lots more....

The sub folders are PGP IDs, so if we ls a specific one:

$ gsutil ls gs://pgp-harvard-data-public/hu011C57/gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/

3

Page 8: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

And then keep diving down through the structure, you can end up here:

$ gsutil ls gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/dbSNPAnnotated-GS000015172-ASM.tsv.bz2gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/gene-GS000015172-ASM.tsv.bz2... and more ...

Your genome data is located at: gs://pgp-harvard-data-public/{YOUR_PGP_ID}

If you do not see the data you are looking for, you should contact PGP directly through your web profile.

4 Chapter 1. Discover Public Data

Page 9: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 2

Load Data into Google Genomics

2.1 Loading Genomic Variants

Contents

• Loading Genomic Variants– Prerequisites– Step 1: Upload variants to Google Cloud Storage

* Transfer the data.* Check the data.

– Step 2. Import variants to Google Genomics* Create a Google Genomics dataset to hold your data.* Import your VCFs from Google Cloud Storage to your Google Genomics Dataset.* Check the import job for completion.

– Step 3. Export variants to Google BigQuery* Create a BigQuery dataset in the web UI to hold the data.* Export variants to BigQuery.* Check the import job for completion.

2.1.1 Prerequisites

1. Sign up for Google Genomics by doing all the steps in Google Genomics: Try it now.

2. Sign up for Google Cloud Storage by doing all the steps in Google Cloud Storage: Try it now.

5

Page 10: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

2.1.2 Step 1: Upload variants to Google Cloud Storage

For the purposes of this example, let’s assume you have a local copy of the Illumina Platinum Genomes variants thatyou would like to load.

Note Google Genomics cannot load compressed VCFs so for now be sure to uncompress them prior to uploading themto cloud storage. We expect to support compressed VCFs soon.

Transfer the data.

To transfer a glob of files:

gsutil -m -o ’GSUtil:parallel_composite_upload_threshold=150M’ cp *.vcf \gs://YOUR_BUCKET/platinum-genomes/vcf/

Or to transfer a directory tree of files:

gsutil -m -o ’GSUtil:parallel_composite_upload_threshold=150M’ cp -R YOUR_DIRECTORY_OF_VCFS \gs://YOUR_BUCKET/platinum-genomes/

If any failures occur due to temporary network issues, re-run with the no-clobber flag to transmit just the missing files:

gsutil -m -o ’GSUtil:parallel_composite_upload_threshold=150M’ cp -n -R YOUR_DIRECTORY_OF_VCFS \gs://YOUR_BUCKET/platinum-genomes/

For more detail, see the gsutil cp command.

Check the data.

When you are done, the bucket will have contents similar to this but with your own bucket’s name:

$ gsutil ls gs://genomics-public-data/platinum-genomes/vcfgs://genomics-public-data/platinum-genomes/vcf/NA12877_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12878_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12879_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12880_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12881_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12882_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12883_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12884_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12885_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12886_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12887_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12888_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12889_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12890_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12891_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12892_S1.genome.vcfgs://genomics-public-data/platinum-genomes/vcf/NA12893_S1.genome.vcf

For more detail, see the gsutil ls command.

6 Chapter 2. Load Data into Google Genomics

Page 11: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

2.1.3 Step 2. Import variants to Google Genomics

Create a Google Genomics dataset to hold your data.

• YOUR_DATASET_NAME: This can be any name you like such as “My Copy of Platinum Genomes”.

• YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER: You can find your Google Cloud Platformproject number towards the top of the Google Developers Console page.

$ java -jar genomics-tools-client-java-v1beta2.jar createdataset --name YOUR_DATASET_NAME \--project_number YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER{"id" : "THE_NEW_DATASET_ID","isPublic" : false,"name" : "YOUR_DATASET_NAME","projectNumber" : "YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER"}

For more detail, see managing datasets.

Import your VCFs from Google Cloud Storage to your Google Genomics Dataset.

• THE_NEW_DATASET_ID: This was returned in the output of the prior command.

$ java -jar genomics-tools-client-java-v1beta2.jar importvariants \--variant_set_id THE_NEW_DATASET_ID \--vcf_file gs://YOUR_BUCKET/platinum-genomes/vcf/*.vcfImport job: {

"id" : "THE_NEW_IMPORT_JOB_ID","status" : "pending"

}

For more detail, see managing variants.

Check the import job for completion.

• THE_NEW_IMPORT_JOB_ID: This was returned in the output of the prior command.

$ java -jar genomics-tools-client-java-v1beta2.jar getjob --poll --job_id THE_NEW_IMPORT_JOB_IDWaiting for job: job_id...{

"status" : "success","importedIds" : ["call_set_id", "call_set_id"],"warnings" : []

}

2.1.4 Step 3. Export variants to Google BigQuery

Create a BigQuery dataset in the web UI to hold the data.

1. Open the BigQuery web UI.

2. Click the down arrow icon next to your project name in the navigation, then click Create new dataset.

3. Input a dataset ID.

2.1. Loading Genomic Variants 7

Page 12: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

Export variants to BigQuery.

• THE_NEW_DATASET_ID: This was returned in the output of the createdataset command.

• YOUR_BIGQUERY_DATASET: This is the dataset ID you created in the prior step.

• YOUR_BIGQUERY_TABLE: This can be any ID you like such as “platinum_genomes_variants”.

$ java -jar genomics-tools-client-java-v1beta2.jar exportvariants \--project_id YOUR_GOOGLE_CLOUD_PLATFORM_PROJECT_NUMBER \--variant_set_id THE_NEW_DATASET_ID \--bigquery_dataset YOUR_BIGQUERY_DATASET \--bigquery_table YOUR_BIGQUERY_TABLEExport job: {

"id" : "THE_NEW_EXPORT_JOB_ID","status" : "pending"

}

For more detail, see variant exports

Check the import job for completion.

• THE_NEW_EXPORT_JOB_ID: This was returned in the output of the prior command.

$ java -jar genomics-tools-client-java-v1beta2.jar getjob --poll --job_id THE_NEW_EXPORT_JOB_IDWaiting for job: job_id...{

"status" : "success","importedIds" : ["call_set_id", "call_set_id"],"warnings" : []

}

Now you are ready to start querying your variants!

2.2 Troubleshooting Job failures

If you were redirected to this page from a Job failure, that means your Job failed for an unknown reason.

Either the failure was transient (which occassionally happens) and the Job should be retried, or there is a bug in ourimplementation which is causing an unexpected exception.

Rest assured that we keep track of all failed Jobs, and will track down the bug if there is one. In a perfect world, youwould never need to see this page.

Because you are here though, please try the following:

• Re-launch your Job once more.

• If the Job fails a second time, please email [email protected] with both of your JobIDs.

Sorry for the failure - we’ll do better next time.

8 Chapter 2. Load Data into Google Genomics

Page 13: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 3

Browse Genomic Data

TODO: GABROWSE and IGV

To browse your own data . . .

9

Page 14: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

10 Chapter 3. Browse Genomic Data

Page 15: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 4

Quality Control

TODO: qc codelab

To run this on your own data . . .

11

Page 16: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

12 Chapter 4. Quality Control

Page 17: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 5

Annotate Variants

5.1 Annotate Variants with BioConductor

TODO: point to annotation vignette on BioConductor for an example on public data

to annotate your own variants . . .

5.2 Annotate Variants with Tute Genomics

Tute Genomics has made available to the community annotations for all hg19 SNPs as a BigQuery table.

See Tute’s documentation for more details about the annotation databases included and sample queries upon publicdata.

To make use of this upon your own data:

1. Load Data into Google Genomics

2. Use the BigQuery JOIN command to join the Tute table with your variants and materialize the result to a newtable.

TODO: actual example with bq tool

5.3 Annotate Variants with Google Genomics

TODO: command line for new AnnotateVariants Dataflow job

13

Page 18: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

14 Chapter 5. Annotate Variants

Page 19: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 6

Analyze Variants

TODO: All Modalities Codelab

To run this on your own data . . .

15

Page 20: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

16 Chapter 6. Analyze Variants

Page 21: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 7

Compute Principal Coordinate Analysis

TODO: spark and dataflow instructions

17

Page 22: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

18 Chapter 7. Compute Principal Coordinate Analysis

Page 23: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 8

Compute Identity By State

TODO: dataflow instructions

19

Page 24: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

20 Chapter 8. Compute Identity By State

Page 25: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 9

Build your own Google Genomics API Client

The tools for working with the Google Genomics API are all open source and available on GitHub.

This documentation covers how to get started with the available tools as well as how you might build your own codewhich uses the API.

All improvements to these docs are welcome! You can file an issue or submit a pull request.

9.1 Important constants and links

Google’s base API url is: https://www.googleapis.com/genomics/v1beta2

More information on the API can be found at: http://cloud.google.com/genomics and http://ga4gh.org

To test Google’s compliance with the GA4GH API, you can use the compliance tests: http://ga4gh.org/#/compliance

To get a list of public datasets that can be used with Google’s API calls, you can use the APIs explorer or GoogleGenomics Public Data.

9.2 Common API workflows

There are many genomics-related APIs documented at cloud.google.com/genomics/v1beta2/reference.

Of the available calls, there are some very common patterns that can be useful when developing your own code.

The following sections describe these workflows using plain URLs and simplified request bodies. Each step shouldmap 1-1 with all of the auto-generated client libraries.

9.2.1 Browsing read data

• GET /datasets

List all available datasets that a current user has access to. (Or all public datasets when not using OAuth) Chooseone datasetId from the result.

Note: Currently, this call only returns public datasets! It is not able to return any private datasets. For now, youmay need to ask a user for a datasetId or readGroupSetId directly

• POST /readgroupsets/search {datasetIds: [<datasetId>]}

Search for read group sets in a particular dataset. Choose one readGroupSetId from the result.

21

Page 26: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

Note: This is a good place to use a partial request to only ask for the id and name fields on a read group set.Then you can follow up with a GET /readgroupsets/<readGroupSetId> call to get the complete readgroup set data.

• GET /readgroupsets/<readGroupSetId>/coveragebuckets

Get coverage information for a particular readset. This will tell you where the read data is located, and whichreferenceNames should be used in the next step.

• POST /reads/search {readGroupSetIds: [<readGroupSetId>]}

Get reads for a particular read group set.

Note: The call also requires referenceName, start and end. The referenceName can be chosen from thecoverage buckets by the user, along with the start and end coordinates they wish to view. The API uses 0-basedcoordinates.

9.2.2 Map reducing over read data within a readset

• GET /readgroupsets/<readGroupSetId>/coveragebuckets

First get coverage information for the read group set you are working with.

Iterate over the coverageBuckets array. For each bucket, there is a field range.end. Using this field, andthe number of shards you wish to have, you can calculate sharding bounds.

Let’s say there are 23 references, and you want 115 shards. The easiest math would have us creating 5 shardsper reference, each with a start of i * range.end/5 and an end of min(range.end, start +range.end/5)

• POST /reads/search {readGroupSetId: x, referenceName: shard.refName,start: shard.start, end: shard.end}

Once you have your shard bounds, each shard will then do a reads search to get data. (Don’t forget to use a usea partial request)

9.2.3 Map reducing over variant data

• GET /variantsets/<datasetId>

First get a summary of the variants you are working with. This includes the references that have data, as well astheir upper bounds.

Iterate over the referenceBounds array. For each reference, there is a field upperBound. Using this field,and the number of shards you wish to have, you can calculate sharding bounds.

Let’s say there are 23 references, and you want 115 shards. The easiest math would have us creating 5shards per reference, each with a start of i * referenceBound.upperBound/5 and an end ofmin(referenceBound.upperBound, start + referenceBound.upperBound/5)

• POST /variants/search {variantSetIds: [x], referenceName: shard.refName,start: shard.start, end: shard.end}

Once you have your shard bounds, each shard will then do a variants search to get data. (Don’t forget to use ause a partial request)

If you only want to look at certain call sets, you can include the callSetIds: ["id1", "id2"] fieldon the search request. Only call information for those call sets will be returned. Variants without any of therequested call sets won’t be included at all.

22 Chapter 9. Build your own Google Genomics API Client

Page 27: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

9.3 API authorization requirements

Calls to the Google Genomics API can be made with OAuth or with an API key.

• To access private data or to make any write calls, an API request needs to be authenticated with OAuth.

• Read-only calls to public data only require an API key to identify the calling project. (OAuth will also work)

Some APIs are still in the testing phase. The following lays out where each API call stands and also indicates whethera call supports requests without OAuth.

9.3.1 Available APIs

API method OAuth requiredGet, List and Search methods (except on Jobs) FalseCreate, Delete, Patch and Update methods TrueImport and Export methods TrueAll Job methods True

9.3.2 APIs in testing

API method OAuth requiredgenomics.experimental.* True

9.4 The java client

The api-client-java project provides a command line interface for API queries in Java.

9.4.1 Command line options for api-client-java

To command line is now the best place for help. Executing without any parameters:

java -jar target/genomics-tools-client-java-v1beta2.jar

will print out all the available commands. To get help on a specific command, append the command followed byhelp. For example to get help on the searchreads command:

java -jar target/genomics-tools-client-java-v1beta2.jar searchreads help

All the request types map to Genomics API calls. You can read the API documentation for more information aboutthe various objects, and what each method is doing.

The custom command

If you wish to call an API method that doesn’t have a pre-defined request type, or if you wish to pass in additionalJSON fields that aren’t supported with the existing options, then you can issue a fully custom request with the followingparameters:

--custom_endpoint Required. The API endpoint to query. This is relative to the base URL andshouldn’t start with a / Example: readgroupsets/search.

9.3. API authorization requirements 23

Page 28: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

--custom_method The HTTP method to query with. Defaults to POST. Other valid examples areGET, PATCH, DELETE.

--custom_body If the API endpoint you are hitting requires a HTTP body, use this parameterto pass in a JSON object as a string. It should look something like {"key":"value"}

Putting these pieces together, if you wanted to do a readsets search with name filtering (which isn’t supported throughthe other options) you could do so with this query:

java -jar target/genomics-tools-client-java-v1beta2.jar custom --custom_endpoint "readgroupsets/search" --custom_body ’{"datasetIds": ["10473108253681171589"], "name": "NA1287"}’ --fields "readGroupSets(id,name)" --pretty_print

If instead you wanted to make a GET call, your custom request could look like this:

java -jar target/genomics-tools-client-java-v1beta2.jar custom --custom_endpoint "readgroupsets/CMvnhpKTFhD04eLE-q2yxnU" --custom_method "GET" --fields "id,name" --pretty_print

9.4.2 Clearing stored credentials

The first time the Java client makes an API request, it authenticates the caller with OAuth and stores the resultingcredentials for all future API calls.

If you wish to remove these stored credentials (to authenticate with a different client secrets file, or as a different user,etc), you will need to remove the storage directory with this command:

rm ~/.store/genomics_java_client/StoredCredential

The next request made to the Java client will then require a browser to open the OAuth pages.

The java client uses Google’s java client library to get data from the Google Genomics APIs. See the java docs formore details.

9.5 The python client

The api-client-python project provides a simple genome browser that pulls data from the Genomics API.

9.5.1 Setting up the python client on Windows

• In order to setup Python 2.7 for Windows, first download it from https://www.python.org/downloads/

• After installing Python, add to your PATH the location of the Python directory and the Scripts directory withinit.

For example, if Python is installed in C:\Python27, proceed by right-clicking on My Computer on the StartMenu and select “Properties”. Select “Advanced system settings” and then click on the “Environment Variables”button. In the window that comes up, append the following to the system variable PATH (if you chose a differentinstallation location, change this path accordingly):

;C:\Python27\;C:\Python27\Scripts\

• Get the api-client-python code onto your machine by cloning the repository:

git clone https://github.com/googlegenomics/api-client-python.git

24 Chapter 9. Build your own Google Genomics API Client

Page 29: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

Running the client with App Engine

Only follow the instructions in this section if you want to run the python client with App Engine.

• Download the “Google App Engine SDK for Python” for Windows fromhttps://developers.google.com/appengine/downloads and install it.

• From within the api-client-python directory that you clones, run the dev_appserver.py script. If weassume the installation directory for your app engine SDK was C:\Google\google_appengine, thenyou would run the following command:

python C:\Google\google_appengine\dev_appserver.py .

If you get an error like google.appengine.tools.devappserver2.wsgi_server.BindError:Unable to bind localhost:8000, try specifying a specific port with this command:

python C:\Google\google_appengine\dev_appserver.py --admin_port=12000 .

• To view your running server, open your browser to localhost:8080.

Running the client without App Engine

Only follow the instructions in this section if you do not want to use App Engine. See the section above for AppEngine instructions.

• First you will need to download Pip from https://raw.github.com/pypa/pip/master/contrib/get-pip.py

• To install Pip, open up a cmd.exe window by selecting Start->Run->cmd and type the following (replacedirectory_of_get-pip.py with the location of where get-pip.py resides):

cd directory_of_get-pip.pypython get-pip.py

• Afterwards in the same command window, type the following command to update your Python environmentwith the required modules:

pip install WebOb Paste webapp2 jinja2

• You should then be able to run the localserver with the following commands:

cd api-client-pythonpython localserver.py

Enabling the Google API provider

If you want to pull in data from ‘Google Genomics API‘_ you will need to set API_KEY in main.py to a validGoogle API key.

• First apply for access to the Genomics API by following the instructions athttps://developers.google.com/genomics/

• Then create a project in the Google Developers Console or select an existing one.

• On the APIs & auth tab, select APIs and turn the Genomics API to ON

• On the Credentials tab, click create new key under the Public API access section.

• Select Server key in the dialog that pops up, and then click Create. (You don’t need to enter anything in thetext box)

9.5. The python client 25

Page 30: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

• Copy the API key field value that now appears in the Public API access section into the top of the main.pyfile inside of your api-client-python directory. It should look something like this:

API_KEY = "abcdef12345abcdef"

Note: You can also reuse an existing API key if you have one. Just make sure the Genomics API is turned on.

• Run your server as before, and view your server at localhost:8080.

• Google should now show up as an option in the Readset choosing dialog.

9.5.2 GABrowse URL format

The genome browser code supports direct linking to specific backends, readsets, and genomic positions.

These parameters are set using the hash. The format is very simple with only 3 supported key value pairs separated by& and then =:

• backend

The backend to use for API calls. example: GOOGLE or NCBI

• readsetId

The ID of the readset that should be loaded. See Important constants and links for more information.

• location

The genomic position to display at. Takes the form of <chromosome>:<base pair position>. exam-ple: 14:25419886 This can also be an RS ID or a string that will be searched on snpedia.

As you navigate in the browser (either locally or at http://gabrowse.appspot.com), the hash will automatically populateto include these parameters. But you can also manually create a direct link without having to go through the UI.

Putting all the pieces together, here is what a valid url looks like:

http://gabrowse.appspot.com/#backend=GOOGLE&readsetId=CPHG3MzoCRDY5IrcqZq8hMIB&location=14:25419886

The python client does not currently use Google’s python client library. If you want to use the client library, themethod documentation for genomics can be very useful.

9.6 The R client

The api-client-r project provides an R package with methods to search for Reads and Variants stored in the GoogleGenomics API. Additionally it provides converters to BioConductor datatypes such as GAlignments, GRanges, andVRanges.

9.7 Migrating from v1beta to v1beta2

The v1beta2 version of the Google Genomics API is now available and all client code should migrate to it by the endof 2014.

If you are using the genomics-tools-client-java jar from the command line - upgrading is as easy as downloadinga new jar. (Or running git pull; mvn package from your git client)

For all other integrations: v1beta2 matches the GA4GH API v0.5.1, which means that there are quite a few methodand field renames to deal with. This page summarizes all the changes necessary to move to the latest API.

26 Chapter 9. Build your own Google Genomics API Client

Page 31: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

9.7.1 new version notes

General

• maxResults is now pageSize, and is an integer

Datasets and Jobs

• All usages of projectId should be replaced by projectNumber

• job.description is now job.detailedStatus

Variants

• The variant objects have not changed.

• The import and export methods have slightly different URLs. /variants/import is now/variantsets/<variantSetId>/importVariants and /variants/export is/variantsets/<variantSetId>/export. These affect the generated client libraries slightly.

Readsets/Readgroupsets

• readset has now been renamed to readgroupset. This is mostly a straightforward replacement ofthe term.

• readset.fileData[0].fileUri is now readgroupset.filename

• readset.fileData[0].refSequences is replaced by readgroupset.referenceSetId

• The rest of the readset.fileData field has been replaced by information within thereadgroupset.readgroups array.

Reads

• All read positions are now 0-based longs, just like the variant positions.

• originalBases is now alignedSequence

• alignedBases (originalBases with the cigar applied) has been removed

• baseQuality is now an int array called alignedQuality. You no longer need to subtract 33 or dealwith ASCII conversion.

• name is now fragmentName

• templateLength is now fragmentLength

• tags is now info

• position is now alignment.position.position. The alignment object now contains allalignment-related information - including the cigar, reference name, and whether the read is on the re-verse strand.

• The old cigar string is now the structured field alignment.cigar. To get an old-style cigar string,iterate over each element in the array, and concat the operationLength with a mapped version ofoperation. pseudocode:

cigar_enums = {ALIGNMENT_MATCH: "M", CLIP_HARD: "H", CLIP_SOFT: "S", DELETE: "D",INSERT: "I", PAD: "P", SEQUENCE_MATCH: "=", SEQUENCE_MISMATCH: "X", SKIP: "N"}

cigar_string = [c.operationLength + cigar_enums[c.operation] for c in read.alignment.cigar].join(’’)

• The old flags integer is now represented by many different first class fields. To reconstruct a flags value,you need code similar to this pseudocode:

9.7. Migrating from v1beta to v1beta2 27

Page 32: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

Google Genomics Documentation, Release v1beta2

flags = 0flags += read.numberReads == 2 ? 1 : 0 #read_pairedflags += read.properPlacement ? 2 : 0 #read_proper_pairflags += read.alignment.position.position == null ? 4 : 0 #read_unmappedflags += read.nextMatePosition.position == null ? 8 : 0 #mate_unmappedflags += read.alignment.position.reverseStrand ? 16 : 0 #read_reverse_strandflags += read.nextMatePosition.reverseStrand ? 32 : 0 #mate_reverse_strandflags += read.readNumber == 0 ? 64 : 0 #first_in_pairflags += read.readNumber == 1 ? 128 : 0 #second_in_pairflags += read.secondaryAlignment ? 256 : 0 #secondary_alignmentflags += read.failedVendorQualityChecks ? 512 : 0 #failed_quality_checkflags += read.duplicateFragment ? 1024 : 0 #duplicate_readflags += read.supplementaryAlignment ? 2048 : 0 #supplementary_alignment

reads/search

• sequenceName is now referenceName

• sequenceStart is now start

• sequenceEnd is now end

• The response from reads/search now returns a field called alignments rather than reads

28 Chapter 9. Build your own Google Genomics API Client

Page 33: Google Genomics Documentation2.1.4Step 3. Export variants to Google BigQuery Create a BigQuery dataset in the web UI to hold the data. 1.Open theBigQuery web UI. 2.Click the down arrow

CHAPTER 10

The mailing list

The Google Genomics Discuss mailing list is a good way to sync up with other people whouse genomics-tools including the core developers. You can subscribe by sending an email [email protected] or just post using the web forumpage.

All improvements to these docs are welcome! You can file an issue or submit a pull request.

29