accessing(publicdata(( (fromncbi)(using(bioconductor( · 2011. 7. 29. · sradb’and’ncbisra’...
TRANSCRIPT
![Page 1: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/1.jpg)
Accessing Public Data (from NCBI) Using Bioconductor
![Page 2: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/2.jpg)
Na#onal Ins#tutes of Health
![Page 3: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/3.jpg)
Objec#ves
• Navigate NCBI GEO website • Understand NCBI GEO data en<<es and rela#onships
between them • Use GEOquery package to import data from NCBI GEO into
R • Convert GEOquery data structures into R data structures • Use GEOmetadb to find data in NCBI GEO • Know rela#onship between NCBI GEO and NCBI SRA • Understand how to use the SRAdb package to query SRA
metadata • Use SRAdb package to control the Integrated Genome
Viewer (IGV) from R
![Page 4: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/4.jpg)
![Page 5: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/5.jpg)
MIAME-‐compliant Data • The raw data for each hybridiza#on (e.g., CEL or GPR files) • The final processed (normalized) data for the set of hybridiza#ons in the
experiment (study) (e.g., the gene expression data matrix used to draw the conclusions from the study)
• The essen#al sample annota#on including experimental factors and their values (e.g., compound and dose in a dose response experiment)
• The experimental design including sample data rela#onships (e.g., which raw data file relates to which sample, which hybridiza#ons are technical, which are biological replicates)
• Sufficient annota#on of the array (e.g., gene iden#fiers, genomic coordinates, probe oligonucleo#de sequences or reference commercial array catalog number)
• The essen#al laboratory and data processing protocols (e.g., what normaliza#on method has been used to obtain the final processed data)
![Page 6: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/6.jpg)
What’s in GEO
• Gene expression profiling by microarray or next-‐genera#on sequencing
• Non-‐coding RNA profiling by microarray or next-‐genera#on sequencing
• Chroma#n immunoprecipita#on (ChIP) profiling by microarray or next-‐genera#on sequencing
• Genome methyla#on profiling by microarray or next-‐genera#on sequencing
• Genome varia#on profiling by array (arrayCGH) • SNP arrays • Serial Analysis of Gene Expression (SAGE) • Protein arrays
![Page 7: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/7.jpg)
GEO data en##es
• GEO Samples (GSM) • GEO PlaYorm (GPL) • GEO Series (GSE)
– Collec#ons of related GSM and GPL along with free-‐text study metadata
– Two data “flavors” • GSE • GSEMatrix
• GEO Dataset (GDS) – Only data en#ty that is curated by NCBI GEO staff – Typically, samples are divided into sta#s#cally and biologically relevant groups upon which we can compute
![Page 8: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/8.jpg)
NCBI GEO website
• h^p://www.ncbi.nlm.nih.gov/geo/
![Page 9: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/9.jpg)
GEO SOFT Format
![Page 10: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/10.jpg)
GEO SOFT Format
![Page 11: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/11.jpg)
GEOquery
• Singular goal: Get data from NCBI GEO and parse into R objects
• Secondary goals: – Get supplemental files from NCBI GEO – Allow large-‐scale data mining by doing all-‐of-‐the-‐above in a lossless manner
![Page 12: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/12.jpg)
GEOquery Data Structures (GSM, GPL, GDS)
![Page 13: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/13.jpg)
GEOquery Data Structures (GSE)
![Page 14: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/14.jpg)
GEOquery Walkthrough
• h^p://watson.nci.nih.gov/~sdavis/ – Click on Tutorials and then on “Accessing public data….”
• Download the R script for cut-‐and-‐paste to follow along
![Page 15: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/15.jpg)
GEOmetadb
• Finding data in NCBI GEO can be challenging • Some inves#gators need large-‐scale, computable access to GEO metadata
• Given one GEO en#ty type, it may be useful to find all rela#onships with other en#ty types (eg., find all hgu133a arrays in GEO)
![Page 16: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/16.jpg)
GEOmetadb
• What is GEOmetadb? – A Bioconductor package that offers an alterna#ve to eU#ls data mining of GEO metadata
– A SQLite database which stores all the GEO metadata in a rela#onal format for easy querying, par#cularly in bulk
• What is GEOmetadb NOT? – We have not a^empted to alter or standardize in any way the data from GEO
![Page 17: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/17.jpg)
GEOmetadb Schema
![Page 18: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/18.jpg)
![Page 19: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/19.jpg)
GEOmetadb Walkthrough
![Page 20: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/20.jpg)
SRAdb
• Similar goals to GEOmetadb, but with SRA data
• Data for SRA (ENA, DRA) are mirrored, but exact policies are a bit unclear (to me)
• Accessing to the data also provided by SRAdb package, but further processing (SRA SDK) now needed to get FASTQ files
![Page 21: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/21.jpg)
SRAdb and NCBI SRA Recently, NCBI announced that due to budget constraints, it would be discon#nuing its Sequence Read Archive (SRA) and Trace Archive repositories for high-‐throughput sequence data. However, NIH has since commi^ed interim funding for SRA in its current form un#l October 1, 2011. In addi#on, NCBI has been working with staff from other NIH Ins#tutes and NIH grantees to develop an approach to con#nue archiving a widely used subset of next genera#on sequencing data ajer October 1, 2011.
We now plan to con#nue handling sequencing data associated with:
• RNA-‐Seq, ChIP-‐Seq, and epigenomic data that are submi^ed to GEO • Genomic and Transcriptomic assemblies that are submi^ed to GenBank • 16S ribosomal RNA data associated with metagenomics that are submi^ed to GenBank
In addi#on, NCBI will con#nue to provide access to exis#ng SRA and Trace Archive data for the foreseeable future. NCBI is also con#nuing to discuss with NIH Ins#tutes approaches for handling other next-‐genera#on sequencing data associated with specific large-‐scale studies.
![Page 22: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/22.jpg)
SRAdb and IGV Walkthrough
• Download and start IGV: – h^p://www.broadins#tute.org/sojware/igv/download
![Page 23: Accessing(PublicData(( (fromNCBI)(Using(Bioconductor( · 2011. 7. 29. · SRAdb’and’NCBISRA’ Recently,’NCBIannounced’thatdue’to’budgetconstraints,’itwould’be’discon#nuing’its’](https://reader035.vdocument.in/reader035/viewer/2022071006/5fc3ab7c18f01f29fe776055/html5/thumbnails/23.jpg)