welcoming to incoming bioinformatics students at ucsf
TRANSCRIPT
Biological & Medical
Informatics:!the beginning
Daniel Himmelstein!September 24, 2014
Hand Drawn Map of SF!by Jenni Sparks
Before the Money Came!Bettye LaVette
challengeHand Drawn Map of SF!by Jenni Sparks
review article
T h e n e w e ngl a nd j o u r na l o f m e dic i n e
n engl j med 369;5 nejm.org august 1, 2013448
Global Health
Measuring the Global Burden of DiseaseChristopher J.L. Murray, M.D., D.Phil., and Alan D. Lopez, Ph.D.
From the Institute for Health Metrics and Evaluation, University of Washington, Seattle (C.J.L.M.); and the University of Melbourne, School of Population and Global Health, Carlton, VIC, Australia (A.D.L.). Address reprint requests to Dr. Murray at the Institute for Health Metrics and Evaluation, 2301 Fifth Ave., Suite 600, Seattle, WA 98121, or at [email protected].
N Engl J Med 2013;369:448-57.DOI: 10.1056/NEJMra1201534Copyright © 2013 Massachusetts Medical Society.
It is difficult to deliver effective and high-quality care to patients without knowing their diagnoses; likewise, for health systems to be effective, it is necessary to understand the key challenges in efforts to improve population
health and how these challenges are changing. Before the early 1990s, there was no comprehensive and internally consistent source of information on the global bur-den of diseases, injuries, and risk factors. To close this gap, the World Bank and the World Health Organization launched the Global Burden of Disease (GBD) Study in 1991.1 Although assessments of selected diseases, injuries, and risk factors in se-lected populations are published each year (e.g., the annual assessments of the human immunodeficiency virus [HIV] epidemic2), the only comprehensive assess-ments of the state of health in the world have been the various revisions of the GBD Study for 1990, 1999–2002, and 2004.1,3-10 The advantage of the GBD approach is that consistent methods are applied to critically appraise available information on each condition, make this information comparable and systematic, estimate results from countries with incomplete data, and report on the burden of disease with the use of standardized metrics.
The most recent assessment of the global burden of disease is the 2010 study (GBD 2010), which provides results for 1990, 2005, and 2010. Several hundred investigators collaborated to report summary results for the world and 21 epidemio-logic regions in December 2012.11-18 Regions based on levels of adult mortality, child mortality, and geographic contiguity were defined. GBD 2010 addressed a number of major limitations of previous analyses, including the need to strength-en the statistical methods used for estimation.11 The list of causes of the disease burden was broadened to cover 291 diseases and injuries. Data on 1160 sequelae of these causes (e.g., diabetic retinopathy, diabetic neuropathy, amputations due to diabetes, and chronic kidney disease due to diabetes) have been evaluated separately. The mortality and burden attributable to 67 risk factors or clusters of risk factors were also assessed.
GBD 2010, which provides critical information for guiding prevention efforts, was based on data from 187 countries for the period from 1990 through 2010. It includes a complete reassessment of the burden of disease for 1990 as well as an estimation for 2005 and 2010 based on the same definitions and methods; this facilitated meaningful comparisons of trends. The prevalence of coexisting condi-tions was also estimated according to the year, age, sex, and country. Detailed results from global and regional data have been published previously.11-18
The internal validity of the results is an important aspect of the GBD approach. For example, demographic data on all-cause mortality according to the year, coun-try, age, and sex were combined with data on cause-specific mortality to ensure that the sum of the number of deaths due to each disease and injury equaled the number of deaths from all causes. Similar internal-validity checks were used for
The New England Journal of Medicine Downloaded from nejm.org on August 5, 2013. For personal use only. No other uses without permission.
Copyright © 2013 Massachusetts Medical Society. All rights reserved.
Global Burden of Disease (2010)
Disease Years Lost!(million)
ischemic heart disease 129.8
HIV-AIDS 81.5
Respiratory Cancers 46.9
disability-adjusted life year (DALY) is a measure of overall disease burden, expressed as the number of years lost due to ill-health, disability or early death
DOI: 10.1056/NEJMra1201534Murray et al. NEJM. 2013
!100 Million Pennies
http://www.kokogiak.com/megapenny
US Life Expectancy
Gregg Easterbrook (September 17, 2014) What Happens When We All Live to 100?. The Atlantic
Calico: 500 Million USD
1950 1960 1970 1980 1990 2000 2010
0.0
1.0
2.0
Increasing R&D Spending per New Drug Approval
Spen
ding
per
Dru
g*
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ●
●● ●
●● ● ●
●●
● ●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●●
●
●
●
●Td = 9.44
exponential model
1950 1960 1970 1980 1990 2000 2010Year
log 1
0(Sp
endi
ng p
er D
rug*)
−2−1
0
●●
●
●●
● ●●
●
●
●●
● ● ●
● ●●
●
●●
●
●
●●
●●
● ●●
●
●●
●
● ●●
●
●
●●
●●
●●
●
●●
● ●● ●
● ●●
● ●●
● ●●
R2 = 0.95
linear modelconfidence intervalprediction interval
*Spending in Billions of 2008 Dollars data from doi:10.1038/nrd3681
Himmelstein, Daniel; Baranzini, Sergio (2014): Increasing R&D Spending per New Drug Approval. figshare. http://dx.doi.org/10.6084/m9.figshare.937004
1950 1960 1970 1980 1990 2000 2010
0.0
1.0
2.0
Increasing R&D Spending per New Drug Approval
Spen
ding
per
Dru
g*
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ●
●● ●
●● ● ●
●●
● ●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●●
●
●
●
●Td = 9.44
exponential model
1950 1960 1970 1980 1990 2000 2010Year
log 1
0(Sp
endi
ng p
er D
rug*)
−2−1
0
●●
●
●●
● ●●
●
●
●●
● ● ●
● ●●
●
●●
●
●
●●
●●
● ●●
●
●●
●
● ●●
●
●
●●
●●
●●
●
●●
● ●● ●
● ●●
● ●●
● ●●
R2 = 0.95
linear modelconfidence intervalprediction interval
*Spending in Billions of 2008 Dollars data from doi:10.1038/nrd3681
Physarum polycephalum: Slime Mold
Plasmodium:!• vegetative state!• acellular!• multinuclear!• protoplasmic veins
(tubules)!• locomotion by pulsation
- surface tensionhttp://youtu.be/MX2Fo4k6pxE
Signature habitat:!shady, cool, moist
the presentHand Drawn Map of SF!by Jenni Sparks
The exponential rise of ‘omics’
Andrew Su on Twitter
‘omics’ — collective characterization and quantification of biomolecules
Data Scientist:
Data Scientist: The Sexiest Job of the 21st Century
Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil
ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"
Spotlight
hen Jonathan Goldman ar-rived for work in June 2006
at LinkedIn, the business networking site, the place still
felt like a start-up. The com-pany had just under 8 million
accounts, and the number was growing quickly as existing mem-
bers invited their friends and col-leagues to join. But users weren’t
seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”
70 Harvard Business Review October 2012
SPOTLIGHT ON BIG DATA
Data Scientist: The Sexiest Job of the 21st Century
Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil
ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"
Spotlight
hen Jonathan Goldman ar-rived for work in June 2006
at LinkedIn, the business networking site, the place still
felt like a start-up. The com-pany had just under 8 million
accounts, and the number was growing quickly as existing mem-
bers invited their friends and col-leagues to join. But users weren’t
seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”
70 Harvard Business Review October 2012
SPOTLIGHT ON BIG DATA
Data Scientist: The Sexiest Job of the 21st Century
Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil
ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"
Spotlight
hen Jonathan Goldman ar-rived for work in June 2006
at LinkedIn, the business networking site, the place still
felt like a start-up. The com-pany had just under 8 million
accounts, and the number was growing quickly as existing mem-
bers invited their friends and col-leagues to join. But users weren’t
seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”
70 Harvard Business Review October 2012
SPOTLIGHT ON BIG DATAData Scientist: The Sexiest Job of the 21st Century
Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil
ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"
Spotlight
hen Jonathan Goldman ar-rived for work in June 2006
at LinkedIn, the business networking site, the place still
felt like a start-up. The com-pany had just under 8 million
accounts, and the number was growing quickly as existing mem-
bers invited their friends and col-leagues to join. But users weren’t
seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”
70 Harvard Business Review October 2012
SPOTLIGHT ON BIG DATA
Artwork: Tamar Cohen, Andrew J Buboltz, 2011
Definition (wikipedia): !the study of the generalizable extraction of knowledge from data
comparison with maternal grandmother
The Dawn of Personalized GenomicsNHGRI GWAS Catalog
Open Source Explosion
Audio from: Let’s Talk Bitcoin! #134 Disruptive Leaps Andreas Antonopoulos & Jeffrey Tucker
Science graphic from: http://nisd.net/academics/elementary-science
the pastHand Drawn Map of SF!by Jenni Sparks
• Aggregate microbial rDNA content of a seawater sample
• richness of operational taxonomic units (OTUs)
• species distribution modeling
Diversity of the Marine Metagenome
Ladau et al. (2013) ISME doi:10.1038/ismej.2013.37
Katie Pollard
-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°
-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°
-90°
-60°
-30°
0°
30°
60°
90°
-90°
-60°
-30°
0°
30°
60°
90°
MICROBIS
FUHRMAN2008
POMMIER2007
GOS
Figure S1: Sampling locations for data used in constructing maps. Models withzero to eight parameters were fitted using MICROBIS data. Predictive performance ofthe models was evaluated using both internal measures of model performance (AIC, BIC,and PRESS) and three independent data sets, collected at the locations shown in red,green, and yellow (see Table S1). Analyses were based on 377 samples (234 MICROBIS,30 GOS, 9 POMMIER2007, 103 FUHRMAN2008) collected from 164 distinct locations.
11
-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°
-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°
-90°
-60°
-30°
0°
30°
60°
90°
-90°
-60°
-30°
0°
30°
60°
90°
MICROBIS
FUHRMAN2008
POMMIER2007
GOS
Figure S1: Sampling locations for data used in constructing maps. Models withzero to eight parameters were fitted using MICROBIS data. Predictive performance ofthe models was evaluated using both internal measures of model performance (AIC, BIC,and PRESS) and three independent data sets, collected at the locations shown in red,green, and yellow (see Table S1). Analyses were based on 377 samples (234 MICROBIS,30 GOS, 9 POMMIER2007, 103 FUHRMAN2008) collected from 164 distinct locations.
11
Diversity in June
Ladau et al. (2013) ISME doi:10.1038/ismej.2013.37
linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.
We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.
Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences
with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.
Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.
2.05 2.20 2.35 2.50 2.65
Log10(OTU Richness)
Latitude
-90
-60
-30
0
30
60
90
Log10(OTU Richness)
2.0
Latitude
-90
-60
-30
0
30
60
902.62.42.2
Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).
Global marine bacterial diversityJ Ladau et al
1671
The ISME Journal
linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.
We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.
Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences
with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.
Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.
2.05 2.20 2.35 2.50 2.65
Log10(OTU Richness)
Latitude
-90
-60
-30
0
30
60
90
Log10(OTU Richness)
2.0
Latitude
-90
-60
-30
0
30
60
902.62.42.2
Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).
Global marine bacterial diversityJ Ladau et al
1671
The ISME Journal
Diversity in December
linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.
We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.
Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences
with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.
Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.
2.05 2.20 2.35 2.50 2.65
Log10(OTU Richness)
Latitude
-90
-60
-30
0
30
60
90
Log10(OTU Richness)
2.0
Latitude
-90
-60
-30
0
30
60
902.62.42.2
Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).
Global marine bacterial diversityJ Ladau et al
1671
The ISME Journal
Ladau et al. (2013) ISME doi:10.1038/ismej.2013.37
linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.
We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.
Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences
with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.
Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.
2.05 2.20 2.35 2.50 2.65
Log10(OTU Richness)
Latitude
-90
-60
-30
0
30
60
90
Log10(OTU Richness)
2.0
Latitude
-90
-60
-30
0
30
60
902.62.42.2
Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).
Global marine bacterial diversityJ Ladau et al
1671
The ISME Journal
Slime Mold & the Greater Tokyo Rail System
Tero et al (2010) Science DOI: 10.1126/science.1177894http://youtu.be/GwKuFREOgmo
• 17 cm (7 in) agar-filled petri dish
• plasmodium for Tokyo
• quaker oats for cities
• vegetate for a day
• decentralized, distributed planning
Tero et al (2010) Science DOI: 10.1126/science.1177894
aftermath: no illuminationaftermath: geographic
constraint using illumination
The SlimeNet was comparable or preferable to the RealNet in terms of: !• efficiency • fault tolerance • cost
Actual Rail Network Slime Tubule Network
Tero et al (2010) Science DOI: 10.1126/science.1177894
Human Evolution & Population GeneticsJohn Novembre
Ryan Hernandez
• 3,192 Europeans • 500,568 SNPs • Reduced to 2d (PCA)
Veeramah & Hammer (2014) Nat Rev Genet doi:10.1038/nrg3625
out-of-Africa bottleneck
• Europeans have less genetic diversity than Africans
Novembre et al (2008) Nature doi:10.1038/nature07331
Genes mirror geography within Europe
Novembre et al (2008) Nature doi:10.1038/nature07331
• Despite the low diversity in Europeans, 500 thousand common variants discriminate population diversity with high resolution.
Medical Informatics - An invited segment by Antoine Lizée -
How to build intelligence around
patient medical records
Adriana Karembeu & Antoine Lizee at Sandler Neurosciences Center, UCSF
4500 visits - 600 patients – 10th year (UCSF EPIC STUDY)
Images ~200MB/visit
Brain MRI T1, T2,
proton density Processed MRI Cortical Thickness,
Myelin Overlays CT, Myelin,
Anatomical labels
GWAS 500,000+ SNPs
HLA A,B,C,
DRB1, DQB1
Patient data Age, sex, history, etc. Clinical data Clinical Scores, treatments Patient reported Quality of Life questionnaires Processed data MRI-based
Refe
renc
e Da
ta
Genotypes ~1MB/patient
(Para-) Clinical Data ~250 variables/visit
Visits
hometown
kin collegedebate
campGuatemala
UCSF
research
Dartmouth
The Friendship Network of Daniel Himmelstein
Learn more online at: http://dhimmel.com
• 1,278 nodes • 40,255 edges
Highschool
Camp
College
Kin
UCSF
Research
Debate
1,278 nodes (1 type) 40,255 edges (1 type)
http://dhimmel.com
Facebook Friends
Genes
DiseasesPathophysiologiesTissues
Genomic Positions
Perturbations
Canonical Pathways
BioCarta
KEGG
ReactomemiRNA
TFBSCancer Hoods
Cancer Modules
GO: BP
GO: MF
GO: CC
Oncogenic
Immunologic
Complex Diseases29,241 nodes (19 types)
1,608,168 edges (20 types)
http://het.io
G T De l
G G Di a
a aG D G Da
G Da
MetaPaths
GT
De
lG
GD
ia
Multiple SclerosisIRF1 IL2RA4 1 1 4
Multiple SclerosisIRF1 IRF84 1 1 4
Multiple SclerosisIRF1 CXCR44 2 1 4
Multiple SclerosisIRF1 Leukocyte
2 1 1 1
metapath paths pathdegreeproduct
degreeweighted
path count
0.707
0.25
0.25
0.177
0.677
0.707
ITCH
Lung
SUMO1
Multiple Sclerosis
IRF1
Leukocyte
Crohn’s Disease
IL2RAIRF8
CXCR4
STAT3
expression
interaction asso
ciatio
n
loca
lizat
ion
asso
ciat
ion
asso
ciat
ion
association
inter
actio
n
Graph Subset
a mG D P Dm
i iG G G Da
a lG D T Dl
e eG T G Da
MetaGraph
MSigDBCollection
Disease
Tissue
Gene
expr
essio
n localization
association
Patho-physiology
mem
bers
hip
mem
bership
inte
raction
A
B
C
D
PDP (path) =Y
d2Dpath
d�wm mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Da
m mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Da
metaedge-specific degrees
Network
G T De l
G G Di a
a aG D G Da
G Da
MetaPaths
GT
De
lG
GD
ia
Multiple SclerosisIRF1 IL2RA4 1 1 4
Multiple SclerosisIRF1 IRF84 1 1 4
Multiple SclerosisIRF1 CXCR44 2 1 4
Multiple SclerosisIRF1 Leukocyte
2 1 1 1
metapath paths pathdegreeproduct
degreeweighted
path count
0.707
0.25
0.25
0.177
0.677
0.707
ITCH
Lung
SUMO1
Multiple Sclerosis
IRF1
Leukocyte
Crohn’s Disease
IL2RAIRF8
CXCR4
STAT3
expression
interaction asso
ciatio
n
loca
lizat
ion
asso
ciat
ion
asso
ciat
ion
association
inter
actio
n
Graph Subset
a mG D P Dm
i iG G G Da
a lG D T Dl
e eG T G Da
MetaGraph
MSigDBCollection
Disease
Tissue
Gene
expr
essio
n localization
association
Patho-physiology
mem
bers
hip
mem
bership
inte
raction
A
B
C
D
PDP (path) =Y
d2Dpath
d�wm mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Da
m mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G DaDWPC(metapath) =
X
path2Paths
PDP (path)
metaedge-specific degrees
Feature Computation
{Cancer H
ood}{Positio
nal}GeTeGaDGiGeTlDGeTlD
{GO Functio
n}
{GO Component}
{miRNA Target}
{BioCarta}
{Oncogenic}
{TF Target}
GaD (any gene)
{Cancer M
odule}GiGiGaD
{GO Process}GiGaD{KEGG}
{Immunologic}{R
eactome}
{Perturbatio
n}GaDmPmD
GaD (any disease)GaDlTlDGaDaGaD
2 0 2 4
Standardized Coe cient
Method (AUROC)
ridge (0.829)
lasso (0.823)
Machine Learning
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Reca
ll
Partition (AUROC)
Testing (0.829)
Training (0.810)
AUPRC = 0.062
0.0
0.2
0.4
0.6
0.8
1.0
Rid
ge
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cisi
on
0.2
0.4
0.6
PredictionThreshold
Performance
0.2
0.6
1.0
1.4
1.8
Meta
2.5
0.0 0.2 0.4 0.6 0.8 1.0
P-value
Densi
ty
Combine Predictions & Statistical Evidence
15
Gene Meta2.5 HNLP WTCCC2
JAK2 0.047 0.102 0.0015REL 0.001 0.040 0.0003SH2B3 0.012 0.034 0.0130RUNX3 0.016 0.025 0.0073
Table 5. Multiple sclerosis gene discovery.
ValuePrediction Threshold 0.024False Positive Rate 0.001Recall 0.108Precision 0.133Lift 68.4Novel & Meta2.5-nominal Total 1211Discovered 4Bonferroni Cuto↵ 0.0125Discovered < Bonferroni 3Total < Bonferroni 199Replication p-value 0.015
Table 6. Multiple sclerosis gene discovery statistics.
Discover Novel Susceptibility Genes
Interactive Web Browser - http://het.io
Mechanisms of Pathogenesis
Gene—{MSigDB Collection}—Gene—Disease DWPC Model
—
—
————
——
——— — —
—
—
—
————
——
——— — —
—
— —
—
—
0.4
0.6
0.8
1.0
Positio
nal
Cance
r Hoo
d
BioCar
ta
GO C
ompo
nent
miR
NA Tar
get
GO F
unction
Reactom
e
Onc
ogen
ic
TF Tar
get
KEGG
GO P
roce
ss
Cance
r Mod
ule
Imm
unolog
ic
Pertu
rbat
ion
Lass
o
Ridge
AU
RO
C
Degree-Weighted Path Count Path Count Model
— ——— ——
— ———
—— ——
— ——
——
—
— —
—
—
0.4
0.6
0.8
1.0
GiG
aD
GeT
eGaD
GeT
lD
GiG
eTlD
GaD
aGaD
GaD
mPm
D
GiG
iGaD
GaD
lTlD
GaD
(any
gen
e)
GaD
(any
dise
ase)
Lass
o
Ridge
AU
RO
C
Pathophysiology
degenerative
immunologic
metabolic
neoplastic
psychiatric
unspeci c
Gene—{MSigDB Collection}—Gene—Disease DWPC Model
—
—
————
——
——— — —
—
—
—
————
——
——— — —
—
— —
—
—
0.4
0.6
0.8
1.0
Positio
nal
Cance
r Hoo
d
BioCar
ta
GO C
ompo
nent
miR
NA Tar
get
GO F
unction
Reactom
e
Onc
ogen
ic
TF Tar
get
KEGG
GO P
roce
ss
Cance
r Mod
ule
Imm
unolog
ic
Pertu
rbat
ion
Lass
o
Ridge
AU
RO
C
Degree-Weighted Path Count Path Count Model
— ——— ——
— ———
—— ——
— ——
——
—
— —
—
—
0.4
0.6
0.8
1.0
GiG
aD
GeT
eGaD
GeT
lD
GiG
eTlD
GaD
aGaD
GaD
mPm
D
GiG
iGaD
GaD
lTlD
GaD
(any
gen
e)
GaD
(any
dise
ase)
Lass
o
Ridge
AU
RO
CPathophysiology
degenerative
immunologic
metabolic
neoplastic
psychiatric
unspeci c
c
c
c
c
Sergio!Baranzini
Ryan Hernandez
John Witte
Andrej Sali
Katie Pollard
Patsy Babbitt
Decreased Lung Cancer at High Elevations
●
●●●●
●
●●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
β = −9.167R2 = 0.202
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●
β = −7.781R2 = 0.109
●●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●●
●
●●
●●
●
●
●
β = −2.072R2 = 0.059
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
● ●
β = 3.545R2 = 0.011
25
50
75
100
40
80
120
160
20
30
40
50
60
70
80
120
160
200
0 1 2
0 1 2
0 1 2
0 1 2Elevation (km)
Inci
denc
e (A
ge−a
djus
ted
case
s pe
r 100
,000
)
●
●●●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●●● ●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
● ●
●
●
β = −7.234R2 = 0.252
●
●
●●
●
●●
●
● ●●●●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●● ●
●●
●●
●
●
●
●
●
●● ●
●
●
β = −4.056R2 = 0.04
● ●●●
●
●
●
●
●
●● ●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●●
β = 0.648R2 = 0.006
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●
●
●●
β = 4.861R2 = 0.015
−40
−20
0
20
40
−50
−25
0
25
−10
0
10
20
−40
0
40
80
−1 0 1
−1 0 1
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1 0 1Elevation Residual
Inci
denc
e R
esid
ual
A B
Elevation (km) Residual Elevation
Lung
Can
cer I
ncid
ence
Lung
Can
cer I
ncid
ence
Res
idua
l
Bivariate Plot●
●●●●
●
●●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
β = −9.167R2 = 0.202
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●
β = −7.781R2 = 0.109
●●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●●
●
●●
●●
●
●
●
β = −2.072R2 = 0.059
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
● ●
β = 3.545R2 = 0.011
25
50
75
100
40
80
120
160
20
30
40
50
60
70
80
120
160
200
0 1 2
0 1 2
0 1 2
0 1 2Elevation (km)
Inci
denc
e (A
ge−a
djus
ted
case
s pe
r 100
,000
)
●
●●●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●●● ●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
● ●
●
●
β = −7.234R2 = 0.252
●
●
●●
●
●●
●
● ●●●●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●● ●
●●
●●
●
●
●
●
●
●● ●
●
●
β = −4.056R2 = 0.04
● ●●●
●
●
●
●
●
●● ●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●●
β = 0.648R2 = 0.006
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●
●
●●
β = 4.861R2 = 0.015
−40
−20
0
20
40
−50
−25
0
25
−10
0
10
20
−40
0
40
80
−1 0 1
−1 0 1
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1 0 1Elevation Residual
Inci
denc
e R
esid
ual
A B
Elevation (km) Residual Elevation
Lung
Can
cer I
ncid
ence
Lung
Can
cer I
ncid
ence
Res
idua
l
Partial Regression Plot• Counties of the American West
• Lung cancer versus elevation
• Publicly-available data
Lung Breast Colorectal Prostate
-0.6
-0.4
-0.2
0.0
0.2
0.4
2 5 8 2 5 8 2 5 8 2 5 8
Subset Size
Sta
nd
ard
ize
d E
leva
tion
Co
eci
en
t
500
600
700
Model BIC
Association specific to lung cancer
Kamen Simeonov
• Inhaled carcinogen
• Oxygen concentration decreases by ~11% for every 1000 meter rise in elevation
Lung Breast Colorectal Prostate
-0.6
-0.4
-0.2
0.0
0.2
0.4
2 5 8 2 5 8 2 5 8 2 5 8
Subset Size
Sta
nd
ard
ize
d E
leva
tion
Co
eci
en
t
500
600
700
Model BIC
Lung Breast Colorectal Prostate
-0.6
-0.4
-0.2
0.0
0.2
0.4
2 5 8 2 5 8 2 5 8 2 5 8
Subset Size
Sta
nd
ard
ize
d E
leva
tion
Co
eci
en
t
500
600
700
Model BIC
the futureHand Drawn Map of SF!by Jenni Sparks
Subscription PublishingHealth science journal subscription
costs are skyrocketing
© Association of Research Libraries, 2013
$0
$2,000
$4,000
$6,000
$8,000
$10,000
$12,000
$14,000
$16,000
$18,000
$20,000
1.50%
1.70%
1.90%
2.10%
2.30%
2.50%
2.70%
2.90%
3.10%
3.30%
3.50%
3.70%
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
In te
n th
ousa
nds (
mul
tiply
val
ues
by 1
0000
0)
Library Expenditure as % of Total University Expenditure (Average of Select US ARL Libraries)
Total University Expenditure (Average of Select US ARL Libraries)
Library and University Expenditure Trends (Time-Series)
Library and University Expenditure Trends (Time-Series)
Library and University Expenditure Trends (Time-Series)
Library and University Expenditure Trends (Time-Series)
Library and University Expenditure Trends (Time-Series)
Library and University Expenditure Trends (Time-Series)
Library and University Expenditure Trends (Time-Series)
Library and University Expenditure Trends (Time-Series)
1982 20111.7%
3.7%
year
% o
f uni
vers
ity b
udge
ts
for l
ibra
ries
Library budges are nosediving
http://www.library.ucsf.edu/services/scholpub/journalcosts
• Libraries are canceling subscriptions • Research is paywalled, inaccessible
to those who could benefit • Scientists desire their findings to be
widely-applied
• Research funding is public • Very small percentage of individuals
have institutional access • Academia doesn’t succeed in a
vacuum — innovation grows from diverse and plentiful inputs Audio from:
Let’s Talk Bitcoin! #134 Disruptive Leaps Andreas Antonopoulos
Per Article Cost from "Open Access: Market Size, Share, Forecast, and Trends" Outsell. January 31, 2013 !Subscription: $4,000.00 Open Access: $950.00
UCSF Open Access Fund http://www.library.ucsf.edu/services/scholpub/oa/fund/eligibility
Fully OA Journal: $2,000 Hybrid OA: $1,000
• PeerJ — Lifetime publishing plan for $99
• eLife — currently no APC, “pain free publication”
• PLOS, BMC, Specialty Pubs • F1000 Research, pre-review
publication • preprints, arRxiv & bioRxiv
Article-level metrics
doi:10.1371/journal.pone.0013636.g005
Open Access increases Citations
Gargouri et al. PLOS One. 2010
• Alternative to journal impact factor
• Citations, downloads, views, social media
• Accelerates science — impact factor = rejection
• Expands the audience evaluating article importance and quality
• Already used: h-indexGrow in importance
Public Data increases Citations
citations
Piwowar & Vision (2013) DOI: 10.7717/peerj.175
• 10,555 microarray studies
• Classified studies by data availability
• 8 categories of covariates
Availability & Reuse
• only applies to original research articles
• journals often withhold the typeset version
• does not affect reuse
Creative Commons Attribution Alone
Mandatory Archiving !NIH: PubMed Central UC: eScholarship
• subscription journal require the transfer of article ownership
• enforce the article copyright
• require licensing for reuse
Tools for Efficiency & Reproducibility
Version control:
Online code repositories:
Interactive programming environments:
ipython notebook
Personal WebsiteClint Cario clintcario.com
Daniel Himmelstein dhimmel.com
Brian O’Donovan iambrianodonovan.com
Kieran Mace mace.co
Andrew Sczesnak andrewsczesnak.com
Kyle Barlow kylebarlow.com
your beginning
Hand Drawn Map of SF!by Jenni Sparks