claudiasolís-lemus 1 jointworkwithcécileané 1 2pages.stat.wisc.edu/~claudia/silo.pdf · 2016. 6....

36
Statistical inference of phylogenetic networks Claudia Solís-Lemus 1 Joint work with Cécile Ané 12 1 Department of Statistics, UW-Madison 2 Department of Botany, UW-Madison October 28, 2015

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


1 download

TRANSCRIPT

  • Statistical inference of phylogenetic networks

    Claudia Solís-Lemus 1

    Joint work with Cécile Ané 1 2

    1Department of Statistics, UW-Madison

    2Department of Botany, UW-Madison

    October 28, 2015

  • Motivation: Tree of life

    Solis-Lemus networks October 28, 2015 2 / 32

  • Data

    Solis-Lemus networks October 28, 2015 3 / 32

  • Phylogenetic tree

    (a) Rooted (b) Unrooted

    Figure: Binary phylogenetic tree

    Solis-Lemus networks October 28, 2015 4 / 32

  • Estimated tree

    Alligator

    Emu

    Kiwi

    Ostrich

    Swan

    Goose

    Chicken

    Falcon

    Finch

    Osprey

    Woodpecker

    Ibis

    Stork

    Vulture

    Penguin

    Solis-Lemus networks October 28, 2015 5 / 32

  • Gene tree reconstruction

    Human

    Chimpanzee

    Gorilla

    Orangutan

    Markov model for

    sequence evolution

    Solis-Lemus networks October 28, 2015 6 / 32

  • Markov model for sequence evolution: A,C,G,T

    Solis-Lemus networks October 28, 2015 7 / 32

  • Markov model for sequence evolution: A,C,G,T

    L =∑w

    ∑y

    ∑x

    π(w)Pt6(w , y)Pt5(w ,G)Pt3(y , x)Pt4(y ,C)

    Pt2(x ,C)Pt1(x ,A)

    Solis-Lemus networks October 28, 2015 8 / 32

  • Markov model for sequence evolution: A,C,G,T

    L =∑w

    ∑y

    ∑x

    π(w)Pt6(w , y)Pt5(w ,G)Pt3(y , x)Pt4(y ,C)

    Pt2(x ,C)Pt1(x ,A)

    Solis-Lemus networks October 28, 2015 8 / 32

  • Gene tree reconstruction

    Numerical optimization for branch lengths: tHeuristic optimization for tree topology

    Available software: MrBayes (Huelsenbeck and Ronquist, 2001)RAxML (Stamatakis, 2014)

    Solis-Lemus networks October 28, 2015 9 / 32

  • Challenge: big space of trees

    # Species # Unrooted trees # Rooted trees1 1 12 1 13 1 34 3 155 15 1056 105 9457 945 103958 10,395 135,1359 135,135 2,027,02510 2,027,025 34,459,42511 34,459,425 654,729,07512 654,729,075 13,749,310,57513 13,749,310,575 316,234,143,225...

    ......

    52 > # atoms in universe

    Solis-Lemus networks October 28, 2015 10 / 32

  • Challenge: gene tree discordance

    76% Human

    Chimpanzee

    Gorilla

    Orangutan

    Solis-Lemus networks October 28, 2015 11 / 32

  • Challenge: gene tree discordance

    76% Human

    Chimpanzee

    Gorilla

    Orangutan

    12% Human

    Gorilla

    12% Human

    Orangutan

    Species tree and gene trees can be different!

    Solis-Lemus networks October 28, 2015 12 / 32

  • Species tree vs Gene tree

    Human Chimpanzee Gorilla

    Solis-Lemus networks October 28, 2015 13 / 32

  • From gene trees to species tree

    ?

    Solis-Lemus networks October 28, 2015 14 / 32

  • Multiple reasons for gene tree discordance

    1 gene tree reconstruction error2 horizontal gene transfer (HGT)3 incomplete lineage sorting (ILS)

    Solis-Lemus networks October 28, 2015 15 / 32

  • 2. Horizontal Gene Transfer (HGT)

    Psst! Hey kid! Wanna be a Superbug...?

    Stick some of this into your genome...

    Even penicillin won't be able to harm you...

    www.nearingzero.net

    www.quora.net

    Solis-Lemus networks October 28, 2015 16 / 32

  • 2. Horizontal Gene Transfer (HGT)

    Solis-Lemus networks October 28, 2015 17 / 32

  • 2. HGT => Phylogenetic network

    (c) Rooted (d) Unrooted

    Figure: Binary phylogenetic network

    Solis-Lemus networks October 28, 2015 18 / 32

  • 3. Incomplete lineage sorting (ILS)

    Gene trees differ inside a species tree

    Solis-Lemus networks October 28, 2015 19 / 32

  • 3. Incomplete lineage sorting (ILS)

    Gene trees differ inside a species tree

    Solis-Lemus networks October 28, 2015 19 / 32

  • 3. Incomplete lineage sorting (ILS)

    Gene trees differ inside a species tree

    Solis-Lemus networks October 28, 2015 19 / 32

  • 3. Incomplete lineage sorting (ILS)

    Gene trees differ inside a species tree

    Solis-Lemus networks October 28, 2015 19 / 32

  • 3. Incomplete lineage sorting (ILS)

    Solis-Lemus networks October 28, 2015 20 / 32

  • Multispecies coalescent model on a network (ILS+HGT)

    P(gene tree | species network,t, γ)

    Meng & Kubatko (2009), Yu Degnan & Nakhleh (2012)

    Solis-Lemus networks October 28, 2015 21 / 32

  • Maximum likelihood estimationModel:

    Data: multiple gene alignments −→ multiple gene trees

    L(network, t, γ) =∏g

    P(g |network, t, γ)

    Numerical optimization for t, γ

    Heuristic optimization for network topology

    PhyloNet Maximum LikelihoodYu, Dong, Liu & Nakhleh (2014)

    Solis-Lemus networks October 28, 2015 22 / 32

  • Challenge: huge space of networks

    Solis-Lemus networks October 28, 2015 23 / 32

  • Challenge: expensive computation of likelihood

    Gene tree: ((a, c), (b1, b2))

    Figure: Yu, Dong, Liu, Nakhleh (2014)

    Solis-Lemus networks October 28, 2015 24 / 32

  • Maximum pseudolikelihood estimation

    Quartet-based inference:

    L̃(network, t, γ) ∝∏

    q∈Q(network)

    Likelihood(q, t, γ)

    Numerical optimization for t, γ

    Heuristic optimization for network topology

    SNaQ Maximum PseudolikelihoodS.-L., Ané (2015)

    Solis-Lemus networks October 28, 2015 25 / 32

  • SNaQ (S.-L., Ané, 2015)

    Maximum pseudolikelihoodestimation of unrooted networksquartet-based inferenceInput: unrooted gene treesFast, scalable,easy pseudolikelihood computation

    www.github.com/crsl4/PhyloNetworks

    Solis-Lemus networks October 28, 2015 26 / 32

  • SNaQ (S.-L., Ané, 2015)

    0

    20

    40

    60

    80

    100

    10 30 100 300 1k 3k

    PhyloNetSNaQ

    n=6, h=1

    Tim

    e (

    min

    ute

    s)

    10 30 100 300 1k 3k

    n=6, h=2

    0

    50

    100

    150

    200

    number of genes

    10 30 100 300 1k 3k

    n=10, h=2

    Tim

    e (

    hours

    )

    number of genes

    10 30 100 300 1k 3k

    n=15, h=3

    Solis-Lemus networks October 28, 2015 27 / 32

  • Identifiability issues: flat pseudolikelihood

    Solis-Lemus networks October 28, 2015 28 / 32

  • Xiphophorus fish data (Cui et al., 2013)

    1183 genes, 24 swordtails and platyfish, 5 hybridizations

    XalvareziXmayaeXhelleriiXsignumXclemenciae_F2XmonticolusXmultilineatusXnigrensisXpygmaeusXcontinensXcorteziXnezahuacoyotlXmontezumaeXmalinche_CHIC2Xbirchmanni_GARCXmilleriXevelynaeXvariatusXcouchianusXgordoniXmeyeriXxiphidiumXandersiXmaculatus

    X.alvarezi

    X.mayae

    X.hellerii

    X.signum

    X.clemenciae

    X.monticolus

    X.multilineatus

    X.nigrensis

    X.pygmaeus

    X.continens

    X.cortezi

    X.nezahualcoyotl

    X.montezumae

    X.malinche

    X.birchmanni

    X.milleri

    X.evelynae

    X.variatus

    X.couchianus

    X.gordoni

    X.meyeri

    X.xiphidium

    X.andersi

    X.macalatus

    SS

    NS

    NP

    SP

    � Loss of sword• Loss of ability to produce sword

    Solis-Lemus networks October 28, 2015 29 / 32

  • Many challenges

    Huge space of networks- How to search efficiently?

    No clear idea on the shape of pseudolikelihood function- Flatness => Identifiability issues:

    · Identify hybrid node in cycle?- Many local maxima

    Properties for pseudolikelihood estimator

    Solis-Lemus networks October 28, 2015 30 / 32

  • Final remark

    (SNaQ uses NLopt)

    Solis-Lemus networks October 28, 2015 31 / 32

  • Acknowledgements

    Joint work with Cécile Ané

    Thanks:Doug BatesNoah Stenz

    Sarah Friedrich (for cool SNaQ logo)

    DEB 0949121

    DEB 1354793www.github.com/crsl4/PhyloNetworks

    [email protected]

    Solis-Lemus networks October 28, 2015 32 / 32