tth  11:00-12:15 in clark s361 profs: serafim batzoglou, gill bejerano

39
http://cs273a.stanford.edu [Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean

Upload: kimball

Post on 23-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean. Lecture 18. Chains & Nets Non-coding Transcripts. Chaining Alignments. Chaining bridges the gulf between syntenic blocks and base-by-base alignments. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 1

TTh  11:00-12:15 in Clark S361Profs: Serafim Batzoglou, Gill BejeranoTAs: George Asimenos, Cory McLean

Page 2: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 2

Lecture 18

Chains & NetsNon-coding Transcripts

Page 3: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

3

Chaining Alignments• Chaining bridges the gulf between syntenic blocks

and base-by-base alignments. • Local alignments tend to break at transposon

insertions, inversions, duplications, etc.• Global alignments tend to force non-homologous

bases to align.• Chaining is a rigorous way of joining together

local alignments into larger structures.[Jim Kent’s slides]

Page 4: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

4

Chains join together related local alignments

Protease Regulatory Subunit 3

Page 5: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 5

Chains• a chain is a sequence of gapless aligned blocks, where there must

be no overlaps of blocks' target or query coords within the chain.• Within a chain, target and query coords are monotonically non-

decreasing. (i.e. always increasing or flat)• double-sided gaps are a new capability (blastz can't do that) that

allow extremely long chains to be constructed.• not just orthologs, but paralogs too, can result in good chains. but

that's useful!• chains should be symmetrical -- e.g. swap human-mouse -> mouse-

human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments.

• chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done.

• chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki]

Page 6: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

6

Affine penalties are too harsh for long gaps

Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine gap scores model red/blue plots as straight lines.

Page 7: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

7

Before and After Chaining

Page 8: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

8

Chaining Algorithm

• Input - blocks of gapless alignments from blastz• Dynamic program based on the recurrence

relationship: score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj))

• Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands)

j<i

Page 9: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

9

Netting Alignments

• Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions.

• Net finds best match mouse match for each human region.

• Highest scoring chains are used first.• Lower scoring chains fill in gaps within

chains inducing a natural hierarchy.

Page 10: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

10

Net Focuses on Ortholog

Page 11: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 11

Nets

• a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels.

• a net is single-coverage for target but not for query.• because it's single-coverage in the target, it's no longer symmetrical.• the netter has two outputs, one of which we usually ignore: the target-

centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.

• nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level.

[Angie Hinrichs, UCSC wiki]

Page 12: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 12

"LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process. Same-species liftOver chains are generated by a series of scripts that use blat -fastMap as the alignment method.

[Angie Hinrichs, UCSC wiki]

Page 13: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

13

Before and After Chaining

Page 14: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

14

Net highlights rearrangements

A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Page 15: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

15

Useful in finding pseudogenes

Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Page 16: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

16

Mouse/HumanRearrangement Statistics

Number of rearrangements of given type per megabaseexcluding known transposons.

Page 17: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

17

A Rearrangement Hot Spot

Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

Page 18: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 18

Cautionary Note 1

Page 19: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 19

Cautionary Note 2

Page 20: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 20

Same Region…

same in allthe other fish

Page 21: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 21

Orthology vs. Paralogy

Page 22: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 22

non coding transcripts

Page 23: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 23

Page 24: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 24

Page 25: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 25

Page 26: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 26

Page 27: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 27

Page 28: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 28

Page 29: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 29

Page 30: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 30

Human Specific Rapid Evolution

hmr hmr c

100%id 100%id

maximally changed

Page 31: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

31

Nearest Neighbor Model for RNA Secondary Structure Free Energy at 37 OC:

C G U U U G G GUU

CACAAACG

-2 .0

-2 .1

-0 .9

-0 .9

-1 .8

-1 .6

+ 5 .0

Ghelix = GCGGC + G

GUCA + 2G

UUAA + G

UGAC =

-2.0 kcal/mol - 2.1 kcal/mol + 2x(-0.9) kcal/mol - 1.8 kcal/mol = -7.7 kcal/mol

Ghairpin loop = Ginitiation (6 nucleotides) + GmismatchGGCA =

5.0 kcal/mol - 1.6 kcal/mol = 3.4 kcal/mol

Gtotal = G

hairpin + Ghelix = 3.4 kcal/mol - 7.7 kcal/mol = -4.3 kcal/mol

Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287.

Page 32: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 32

Page 33: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 33

Page 34: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 34

Transcripts, transcripts everywhere

Human Genome

Transcribed (Tx)

Tx from both strands

Leaky tx?

Functional?

Page 35: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 35

Page 36: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 36

Page 37: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 37

Page 38: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 38

Page 39: TTh  11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano

http://cs273a.stanford.edu [Bejerano Spr06/07] 39