spaghetti code, soupy logic adventures in gene expression & genome annotation jim kent...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Spaghetti Code, Soupy Logicadventures in gene expression & genome annotation
Jim Kent
University of California Santa Cruz
A Challenge Every Speaker Faces:
• Who is the audience?
• Bioinformaticians:– Biologists with bigger, better databases?– Geeks trading bits for bases?– Leading edge interdisciplinary super scientists?
Top 5 Reasons Biologists Go Into Bioinformatics
• 5 - Microscopes and biochemistry are so 20th century.
• 4 - Got started purifying proteins, but it turns out the cold room is really COLD.
• 3 - After 23 years of school wanted to make MORE than 23,000/year in a postdoc.
• 2 - Like to swear, @ttracted to $_ Perl #!!• 1 - Getting carpel tunnel from pipetting
Top 5 Reasons Computer People go into Bioinformatics
• 5 - Bio courses have some females.
• 4 - Human genome stabler than Windows XP
• 3 - Having mastered binary trees, quad trees, and parse trees ready for phylogenic trees.
• 2 - Missing heady froth of the internet bubble.
• 1 - Must augment humanity to defeat evil artificial intelligent robots.
The Paradox of GenomicsHow does a long, static, one dimensional string of DNA turn into the remarkably complex, dynamic, and three dimensional human body?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
GTTTGCCATCTTTTGCTGCTCTAGGGAATCCAGCAGCTGTCACCATGTAAACAAGCCCAGGCTAGACCAGTTACCCTCATCATCTTAGCTGATAGCCAGCCAGCCACCACAGGCATGAGT
Models and Metaphors• When trying to understand something we like to
build up metaphors and models.• Computer programs are complex systems that
ultimately are built up of 0’s and 1’s, perhaps they are a model for a genome built of A,C,G and T?
• Human genome lacks documentation, has accumulated 3 billion years of cruft, and does not believe in local variables.
• Therefore we must look to less than straightforward software programs as guides.
Bioperl CORBA modulesub new { my ( $class, @args) = @_; my $self = $class->SUPER::new(@args); my ( $idl, $ior, $orbname ) = $self->_rearrange( [ qw(IDL IOR ORBNAME)], @args); $self->{'_ior'} = $ior || 'biocorba.ior'; $self->{'_idl'} = $idl || $ENV{BIOCORBAIDL} || 'biocorba.idl'; $self->{'_orbname'} = $orbname || 'orbit-local-orb'; $CORBA::ORBit::IDL_PATH = $self->{'_idl'}; my $orb = CORBA::ORB_init($orbname); my $root_poa = $orb->resolve_initial_references("RootPOA"); $self->{'_orb'} = $orb; $self->{'_rootpoa'} = $root_poa; return $self;}
Obfuscated C#define c(n,s)case n:s;continuechar x[]="((((((((((((((((((((((",w[]="\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b";char r[]={92,124,47},l[]={2,3,1,0};char*T[]={" |"," |","%\\|/%"," %%%",""};char d=1,p=40,o=40,k=0,*a,y,z,g=-1,G,X,**P=&T[4],f=0;unsigned int s=0;void u(int i){int n;printf("\233;%uH\233L%c\233;%uH%c\233;%uH%s\23322;%uH@\23323;%uH \n",*x-*w,r[d],*x+*w,r[d],X,*P,p+=k,o);if(abs(p-x[21])>=w[21])exit(0);if(g!=G){struct itimerval t={0,0,0,0};g+=((g<G)<<1)-1;t.it_interval.tv_usec=t.it_value.tv_usec=72000/((g>>3)+1);setitimer(0,&t,0);f&&printf("\e[10;%u]",g+24);}f&&putchar(7);s+=(9-w[21])*((g>>3)+1);o=p;m(x);m(w);(n=rand())&255||--*w||++*w;if(!(**P&&P++||n&7936)){while(abs((X=rand()%76)-*x+2)-*w<6);++X;P=T;}(n=rand()&31)<3&&(d=n);!d&&--*x<=*w&&(++*x,++d)||d==2&&++*x+*w>79&&(--*x,--d);signal(i,u);}void e(){signal(14,SIG_IGN);printf("\e[0q\ecScore: %u\n",s);system("stty echo -cbreak");}int main(int C,char**V){atexit(e);(C<2||*V[1]!=113)&&(f=(C=*(int*)getenv("TERM"))==(int)0x756E696C||C==(int)0x6C696E75);srand(getpid());system("stty -echo cbreak");h(0);u(14);for(;;)switch(getchar()){case 113:return 0;case 91:case 98:c(44,k=-1);case 32:case 110:c(46,k=0);case 93:case 109:c(47,k=1);c(49,h(0));c(50,h(1));c(51,h(2));c(52,h(3));}}
Microsoft Windows
mouse
keyboard
network
elaborate proprietary process
blue screen
of death
Looks like metaphor not enough, must study actual cells & DNA
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
How DNA is Used by the Cell
Promoter Tells Where to Begin
Different promoters activate different genes indifferent parts of the body.
A Computer in Soup
Idealized promoter for a gene involved in making hair.Proteins that bind to specific DNA sequences in the promoter region together turn a gene on or off. Theseproteins are themselves regulated by their own promotersleading to a gene regulatory network with many of thesame properties as a neural network.
Genes can be transcription factors that activate
or repress other genes, leading to regulatory networks
such as this one from the development of the central
nervous system. (Image from D’Haeseleer Somogyi 1999)
The Decisions of a Cell
• When to reproduce?
• When to migrate and where?
• What to differentiate into?
• When to secrete something?
• When to make an electrical signal?The more rapid decisions usually are via the cell membrane and 2nd messengers. The longer acting decisions are usually made in the nucleus.
Nucleus Used to Appear Simple
• Cheek cells stained with basic dyes. Nuclei are readily visible.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Mammalian Nuclei Stained in Various Ways
Image from Tom Misteli lab
Artist’s rendition of nucleus
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Image from nuclear protein database
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Chromatin
Turning on a gene:
• Getting DNA into the right compartment of the nucleus (may involve very diffuse signals in DNA over very long distances)
• Loosening up chromatin structure (this involves activator and repressors which can act over relatively long distances)
• Attracting RNA Polymerase II to the transcription start site (these involve relatively close factors both upstream and downstream of transcription start).
Methods for Studying Transcription
• Genetics in model organisms
• Promoters hooked to reporter genes
• Gel shifts and DNAse footprinting.
• Phylogenic footprinting
• Motif searches in clusters of coregulated genes.
Drosophila Genetics
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
normal antennapediamutant
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Reporter Gene Constructs
promoter to study easily seen gene
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Drosophila embryo transfected with ftz promoter hookedup to lacz reporter gene, creating stripes where ftz promoteris active.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Txn factorfootprint
Gel showing selective protection of DNA from nuclease digestion where transcription factor is bound.
Biochemical Footprinting Assays
Pseudogenes
Creative Chaos & Genome
Finding Transcription Start
Phylogenic Footprinting
Mouse Paints Some Promoters
RefSeq
Spliced EST
Mouse
Fish
Repeat
Crystallin - a gene expressed in the eye. Coding regions are very similar to crystallins in the liver, but the promoter is different.
Normalized eScores
Mouse/Human Chrom 7 Synteny
Motifs in Coregulated Genes
Conservation Levels of Regulatory Regions
Transition from Private Research Interests to Role in Genome
Project
Assembly War Story
Building a Better Browser
Pretty Adventurous Programming
Genome BrowserBLAT
Gene SorterTable Browser
Service Organization
Parasol and Kilo Cluster
• UCSC cluster has 1000 CPUs running Linux
• 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment
• We wrote Parasol job scheduler to keep up.– Very fast and free.
– Jobs are organized into batches.
– Error checking at job and at batch level.
AcknowledgementsIndividuals Institutions
NHGRI, The Wellcome Trust, HHMI, Taxpayers in the US and worldwide.
Whitehead, Sanger, Wash U, Baylor, Stanford, DOE, and the international sequencing centers.
NCBI, Ensembl, Genoscope, The SNP Consortium, UCSC, Softberry, Affymetrix.
David Haussler, Chuck Sugnet
Francis Collins, Bob Waterston, Eric Lander, John Sulston, Richard Gibbs
Lincoln Stein, Sean Eddy, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, Greg Schuler, Deanna Church, Asif Chinwalla, Kim Worley, the Gene Cats.
Everyone else!
THE END
gctcgttcaggggtaaaggtgtattctagatCCACAACAAGCCCCGTGGTCTAGCACAGC AAAGAGAAAAAAAGAGAACACGAAAATGCCCTTGCTCCCCTCCGGGGGCCCCTTTTGTGC GGTTCTTGCCAACGCAGCAGCCCTCCTGCTATATAGCCCGCCGCGCCgCAGCCCCACCCG CTCAGCGCCGCCGCCCCACCAGCTCAGCACCGCCGTGCGCCCAGCCAGCCATGGGGAAGG TGAGCCCAGCCTGCGCCCCGGGACCCCGGAGCTTCCTCCATCGCGGGGGCCAGAGACTGG GGCAGGAGCAGGCCTGTGAGACCTCGCCTTGTCCCGCCTTGCCTTGCAGATCACCCTCTA CGAGGACCGGGGCTTCCAGGGCCGCCACTATGAATGCAGCAGCGACCACCCCAACCTGCA GCCCTACTTGAGCCGCTGCAACTCGGCGCGCGTGGACAGCGGCTGCTGGATGCTCTATGA GCAGCCCAACTACTCGGGCCTCCAGTACTTCCTGCGCCGCGGCGACTATGCCGACCACCA GCAGTGGATGGGCCTCAGCGACTCGGTCCGCTCCTGCCGCCTCATCCCCCACGTGAGTAC ATCCTCAAGTCAGGACCCAGGCCCTCAGGACACTCACTGGAtgGTTTCAAGCAAAAGTTA AACATTAGAAGTAGTGATCAGTcacaataaCTGAGAGTGGACAAAAGATGAACTATAGTG GATTAAGTCAATAGagttTGCTCCCCACATAAGCAAAGTATTACCCAGACAcCAGTTAAT caCAATTAATCCACAAATATGTATTGAGTAGGAATGTGTCTCCTGCCctAGGGGTTGTAT
Coloring CRYGD Start
Trends in Society & Biology
50’s Cars are good Mitochondria and metabolism
60’s Recording DNA as recording media of genes
70’s Birth control Working out the cell cycle
80’s Yuppies Start of serious genetic engineering
90’s Microsoft rules Incyte, Celera race to patent genome
2000’s
(The NEED for Bioinformatics)• ~200 million bases of DNA are sequenced
every day.– Not much use without assembly.
• Protein and non-sequence data also being generated at a prodigious rate.– How to store it and find the parts you want?
• Making models that are simple enough to understand, but rich enough to reflect the biology.
(My Road to a Bio PhD)• Liked bio, but too many prerequisites!• Had fun doing graphics/animation
programming in 80’s & early 90’s.• Bored of endlessly shifting Microsoft APIs• Community college, UC extension to get
bio BA equivalent in 97 & 98.• UC Santa Cruz bio grad school 1999• Interested in developmental biology and
how a cell makes decisions.
Perhaps Must Study Actual Cells
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Spaghetti Code or Soupy Logic
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Steaming fresh modules in
sourceforge.net
Combinatorical assembly of
transcription factors in cell.