candid: a candidate gene identification tool janna hutz [email protected] march 19, 2007
TRANSCRIPT
Candidate genes
• Positional– Linkage evidence– Deletion syndrome– Loss of heterozygosity– Disease-related amplification– Association
• Biological– Pathways– Phenotypic characteristics
ACT[A/G]GGA
A case study: acd
A case study: acd0 cM
~82 cM
acd
Os
Es-1
~31 cMD8Mit5
D8Mit79
D8Mit13
25 cM
38.7 cM
67 cM
acd51 cM
A case study: acd
1/145 3/1450/145 0/1451/145
Which gene is acd?
Prioritization tools
• Endocrinologist/Geneticist
• Ensembl
• RT-PCR
• Sequencing
BINGO!…two years later.
How can we improve this?
Improve our tools
• Clinician– Has memorized information about many
disorders; can name some relevant genes– Gets his/her information from…
PubMed
PubMed
• How do we use PubMed to analyze our candidates?
– Enter our phenotypic keywords into PubMed. Read the papers that come up in the results. Make a list of genes.
– Do PubMed searches for all the candidates. Read the papers that come up in the results. Rate the candidates.
Better: Don’t do it yourself…
PubMed
• Each publication has a PubMed ID
• Each gene has a Gene ID
• Wouldn’t it be nice if we could link Gene IDs and PubMed IDs?
–ftp://ftp.ncbi.nlm.nih.gov/gene/DATA–gene2pubmed.gz–TaxonomyID; GeneID; PubMedID
Who makes that file? (1)
• From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
• Links between Gene and PubMed are the result of the following:
• 1. Manual curation within NCBI. Part of the process of generating a REVIEWED RefSeq is an analysis of the current literature. Papers that are seminal in defining the gene, its sequence, and its function are added to the record at that time. Alert users point out gaps or errors in papers associated with a Gene record. These messages are reviewed and implemented as required.
Who makes that file? (2)
• 2. Integration of information from other public databases. Gene integrates gene-citation from resources external to NCBI such as model organism-specific databases, Gene Ontology (GO), groups curating interactions, and sequence databases. The assumption in using these source is that they report citations specific to a gene in a known species. Gene does not process citations from OMIM automatically, because many of citations in OMIM refer to studies of genes in species other than human.
Example 1
pancreatic cancer sequence candidates
$
Help Sally.
• Use CANDID’s literature criterion
http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html
User: workshop Password: perl031907
Help Sally.
• Look for genes that are involved with pancreatic cancer.
• What are some keywords we can use?
A measure of relevancy
• Find relevant publications
• Is Gene X linked to these publications?
• How many publications match?
• What percent of Gene X’s publications match?
By the numbers…
• Literature scores run from 0 to 1.
Number of gene’s publications that match
Number of gene’s publications
• The score is…
Matching
• Every publication has a “Text Words” field that includes, when available, …– Title– Abstract– Other abstract– MeSH terms– MeSH subheadings– Publication types– Substance names– Personal name as subject– MEDLINE secondary source– Other terms
Summary
Results
Exporting to Excel
• Output file is a comma-separated file
• Download it, and change the .output to .csv.
• If Excel doesn’t open it automatically when you click on it, paste the data into a new sheet and use the Text Import Wizard to separate the columns.
Drawbacks
• What if a gene isn’t associated with any publications?– It’s not important– It’s not yet characterized
What about those genes?
Analyzing the “other genes”
• We don’t have literature data.
• We don’t have expression data.
• All we have is a sequence.
Fun with sequences
• DNA– Cross-species conservation
• RNA (cDNA)– Cross-species conservation– Protein sequence prediction
• Protein conservation• Protein domain prediction
Protein domains
• InterPro• Conserved Domain Database (NCBI)
• Wouldn’t it be nice if we could link Gene IDs and protein domains?
Interproftp://ftp.ncbi.nlm.nih.gov/gene/DATA
Who makes those links?
• From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
• Links between Entrez Gene and Conserved Domain Database (CDD) are calculated from the domains annotated by the CDD group on Reference Sequence proteins.
How can we use this?
• The CDD domains have descriptions.
• These descriptions can be searched…
1. CANDID finds domains containing our keywords.
2. If a gene has one of those domains, it gets a score of 1.
…just like when we searched PubMed!
How far back does our gene go?
• Is our gene in mammals?
• Fish?
• Bacteria?
More sequence fun
• Many measures of conservation– Nucleotide similarity (percentage, pairwise)– Amino acid similarity (percentage,
pairwise)– etc., etc.
HomoloGene
• Gets sequences
• Uses amino acid AND nucleotide similarity measures
• Plus lots more math, equals…
• A label that answers our question
Labels used in CANDID
• Homo sapiens• Primates (chimp, gorilla)• Rodents (rat, mouse)• Eutherian mammals (dog, cow, cat)• Amniota (chicken)• Insects (mosquito, bee)• Bilateria (C. elegans)• Fungi• Eukaryotes
HIGHER SCORE
Example 2
pancreas:tumor tissue
pancreas: normal tissue
custom microarray
Known and unknown genes
Array candidates
• Let’s increase the number of CANDID results we got in Example 1…
Weighting system
• Prioritize genes of known or unknown function
• Modify weights for each category
• Well-characterized genes: higher literature weight
• Uncharacterized genes: higher domains, conservation weights
Example 3
• Make up your own example!
• Use literature, domains, and/or conservation criteria.
Next week
• Expression data
• Linkage data
• Association data
• CANDID’s efficiency
• Anything else?