candid: a candidate gene identification tool janna hutz [email protected] march 19, 2007

CANDID:A candidate gene identification tool

Janna Hutz

[email protected]

March 19, 2007

Candidate genes

• Positional– Linkage evidence– Deletion syndrome– Loss of heterozygosity– Disease-related amplification– Association

• Biological– Pathways– Phenotypic characteristics

ACT[A/G]GGA

A case study: acd

A case study: acd0 cM

~82 cM

acd

Os

Es-1

~31 cMD8Mit5

D8Mit79

D8Mit13

25 cM

38.7 cM

67 cM

acd51 cM

A case study: acd

1/145 3/1450/145 0/1451/145

Which gene is acd?

Prioritization tools

• Endocrinologist/Geneticist

• Ensembl

• RT-PCR

• Sequencing

BINGO!…two years later.

How can we improve this?

Improve our tools

• Clinician– Has memorized information about many

disorders; can name some relevant genes– Gets his/her information from…

PubMed

PubMed

• How do we use PubMed to analyze our candidates?

– Enter our phenotypic keywords into PubMed. Read the papers that come up in the results. Make a list of genes.

– Do PubMed searches for all the candidates. Read the papers that come up in the results. Rate the candidates.

Better: Don’t do it yourself…

PubMed

• Each publication has a PubMed ID

• Each gene has a Gene ID

• Wouldn’t it be nice if we could link Gene IDs and PubMed IDs?

–ftp://ftp.ncbi.nlm.nih.gov/gene/DATA–gene2pubmed.gz–TaxonomyID; GeneID; PubMedID

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA

Who makes that file? (1)

• From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

• Links between Gene and PubMed are the result of the following:

• 1. Manual curation within NCBI. Part of the process of generating a REVIEWED RefSeq is an analysis of the current literature. Papers that are seminal in defining the gene, its sequence, and its function are added to the record at that time. Alert users point out gaps or errors in papers associated with a Gene record. These messages are reviewed and implemented as required.

http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

Who makes that file? (2)

• 2. Integration of information from other public databases. Gene integrates gene-citation from resources external to NCBI such as model organism-specific databases, Gene Ontology (GO), groups curating interactions, and sequence databases. The assumption in using these source is that they report citations specific to a gene in a known species. Gene does not process citations from OMIM automatically, because many of citations in OMIM refer to studies of genes in species other than human.

Example 1

pancreatic cancer sequence candidates

$

Help Sally.

• Use CANDID’s literature criterion

http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html

User: workshop Password: perl031907

Help Sally.

• Look for genes that are involved with pancreatic cancer.

• What are some keywords we can use?

A measure of relevancy

• Find relevant publications

• Is Gene X linked to these publications?

• How many publications match?

• What percent of Gene X’s publications match?

By the numbers…

• Literature scores run from 0 to 1.

Number of gene’s publications that match

Number of gene’s publications

• The score is…

Matching

• Every publication has a “Text Words” field that includes, when available, …– Title– Abstract– Other abstract– MeSH terms– MeSH subheadings– Publication types– Substance names– Personal name as subject– MEDLINE secondary source– Other terms

Summary

Results

Exporting to Excel

• Output file is a comma-separated file

• Download it, and change the .output to .csv.

• If Excel doesn’t open it automatically when you click on it, paste the data into a new sheet and use the Text Import Wizard to separate the columns.

Drawbacks

• What if a gene isn’t associated with any publications?– It’s not important– It’s not yet characterized

What about those genes?

Analyzing the “other genes”

• We don’t have literature data.

• We don’t have expression data.

• All we have is a sequence.

Fun with sequences

• DNA– Cross-species conservation

• RNA (cDNA)– Cross-species conservation– Protein sequence prediction

• Protein conservation• Protein domain prediction

Protein domains

• InterPro• Conserved Domain Database (NCBI)

• Wouldn’t it be nice if we could link Gene IDs and protein domains?

Interproftp://ftp.ncbi.nlm.nih.gov/gene/DATA

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA

Who makes those links?

• From http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

• Links between Entrez Gene and Conserved Domain Database (CDD) are calculated from the domains annotated by the CDD group on Reference Sequence proteins.

How can we use this?

• The CDD domains have descriptions.

• These descriptions can be searched…

1. CANDID finds domains containing our keywords.

2. If a gene has one of those domains, it gets a score of 1.

…just like when we searched PubMed!

How far back does our gene go?

• Is our gene in mammals?

• Fish?

• Bacteria?

More sequence fun

• Many measures of conservation– Nucleotide similarity (percentage, pairwise)– Amino acid similarity (percentage,

pairwise)– etc., etc.

HomoloGene

• Gets sequences

• Uses amino acid AND nucleotide similarity measures

• Plus lots more math, equals…

• A label that answers our question

Labels used in CANDID

• Homo sapiens• Primates (chimp, gorilla)• Rodents (rat, mouse)• Eutherian mammals (dog, cow, cat)• Amniota (chicken)• Insects (mosquito, bee)• Bilateria (C. elegans)• Fungi• Eukaryotes

HIGHER SCORE

Example 2

pancreas:tumor tissue

pancreas: normal tissue

custom microarray

Known and unknown genes

Array candidates

• Let’s increase the number of CANDID results we got in Example 1…

Weighting system

• Prioritize genes of known or unknown function

• Modify weights for each category

• Well-characterized genes: higher literature weight

• Uncharacterized genes: higher domains, conservation weights

Example 3

• Make up your own example!

• Use literature, domains, and/or conservation criteria.

Next week

• Expression data

• Linkage data

• Association data

• CANDID’s efficiency

• Anything else?

candid: a candidate gene identification tool janna hutz [email protected] march 19, 2007

Documents