phenotype information for existing gwas studies

16
Phenotype Information Retrieval for Existing GWAS Studies Neda Alipanah, Ph.D. University of California San Diego March 2013

Upload: amia

Post on 07-Nov-2014

25 views

Category:

Documents


0 download

DESCRIPTION

2013 Summit on Clinical Research Informatics

TRANSCRIPT

Page 1: Phenotype Information for Existing GWAS Studies

Phenotype Information Retrieval for Existing GWAS

Studies

Neda Alipanah, Ph.D. University of California San Diego

March 2013

Page 2: Phenotype Information for Existing GWAS Studies

Motivation •  The database of Genotypes and Phenotypes (dbGaP) is

archiving the results of different Genome Wide Association Studies (GWAS).

•  Phenotype variables are not harmonized across studies.

•  Redundant phenotype identifiers for the same phenotype.

•  dbGaP lacks semantic relations among its variables.

•  Search on phenotypes is not accurate.

Goals •  Standardize dbGaP information to allow accurate,

reusable and

•  Quick retrieval of information

Page 3: Phenotype Information for Existing GWAS Studies

Problem Statement (Example of Redundant Variables) dbGaP Structure

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000007.v18.p7

dbGaP Study phs000007.v18.p7

id=” phv00003636.v1 ”, Description=” HEART:

HYPERTENSIVE HEART DISEASE ”, name=” FK414”,

version=“1”, Logical Max=”--”, Logical Minimum=”--”, unit=”--”, type=”string”

id=” phv00008678.v3 ”, Description=” CDI: HYPERTENSIVE

HEART DISEASE ”, name=” C334 ”,

version=“3”, Logical Max=”--”, Logical Minimum=”--”, unit=”--”,

type=”text”

N Alipanah, H Kim, L Ohno-Machado: Building an Ontology of Phenotypes for Existing GWAS Studies. Healthcare Informatics, Imaging and Systems Biology (HISB), 2012 IEEE Second International Conference on , vol., no., pp.111, 27-28 Sept. 2012

Page 4: Phenotype Information for Existing GWAS Studies

Problem Statement (Example of Semantic Relation) dbGaP Structure

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.p1

dbGaP Study phs000284.v1.p1

id=” phv00123020.v1”, Description=” CVD: self report of MD

dx of cvd ”, name=” cvd”,

version=“1”, value=“Yes, No, Not assessed”

Id=” phv00123021.v1 ”, Description=” CVD: self report of MD

dx of cvd (missing recoded as no) ”, name=”cvdx”,

version=“1”, value=“Yes, No, Not assessed”

Page 5: Phenotype Information for Existing GWAS Studies

Proposed Solution

� Build an information model ◦  Indexing the phenotype variables

semantically ◦ No Redundancy

Example:

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Heart Disease

Cardiovascular Disease

Id=“phv00124261.v1 ”

id=” phv00008678.v3” phv00123021.v1 phv00123020.v1

…….

Page 6: Phenotype Information for Existing GWAS Studies

Methods

I. String-based Variables Distance Calculation

II. Semantic Hierarchy Extraction on Revised

Clusters

III. Classification and Ontology Creation

IV. sdGaP Information Retrieval

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Page 7: Phenotype Information for Existing GWAS Studies

1. String-based Variables Distance Calculation

1- Property Extraction Name, Description, Type, Unit, …,and (Max-Min) values 2- UMLS Expansion

Expand Variable Description with MetaMap

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Page 8: Phenotype Information for Existing GWAS Studies

1. String-based Variables Distance Calculation 3- Distance Computation Description: Vector Space Model Matching Name Similarity: Jaro-Winkler String Matcher Type: Exact String Match Unit: Exact String Match (Max-Min) values: Subset Matching 4- Build Distance Matrix Compute the Distance between every pair of Variables.

5- Cluster based on Distance Matrix Variables with the same distance to other variables are clustered together.

Page 9: Phenotype Information for Existing GWAS Studies

II. Semantic Hierarchy Extraction on Revised Clusters 1.  Build String-based distance matrix for variables in a single assigned

cluster.

2.  Sub-cluster variables and calculate semantically relevant (similar) variables.

3.  Assign labels to sub-clusters based on the relevant UMLS Concept Unique Identifier.

4.  Perform re-clustering to find smaller group of relevant variables.

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

id=”phv00122015”, Description=”Age at time of Study”,

name=”age”, version=“1”, Logical Max=”65”, Logical

Minimum=”18”, unit=”Years”, type=”decimal”

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

dbGaP Study phs000284.v1.pht001903.v1.CFS_CARe_ECG.data_dict_2011_02_07

Page 10: Phenotype Information for Existing GWAS Studies

III. Classification and (sdGaP) Ontology Creation 1.  Start with UMLS semantic network. 2.  Extract corresponding sub class (PAR/CHD) hierarchy using the

UMLS hierarchy table (MRREL table).

3.  Instantiate the phenotype variables to the UMLS CUIs. (Not for higher levels)

4.  Populate the related constraints in sdGaP

Page 11: Phenotype Information for Existing GWAS Studies

IV. sdGaP Information Retrieval 1.  Use sdGaP Ontology structure to expand the query �  Density Measure (DM)

Density(A)=3 Density(B)=0 Density(D)=0

Density(A)=2 Density(D)=1 Density(B)=1 Density(E)=1 Density(C)=0

Page 12: Phenotype Information for Existing GWAS Studies

IV. Result �  Dataset: Cleveland Family Study (CFS) with 5 data sets and 2,339

phenotype variables. (phs000284.v1.p1) �  Use Weka Tool for Xmean clustering. �  The X-mean clustering resulted in 35 clusters for relevant variables. �  Reorganized into 23 clusters by domain expert reviewers

Page 13: Phenotype Information for Existing GWAS Studies

IV. Result of Concept-based Retrieval (Improvement of Subclass Expansion)

�  Query =“Cardiovascular Disease”

Query Expansion={Heart } in “Disease Cluster” Recall Improvement 2/45=0.04 to 18/45=0.4

Cardiovascular Disease

Heart Disease

phv00123021.v1

Phv00122274.v1

Phv00122277.v1

phv00123020.v1

Phv00122280.v1

Phv00122281.v1

Phv00122283.v1

Phv00122284.v1

Phv00122285.v1

Phv00122286.v1

……

Page 14: Phenotype Information for Existing GWAS Studies

Conclusion

� Extracting Standard Reusable Information Model From UMLS

�  Improving Information Retrieval by

Organizing Phenotype Variables and Instantiate them to Data Model

Page 15: Phenotype Information for Existing GWAS Studies

Limitation � Clustering based on Distance Calculation

is Semi-automated Computation. �  Instantiating variables to lower levels of

hierarchy needs domain expert review. �  Only instances of lower level of

hierarchies are considered in ontology building.

�  For large data, distance calculation and clustering needs more advance algorithms.

Page 16: Phenotype Information for Existing GWAS Studies

Acknowledgement

�  Supported by Grants ◦ UH2HL108785 (NHLBI) ◦ R01HS019913 (AHRQ)

�  Supervision of Dr. Lucila Ohno-Machado.