data and text mining for computational biology
DESCRIPTION
Data and Text Mining for Computational Biology. Introduction. Course information. CS 6365 Data and Text Mining for Computational Biology Meets Tuesday and Thursday 7:00-8:15 pm at ECSS 2.412. Instructor. Vasileios Hatzivassiloglou Associate Professor, Computer Science - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/1.jpg)
Data and Text Mining for Computational Biology
Introduction
![Page 2: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/2.jpg)
Course information
• CS 6365
• Data and Text Mining for Computational Biology
• Meets Tuesday and Thursday 7:00-8:15 pm at ECSS 2.412
![Page 3: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/3.jpg)
Instructor
• Vasileios Hatzivassiloglou
• Associate Professor, Computer Science
• Founding Professor, Bioengineering
• Research focus: Discover knowledge from massive amounts of raw data– data not the same as information– information overload
![Page 4: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/4.jpg)
Research Interests
• Text analysis, machine learning, intelligent information retrieval, summarization, question answering, bioinformatics, medical informatics
![Page 5: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/5.jpg)
Contact information
• Office hours: Tuesday and Thursday 6:00-7:00pm and by appointment
• Office location: ECSS 3.406
• (972) 883-4342
• Teaching Assistant: TBA
![Page 6: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/6.jpg)
Course goals
• Introduce the field of bioinformatics
• Discuss primary techniques used for data mining
• Introduce text mining and additional issues it brings to data mining methods
• Use examples from computational biology
![Page 7: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/7.jpg)
Intended audience
• For both computer scientists and biologists• Not an easy task to balance the two
• Focus on data and text mining algorithms and applications– Coverage of machine learning background– No extensive algorithmic analysis /
computational complexity– Medium level of programming
![Page 8: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/8.jpg)
Prerequisites
• Officially CS 6325 – Introduction to Bioinformatics
• Waived for this offering of the course• You should know
– Basic data structures (multidimensional arrays, hash tables, binary trees)
– One high-level programming language and be able to adapt to a new one as needed
– Be able to install and use external software packages
![Page 9: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/9.jpg)
You need not know
• Molecular biology
• Machine learning
• Data mining (in general)
• Text analysis / natural language processing
• Information retrieval
• Artificial intelligence
![Page 10: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/10.jpg)
Course level
• Introductory graduate course (MS or first-year PhD)
• Maturity in programming and data structures as of a Computer Science senior
• Ability (and interest in) accessing the primary literature in a guided fashion
![Page 11: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/11.jpg)
Course structure
• 6 lectures on biological background and bioinformatics in general
• 6 lectures on data similarity
• 8 lectures on data mining methods
• 3 lectures on text mining and knowledge mining methods
• student presentations of research papers (3-4 sessions)
![Page 12: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/12.jpg)
Expected work load• Two homework sets given in mid-to-late
September and mid-to-late October• Two weeks to turn in each homework set• Mid-term exam in early October• Each student selects two or three research
papers to review in late October• Student presentations of research papers in the
last week of November / first week of December• Final exam
![Page 13: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/13.jpg)
Course project
• In lieu of the research papers and presentation, students may elect to work on a project in teams of two or three
• Project is chosen by the students with the advice and consent of the instructor
• Project investigation/implementation should be approximately 1.5-2 times the work required for a regular homework
![Page 14: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/14.jpg)
Programming
• Each student selects their own programming language (must be available at UTD and accessible to TA)
• Examples: C, C++, Java, Perl, Python
• Can also use a package/programming environment specifically tailored to bioinformatics
![Page 15: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/15.jpg)
One likely package
• R (http://www.r-project.org/)• R is the free alternative to S-Plus
developed at AT&T research• S-Plus is the extensible, programmable
alternative to statistical packages like SAS and SPSS
• If you know C, you will be right at home with R
![Page 16: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/16.jpg)
Another likely package
• BioPerl (http://bio.perl.org/)
• A collection of library modules in Perl written by and for bioinformaticians
• Perl supports high-level operations such as hashes as a basic data structure, string matching, and regular expressions
• Perl is really bad at OOP and efficiency
• Easy to learn
![Page 17: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/17.jpg)
Grading
• Class participation: 20%
• Homework assignments: 30% (total)
• Midterm: 10%
• Research paper presentation or project: 20%
• Final exam: 20%
![Page 18: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/18.jpg)
Textbooks
• No good integrated textbook on data mining from a computational biology perspective
• We will use a text book covering bioinformatics algorithms and another text book on data mining in general, and additional chapters from other books and research articles
• Copies of chapters / research articles will be provided
![Page 19: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/19.jpg)
Recommended textbook #1
• “An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)”, by Neil C. Jones and Pavel A. Pevzner, MIT Press, 2004.
• ISBN 0262101068
• 448 pages
• Available on Amazon.com for $41, Barnes and Noble for $60
![Page 20: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/20.jpg)
Recommended textbook #2
• “Data Mining : Concepts and Techniques” by Jiawei Han and Micheline Kamber, Elsevier, second edition, 2006.
• ISBN 1558609016
• 800 pages
• Available on Amazon.com for $52, Barnes and Noble for $65
![Page 21: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/21.jpg)
Supplementary textbooks
• “Bioinformatics: The Machine Learning Approach” by Pierre Baldi and Soren Brunak, 2nd edition, 2001.
• “Data mining : multimedia, soft computing, and bioinformatics” by Sushmita Mitra and Tinku Acharya, 2003.
• Both of the above are available as full-text eBooks via http://library.utdallas.edu.
![Page 22: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/22.jpg)
Background reading
• Biology: “Molecular Biology of the Cell” by Bruce Alberts et al., 4th edition, 2002.
• Machine learning: “Machine Learning” by Tom Mitchell, 1997.
![Page 23: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/23.jpg)
Background reading (II)
• Statistics: “The elements of statistical learning: data mining, inference, and prediction” by Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2001.
• Data structures and algorithms: “Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2nd edition, 2001.
![Page 24: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/24.jpg)
So what is it all about?
• Three parts:– Bioinformatics / computational biology– Data mining– Text mining
![Page 25: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/25.jpg)
Bioinformatics
• A fast developing discipline• We will discuss
– basic concepts of molecular biology– databases of biological data– structure and function of DNA, RNA, proteins– sequence searching (BLAST)– sequence similarity and comparison– protein structure (2D and 3D)– protein motifs and patterns– microarrays– phylogenetics
![Page 26: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/26.jpg)
Data mining
• Given a large amount of data of known types, extract useful information
• We will discuss– data cleanup and outliers– model construction– data and dimensionality reduction– classification– prediction / probability estimation– clustering– measuring performance
![Page 27: Data and Text Mining for Computational Biology](https://reader036.vdocument.in/reader036/viewer/2022062409/56814aa3550346895db7b862/html5/thumbnails/27.jpg)
Text mining
• Not only we have a large amount of raw data, but we don’t know what each item means
• We will discuss:– tokenization and basics of text processing– recognition of terms and entities– classification– dictionary creation– relationship learning and extraction– document level clustering and information retrieval