1 introduction
DESCRIPTION
jjjjjjjjjjjjjjjjjTRANSCRIPT
CpE 615: Machine learning
Dr. Mohammad A. AlzubaidiDepartment of Computer Engineering
Yarmouk University
Brief Introduction Dr. Mohammad A. Alzubaidi Assistant Professor at CE Dept. Research interests: machine learning, data
mining, image processing and their applications to bioinformatics
Outline of lecture Course information
Introduction to Machine Learning (ML)
Tentative Course schedule
Survey
Course Information Instructor: Dr. Mohammad A. Alzubaidi Office: Assistant Dean Office, H-205 Phone: 02/7211111 x4440 Email: [email protected] Web: elearning.yu.edu.jo Time: Wed 5:00pm—8:00pm Office hours: Sun – Thu 8:00am – 4:00pm Location: HN-401 Course textbook: No textbook is required. (Materials will be
available at the class web page) Topics: Data types and representation, classification,
evaluation, preprocessing, clustering, semi-supervised learning, advanced topics … etc.
Reference books Introduction to Data Mining. Tan, et al., 2005.
Pattern Classification. Duda, et al. , 2000.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hastie, et al., 2001.
Kernel Methods in Computational Biology. Scholkopf, et al., editors. 2004.
Kernel Methods for Pattern Analysis. Taylor and Cristianini, 2004.
Grading Midterm Exam: 30%
Project, class participation, and seminars: 30%. Two to three students form a group to carry out a
small research project. A survey of the state-of-art in an area related to this course Machine learning techniques for specific applications A comparative study of several well-known algorithms. Design of a novel algorithm related to this course.
Students are required to attend the lecture, participate in the class discussion.
Students might be asked to give a seminar.
Final Exam: 40%.
Programming language Matlab
Tutorials http://www.math.ufl.edu/help/matlab-tutorial/ http://www.math.mtu.edu/~msgocken/intro/node1.ht
ml
R language
Or other languages
What is machine learning? Machine learning is the study of computer systems that
improve their performance through experience. Learn existing and known structures and rules. Discover new findings and structures.
Face recognition Bioinformatics
Supervised learning vs. unsupervised learning
Semi-supervised learning
Machine learning versus data mining
Data mining is extraction of useful patterns from data
sources, e.g., databases, texts, web, image. the analysis of (often large) observational
data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.
A lot of common topics Clustering, Classification … etc.
Machine learning versus data mining
Different focuses ML focuses more on theory (statistics) DM focuses more on applications
In this course I will try to balance between the two.
Clustering
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
Applications of Cluster Analysis
Understanding Group genes and proteins that have similar
functionality, or group stocks with similar price fluctuations
Summarization Reduce the size of large data sets
Clustering precipitation in Australia
Classification: Definition Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set
ModelLearn
Classifier
Classification: Application
Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card
transactions on an account.
Character Recognition
Given a digit representation.
What is it’s class?
Inputs are 28x28 greyscale images.
Researchers have used Neural Networks Support Vector
Machines ... etc
Other applications
Face recognition
Protein function prediction
Cancer detection
Document categorization
Data representation Traditional algorithms work on vectors.
Images can be represented as matrices or vectors.
Data integrationmRNA
expression data
protein-protein interaction data
hydrophobicity data
sequence data
(gene, protein)
Genome-wide data
Curse of dimensionality Large sample size is required for high-dimensional data.
Query accuracy and efficiency degrade rapidly as the dimension increases.
Strategies Feature reduction Feature selection Kernel learning
Model selection Choose the best model from a set of different models to
fit to the data
Support Vector Machines (SVM), Linear Discriminant Analysis (LDA)
Models are specified by certain parameters. How to choose the best parameters? Cross-validation (leave one out, k-fold CV)
Machine learning applications Computer vision, information retrieval, image processing, bioinformatics, text mining, web mining … etc.
Course schedule Weeks 1 – 6:
Introduction Data Types Classification Evaluation Preprocessing
Week 7: Midterm Exam Weeks 8 – 11:
Clustering Semi-supervised Learning Advances Topics
Weeks 12 – 14: Presentations Week 15: Final Exam