Download - Smoothing, Sampling, and Simulation Vasileios Hatzivassiloglou University of Texas at Dallas
Smoothing, Sampling, and Simulation
Vasileios Hatzivassiloglou
University of Texas at Dallas
2
Back to motif finding
• Apply MLE to the profile data
• Note that we already used MLE when calculating each cell
• Now θ is the set of choices for each letter
• Because each choice is independent of the others, the MLE is– Choose at each position j the letter
• The algorithm takes O(kn) timei
ijAargmax
3
Representing profiles
• Usually stored as logAij values– historically for ease of calculation– with computers for maintaining accuracy
• Smoothing– estimated values can be 0– this will affect calculations, sometimes leading to
serious problems (e.g., no solution)– smoothing increases 0 probabilities– it has to reduce other estimated probabilities to
account for this
4
Additive smoothing
• Replace each probability
with
where is a small number (such as 0.001)
||
k
cA ijij
k
cA ijij
k/||
5
Student presentations
• Scheduled for December 2 and December 4
• Each student gets 10 minutes (7 minutes for presentation, 3 minutes for questions)
• Select project or topic and papers in consultation with the instructor by November 13
6
Potential presentation topics
• Similarity
• Statistical, predictive, and generative models
• Simulation
• Estimation
• Classification
• Clustering
• Text mining and knowledge discovery
7
Statistical sampling
• A very general method for solving difficult problems with many variables that cannot be solved directly, but where partial solutions can be “guessed” and improved
• Commonly known as “Monte Carlo” methods (from the Monaco casino) because one of the pioneers of the technique liked gambling
8
Famous MC applications
• Buffon’s needle (18th century)
• Enrico Fermi’s study of the neutron (1930)
• The Manhattan project (1944)
• Currently used in– aerodynamics– video games and computer-generated films– share pricing– bioinformatics
9
Buffon’s needle• How to calculate π?
• Consider a random throwing of a needle of length l on a floor with parallel boards of width w (w>l). Then it can be shown that the probability p of the needle crossing a line between boards is
• By estimating p (experimentally through MLE) one can then calculate π
• Using this, the estimate 355/113 was obtained (accurate to 7 decimal places)
w
l
2
10
The classification problem
• Given examples from two or more different classes of objects, and a description of a new object, which class does the new object come from?
• A lot of variation depending on what kind of description we have available
11
Example classification problems
• Given samples of spam and non-spam email messages, classify an incoming message as spam or non-spam
• Given samples of paying and non-paying credit card holders, accept or reject a credit card application
• Given samples of patients who entered a hospital, predict whether a given patient will exit the hospital alive