random forest photometric redshift estimation
Post on 19-Jan-2016
44 Views
Preview:
DESCRIPTION
TRANSCRIPT
Random Forest Photometric Redshift
EstimationSamuel Carliles1
Tamas Budavari2, Sebastien Heinis2,Carey Priebe3, Alex Szalay2
Johns Hopkins University1Dept. of Computer Science
2Dept. of Physics & Astronomy3Dept. of Applied Mathematics & Statistics
Photometric RedshiftsPhotometric Redshifts
You know what they are I did it on SDSS DR6 colors zspec = f(u-g, g-r, r-i, i-z)
zphot = f(u-g, g-r, r-i, i-z)
= zphot - zspec
I did it with Random Forests
You know what they are I did it on SDSS DR6 colors zspec = f(u-g, g-r, r-i, i-z)
zphot = f(u-g, g-r, r-i, i-z)
= zphot - zspec
I did it with Random Forests
ˆ
Regression Trees
» A Binary Tree» It partitions input training data
into clusters of similar objects» Each new test object is matched
with the cluster to which it is “closest” in the input space
» The output value is the mean of the output values of training objects in its cluster
» A Binary Tree» It partitions input training data
into clusters of similar objects» Each new test object is matched
with the cluster to which it is “closest” in the input space
» The output value is the mean of the output values of training objects in its cluster
Building a Regression Tree
Starting at the root node choose a dimension on which to split
Choose the point which “best” distinguishes clusters in that dimensionPoints left go in the left child, right go in the right childRepeat the process in each child node until all objects are in their own leaf node
x1
x2
x3
x3
How Do You Choose the Dimension and Split
Point?
How Do You Choose the Dimension and Split
Point?The best split point in a dimension
is the one which minimizes resubstitution error in that dimension
The best dimension is the one with the lowest best resubstitution error
What’s Resubstitution Error?
• For a candidate split point, there are points left and points right
= L ( x - xL)2 / NL + R (x - xR)2 / NR
• That’s the resubstitution error
• Minimize it
¯ ¯
Randomizing a Regression Tree
Randomizing a Regression Tree
Train it on a bootstrap sampleThis is a sample of N objects chosen
uniformly at random with replacement from the complete training set
Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions
Train it on a bootstrap sampleThis is a sample of N objects chosen
uniformly at random with replacement from the complete training set
Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions
Random Forest
• An ensemble of “randomized” Regression Trees
• Ensemble estimate is the mean of individual tree estimates
• This gives a distribution of iid estimation errors
• Central Limit Theorem gives the distribution of their mean
• Their mean is exactly zphot - zspec
• That means we have the error distribution for that object!
Implemented in RImplemented in R◊ More training data -> better estimates◊ Forests converge pretty quickly in forest size◊ Training set size, input space constrained by
memory in R implementation
◊ More training data -> better estimates◊ Forests converge pretty quickly in forest size◊ Training set size, input space constrained by
memory in R implementation
ResultsResultsRMS Error = 0.023
Training set size = 80,000
Error DistributionError DistributionStandardized Error Distribution Since we know the error
distribution* for each object, we can standardize them and the results should be standard normal over all test objects. Like in this plot! :)
If the standardized errors are standard normal, then we can predict how many of the errors fall between the tails of the distribution for different tail sizes. Like in this plot! (mostly)
SummarySummary
Random Forest estimates come with Gaussian error distributions
0.023 RMS error is competitive with other methodologies
This makes Random Forests good
Random Forest estimates come with Gaussian error distributions
0.023 RMS error is competitive with other methodologies
This makes Random Forests good
Future WorkFuture Work
CRLB says bigger N gives better estimates from the same estimator
80,000 objects is good, but we have way more than that available
Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation
So I’m writing a C# implementation
CRLB says bigger N gives better estimates from the same estimator
80,000 objects is good, but we have way more than that available
Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation
So I’m writing a C# implementation
top related