statistical data analysis - stony brook...
TRANSCRIPT
01/23/07 PHY310: Statistical Data Analysis 1
Road Map
Class “Administrative” stuffContacts, grading, expectationsMoving the class time
What is Data Analysis? Data Presentation
PHY310: Lecture 1
Statistical Data Analysis
By the end of the semester, I'd like you to be ready to ace the analysis part of PHY445, and be ready for any basic analysis you'll need to do in the future.
01/23/07 PHY310: Statistical Data Analysis 2
Administrative Details
My contact informationOffice: Physics D-134Phone: 631-632-8299Email: [email protected] page: nngroup.physics.sunysb.edu/~mcgrew/phy310/spring07.html
Official Class Time Tues/Thurs. at 8am in Physics D-122If there is a time we can agree on, I would like to move the class to after 10am and before 3pm on Tuesday/Thursday, or Monday/Wednesday.
Computer AccessYou will need computer access (with broadband)
I will use unix, but it's not requiredUnix accounts can be made available at the Math SINC siteI'll be using C++ for most things, I may also use PYTHON for some examples.
01/23/07 PHY310: Statistical Data Analysis 3
Course Details
PHY310 is mostly about practical data analysis, so most of the grading will be based on projectsGrading: The homework, midterm exam, and final project will count equally
HomeworkThere will be an equivalent of one assignment per week. Some assignments will be cumulative so that work from one week is required for the next assignment. A lot of the homework will require access to a computer and a minor amount of programming.
Midterm ExamThe midterm exam will cover probability and other background material.
Final ProjectThe final will be an actual data analysis project. Due on the day of the scheduled final.
01/23/07 PHY310: Statistical Data Analysis 4
Survey Questions
Class timeEmail addressesMy wife is about do deliver, so I will not be here for one or two classes
There will be VERY SHORT NOTICE.Options to make up lectures
I can arrange a substitute.We can schedule extra lectures later in the semester (difficult)
Computer language skillsThere are several languages that are well suited for data analysisWhich ones do you know?
C++, PYTHON, Ruby
01/23/07 PHY310: Statistical Data Analysis 5
What is Data Analysis?
Data analysis is a set of tools used to quantify knowledge gained from observations.
Data PresentationGraphical presentation, data descriptors, data summary
Numerical MethodsPattern recognition, numeric optimization, function approximation, &c
Probability and StatisticsDistributions, discriminants, tests, estimators, confidence intervals &cA physicists treatment: We use statistics like we use math.
We make mathematicians cringeWe make statisticians cringe
A good data analysis will combine all of these tools usingWell posed questions
A well posed question has a unique, easily interpreted answerClear reasoning
The logic should be straight forward, A → B → C
01/23/07 PHY310: Statistical Data Analysis 6
Data: The Starting Point
Defining your data set (remember “data” is plural!)How was it collected?
What are the underlying assumptions?Are there any experimental limitations?Can the measurements be repeated?
Is it a one time “measurement of opportunity?”Taking a phrase from “CSI”: Can you establish a “chain of control”?
When was the data takenWhere was the data takenWhat were the “run” conditions
What data are you using?Are all of the data of the same quality?Should all the data be used in your analysis?
Golden Samples, Silver Samples, and Bronze SamplesDo you understand the data set?
01/23/07 PHY310: Statistical Data Analysis 7
Keeping Track of the Data
To use data in an analysis, you should know exactly where it came from
Record keeping is vital: you must be able to track the chain of evidenceThis means keeping a good log book
Things to recordDate, time and place the data set was takenLocation the data were stored
It's a really good idea to use consistent data naming.e.g. Name you data files like: data-001.dat data-002.datKeep track of what the data files are in your log book.
01/23/07 PHY310: Statistical Data Analysis 8
Types of Data: “Counted”We live in a quantum world, so fundamentally, all experiments are counting thingsThings that are counted
Blocks in a boxRadioactive decaysMost things in particle physics (beware, I'm a particle physicist)
The measurements are very binary“It” is there or not there. You record
The number of times “it” happenedOr the times at which “it” happened.
You can run into situations where you have a very small samplee.g. Charge 1/3 candidates passing through a super-conducting coil (one has been observed)
Statistical treatment is easy (it's all Poissonian)Statistics is easy, but there is controversy about the “right question” when you deal with very small samples
Return to this later in the semester
01/23/07 PHY310: Statistical Data Analysis 9
Types of Data: “Analog”
Most everyday things are analogTimeVoltageCurrentTemperature
The measurements are continuousValues can be positive or negative
Statistical treatment is hardWith counting, you know the underlying probability distributionWith analog, you don't know the underlying distribution
“Mean Value Theorem” says distribution approaches Gaussian (more in a couple weeks)
Practical data sets are a mix of counting and analog
01/23/07 PHY310: Statistical Data Analysis 10
Data Collection: InstrumentationYou need to understand the devices that collected your data.
01/23/07 PHY310: Statistical Data Analysis 11
Example: A Generic Data SetMost modern devices are digital (or digitized)
Data comes as long columns of numbersSometimes called an ntuple, or a data base
Pulling results out of large databases is called data miningExample: Muon decay times in Super-Kamiokande
0.099772 1.28129 4.17491 5.01797 8.57598 8.26794 2.9651 0.606887 5.58385 1.36029 1.31314 5.19564 2.97803 0.370979 0.809409 6.36955 0.678251 1.36112 1.42312 3.11666 9.50238 1.51145 3.41122 8.38516 1.53669 2.33519 2.88627 0.558051 6.08279 0.188279 1.88294 2.28317 2.07205 0.686567 0.769862 3.84807 0.041207 1.01861 1.17904 4.3337 1.18469 0.685796 0.893843 4.7592 4.87444 1.67218 1.19606 1.57121 3.15519 2.08465 2.15251 0.814932 1.52103 0.4395 0.977286 1.71409 1.40942 1.92409 3.81955 1.94511 1.72183 1.3747 0.475826 6.4247 7.36909 1.42768 0.353709 1.54796 6.39251 2.60078 4.22522 2.91898 1.54727 ...
If you don't know what it means, it's not very meaningful.
01/23/07 PHY310: Statistical Data Analysis 12
Pitfalls of Digital DataAlmost all modern data is digitized and you need to be careful of effects that it might introduce
Left and Right are the same data, but right side is digitized.
01/23/07 PHY310: Statistical Data Analysis 13
Data in this Class
Since most lab courses concentrate on collecting the data, for PHY310 we'll assume that it was collected more or less correctly. We'll start with a pile of data, and figure out what to do with it.
We start where most lab classes stop.Questions to think about
What was measured?How does the measurement relate to a physical quantity?!
What is the intrinsic accuracy of the measurement device?What are the limitations of the measurement device?
What is the statistical accuracy of the measurement?Where there any external circumstances that affected the measurement?
01/23/07 PHY310: Statistical Data Analysis 14
Data Presentation: Single Values
When you make a measurement, you will have a value and an uncertainty.
Both should be presented:
Super-Kamiokande Muon Decay Lifetime: 2272 ns ± 21 ns
Units for each value
Reasonable number of digits for accuracy
01/23/07 PHY310: Statistical Data Analysis 15
Data Presentation: Graphs
Graphs are a good way to present two related measurementsExample: Voltage and Current for Ohms Law
Voltage (V) Current (mA)1 ± 0.02 0.37 ± 0.022 ± 0.04 0.71 ± 0.044 ± 0.08 1.4 ± 0.075 ± 0.10 1.9 ± 0.107 ± 0.14 2.41 ± 0.12
10 ± 0.20 3.39 ± 0.17
01/23/07 PHY310: Statistical Data Analysis 16
Elements of a Good GraphTitle on the Graph
Axis Label
Axis Label
Error Bars
Adding a date is a goodidea. It can be removed for publication.
01/23/07 PHY310: Statistical Data Analysis 17
Data Presentation: HistogramsLists of single values are a very common type of data.
Usually interested in how often each value occurs (the frequency)Example: The time data from (about) Slide 10
01/23/07 PHY310: Statistical Data Analysis 18
Elements of a Good HistogramsTitle on the Graph
Axis Label
Axis Label with the bin size Number of entries,
underflows, and overflows
Error Bars
“Small” number of bins (this histogram has 50)
01/23/07 PHY310: Statistical Data Analysis 19
FinallyData Analysis is a combination of
Understanding the data collectionClearly presenting the collected dataApplying various numeric and pattern reconstruction techniquesUsing Statistics to summarize what is learned
Above all, data analysis is about pragmatism, and not mathematical purity
The goal is to increase our (incomplete, but) useful knowledgeWe'll talk about how to make reasonable approximation, and how to account for any biases that are introduced
On Wednesday, we'll start talking the computer programming required in data analysis
How to read data setsHow to make graphs and histograms
The End