statistical data analysis - stony brook...

19
01/23/07 PHY310: Statistical Data Analysis 1 Road Map Class “Administrative” stuff Contacts, grading, expectations Moving the class time What is Data Analysis? Data Presentation PHY310: Lecture 1 Statistical Data Analysis By the end of the semester, I'd like you to be ready to ace the analysis part of PHY445, and be ready for any basic analysis you'll need to do in the future.

Upload: buiduong

Post on 27-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

01/23/07 PHY310: Statistical Data Analysis 1

Road Map

Class “Administrative” stuffContacts, grading, expectationsMoving the class time

What is Data Analysis? Data Presentation

PHY310: Lecture 1

Statistical Data Analysis

By the end of the semester, I'd like you to be ready to ace the analysis part of PHY445, and be ready for any basic analysis you'll need to do in the future.

01/23/07 PHY310: Statistical Data Analysis 2

Administrative Details

My contact informationOffice: Physics D-134Phone: 631-632-8299Email: [email protected] page: nngroup.physics.sunysb.edu/~mcgrew/phy310/spring07.html

Official Class Time Tues/Thurs. at 8am in Physics D-122If there is a time we can agree on, I would like to move the class to after 10am and before 3pm on Tuesday/Thursday, or Monday/Wednesday.

Computer AccessYou will need computer access (with broadband)

I will use unix, but it's not requiredUnix accounts can be made available at the Math SINC siteI'll be using C++ for most things, I may also use PYTHON for some examples.

01/23/07 PHY310: Statistical Data Analysis 3

Course Details

PHY310 is mostly about practical data analysis, so most of the grading will be based on projectsGrading: The homework, midterm exam, and final project will count equally

HomeworkThere will be an equivalent of one assignment per week. Some assignments will be cumulative so that work from one week is required for the next assignment. A lot of the homework will require access to a computer and a minor amount of programming.

Midterm ExamThe midterm exam will cover probability and other background material.

Final ProjectThe final will be an actual data analysis project. Due on the day of the scheduled final.

01/23/07 PHY310: Statistical Data Analysis 4

Survey Questions

Class timeEmail addressesMy wife is about do deliver, so I will not be here for one or two classes

There will be VERY SHORT NOTICE.Options to make up lectures

I can arrange a substitute.We can schedule extra lectures later in the semester (difficult)

Computer language skillsThere are several languages that are well suited for data analysisWhich ones do you know?

C++, PYTHON, Ruby

01/23/07 PHY310: Statistical Data Analysis 5

What is Data Analysis?

Data analysis is a set of tools used to quantify knowledge gained from observations.

Data PresentationGraphical presentation, data descriptors, data summary

Numerical MethodsPattern recognition, numeric optimization, function approximation, &c

Probability and StatisticsDistributions, discriminants, tests, estimators, confidence intervals &cA physicists treatment: We use statistics like we use math.

We make mathematicians cringeWe make statisticians cringe

A good data analysis will combine all of these tools usingWell posed questions

A well posed question has a unique, easily interpreted answerClear reasoning

The logic should be straight forward, A → B → C

01/23/07 PHY310: Statistical Data Analysis 6

Data: The Starting Point

Defining your data set (remember “data” is plural!)How was it collected?

What are the underlying assumptions?Are there any experimental limitations?Can the measurements be repeated?

Is it a one time “measurement of opportunity?”Taking a phrase from “CSI”: Can you establish a “chain of control”?

When was the data takenWhere was the data takenWhat were the “run” conditions

What data are you using?Are all of the data of the same quality?Should all the data be used in your analysis?

Golden Samples, Silver Samples, and Bronze SamplesDo you understand the data set?

01/23/07 PHY310: Statistical Data Analysis 7

Keeping Track of the Data

To use data in an analysis, you should know exactly where it came from

Record keeping is vital: you must be able to track the chain of evidenceThis means keeping a good log book

Things to recordDate, time and place the data set was takenLocation the data were stored

It's a really good idea to use consistent data naming.e.g. Name you data files like: data-001.dat data-002.datKeep track of what the data files are in your log book.

01/23/07 PHY310: Statistical Data Analysis 8

Types of Data: “Counted”We live in a quantum world, so fundamentally, all experiments are counting thingsThings that are counted

Blocks in a boxRadioactive decaysMost things in particle physics (beware, I'm a particle physicist)

The measurements are very binary“It” is there or not there. You record

The number of times “it” happenedOr the times at which “it” happened.

You can run into situations where you have a very small samplee.g. Charge 1/3 candidates passing through a super-conducting coil (one has been observed)

Statistical treatment is easy (it's all Poissonian)Statistics is easy, but there is controversy about the “right question” when you deal with very small samples

Return to this later in the semester

01/23/07 PHY310: Statistical Data Analysis 9

Types of Data: “Analog”

Most everyday things are analogTimeVoltageCurrentTemperature

The measurements are continuousValues can be positive or negative

Statistical treatment is hardWith counting, you know the underlying probability distributionWith analog, you don't know the underlying distribution

“Mean Value Theorem” says distribution approaches Gaussian (more in a couple weeks)

Practical data sets are a mix of counting and analog

01/23/07 PHY310: Statistical Data Analysis 10

Data Collection: InstrumentationYou need to understand the devices that collected your data.

01/23/07 PHY310: Statistical Data Analysis 11

Example: A Generic Data SetMost modern devices are digital (or digitized)

Data comes as long columns of numbersSometimes called an ntuple, or a data base

Pulling results out of large databases is called data miningExample: Muon decay times in Super-Kamiokande

0.099772 1.28129 4.17491 5.01797 8.57598 8.26794 2.9651 0.606887 5.58385 1.36029 1.31314 5.19564 2.97803 0.370979 0.809409 6.36955 0.678251 1.36112 1.42312 3.11666 9.50238 1.51145 3.41122 8.38516 1.53669 2.33519 2.88627 0.558051 6.08279 0.188279 1.88294 2.28317 2.07205 0.686567 0.769862 3.84807 0.041207 1.01861 1.17904 4.3337 1.18469 0.685796 0.893843 4.7592 4.87444 1.67218 1.19606 1.57121 3.15519 2.08465 2.15251 0.814932 1.52103 0.4395 0.977286 1.71409 1.40942 1.92409 3.81955 1.94511 1.72183 1.3747 0.475826 6.4247 7.36909 1.42768 0.353709 1.54796 6.39251 2.60078 4.22522 2.91898 1.54727 ...

If you don't know what it means, it's not very meaningful.

01/23/07 PHY310: Statistical Data Analysis 12

Pitfalls of Digital DataAlmost all modern data is digitized and you need to be careful of effects that it might introduce

Left and Right are the same data, but right side is digitized.

01/23/07 PHY310: Statistical Data Analysis 13

Data in this Class

Since most lab courses concentrate on collecting the data, for PHY310 we'll assume that it was collected more or less correctly. We'll start with a pile of data, and figure out what to do with it.

We start where most lab classes stop.Questions to think about

What was measured?How does the measurement relate to a physical quantity?!

What is the intrinsic accuracy of the measurement device?What are the limitations of the measurement device?

What is the statistical accuracy of the measurement?Where there any external circumstances that affected the measurement?

01/23/07 PHY310: Statistical Data Analysis 14

Data Presentation: Single Values

When you make a measurement, you will have a value and an uncertainty.

Both should be presented:

Super-Kamiokande Muon Decay Lifetime: 2272 ns ± 21 ns

Units for each value

Reasonable number of digits for accuracy

01/23/07 PHY310: Statistical Data Analysis 15

Data Presentation: Graphs

Graphs are a good way to present two related measurementsExample: Voltage and Current for Ohms Law

Voltage (V) Current (mA)1 ± 0.02 0.37 ± 0.022 ± 0.04 0.71 ± 0.044 ± 0.08 1.4 ± 0.075 ± 0.10 1.9 ± 0.107 ± 0.14 2.41 ± 0.12

10 ± 0.20 3.39 ± 0.17

01/23/07 PHY310: Statistical Data Analysis 16

Elements of a Good GraphTitle on the Graph

Axis Label

Axis Label

Error Bars

Adding a date is a goodidea. It can be removed for publication.

01/23/07 PHY310: Statistical Data Analysis 17

Data Presentation: HistogramsLists of single values are a very common type of data.

Usually interested in how often each value occurs (the frequency)Example: The time data from (about) Slide 10

01/23/07 PHY310: Statistical Data Analysis 18

Elements of a Good HistogramsTitle on the Graph

Axis Label

Axis Label with the bin size Number of entries,

underflows, and overflows

Error Bars

“Small” number of bins (this histogram has 50)

01/23/07 PHY310: Statistical Data Analysis 19

FinallyData Analysis is a combination of

Understanding the data collectionClearly presenting the collected dataApplying various numeric and pattern reconstruction techniquesUsing Statistics to summarize what is learned

Above all, data analysis is about pragmatism, and not mathematical purity

The goal is to increase our (incomplete, but) useful knowledgeWe'll talk about how to make reasonable approximation, and how to account for any biases that are introduced

On Wednesday, we'll start talking the computer programming required in data analysis

How to read data setsHow to make graphs and histograms

The End