the pan-starrs data challenge jim heasley institute for astronomy university of hawaii

The Pan-STARRS Data Challenge

Jim Heasley

Institute for Astronomy

University of Hawaii

IDIES09

What is Pan-STARRS?

• Pan-STARRS - a new telescope facility• 4 smallish (1.8m) telescopes, but with

extremely wide field of view• Can scan the sky rapidly and repeatedly,

and can detect very faint objects– Unique time-resolution capability

• Project led by IfA with help from Air Force, Maui High Performance Computer Center, MIT’s Lincoln Lab.

• The prototype, PS1, will be operated by an international consortium

IDIES09

Pan-STARRS Overview

• Time domain astronomy– Transient objects– Moving objects– Variable objects

• Static sky science– Enabled by stacking repeated scans to form a collection of

ultra-deep static sky images

•Pan-STARRS observatory specifications–Four 1.8m R-C + corrector–7 square degree FOV - 1.4Gpixel cameras–Sited in Hawaii–A = 50 –R ~ 24 in 30 s integration

–-> 7000 square deg/night–All sky + deep field surveys in g,r,i,z,y

IDIES09

Image Processing

Pipeline(IPP)

Moving Object Processing

System(MOPS)

Solar System Data Manager

(SSDM)

Object Data Manager(ODM)

Web-Based Interface

(WBI)

Data Retrieval Layer(DRL)

End Users

Detection Records

Rec

ord

s

Rec

ord

s

Gigapixel Camera

Images

Pho

tons

Telescope

Published ScienceProducts Subsystem

(PSPS)

The Published Science Products Subsystem

IDIES09

IDIES09

Front of the Wave

• Pan-STARRS is only the first of a new generation of astronomical data programs that will generate such large volumes of data:– SkyMapper, southern hemisphere optical– VISTA, southern hemisphere IR survey– LSST, an all sky survey like Pan-STARRS

• Eventually, these data sets will be useful for data mining.

IDIES09

IDIES09

PS1 Data Products

• Detections—measurements obtained directly from processed image frames– Detection catalogs– “Stacks” of the sky images source catalogs– Difference catalogs

• High significance (> 5 transient events)• Low significance (transients between 3 and 5 )

– Other Image Stacks (Medium Deep Survey)

• Objects—aggregates derived from detections

IDIES09

What’s the Challenge?

• At first blush, this looks pretty much like the Sloan Digital Sky Survey…

• BUT– Size – Over its 3 year mission, PS1 will record

over 150 billion detections for approximately 5.5 billion sources

– Dynamic Nature – new data will be always coming into the database system, for things we’ve seen before or new discoveries

IDIES09

How to Approach This Challenge

• There are many possible approaches to deal with this data challenge.

• Shared what?– Memory– Disk– Nothing

• Not all of these approaches are created equal, either in cost and/or performance (DeWitt & Gray, 1992, “Parallel Database Systems: The Future of High Performance Database Processing”).

IDIES09

Conversation with the Pan-STARRS Project Manager

• Jim: Tom, what are we going to do if the solution proposed by TBJD is more than you can afford?

• Tom: Jim, I’m sure you’ll think of something!• Not long after that, TBJD did give us a

hardware/software plan we couldn’t afford. Not long after, Tom resigned from the project to pursue other activities…

• The Pan-STARRS project teamed up with Alex and his database team at JHU

IDIES09

Building upon the SDSS Heritage

• In teaming with the team at JHU we hoped to build upon the experience and software developed for the SDSS.

• A key question was how could we scale the system to deal with the volume of data expected from PS1 (> 10X SDSS in the first year alone).

• The second key question, could the system keep up with the data flow.

• The heritage is more one of philosophy than recycled software, as to deal with the challenges posed by PS1 we’ve had to generate a great deal of new code.

IDIES09

The Object Data Manager

• The Object Data Manager (ODM) was considered to be the “long pole” in the development of the PS1 PSPS.

• Parallel database systems can provide both data redundancy and spreading very large tables that can’t fit on a single machine over multiple storage volumes.

• For PS1 (and beyond) we need both.

IDIES09

Distributed Architecture

• The bigger tables will be spatially partitioned across servers called Slices

• Using slices improves system scalability• Tables are sliced into ranges of ObjectID,

which correspond to broad declination ranges• ObjectID boundaries are selected so that

each slice has a similar number of objects• Distributed Partitioned Views “glue” the data

together

IDIES09

Design Decisions: ObjID• Objects have their positional information encoded in

their objID– fGetPanObjID (ra, dec, zoneH)– ZoneID is the most significant part of the ID– objID is the Primary Key

• Objects are organized (clustered indexed) so nearby objects in the sky are stored on disk nearby as well

• It gives good search performance, spatial functionality, and scalability

IDIES09

Telescope

CSV FilesCSV FilesImage

Procesing Pipeline (IPP)

Image Procesing

Pipeline (IPP)CSV FilesCSV Files

Load Workflow

Load Workflow

Load Workflow

Load Workflow

Load DB

Load DB

Load DB

Load DB

Cold Slice DB 1

Cold Slice DB 1

Cold Slice DB 2

Cold Slice DB 2

WarmSlice DB 1

WarmSlice DB 1

WarmSlice DB 2

WarmSlice DB 2

Merge Workflow

Merge Workflow

Merge Workflow

Merge Workflow

Hot Slice DB 2

Hot Slice DB 2

Hot Slice DB 1

Hot Slice DB 1

Flip Workflow

Flip Workflow

Flip Workflow

Flip Workflow

MainDB Distribute

d View

MainDB Distribute

d View

MainDBDistribute

d View

MainDBDistribute

d View CASJobsQuery Service

CASJobsQuery Service

MyDBMyDB

MyDBMyDB

The Pan-STARRS Science CloudThe Pan-STARRS Science Cloud

← Behind the Cloud|| User facing services →

Validation Exception

Notification

Data Valet Workflows

Data ConsumerQueries & Workflows

Data flows in one direction→, except for

error recovery

Slice Fault Recover

Workflow

Slice Fault Recover

Workflow

Data CreatorsAstronomers

(Data Consumers)

Admin & Load-Merge Machines

Production Machines

Pan-STARRS Data Flow

IDIES09

Pan-STARRS Data Layout

Slice 1

Slice 1

Slice 2

Slice 2

Slice 3

Slice 3

Slice 4

Slice 4

Slice 5

Slice 5

Slice 6

Slice 6

Slice 7

Slice 7

Slice 8

Slice 8

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

S16

S3

S2

S5

S4

S7

S6

S9

S8

S11

S10

S13

S12

S15

S14

S1

Load Merge 1

Load Merge 1

Load Merge 2

Load Merge 2

Load Merge 3

Load Merge 3

Load Merge 4

Load Merge 4

Load Merge 5

Load Merge 5

Load Merge 6

Load Merge 6

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

csvcsvcsvcsvcsvcsv csvcsvcsvcsvcsvcsv

Image Image PipeliPipeli

nene

HHOOTT

WWAARRMM

Main MainDistributed Distributed ViViewew

Head 2Head 2Head 1Head 1

Slice Slice Nodes Nodes

Load-Merge Load-Merge NodesNodes

CCOOLLDD

L1 L1 DataData

L2 L2 DataData LL

OOAADD

Head Head NodesNodes

IDIES09

The ODM Infrastructure

• Much of our software development has gone into extending the ingest pipeline developed for SDSS.

• Unlike SDSS, we don’t have “campaign” loads but a steady from of data from the telescope through the Image Processing Pipeline to the ODM.

• We have constructed data workflows to deal with both the regular data flow into the ODM as well as anticipated failure modes (lost disk, RAID, and various severer nodes).

IDIES09

Pan-STARRS Object Data Manager Subsystem

Pan-STARRS Cloud Services for Astronomers

System Operation UI System Health Monitor UI

Query Performance UI

System & Administration Workflows

Orchestrates all cluster changes, such as, data loading, or fault tolerance

Configuration, Health & Performance Monitoring

Cluster deployment and operations

Internal Data Flow and State Logging

Tools for supporting workflow authoring and execution

Loaded Astronomy Databases

~70TB Transfer/Week

Deployed Astronomy Databases

~70TB Storage/Year

Query ManagerScience queries and

MyDB for results

Image Processing PipelineExtracts objects like stars and

galaxies from telescope images~1TB Input/Week

Pan-STARRS Telescope

Data Flow

Control Flow

21

IDIES09

What Next?

• Will this approach scale to our needs?– PS1 – yes. But, we already see the need for better

parallel processing query plans.– PS4 – unclear! Even though I’m not from Missouri,

“show me!” One year of PS4 produces > data volume than the entire PS1 3 year mission!

• Cloud computing? – How can we test issues like scalability without actually

building the system?– Does each project really need its own data center?– Having these databases “in the cloud” may greatly

facilitate data sharing/mining.

IDIES09

Finally, Thanks• To Alex for stepping in, hosting the development

system at JHU, and building up his core team to construct the ODM, especially– Maria Nieto-Santisteban – Richard Wilton– Susan Werner

• And at Microsoft to– Michael Thomassy– Yogesh Simmhan– Catharine van Ingen

the pan-starrs data challenge jim heasley institute for astronomy university of hawaii

Documents

data mining

data sets

volume of data

data flow

sourcesdynamic nature

panstarrs project managerjim

idies09the object data

large volumes of data