tools for thinking big: big data - extensiondata progression your desktop may have a 1 tb drive my...

31
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015 Andrew S. Jones Colorado State University Cooperative Institute for Research in the Atmosphere (CIRA) Tools for Thinking Big: Big Data 1

Upload: others

Post on 16-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Andrew S. Jones Colorado State University

Cooperative Institute for Research in the Atmosphere (CIRA)

Tools for Thinking Big: Big Data

1

Page 2: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Background

◆  Physics background ◆  Satellite Weather Data + Physical Modeling

=> Better Results ◆  Need for Operational Decision Aids ◆  Large-scale Transdisciplinary Collaborations

(multi-institutional, multi-agency) ◆  Results in ACTIONS in the field

▼  Save lives, reduce pain and suffering ▼  Inform folks to enable better decisions

2

Page 3: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Some New Institutional Tools

◆  Rocky Mountain Consortium for Global Development (RMCGD) ▼  Large-scale Multi-institutional Public/Private/

Foundation/Univ. collaboration for food security in water limited environments (all hands on deck)

◆  CSU Innovation Center for Sustainable Agriculture (ICSA) ▼  Advancing soil health and new

practices to feed more people

3

Page 4: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Big Data: What is it?

◆ What is Big Data? ◆  It is now a $33B industry and rapidly

growing ◆  It is seen by many as the next phase of

the internet’s development ◆ Adoption of pervasive cloud computing ◆ Consolidation and migration of

operations to the cloud

4

Page 5: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Data Progression

◆  “Big is relative” ◆  It all starts with “0”s and “1”s. ◆  1 page of text = ~60 lines = 5KB ◆  1 book = 300 pages = 1.5MB ◆  1 library (Lib of Cong.) = 37M books

= 55 PB, 12,000 books added daily = 18 GB/day ◆  My work load is about 50 GB/day

▼  Peaks of 4 TB/day sustained over 3 weeks ▼  I serve 1, Wikipedia serves ~20M users

5

Page 6: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Data Progression

◆  Your desktop may have a 1 TB drive ◆  My work would fill it up in 3 weeks (6h at peak) ◆  The Federal Govt. collects about 30 TB of

weather data every day ◆  Your desktop would fill up in 48min. ◆  These are small data users compared to truly

large Big Data users ◆  Infrastructure-limited capacity,

e.g., “town-scale”

6

Page 7: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Major Data Issues

◆ Storage Capacity ◆  I/O “Bandwidth” / Network interactions ◆ Access, Number of users, Scalability ◆ Provisioning of data in a meaningful and

useful manner (2 examples) ▼  GIS data archives nX larger than their user base ▼  User databases many times larger than their

database – multiple database replications

◆ Curation (quality control, documentation) 7

Page 8: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Data and Models

◆ Using weather as an example ◆ Progression of data/model use

1.  Most simple (random toss) 2.  Rules of thumb / Empirical Statistics 3.  Physical models (e.g., f(x)) 4.  Optimized models (tuned by data) 5.  Artificial Intelligence inferences

(can be highly automated)

8

Page 9: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

New Major Trends

◆  “Internet of things” – Routers that think and consolidate – we no longer ask for data item x at location y.

◆  Instead we ask the thing to consolidate the information and preprocess to allow us to have data item x. We no longer care how the “thing” found it. Just do it.

◆ Possible huge cost savings, and a shifting of responsibilities to the “thing”.

9

Page 10: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

What does this enable?

◆  Infrastructure that “assists” with the heavy and many times tedious duties

◆ Allows for light weight applications to do heavy weight thinking - as coordination and massive scale data jobs are handled elsewhere

◆ Frees high-end application developers to use their time more effectively

◆ But… we have to walk before we run… 10

Page 11: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Where are we now?

◆ Large scale computing is here now ◆ Devices are recording in the field

massive amounts of individual data sets ◆ Companies are creating vertically

integrated information “towers” ◆ Consortia are rapidly forming to likewise

create new public/private shared information cores (usually around themed topical research/appl. ideas)

11

Page 12: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

How Big do we need to go?

◆ Earth’s surface is ~ 500 000 000 km^2 ▼  in 1 km^2 we’d like 30m grid resolution

1000000 m^2 / (30m)^2 = 1111 grid pts ▼  ~ 100+ vertical layers ▼  ~ 75 model variables in each element ▼ 14 day simulation at 10 second intervals

is a time series of ~120K elements ◆ 5E8 * 1E3 * 1E2 * 75 * 1.2E5 = 4.5E20

12

Page 13: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Oops….

◆ To optimize the system, we need to know cross correlation errors, and behaviors to create a real solution that can prognosticate and use model errors and data errors, and return the best possible statistical optimized solution

◆ Skip a bunch of math. ◆  It goes as the O(N^2). Or just ~2E41. ◆ 200 Duodecillion compute elements

13

Page 14: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

In practice…

◆ Also more than 7 M radiative aborption lines exist in the spectral remote sensing lines to detect molecular effects

◆ Thus, physical approximations are made: lower spatial and time resolution, approximations for solvers, diagonalization of matrices, leveraging of similar-behavior elements

◆ We can get down to 1E7-1E8 14

Page 15: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Why 1E8?

◆ 1 GB of memory is 1E9. It fits! ◆ So models are reduced until the

problem is solvable while retaining the major behaviors. We are experts at this.

◆ Can we likewise simplify relationships between Food Systems components, use lower order models, and build key high priority components with the necessary “parts” linking the whole?

15

Page 16: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Key Questions

◆ Are all of the important variables represented? ▼ Are the variable sensitivities understood? ▼ Are the variable errors understood? ▼ Are each “represented” at the right scales?

Do scales interact? ◆ Gap filling -> prioritization by end need

▼ Analytics: Cluster analysis, conditionality, knowledge inference and association

16

Page 17: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

More Key Questions

◆ What are the true models of inter-relationships that underly the data-model relationships?

◆ Do the data support the models? ◆ To what degree? Is it adequate for the

end user decision making tasks? ◆ Are our results recreatable and

transparent to others for additional testing? Can new data be used later?

17

Page 18: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

More Key Questions

◆  How are models and data coordinated within a large community with different terms of reference? (technical lang.)

◆  What about modality of model conditions? What if distributions are more complex, multi-modal?

◆  How is this linked to other conditionally dependent systems that may be unimodal?

◆  Is a data distribution analysis needed? ◆  Non-Gaussian weather data assimilation is

advancing… it will impact other fields very soon 18

Page 19: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Big Factors affecting Big Data

◆ Climate Smart Agriculture ◆ Federal Data Sharing mandates…

19

Page 20: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Climate Smart Ag.

◆  Where Food Security and Climate Meets ◆  Key Aspects: 1) Global Current State, 2) Global Future

State, 3) NET change impacts, 4) Variability Risks ◆  Leads to Predictability and Monitoring Requirements ◆  Need to increase Resilience and improve Risk

Mitigation Strategies ◆  Answer practical questions: e.g., “If I am currently

doing “x”, what happens if I do “y”?, given conditional probabilities of future states

◆  Next… I’ll show you the climate/wx spaces we work in, the tools used, and sources that help meet your needs

20

Page 21: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

21

◆  In general, it’s warmer and more changeable

◆  *Some interesting footnotes!

◆  We don’t understand it all – by any means

◆  But there are numerous indicators that climate is changing

The Current Global Climate State

*

*

Page 22: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

22

0

1

2

3

4

5

6

7

1 hour 1 day 1 week 1 month 1 year 10 years

Max Error Mean Error Min Error

Predictability Errors

Time of Interest

Page 23: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

0

1

2

3

4

5

6

7

1 hour 1 day 1 week 1 month 1 year 10 years

Max Error Mean Error Min Error

Wx Outlooks

Wx Fcsts

Predictability Errors

23

Nowcasts

Seasonal Fcsts Climate Fcsts

Time of Interest

Page 24: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Reducing Model Errors

◆ This is called “Data Assimilation” 1.  Models start from an initial set of

conditions (our best guess of conditions) 2.  Observations are compared to the model

results, and the initial conditions are improved to match “reality”

3.  Models are rerun using improved starting conditions to generate forecasts that better match the observed reality

24

Page 25: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Data Assimilation Pros/Cons

◆  Models can be much improved ◆  Works for models which have predictable

components ◆  Many climate signatures are long-term features, thus

they are forced using “boundary conditions” and outputs are analyzed statistically (in aggregate)

◆  You need 1) a good model, 2) sensitivities to the data within the model, and 3) good data

◆  Many national centers collaborate with CSU on the development of these systems

25

Page 26: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

What I specialize in…

◆ Multisatellite Data Assimilation and Blended Satellite Data Products

◆ Focus is on dynamic real-time intercalibration of many satellite sensors (US and International)

◆ End users are operational DoD and NOAA National Weather Service users

◆ ALSO new agricultural industry users…

26

Page 27: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

NOAA examples ◆  Global Blended

Water Vapor ◆  The CSU Global

Rainfall Data will be shown later ▼  5+ year high

resolution (~9 km) database > 1.5M data files

▼  > 10 satellites

27

Page 28: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

New Soil Moisture Satellite Data and Techniques

◆  Global Satellite Soil Moisture data (DoD/NASA)

◆  CSU Methods to downscale soil moisture to 10 meter resolutions (crop-scales) (Prof. Neimann)

28

18 ft. wide DoD R&D for parallel

Ag Use

Plus more sensors in use and under design. What’s new is combining, downscaling, and sharing it all (in near real-time)

Page 29: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Daily Aggregated Satellite-based Rainfall Data (showing land regions only)

CSU Blended Rain Rate total rainfall (mm) for 2013-10-03

29

Page 30: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

aWhere Platform – Used for CSU 4H educational purposes

30

Page 31: Tools for Thinking Big: Big Data - ExtensionData Progression Your desktop may have a 1 TB drive My work would fill it up in 3 weeks (6h at peak) The Federal Govt. collects about 30

CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015

Conclusions

◆  Big Data is really big and transforming ◆  Communication and prioritization is a critical need to

ensure that the right big data is there ◆  Working groups, and new coordinating projects/

committees could shape how this is most effectively done

◆  Big Data goes with interpretive models ◆  Use of advanced mathematics and new analytics

techniques are needed to reach scalability objectives ◆  It takes transdisciplinary teamwork to make it all work

31