tools for thinking big: big data - extensiondata progression your desktop may have a 1 tb drive my...
TRANSCRIPT
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Andrew S. Jones Colorado State University
Cooperative Institute for Research in the Atmosphere (CIRA)
Tools for Thinking Big: Big Data
1
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Background
◆ Physics background ◆ Satellite Weather Data + Physical Modeling
=> Better Results ◆ Need for Operational Decision Aids ◆ Large-scale Transdisciplinary Collaborations
(multi-institutional, multi-agency) ◆ Results in ACTIONS in the field
▼ Save lives, reduce pain and suffering ▼ Inform folks to enable better decisions
2
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Some New Institutional Tools
◆ Rocky Mountain Consortium for Global Development (RMCGD) ▼ Large-scale Multi-institutional Public/Private/
Foundation/Univ. collaboration for food security in water limited environments (all hands on deck)
◆ CSU Innovation Center for Sustainable Agriculture (ICSA) ▼ Advancing soil health and new
practices to feed more people
3
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Big Data: What is it?
◆ What is Big Data? ◆ It is now a $33B industry and rapidly
growing ◆ It is seen by many as the next phase of
the internet’s development ◆ Adoption of pervasive cloud computing ◆ Consolidation and migration of
operations to the cloud
4
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Data Progression
◆ “Big is relative” ◆ It all starts with “0”s and “1”s. ◆ 1 page of text = ~60 lines = 5KB ◆ 1 book = 300 pages = 1.5MB ◆ 1 library (Lib of Cong.) = 37M books
= 55 PB, 12,000 books added daily = 18 GB/day ◆ My work load is about 50 GB/day
▼ Peaks of 4 TB/day sustained over 3 weeks ▼ I serve 1, Wikipedia serves ~20M users
5
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Data Progression
◆ Your desktop may have a 1 TB drive ◆ My work would fill it up in 3 weeks (6h at peak) ◆ The Federal Govt. collects about 30 TB of
weather data every day ◆ Your desktop would fill up in 48min. ◆ These are small data users compared to truly
large Big Data users ◆ Infrastructure-limited capacity,
e.g., “town-scale”
6
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Major Data Issues
◆ Storage Capacity ◆ I/O “Bandwidth” / Network interactions ◆ Access, Number of users, Scalability ◆ Provisioning of data in a meaningful and
useful manner (2 examples) ▼ GIS data archives nX larger than their user base ▼ User databases many times larger than their
database – multiple database replications
◆ Curation (quality control, documentation) 7
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Data and Models
◆ Using weather as an example ◆ Progression of data/model use
1. Most simple (random toss) 2. Rules of thumb / Empirical Statistics 3. Physical models (e.g., f(x)) 4. Optimized models (tuned by data) 5. Artificial Intelligence inferences
(can be highly automated)
8
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
New Major Trends
◆ “Internet of things” – Routers that think and consolidate – we no longer ask for data item x at location y.
◆ Instead we ask the thing to consolidate the information and preprocess to allow us to have data item x. We no longer care how the “thing” found it. Just do it.
◆ Possible huge cost savings, and a shifting of responsibilities to the “thing”.
9
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
What does this enable?
◆ Infrastructure that “assists” with the heavy and many times tedious duties
◆ Allows for light weight applications to do heavy weight thinking - as coordination and massive scale data jobs are handled elsewhere
◆ Frees high-end application developers to use their time more effectively
◆ But… we have to walk before we run… 10
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Where are we now?
◆ Large scale computing is here now ◆ Devices are recording in the field
massive amounts of individual data sets ◆ Companies are creating vertically
integrated information “towers” ◆ Consortia are rapidly forming to likewise
create new public/private shared information cores (usually around themed topical research/appl. ideas)
11
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
How Big do we need to go?
◆ Earth’s surface is ~ 500 000 000 km^2 ▼ in 1 km^2 we’d like 30m grid resolution
1000000 m^2 / (30m)^2 = 1111 grid pts ▼ ~ 100+ vertical layers ▼ ~ 75 model variables in each element ▼ 14 day simulation at 10 second intervals
is a time series of ~120K elements ◆ 5E8 * 1E3 * 1E2 * 75 * 1.2E5 = 4.5E20
12
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Oops….
◆ To optimize the system, we need to know cross correlation errors, and behaviors to create a real solution that can prognosticate and use model errors and data errors, and return the best possible statistical optimized solution
◆ Skip a bunch of math. ◆ It goes as the O(N^2). Or just ~2E41. ◆ 200 Duodecillion compute elements
13
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
In practice…
◆ Also more than 7 M radiative aborption lines exist in the spectral remote sensing lines to detect molecular effects
◆ Thus, physical approximations are made: lower spatial and time resolution, approximations for solvers, diagonalization of matrices, leveraging of similar-behavior elements
◆ We can get down to 1E7-1E8 14
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Why 1E8?
◆ 1 GB of memory is 1E9. It fits! ◆ So models are reduced until the
problem is solvable while retaining the major behaviors. We are experts at this.
◆ Can we likewise simplify relationships between Food Systems components, use lower order models, and build key high priority components with the necessary “parts” linking the whole?
15
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Key Questions
◆ Are all of the important variables represented? ▼ Are the variable sensitivities understood? ▼ Are the variable errors understood? ▼ Are each “represented” at the right scales?
Do scales interact? ◆ Gap filling -> prioritization by end need
▼ Analytics: Cluster analysis, conditionality, knowledge inference and association
16
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
More Key Questions
◆ What are the true models of inter-relationships that underly the data-model relationships?
◆ Do the data support the models? ◆ To what degree? Is it adequate for the
end user decision making tasks? ◆ Are our results recreatable and
transparent to others for additional testing? Can new data be used later?
17
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
More Key Questions
◆ How are models and data coordinated within a large community with different terms of reference? (technical lang.)
◆ What about modality of model conditions? What if distributions are more complex, multi-modal?
◆ How is this linked to other conditionally dependent systems that may be unimodal?
◆ Is a data distribution analysis needed? ◆ Non-Gaussian weather data assimilation is
advancing… it will impact other fields very soon 18
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Big Factors affecting Big Data
◆ Climate Smart Agriculture ◆ Federal Data Sharing mandates…
19
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Climate Smart Ag.
◆ Where Food Security and Climate Meets ◆ Key Aspects: 1) Global Current State, 2) Global Future
State, 3) NET change impacts, 4) Variability Risks ◆ Leads to Predictability and Monitoring Requirements ◆ Need to increase Resilience and improve Risk
Mitigation Strategies ◆ Answer practical questions: e.g., “If I am currently
doing “x”, what happens if I do “y”?, given conditional probabilities of future states
◆ Next… I’ll show you the climate/wx spaces we work in, the tools used, and sources that help meet your needs
20
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
21
◆ In general, it’s warmer and more changeable
◆ *Some interesting footnotes!
◆ We don’t understand it all – by any means
◆ But there are numerous indicators that climate is changing
The Current Global Climate State
*
*
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
22
0
1
2
3
4
5
6
7
1 hour 1 day 1 week 1 month 1 year 10 years
Max Error Mean Error Min Error
Predictability Errors
Time of Interest
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
0
1
2
3
4
5
6
7
1 hour 1 day 1 week 1 month 1 year 10 years
Max Error Mean Error Min Error
Wx Outlooks
Wx Fcsts
Predictability Errors
23
Nowcasts
Seasonal Fcsts Climate Fcsts
Time of Interest
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Reducing Model Errors
◆ This is called “Data Assimilation” 1. Models start from an initial set of
conditions (our best guess of conditions) 2. Observations are compared to the model
results, and the initial conditions are improved to match “reality”
3. Models are rerun using improved starting conditions to generate forecasts that better match the observed reality
24
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Data Assimilation Pros/Cons
◆ Models can be much improved ◆ Works for models which have predictable
components ◆ Many climate signatures are long-term features, thus
they are forced using “boundary conditions” and outputs are analyzed statistically (in aggregate)
◆ You need 1) a good model, 2) sensitivities to the data within the model, and 3) good data
◆ Many national centers collaborate with CSU on the development of these systems
25
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
What I specialize in…
◆ Multisatellite Data Assimilation and Blended Satellite Data Products
◆ Focus is on dynamic real-time intercalibration of many satellite sensors (US and International)
◆ End users are operational DoD and NOAA National Weather Service users
◆ ALSO new agricultural industry users…
26
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
NOAA examples ◆ Global Blended
Water Vapor ◆ The CSU Global
Rainfall Data will be shown later ▼ 5+ year high
resolution (~9 km) database > 1.5M data files
▼ > 10 satellites
27
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
New Soil Moisture Satellite Data and Techniques
◆ Global Satellite Soil Moisture data (DoD/NASA)
◆ CSU Methods to downscale soil moisture to 10 meter resolutions (crop-scales) (Prof. Neimann)
28
18 ft. wide DoD R&D for parallel
Ag Use
Plus more sensors in use and under design. What’s new is combining, downscaling, and sharing it all (in near real-time)
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Daily Aggregated Satellite-based Rainfall Data (showing land regions only)
CSU Blended Rain Rate total rainfall (mm) for 2013-10-03
29
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
aWhere Platform – Used for CSU 4H educational purposes
30
CSU/CIRA Dr. Andrew S. Jones ([email protected]) Food Systems Workshop, Sep. 28, 2015
Conclusions
◆ Big Data is really big and transforming ◆ Communication and prioritization is a critical need to
ensure that the right big data is there ◆ Working groups, and new coordinating projects/
committees could shape how this is most effectively done
◆ Big Data goes with interpretive models ◆ Use of advanced mathematics and new analytics
techniques are needed to reach scalability objectives ◆ It takes transdisciplinary teamwork to make it all work
31