data science for usda big data dr. brand niemann director and senior data scientist/data journalist...

24
Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Semantic Community Data Science Big Data Science for Precision Farming Business August 19, 2015 1

Upload: diane-atkinson

Post on 31-Dec-2015

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

1

Data Science for USDA Big Data

Dr. Brand NiemannDirector and Senior Data Scientist/Data Journalist

Semantic CommunitySemantic Community

Data ScienceBig Data Science for Precision Farming Business

August 19, 2015

Page 2: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

2

The Journey

• USDA Data Sources:• Open Data• Innovation Challenge• Data Driven Precision Farming

Online Course

• Data Audit (See Next Slide for Details):• Mine• Science• Questions• Publish

• Results:• Open Data: Problem Reading

Catalog Was Fixed and USDA Data Science MOOC Was Created• Innovation Challenge: Problems

with Farm Data Dashboard Data Sets and Went Back to Original Data Sets• Data Driven Precision Farming

Online Course: Problems Understanding Multiple Soils Data Sets Being Sorted Out

Page 3: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

3

Data Mining - Science - Questions - Publication Process

• Data Mining Process:• Business Understanding• Data Understanding• Data Preparation• Modeling• Evaluation• Deployment

• Data Science Process:• Data Preparation• Data Ecosystem• Data Story

• Data Science Questions:• How was the data collected?• Where is the data stored?• What are the data results? and• Why should we believe the data results?

• Data Science Data Publication:• Knowledge Base• Spreadsheet Index• Web & PDF Tables to Spreadsheet• Data Browser• Dynamically Linked Adjacent

Visualizations

Page 4: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

4

NIH Data Commons

• FAIR Principles:• Findable• Accessible• Interoperable• Reusable

• Cloud:• Data• Software• Results

• Federal Science Policy:• OSTP Public Access to Scientific Data

Memo (February 2013)• New Program: Big-Data-to-

Knowledge (2013)• New Position: Associate Director of

Data Science (2014)• Digital Enterprise (2015): Data

Commons• Metadata• Open APIs• Digital Objects• Containers

Federal Big Data Working Group Meetup, August 17, 2015:A NIH – Semantic Medline Data Science Data Publication Commons

Page 5: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

5

OSTP/NSF Data Science Meetup of Meetups

• Week of November 2nd:• NSF Data Science/Big Data

Principal Investigators (About 300)• NSF Data Hubs (4)• Organizers of Largest Data

Science/Big Data Meetups (About 65)

• Pipeline for Return on Investment:• PIs put their data, tools and

research results in the Data Hubs• Data Hubs provide those data,

tools, and research results to the world, but especially to the Data Science/Big Data Meetups• Data Science/Big Data Meetups

collaborate with PIs and Data Hubs to increase usage and feedback

Page 6: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

6

We Already Do This!

• Semantic Community:• Provides a Community Sandbox that is

like a GitHub, Data Hub, Data Commons, etc.• Metadata (MindTouch)• Open APIs (MIndTouch)• Digital Objects (MindTouch)• Containers (Spotfire)

• Organize the Federal Big Data Working Group Meetup

• Support Agencies and Programs in Crowdsourcing Their Data Sets

• Mentor Data Scientists (Tutorials and MOOCs) and Entrepreneurs (Eastern Foundry)

• Federal Big Data Working Group Meetup:• Federal: Supports the Federal Big Data

Initiative, but not endorsed by the Federal Government or its Agencies;

• Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content;

• Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products; and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) now embraced by the White House.

Page 7: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

7

http://semanticommunity.info/Data_Science/Big_Data_Science_for_Precision_Farming_Business#Story_2

Page 8: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

8

USDA Big Data in Spotfire

• USDAMOOC-Spotfire: 1.7 GB• Web Player

• USDANASS-Spotfire: 730 MB• Web Player

• FarmDataDashboard1-Spotfire: 521 MB• Web Player

• FarmDataDashboard2-Spotfire: 1.2 GB• In Process

• FarmData-Spotfire: 15 MB• Web Player

• NCSSSoilSurvey-Spotfire: 235 MB• Web Player

• NCSSSoilCharacterizationDatabase6302015GDB-Spotfire: 144 MB• Web Player

Page 13: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

13

Extra Slides on USDA Soils Data Sets

• File Inventory for Weeks 4 and 5 of Online Course• Multiple Data Sets in Multiple Formats in Multiple Places• The Newest Gridded Soil Survey Geographic (gSSURGO) Database

Requires Advanced Tool to Convert from GDB to SHP• 32 SHP Files with Attributes But No Location and Many Access Data

Sets to Export to Excel• Spotfire for Data Relationships (Statistical and Linking)

Page 14: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

14

Weeks 4 and 5 Soil File Inventory

• Week 4: Modeling• NCSS_Soil_Characterization_Database: GDB and Access

(NCSSSoilCharacterizationDatabase6302015GDB-Spotfire)• SoilDataAvailabilityShapefile: Shape (NCSSSoilSurvey-Spotfire)• Otoe County, Nebraska: PDF, GIF, and PNG (FarmData-Spotfire)• Master_query 72 MB Excel and NCSS_Site_Location 7 MB Excel (NCSSSoilSurvey-Spotfire) and

(FarmData-Spotfire)

• Week 5: Evaluation• wss_SSA_NE131_soildb_NE_2003_[2014-09-02] 19 MB Shape and 12 MB Access

(NCSSSoilCharacterizationDatabase6302015GDB-Spotfire)• wss_gsmsoil_NE_[2006-07-06] 8 MB Shape and 11 MB Access

(NCSSSoilCharacterizationDatabase6302015GDB-Spotfire)• nrcs142p2_052440 28 MB Shape and Excel and PDF Image Map (NCSSSoilSurvey-Spotfire) and

(FarmData-Spotfire)

Page 15: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

15

Digital Soil Geographic Databases (GIS-ready)

• Land Resource Regions (LRR) and Major Land Resource Areas (MLRA):• http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053624

• Common Resource Areas (CRA):• http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053635

• U.S. General Soil Map (STATSGO2):• http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053629

• Soil Survey Geographic (SSURGO) Database:• http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053627

• Gridded Soil Survey Geographic (gSSURGO) Database (See Previous Slides):• http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053628

• National Cooperative Soil Survey Soil Characterization Database (Pedons):• http://ncsslabdatamart.sc.egov.usda.gov/

Page 16: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

16

Geospatial Data Gateway

Page 17: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

17

Map Layer Status Map/Spotfire

Source Format Projection

Major Land Resource Areas by State

Link Yes USDA ESRI Shape, ESRI File GeoDataBase

Geographic,UTM,State Plane

Common Resource Areas by State

Link State Only? USDA ESRI Shape, ESRI File GeoDataBase

Geographic,UTM,State Plane

Soil Survey Spatial and Tabular Data (SSURGO 2.2)

Link Yes USDA ESRI Shape WGS84Geographic

Raster Soil Survey Link GDB-to-SHP, Not Yet

USDA ESRI File GeoDataBase

AutoUTM to county

U.S. General Soil Map (STATSGO2) by State

Link Nebraska, Yes USDA ESRI Shape WGS84Geographic

Gridded Soil Survey Geographic (gSSURGO) by State or Conterminous U.S.

Link GDB-to-SHP, Yes USDA ESRI File GeoDataBase

Albers

My Note: These Links Do Not Go To Data Download

Page 18: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

18

http://websoilsurvey.sc.egov.usda.gov/DSD/Download/Cache/STATSGO2/wss_gsmsoil_US_[2006-07-06].zip

My Note: Did Not Download

Page 19: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

19

http://www.nrcs.usda.gov/wps/portal/nrcs/site/soils/home/

FY2015 gSSURGO Database ReleaseThe FY2015 Gridded Soil Survey Geographic (gSSURGO) Database was released on February 23, 2015. These data are derived from a December 1, 2014, snapshot of the Soil Data Mart database. These new data are available in both state-wide tiles and the Conterminous U.S. (CONUS).

Page 20: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

20

Description of Gridded Soil Survey Geographic (gSSURGO) Database

• Gridded SSURGO (gSSURGO) is similar to the standard USDA-NRCS Soil Survey Geographic (SSURGO) Database product but in the format of an Environmental Systems Research Institute, Inc. (ESRI®) file geodatabase. A file geodatabase has the capacity to store much more data and thus greater spatial extents than the traditional SSURGO product. This makes it possible to offer these data in statewide or even Conterminous United States (CONUS) tiles. gSSURGO contains all of the original soil attribute tables in SSURGO. All spatial data are stored within the geodatabase instead of externally as separate shapefiles. Both SSURGO and gSSURGO are considered products of the National Cooperative Soil Survey (NCSS) partnership.

• The gridded SSURGO (gSSURGO) dataset was created for use in national, regional, and statewide resource planning and analysis of soils data. The raster map layer data can be readily combined with other national, regional, and local raster layers, including the National Land Cover Database (NLCD), the National Agricultural Statistics Service (NASS) Crop Data Layer (CDL), and the National Elevation Dataset (NED).

Page 21: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

21

Page 22: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

22

32 SHP Files with Attributes But No Location Access Database

Page 24: Data Science for USDA Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Big Data Science for

24