greenbacker open analyticsdc
DESCRIPTION
Berico:TRANSCRIPT
![Page 1: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/1.jpg)
Open Source Software for Geospatial Analytics on Unstructured Big DataCharlie Greenbacker, Principal Data Scientist
![Page 2: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/2.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2
Background
About Me:
Data Scientist
Natural Language Processing
Unstructured Text Information
Berico Technologies:
Veteran-owned Small Business
Big Data Analytics in the Cloud
Defense & Intel Community
![Page 3: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/3.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3
The Problem: geotagging unstructured text
Growing demand forgeospatial analytics
Most of human knowledge remains “trapped” in text
Existing solutions are expensive and don’t scale
Need an open source solution
![Page 4: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/4.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4
The Solution: an open source geoparser
1. Data Ingestion
Input: unstructured text
2. Entity Extraction
Named entity recognition
Find location names in text
3. Entity Resolution
Match against a gazetteer
“The Springfield Problem”
4. Data Enrichment
Output: structured geo data
![Page 5: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/5.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5
Data Ingestion: unstructured text
photo: Flickr user NS Newsflash
![Page 6: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/6.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6
Entity Extraction: named entity recognition
![Page 7: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/7.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7
Entity Resolution: match against a gazetteer
![Page 8: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/8.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8
Data Enrichment: structured geo data
![Page 9: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/9.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9
“The Springfield Problem”
![Page 10: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/10.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10
Dealing with Ambiguity
Intelligent Context-based Heuristics
First: rank by population
Next: look for other locations mentioned in the same document
“Springfield” + “Chicago” = Illinois
“Springfield” + “Boston” = Massachusetts
Soon: calculate distance based on lat/lons
Resolve alternate names to same geospatial entity
“Ivory Coast” = “Cote d’Ivoire”
Use fuzzy matching to capture misspelled place names
Including both phonetic spelling & typographical errors
![Page 11: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/11.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11
CLAVIN: an open source geoparser
![Page 12: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/12.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12
System Architecture
![Page 13: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/13.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13
Live Demonstration
![Page 14: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/14.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14
Live Demonstration
What can I do with this data?
![Page 15: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/15.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15
Map Visualizations
![Page 16: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/16.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16
Hierarchical Geospatial Search
Virginia
ArlingtonReston
![Page 17: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/17.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17
Geospatial Bounding Box Search
![Page 18: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/18.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18
Geospatial Analytics on Unstructured Text
![Page 19: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/19.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19
Performance Metrics & Features
Accurate: 0.75 F-measure
Fast: 100 locations per sec per cpu
Scalable: processes 1 million documentsin 1 hour on a 9-node Hadoop cluster
Smart: natural language processing, context-based heuristics, & fuzzy matching
Easy to use: simple Java-based API
Open source: Apache License
CLAVIN
“Cartographic
Location
And
Vicinity
INdexer
![Page 20: Greenbacker open analyticsdc](https://reader038.vdocument.in/reader038/viewer/2022103111/54c669ca4a79594b538b47e2/html5/thumbnails/20.jpg)
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20
clavin.bericotechnologies.com
Charlie Greenbacker@greenbacker
meetup.com/DC-NLP
@DCNLP