greenbacker open analyticsdc

Post on 26-Jan-2015

106 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Berico:

TRANSCRIPT

Open Source Software for Geospatial Analytics on Unstructured Big DataCharlie Greenbacker, Principal Data Scientist

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2

Background

About Me:

Data Scientist

Natural Language Processing

Unstructured Text Information

Berico Technologies:

Veteran-owned Small Business

Big Data Analytics in the Cloud

Defense & Intel Community

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3

The Problem: geotagging unstructured text

Growing demand forgeospatial analytics

Most of human knowledge remains “trapped” in text

Existing solutions are expensive and don’t scale

Need an open source solution

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4

The Solution: an open source geoparser

1. Data Ingestion

Input: unstructured text

2. Entity Extraction

Named entity recognition

Find location names in text

3. Entity Resolution

Match against a gazetteer

“The Springfield Problem”

4. Data Enrichment

Output: structured geo data

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5

Data Ingestion: unstructured text

photo: Flickr user NS Newsflash

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6

Entity Extraction: named entity recognition

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7

Entity Resolution: match against a gazetteer

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8

Data Enrichment: structured geo data

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9

“The Springfield Problem”

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10

Dealing with Ambiguity

Intelligent Context-based Heuristics

First: rank by population

Next: look for other locations mentioned in the same document

“Springfield” + “Chicago” = Illinois

“Springfield” + “Boston” = Massachusetts

Soon: calculate distance based on lat/lons

Resolve alternate names to same geospatial entity

“Ivory Coast” = “Cote d’Ivoire”

Use fuzzy matching to capture misspelled place names

Including both phonetic spelling & typographical errors

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11

CLAVIN: an open source geoparser

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12

System Architecture

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13

Live Demonstration

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14

Live Demonstration

What can I do with this data?

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15

Map Visualizations

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16

Hierarchical Geospatial Search

Virginia

ArlingtonReston

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17

Geospatial Bounding Box Search

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18

Geospatial Analytics on Unstructured Text

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19

Performance Metrics & Features

Accurate: 0.75 F-measure

Fast: 100 locations per sec per cpu

Scalable: processes 1 million documentsin 1 hour on a 9-node Hadoop cluster

Smart: natural language processing, context-based heuristics, & fuzzy matching

Easy to use: simple Java-based API

Open source: Apache License

CLAVIN

“Cartographic

Location

And

Vicinity

INdexer

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20

clavin.bericotechnologies.com

Charlie Greenbacker@greenbacker

meetup.com/DC-NLP

@DCNLP

top related