openke - technology and kibana analysis annotated field search visualizations (tuning/optimize...

1
Unstructured Text, Analy2cs, and Summariza2on Interfaces with External Tools Document Handling Text Extraction (Apache Tika / POI) Diffbot Extraction Analytics Text Summarization Topic Modelling WordCloud Publish Date Voyant Experiment RASOR/Olympics Indications and Warnings Demo ElasticSearch and Kibana Analysis Annotated Field Search Visualizations (Tuning/Optimize Crawl, OpenKE Usage) Future Analytic Framework ● IBM Watson Content Analy1cs ● LAS Instrumenta1on ● PNNL Knowledge Graph ○ hBps://github.com/streaming-graphs/NOUS ● SAS ● Voyant OpenKE - Technology Open Source Knowledge Enrichment Database Laboratory for Analytic Sciences [email protected] Knowledge Graphs Domain Learning and Discovery Focused Web Crawling ● Tunable and Configurable Web Crawler ● Page Data Model ○ Provenance Capture ○ Metadata Capture ○ Policy Support Data Header ● Mul1ple Source Types (Web, Forums, Search APIs) ● Web Crawling Configura1on (Depth, Breadth, Relevancy, Site) ● Leveraging Structured Data Within Pages ● Policy (robots.txt, data header) ● Javascript and HTML challenges ● Access and Audi1ng Concepts for Policy Current OpenKE Capabilities - Yellow Current External Capabilities - Green Future Capabilities - Blue Dictionary and PESTLE Annotations Regular Expressions / Relevancy Tuning Domain Discovery Capability Search APIs and “Session” Result Comparison Indexing Session Corpus via Text Rank Annotation Analysis Topic Modeling (LDA) Data Source Learning Page Crawl Progression (Page History) Dynamic Content Challenge Source Data Freshness { "extractArea": [ { "selector": "#productTitle", "title": "Title" }, { "selector": "#feature-bullets", "title": "" }, { "selector": "#prodDetails", "title": "Details" }, { "selector": "#productDescription", "title": "Description" }, { "selector": "#detail-bullets", "title": "Details - Bullets” }, { "selector": "#aplus-product-description_feature_div", "title": "Manufacturer Info" }, { "selector": "#aplusProductDescription", "title": "Manufacturer Info" }, { "selector": "#technical-specs_feature_div", "title": "Technical Specifications" } ], "allowSingleHopFromReferrer": true, "relevantRegExp": "drone|quadcopter", "limitToDomain": true, "webCrawler": { "politenessDelay": 20000, "maxDepthOfCrawling": 1 } } OpenKE Web Crawling Rio Olympic I&W Tuning OpenKE Domain Discovery Index View OpenKE Web Crawling Job Config Example ● “Holis1c” ○ Facts ○ Events ○ Causal ○ Connec1ons ○ Meta-data ● Intelligence / Analy1c Tasks ○ Discovery ○ Behavioral Modeling ○ Network Discovery ● Analy1c Support ○ Data / Evidence Gathering ○ Predic1on ○ Model Genera1on: AI Planning ○ Ontology Development ○ Knowledge Base OpenKE Technical Framework ● Pla]orm: ○ Java ○ Hortonworks Data Pla]orm ● Storage: ○ Accumulo ○ Elas1cSearch ○ HDFS ○ OrientDB ○ PostgreSQL ● Open Source Libraries: ○ Crawler4J ○ jsoup ○ Apache Tika ○ Apache POI ○ Tabula ○ Stanford CoreNLP ○ Python NLTK ○ Python Gensim ○ University of Washington: OpenIE ○ d3.js ● Other: ○ Docker ○ Kibana ○ ApacheSpark ○ Apache Zeppelin ○ Tor2Web

Upload: duongnhan

Post on 07-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: OpenKE - Technology and Kibana Analysis Annotated Field Search Visualizations (Tuning/Optimize Crawl, OpenKE Usage) Future Analytic Framework IBM Watson Content Analy1cs LAS Instrumentaon

UnstructuredText,Analy2cs,andSummariza2on

InterfaceswithExternalTools

●  Document Handling Text Extraction (Apache Tika / POI) ●  Diffbot Extraction ●  Analytics ○  Text Summarization ○  Topic Modelling ○  WordCloud ○  Publish Date ○  Voyant Experiment ○  RASOR/Olympics Indications and Warnings Demo

●  ElasticSearch and Kibana Analysis ○  Annotated Field Search ○  Visualizations (Tuning/Optimize Crawl, OpenKE Usage)

●  Future Analytic Framework

●  IBMWatsonContentAnaly1cs●  LASInstrumenta1on●  PNNLKnowledgeGraph

○  hBps://github.com/streaming-graphs/NOUS●  SAS●  Voyant

OpenKE - Technology Open Source Knowledge Enrichment Database

Laboratory for Analytic Sciences [email protected]

KnowledgeGraphs

DomainLearningandDiscovery

FocusedWebCrawling●  TunableandConfigurableWebCrawler●  PageDataModel

○  ProvenanceCapture○  MetadataCapture○  PolicySupportDataHeader

● Mul1pleSourceTypes(Web,Forums,SearchAPIs)● WebCrawlingConfigura1on(Depth,Breadth,Relevancy,Site)●  LeveragingStructuredDataWithinPages●  Policy(robots.txt,dataheader)●  JavascriptandHTMLchallenges● AccessandAudi1ngConceptsforPolicy

Current OpenKE Capabilities - Yellow Current External Capabilities - Green Future Capabilities - Blue

●  Dictionary and PESTLE Annotations ●  Regular Expressions / Relevancy Tuning ●  Domain Discovery Capability ○  Search APIs and “Session” Result Comparison ○  Indexing Session Corpus via Text Rank ○  Annotation Analysis ○  Topic Modeling (LDA)

●  Data Source Learning ○  Page Crawl Progression (Page History) ○  Dynamic Content Challenge ○  Source Data Freshness

{ "extractArea": [ { "selector": "#productTitle", "title": "Title" }, { "selector": "#feature-bullets", "title": "" }, { "selector": "#prodDetails", "title": "Details" }, { "selector": "#productDescription", "title": "Description" }, { "selector": "#detail-bullets", "title": "Details - Bullets” }, { "selector": "#aplus-product-description_feature_div", "title": "Manufacturer Info" }, { "selector": "#aplusProductDescription", "title": "Manufacturer Info" }, { "selector": "#technical-specs_feature_div", "title": "Technical Specifications" } ], "allowSingleHopFromReferrer": true, "relevantRegExp": "drone|quadcopter", "limitToDomain": true, "webCrawler": { "politenessDelay": 20000, "maxDepthOfCrawling": 1 } }

OpenKE Web Crawling Rio Olympic I&W Tuning

OpenKE Domain Discovery Index View

OpenKE Web Crawling Job Config Example

●  “Holis1c”○  Facts○  Events○  Causal○  Connec1ons○  Meta-data

●  Intelligence/Analy1cTasks○  Discovery○  BehavioralModeling○  NetworkDiscovery

● Analy1cSupport

○  Data/EvidenceGathering○  Predic1on○  ModelGenera1on:AIPlanning○  OntologyDevelopment○  KnowledgeBase

OpenKETechnicalFramework

●  Pla]orm:○  Java○  HortonworksDataPla]orm

●  Storage:

○  Accumulo○  Elas1cSearch○  HDFS○  OrientDB○  PostgreSQL

●  OpenSourceLibraries:

○  Crawler4J○  jsoup○  ApacheTika○  ApachePOI○  Tabula○  StanfordCoreNLP○  PythonNLTK○  PythonGensim○  UniversityofWashington:

OpenIE○  d3.js

●  Other:

○  Docker○  Kibana○  ApacheSpark○  ApacheZeppelin○  Tor2Web