openke - technology and kibana analysis annotated field search visualizations (tuning/optimize...
TRANSCRIPT
UnstructuredText,Analy2cs,andSummariza2on
InterfaceswithExternalTools
● Document Handling Text Extraction (Apache Tika / POI) ● Diffbot Extraction ● Analytics ○ Text Summarization ○ Topic Modelling ○ WordCloud ○ Publish Date ○ Voyant Experiment ○ RASOR/Olympics Indications and Warnings Demo
● ElasticSearch and Kibana Analysis ○ Annotated Field Search ○ Visualizations (Tuning/Optimize Crawl, OpenKE Usage)
● Future Analytic Framework
● IBMWatsonContentAnaly1cs● LASInstrumenta1on● PNNLKnowledgeGraph
○ hBps://github.com/streaming-graphs/NOUS● SAS● Voyant
OpenKE - Technology Open Source Knowledge Enrichment Database
Laboratory for Analytic Sciences [email protected]
KnowledgeGraphs
DomainLearningandDiscovery
FocusedWebCrawling● TunableandConfigurableWebCrawler● PageDataModel
○ ProvenanceCapture○ MetadataCapture○ PolicySupportDataHeader
● Mul1pleSourceTypes(Web,Forums,SearchAPIs)● WebCrawlingConfigura1on(Depth,Breadth,Relevancy,Site)● LeveragingStructuredDataWithinPages● Policy(robots.txt,dataheader)● JavascriptandHTMLchallenges● AccessandAudi1ngConceptsforPolicy
Current OpenKE Capabilities - Yellow Current External Capabilities - Green Future Capabilities - Blue
● Dictionary and PESTLE Annotations ● Regular Expressions / Relevancy Tuning ● Domain Discovery Capability ○ Search APIs and “Session” Result Comparison ○ Indexing Session Corpus via Text Rank ○ Annotation Analysis ○ Topic Modeling (LDA)
● Data Source Learning ○ Page Crawl Progression (Page History) ○ Dynamic Content Challenge ○ Source Data Freshness
{ "extractArea": [ { "selector": "#productTitle", "title": "Title" }, { "selector": "#feature-bullets", "title": "" }, { "selector": "#prodDetails", "title": "Details" }, { "selector": "#productDescription", "title": "Description" }, { "selector": "#detail-bullets", "title": "Details - Bullets” }, { "selector": "#aplus-product-description_feature_div", "title": "Manufacturer Info" }, { "selector": "#aplusProductDescription", "title": "Manufacturer Info" }, { "selector": "#technical-specs_feature_div", "title": "Technical Specifications" } ], "allowSingleHopFromReferrer": true, "relevantRegExp": "drone|quadcopter", "limitToDomain": true, "webCrawler": { "politenessDelay": 20000, "maxDepthOfCrawling": 1 } }
OpenKE Web Crawling Rio Olympic I&W Tuning
OpenKE Domain Discovery Index View
OpenKE Web Crawling Job Config Example
● “Holis1c”○ Facts○ Events○ Causal○ Connec1ons○ Meta-data
● Intelligence/Analy1cTasks○ Discovery○ BehavioralModeling○ NetworkDiscovery
● Analy1cSupport
○ Data/EvidenceGathering○ Predic1on○ ModelGenera1on:AIPlanning○ OntologyDevelopment○ KnowledgeBase
OpenKETechnicalFramework
● Pla]orm:○ Java○ HortonworksDataPla]orm
● Storage:
○ Accumulo○ Elas1cSearch○ HDFS○ OrientDB○ PostgreSQL
● OpenSourceLibraries:
○ Crawler4J○ jsoup○ ApacheTika○ ApachePOI○ Tabula○ StanfordCoreNLP○ PythonNLTK○ PythonGensim○ UniversityofWashington:
OpenIE○ d3.js
● Other:
○ Docker○ Kibana○ ApacheSpark○ ApacheZeppelin○ Tor2Web