digital pebble behemoth

18
Behemoth Large scale document processing with Hadoop Julien Nioche [email protected] Bristol Hadoop Workshop 10/03/10

Upload: steve-loughran

Post on 22-Jan-2015

2.898 views

Category:

Technology


0 download

DESCRIPTION

Talk about NLP processing on top of Hadoop using Behemoth by Julien of Digital Pebble

TRANSCRIPT

  • 1. Behemoth Large scale document processing with Hadoop Julien Nioche [email protected] Hadoop Workshop 10/03/10

2. DigitalPebble Bristol-based consultancy Specialised in Text Engineering Natural Language Processing Web Crawling Information Retrieval Data Mining Strong focus on Open Source & Apache ecosystem User | Contributor | Committer Lucene, SOLR, Nutch Tika Mahout GATE, UIMA 3. Open Source Frameworks for NLP Apache UIMA http://incubator.apache.org/uima/ GATE http://gate.ac.uk/ Pipeline of annotators Stand-off annotations Collection of resources (Tokenisers, POS taggers, ...) GUIs Community Both very popular 4. Demo GATE 5. Web scale document processing GATE http://gatecloud.net/ - Closed-source, limited access DIY UIMA AS http://incubator.apache.org/uima/doc-uimaas-what.html 6. UIMA AS Low latency throughput? Storage & replication DIY Ease of configuration? Esp. when mixing different types of Service Instances Post-processing scalability e.g. aggregate info across documents DIY 7. Cometh Behemoth... Behemoth as depictedin the 'DictionnaireInfernal'. 8. The Master and MargaritaM. Boulgakov 9. Behemoth Hosted on Google Code (http://code.google.com/p/behemoth-pebble/) Apache License Large scale document analysis based on Apache Hadoop Deploy UIMA or GATE-based apps on cluster Provide adapters for common inputs Encourage code reuse (sandbox) Runs on Hadoop 0.18 / 0.19 / 0.20 10. Typical Workflow Load input into HDFS Convert input format into Behemoth Document Format Input supported : standard files on local file system, WARC, Nutchsegments Use Apache Tika to identify mime-type, extract text and meta-data Generate SequenceFile Put GATE/UIMA resources on HDFS Zipped GATE plugins + GAPP file UIMA Pear package 11. Typical Workflow (cont.) Process Behemoth docs with UIMA / GATE Use Distributed Cache for sending G/U resources to slaves Load application and do processing in Map No reducers Generate another SequenceFile Post-process Do whatever we want with annotations but can scale thanks to Map Reduce Can do things differently e.g. use reducers for postprocessing, convert input inside map step Illustrated by example in Sandbox Reuse modules e.g. GATEProcessor 12. Document implementation class DocumentString url;String contentType;String text;byte[] content;MapWritable metadata;List annotations;class AnnotationString type;long start;long end;Map features; 13. Example of document ./hadoopfslibjars/data/behemothpebble/build/behemoth0.1snapshot.jobtexttextcorpusANNIE/part*url:file:/data/behemothpebble/src/test/data/docs/droitshomme.txt contentType:text/plain metadata:nullContent: Prambule Considrantquelareconnaissancedeladignitinhrentetouslesmembres()Text: Prambule Considrantquelareconnaissancedeladignitinhrentetouslesmembres()Annotations: Token0 9 string=Prambule Token1122string=Considrant Token2326string=que Token2729string=la Token3044string=reconnaissance Token4547string=de 14. Advantages Used as a common ground between UIMA and GATE Deliberately simple document representation => fine for most applications Feature names and values as Strings Potentially not restricted to JAVA Annotators Hadoop Pipe for C++ Annotators Needs a C++ Implementation of BehemothDocument Unless use AVRO (more on that later) Harness multiple cores / CPU Worth using even on a single machine Easy Configuration Custom BehemothConfiguration (behemoth-default & behemoth-site.xml) What annotations to transfer from GATE / UIMA docs What features to keep Benefits from Hadoop Ecosystem Focus on use of annotations and custom code 15. Sandbox Reuse Basic blocks : conversion / GATE-UIMA wrappers / ... Extend Add custom reducers for specific tasks Share Open to contributions Separate from the core 16. Quick demo Do we have 5 more minutes? 17. Future developments Cascading Tap / Pipe / Sink Hbase Avoid multiplicating SequenceFiles AVRO Facilitate annotators in languages != JAVA Sandbox Examples SOLR Use Named Entities (Person, Location, ) for faceting MAHOUT Generate vectors for document clustering Better documentation, pretty pictures, etc... Needs to be used on a very large scale Anyone with a good use case?