content identification using hbase

16
CONTENT IDENTIFICATION USING HBASE Daniel Nelson 3-10-2014

Upload: hbasecon

Post on 10-May-2015

701 views

Category:

Software


4 download

DESCRIPTION

Speaker: Daniel Nelson (Nielsen) The motivation behind content identification is to determine the media people are consuming (via TV shows, movies, or streaming). Nielsen collects that data via its Fingerprints system, which generates significant amounts of structured data that is stored in HBase. This presentation will review the options a developer has for HBase querying and retrieval of hash data. Also covered is the use of wire protocols (Protocol Buffers), and how they can improve network efficiency and throughput, especially when combined with an HBase coprocessor.

TRANSCRIPT

  • 1.CONTENT IDENTIFICATION USING HBASE Daniel Nelson 3-10-2014

2. 2 Copyright2013TheNielsenCompany.Confidentialandproprietary. TV MEASUREMENT HISTORY Traditional TV Video On Demand Time Shifted Viewing Internet-Based Content What are people watching? 3. 3 Copyright2013TheNielsenCompany.Confidentialandproprietary. WHY IDENTIFY CONTENT? Broadcasters: What are people watching? Advertisers: How can I focus my spend? Movies Commercials Programs Streaming 4. 4 Copyright2013TheNielsenCompany.Confidentialandproprietary. WHAT ARE AUDIO FINGERPRINTS? Program Audio Audio Fingerprints 5. 5 Copyright2013TheNielsenCompany.Confidentialandproprietary. BUILDING A LIBRARY OF CONTENT Nielsen Remote Media Monitoring Sites Audio Movies Commercials Programs Streaming Fingerprint Generator Fingerprints HBase Region Servers 6. Copyright2013TheNielsenCompany.Confidentialandproprietary. 6 NEED FOR HBASE Rapidly Growing Content Monolithic Limitations Storage Scalability Distributed Computations 7. 7 Copyright2013TheNielsenCompany.Confidentialandproprietary. COLLECTING VIEWING DATA 8. Copyright2013TheNielsenCompany.Confidentialandproprietary. 8 CONTENT IDENTIFICATION Matching Process identifies Content by comparing Unknown Fingerprints (Left) against Reference Fingerprints (Right). Match Unknown Fingerprints Reference Fingerprints ESPN QVC SyFy BNN 9. Copyright2013TheNielsenCompany.Confidentialandproprietary. 9 FINGERPRINTS AND HBASE Think of HBase as One Big Hash Table Fingerprints Fit Nicely into HBase as the Key Keys Are Not 100% Unique Collisions Without Loss? Near Constant Time Lookups Hash Table Load Factor will impact this ( n/k ) 10. 10 Copyright2013TheNielsenCompany.Confidentialandproprietary. REFERENCE HOUSE KEEPING Maintaining a Moving Window of Relevant Data Broadcast Reference Fingerprints Expire after 8 Days. Managing of 20+ Billion Hash Keys is no Small Task HBase TTL (Time To Live) Places an Expiration Date on All Table Data Hides Expired Data From Queries Purged on Next Compaction Cycle 11. 11 Copyright2013TheNielsenCompany.Confidentialandproprietary. OPTIMIZING USE OF HBASE Network Fastest Network Bonded 1Gig Ethernet Reduce Data Volume in Network Transfers Protobufs Google Protocol Buffers HBase Co-Processors Your Code Running on Region Servers Computations, Advanced Filtering, Transformations 12. 12 Copyright2013TheNielsenCompany.Confidentialandproprietary. HBase Region Servers HBASE ENDPOINT CO-PROCESSORS Push your Business Logic into the Coprocessor. Keeping Co-Processor Code Simple Loading your Co-Processor HBase Client Application Co-Processor Co-Processor Co-Processor Co-Processor Query/Response 13. 13 Copyright2013TheNielsenCompany.Confidentialandproprietary. QUERYING HBASE Table.scan () Table.get () Table.get(ListkeyList) Query using a Co Processor coprocessorExec( yourProtocolClass, null, null, ) coprocessorExec( yourProtocolClass, startKey, endKey, ) 14. 14 Copyright2013TheNielsenCompany.Confidentialandproprietary. STORING MILLIONS OF FILES Have a lot of Files to Store, use SAN/NAS/HDFS Right? SAN/NAS Costly More Hardware to Buy/Maintain HDFS Limited File Count Sequence Files Immutable Inefficient to Retrieve, Delete, Modify or Add Files... There is another way. 15. 15 Copyright2013TheNielsenCompany.Confidentialandproprietary. LEVERAGING WHAT ALREADY WORKS Example File To Store: /foo/bar/myFile.bin HBase Key = File Path: /foo/bar/ Qualifier = File Name: myFile.bin Value = Your File Data (Serialized Object, Text, etc) List Files - Scan using built in KeyOnlyFilter Wrap this with an API and your Application can use HBase FS File Retrieval, Delete, Modification, Updates, Versions Apply TTL to Purge Files 16. THANK YOU