distributed processing of stream text mining

Образец заголовка

Distributed Processing of Stream Text Mining

by Wenyi Lu and Li Miao

Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Образец заголовкаOverview• Motivation & What is distributed stream-

processing?– Concepts, Key requirements, Fault Tolerance

• General Processing Framework in Text Mining– Mapreduce, Spark, YARN, STREAM

• Distributed stream-processing engine for Text Mining– Haystack, SAMZA, Dremel, S4, Storm

• Actual Combat– SAGE Geo-distributed streaming data, Stream Join

Processing, Frequent Itemset Mining, K-Nearest Neighbor Queries

Образец заголовкаWhat is BIG DATA anyways?

• Big Size– Panama Papers: 2.6 TB – Google is processing more than 20 PB of data

everyday• Flexible & dynamic

– Walmart is collecting more than 2.5 PB each day.

– Facebook collects more than 7 PB of data each day.

Образец заголовкаBI Intelligence projects 34 billion devices will be connected by 2020

Образец заголовкаWhat is distributed stream-processing?• A large amounts of data generated in external

environments are pushed to servers for real time processing.

• Examples: sensor network, stock trading, web traffic processing, network monitoring and so on.

• Data generated by these applications can be seen as streams of events or tuples.

• A new class of applications called distributed stream processing systems (DSPS) has emerged to facilitate such large scale real time data analytics.

Образец заголовкаFault Tolerance• In large scale distributed computing failures can

happen due to node failures, network failures, software bugs, and resource limitations.

• Recovery without any information loss and recovery with the minimum latency are at the two extremes of the recovery methods.

• Three recovery methods categories– Precise Recovery– Rollback Recovery– Gap Recovery

Образец заголовкаKey requirements of DSPS• A good stream-processing framework are

revolving around two important attributes:• Latency f the system • High availability of the system

• It is important to lower the latency and increase the high availability, but these are competing requirements and one cannot be improved without sacrificing the other.

Образец заголовкаGeneral Processing Model(MapReduce)• MapReduce, as one of the most popular parallel

computing paradigms for big data processing, has been widely used in both industry and academia.

• Data Model– Data must be mappable and reducible.– Distributed file systems (unstructured data) – Reduces the data model overhead compared to RDBMS

Образец заголовкаData Model & Architecture

Data Model

Architecture

Образец заголовкаGeneral Processing Model(Spark)• Spark is created to do iterative jobs, especially

for those who reuse a working set of data across multiple parallel operations.

• Spark is up to 20× faster than Hadoop for iterative applications, speeds up a real-world data analytics report by 40×, and can be used interactively to scan a 1 TB dataset with 5–7s latency.

Образец заголовкаData Model & Architecture

Data Model

Architecture

Образец заголовкаWord Count Example

Образец заголовкаGeneral Processing Model(YARN)• Map-reduce v2 (Yarn)

• Reliability• Availability• Scalability - Clusters of 10000 nodes and 200,000 cores

• The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

Образец заголовкаGeneral Processing Model(STREAM)• Long-running, continuous queries over continuous

unbounded streams of data are needed for many modern applications in big data era.

• STREAM is about data management and query processing for this case. A general-purpose prototype Data Stream Management System which supports a large class of declarative continuous queries over continuous streams and traditional stored data sets is built.

• The STREAM prototype targets environments where streams may be rapid, stream characteristics and query loads may vary over time, and system resources may be limited.

Образец заголовкаDistributed Stream Processing Engines(Aurora)• Aurora: an early stream processing system

developed by Brown University and MIT that has the modern distributed stream processing network structure.

Образец заголовкаSample in Text Mining• Monitoring toxins in water.• Stream data is fish behavior (e.g., breathing

rate) and water quality (e.g., temperature).• When the fish behave abnormally, an alarm is

sounded.• The water data contain 1,2, and 4 hour sliding

windows.• Ease of developing stream applications

-Aurora proved very convenient for sliding window calculation.-Aurora’s GUI proved invaluable.

Образец заголовкаDistributed Stream Processing Engines(Storm)• Storm: a real time stream-processing

framework built by Twitter and now available as an Apache project.– Design Goals of Storm: Guaranteed message processing is a

key goal in design of Storm. A message cannot be lost due to node failures and at least once processing is a design goal. Robustness of the system is critical.

Образец заголовкаDistributed Stream Processing Engines(Storm)• Storm Architecture

Nimbus is the touch point between the user and the Storm system. It also keeps track on the workers. It is the main server. All coordination between Nimbus and the Supervisor is done using Zookeeper. Storm uses Zookeeper to keep track of state information. The supervisor runs on each Storm node. It receives assignments from Nimbus and spawn workers based on the assignments. Each worker process runs several executors inside a JVM.

Образец заголовкаSample in Text Mining• Build a flow which could be easy to use for

monitoring tweets sentiment in real time on Storm

• Retrieves tweets originating from US and computes the sentiment scores of States continuously.

Образец заголовкаDistributed Stream Processing Engines(S4)• S4: stands for Simple Scalable Streaming

System and it was developed by Yahoo and donated to Apache Software Foundation in 2011. – S4 is a fully distributed real time stream processing

framework. It employs the Actors model for computations. The processing model is inspired by MapReduce and uses key based programming model as in MapReduce.

– S4 fills the gap between complex proprietary systems and batch-oriented open source computing platforms. S4 aim to develop a high performance computing platform that hides the complexity inherent in parallel processing system from the application programmer.

Образец заголовкаSample in Text Mining

Образец заголовкаDistributed Stream Processing Engines(Samza)• Apache Samza: is a stream-processing

framework developed by Linkedin and donated to Apache Software Foundation.

• Apache Samza is a framework that gives a developer the tools they need to process messages at an incredibly high rate of speed while still maintaining fault tolerance.

Образец заголовкаDistributed Stream Processing Engines(Samza)• Samza Architecture

- Samza relies on Apache Yarn for the distributed resource allocation and scheduling. It uses Apache Kafka for the distributed message brokering.

- Samza provides an API for creating and running stream tasks on a cluster managed by Yarn. In this cluster Samza runs Kafka brokers and Stream tasks are consumers and producers for Kafka streams.

- Kafka provides a distributed brokering system with persistence for message streams. The system is optimized for handling large messages and provides file system persistence for messages.

Образец заголовкаSample in Text Mining• If a recruiter is searching for a “software engineer”, Linkedin may

also want to show them people who describe themselves as “computer programmer”, “developer”, “code artist” or “ninja rockstar coder” — since they all mean roughly the same thing (to a first approximation). However, the recruiter probably doesn't want to see the kind of developer who develops real estate, nor the kind of rockstar who plays guitar. From this example that dealing with synonyms in job titles is not straightforward. It's helpful to apply a little standardization behind the scenes

Образец заголовкаDistributed Stream Processing Engines(Haystack)• Facebook’s photo storage system• Data Scale: Facebook currently stores over 260

billion images, which translates to over 20 petabytes of data. Users upload one billion new photos (∼60 terabytes) each week and Facebook serves over one million images per second at peak.

• As for facebook photo’s big scale, it has “Long Tail Effect”. Just need 10B to cache the metadata, while XFS filesystem needs 536B for each file, if cache/CDN missed, each “read” procedure would cost 1 I/O instead of 2 I/Os.

Образец заголовкаDistributed Stream Processing Engines(Haystack)

Haystack

Early Architecture

Образец заголовкаSample in Photo Retrieval

Образец заголовкаDistributed Stream Processing Engines(Dremel)• Dremel is a scalable, interactive ad-hoc query

system for analysis of read-only nested data.• Amazing efficiency: Analyzing 1PB in 3 seconds• A novel columnar storage format for nested data.

We present algorithms for dissecting nested records into columns and reassembling them

• By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds.

Образец заголовкаData Model

Образец заголовкаSample in Text Mining❏ Analysis of crawled web documents.❏ Tracking install data for applications on Android Market❏ Crash reporting for Google products.❏ OCR results from Google Books.❏ Spam analysis.❏ Debugging of map tiles on Google Maps.❏ Tablet migrations in managed Bigtable instances.❏ Results of tests run on Google’s distributed build system.❏ Disk I/O statistics for hundreds of thousands of disks.❏ Resource monitoring for jobs run in Google’s data centers.❏ Symbols and dependencies in Google’s codebase.

Образец заголовкаActual Combat• SAGE Geo-distributed streaming data

– designed for public clouds as a decentralized and autonomous, unlike Aurora(single-node) or Medusa(single administrator entity)

– Key Bonus: high availability, exploiting the geographical proximity of data centers with respect to data sources (i.e. data locality) and the ability to scale beyond the infrastructure capacity of a single site.

– Real case scenario: The stock market(a multi source generator of streaming data)

Образец заголовкаActual Combat

• Scalable Distributed Stream Join Processing

←BiStream

→ContBand Routing


• Frequent Itemset Mining– KAAL algorithm is used for finding frequent itemsets over a

stream of data.

– The data structure used in it is trie structure. A tilted time window is embedded on each node. We use logarithmic merging in the fixed time period. As the results get old, we merge it with the older results. Storage space is given to the transaction data obtained in most recent window. Simultaneously we are using less space for the older and already processed.

– Step 1: Process the current batch.– Step 2. Update the summary structure– Step 3. Scan the summary (DFS)

Образец заголовкаFrequent Itemset Mining


• K-Nearest Neighbor Queries– consider the problem of processing K-Nearest

Neighbor (KNN) queries over large datasets where the index is jointly maintained by a set of machines in a computing cluster.

– The actual indexing builds several hash tables with different LSH functions to increase the probability of collision for close points.

– Synthetic/Flickr Dataset

Образец заголовкаReference• [1] Evangelopoulos, Xenophon, et al. "Evaluating information retrieval using document popularity: An implementation on MapReduce." Engineering Applications of Artificial Intelligence (2016).

• [2] Lin, Jimmy, et al. "Low-latency, high-throughput access to static global resources within the Hadoop framework." University of Maryland, Tech. Rep (2009).

• [3] Lin, Jimmy. "The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce." 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. Vol. 1. 2009.

• [4] Berberich, Klaus, and Srikanta Bedathur. "Computing n-gram statistics in MapReduce." Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013.

Образец заголовкаReference• [5] Lin, Jimmy. "Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce." Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2009.

• [6] De Francisci Morales, Gianmarco, Aristides Gionis, and Mauro Sozio. "Social content matching in mapreduce." Proceedings of the VLDB Endowment 4.7 (2011): 460-469.

• [7] Liu, Chao, et al. "Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce." Proceedings of the 19th international conference on World wide web. ACM, 2010.

• [8] Zhai, Ke, et al. "Mr. LDA: A flexible large scale topic modeling package using variational inference in mapreduce." Proceedings of the 21st international conference on World Wide Web. ACM, 2012.

Образец заголовкаReference• [9] Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011.

• [10] Stupar, Aleksandar, Sebastian Michel, and Ralf Schenkel. "RankReduce-processing k-nearest neighbor queries on top of MapReduce." Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval. 2010.

Образец заголовка

Thanks!

distributed processing of stream text mining

Data & Analytics