a big-data architecture for real-time analytics
DESCRIPTION
On mixing high-speed updates and in-memory queries: A big-data architecture for real-time analyticsTRANSCRIPT
Tao Zhong Kshitij A. Doshi
Xi Tang Ting Lou
Zhongyan Lu Hong Li
Presented by: Raminder Kaur Wayne State University
Introduction Motivation and Background Architecture Framework Result Future work Conclusion Index term References
Wayne State University
This paper describes: a few key additional requirements that result from having to
support in-memory processing of data while updates proceed concurrently.
RAF Two RAF based solutions (discussed further)
Wayne State University
A few examples of information in motion that may just be seconds old, and
not yet well categorized or linked to other data:
- GPS-based navigation : to reduce wasted energy, accidents, delays and emergencies.
- A credit card company : to detect and intercept suspicious transactions
- A metropolitan or regional power grid : to modulate power generation, perform load-balancing, direct repair actions, and take policy enforcement steps
An essential feature in the above examples is the need to integrate new transactions into analysis results within a very short time—sometimes as short as a few tens of milliseconds.
Wayne State University
RDD makes in-memory solutions less failure prone. So RAF enhances RDD
approach so that resiliency is blended with a few additional characteristics as
listed below:
• Efficient allocation and control of memory resources
• Resilient update of information at much finer resolution
• Flexible and highly efficient concurrency control
• Replication and partitioning of data transparent to clients
Architecturally RAF elevates memory across an entire cluster to a first class
storage entity and defines high level mechanisms by which applications on RAF
can orchestrate distributed actions upon objects stored in cluster memory.
To promote responsible and transparent use of memory, RAF opts to use a programming language such as C, C++, over mixed language environments in which garbage allocation is opaque.
Wayne State University
Data has a lot of value when mined. As data continues to compound at brisk
rates, institutions need to grapple with two broad demands – accumulating, processing, synopsizing and utilizing information in a timely manner storing the refined data resiliently keeping the data accessible at high speed.
The term Big Data itself is elastic and serves well as a description of the scale
or volume of these solutions, but does not define a constraining principle for
organizing storage .
Wayne State University
Requirements for low-latency and high throughput analytics ondatasets:
In-memory structures and storage Resiliency Sharing data through memory Uniform interaction with storage Minimizing memory recycling Efficient integration of CRUD Synchronizing efficiently Searching Efficiently
Wayne State University
Translation of eight requirements into five design elements: C and C++ based programming for efficient sharing of data
through memory Resilient storing of new content Efficient concurrency Processing information in motion Fast, general, ad-hoc searches
Wayne State University
This framework targets the execution of complex queries at very low latency.
Information upon which queries operate may be available on some storage medium, or generated dynamically as a result of ongoing transactional activities.
RAF provides distributed computing environment which is integrated with memory-centric, distributed storage system where one application can pass the data to another in order to share data in memory
Wayne State University
RDD: used to store information in memory of one or more machines to assure that in case of failure of one or more machines, the RDD can be reconstructed.
Transformations: operation on RDD to generate new data sets. RAF transformations are join, map, union, etc.
Filter: a particular type of transformation. Produces a dataset whose contents satisfy a specified condition.
Delegate: It is a bridged module. Purpose of delegate is to create a version of datastore at a particular time and present it as memory resident RDD.
Wayne State University
Efficient storage sharing using DELEGATE Memory-centric storage operation-Reliability Data and storage types-Structured data-Storage types (Replicated store and Partitioned store) Distributed Execution of Analytics tasks-Analytics tasks interface
Wayne State University
Unit Testing:-Scalability testing results (how well update operations scale)-Latency relative to Hive/HDFS (how long does it take to complete a query)NOTE: These unit test results show advantage of in-memory distributed processing
oriented design of RAF.
Solution-level implementation and testing-Telecommunications subscriber Management-Safe City Solution
Wayne State University
Wayne State University
Wayne State University
Motivated by the high degree of familiarity that many developers have with database interfaces, we are incrementally introducing SQL-92/JDBC/ODBC like interfaces on top of RAF. A number of optimizations are also being added.
These optimizations include: application requested indexing, to accelerate searches blending in column-store capabilities where appropriate (for example, for
rarely-written data) compression, in order to reduce data transported between nodes.
Wayne State University
Discussed RAF, an architectural approach that meshes memory-centric non-relational query processing for low latency analytics with memory-centric update processing to accommodate high volumes of updates.
Delegate, which participates as a special type of content transformer in a hierarchy of RDD transformations.
In RAF, protocol buffers are used to obtain data abstraction and efficient conveyance among applications, providing applications with a high degree of independence in location, representation, and transmission of data.
A light-weight but expressive interface for RAF Using unit tests we show high cluster scaling capability for transactions, an
order of magnitude latency improvement for query processing. Discussed two real-world usage scenarios in which RAF is being used.
Wayne State University
RDD: Resilient distributed dataset RAF: Real-time Analytics Foundation CRUD : Create/Retrieve/Update/Delete HDFS: Hadoop Distributed File System
Wayne State University
Apache Hadoop: http://hadoop.apache.org/ Apache HBase: http://hbase.apache.org/ Memcached: http://www.memcached.org/ Oracle Coherence: http://www.oracle.com/technetwork/ middle ware/ coherence/ H. Plattner, A. Zeier, In-Memory Data Management. Protobuf: http://code.google.com/p/protobuf/ Redis: http://www.redis.io/ SQLStream: http://www.sqlstream.com/ Vertica: http://www.vertica.com/ VoltDB: http://www.voltdb.com
Wayne State University
Thanks !!!