Introduction to Apache Ignite and GridGainMandhir Gidda
ROME 24-25 MARCH 2017
© 2015 The Apache Software Foundation. Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are trademarks of The Apache Software Foundation.
In-Memory Computing Platform
Built on
MANDHIR GIDDAEMEA Solutions Architect
Agenda• GridGain & Apache Ignite Project• Ignite In-Memory Computing
Platform• Introduction to Clustering• Data Grid• Compute Grid, Service Grid &
Streaming• Hadoop & Spark Integration
• GridGain Roadmap• Q & A
© 2016 GridGain Systems, Inc.
What is Apache Ignite
High-performance distributed in-memory platform for computing and transacting on large-scale data sets in near real-time.
© 2016 GridGain Systems, Inc.
Apache Ignite Project - Recap• 2007: First version of GridGain (compute
grid)• Oct. 2014: GridGain contributes Ignite to
ASF• Aug. 2015: Ignite is the second fastest
project to graduate after Spark
• Today vs. Feb. 2016: • 28 more contributors: 88+ contributors• Huge development momentum -
Estimated 248 years of effort since the first commit in February, 2014 vs 192 year last Feb. [Openhub]
• 200k more SLOC and 2.5k more commits: 900k+ SLOC & more than 18.5k commits
February 2017
© 2016 GridGain Systems, Inc.
• What is GridGain Enterprise Edition?• Is a binary build of Apache Ignite™ created by
GridGain• Added enterprise features for enterprise
deployments• Earlier features and bug fixes by a few weeks• Heavily tested
© 2016 GridGain Systems, Inc.
Customer Use CasesAutomated Trading
SystemsReal time analysis of trading positions & market risk. High volume transactions, ultra low latencies.
Financial ServicesFraud Detection, Risk Analysis, Insurance rating and modelling.
Online & Mobile AdvertisingStream processing, geo-targeting & personalisation & segmentation
Big Data/Visual AnalyticsCustomer 360 view, real-time analysis of KPIs, up-to-the-second operational BI.
Online Gaming Real-time back-ends for mobile and massively parallel games.
SaaS Platforms & AppsHigh performance next-generation architectures for Software as a Service Application vendors.
Travel & E-CommerceHigh performance next-generation architectures for online hotel booking.
© 2016 GridGain Systems, Inc.
What In-Memory Capabilities are Supported?
‣ HPC‣ Machine learning‣ Risk analysis‣ Grid computing
‣ HA API Services‣ Scalable
Middleware
‣ Web-session clustering
‣ Distributed caching‣ In-Memory SQL
‣ Real-time Analytics‣ Big Data‣ Monitoring tools
‣ Big Data‣ Realtime Analytics‣ Batch processing
‣ Distributed In-Memory File System
‣ Node2Node & Topic-based Messaging
‣ Fault Tolerance‣ Multiple backups‣ Cluster groups‣ Auto Rebalancing
‣ Complex event processing
‣ Event driven design
‣ Distributed queues‣ Atomic variables‣ Dist. Semaphore
© 2016 GridGain Systems, Inc.
Introduction to Clustering
© 2016 GridGain Systems, Inc.
Definitions and TerminologyAn Ignite cluster is a group of Ignite nodes working together to accomplish tasks like distributed compute and caching
An Ignite node is a single Ignite process running in a JVM
Many Ignite nodes can live on one physical server or JVM
Ignite nodes can be Clients or Servers . Nodes can be named and logically grouped into Cluster Groups
Server/VM/Container Server/VM/Container
JVM
Ignite
JVM
Ignite
Ignite
JVM
Ignite…
© 2016 GridGain Systems, Inc.
Shared-nothing architecture involves multiple identical nodes forming a cluster with no single master or coordinator
All nodes in a shared-nothing cluster run the exact same processes
Nodes communicate using message passing as Peer to Peer
Unlike Master-Slave, No single point of failure or bottleneck = Linearly scalable & highly available
Operational efficiency at scale
Ignite Clustering
Server/VM/Container
JVMIgnite
Server/VM/Container
JVMIgnite
Server/VM/Container
JVMIgnite
Server/VM/Container
JVMIgnite
© 2016 GridGain Systems, Inc.
An Ignite node can be started as a client or a server.
Server nodes participate in caching and computations. Client nodes can also participate in computations.
Client nodes are used for IgniteAPI operations from the client side such as cache operations, issuing SQL, transactions, and data streaming.
Ignite Clients & Servers
Server Server
Server
Server
ClientClient Client
Data & Compute Nodes
Client Connectors
© 2016 GridGain Systems, Inc.
– Distributed Key-Value Store– Pluggable SPI Design
– E.g FailoverSPI, LoadBalancerSPI– Fault Tolerance and Scalability
– CacheMode, DiscoverySPI – SQL Queries (ANSI 99)– ACID Transactions– In-Memory Indexes– RDBMS / NoSQL Integration– 100% JCache Compliant (JSR 107)
High-level Architecture
© 2016 GridGain Systems, Inc.
In-Memory Data Grid
© 2016 GridGain Systems, Inc.
Data Grid: Cache Modes & Horizontal Scaling
Replicated Cache
All data replicated to each node
Highest availability topology
Every data update propagated to all nodes, scaling can become an issue
Best for scenarios where dataset is small, and high read activity > 80%
© 2016 GridGain Systems, Inc.
Data Grid: Cache Modes & Horizontal Scaling
Partitioned Cache
Most scalable distributed topology
Dataset is divided into partitions, and distributed equally amongst nodes (x TB)
Updates are cheap, 1x Primary partition and optionally 1+ backup partition
Reads expensive if data is not collocated
Near-Cache use is more relevant
Best for large datasets, frequent updates
© 2016 GridGain Systems, Inc.
• Vertical Scale – OnHeap, OffHeap, OffHeap_Values,
Swap Space• Avoid Java GC Collection
Pauses• Off-Heap Indexes
– OffHeap, OffHeap_Values• Full RAM Utilization• I/O + Network Efficiency
• OffHeap to disk or network -> zero copies. OS optimized
• Simple Configuration
Data Grid: Off-Heap Memory
© 2016 GridGain Systems, Inc.
Data Grid: External Persistence
• Read-through & Write-through
• Support for Write-behind• Configurable eviction
policies• LRU, FIFO, Sorted, TTL
• DB schema mapping wizard:• Generates all the XML configuration
and Java POJOs
© 2016 GridGain Systems, Inc.
Data Grid: Cache APIs• Predicate-based Scan Queries• Text Queries based on Lucene
indexing• Query configuration using annotations,
Spring XML or simple Java code• SQL Queries: ANSI-99 Compliant• Memcached (PHP, Java, Python,
Ruby)• HTTP REST API• JDBC & ODBC
© 2016 GridGain Systems, Inc.
• ANSI-99 SQL• In-Memory Indexes (On and
Off-Heap)• Automatic Group By,
Aggregations, Sorting• Cross-Cache Joins, Unions• Use local H2 engine
Data Grid: SQL Support (ANSI 99)
© 2017 GridGain Systems, Inc. GridGain Company Confidential
In-Memory SQL Grid (aka. IMDB)• JDBC and ODBC as a connection point• SQL for data access• DML for Data Modification• DDL for Cache and Index Management• Advantages
– GridGain as a distributed SQL based storage– No need to rewrite application logic– Support of variety of tools and languages– Ad-hoc queries (GeoSpatial)
• Available in Apache Ignite (open source)
Application Layer
GridGain Cluster ANSI SQL-99 DML DDL
ODBC/JDBC
ACID Transactions
Data Layer
Persistent Storage
© 2016 GridGain Systems, Inc.
Data Grid: Transactions• Fully ACID• Support for Transactional &
Atomic • Cross-cache transactions• Optimistic and Pessimistic
concurrency modes with multiple isolation levels
• Deadlock protection (Serializable)• JTA Integration
© 2016 GridGain Systems, Inc.
Distributed Java Structures
Use of java.util.Concurrent• Distributed Map (cache)• Distributed Set• Distributed Queue• CountDownLatch• AtomicLong• AtomicSequence• AtomicReference• Distributed
ExecutorService
© 2016 GridGain Systems, Inc.
Continuous Queries• Execute a query and get
notified on data changes captured in the filter
• Remote filter to evaluate event and local listener to receive notification
• Guarantees exactly once delivery of an event
© 2016 GridGain Systems, Inc.
• IgniteDataStreamer - Collocation & Indexing, millions of objects/sec
• StreamReceiver/StreamVisitor• Sliding Windows for
CEP/Continuous Query,• JMS, Kafka, MQTT, Flume,
Camel data streamer integrations
• Real-time visual analytics/BI
Streaming and CEP
© 2016 GridGain Systems, Inc.
• Create chains of event processors & transform an object through various states• Synchronous or asynchronous execution of remote filters & listeners with thread control
Payment Validator Payment Verifier Payment Processor
Ignite Cache
Event Processing using Ignite (SEDA)
© 2016 GridGain Systems, Inc.
Event Processing using Ignite
© 2016 GridGain Systems, Inc.
Data Grid: Tiered Memory & Local Store
• Tiered Memory• On-Heap -> Off-Heap -> Swap
(volatile)• Persistent On-Disk Store• Fast Recovery• Local Data Reload
• Eliminate Network and Db impacts when reloading in-memory store
© 2017 GridGain Systems, Inc. GridGain Company Confidential
Ultimate EditionGridGain Disk Storage
© 2017 GridGain Systems, Inc. GridGain Company Confidential
Current GridGain Model
Application LayerMobile Cloud/SaaS Social IoT Enterprise Applications
Data LayerRDBMS NoSQL Hadoop
In-Memory Computing Layer Data Grid Compute Grid Streaming Service Grid ANSI SQL-99 File System Spark Integration
Unified API
ACID Transactions
© 2017 GridGain Systems, Inc. GridGain Company Confidential
Current GridGain Model
Application LayerMobile Cloud/SaaS Social IoT Enterprise Applications
Data LayerRDBMS NoSQL Hadoop
In-Memory Computing Layer Data Grid Compute Grid Streaming Service Grid ANSI SQL-99 File System DML/DDL Spark Integration
Unified API
ACID Transactions
Data Layer Disk Storage
© 2017 GridGain Systems, Inc. GridGain Company Confidential
• All the features of Enterprise Edition Including
– Persistent Data Store– Ability to query memory and
disk together – Instantaneous Restarts– Full and Incremental Cluster
Snapshots– Point-In-Time Recovery
• Special license is required
Ultimate Edition: Features Set
© 2016 GridGain Systems, Inc.
• Multiple (up to 32) Data Centres
• Complex Replication Technologies
• Active-Active & Active-Passive
• Smart Conflict Resolution• Durable Persistent Queues• Automatic Throttling• GridGain Enterprise
Data Grid: DC Replication
© 2016 GridGain Systems, Inc.
In-Memory Compute Grid & the rest
© 2016 GridGain Systems, Inc.
Client-Server vs. Affinity Colocation
12
43 Data 1
Job 1
2
3Data 2
Job 2
ProcessingNode 1
ProcessingNode 2
ClientNode
Data Node 1
Data Node 2
ProcessingNode 1
1
3
4Data 1
Data 2
2
2
1. Initial Request2. Fetch data from remote
nodes3. Process entire data-set4. Return to client
1. Initial Request2. Co-locating processing with data3. Return partial result4. Reduce & return to client
© 2016 GridGain Systems, Inc.
• Direct API for MapReduce (ComputeTask, map(), result(), reduce() )
• Cron-like Task Scheduling• State Checkpoints• Load Balancing
• Round-robin• Random & weighted• Probe
• Automatic Failover - FailoverSPI• Per-node Shared State• Zero Deployment
• P2P distributed class loading• Inter-node bytecode transfer
In-Memory Compute Grid
© 2016 GridGain Systems, Inc.
• Distributed Closures• Java lambda expressions• Java Runnable(s)/Callable(s)
• ExecutorService (JDK)• Distributed, Fault Tolerant,
Load Balanced• Sync or Async• Task Deployment (GAR)
In-Memory Compute Grid
© 2016 GridGain Systems, Inc.
Messaging & Events
• Topic-based messaging• Ordered & Unordered
messages• Local & Remote message
listeners• Local & Remote event
listeners• Trigger actions from any cluster
events or operations• Query events via
IgniteEvents API• Full Events logging not best
practice
© 2016 GridGain Systems, Inc.
• Deploy arbitrary user-defined services on cluster
• Control (SLA) how many service instances are deployed on each cluster or node
• Automatically ensure deployment and fault tolerance
• Singletons on the Cluster– Cluster Singleton– Node Singleton– Key Singleton
• Guaranteed Availability– Auto Redeployment in Case of
Failures
In-Memory Service Grid
© 2016 GridGain Systems, Inc.
• Resilience - Build an in-memory resilient service layer between your client application and the grid
• Shielding- Only expose application APIs and not direct grid APIs
• Continuations - Call services internally via compute tasks to create service chains
In-Memory Service Grid
© 2016 GridGain Systems, Inc.
Hadoop & Spark Integration
© 2016 GridGain Systems, Inc.
• Ignite In-Memory File System (IGFS)
– Hadoop-compliant– Easy to Install– On-Heap and Off-Heap– Caching Layer for HDFS– Write-through and Read-through
HDFS– Any Hadoop distribution– Performance Boost
IGFS: In-Memory File System
MR HIVE PIG
In-Memory MapReduce
IGFS
HDFS
IGFS
YARN }Any Hadoop Distro
© 2016 GridGain Systems, Inc.
Hadoop Accelerator: Map Reduce
• In-Memory Performance• Zero Code Change
• Use existing MR code• Use existing Hive queries
• No Name Node• No Network Noise• In-Process Data Colocation• Eager Push Scheduling
User Application
Hadoop Client
Ignite Client
Hadoop Jobtracker
Hadoop Name Node
Hadoop Tasktracker
Hadoop Tasktracker
Ignite Data Node (IGFS)
Ignite Data Node (IGFS)
Hadoop Data Node
(HDFS)
Hadoop Data Node
(HDFS)Ignite PathHadoop Path
© 2016 GridGain Systems, Inc.
• IgniteRDD – Share RDD across jobs on
the host– Share RDD across jobs in
the application– Share RDD globally
• Faster SQL– In-Memory Indexes– SQL on top of Shared RDD
Spark & Ignite IntegrationSpark Application
Spark Worker
Spark Job
Spark Job
Ignite Node
Yarn Mesos Docker Cloud
Server
Spark Worker
Spark Job
Spark Job
Ignite Node
Server
Spark Worker
Spark Job
Spark Job
Ignite Node
Server
In-Memory Shared RDDs
© 2016 GridGain Systems, Inc.
• Docker• Amazon AWS• Azure Marketplace• Google Cloud• Apache JClouds• Mesos• YARN• Apache Karaf (OSGi)
Deployment
© 2016 GridGain Systems, Inc.
Thank You!
www.gridgain.comhttps://ignite.apache.org
@gridgain #gridgainThank you for joining us. Follow the
conversation.
Author: Mandhir Gidda