logging with log4j and log aggregation with apache flume · •log4j is a reliable, fast, and...

Logging with Log4j and log aggregation with Apache flume

By

Arivoli.K,MDS201903

Naveen Kumar Reddy,MDS201909

Saager Babu NG,MDS201917

Suman Polley,MDS201935

Avinash Kumar, MDS201907

Why Logging is necessary?

Here comes log4j

Overview

• Log4j is a reliable, fast, and flexible logging framework (APIs) written in Java, which is distributed under the Apache Software License.

• Log4j has been ported to the C, C++, C#, Perl, Python, Ruby, and Eiffel languages.

• It views the logging process in terms of levels of priorities

Components

Log4j has three main components:

• Loggers: Responsible for capturing logging information.

• Appenders: Responsible for publishing logging information to various preferred destinations.

• Layouts: Responsible for formatting logging information in different styles.

History

• Started in early 1996 as tracing API for the E.U. SEMPER (Secure Electronic Marketplace for Europe) project.

• After countless enhancements and several incarnations, the initial API has evolved to become log4j, a popular logging package for Java.

• The package is distributed under the Apache Software License, a full-fledged open source license certified by the open source initiative.

Features

• It is thread-safe.

• It is optimized for speed.

• It is based on a named logger hierarchy.

• It supports multiple output appenders per logger.

• It is fail-stop but log4j does not guarantee that each log statement will be delivered to its destination.

• And many more!!!

Pros of logging

• Quick debugging

• Easy maintenance

• Structured storage of an application's runtime information.

Cons of logging

• Slows down an application.

• If too verbose, it can cause scrolling blindness.

To alleviate these concerns, log4j is designed to be reliable, fast, and extensible

Logger object

• Logger Object is the top-level layer is the Logger which provides the Logger object.

• The Logger object is responsible for capturing logging information and they are stored in a namespace hierarchy

Appender object

• This is a lower-level layer which provides Appender objects.

• The Appender object is responsible for publishing logging information to various preferred destinations such as a database, file, console, UNIX Syslog, etc

Layout object

• The Layout layer provides objects which are used to format logging information in different styles.

• It provides support to appender objects before publishing logging information.

• Layout objects play an important role in publishing logging information in a way that is human-readable and reusable.

Framework of log4j

Brief overview of the support objects

1)Level Object :

The Level object defines the granularity and priority of any logging information.

There are seven levels of logging defined within the API: OFF, DEBUG, INFO, ERROR, WARN, FATAL, and ALL.

Logging levels

Support objects

2) Filter Object :

The Filter object is used to analyze logging information and make further decisions on whether that information should be logged or not.

3)ObjectRenderer :

The ObjectRenderer object is specialized in providing a String representation of different objects passed to the logging framework.

4) LogManager:

The LogManager object manages the logging framework.

Syntax

Types of appenders

• RollingFileAppender

• SMTPAppender

• SocketAppender

• SocketHubAppender

• AppenderSkeleton

• AsyncAppender

• ConsoleAppender

Layout

• We have used PatternLayout with our appender.

• All the possible options are:

DateLayout ,HTMLLayout ,PatternLayout ,SimpleLayout, XMLLayout

Logging methods

• Logger class provides a variety of methods to handle logging activities.

• The Logger class does not allow us to instantiate a new Logger instance but it provides two static methods for obtaining a Logger object:

public static Logger getRootLogger(), public static Logger getLogger(String name)

static Logger log = Logger.getLogger(log4jExample.class.getName())

Logging methods

• Once we obtain an instance of a named logger, we can use several methods of the logger to log messages.

• The Logger class has the following methods for printing the logging information as shown in the next 2 slides.

Logging methods

Log 4j appenders

• Flume provides two log4j appenders that can be plugged into your application:

1) One that can write data to exactly one flume agent.

2) Another that can choose one of many configured Flume agents in a round-robin or random order.

Flume appender

Load balancing log4j appender

• Log4j appenders can be configured to load balance between multiple flume agents, using a round-robin or random strategy.

These log4j appenders come bundled up with flume and doesn’t require us to write any code which is another reason why it is extremely popular.

Apache Flume and log aggregating● Introduction● Philosophy● Apache Flume in HDFS ecosystem● Pros and cons

-Suman Polley MDS201935

INTRODUCTION:

● Apache Flume is a distributed system for

efficiently collecting, aggregating and

moving large amounts of log data from

many different sources to a centralized

data store.

● The use of Apache Flume is not only

restricted to log data aggregation. Since

data sources are customizable, Flume

can be used to transport massive

quantities of event data including but

not limited to network traffic data,

social-media-generated data, email

messages and pretty much any data

source possible.

PHILOSOPHY:

● Distributed pipeline architecture.● Pushing data into HDFS using an intermediate system is a common case

use.Flume acts as a buffer between source and destination.Thus by balancing out any inconsistency in data Flume maintains a smooth flow of data.

● Low-cost of installation ,operation and maintenance.● Highly customizable and extendable.

Flumes position in Hadoop ecosystem:

ARCHITECTURE:

PROS:

● RELIABILITY & RECOVERABILITY :

The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository.This ensures reliable data transfer and ensures recoverability.

PROS:● DOWNSTREAMING:

There could be hundreds even thousands of source .HDFS requires that exactly one clint writes at a time to the database.This could be a problem (!!!!!...)as it will create severe stress to the destination server.

PROS: Solution:

By connecting multiple agents to each other Flume creates a data pipeline .

It is possible to scale down the no of servers that write to the HDFS by adding intermediate Flume agents. This structure has its own problems!!

If n-th tier has same volume as (n-1)th tier then n-th tier will easily overflow creating flow back-pressure.

Points to remember:

● Event volume is least in the outermost tier.

● Event volume increases as flow converges.

● Event volume is the greatest in the innermost tier.

PROS:

● HANDLING AGENT FALIURE :

If the Flume agent goes down, then the all the flows hosted on that agent are aborted. Once the agent is restarted, then flow will resume.

PROS:

The flow using file

channel or other stable

channel will resume

processing events where

it left off.

CONS: ● Channels in Flume act as buffers at various hops. These buffers have a fixed capacity, and once that capacity is full it will create back pressure on earlier points in the flow. If this pressure propagates to the source of the flow, Flume will become unavailable and may lose data.

Rule of Thumb:

Event volume must be equal to worst case data ingestion rate (max data ingestion rate )sustained over the worst case downstream outage interval.

A BETTER SOLUTION:WHat if the single node goes down?

Adding another Flume agent balance the load and it gets better at downstream faliure handling.

Summery:

● All of the above points makes Apache Flume a great real time log aggregator.

● Although it was created for log aggregating ,since then it has evolved to handle many type of streaming data.

● Weak ordering nad prone to duplicacy hinders Flumes application beyond logging (like IoT, Instant messaging service).

NAME NODE CLOGGING

What if all the web

servers collecting log

data tried to connect to

hdfs and write at the

same time?

Map ReduceSpark

Impala

NAME NODE

1

FLUME: EVENTAn Event is the fundamental unit of data transported by flume from its point of origination to its final destination.

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes.

● Payload is opaque to Flume● Headers are specified as an unordered collection of string key-value pairs● These headers help in contextual routing

2

WHAT IS AVRO ?Avro is a row-based data format slash a data serialization system released by Hadoop working group in 2009.

The data schema is stored as JSON in the header while the rest of the data is stored in binary format. One

shining point of Avro is its robust support for schema evolution.

Row-based data formats are overall better for storing write-intensive data because appending new records is

easier.

An Avro Object Container File consists of:

● A file header, followed by

● one or more file data blocks.

A file header consists of:

● Four bytes, ASCII 'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values

0x4F 0x62 0x6A 0x01).

● File metadata, including the schema definition.

● The 16-byte, randomly-generated sync marker for this file.

3Ref: https://en.wikipedia.org/wiki/Apache_Avro

https://en.wikipedia.org/wiki/Apache_Avro

FLUME: CLIENT

An entity that generates events and sends them to one or more Agents.

● In our use case of log aggregation, this is done by log4j appender as discussed previously.● Only properties need to be changed for this to be used (no code needed).● Configure log4j to use flume log4j class.● Aim is to decouple a flume agent from the machine on which the flume is running.

4

FLUME: AGENT

A container for hosting Sources, channels, Sinks and other components that enable the transportation of

events from one place to another place. It is self-contained JVM process.

Connecting multiple Flume agents to each establishes a flow. This flow moves data.

Each Flume agent has three components: the source, the channel, and the sink. The source is responsible for

getting events into the Flume agent, while the sink is responsible for removing the events from the agent and

forwarding them to the next agent in the topology, or to HDFS. The channel is a buffer that stores data that

the source has received, until a sink has successfully written the data out to the next hop or the eventual

destination. 5

SourceAn active component that receives events from a specialized location or mechanism and places it on one or

more Channels.

Sources are active components that receive data from some other application that is producing the data.

There are sources that produce data themselves, though such sources are mostly used for testing purposes.

Sources can listen to one or more network ports to receive data or can read data from the local file system.

Each source must be connected to at least one channel. A source can write to several channels, replicating

the events to all or some of the channels, based on some criteria.

Flume’s primary RPC source is the Avro Source. The Avro Source is designed to be a highly scalable RPC

server that accepts data into a Flume agent, from another Flume agent’s Avro Sink or from a client

application that uses Flume’s SDK to send data. The Avro Source together with the Avro Sink represents

Flume’s internal communication mechanism (between Flume agents). With the scalability of the Avro Source

combined with the channels that act as a buffer, Flume agents can handle significant load spikes.

6

Source configuration

A source named usingFlumeSource of type avro, running in an agent started with the name usingFlume,

would be configured with a file that looks like:

7

CHANNELSChannels are passive components that buffer data that has been received by the agent, but not yet written

out to another agent or to a storage system.

Channels behave like queues, with sources writing to them and sinks reading from them. Multiple sources

can write to the same channel safely, and multiple sinks can read from the same channel. Each sink, though,

can read from only exactly one channel.

Channels allow sources and sinks to operate at different rates. Having a channel operating as a buffer

between sources and sinks has several advantages. Having a buffer in between the sources and the sinks also

allows them to operate at different rates, since the writes happen at the tail of the buffer and reads happen

off the head. This also allows the Flume agents to handle “peak hour” loads from the sources, even if the

sinks are unable to drain the channels immediately.

Channels allow multiple sources and sinks to operate on them. Channels are transactional in nature. Each

write to a channel and each read from a channel happens within the context of a transaction. Only once a

write transaction is committed will the events from that transaction be readable by any sinks. Also, if a sink

has successfully taken an event, the event is not available for other sinks to take until and unless the sink

rolls back the transaction. 8

TRANSACTIONS IN CHANNEL

ALL OR NOTHING

9

CHANNEL CONFIGURATIONThe following configuration shows a Memory Channel configured to hold up to 100,000 events, with each

transaction being able to hold up to 1,000 events. The total memory occupied by all events in the channel

can be a maximum of approximately 5 GB of space. Of this 5 GB, the channel considers 10% to be reserved

for event headers (as defined by the byteCapacityBufferPercentage parameter), making 4.5 GB available for

event bodies:

10

SINKThe component that removes data from a Flume agent and writes it to another agent or a data store or some

other system is called a sink. To facilitate this process, Flume allows the user to configure the sink, which

could be one of the sinks that comes bundled with Flume.

Sinks are the components in a Flume agent that keep draining the channel, so that the sources can continue

receiving events and writing to the channel. Sinks continuously poll the channel for events and remove them

in batches. These batches of events are either written out to a storage or indexing system, or sent to another

Flume agent.

Sinks are fully transactional. Each sink starts a transaction with the channel before removing events in

batches from it. Once the batch of events is successfully written out to storage or to the next Flume agent,

the sink commits the transaction with the channel. Once the transaction is committed, the channel removes

the events from its own internal buffers.

11

FLOW EXAMPLE 1

Multi agent flow

Where the data goes through multiple agents or hops.

The sink of the previous agent and source of the current hop need to be avro type with the sink pointing to

the hostname (or IP address) and port of the source.

12

FLOW EXAMPLE 2Typical scenario of log aggregation.

Multiple web servers producing log

data.

Logs collected from hundreds of

web servers sent to a dozen of

agents that write to HDFS cluster.

Configuring a number of first tier

agents with an avro sink, all pointing

to an avro source of single agent .

This source on the second tier agent

consolidates the received events

into a single channel which is

consumed by a sink to its final

destination. 13

FLOW EXAMPLE 3

Multiplexing Flow

Flume supports multiplexing the event

flow to one or more destinations. This

is achieved by defining a flow

multiplexer that can replicate or

selectively route an event to one or

more channels.

The above example shows a source

from agent “foo” fanning out the flow

to three different channels. This fan

out can be replicating or multiplexing.

14

FLUME PROPERTIES

15

REFERENCES

1. USING FLUME BY HARI SHREEDHARAN

2. FLUME USER GUIDE https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html3. REAL TIME DATA INGEST INTO HADOOP USING FLUME

https://www.youtube.com/watch?v=SR__hkCINNc

16

https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

logging with log4j and log aggregation with apache flume · •log4j is a reliable, fast, and...

Documents