Ingesting Log4J Application Logs Into HDFS Using Flume In Realtime


.

Log file analysis is popular usecase in Big Data world. Log files contains evidence of historical events that an application witnessed under their execution environment. Monitoring applications intend to find out traces of actual events that happened during program execution. Several analysis usecases are possible from simply counting occurrence of some event to specific processing.

Log file analysis has two main parts where “log processing” is preceded by “log ingestion”. In log ingestion, log files are read from the source into copied/transferred (ingestion) in HDFS. Log processing later read and process data from HDFS.

The easiest way to read logs from log producing machine is to configure an agent on the machine itself. Agent configuration usually offer much flexibility. Agent reads the log files and transfer the logs to a log server or a log aggregator service that runs on a non-production box, from where the data eventually comes to HDFS.

Production Systems

Agents usually also offer to configure a local queue, either in-memory or on-disk to store log events read from the log file. Since writing to the destination is often slower than reading from the source, because either log server is running on non-production box or because of network delays, this queue is very instrumental.

However, memory consumption (in case of in-memory queue) or disk I/O (in case of file based queue) can quickly grow, taking toll on the main server application’s resource availability. Low resources on production systems can make the main application slow, in worst case, even unstable. Due to these reasons, system administrators do not prefer agents on production boxes.

Syslog

Syslog is a logging standard (RFC 5424) used in computing since decades. In Unix world, syslog is a part of common installation and most of the logging softwares/solutions support syslog. However, other operating systems and even networking devices like routers, switches etc. also support syslog standard. The syslog standard promotes an idea of having message originators separate from message servers and message analyzers. Syslog also contains a network protocol that defines the transfer of log messages over TCP as well as UDP protocol. Use of syslog is pretty standard in the industry.

Log4J 1.2

Apache Log4J is popular logging utility that is vastly used by Java developers. Log4J design consist of loggers, appenders and layouts. Log4J allows an application to supply log messages using Logger objects. The log messages can be passed to one or more output objects called appenders. There can be as many appenders as you wish. Every appender output the message to it’s defined output type, e.g. File appenders output the message to file(s), Console appenders send the output to console, Socket appenders write the messages to sockets etc. Before outputting the message using appenders, Log4J can make use of layouts to format the output. Layouts use formatting string that defines the pattern in which each message should be output. Log4J defines log levels and based on these log levels, allow or restrict the message to go to the output. A real power offered by Log4J allows the logging configuration to be changed on the fly using an XML or a property file.

Agent Less Ingestion

Log4J and Flume both support features that can be used to setup agent less data ingestion. Log4J offers appenders like SocketAppender, SyslogAppender etc. that can write logs over the wire. On the other hand, Flume offer data sources like netcat, syslogtcp, syslogtcp etc. that are capable of reading data from the wire. Using this combination, we can setup data ingestion channel without installing any agent software on the production box. In the following example configuration, we setup Log4J and Flume on separate hosts and transmit logs over the network.

Configuration

This is our log4J side configuration that simply adds an appender and writes the log data over the network in standard Syslog format.

Log4J

log4j.rootLogger=INFO,stdout,R,slg
log4j.appender.slg=org.apache.log4j.net.SyslogAppender
log4j.appender.slg.syslogHost=172.18.227.212:1976
log4j.appender.slg.Header=true
log4j.appender.slg.Facility=LOCAL7
log4j.appender.slg.layout=org.apache.log4j.PatternLayout
log4j.appender.slg.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n

The flume agent is configured on a separate machine over the network. This is our flume side configuration that reads data from network using a Syslog data source over UDP. The agent also configure an HDFS sink that will write to HDFS. Both the source and the sink are joined using a queue- called a channel. This channel can hold up to 500 items, that helps aggregating data before writing to HDFS.

Flume

sgnfa.sources = slg
sgnfa.channels = memChannel
sgnfa.sinks = hdfsSink

#-----------------------------------------------------------
# Define a channel for use, i.e. A queue between source and sink
#-----------------------------------------------------------
sgnfa.channels.memChannel.type = memory
sgnfa.channels.memChannel.capacity = 500
#-----------------------------------------------------------



#-----------------------------------------------------------
# Source Definition: SYSLOG
#-----------------------------------------------------------
sgnfa.sources.slg.type = syslogudp
sgnfa.sources.slg.host = 172.18.227.212
sgnfa.sources.slg.port = 1976
sgnfa.sources.slg.selector.type = replicating
# sgnfa.sources.slg.interceptors = dataSelector
# sgnfa.sources.slg.interceptors.dataSelector.type = regex_filter
# sgnfa.sources.slg.interceptors.dataSelector.regex = ".*"

## Does the Regex identify exclude lines instead of include lines ?
# sgnfa.sources.slg.interceptors.dataSelector.excludeEvents = false
#-----------------------------------------------------------



#-----------------------------------------------------------
# Sink must be well defined: HDFS
#-----------------------------------------------------------
sgnfa.sinks.hdfsSink.type = hdfs
sgnfa.sinks.hdfsSink.hdfs.path = hdfs://namenode.hadoop.domain.com:9000/user/vpathak/FlmData

# No time based roll over
sgnfa.sinks.hdfsSink.hdfs.rollInterval = 0

# Size based roll over: 10 GB max destination size
sgnfa.sinks.hdfsSink.hdfs.rollSize = 10737418200

# No number of lines based roll over
sgnfa.sinks.hdfsSink.hdfs.rollCount = 0

# Close Inactive files after 1 hour
sgnfa.sinks.hdfsSink.hdfs.idleTimeout = 3600

# Compress the output file# sgnfa.sinks.hdfsSink.hdfs.codeC = lzo

sgnfa.sinks.hdfsSink.hdfs.fileType = DataStream
#-----------------------------------------------------------


#-----------------------------------------------------------
# Connecting the source and sink using the channel ...
#-----------------------------------------------------------
sgnfa.sinks.hdfsSink.channel = memChannel
sgnfa.sources.skt.channels = memChannel
#-----------------------------------------------------------

There however also a down side of using network based appenders. There is high bandwidth connectivity is expected between the appender and the destination server, e.g. LAN etc. since Log4J 1.2 appenders are synchronous in nature. A slow network will block the Logger until the appender return after writing data.

This is the configuration that uses Log4J 1.2. The newer 2.x makeover of Log4J supports all features of 1.2 in addition to support for an appender for Apache Flume. Log4J 2.x also introduce Asynchronous appenders that will not block the Logger even on slow connection between Appender and Flume.

Happy Data Ingestion …

.

2 responses to “Ingesting Log4J Application Logs Into HDFS Using Flume In Realtime”

  1. Hi Vipul,

    Did you install Hadoop on your remote flume agent machine in order to talk to HDFS? The Flume documentation simply says that “Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster.”

    thanks,

    Jan

    Like

    1. Hello Jan,

      You don’t need to install Hadoop on the remote machine. Flume just need to connect to HDFS to pump data. Flume has that functionality in the HDFS-SINK JAR file that comes with Flume installation.

      Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.