# chukwa usage with hadoop 2 (2.4.0) and hbase 0.98

If you want to use chukwa with hadoop 2.4.0 and hbase 0.98 you need to exchange some jars, because the trunk version uses older versions of hadoop and hbase (at the time of writing).

To be sure that I have the newest version I cloned the sources from the github repository

https://github.com/apache/chukwa

using the git command

In the local chukwa directory the build is easy done using maven:

After the successful maven build the fresh tarball of chukwa should be available in the target subdirectory.

Now we can untar the tarball in a place of our choice and configure chukwa like it is described in the tutorial. The hbase schema file has some small typos which I corrected in a fork.

https://github.com/woopi/chukwa/commit/c962186667700e04fe0ed5c040322ee77f3042da

Before starting chukwa we have to exchange some jars. First turn off some jars that are for older hadoop / hbase versions.

Copy the following jars from the hadoop and/or hbase distribution to the same chukwa directory.

Now I started a chukwa collector on my namenode and a agent on a linux box. After some minutes the first log data can be seen in HBase:

If writing to hdfs you will see something like:

# Apache Hadoop 2.4.0 binary build for 64bit debian linux

At the time of writing this blog apache hadoop builds from the apache website are compiled for 32bit platforms. If you use this on a 64bit platform (with Java 64 bit) you might get some error messages regarding the shared library

I have now compiled hadoop 2.4.0 on debian 7.4 (wheezy) 64 bit.

With this version I didn’t get again this error message.

The error message is a bit misleading, but some other blogs as well as a small hint in the documentation page

Native library documentation

directed me to a compilation of hadoop from the source tarball. The build was straight forward with the command

mvn package -Pdist,native -DskipTests=true -Dtar

after installation of necessary packages that have been missing in my box:

For those who want to start directly with the tarball compiled for 64bit platforms find here my hadoop 2.4.0 bundle:

# Lambda Architecture vs. Java 8 Lambdas

Actually the Lambda buzz word appears often in IT publications. Some reader may get confused and put the article in the wrong context, because actually there are two completely different topics with a similarity in the title: Lambda (λ)

# 1. Java SE 8 Lambdas

Lambdas introduced with the new Java SE 8 targets functional programming aspects.

http://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html

With the introduction of lamdas, Java has now the ability to handle funtions as method parameters. Other programming languages belonging to the class of functional programming languages are: Haskell, Clojure, Lisp

Here an example taken from the Java SE documentation page:

Call a method an put functions as parameters:

# 2. Lambda Architecture (Big Data)

Lambda Architecture was introduced by Nathan Marz. It describes roughly spoken a design in the big data area, which combines a batch layer of data processing (with higher latency) with a speed layer that makes use of stream processing tools like Storm to produce real time views. The user gets data combined from both layers so that he can see actual data in real time. Real time views from the batch layer, which typically uses Hadoop’s MapReduce to aggregate/transform raw input data, can be achieved by using elephantDB. The layer between the batch layer and the user is called serving layer.

http://www.manning.com/marz/BDmeapch1.pdf

# Hadoop WebHDFS usage in combination with HAR (hadoop archive) from PHP

Hadoop is not very efficient for storing a lot of smaller files.

If you need to access a lot of files nevertheless you can use HAR to get rid of the small file problem. Here are the steps that I did to get access to the files from PHP.

# 2. Create a hadoop archive

Let hadoop create ONE single HAR file with name hadoop-api.har from the whole directory
/tmp/har/ (HDFS)

This command will start a MapReduce job that creates the HAR file without deleting the original small files in HDFS.

# 5. HAR file content (HDFS filesystem)

The HAR file is not really just one file but a directory with a couple of files. Let’s have a look to it with raw hdfs commands

# 6. Structure of the HAR file index (how to get access the single files)

hdfs dfs -cat /har/hadoop-api.har/_index
...snip...

--------
...snip...


Each row of the index file contains several space-separated columns:

• The url encoded path in the HAR file
• The type of the entry, i.e. file or dir
• The HDFS file which contains the content
• The offset in the HDFS content file
• The length of the file