# chukwa usage with hadoop 2 (2.4.0) and hbase 0.98

If you want to use chukwa with hadoop 2.4.0 and hbase 0.98 you need to exchange some jars, because the trunk version uses older versions of hadoop and hbase (at the time of writing).

To be sure that I have the newest version I cloned the sources from the github repository

https://github.com/apache/chukwa

using the git command

In the local chukwa directory the build is easy done using maven:

After the successful maven build the fresh tarball of chukwa should be available in the target subdirectory.

Now we can untar the tarball in a place of our choice and configure chukwa like it is described in the tutorial. The hbase schema file has some small typos which I corrected in a fork.

https://github.com/woopi/chukwa/commit/c962186667700e04fe0ed5c040322ee77f3042da

Before starting chukwa we have to exchange some jars. First turn off some jars that are for older hadoop / hbase versions.

Copy the following jars from the hadoop and/or hbase distribution to the same chukwa directory.

Now I started a chukwa collector on my namenode and a agent on a linux box. After some minutes the first log data can be seen in HBase:

If writing to hdfs you will see something like:

# 0. Motivation

In several situations it is important to have a deeper understanding of the framework to write mapreduce programs that are more complex than the typical WordCount examples.

In all publications and books that I saw those details are not explained, so that I had to evaluate the details by myself. I used hadoop 2.3.0 source code for my analysis.

My goal was to get answers to the following questions:

1. Where does the serialization takes place?
2. Where are the key/value pairs generated from the HDFS file data?
3. Which class makes the loop over all key/value pairs?
4. How is Avro serialization detected/choosen and why are the special Avro map reduce classes necessary?

# 1. Method call stack

To get started with the analysis, I wrote an example MapReduce application and I included the following dummy Exception to get the stack trace:

The resulting stack trace delivers to following informations:

Noclass namemethod namesource code row
3javax.security.auth.SubjectdoAs415
4java.security.AccessControllerdoPrivileged-
5org.apache.hadoop.mapred.YarnChild$2run168 6org.apache.hadoop.mapred.MapTask run340 7org.apache.hadoop.mapred.MapTask runNewMapper764 8org.apache.hadoop.mapreduce.Mapperrun145 9MyMappermap... # 2. Remarks to the different steps ## 2.1 – 2.5 Dispatching Job context In steps 1 – 5 the job context is dispatched with some more general initializations steps that are not so relavant for my actual interests of this blog. ## 2.6 and 2.7 runNewMapper Steps 6 is just calling 7 using the umbilical that is connecting the application master of the job with the worker nodes that execute the tasks (map and reduce). First the mapper and input format classes are created using the informations from the job/task context. Next the RecordReader is created using the now available input format instance. NewTrackingRecordReader is a inner class defined in MapTask. The constructor creates a instance of a mapreduce.RecordReader using This RecordReader is used in the to get key/value pairs. In my example the RecordReader was the one from the SequenceFile: As we can see, the method nextKeyValue internally calls the corresponding methods of SequenceFile.Reader which is associated with the SequenceFileInputFormat I used in my Job definition. The keyDeserializer is created in the init method of SequenceFile.Reader using the SerializationFactory. How this works will be explained soon. The serializer and deserializer is retrieved by the SerializationFactory which makes use of the Serializations registered in the JobContext / Configuration. First the constructur fills a List with all registered Serializations. The serialization to be used to serialize and deserialize a instance of a class is determined in the method getSerialization which loops over all serializations and checks if the accept method indicates that the serialization can be used for the given class. ## 2.8 Mapper delegation to custom map method The custom Mapper extends the framework Mapper. The framework calls a default identity implementation of the map method which is typically overwritten by the custom Mapper implementation. ## 2.9 Custom Mapper implementation In this way the run method delegates the request to the cutom map implementation. # 3. Answers to the questions ## 3.1 Where does the serialization takes place? The configured InputFormat creates a RecordReader/Reader which determines the serialization which is usable for the configured key / value class. In my example I used a SequenceFile as input, where the key and value classes of the input are read from the SequenceFile header block (before real data are loaded). From the RecordReader perspective deserialized key or value instances are loaded from the HDFS File (Split). ## 3.2 Where are the key/value pairs generated from the HDFS file data? The inner class Reader of SequenceFile calls the methods of the Serialization implementation to deserialize the data and delivers the object instances to the calling instances. ## 3.3 Which class makes the loop over all key/value pairs? The run method of class org.apache.hadoop.mapreduce.Mapper loop as long as context.nextKeyValue() is true. With the retrieved key and values the Mapper calls the custom implementation of the map method. ## 3.4 How is Avro serialization detected/choosen and why are the special Avro map reduce classes necessary? The class SerializationFactory checks which Serializer accepts the requested class. If the Avro Serialization accepts this class (if the class is a Avro class), this Serialization is used for serializing and deserializing the data. The second part of this question is not answered by my analysis (yet)….. to be continued # Hadoop / YARN / Tez # 0. Introductary Remarks Here my extended version of the installation / deploy documentation of Tez. The original version can seen on the Tez webpage I have used hadoop 2.3.0 to try the Tez installation. So I assume here that your hadoop cluster ist already up and running. Look on the webpage of hadoop for installation instructions if you have not already a running hadoop cluster. My version of hadoop is running on debian Wheezy. # 1. Download the Tez tarball You can get the source tarball of tez from it’s apache incubator page. # 2. Compile the sources using maven Before starting maven to build tez we have to change the pom.xml file in the root directory of the unzipped tez directory. In my case I changed the hadoop version property hadoop.version from 2.2.0 to 2.3.0. After starting maven with the following command, usually some missing jars are downloaded from the central repository. # 3. Adapting configuration files ## 3.1 Add TEZ variables to .bashrc As hadoop user edit ~/.bashrc and append the following lines ## 3.2 Add Tez jars to the hadoop environment shell script Edit${HADOOP_INSTALL}/etc/hadoop/hadoop-env.sh

Right after the lines

## 3.6 Start or restart yarn

${HADOOP_INSTALL}/sbin/stop-yarn.sh${HADOOP_INSTALL}/sbin/start-yarn.sh

# 4. Start a tez example

In the tez tarball are several examples included. Let’s start the orderedwordcount example.

Create a hdfs dir /tests/tez-examples/in and copy some text files of your choice to it.

Then execute the command:

The output should look similar to mine here:

Author: