Inverse DNS resolution with HBase and Phoenix

Recently I did some quick experiments with HBase (1.1.2) and Phoenix (4.6.0). As a quite big dataset I used the DNS data available as a dnscensus 2013 archive. This dataset contains DNS data collected by extraction of port 53 data out of network captures at some central internet routers. Here some examples:

To get the inverse map of all collected type A DNS requests the phoenix table should have a key starting with the ip address! Don’t confuse reverse dns lookup (PTR) with the inverse of a type A DNS request!

The table create script for phoenix looks like

To get the original data in the right order I used sed

A bit tricky was the line endings of the original file data, why I used .* to rid of all type of white spaces after the ip addresses. The tail command just removes the header line, which is not needed when the data get imported into phoenix.

How many rows do we have?

Loading the csv data into phoenix is quite simple with the bulk load utility:


The result is a list of dns names which resolve to the ip of google’s public dns server!

The select from a table with almost 1 billion of rows took less than a 10th of a second! Nice 🙂

Hadoop 2.6.0 API Javadoc (with private classes/methods)

If you ever wrote some MapReduce program using SequenceFile you may have asked yourself what methods in detail are available in the class created by the methods createWriter( ). The standard javadoc from apache’s website doesn’t show this information:


This is why I regenerated the javadoc from the sources with private classes / methods enabled.


You can view the javadocs online or you can download the tarball:

Hadoop 2.6.0 javadoc with private classes/methods (view online)

Hadoop 2.6.0 javadoc with private classes/methods (tarball)

How to get SAP data into the hadoop data lake

It is possible to get a lot of informations using SAP standard remote enabled function modules, but there are some issues that can make the data retieval process painful. If you have a closer look to RFC_READ_TABLE function module for example, you will find out that even some simply structured standard tables like EDIDC (IDOC control records) cannot be read with all columns at once. Further if the given delimiter appears in the flat file structured response (CSV like) you get some trouble to parse the results. If you get a little bit deeper into real life use cases you recognize that where conditions, table joins etc. are needed. If you want to retrieve a lot of datasets from a table the rowcount/skip rows mechanism is not very performant and even worse can lead to inconsistent results if changes are made between two requests.

Because of those issues we started to develop some function modules and classes which remove some of the described pain points. Starting development with an improved version of RFC_READ_TABLE we quickly found other needs and added further function modules that offer great features to retrieve interesting information from SAP systems (not only ERP).

Less visible for the user, but even more recognizable features when using with real life payloads are performance related optimizations. For example there are possibilities to get rid of the additional data amount to be transfered because of the SOAP encoding. MTOM is not (easy?) to be used in SAP services [maybe if you use additional systems like XI which has also a Java Stack 🙂 ]. Another internal improvment is the XML serialization implementation, which serializes even data with hex 0 characters. This is important, because the XML serialization available in SAP standard called asXML transformation produces dumps if such hex 0 charaters appear in the datasets to be serialized.

All developed improvements on ABAP side as well as some useful Java classes which help you to get the data into hadoop are available in what we called Woopi SAP Adapter. Woopi is a registered Trademark of Dr. Menzel.

1. Reasons to bring data into hadoop

  1. Hadoop storage is cheap, typical enterprise SAN storage is expensive (Cost)
  2. Hadoop storage is easy to extend without relevant upper limit, without system or structural changes. You just add more nodes and rebalance the cluster. (Volume)
  3. Processing of data can be 1000 times faster than with tradional systems (Velocity)
  4. You can process unstructured or semi-structured data too. (Variety)
  5. Calculations that you never thought about become possible now (new possibilities)
  6. Don’t waste capacities of your productive system by letting hadoop do all calculations which need not to be done in SAP. (Cost + performance)

If we have a look to the google trend chart for Hadoop, we see that it is getting continously more interest



2. The Woopi SAP Adapter Modules

2.1. RFC function module to read table data

Our function module Z_WP_RFC_READ_TABLE has to following features:

  1. multiple joined SAP tables
  2. special columns or ALL columns per table or ALL columns globally
  3. XML serialization of the results
  4. zip compression option
  5. where conditions
  6. number of results limitation
  7. meta data export from data dictionary
  8. reading data from cluster tables
  9. asynchronous mode to export huge amount of datasets per query.

2.2. RFC function module to read ABAP source code

If you extract e.g. daily the changed ABAP sources you are able to create a ABAP source repository with history. In this way you can check which source code was active at every timestamp you want to examine. This can be very useful, if you have to analyze errors or in cases of partly finished transport orders. With this data basis you can further detect code inconsistencies due to transport order  that came in the productiv system in wrong order.

The software just delivers such source codes that are changed since your last data retrieval run. In this way the source code extraction is not very time consuming and you can repeat it quite often.


2.3. RFC function module to read JOBLOGs, JOBSPOOLs, JOBINFOs

Job informations can be listet using TCODE SM37. In productive systems you have typically a lot of jobs in this list every day. Often you have to check the Job Logs or Job Step Spools for errors to guarantee no disruption or errors in your business processes. The standard function modules which can read Job Log and Job Step Spool informations


are both not remote enabled. Our SAP adapter module for Job Logs delivers all Job Informations since the last data retrieval at once based on informations in SAP tables


2.4. RFC function module to read BDOCs as XML

Bdocs are the XML documents that are exchanged between ERP and CRM systems to keep their business content sychronous. With the SAP adapter module for BDOCs it is quite easy to continously pipe the BDOC messages as XML documents into a hadoop sequencefile.

Because all information about your business content is stored in one of the BDOCS you can use Hadoop to parse a huge amount of BDOCs in a short time. In this way you can do some analysis to answer business questions that came up a long time after the data were exchanged between the systems. Imagine e.g. the question about deleted sales orders. Without custom changes in the SAP system it is very hard to get this information out of the SAP system. You can also answer questions like in which system some changes have been made.

2.5. RFC function module to read IDOCs as XML

IDOCs are important, because these documents are used as one very often used way to exchange data between SAP systems or between a SAP system and external systems. Most imporant use cases are sales order import and warehouse transport order exchanges from and to external warehouse systems.

The arguments to save IDOC copies in hadoop are similar as for BDOCs even if the business questions are different.

2.6. ADK archiving with a copy into Hadoop

With minial changes you can reuse your archiving reports to write a additional copy of all data into hadoop.

All necessary code is available in the class Z_WP_ARCHIVING_HDP which has methods that are equal in name and parameters to the ADK function modules. So your migration steps are:

  1. Create a instance of Z_WP_ARCHIVING_HDP in the beginning of your write report
  2. Replace als ADK function module calls by the corresponding method calls of the already created Wrapper class instance.


The following ADK function modules need to be replace by the corresponding methods:



2.7. XI Java Stack Database Table Reader

The SAP Java Stacks have their own database, which is not directly accessable from the ABAP Stack. Most of XI adapters are implemented in Java and are executed in the Java Stack. The communication with other SAP systems takes place in the ABAP Stack, so that messages have to be exchanged between the two stacks internally. The exchanged messages are stored in the database – some of them in the java database others in the abap database. You can monitor the messages using the runtime workbench. In case of (communication) problems between the two stacks you have to check the messages in both stacks. If messages are missing/lost in one of the stacks it is not easy to find those messages.

Because of this situations we developed a EAR for the Java Stack Application Server which offers possibility to access the java database generically (similar to the abap read table function module) over http.

2.7.1. XI Java Stack Message Reader

As a special use case we can get complete XI messages (meta data and payload) directly from the java stack database and write them as usual in sequencefiles into hadoop.

3. Woopi SAP Adapter on the Java / hadoop side

On the Java side the Woopi Sap Adapter has the necessary client classes to pull the data from the SAP systems. The actual state is persisted locally, so that the software knows which data to fetch next time.

There are MapReduce jobs which transform data retrieved from SAP tables and automatically generate (if necessary) HIVE or Phoenix tables in Hadoop and afterwords import the data to those hadoop databases.

If you write MapReduce jobs that are not just at HelloWorld level, you quickly get the need to handle serialized data (Avro, Json, etc.) using special comparators. Creating grouping comparators to get data from different input sources with corresponding tags in their key together, can be very time consuming and error prone. We have created some helper classes that makes it possible to generate comparators by just defining the necessary fields in the (extended) avro schema.

4. Use Cases

4.1. Example use cases

  1. statistical monitoring
    warnings when measured actual data deviates more than statistical standard deviation
  2. long term KPI calculations
  3. end to end (multi system) busisness process monitoring, e.g. sales orders external system -> sales orders ESB system -> sales orders CRM -> sales orders ERP -> transport orders (warehouse) -> delivery confirmation (UPS, DHL, Transoflex etc.)
  4. customer segmentation
  5. order recommendations
  6. warehouse: stock-keeping optimizations
  7. searches / analysis over CDHDR/CDPOS

4.2. Let us know about your use cases

Many other use cases are imaginable. If you have interesting use cases from your business, we would appreciate to get them. Just send us an e-mail.

Why you should monitor the SAP XI/PI adapter engine

1. Architecture overview

XI older than 7.3 are dual stack systems. The adapter engine with most of the adapters are running on the Java stack (JMS, HTTP, SOAP, …) while the adapters to connect XI to SAP systems are running on the abap stack.

Beside the adapter engine you find also the SLD (system landscape directory) and the RTW (runtime workbench) on the Java side.

Java and Abap stack are communicating with each other using technial users. If those user get locked (for example due to some security features which lock users if unauthorized access was detected with wrong password) the internal communication between the java and abap stack is broken.

The Java and the Abap stack have it’s own separte database and it is not possible to access the java database from abap directly or vice versa.

2. Implications of broken internal communication

If for a certain time the internal communication was broken, some of the messages may not be processed correctly. I detected such situation for example for inbound wholesaler orders.

If you are processing a lot messages it is easy to find out which orders are not processed correctly. Depending on the used adapter you may have to look in every message payload via the runtime workbench.

3. Automatic Adapter Engine Message Monitoring

To make the monitoring easier I have implemented a Java Enterprise application (EAR/WAR) for XI which has access to the java database where the runtime workbench messages are stored. By providing a webbased technical interface to access the java database tables, all messages can be continuously downloaded and processed in a external monitoring tool so that not processed order numbers can be displayed as alerts to the responible persons. The relevant database table is: XI_AF_MSG

Attention: Before you can access the java database of your XI system, you have to define some datasource in Visual Administrator.

If you want to centralize your data in a data lake it is now easy to write the data flow to hadoop where XI data and ERP/CRP/BW data can be stored so that you have the possibility to analize them together!



chukwa usage with hadoop 2 (2.4.0) and hbase 0.98

If you want to use chukwa with hadoop 2.4.0 and hbase 0.98 you need to exchange some jars, because the trunk version uses older versions of hadoop and hbase (at the time of writing).

To be sure that I have the newest version I cloned the sources from the github repository

using the git command

In the local chukwa directory the build is easy done using maven:

After the successful maven build the fresh tarball of chukwa should be available in the target subdirectory.

Now we can untar the tarball in a place of our choice and configure chukwa like it is described in the tutorial. The hbase schema file has some small typos which I corrected in a fork.

Before starting chukwa we have to exchange some jars. First turn off some jars that are for older hadoop / hbase versions.

Copy the following jars from the hadoop and/or hbase distribution to the same chukwa directory.

Now I started a chukwa collector on my namenode and a agent on a linux box. After some minutes the first log data can be seen in HBase:

If writing to hdfs you will see something like:


Apache Hadoop 2.4.0 binary build for 64bit debian linux

At the time of writing this blog apache hadoop builds from the apache website are compiled for 32bit platforms. If you use this on a 64bit platform (with Java 64 bit) you might get some error messages regarding the shared library

I have now compiled hadoop 2.4.0 on debian 7.4 (wheezy) 64 bit.

With this version I didn’t get again this error message.

The error message is a bit misleading, but some other blogs as well as a small hint in the documentation page

Native library documentation

directed me to a compilation of hadoop from the source tarball. The build was straight forward with the command

mvn package -Pdist,native -DskipTests=true -Dtar

after installation of necessary packages that have been missing in my box:

For those who want to start directly with the tarball compiled for 64bit platforms find here my hadoop 2.4.0 bundle:





Lambda Architecture vs. Java 8 Lambdas

Actually the Lambda buzz word appears often in IT publications. Some reader may get confused and put the article in the wrong context, because actually there are two completely different topics with a similarity in the title: Lambda (λ)

1. Java SE 8 Lambdas

Lambdas introduced with the new Java SE 8 targets functional programming aspects.

With the introduction of lamdas, Java has now the ability to handle funtions as method parameters. Other programming languages belonging to the class of functional programming languages are: Haskell, Clojure, Lisp

Here an example taken from the Java SE documentation page:

Call a method an put functions as parameters:


2. Lambda Architecture (Big Data)

Lambda Architecture was introduced by Nathan Marz. It describes roughly spoken a design in the big data area, which combines a batch layer of data processing (with higher latency) with a speed layer that makes use of stream processing tools like Storm to produce real time views. The user gets data combined from both layers so that he can see actual data in real time. Real time views from the batch layer, which typically uses Hadoop’s MapReduce to aggregate/transform raw input data, can be achieved by using elephantDB. The layer between the batch layer and the user is called serving layer.

Have a look to Nathan Marz’s book, chapter 1, section 1.7 Summary of the Lambda Architecture to get more informations about this.


Hadoop WebHDFS usage in combination with HAR (hadoop archive) from PHP

Hadoop is not very efficient for storing a lot of smaller files.

If you need to access a lot of files nevertheless you can use HAR to get rid of the small file problem. Here are the steps that I did to get access to the files from PHP.

1. Copy the files from local filesystem to HDFS

2. Create a hadoop archive

Let hadoop create ONE single HAR file with name hadoop-api.har from the whole directory
/tmp/har/ (HDFS)

This command will start a MapReduce job that creates the HAR file without deleting the original small files in HDFS.

3. Delete the small files from HDFS

4. HAR file content (HAR filesystem)

5. HAR file content (HDFS filesystem)

The HAR file is not really just one file but a directory with a couple of files. Let’s have a look to it with raw hdfs commands

6. Structure of the HAR file index (how to get access the single files)

hdfs dfs -cat /har/hadoop-api.har/_index
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2Fserializer%2Fclass-use%2FJavaSerialization.html file part-0 17439924 4592 1401786436896+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Frecord%2Fmeta%2FUtils.html file part-0 86374093 9779 1401786547239+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fhttp%2Flib%2FStaticUserWebFilter.StaticUserFilter.html file part-0 12578713 14088 1401786409718+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Ffs%2Fftp%2FFTPFileSystem.html file part-0 33753102 56570 1401786511587+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Frecord%2Fcompiler%2Fclass-use%2FConsts.html file part-0 23911123 4493 1401786471791+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fsecurity%2Fproto%2Fclass-use%2FSecurityProtos.GetDelegationTokenResponseProto.html file part-0 27203013 22194 1401786486455+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fha%2FFenceMethod.html file part-0 9642822 11698 1401786398308+420+hadoop+supergroup

=>  %2Fapi%2Forg%2Fapache%2Fhadoop%2Fservice%2Fclass-use%2FService.html file part-0 27995653 13268 1401786490938+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fha%2Fproto%2Fclass-use%2FHAServiceProtocolProtos.MonitorHealthRequestProto.html file part-0 11807109 26499 1401786404845+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Ffs%2Fpermission%2Fclass-use%2FAccessControlException.html file part-0 8820869 4647 1401786392428+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Fipc%2Fprotobuf%2FRpcHeaderProtos.RpcRequestHeaderProto.OperationProto.html file part-0 76175228 529335 1401786535978+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Ffs%2FHardLink.LinkStats.html file part-0 6370459 15816 1401786381184+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2FMapFile.Reader.html file part-0 13487270 42673 1401786413666+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2FRawComparator.html file part-0 13752188 12048 1401786415087+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Frecord%2Fcompiler%2Fclass-use%2FJByte.html file part-0 23924635 4482 1401786472012+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Futil%2Fclass-use%2FShell.OSType.html file part-0 29576158 7120 1401786501833+420+hadoop+supergroup
%2Fapi%2Fsrc-html%2Forg%2Fapache%2Fhadoop%2Fha%2Fproto%2FHAServiceProtocolProtos.TransitionToActiveRequestProtoOrBuilder.html file part-0 46601135 515881 1401786515905+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fnet%2Funix%2Fpackage-summary.html file part-0 23093614 4293 1401786467275+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fipc%2Fprotobuf%2Fclass-use%2FRpcHeaderProtos.RpcResponseHeaderProtoOrBuilder.html file part-0 20622979 7652 1401786447960+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Fio%2Fretry%2Fclass-use%2FAtMostOnce.html file part-0 17192339 4502 1401786434497+420+hadoop+supergroup
%2Fapi%2Forg%2Fapache%2Fhadoop%2Ftools%2Fproto%2Fclass-use%2FGetUserMappingsProtocolProtos.GetUserMappingsProtocolService.html file part-0 28531720 7322 1401786494023+420+hadoop+supergroup

Each row of the index file contains several space-separated columns:

  • The url encoded path in the HAR file
  • The type of the entry, i.e. file or dir
  • The HDFS file which contains the content
  • The offset in the HDFS content file
  • The length of the file

7. Example access using curl

8. PHP access to WebHDFS

In a simple php script the HAR index file is loaded, parsed and used to construct the URL to download the content of the file (inside the HAR), where the local / relative path is append to the php script URL:

The part behind index.php [/api/org/apache/hadoop/ha/proto/class-use/HAServiceProtocolProtos.MonitorHealthRequestProto.html] is a example html file which is included in the HAR file.

The php file’s source is

 9. Remarks

It would be great if the WebHDFS implementation would allow to access a har filesystem directly.

[1] WebHdfs

Hadoop’s mapreduce on YARN: What the framework does before your Mapper/Reducer methods are called

0. Motivation

In several situations it is important to have a deeper understanding of the framework to write mapreduce programs that are more complex than the typical WordCount examples.

In all publications and books that I saw those details are not explained, so that I had to evaluate the details by myself. I used hadoop 2.3.0 source code for my analysis.

My goal was to get answers to the following questions:

  1. Where does the serialization takes place?
  2. Where are the key/value pairs generated from the HDFS file data?
  3. Which class makes the loop over all key/value pairs?
  4. How is Avro serialization detected/choosen and why are the special Avro map reduce classes necessary?

1. Method call stack

To get started with the analysis, I wrote an example MapReduce application and I included the following dummy Exception to get the stack trace:

The resulting stack trace delivers to following informations:

Noclass namemethod namesource code row
6org.apache.hadoop.mapred.MapTask run340
7org.apache.hadoop.mapred.MapTask runNewMapper764

2. Remarks to the different steps

2.1 – 2.5 Dispatching Job context

In steps 1 – 5 the job context is dispatched with some more general initializations steps that are not so relavant for my actual interests of this blog.

2.6 and 2.7 runNewMapper

Steps 6 is just calling 7 using the umbilical that is connecting the application master of the job with the worker nodes that execute the tasks (map and reduce).

First the mapper and input format classes are created using the informations from the job/task context.

Next the RecordReader is created using the now available input format instance. NewTrackingRecordReader is a inner class defined in MapTask. The constructor creates a instance of a mapreduce.RecordReader using

This RecordReader is used in the to get key/value pairs. In my example the RecordReader was the one from the SequenceFile:

As we can see, the method nextKeyValue internally calls the corresponding methods of SequenceFile.Reader which is associated with the SequenceFileInputFormat I used in my Job definition.

The keyDeserializer is created in the init method of SequenceFile.Reader using the SerializationFactory. How this works will be explained soon.

The serializer and deserializer is retrieved by the SerializationFactory which makes use of the Serializations registered in the JobContext / Configuration. First the constructur fills a List with all registered Serializations. The serialization to be used to serialize and deserialize a instance of a class is determined in the method getSerialization which loops over all serializations and checks if the accept method indicates that the serialization can be used for the given class.

 2.8 Mapper delegation to custom map method

The custom Mapper extends the framework Mapper. The framework calls a default identity implementation of the map method which is typically overwritten by the custom Mapper implementation.

2.9 Custom Mapper implementation

In this way the run method delegates the request to the cutom map implementation.

3. Answers to the questions

3.1 Where does the serialization takes place?

The configured InputFormat creates a RecordReader/Reader which determines the serialization which is usable for the configured key / value class. In my example I used a SequenceFile as input, where the key and value classes of the input are read from the SequenceFile header block (before real data are loaded).

From the RecordReader perspective deserialized key or value instances are loaded from the HDFS File (Split).

3.2 Where are the key/value pairs generated from the HDFS file data?

The inner class Reader of SequenceFile calls the methods of the Serialization implementation to deserialize the data and delivers the object instances to the calling instances.

3.3 Which class makes the loop over all key/value pairs?

The run method of class org.apache.hadoop.mapreduce.Mapper loop as long as context.nextKeyValue() is true. With the retrieved key and values the Mapper calls the custom implementation of the map method.

3.4 How is Avro serialization detected/choosen and why are the special Avro map reduce classes necessary?

The class SerializationFactory checks which Serializer accepts the requested class. If the Avro Serialization accepts this class (if the class is a Avro class), this Serialization is used for serializing and deserializing the data.

The second part of this question is not answered by my analysis (yet)….. to be continued


Avro reuse of custum defined types

It took me some time to find out that avro allows us to reuse custum types so that we get can build complex types that are composed of other complex types that we defined by ower own.

Look at the following example schema definition:

This kind of schema definitions is allowed if there is another schema definition file which defines the type org.woopi.avro.CompoundSubTypeExtended.

I used the following example for my tests:

If you compile the files with the Avro schema compiler that is included in the avro tools jar file avro-tools-1.7.6.jar that can be downloaded from Avro java jar mirror,it is important that the compiler gets all schema files at once so that it has a possibility to get informations from one type to be used in another one.

It is possible to use the java command with multiple single schema files and directories as input:

Or for ANT fans: