How to get SAP data into the hadoop data lake

It is possible to get a lot of informations using SAP standard remote enabled function modules, but there are some issues that can make the data retieval process painful. If you have a closer look to RFC_READ_TABLE function module for example, you will find out that even some simply structured standard tables like EDIDC (IDOC control records) cannot be read with all columns at once. Further if the given delimiter appears in the flat file structured response (CSV like) you get some trouble to parse the results. If you get a little bit deeper into real life use cases you recognize that where conditions, table joins etc. are needed. If you want to retrieve a lot of datasets from a table the rowcount/skip rows mechanism is not very performant and even worse can lead to inconsistent results if changes are made between two requests.

Because of those issues we started to develop some function modules and classes which remove some of the described pain points. Starting development with an improved version of RFC_READ_TABLE we quickly found other needs and added further function modules that offer great features to retrieve interesting information from SAP systems (not only ERP).

Less visible for the user, but even more recognizable features when using with real life payloads are performance related optimizations. For example there are possibilities to get rid of the additional data amount to be transfered because of the SOAP encoding. MTOM is not (easy?) to be used in SAP services [maybe if you use additional systems like XI which has also a Java Stack 🙂 ]. Another internal improvment is the XML serialization implementation, which serializes even data with hex 0 characters. This is important, because the XML serialization available in SAP standard called asXML transformation produces dumps if such hex 0 charaters appear in the datasets to be serialized.

All developed improvements on ABAP side as well as some useful Java classes which help you to get the data into hadoop are available in what we called Woopi SAP Adapter. Woopi is a registered Trademark of Dr. Menzel.

1. Reasons to bring data into hadoop

  1. Hadoop storage is cheap, typical enterprise SAN storage is expensive (Cost)
  2. Hadoop storage is easy to extend without relevant upper limit, without system or structural changes. You just add more nodes and rebalance the cluster. (Volume)
  3. Processing of data can be 1000 times faster than with tradional systems (Velocity)
  4. You can process unstructured or semi-structured data too. (Variety)
  5. Calculations that you never thought about become possible now (new possibilities)
  6. Don’t waste capacities of your productive system by letting hadoop do all calculations which need not to be done in SAP. (Cost + performance)

If we have a look to the google trend chart for Hadoop, we see that it is getting continously more interest

hadoop_google_trend

WoopiSapAdapter_logo

2. The Woopi SAP Adapter Modules

2.1. RFC function module to read table data

Our function module Z_WP_RFC_READ_TABLE has to following features:

  1. multiple joined SAP tables
  2. special columns or ALL columns per table or ALL columns globally
  3. XML serialization of the results
  4. zip compression option
  5. where conditions
  6. number of results limitation
  7. meta data export from data dictionary
  8. reading data from cluster tables
  9. asynchronous mode to export huge amount of datasets per query.

2.2. RFC function module to read ABAP source code

If you extract e.g. daily the changed ABAP sources you are able to create a ABAP source repository with history. In this way you can check which source code was active at every timestamp you want to examine. This can be very useful, if you have to analyze errors or in cases of partly finished transport orders. With this data basis you can further detect code inconsistencies due to transport order  that came in the productiv system in wrong order.

The software just delivers such source codes that are changed since your last data retrieval run. In this way the source code extraction is not very time consuming and you can repeat it quite often.

 

2.3. RFC function module to read JOBLOGs, JOBSPOOLs, JOBINFOs

Job informations can be listet using TCODE SM37. In productive systems you have typically a lot of jobs in this list every day. Often you have to check the Job Logs or Job Step Spools for errors to guarantee no disruption or errors in your business processes. The standard function modules which can read Job Log and Job Step Spool informations

  • RSPO_RETURN_ABAP_SPOOLJOB
  • BP_JOBLOG_READ

are both not remote enabled. Our SAP adapter module for Job Logs delivers all Job Informations since the last data retrieval at once based on informations in SAP tables

  • TBTCO
  • TBTC_SPOOLID

2.4. RFC function module to read BDOCs as XML

Bdocs are the XML documents that are exchanged between ERP and CRM systems to keep their business content sychronous. With the SAP adapter module for BDOCs it is quite easy to continously pipe the BDOC messages as XML documents into a hadoop sequencefile.

Because all information about your business content is stored in one of the BDOCS you can use Hadoop to parse a huge amount of BDOCs in a short time. In this way you can do some analysis to answer business questions that came up a long time after the data were exchanged between the systems. Imagine e.g. the question about deleted sales orders. Without custom changes in the SAP system it is very hard to get this information out of the SAP system. You can also answer questions like in which system some changes have been made.

2.5. RFC function module to read IDOCs as XML

IDOCs are important, because these documents are used as one very often used way to exchange data between SAP systems or between a SAP system and external systems. Most imporant use cases are sales order import and warehouse transport order exchanges from and to external warehouse systems.

The arguments to save IDOC copies in hadoop are similar as for BDOCs even if the business questions are different.

2.6. ADK archiving with a copy into Hadoop

With minial changes you can reuse your archiving reports to write a additional copy of all data into hadoop.

All necessary code is available in the class Z_WP_ARCHIVING_HDP which has methods that are equal in name and parameters to the ADK function modules. So your migration steps are:

  1. Create a instance of Z_WP_ARCHIVING_HDP in the beginning of your write report
  2. Replace als ADK function module calls by the corresponding method calls of the already created Wrapper class instance.

 

The following ADK function modules need to be replace by the corresponding methods:

  1. ARCHIVE_OPEN_FOR_WRITE
  2. ARCHIVE_NEW_OBJECT
  3. ARCHIVE_PUT_RECORD
  4. ARCHIVE_CLOSE_FILE
  5. ARCHIVE_SAVE_OBJECT

 

2.7. XI Java Stack Database Table Reader

The SAP Java Stacks have their own database, which is not directly accessable from the ABAP Stack. Most of XI adapters are implemented in Java and are executed in the Java Stack. The communication with other SAP systems takes place in the ABAP Stack, so that messages have to be exchanged between the two stacks internally. The exchanged messages are stored in the database – some of them in the java database others in the abap database. You can monitor the messages using the runtime workbench. In case of (communication) problems between the two stacks you have to check the messages in both stacks. If messages are missing/lost in one of the stacks it is not easy to find those messages.

Because of this situations we developed a EAR for the Java Stack Application Server which offers possibility to access the java database generically (similar to the abap read table function module) over http.

2.7.1. XI Java Stack Message Reader

As a special use case we can get complete XI messages (meta data and payload) directly from the java stack database and write them as usual in sequencefiles into hadoop.

3. Woopi SAP Adapter on the Java / hadoop side

On the Java side the Woopi Sap Adapter has the necessary client classes to pull the data from the SAP systems. The actual state is persisted locally, so that the software knows which data to fetch next time.

There are MapReduce jobs which transform data retrieved from SAP tables and automatically generate (if necessary) HIVE or Phoenix tables in Hadoop and afterwords import the data to those hadoop databases.

If you write MapReduce jobs that are not just at HelloWorld level, you quickly get the need to handle serialized data (Avro, Json, etc.) using special comparators. Creating grouping comparators to get data from different input sources with corresponding tags in their key together, can be very time consuming and error prone. We have created some helper classes that makes it possible to generate comparators by just defining the necessary fields in the (extended) avro schema.

4. Use Cases

4.1. Example use cases

  1. statistical monitoring
    warnings when measured actual data deviates more than statistical standard deviation
  2. long term KPI calculations
  3. end to end (multi system) busisness process monitoring, e.g. sales orders external system -> sales orders ESB system -> sales orders CRM -> sales orders ERP -> transport orders (warehouse) -> delivery confirmation (UPS, DHL, Transoflex etc.)
  4. customer segmentation
  5. order recommendations
  6. warehouse: stock-keeping optimizations
  7. searches / analysis over CDHDR/CDPOS

4.2. Let us know about your use cases

Many other use cases are imaginable. If you have interesting use cases from your business, we would appreciate to get them. Just send us an e-mail.