Hadoop / YARN / Tez

0. Introductary Remarks

Here my extended version of the installation / deploy documentation of Tez. The original version can seen on the Tez webpage

I have used hadoop 2.3.0 to try the Tez installation. So I assume here that your hadoop cluster ist already up and running. Look on the webpage of hadoop for installation instructions if you have not already a running hadoop cluster.

My version of hadoop is running on debian Wheezy.

1. Download the Tez tarball

You can get the source tarball of tez from it’s apache incubator page.

2. Compile the sources using maven

Before starting maven to build tez we have to change the pom.xml file in the root directory of the unzipped tez directory. In my case I changed the hadoop version property hadoop.version from 2.2.0 to 2.3.0.

After starting maven with the following command, usually some missing jars are downloaded from the central repository.

3. Adapting configuration files

3.1 Add TEZ variables to .bashrc

As hadoop user edit ~/.bashrc and append the following lines

3.2 Add Tez jars to the hadoop environment shell script

Edit ${HADOOP_INSTALL}/etc/hadoop/hadoop-env.sh

Right after the lines

add the following new lines to add tez jars

 3.3 Upload tez jars to hdfs

 3.4. Create a tez configuration file

Create/Edit the file tez-site.xml in ${HADOOP_INSTALL}/etc/hadoop/

3.5 Configure MapReduce to use yarn-tez instead of yarn

Create or edit mapred-site.xml in ${HADOOP_INSTALL}/etc/hadoop/

3.6 Start or restart yarn

${HADOOP_INSTALL}/sbin/stop-yarn.sh

${HADOOP_INSTALL}/sbin/start-yarn.sh

4. Start a tez example

In the tez tarball are several examples included. Let’s start the orderedwordcount example.

Create a hdfs dir /tests/tez-examples/in and copy some text files of your choice to it.

Then execute the command:

The output should look similar to mine here:

Author:

Google shows us the way to scalability

Google started to write a series of publications regarding SCALABLE software, that changed the way of thinking in the complete IT world!

  1. The distributed file system:
    In 2003 the published an article about the Google distributed file system
    With the informations in this article it was possible to implement an open source version of a distributed file system included in ASF’s Hadoop.
  2. MapReduce:
    In 2004 the MapReduce paper made the next important step for scalable software systems. With MapReduce it is now possible to make calculations on huge amount of data using clusters consisting of commodity hardware. The open source implementation is also included in Hadoop.
  3. BigTable:
    As a next consequent step Google added a database system, which is able to scale well and store schema-less data in tables with millions of columns and even more rows. They called this system BigTable. Apache’s answer / implementation is called HBase and is a subproject of Hadoop.
  4. Dremel:
    In 2010 Google published a paper about its system for interactive analysis of web-scale datasets => Dremel
    Actually Apache is incubating a open source version of Dremel which is call Drill.
  5. Pregel:
    Influenced by big presence of social network needs Google published internals of it’s scalable graph processing system Pregel. Apache implemented this system under the name Giraph.
  6. Kafka:
    The missing part in the scalable software puzzle is a scalable message queue system which acts on shared nothing hardware cluster. Apache presents a scalable system which delivers topics and queues combined in one technological implementation called Kafka. It seems that the basics are well designed even if some features for enterprise use are not yet available (i.e. authorization / access control). Up to now I didn’t find a analogon in a google publication. If someone knows a publication, I would be pleased if you leave me a comment.

As we can see the IT world is changing and the time where knowlege of relational databases and GUI’s or webfrontends are enough for IT professionals seems to be over.

Some people may be afraid to loose their knowledge leader role – others (hopefully the most of us) are just curious about all the possibilities those new techniques will bring us.

Let’s scale out …..
Author: