Configuring PDI for Use with Hadoop 1.2.0

Table of Contents

Purpose #

In this post I have captured my work configuring Pentaho Data Integration (PDI) for use with Hadoop. It is formatted as a tutorial on how to setup PDI 4.4 with Hadoop 1.2.0 for your use.

Prerequisites #

Java 1.6 or later (Not the OpenJDK distro as it is not compatible with this version of PDI)
Pentaho Data Integration 4.4
An up and running Hadoop 1.2.0 Cluster

Configuring up Pentaho #

Pentaho comes preconfigured for use with Hadoop 0.2.0. Which is great…unless you want to use a different version of Hadoop. The supported versions of Hadoop for use with Pentaho are outlined in their support matrix that can be found in the Pentaho InfoCenter. In my case, I was using Hadoop 1.2.0 so I found that it is necessary to create your own Hadoop configuration for Pentaho. I augmented the instructions located in the Pentaho InfoCenter on “Creating a New Hadoop Configuration” with the instructions found in this article. See updated instructions below:

Go into the $PDI_HOME/plugins/pentaho-big-data-plugin/hadoop-configurations directory.
Make a copy of the hadoop-20 folder and rename it to hadoop-120. This folder is the name of your new configuration.
Copy the following JAR files from your Hadoop NameNode into the $PDI_HOME/plugins/pentaho-big-data-plugin/hadoop-configurations/hadoop-120/lib/client directory:
- commons-codec-1.4.jar
- commons-configuration-1.6.jar
- hadoop-core-1.2.0.jar
Remove these files after copying in the updated libraries:
- commons-codec-1.3.jar
- hadoop-core-0.20.2.jar
Update theactive.hadoop.configurationproperty, which configures the distribution of Hadoop that PDI will use when communicating with the cluster, in the $PDI_HOME/plugins/pentaho-big-data-plugin/plugin.properties file to look like the following code block:

active.hadoop.configuration=hadoop-120

Once the steps above have been completed then you can start PDI and begin using the Big Data steps like Hadoop Copy Files and MapReduce.