In this post I have captured my work configuring Pentaho Data Integration (PDI) for use with Hadoop. It is formatted as a tutorial on how to setup PDI 4.4 with Hadoop 1.2.0 for your use.
- Java 1.6 or later (Not the OpenJDK distro as it is not compatible with this version of PDI)
- Pentaho Data Integration 4.4
- An up and running Hadoop 1.2.0 Cluster
Configuring up Pentaho
Pentaho comes preconfigured for use with Hadoop 0.2.0. Which is great…unless you want to use a different version of Hadoop. The supported versions of Hadoop for use with Pentaho are outlined in their support matrix that can be found in the Pentaho InfoCenter. In my case, I was using Hadoop 1.2.0 so I found that it is necessary to create your own Hadoop configuration for Pentaho. I augmented the instructions located in the Pentaho InfoCenter on “Creating a New Hadoop Configuration” with the instructions found in this article. See updated instructions below:
Go into the $PDI_HOME/plugins/pentaho-big-data-plugin/hadoop-configurations directory.
Make a copy of the hadoop-20 folder and rename it to hadoop-120. This folder is the name of your new configuration.
Copy the following JAR files from your Hadoop NameNode into the $PDI_HOME/plugins/pentaho-big-data-plugin/hadoop-configurations/hadoop-120/lib/client directory:
Remove these files after copying in the updated libraries:
Update theactive.hadoop.configurationproperty, which configures the distribution of Hadoop that PDI will use when communicating with the cluster, in the $PDI_HOME/plugins/pentaho-big-data-plugin/plugin.properties file to look like the following code block:
Once the steps above have been completed then you can start PDI and begin using the Big Data steps like Hadoop Copy Files and MapReduce.