Tuesday, 18 December 2012

Kettle Integration for BigData - My first ETL

I got introduced to Pentaho Kettle very recently and immediately excited to get my hands dirty. I have been playing around with Hadoop eco-system for quite a while now and Pentaho for BigData drew my attention.

It took me quite a while to get a hang of the conepts (well not that long, its 2 days!!!).

I already had Hadoop-1.1.1 cluster running and my job now is to integrate Kettle with the already running Hadoop.  The Spoon UI that is bundled with Kettle helps to design jobs and transformations. I had real hard times in getting the spoon UI opened on my AWS EC2 Ubuntu instance. Well, i do not want to talk about those issues here and its still not solved :-(. It may be worth a separate post once i have the solution.
In short, i suspect that it could be video graphics driver issue.

Spoon would have taken care of my need end to end - from designing the jobs and transformations to running them. But unfortunately the ubuntu issue forced me to use Spoon on windows and use the generated kjb and ktr files on Ubuntu. Well, Kettle comes with very useful scripts to run jobs and transformation (pan and kitchen respectively). Cool, atleast i could integrate kettle with HDFS and Hive successfully.

Using Spoon from windows has its caveats. Certain design steps would try to connect to the running instance of hadoop, hbase etc, which in my case is not possible as my windows PC reside in the private network.

After 2-3 days of struggle, i am atleast happy that few things worked.


HBase Master startup issue

I had tough time trying to solve an issue that occured when HBase started. The HBase master failed to start with an error "host name cant be null".

HBase was earlier started with a 2 node cluster (2 region servers) pointing to a specific HDFS folder. When data was inserted earlier, it keeps reference to the hostnames in the cluster.

Now when i start the Hbase now with just one node, it tries to look at the data that was earlier there and failed to see one of the nodes, which is not used now.

The solution is to delete the HDFS folder that was earlier used (if you do not mind the loss of data).
I restarted the HBase after deleting this folder and it worked fine.

Another option would be to make HBase point to a different HDFS folder (hbase-site.xml).

Hope this helps!!

Friday, 14 December 2012

Sqoop-Hadoop integration


When sqoop is integrated with the already running hadoop cluster, you might face several issues including the following. I faced these issues when i tried to import the data from mysql to my hadoop instance.

Keep these points in mind
1. Install JDK, JRE alone is not enough.

2. JDBC driver for the corresponding database from where the data to be imported. Copy the driver jar to $SCOOP_HOME\lib

3. The hadoop jar that is bundled with sqoop may not be compatible with your hadoop cluster. Replace the hadoop jat bundled in scoop with the hadoop core from your hadoop installation.

4. Make sure that sqoop is using the right hadoop installation. If not then you may have to tweak $SCOOP_HOME/usr/lib/sqoop/bin/configure_sqoop file.

Exception due to point 3 and 4
ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 63, server = 61)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:403)

Thursday, 13 December 2012

Ubuntu mysql installation



MySQL Server Installation

sudo apt-get install mysql-server

This would prompt for a root user password.

To verify the installation:

sudo netstat -tap | grep mysq

tcp        0      0 ip6-localhost:mysql     *:*                     LISTEN      14731/mysqld

Then type,
sudo mysql -u root -p

It prompts for the password after which should take you to the mysql shell.

mysql> CREATE DATABASE sqoopDB;

mysql> USE sqoopDB;

mysql> CREATE TABLE sample (name VARCHAR(10), age VARCHAR(10));

mysql> DESCRIBE sample;

mysql>SHOW TABLES;

mysql> INSERT INTO sample VALUES ('COGN1', '25')

Thats it.

Wednesday, 12 December 2012

Apache flume - HDFS sink

Points to keep in mind when flume is configured to use HDFS sink.

Problem: When i tried to point the HDFS sink to my already running HDFS instance, i got the following exception.
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.SequenceFile$CompressionType

Solution: Copied the hadoop-core and the commons-configuration jar from $HADOOP_INSTALL\lib to $FLUME_INSTALL\lib and it worked.


Monday, 10 December 2012

SSH Key - Permission Denied

Everytime, i try to create a new RSA key-pair and configure the pairs on a linux box, i always get the Permission Denied (Public Key) issue and end up spending 10 minutes to find a solution.
So, after this post, i need not run around for the solution.

Make sure the permissions of the .ssh folder are 0700
Make sure the permissions of the authorized_keys file are 0600
Make sure the user owns the .ssh folder and contents 

Execute the following command:
sudo chown -R <<username>>:<<usergroup>> /home/<<username>>/.ssh
sudo chmod 0700 /home/<<username>>/.ssh
sudo chmod 0600 /home/<<username>>/.ssh/authorized_keys.
A little background on SSH key-pair,
Lets say that machine 1 (M1) wants to communicate with machine 2 (M2) using the SSH key-pair (private/public).
1. M1 uses its private key to communicate with M2. This private key is not known to anyone else. (usually this private key is available in /home/<<username>>/.ssh/id_rsa file)
2. M2 should have the M1's public key added to home/<<username>>/.ssh/authorized_keys.
From M1, ssh <<username>>@M2 would use the id_rsa private key by default.
Alternatively, the private key can be specified using the -i option as follows:
ssh -i mykey <<username>>@M2