Sunil's Notes: April 2014

Geo IP client

Sometimes you might want to find out location of where your web traffic is coming from. In that case you might want to look into the access log generated by your web server, it will give you ip address of the client. ONce you have the client's ip you can use GeoIP database to look up city, state, country for that ip. You can download the sample project from here This is how a sample log line looks like


125.215.163.73 - - [29/Aug/2011:00:00:18 -0700] "GET /blog/geekery/xvfb-firefox.html HTTP/1.1" 200 9767 "http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBkQFjAA&url=http%3A%2F%2Fwww.semicomplete.com%2Fblog%2Fgeekery%2Fxvfb-firefox.html&rct=j&q=xvfb%20firefox%20fully%20loaded%20screenshot&ei=zDhbTvX6N-jmiAK5_7i4CQ&usg=AFQjCNEFaYxjYoKmLd5CLaG3SbQNStGkLg&sig2=oZnDDXKzFB8uwbDg5aNi2w&cad=rja" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.20) Gecko/20110803 Firefox/3.6.20 FirePHP/0.5"

Now you can download GeoIP database from MaxMind, then create a simple Java Client that given a ip address will print its geo location like this, you can download source code for my project from here Your output should look something like this


City Tokyo
ZIP Code null
Country Japan
Location Location [latitude=35.685, longitude=139.7514, timeZone=Asia/Tokyo]
City Fremont
ZIP Code 94536
Country United States
Location Location [latitude=37.567, longitude=-121.9829, metroCode=807, timeZone=America/Los_Angeles]

Connecting to Hive using Beeline client

I like to use Hive Beeline client better than using the default hive command line interface. On my cloudera VM 4.4.0-1 i can connect to beeline client by executing following command


beeline -u jdbc:hive://

Connecting to Hive using JDBC client

I wanted to try out connecting to Hive using a JDBC driver, so i followed these steps. You can download the Maven project from this location

First start the HiveServer by executing hive --service hiveserver command
Make sure that Hive is running by executing netstat --tnlp | grep 10000
Now use the following Java code for connecting and executing simple select query on Hive

MapReduce program that reads input files from S3 and writes output to S3

In the WordCount(HelloWorld) MapReduce program entry i talked about how to create a simple WordCount Map Reducer program with Hadoop. I wanted to change it to so that it reads input files from Amazon S3 bucket and writes output back to Amazon S3 bucket, so i built S3MapReduce program, that you can download from here. I followed these steps

First create 2 buckets one for storing input and other for storing output in your Amazon S3 account. Most important issue here is to make sure that you create your buckets in US Standard region, if you dont do that then additional steps might be required for Hadoop to be able to access your buckets Name of input bucket in my case is com.spnotes.hadoop.wordcount.books
Name of the output bucket is com.spnotes.hadoop.wordcount.output
Upload few .txt files that you want to use as input in your input bucket like this
Next step is to create MapReduce program like this, In my case one Java class has code for Mapper, Reducer and driver class. Most of the code in the MapReduce is same only difference is for working with S3 you will have to add few S3 specific properties like this, basically you need to set your accessKey and secretAccessKey that you can get from AWS Security console and paste it here. You will also have to tell Hadoop to use s3n as file system.
```
//Replace this value
job.getConfiguration().set("fs.s3n.awsAccessKeyId", "awsaccesskey");
//Replace this value
job.getConfiguration().set("fs.s3n.awsSecretAccessKey","awssecretaccesskey");
job.getConfiguration().set("fs.default.name","s3n://com.spnotes.hadoop.input.books");
```
Now last step is to execute this program, it takes 2 inputs, You can just right click on your S3MapReduce program and say execute with following 2 parameters
```
s3n://com.spnotes.hadoop.wordcount.books s3n://com.spnotes.hadoop.wordcount.output/output3
```
Once the MapReduce is executed you can check the output by going to S3 console and looking at content of com.spnotes.hadoop.wordcount.output like this

Hadoop MapReduce HTTP Notification

Normally MapReduce programs tend to run for long time and you might want to setup a way to get notification when the job is done finishing to find out if the job was executed successfully or not. Hadoop provides a mechanism in which you can get notification on the progress of your MapReduce job. This is two step process

First set up job.end.notification.url property with value equal to a web application URL that should get invoked with the progress of job job.getConfiguration().set("job.end.notification.url", "http://localhost:8080/mrnotification/MRNotification/$jobId?status=$jobStatus");
Next create a web application that will receive the notification and printout the job status and job name like this

Using output of the MapReduce program as input in another MapReduce program - KeyValueTextInputFormat

In the WordCount(HelloWorld) MapReduce program i blogged about how to create a MapReduce program that takes a text file as input and generates output which tells you frequency of each word in the input file. I wanted to take that a step further by reading the output generated by the first MapReduce and figure out which word is used most frequently and how many times that word is used. So i developed this HadoopWordCountProcessor program to do that.

First take a look at the output generated by the HadoopWordCount program, which looks like this. In the HadoopWordCount program i used TextOutputFormat as output format class, this class generates output in which there is one key value pair on every line separated by tab character XXX 3 YYY 3 ZZZ 3 aaa 10 bbb 5 ccc 5 ddd 5 eee 5 fff 5 ggg 5 hhh 5 iii 5
First create a WordCountProcessorMapper.java program like this, this class receives Text class as Key and value, Only thing i am doing here is converting the Text key into IntWritable and then writing it into output.
The reducer class is the place where i am getting all the words as key and their frequency as value. In this class i am keeping track of highest frequency word (You will have to copy the key and value of highest occuring word into a local variable for it to work because hadoop reuses key and values object sent to reducer)
The last step is to create a Driver class, note one thing about the Driver class, i am setting job.setInputFormatClass(KeyValueTextInputFormat.class);, in this i am setting KeyValueTextInputFormat as input format class. Once i do that hadoop takes care of reading the input and breaking it into key and value and passing to my Mapper class
Next step is to execute the WordCountProcessor.java class with the output of the first MapReduce program as input by passing couple of arguments like this file:////Users/gpzpati/hadoop/output/wordcount file:///Users/gpzpati/hadoop/output/wordcount2 It will generate output like this. Which says aaa is the most frequently used word and it appeared 10 times aaa 10

Use MRUnit for testing your MapReduce program

In the WordCount(HelloWorld) MapReduce program entry i blogged about how you can create your WordCount (HelloWorld) MapReduce program. You can use Apache MRUnit which is a unit testing framework for testing your MapReduce program. I used MRUnit for developing unit tests for my WordCount program.

First i did create a unit test for my WordCountMapper class like this, Basic idea here is you set input and expected output on the test class and then execute the test by calling mapDriver.runTest()
Then i did create a unit test for my WordCountReducer class like this
Last part was to develop a end to end test, in this case you setup both mapper and reducer class that you want to set and then run it end to end

Maven script for running Hadoop programs

In the WordCount(HelloWorld) MapReduce program blog i talked about how to create a WordCount MapReduce program. While developing MapReduce program, i follow this pattern in which first i develop it using MRUnit test driven development, then i execute it on local using driver. But last step for me is always to copy this program to my Cloudera VM and execute it. I build this maven script to help me with the last part, which is to scp the deployment .jar file to the Cloudera Hadoop VM and then execute it. This is the script i use When i create a new MapReduce program, i have to make couple of changes in it but i can always reuse most of it

Change the value of scp.host to point to my hadoop vm, if you changed the username and password on your VM you will have to change it too
Next i have to change the value of mainClass attribute to point to correct class for the MapReduce program that i am developing. In this case name of the driver class its com.spnotes.hadoop.WordCountDriver
Then i have to change the value of command attribute in sshexec element. THe command is made up of different parts
```
hadoop jar ${scp.dirCopyTo}/${project.build.finalName}.jar books/dickens.txt wordcount/outputs
```
in this ${scp.dirCopyTo}/${project.build.finalName}.jar points to the .jar file that is being scp to the VM. books/dickens.txt is path of the input text file, in this case i am using hdfs as input location which points to hdfs://localhost/user/cloudera/books/dickens.txt and the output of mapreduce will get generated in hdfs://localhost/user/cloudera/wordcount/outputs

You can run maven antrun:run command for executing the maven script task that deploys the maperduce jar to the cloudera vm and executes it. You can execute the full project from here

WordCount(HelloWorld) MapReduce program

I am learning about MapReduce and in order to experiment with MapReduce, i created this simple program which takes a text file as input and then generate a output that prints how frequently a word appeared in the text file. You can download the source code for the program from here

First i started by creating a simple Mapper which receives the content of the text file one line at a time, the Mapper takes care of splitting the content into words and then it writes every word into output and sets frequency count for that word to 1, by calling context.write(word,one). In this case the word becomes key and count becomes value
Next i had to develop a Reducer class which, receives word as key and value is list of all the counts for example if your input file is simple text like aaa bbb ccc aaa, then reduce class will get called with aaa - [1, 1], bbb -[1] and ccc - [1] as input. Hadoop framework takes care of collecting output of Mapper and then converting it into key -[value,value] format. In the reducer only thing that i had to do was to iterate through all the values and come up with a count. Once i have that i write it as output of Reducer by calling context.write(key, new IntWritable(sum));
The last part is creating WordCountDriver.java, which is a Java program that sets up Hadoop Framework by setting up inputs, defining outputs and also specifying name of the Mapper and Reducer class. After initializing Hadoop it calls job.waitForCompletion(true), this method will take care of passing the control to Hadoop framework and wait for the job to complete

Now you can either use one of the existing .txt file on your machine or you can create a text file like this


hhh eee  iii bbb ccc fff ddd  ggg aaa aaa
XXX YYY ZZZ
hhh eee  iii bbb ccc fff ddd  ggg aaa aaa 
XXX YYY ZZZ
hhh eee  iii bbb ccc fff ddd  ggg aaa aaa 
XXX YYY ZZZ
hhh eee  iii bbb ccc fff ddd  ggg aaa aaa 
hhh eee  iii bbb ccc fff ddd  ggg aaa aaa

Last step is to run your Hadoop program, if you used the Eclipse or some other IDE for developing your code, you can run your program directly by running WordCountDriver.java directly. This program takes 2 parameters, in my case since the input file is on local file system and i want the output to get stored on local file system too, i pass following 2 parameters
```
file:///Users/sunil/hadoop/sorttest.txt file:///Users/sunil/hadoop/output/wordcount
```
Once the program is finished successfully, you would be able to see part-r-00000 file created on your local machine at /Users/sunil/hadoop/output/wordcount, if you open it you should see output like this
```
XXX     3
YYY     3
ZZZ     3
aaa     10
bbb     5
ccc     5
ddd     5
eee     5
fff     5
ggg     5
hhh     5
iii     5
```

If you want to run this program with bigger text file then you can download few good classical books from Algorithm site data section