Blog Navigation
Partners
Latest Activity
Phil explains how to use the old telephone tones to wane off telemarketers!
Installing Hadoop 0.20.2 in Ubuntu 11.04 x86 with Eclipse
This is the first post of a two part blog detailing Hadoop, the Hadoop Distributed File System (HDFS) and Hadoop MapReduce configuration and programming. I decided that I’d make this because when I was required to learn Hadoop for a class about a month ago, there really wasn’t a lot out on the web about Hadoop and how to get it working properly. Furthermore, all of the MapReduce programming examples I found out there had a lot of deprecated code in them making it difficult to work with. I don’t claim to be a Hadoop expert, but I know enough to get around and write this blog.
So let’s begin. I’m going to assume that an Ubuntu 11.04 i386 installation has been completed successfully and you’re at the main desktop. Just to recap, Ubuntu is at www.ubuntu.com. A default install can quickly be done by anyone with half of a brain (or more), so I’m going to skip over that part J After Ubuntu is installed, the first thing I did was open a terminal up and type in “sudo –s” to elevate myself to the root user (administrator) and then I typed in “apt-get update” followed by “apt-get upgrade”. You can also use the graphical user interface (gui) to update Ubuntu, although, most admins use the command line in Linux. After everything is updated, I also went to get Google Chrome since I’m not a huge fan of Firefox anymore (it’s too slow!).
Before I begin, it is important to note that there are several distributions of Hadoop out in the wild, and I found this to be the most confusing concept to understand during my research. There is Cloudera, Yahoo, and Apache that all make unique distributions of Hadoop (that I’m aware of). I found great difficulty in using Cloudera with Eclipse, so I took a hint to use the Apache distribution. So I’m going to install that for this blog.
All of the distributions of Hadoop that I encountered required that I install Oracle’s version of Java (formally Sun, who invented Java). The nice part about Ubuntu is that it has partner repositories that can be enabled to make this distribution of Java easy to obtain and install. The first step is to open a terminal and become root, “sudo –s”. Next, I’m a fan of vim, an editor that Ubuntu doesn’t include by default, so I installed vim using the command “apt-get install vim”. With vim installed, I uncommented (remove the # in front of the line) the two lines to add software from Canonical’s partner repository in the file at /etc/apt/sources.list. These two lines are:
Next, run “apt-get update” and then “apt-get install sun-java6-jdk”. You will see a blue screen with a license agreement that you will need to arrow through and accept. The next step is to install hadoop, this will be a bit tricky since Hadoop is going to run in development (pseudo distributed) mode for the purposes of this blog. Open up a web browser and navigate to http://hadoop.apache.org and download hadoop-common’s release branch. At the time of this writing, I used hadoop-0.20.2.tar.gz from a mirror listed at http://www.apache.org/dyn/closer.cgi/hadoop/common/. While it downloaded, I also went to http://www.eclipse.org and downloaded the Eclipse IDE for Java Developers (Helios), 32 bit Linux, the filename I actually downloaded was eclipse-java-helios-SR2-linux-gtk.tar.gz. I also visited the hadoop-eclipse-plugin page hosted on Google Code, located at http://code.google.com/p/hadoop-eclipse-plugin/. I downloaded hadoop-0.20.1-eclipse-plugin.jar from this site.
NOTE the version number: Hadoop just released version 0.20.203.0 within the past couple of weeks when this blog was written. I found that this distribution didn’t work with the Eclipse plugin that I really like because the new version expects a different protocol to be followed. This means that every time the Eclipse plugin contacts the Hadoop server, it causes the Hadoop server to crash; I experienced this same problem on Cloudera’s distribution also. Hopefully this is fixed within a few months!
After the files were downloaded, I moved them to my home directory and extracted the two tar.gz archives. Feel free to use either the gui or command line (“tar xzf
Next, navigate to hadoop-0.20.2/conf. In this directory, a few files need to be modified.
hadoop-env.sh :
define JAVA_HOME by adding the line “export JAVA_HOME=/usr” around where the existing line is for JAVA_HOME (that is commented out by default). If you installed Java differently than I did, you might need the command “which java” to get the path to the Java binary.
core-site.xml :
hdfs-site.xml:
mapred-site.xml
Once modified, we also need to setup SSH keys to allow the Hadoop startup script to run properly. This needs to be done via the command line as the regular user, not as root!
Next, we need to enable the SSH server on Ubuntu to enable this to work.
Provided you didn’t get any errors, Hadoop is all set to go. To start hadoop, open up an SSH window, and enter the following commands (at least for my setup, this worked).
At this point, provided there were no errors, Hadoop is now running. I’m going to note that the HDFS is physically in the temporary directory (/tmp). After any reboots of Ubuntu, all of the files in the HDFS will be lost! For me, this didn’t matter any, but it might for someone else.
The next step was to open up eclipse. I used the default workspace location and opened up the workbench. Next, I went to the “window” menu, “open perspective”, and clicked “other”. I highlighted Map/Reduce and clicked ok. I clicked on “Map/Reduce Locations” at the bottom and then clicked the “new hadoop location” button that is very tiny in that pane towards the top right. In the new window, the settings were set to:
The rest of the settings were fine at the default. I then clicked finish. Finally, goto “window” and click “preferences”. Click on “Hadoop Map/Reduce” and click browse. Find where you put the hadoop-0.20.2 directory and go into it. Finally click ok twice. From eclipse, you can modify files on the Hadoop filesystem drive and you can also run Hadoop jobs from within Eclipse!
NOTE: The Hadoop file system now works only if you did everything right. If you didn’t get good results, here are a few tips. First, try re-running the start-all.sh script. If hadoop is running properly, it will say kill off the existing processes first for all 5 processes. Another tip is to look at the log files under the “logs” directory. Between these two things, I was able to get my Hadoop install working on a virtual machine in a little under an hour. In eclipse, I wasn’t able to make “Run on Hadoop” work properly (due to the age of the plugin), but running as a Java Application was enough to allow me to debug using everything locally.
My next blog will detail how to use Eclipse to develop a Hadoop solution for Dijkstra’s shortest path graph algorithm. I also will explain how Hadoop works a little more.
As always, thanks for reading!
Tags: apache, eclipse, hadoop, mapreduce, plugin, ubuntu
Posted in Programming
[...] blog thinks that I forgot about the 2nd part of my original blog post (Installing Hadoop, Located here). If you thought that I forgot, think again! Just as a recap, I left off with a fully [...]
Wow, marvelous blog layout! How long have you been blogging for? you made blogging look easy. The overall look of your site is great, as well as the content!…
Hi phil, i followed the very same instructions as mentioned in above blog,to setup hadoop on my my machine and configuring it with eclipse, but when i try to build it, i am always getting errors regarding some class files not found in eclipse plugin jar . For 2 – 3times i found the reported jars in the net and i added them to build path of my project. Even after that it is howing errors regarding some other class files missing in the same eclipse plugin.
please can you provide some solution for this.
Thank you.
Sounds like you need to replicate my environment exactly. I know that Hadoop is a pain to get working because Hadoop is cutting edge technology and the plugins aren’t keeping up with it. If you use the exact versions I gave of everything, it theoretically should work, but I can’t really guarantee it since hadoop is so bleeding edge. I remember having a lot of problems when I wrote this tutorial too.
Thank you so much. I had really hard-time getting my Hadoop run on Eclipse. This really helped.
An internal error occurred during: “Map/Reduce location status updater”.
org/codehaus/jackson/map/JsonMappingException
I’m guessing that you’re running newer versions of everything, since this was very version specific. Good luck with getting it working!
In eclipse, I wasn’t able to make “Run on Hadoop” work properly,How I have to check if the code run properly or not?