Hadoop and Giraph Installation (Step by Step)

One of the text-mining sub-projects aims to develop Recommender System for the OpenEdition Platforms. We use graph structure to store OpenEdition documents. Each node represents document (article, book, review, …) and each edge represents a specific relation between two documents (citation, similarity, …).

We choose to use Hadoop and Giraph frameworks for graph processing.

In this document, we describe how to set up and configure both Hadoop 2.4.0 and Giraph 1.1.0 on Ubuntu or Debian 64-bit

Prerequisites

Before starting the installation of Hadoop and Giraph, insure that you have installed :

  • Java TM. Hadoop requires Java 7 or late version of java 6. Install Java with the following command:
$ sudo apt-get install openjdk-7-jdk

To check the installed version:

$ java -version
  • SSH. If not installed, use the following command in terminal:
$ sudo apt-get install ssh
  • Download Giraph 1.1.0 using (see section “Deploying Giraph” below):
$ git clone http://git-wip-us.apache.org/repos/asf/giraph.git
  • Maven 3 or higher. Giraph uses the munge plugin, which requires Maven 3, to support multiple versions of Hadoop. Also, the web site plugin requires Maven 3.

Steps for Hadoop Installation

1. Add a dedicated hadoop system user

$ sudo addgroup hadoop
$ sudo adduser –ingroup hadoop hduser

2. SSH configuration

$ su – hduser
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost

In this step, we enable access to local machine with newly created key.

3. Download Hadoop 2.4.0

Download from Apache Hadoop site version 2.4.0 and unpack it to a directory of your choice.

$ sudo tar xzf hadoop-2.4.0.tar.gz

Move the unpacked Hadoop to /usr/local/hadoop-2.4.0 and make hduser:hadoop its owner using:

$ sudo mv hadoop-2.4.0 /usr/local/
$ sudo cd /usr/local
$ sudo ln -s hadoop-2.4.0 hadoop
$ sudo chown -R hduser:hadoop hadoop-2.4.0
$ sudo chown -R hduser:hadoop hadoop

4. Hadoop Configuration

In Hadoop 2.x, we found 2 components

→ HDFS: it is a distributed file system

→ YARN (Yet Another Resource Negotiator): it is a resource manager which manage allocating containers where jobs can run using the data stored in HDFS.

Hadoop can be run on a single-node in pseudo-distributed mode where each Hadoop daemon runs in separate java process.

To configure Hadoop, some files must be edited.

4.1. Edit etc/hadoop/hadoop-env.sh with the following

# The java implementation to use. Here, 64-bit OpenJDK 7 is used

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

# Hadoop installation directory

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java runtime options. Disable IPv6 and set the library directory.

export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.library.path=$HADOOP_PREFIX/lib"

# Hadoop native library director

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native

The following error will result with the use of 64-bit JDK, which has been reported not to be a big issue.

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

To resolve this error, we need to recompile Hadoop fro 64-bit JDK. (not necessary)

4.2. Edit etc/hadoop/core-site.xml with the following

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>

<property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
</property>

The second property of the core-site.xml file describes the default base temporary directory for the local file system and HDFS which is /app/hadoop/tmp. So, we have to set required ownership and permission with following:

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown -R hduser:hadoop /app/hadoop/tmp
# ... and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp

4.3. Edit etc/hadoop/hdfs-site.xml with following

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/hduser/mydata/hdfs/namenode</value>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/hduser/mydata/hdfs/datanode</value>
</property>

4.4. In YARN versions of hadoop, MapReduce isn’t the only computational model. Edit etc/hadoop/mapred-site.xml with following

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

4.5. Edit etc/hadoop/yarn-site.xml with following

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

<property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>localhost:8025</value>
</property>

<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>localhost:8030</value>
</property>

<property>
    <name>yarn.resourcemanager.address</name>
    <value>localhost:8050</value>
</property>

5. Testing Local MapReduce Job

5.1. Format HDFS filesystem

$ bin/hdfs namenode -format

5.2. Start nameNode daemon and DataNode daemon

$ sbin/start-dfs.sh
$ sbin/start-all.sh

5.3. Check that all services got started using ‘jps’ command whose output should be similar to

$ jps

50831 Jps
65180 DataNode
65035 NameNode
65612 ResourceManager
65386 SecondaryNameNode
65731 NodeManager

5.4. Male HDFS directories required to execute MapReduce jobs

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hduser

5.5. Copy input files into HDFS

$ sbin/hdfs dfs -put etc/hadoop /user/hduser/input

5.6. Run an example

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar grep /user/hduser/input/hadoop output 'dfs[a-z.]+'

5.7. Check the output using

$ bin/hdfs dfs -cat output/*

The output should be similar to:

6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.audit.log.maxbackupindex
2 dfs.period
2 dfs.audit.log.maxfilesize
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
1 dfs.datanode.data.dir
1 dfs.namenode.name.dir

Deploying Giraph

1. Git and Maven 3 or higher must be installed, if not, exit from hduser using ‘exit’ command and do following

$ sudo apt-get install git
$ sudo apt-get install maven
$ mvn -version
$ cd /usr/local/
$ sudo git clone https://github.com/apache/giraph.git
$ sudo chown -R hduser:hadoop giraph
$ su – hduser

2. Edit $HOME/.bashrc for user account and add the following line

export GIRAPH_HOME=/usr/local/giraph

3. After that, save and close file. Then do following

$ source $HOME/.bashrc
$ cd $GIRAPH_HOME

4. To build giraph 1.1.0 with hadoop 2.4.0, you have two possibilities

→ if you use YARN:

$ mvn -Phadoop_yarn -Dhadoop.version=2.4.0 -DskipTests package

→ if you use just hadoop2 :

 $ mvn -Phadoop_2 -Dhadoop.version=2.4.0 -DskipTests package

Running a Giraph job

(from: https://giraph.apache.org/quick_start.html)

1. Create an example graph under /tmp/tiny_graph.txt with the following

[0,0,[[1,1],[3,3]]]

[1,0,[[0,1],[2,2],[3,1]]]

[2,0,[[1,2],[4,4]]]

[3,0,[[0,3],[1,1],[4,4]]]

[4,0,[[3,4],[2,4]]]

2. Save and close the file. Then copy the input file to HDFS

$ HADOOP_HOME/bin/hadoop dfs -copyFromLocal /tmp/tiny_graph.txt /user/hduser/input/tiny_graph.txt
$ HADOOP_HOME/bin/hadoop dfs -ls /user/hduser/input

You should have this output :

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/09/30 13:43:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 2 items

drwxr-xr-x - hduser supergroup 0 2014-09-26 01:14 /user/hduser/input/hadoop

-rw-r--r-- 1 hduser supergroup 112 2014-09-28 01:03 /user/hduser/input/tiny_graph.txt 

3. Run the example as following

$ bin/hadoop jar

/usr/local/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.4.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation

-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat

-vip /user/hduser/input/tiny_graph.txt

-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat

-op /user/hduser/output/shortestpaths

-w 1

-ca giraph.SplitMasterWorker=false

4. To check the output, use

$ bin/hadoop dfs -cat /user/hduser/output/shortestpaths/p* 

You should have:

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

14/09/28 01:59:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

0 1.0

2 2.0

1 0.0

3 1.0

4 5.0

We can stop all services and check with ‘jps’:

$ cd /usr/local/hadoop
$ sbin/stop-all.sh

Vous devriez également aimer ...

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *