Centos как запустить hadoop

Обновлено: 04.07.2024

Мы часто осуществляем развертывание различных кластерных систем, поэтому хорошие инструкции на вес золота. Сегодня мы предлагаем хорошую инструкцию по развертыванию кластера Hadoop, подходящего для разработки и малых кластеров без требований к высокой доступности.

Введение

В данной статье представлена подробная инструкция по установке кластера Hadoop на ОС CentOS 7. Статья расчитана на читателя, уже имееющего представление о Hadoop и ОС Linux.

Топология развертывания

Подготовка CentOS

Все действия данного раздела выполняются на каждом узле конфигурации, если иное не указано.

В качестве исходной системы для развертывания будем использовать минимальную установку CentOS 7. После установки системы необходимо добавить несколько программных пакетов:

Задание имени для каждого узла. Данный шаг необязательный, но важный для упрощения идентификации узлов.

Например, на узле master команда будет следующая:

Чтобы увидеть результат, необходимо повторно авторизоваться. Данную операцию необходимо выполнить на каждом узле с указанием корректного hostname узла.

Мы будем использовать OpenJDK 1.8, поскольку этот пакет включен в стандартный репозиторий CentOS 7.

Создайте файл /etc/profile.d/java.sh со следующим содержимым:

Для того, чтобы убедиться в корректности настройки завершите сессию и войдите в систему снова. По команде env вы должны увидеть корректные переменные окружения. Команда java -version должна выдать корректную версию java.

Создайте пользователя и группу пользователей для Hadoop:

Внесите изменения в файл hosts для взаимной идентификации узлов по имени:

Проверьте, что узлы идентифицируются верно:

Настройте доступ по SSH без пароля на каждый узел с каждого узла:

Проверьте, что все узлы взаимно доступны по ключам SSH, без ввода пароля.

Остановите и отключите брандмауэр:

Установка Hadoop

Все действия данного шага выполняются на узле master. Кроме того, все операции выполняются под пользователем hadoop.

Скачайте и распакуйте дистрибутив:

Добавьте переменные окружения Hadoop в сценарий инициализаци сессии bash

Примените данные переменные окружения, чтобы они стали доступны:

Теперь отредактируем файлы конфигурации Hadoop для нашей трехузловой топологии.

Добавьте имя узла slave в файл $HADOOP_HOME/etc/hadoop/slaves :

Добавьте имя вторичного узла в файл $HADOOP_HOME/etc/hadoop/masters :

Если требуется отключить проверки безопасности в Hadoop, что часто используется при разработке, добавьте в файл следующую секцию:

Создайте директории, необходимые Hadoop:

Скопируйте дерево Hadoop и файлы параметров окружения на slave-узлы:

Запуск кластера Hadoop

Запустите распределенную файловую систему DFS:

Запустите распределенную вычислительную систему YARN:

Для остановки кластера Hadoop выполните:

Проверка состояния кластера

На каждом узле запустите команду jps. Убедитесь, что возвращается успешный ответ.

Успешный ответ jps на узле master:

Для детального мониторинга состояния кластера воспользуйтесь веб-интерфейсами Hadoop:

192.168.171.132:50070 — для просмотра состояния хранилища HDFS.
192.168.171.132:8088 — для просмотра ресурсов YARN и состояния приложений.

Заключение

Это все, что необходимо для того, чтобы развернуть базовый кластер Hadoop с поддержкой репликации данных на 3х узлах.

В рамках данного развертывания используется Hadoop с единой точкой отказа NameNode. Несмотря на то, что используется Secondary NameNode, кластер не является отказоустойчивым и должен применяться для целей разработки или малых установок. В больших установках необходимо применять более сложное развертывание с отказоустойчивыми NameNode. Мы расскажем об этом в будущих статьях.

Если вы обнаружили ошибку, вам непонятны некоторые инструкции, или есть предложения по улучшению статьи, будем рады, если вы свяжетесь с нами. Успехов в работе с Hadoop.

Hadoop is a free, open-source and Java-based software framework used for storage and processing of large datasets on clusters of machines. It uses HDFS to store its data and process these data using MapReduce. It is an ecosystem of Big Data tools that are primarily used for data mining and machine learning. It has four major components such as Hadoop Common, HDFS, YARN, and MapReduce.

In this guide, we will explain how to install Apache Hadoop on RHEL/CentOS 8.

Before starting, it is a good idea to disable the SELinux in your system.

To disable SELinux, open the /etc/selinux/config file:

Change the following line:

Save the file when you are finished. Next, restart your system to apply the SELinux changes.

Hadoop is written in Java and supports only Java version 8. You can install OpenJDK 8 and ant using DNF command as shown below:

Once installed, verify the installed version of Java with the following command:

You should get the following output:

It is a good idea to create a separate user to run Hadoop for security reasons.

Run the following command to create a new user with name hadoop:

Next, set the password for this user with the following command:

Provide and confirm the new password as shown below:

Next, you will need to configure passwordless SSH authentication for the local system.

First, change the user to hadoop with the following command:

Next, run the following command to generate Public and Private Key Pairs:

You will be asked to enter the filename. Just press Enter to complete the process:

Next, append the generated public keys from id_rsa.pub to authorized_keys and set proper permission:

Next, verify the passwordless SSH authentication with the following command:

You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit Enter to authenticate the localhost:

First, change the user to hadoop with the following command:

Next, download the latest version of Hadoop using the wget command:

Once downloaded, extract the downloaded file:

Next, rename the extracted directory to hadoop:

Next, you will need to configure Hadoop and Java Environment Variables on your system.

/.bashrc file in your favorite text editor:

Append the following lines:

Save and close the file. Then, activate the environment variables with the following command:

Next, open the Hadoop environment variable file:

Update the JAVA_HOME variable as per your Java installation path:

Save and close the file when you are finished.

First, you will need to create the namenode and datanode directories inside Hadoop home directory:

Run the following command to create both directories:

Next, edit the core-site.xml file and update with your system hostname:

Change the following name as per your system hostname:

Save and close the file. Then, edit the hdfs-site.xml file:

Change the NameNode and DataNode directory path as shown below:

Save and close the file. Then, edit the mapred-site.xml file:

Make the following changes:

Save and close the file. Then, edit the yarn-site.xml file:

Make the following changes:

Save and close the file when you are finished.

Before starting the Hadoop cluster. You will need to format the Namenode as a hadoop user.

Run the following command to format the hadoop Namenode:

You should get the following output:

After formating the Namenode, run the following command to start the hadoop cluster:

Once the HDFS started successfully, you should get the following output:

Next, start the YARN service as shown below:

You should get the following output:

You can now check the status of all Hadoop services using the jps command:

You should see all the running services in the following output:

Hadoop is now started and listening on port 9870 and 8088. Next, you will need to allow these ports through the firewall.

Run the following command to allow Hadoop connections through the firewall:

Next, reload the firewalld service to apply the changes:

At this point, the Hadoop cluster is installed and configured. Next, we will create some directories in HDFS filesystem to test the Hadoop.

Next, run the following command to list the above directory:

You should get the following output:

You can also verify the above directory in the Hadoop Namenode web interface.

Go to the Namenode web interface, click on the Utilities => Browse the file system. You should see your directories which you have created earlier in the following screen:

You can also stop the Hadoop Namenode and Yarn service any time by running the stop-dfs.sh and stop-yarn.sh script as a Hadoop user.

To stop the Hadoop Namenode service, run the following command as a hadoop user:

To stop the Hadoop Resource Manager service, run the following command:

Conclusion

In the above tutorial, you learned how to set up the Hadoop single node cluster on CentOS 8. I hope you have now enough knowledge to install the Hadoop in the production environment.

By Rahul January 10, 2015 5 Mins Read Updated: June 8, 2017

Step 1: Installing Java

Java is the primary requirement to setup Hadoop on any system, So make sure you have Java installed on your system using the following command.

Step 2: Creating Hadoop User

We recommend creating a normal (nor root) account for Hadoop working. So create a system account using the following command.

After creating an account, it also required to set up key-based ssh to its own account. To do this use execute following commands.

Step 3. Downloading Hadoop 2.6.5

Now download hadoop 2.6.0 source archive file using below command. You can also select alternate download mirror for increasing download speed.

Step 4. Configure Hadoop Pseudo-Distributed Mode

4.1. Setup Hadoop Environment Variables

First, we need to set environment variable uses by Hadoop. Edit

/.bashrc file and append following values at end of file.

Now apply the changes in current running environment

Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable. Change the JAVA path as per install on your system.

4.2. Edit Configuration Files

Edit core-site.xml

Edit hdfs-site.xml

Edit mapred-site.xml

Edit yarn-site.xml

4.3. Format Namenode

Now format the namenode using the following command, make sure that Storage directory is

Step 5. Start Hadoop Cluster

Now start your Hadoop cluster using the scripts provides by Hadoop. Just navigate to your Hadoop sbin directory and execute scripts one by one.

Now run start-dfs.sh script.

Now run start-yarn.sh script.

Step 6. Access Hadoop Services in Browser

Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.

Now access port 8088 for getting the information about cluster and all applications

Access port 50090 for getting details about secondary namenode.

Access port 50075 to get details about DataNode

Step 7. Test Hadoop Single Node Setup

You can also check this tutorial to run wordcount mapreduce job example using command line.

How to Install OpenCV on Ubuntu 20.04

How to Create A New React.Js Application

How to Install Yarn on MacOS

72 Comments

Dear Mr Rahul
Could you kindly help me, how we can deploy the services of Single Node Cluster to multiple clients in a Lab environment.

Dear Mr. Rahul,
I am very thankful for your installation guide but could not understand how we can
Edit

/.bashrc file to setup Hadoop Environment Variables, could you kindly help us with more screen shots please

I was stuck up at these area
4.1. Setup Hadoop Environment Variables
4.3 Resolution of host name with IP, where should we set this IP please help me

Regards
Dinakar NK

Hi Dinakar, Simply edit the

/.bashrc configuration file and copy the settings at end of file.

Thanks Ruchira for pointing this. I have updated tutorial accordingly.

Plz provide results of following commands.

$ telnet localhost 22
$ netstat -tulpn | grep 22

It looks OpenSSH server is not running on your system.

The programs included with the Kali GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

I am able to setup the hadoop multi node in ubuntu. but not able to setup the multi node in centos 6.6.
Where can i set the interfaces and hosts and hostname?
Can you please share the video link?

I am not able to follow step 2 because executing passwd hadoop, does ask for password.
I tried hitting just enter but then when I run ssh localhost, it keeps on asking password.

One think I noticed which is also different is the message (shows ssh2 instead of ssh):-
Public key saved to /home/hadoop/.ssh2/id_rsa_2048_a.pub

It just comes back saying -bash: yum: command not found

It looks /usr/bin and /usr/local/bin is not added in PATH environment variable. Please use below command to add it.

I want to change my CentOs code that means I want to add hadoop single node cluster to this and I need to share some other?
How can I do this ??

hdfs file not found

all went well except the last step: step 7:

I cannot fix error. Please help me
Thank you all !

Hey friend,
i am newbie about hadoop i configured hadoop on vagrant ubuntu machine.i wants access hadoop web ui on browser but i unable to do so.i tried changing the core-site.xml file for hadoop ui on browser by my machine ip and different ports for ui like 9000/8020 and 50075,50070 but nothing happens.
plz help so.
Thanks in advance.

have a question, could you please explain me, why have we created a user after installation of java, and how could the password less ssh work for ssh the localhost when i am already in the same system, or am i missing something in here.

Thank you so much for precise instructions which makes it simple and perfect !
Great help 🙂

can i create cluster with two different os (ubuntu and cygwin on windows ) in which hadoop (same version)is installed ?

great article.. works fine

It works on Centos 7 , JDK 8 & Hadoop 2.6
Thanks! a great tutorial.

Thanks. It worked with Centos7 and Hadoop 2.7.1

Thanks. It worked with fedora 22 and Hadoop 2.7.
The only Warning i get is below. I am not sure what it means.

I am trying to run my workflow on a new Yarn cluster via oozie. The job submits fine and as part of the workflow creates a scanner; the scanner is serialised and written to disk. Then during deserialising the string to a scan object I encounter the following error

I googled and checked for all kinds of config errors but all my configurations such as nodename, jobtracker, etc are correctly configured. Also, the google protobuf jar is consistent across all YARN components and my code. Wondering whats going wrong?
a
-Shashank

Hi,
I need to install Hadoop 2.6.0 multi node cluster with different os configurations. I am already having a master node and one slave node both at Ubuntu 12.04. I want to add one more slave node with CentOS.
I wanted to ask is it fine?

Thanks in advance!

First of all let me say THANK YOU for this tutorial. This is a very big help especially to a person like me who just started learning Hadoop / Bigdata.

2. If I rebooted my machine, do I need to run the start-dfs.sh and start-yarn.sh again?

Hi will you please help me to solve out this

Hi this is really a great post, I followed it and it works! I have a follow-up question: can you post another blog for how to install Spark on this single YARN cluster which can work with the the data on hdfs on this single machine?

Please could you suggest me some solution

export HADOOP_HOME=/home/hadoop/hadoop
I didnt understand the above statement. When you create a user hadoop, a folder will be created in home directory. What the purpose of second hadoop?

cd $HADOOP_HOME/etc/hadoop
There is no hadoop folder in etc directory.
I am confused because etc comes under the supervision of root user rather than hadoop user.

Hi, I am trying to run wordcount example. But it is getting stuck at ACCEPTED state.
It is not going into RUNNING state.
Any help appreciated. I have followed the tutorial exact. But using 2.6.0 instead of 2.4.0

Once it is done, just run `stop-dfs.sh` followed by & `start-dfs.sh`

Dear
How I change command from
-Old 64 bit
-1.7.0.51
-rhel-2.4.5.5.el7-x86_64 u51-b31
-(build 24.51-b03,mixed mode)
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el7_0.x86_64/jre
What command change dorectory
My VM ware java version
-1.7.0.45
-rhel-2.4.3.3.el6-i386 u45-b15
-build 24.45-b08,mixed mode,sharing
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45-2.4.3.3.el6_0.i386/jre
please help advisor because I am beginner CENT OS and HODOOP
Best regards

Dear Rahul
How I set again cloud you advisor

Please make sure you have configured JAVA_HOME correctly.

Hi, This one is a great article. I followed many other blogs for this problem. But none of them worked. This one simply worked with no error.

But i have a little problem.
I have installed hbase standalone mode.
Now i want hbase to use hdfs. So in hbase-site.xml file i added this:

But its giving error and not working. Any reason why its not working? I copied the same configuration of yours during my hadoop installation.

Thank for this article.

What could be the problem or what mistake I might have done? Please help.

However this got me along way forther

Thanks for taking the time to write up this guide, it was very helpful.

Hi Rahul
Thank you for sharing this with us. I followed your steps in installing hadoop 2.4 on Centos Vritual Machine. I have an Hbase running on my mac machine, I get connection refused error when hbase tries to connect.
Here is my setting in core-site.xml

and on Hbase: hbase-site.xml

I can telnet onto the port 54310 from the VM but not from a remote machine, i.e my local macbook which is running the virtual machine. Looks like the port is closed to remote client. I have disabled firewall but it didn’t help.

thanks for the tutorial, why

cant be written?

Please check if hadoop user has proper privileges on this file

Thanks for all the steps. Please update the mapred-site.xml to mapred-site.xml.template.

Also, please update the testing the setup.

Very good article,
Two issues,
First exit; ssh localhost will not work for public/private key
Should be ssh localhost; exit

We have updated article accordingly.

Rahul, wonderful article.

Need to know few things and appreciate your feedback on this;

1. Used RHAT 6.3 with Java 1.7/Hadoop 2.6.0
2. Able to run the Name Node and Data Node

you can check your iptables and allow that port (54310, 9000, 50070). i try it and it works well.

Great article but here is a script that also install hbase, hdfs, and a number of other resources

When are you publishing next part of this article? I loved this and I am waiting to see how will you test your setup by running some example map reduce job.

By Rahul November 12, 2015 5 Mins Read Updated: April 3, 2019

Apache Hadoop 3.1 have noticeable improvements any many bug fixes over the previous stable 3.0 releases. This version has many improvements in HDFS and MapReduce. This how-to guide will help you to setup Hadoop 3.1.0 Single-Node Cluster on CentOS/RHEL 7/6 and Fedora 29/28/27 Systems. This article has been tested with CentOS 7 LTS.

This tutorial is for configuring Hadoop Single-Node Cluster. You may be intrested in Hadoop Multi-node Cluster Setup on Linux systems.

1. Prerequsities

2. Create Hadoop User

We recommend creating a normal (nor root) account for Hadoop working. To create an account using the following command.

After creating the account, it also required to set up key-based ssh to its own account. To do this use execute following commands.

3. Download Hadoop 3.1 Archive

In this step, download hadoop 3.1 source archive file using below command. You can also select alternate download mirror for increasing download speed.