Как установить spark на ubuntu

Обновлено: 05.07.2024

In this article, I will explain step-by-step how to do Apache Spark Installation on windows os 7, 10, and the latest version and also explains how to start a history server and monitor your jobs using Web UI.

Install Java 8 or Later

To install Apache Spark on windows, you would need Java 8 or later version hence download the Java version from Oracle and install it on your system. If you wanted OpenJDK you can download it from here.

After download, double click on the downloaded .exe ( jdk-8u201-windows-x64.exe ) file in order to install it on your windows system. Choose any custom directory or keep the default location.

Note: This article explains Installing Apache Spark on Java 8, same steps will also work for Java 11 and 13 versions.

Apache Spark Installation on Windows

Apache Spark comes in a compressed tar/zip files hence installation on windows is not much of a deal as you just need to download and untar the file. Download Apache spark by accessing the Spark Download page and select the link from “Download Spark (point 3 from below screenshot)”.

If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down; the link on point 3 changes to the selected version and provides you with an updated link to download.

Apache Spark Installation windows

After download, untar the binary using 7zip or any zip utility to extract the zip file and copy the extracted directory spark-3.0.0-bin-hadoop2.7 to c:\apps\opt\spark-3.0.0-bin-hadoop2.7

Spark Environment Variables

Post Java and Apache Spark installation on windows, set JAVA_HOME , SPARK_HOME , HADOOP_HOME and PATH environment variables. If you know how to set the environment variable on windows, add the following.

Follow the below steps if you are not aware of how to add or edit environment variables on windows.

  1. Open System Environment Variables window and select Environment Variables.

2. On the following Environment variable screen, add SPARK_HOME , HADOOP_HOME , JAVA_HOME by selecting the New option.

3. This opens up the New User Variables window where you can enter the variable name and value.

4. Now Edit the PATH variable

5. Add Spark, Java, and Hadoop bin location by selecting New option.

Spark with winutils.exe on Windows

To run Apache Spark on windows, you need winutils.exe as it uses POSIX like file access operations in windows using windows API.

winutils.exe enables Spark to use Windows-specific services including running shell commands on a windows environment.

Apache Spark shell

spark-shell is a CLI utility that comes with Apache Spark distribution, open command prompt, go to cd %SPARK_HOME%/bin and type spark-shell command to run Apache Spark shell. You should see something like below (ignore the error you see at the end).

On spark-shell command line, you can run any Spark statements like creating an RDD, getting Spark version e.t.c

This completes the installation of Apache Spark on Windows 7, 10, and any latest.

Where to go Next?

You can continue following the below document to see how you can debug the logs using Spark Web UI and enable the Spark history server or follow the links as next steps

Web UI on Windows

Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.

Spark Web UI

History Server

History server keeps a log of all Spark applications you submit by spark-submit , spark-shell . You can enable Spark to collect the logs by adding the below configs to spark-defaults.conf file, conf file is located at %SPARK_HOME%/conf directory.

After setting the above properties, start the history server by starting the below command.

spark history server

By clicking on each App ID, you will get the details of the application in Spark web UI.

Conclusion

In summary, you have learned how to install Apache Spark on windows and run sample statements in spark-shell , and learned how to start spark web-UI and history server.

If you have any issues, setting up, please message me in the comments section, I will try to respond with the solution.

Установите Apache Spark на Debian 10 Buster

Шаг 1. Перед запуском приведенного ниже руководства важно убедиться, что ваша система обновлена, выполнив следующие apt команды в терминале:

Шаг 2. Установка Java.

Apache Spark требует для запуска Java, давайте убедимся, что Java установлена ​​в нашей системе Debian:

Проверьте версию Java с помощью команды:

Шаг 3. Установка Scala.

Теперь устанавливаем пакет Scala в системы Debian:

Проверьте версию Scala:

Шаг 4. Установка Apache Spark в Debian.

Теперь мы можем скачать двоичный файл Apache Spark:

Затем распакуйте архив Spark:

После этого установите среду Spark:

В конце файла добавьте следующие строки:

Сохраните изменения и закройте редактор. Чтобы применить изменения, выполните:

Теперь запустите Apache Spark с помощью этих команд, одна из которых является главной для кластера:

Чтобы просмотреть пользовательский интерфейс Spark Web, как показано ниже, откройте веб-браузер и введите IP-адрес localhost на порт 8080:


В этой автономной установке с одним сервером мы запустим один подчиненный сервер вместе с главным сервером. Команда используется для запуска рабочего процесса Spark: start - slave . sh

Теперь, когда воркер запущен и работает, если вы перезагрузите веб-интерфейс Spark Master, вы должны увидеть его в списке:

После завершения настройки запустите главный и подчиненный сервер, проверьте, работает ли оболочка Spark:

Поздравляю! Вы успешно установили Spark . Благодарим за использование этого руководства для установки последней версии Apache Spark в системе Debian. Для получения дополнительной помощи или полезной информации мы рекомендуем вам посетить официальный сайт Apache Spark .

Though this article explains with Ubuntu, you can follow these steps to install Spark on any Linux-based OS like Centos, Debian e.t.c, I followed the below steps to setup my Apache Spark cluster on Ubuntu server.

Prerequisites:

  • Ubuntu Server running
  • Root access to Ubuntu server
  • If you wanted to install Apache Spark on Hadoop & Yarn installation, please Install and Setup Hadoop cluster and setup Yarn on Cluster before proceeding with this article.

If you just wanted to run Spark in standalone, proceed with this article.

Java Installation On Ubuntu

Apache Spark is written in Scala which is a language of Java hence to run Spark you need to have Java Installed. Since Oracle Java is licensed here I am using openJDK Java. If you wanted to use Java from other vendors or Oracle please do so. Here I will be using JDK 8.

Post JDK install, check if it installed successfully by running java -version

Python Installation On Ubuntu

You can skip this section if you wanted to run Spark with Scala & Java on an Ubuntu server.

Python Installation is needed if you wanted to run PySpark examples (Spark with Python) on the Ubuntu server.

Apache Spark Installation on Ubuntu

If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down (point 1 and 2); the link on point 3 changes to the selected version and provides you with an updated link to download.

Use wget command to download the Apache Spark to your Ubuntu server.

Once your download is complete, untar the archive file contents using tar command, tar is a file archiving tool. Once untar complete, rename the folder to spark.

Spark Environment Variables

Add Apache Spark environment variables to .bashrc or .profile file. open file in vi editor and add below variables.

Now load the environment variables to the opened session by running below command

In case if you added to .profile file then restart your session by closing and re-opening the session.

Test Spark Installation on Ubuntu

Here I will be using Spark-Submit Command to calculate PI value for 10 places by running org.apache.spark.examples.SparkPi example. You can find spark-submit at $SPARK_HOME/bin directory.

Spark Shell

Apache Spark binary comes with an interactive spark-shell. In order to start a shell to use Scala language, go to your $SPARK_HOME/bin directory and type “ spark-shell “. This command loads the Spark and displays what version of Spark you are using.

Note: In spark-shell you can run only Spark with Scala. In order to run PySpark, you need to open pyspark shell by running $SPARK_HOME/bin/pyspark . Make sure you have Python installed before running pyspark shell.

By default, spark-shell provides with spark (SparkSession) and sc (SparkContext) object’s to use. Let’s see some examples.

spark shell

Spark Web UI

Spark Web UI

Spark History server

Create $SPARK_HOME/conf/spark-defaults.conf file and add below configurations.

Create Spark Event Log directory. Spark keeps logs for all applications you submitted.

Run $SPARK_HOME/sbin/start-history-server.sh to start history server.

As per the configuration, history server by default runs on 18080 port.

Run PI example again by using spark-submit command, and refresh the History server which should show the recent run.

Conclusion

In Summary, you have learned steps involved in Apache Spark Installation on Linux based Ubuntu Server, and also learned how to start History Server, access web UI.

Apache Spark is a framework used in cluster computing environments for analyzing big data. This platform became widely popular due to its ease of use and the improved data processing speeds over Hadoop.

Apache Spark is able to distribute a workload across a group of computers in a cluster to more effectively process large sets of data. This open-source engine supports a wide array of programming languages. This includes Java, Scala, Python, and R.

In this tutorial, you will learn how to install Spark on an Ubuntu machine. The guide will show you how to start a master and slave server and how to load Scala and Python shells. It also provides the most important Spark commands.

Tutorial on how to install Spark on an Ubuntu machine.

  • An Ubuntu system.
  • Access to a terminal or command line.
  • A user with sudo or root permissions.

Install Packages Required for Spark

Before downloading and setting up Spark, you need to install necessary dependencies. This step includes installing the following packages:

Open a terminal window and run the following command to install all three packages at once:

You will see which packages will be installed.

Terminal output when installing Spark dependencies.

Once the process completes, verify the installed dependencies by running these commands:

Terminal output when verifying Java, Git and Scala versions.

The output prints the versions if the installation completed successfully for all packages.

Download and Set Up Spark on Ubuntu

Now, you need to download the version of Spark you want form their website. We will go for Spark 3.0.1 with Hadoop 2.7 as it is the latest version at the time of writing this article.

Use the wget command and the direct link to download the Spark archive:

When the download completes, you will see the saved message.

Output when saving Spark to your Ubuntu machine.

Note: If the URL does not work, please go to the Apache Spark download page to check for the latest version. Remember to replace the Spark version number in the subsequent commands if you change the download URL.

Now, extract the saved archive using tar:

Let the process complete. The output shows the files that are being unpacked from the archive.

Finally, move the unpacked directory spark-3.0.1-bin-hadoop2.7 to the opt/spark directory.

Use the mv command to do so:

The terminal returns no response if it successfully moves the directory. If you mistype the name, you will get a message similar to:

Configure Spark Environment

Before starting a master server, you need to configure environment variables. There are a few Spark home paths you need to add to the user profile.

Use the echo command to add these three lines to .profile:

You can also add the export paths by editing the .profile file in the editor of your choice, such as nano or vim.

For example, to use nano, enter:

When the profile loads, scroll to the bottom of the file.

Nano editor with the profile file open to add Spark variables.

Then, add these three lines:

Exit and save changes when prompted.

When you finish adding the paths, load the .profile file in the command line by typing:

Start Standalone Spark Master Server

Now that you have completed configuring your environment for Spark, you can start a master server.

In the terminal, type:

To view the Spark Web user interface, open a web browser and enter the localhost IP address on port 8080.

The page shows your Spark URL, status information for workers, hardware resource utilization, etc.

The home page of the Spark Web UI.

The URL for Spark Master is the name of your device on port 8080. In our case, this is ubuntu1:8080. So, there are three possible ways to load Spark Master’s Web UI:

  1. 127.0.0.1:8080
  2. localhost:8080
  3. deviceName:8080

Note: Learn how to automate the deployment of Spark clusters on Ubuntu servers by reading our Automated Deployment Of Spark Cluster On Bare Metal Cloud article.

Start Spark Slave Server (Start a Worker Process)

In this single-server, standalone setup, we will start one slave server along with the master server.

To do so, run the following command in this format:

The master in the command can be an IP or hostname.

In our case it is ubuntu1 :

The terminal output when starting a slave server.

Now that a worker is up and running, if you reload Spark Master’s Web UI, you should see it on the list:

Spark Web UI with one slave worker started.

Specify Resource Allocation for Workers

The default setting when starting a worker on a machine is to use all available CPU cores. You can specify the number of cores by passing the -c flag to the start-slave command.

For example, to start a worker and assign only one CPU core to it, enter this command:

Reload Spark Master’s Web UI to confirm the worker’s configuration.

Slave server CPU cores configuration in Web UI.

Similarly, you can assign a specific amount of memory when starting a worker. The default setting is to use whatever amount of RAM your machine has, minus 1GB.

To start a worker and assign it a specific amount of memory, add the -m option and a number. For gigabytes, use G and for megabytes, use M .

For example, to start a worker with 512MB of memory, enter this command:

Reload the Spark Master Web UI to view the worker’s status and confirm the configuration.

Spark slave server RAM configuration on Web UI.

Test Spark Shell

After you finish the configuration and start the master and slave server, test if the Spark shell works.

Load the shell by entering:

You should get a screen with notifications and Spark information. Scala is the default interface, so that shell loads when you run spark-shell.

The ending of the output looks like this for the version we are using at the time of writing this guide:

Terminal showing the screen when you launch a Spark shell on Ubuntu.

Type :q and press Enter to exit Scala.

Test Python in Spark

If you do not want to use the default Scala interface, you can switch to Python.

Make sure you quit Scala and then run this command:

The resulting output looks similar to the previous one. Towards the bottom, you will see the version of Python.

The terminal showing the screen when pyspark shell is launched.

To exit this shell, type quit() and hit Enter.

Basic Commands to Start and Stop Master Server and Workers

Below are the basic commands for starting and stopping the Apache Spark master server and workers. Since this setup is only for one machine, the scripts you run default to the localhost.

To start a master server instance on the current machine, run the command we used earlier in the guide:

To stop the master instance started by executing the script above, run:

To stop a running worker process, enter this command:

The Spark Master page, in this case, shows the worker status as DEAD.

Spark Web UI showing a worker status: dead.

You can start both master and server instances by using the start-all command:

Similarly, you can stop all instances by using the following command:

This tutorial showed you how to install Spark on an Ubuntu machine, as well as the necessary dependencies.

The setup in this guide enables you to perform basic tests before you start configuring a Spark cluster and performing advanced actions.

Our suggestion is to also learn more about what Spark DataFrame is, the features, how to use Spark DataFrame when collecting data and how to create a Spark DataFrame.

Читайте также: