Midterm Prerequisites

Course: Big Data - IU S25
Date: Tuesday - March 13, 2025
Start time: 14:30
Total points: 100/4

Hadoop Spark Cluster using Docker

docker-spark-yarn-cluster

Agenda

Objectives

This tutorial is written to provide you with the requirements that you need to check and prepare before you come to the midterm. If you do not meet the requirements, then you need to satisfy them during the midterm and it will take from your time, so make sure that your PC meets the requirements. If you have issues related to the requirements, please contact your TA at least two days before the midterm.

Hardware requirements

CPU (4 cores or more)
Memory (8GB or more)
Storage (~50 GB)

Software requirements

You have an Ubuntu (e.g. 20.04 or 22.04) or Windows operating system with Ubuntu WSL (22.04). There is no guarantee that the tools will work with other Linux distro or MacOS but you have to check them with this document.
Install Docker and Docker Compose
Install Apache Hadoop
- Version 3.3.1
Install Apache Spark
- Version 3.5.4
Install PostgreSQL
Install Neo4j server - Enterprise edition.
- Do not install community edition.
Install MongoDB server
Install MongoDB Compass
Download MongoDB shell tool
Download MongoDB Database CLI tools

Install Docker and Docker compose

One of the requirements of the midterm is to have Docker and Docker compose on your machine. You need to make sure that Docker engine and Docker compose work on your machine. You can test the docker-compose.yml in the repository that is shared with you in lab 4. You can follow the instructions in the official website to install Docker and Docker compose.

Install Apache Hadoop

There are mainly two ways to install Hadoop.

Without Docker

You can install Hadoop on your local machine as follows. If you have Windows OS then you can use Ubuntu WSL which is the setup of my machine. Make sure that you have sudo access to the machine.

Install openssh-server.

sudo apt-get install -y openssh-server

Install Java 8, and any other packages you prefer.

sudo apt-get -y update
sudo apt-get install -y wget openjdk-8-jdk vim

You can check the version of the java after installation.

java -version

# Output
# openjdk version "17.0.14" 2025-01-21
# OpenJDK Runtime Environment (build 17.0.14+7-Ubuntu-122.04.1)
# OpenJDK 64-Bit Server VM (build 17.0.14+7-Ubuntu-122.04.1, mixed mode, sharing)

As you can see that the version is different from Java 8. To switch between installed java versions, use the update-java-alternatives command.

# List all java versions:
update-java-alternatives -l

# output
# java-1.11.0-openjdk-amd64      1111       /usr/lib/jvm/java-1.11.0-openjdk-amd64
# java-1.17.0-openjdk-amd64      1711       /usr/lib/jvm/java-1.17.0-openjdk-amd64
# java-1.8.0-openjdk-amd64       1081       /usr/lib/jvm/java-1.8.0-openjdk-amd64

Here you can see that I have Java 8, 11, and 17 installed. You can switch to Java 8 as follows.

sudo update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64

Then we can check the version again.

java -version
# openjdk version "1.8.0_442"
# OpenJDK Runtime Environment (build 1.8.0_442-8u442-b06~us1-0ubuntu1~22.04-b06)
# OpenJDK 64-Bit Server VM (build 25.442-b06, mixed mode)

Set JAVA_HOME environment variable.

echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc

source ~/.bashrc
echo $JAVA_HOME
# output
# /usr/lib/jvm/java-8-openjdk-amd64

Download Hadoop

sudo wget -O /hadoop.tar.gz http://archive.apache.org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Unzip it and store it in some folder (/usr/local/hadoop) which will be the hadoop home directory.

cd / && sudo tar xfz hadoop.tar.gz
sudo mv /hadoop-3.3.1 /usr/local/hadoop
sudo rm /hadoop.tar.gz

Set Hadoop home directory.

echo "export HADOOP_HOME=/usr/local/hadoop" >> ~/.bashrc
source ~/.bashrc
echo $HADOOP_HOME
# output
# /usr/local/hadoop

Add Hadoop binaries to system PATH.

echo "export PATH='$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin'" >>  ~/.bashrc

source ~/.bashrc

Create directories for storing data in the namenode and datanodes.

mkdir -p $HADOOP_HOME/hdfs/namenode
mkdir -p $HADOOP_HOME/hdfs/datanode

Configuring Hadoop

<!-- $HADOOP_HOME/etc/hadoop/core-site.xml -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.http.staticuser.user</name>
        <!--    Put here your username      -->
        <value>firasj</value>
    </property>
</configuration>

<!-- $HADOOP_HOME/etc/hadoop/hdfs-site.xml -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Check that you can ssh to the localhost without a passphrase or prompting.

ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

# If you have an ssh key, skip the following line
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

# Add the public key to as an authorized key
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# prevent other users to read this file
chmod 0600 ~/.ssh/authorized_keys

Also add the following to ~/.shh/config file.

Host localhost
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
  LogLevel ERROR

You can now run the command ssh <Your-hostname> (ssh localhost) to check whether it will prompt or ask for passphrase. You can check the file /etc/hosts for host configuration.
11. Specify configuration for Hadoop binaries in the file ($HADOOP_HOME/etc/hadoop/hadoop-env.sh).

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop

# Make sure that you added your username for services.
export HDFS_NAMENODE_USER="firasj"
export HDFS_DATANODE_USER="firasj"
export HDFS_SECONDARYNAMENODE_USER="firasj"
export YARN_RESOURCEMANAGER_USER="firasj"
export YARN_NODEMANAGER_USER="firasj"

export HADOOP_SSH_OPTS="-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR"

Format the HDFS file system (only for the first time installation). This will delete everything in HDFS.

hdfs namenode -format

Start namenode, Secondary namenodes and datanodes

start-dfs.sh

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
14. Browse the web ui for the NameNode; by default it is available at:

NameNode - http://localhost:9870

Test some HDFS commands.

hdfs fsck /

When you are done, stop HDFS daemons.

stop-dfs.sh

Start YARN resource manager and node managers.

start-yarn.sh

Check if you can browse the web ui for the YARN resource manager. By default it is available at:

ResourceManager - http://localhost:8088

Start MapReduce history server. Make sure that HDFS is running before you run this.

mapred --daemon start historyserver

Check if you can browse the web ui for the MapReduce history server. By default it is available at:

MapReduce History Server - http://localhost:19888

The table below shows some of the default ports for the running services.

Service	Port
HDFS namenode	9870
HDFS datanode	9864
HDFS secondary namenode	9868
YARN resource manager	8088
YARN node manager	8042
MapReduce history server	19888

This was a single node installation for Hadoop and for fully-distributed installation, you need to use Docker in the next section.

Using Docker

Here you can follow instructions in the repository shared with you in lab4 to check the steps of installation for pseduo-distributed and fully-distributed modes.

https://github.com/firas-jolha/docker-spark-yarn-cluster

Install Spark and PySpark package

You can install Spark by following the steps shared in lab4 and make sure that you can run pyspark applications on local machine, on Spark cluster and on YARN cluster.

Install PostgreSQL

Without Docker

Ubuntu includes PostgreSQL by default. To install PostgreSQL on Ubuntu, use the apt (or other apt-driving) command:

apt install postgresql

You can check the installation by running:

psql -V
# output
# psql (PostgreSQL) 17.4 (Ubuntu 17.4-1.pgdg22.04+2)

Using Docker

You can install it via Docker as follows:

docker pull postgres

Then run a container.

docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres

Install Neo4j server - Enterprise edition

Without Docker

You can download the server from the official website from here. You can also follow the direct download links below.

Before you install Neo4j server, stop/kill all Neo4j services if you have running ones.

The installation steps are very similar with slight difference in Windows. Please follow the installation approach according to your local operating system.

Windows user

You need to download the zipped folder, unzip it and you can access the server folder as shown below.

The server CLI tool is .\bin\neo4j.bat. In order to run this server, you need to install it as Windows service. You can do that by simply running the command as follows.

.\bin\neo4j.bat windows-service install

You can check that the service is installed by visiting the local Windows services services.msc.

Notice that you may have issues as follows:

This means that you are already have a neo4j service with the same name as shown below.

You can fix it by renaming the service name in .\conf\neo4j.conf to a different name as follows.

and install the service again with the new name.

If you successfully installed the service, the service will be added to services.msc and you would see such output Neo4j service installed..

Now the server is not running. We can change the configuration of the server in .\conf\neo4j.conf.

Before you start the server, you have to accept the evaluation license as follows:

.\bin\neo4j-admin server license --accept-evaluation

We can run the server as follows:

.\bin\neo4j.bat start

Check in services.msc that it is running. After running the server, it will give you an address where you can access the Neo4j DBMS instance. For the default settings, you can access the server on http://localhost:7474

You can connect to the server in the browser as follows (the default username and password are neo4j):

:server connect

Then it will ask you to change the default password.

You can set a password like neo4jneo4j, then you will be able to connect to the server.

We can stop the service as follows:

.\bin\neo4j.bat stop

For more information about this tool, you can run .\bin\neo4j.bat --help.

You can delete the service from your Windows system services.msc as follows:

C:\Windows\System32\sc.exe delete ServiceName

Where ServiceName is your Neo4j Windows service name neo4j2.

Ubuntu user (tested on Ubuntu 22.04.3)

You first need to unzip the compressed folder as follows:

tar zxf neo4j-enterprise-5.17.0-unix.tar.gz

You will get the server instance folder as follows:

Access that folder and you will see as follows:

The server CLI tool is .\bin\neo4j. Now the server is not running. We can change the configuration of the server in ./conf/neo4j.conf.

Make sure that you have Java 21 installed and your JAVA_HOME is set to the path of the home directory of Java.

You can install Java 21 on Ubuntu 22 as follows:

> sudo apt update
> sudo apt install openjdk-21-jdk

Before you start the server, you have to accept the evaluation license as follows:

./bin/neo4j-admin server license --accept-evaluation

We can run the server as follows:

./bin/neo4j start

After running the server, it will give you an address where you can access the Neo4j DBMS instance. For the default settings, you can access the server on http://localhost:7474

You can connect to the server in the browser as follows (the default username and password are neo4j):

:server connect

Then it will ask you to change the default password.

You can set a password like neo4jneo4j, then you will be able to connect to the server.

We can stop the service as follows:

./bin/neo4j stop

Or by killing the process whose pid is given when you started the server.

For instance, to kill the server running at the process (pid:5899), we can write as follows:

kill 5899

If it does not terminate with SIGTERM then send SIGKILL signal to force the termination as follows:

kill -9 5899

For more information about this tool, you can run ./bin/neo4j --help.

Using Docker

You can pull the docker image for enterprise edition of Neo4j server as follows.

docker pull neo4j:5.17.0-enterprise

Then run a container.

docker run --name neo4j_server_enterprise -p 7474:7474 -p 7687:7687 --env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes -v $HOME/neo4j/data:/data neo4j:5.17.0-enterprise

You can connect to the server in the browser as follows (the default username and password are neo4j):

:server connect

Then it will ask you to change the default password.

You can set a password like neo4jneo4j, then you will be able to connect to the server. You can stop the server by running the command docker stop neo4j_server_enterprise.

Install MongoDB tools

Before you install MongoDB server, stop/kill all MongoDB services if you have running ones.

Here we need to download and install four tools:

The server (Community edition is enough)
- Wihtout Docker
  - Download website
    - https://www.mongodb.com/try/download/community
    - Make sure that the package is tgz (which includes both the server and mongos)
  - Go to the directory where the server is installed and locate the bin folder. Add the path of this bin folder to PATH environment variable.
    
    Note: Make sure that you have both the database server mongod and mongos executables.
- Using Docker:
  - You can also install this server using Docker by pulling the image and creating a container from it.
```
docker run --name mongodb_server -p 27017:27017 -d mongodb/mongodb-community-server:latest
```
The compass
- You need to download it from official webite.
  - https://www.mongodb.com/try/download/compass
The shell (mongosh)
- You can download it from here and you do not need to install it .
- Go to the directory where the shell is downloaded and locate the bin folder. Add the path of this bin folder to PATH environment variable.
The Database CLI tools
- You can download these tools using the instructions provided in lab 3.

Self-check tasks

Check that you installed Docker and Docker compose.
Check that you installed all tools related to Hadoop HDFS, YARN and MapReduce.
Check that you installed all Spark and PySpark package.
Check that you installed PostgreSQL server.
Check that you installed all required tools for mongodb (The compass, the server, database CLI tools and the shell) and can access the databases.
Check that you can start a Neo4j server in Enterprise edition.
Check that you have a Google account to Colab notebooks.

Good luck in the midterm