Midterm Prerequisites 2026

Course: Big Data - IU S26
Date: Tuesday - March 12, 2026
Start time: 12:50
Total points: 100/4

I tested most of the tools here on WSL 2 installed on Windows 11. You can see below the version of the operating system that runs the services. Some UI tools like MongoDB Compass is installed directly on Windows 11.

> cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Hadoop Spark Cluster using Docker

docker-spark-yarn-cluster

Agenda

Midterm Prerequisites 2026
Hadoop Spark Cluster using Docker
Agenda
Objectives
Hardware requirements
Software requirements
Install Docker and Docker compose
Install Apache Hadoop
- Pesudo-distributed mode without using Docker
- Fully-distributed mode Using Docker
Install MapReduce
Install Spark and PySpark package
Install PostgreSQL
- Without Docker
- Using Docker
Install Memgraph
Install MongoDB tools
Self-check tasks

Objectives

This tutorial is written to provide you with the requirements that you need to check and prepare before you come to the midterm. If you do not meet the requirements, then you need to satisfy them during the midterm and it will take from your time, so make sure that your PC meets the requirements. If you have issues related to the requirements, please contact your TA at least three days before the midterm.

Hardware requirements

CPU (4 cores or more)
Memory (6-8GB or more)
- If you have 4GB then you can use the resources on (vm.innopolis.university) to install some services like Hadoop or PostgreSQL and others on your local machine.
- We do not need HDP Sandbox for labs anymore, so you can delete it.
Storage (~50 GB)

Software requirements

You have an Ubuntu (e.g. 22.04) or Windows operating system with Ubuntu WSL (22.04). There is no guarantee that the tools will work with other Linux distro or MacOS but you have to check them with this tutorial. Please do not check them during the exam, you will waste your time. You should come ready for the exam.
Install Docker and Docker Compose
Install Apache Hadoop 3
Install Apache Spark 3
Install PostgreSQL
Install Memgraph
Install MongoDB

Install Docker and Docker compose

One of the requirements of the midterm is to have Docker and Docker compose on your machine. You need to make sure that Docker engine and Docker compose work on your machine. You can test the docker-compose.yml in the repository that is shared with you in the labs. You can follow the instructions in the official website to install Docker and Docker compose.

Install Apache Hadoop

There are mainly two modes to install Hadoop.

You should test both modes and there is no optional modes here but the mode is determined during the midterm time.

Pesudo-distributed mode without using Docker

You can install Hadoop on your local machine as follows. If you have Windows OS then you can use Ubuntu WSL which is the setup of my machine. Make sure that you have sudo access to the machine.

Install openssh-server.

sudo apt-get install -y openssh-server

Install Java 8, and any other packages you prefer.

sudo apt-get -y update
sudo apt-get install -y wget openjdk-8-jdk vim

You can check the version of the java after installation.

java -version

# Output
# openjdk version "17.0.14" 2025-01-21
# OpenJDK Runtime Environment (build 17.0.14+7-Ubuntu-122.04.1)
# OpenJDK 64-Bit Server VM (build 17.0.14+7-Ubuntu-122.04.1, mixed mode, sharing)

As you can see that the version is different from Java 8. To switch between installed java versions, use the update-java-alternatives command.

# List all java versions:
update-java-alternatives -l

# output
# java-1.11.0-openjdk-amd64      1111       /usr/lib/jvm/java-1.11.0-openjdk-amd64
# java-1.17.0-openjdk-amd64      1711       /usr/lib/jvm/java-1.17.0-openjdk-amd64
# java-1.8.0-openjdk-amd64       1081       /usr/lib/jvm/java-1.8.0-openjdk-amd64

Here you can see that I have Java 8, 11, and 17 installed. You can switch to Java 8 as follows.

sudo update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64

Then we can check the version again.

java -version
# openjdk version "1.8.0_442"
# OpenJDK Runtime Environment (build 1.8.0_442-8u442-b06~us1-0ubuntu1~22.04-b06)
# OpenJDK 64-Bit Server VM (build 25.442-b06, mixed mode)

Set JAVA_HOME environment variable.

echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc

source ~/.bashrc
echo $JAVA_HOME
# output
# /usr/lib/jvm/java-8-openjdk-amd64

Download Hadoop

sudo wget -O /hadoop.tar.gz http://archive.apache.org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Unzip it and store it in some folder (/usr/local/hadoop) which will be the hadoop home directory.

cd / && sudo tar xfz hadoop.tar.gz
sudo mv /hadoop-3.3.1 /usr/local/hadoop
sudo rm /hadoop.tar.gz

Set Hadoop home directory.

echo "export HADOOP_HOME=/usr/local/hadoop" >> ~/.bashrc
source ~/.bashrc
echo $HADOOP_HOME
# output
# /usr/local/hadoop

Add Hadoop binaries to system PATH.

echo "export PATH='$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin'" >>  ~/.bashrc

source ~/.bashrc

Create directories for storing data in the namenode and datanodes.

mkdir -p $HADOOP_HOME/hdfs/namenode
mkdir -p $HADOOP_HOME/hdfs/datanode

Configuring Hadoop

<!-- $HADOOP_HOME/etc/hadoop/core-site.xml -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.http.staticuser.user</name>
        <!--    Put here your username      -->
        <value>firasj</value>
    </property>
</configuration>

<!-- $HADOOP_HOME/etc/hadoop/hdfs-site.xml -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Check that you can ssh to the localhost without a passphrase or prompting.

ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

# If you have an ssh key, skip the following line
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

# Add the public key to as an authorized key
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# prevent other users to read this file
chmod 0600 ~/.ssh/authorized_keys

Also add the following to ~/.shh/config file.

Host localhost
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
  LogLevel ERROR

You can now run the command ssh <Your-hostname> (ssh localhost) to check whether it will prompt or ask for passphrase. You can check the file /etc/hosts for host configuration.
11. Specify configuration for Hadoop binaries in the file ($HADOOP_HOME/etc/hadoop/hadoop-env.sh).

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop

# Make sure that you added your username for services.
export HDFS_NAMENODE_USER="firasj"
export HDFS_DATANODE_USER="firasj"
export HDFS_SECONDARYNAMENODE_USER="firasj"
export YARN_RESOURCEMANAGER_USER="firasj"
export YARN_NODEMANAGER_USER="firasj"

export HADOOP_SSH_OPTS="-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR"

Format the HDFS file system (only for the first time installation). This will delete everything in HDFS.

hdfs namenode -format

Start namenode, Secondary namenodes and datanodes

start-dfs.sh

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
14. Browse the web ui for the NameNode; by default it is available at:

NameNode - http://localhost:9870

Test some HDFS commands.

hdfs fsck /

When you are done, stop HDFS daemons.

stop-dfs.sh

Start YARN resource manager and node managers.

start-yarn.sh

Check if you can browse the web ui for the YARN resource manager. By default it is available at:

ResourceManager - http://localhost:8088

Start MapReduce history server. Make sure that HDFS is running before you run this.

mapred --daemon start historyserver

Check if you can browse the web ui for the MapReduce history server. By default it is available at:

MapReduce History Server - http://localhost:19888

The table below shows some of the default ports for the running services.

Service	Port
HDFS namenode	9870
HDFS datanode	9864
HDFS secondary namenode	9868
YARN resource manager	8088
YARN node manager	8042
MapReduce history server	19888

This was a single node installation for Hadoop and for fully-distributed installation, you need to use Docker in the next section.

Fully-distributed mode Using Docker

Here you can follow instructions in the repository shared with you in the labs to check the steps of installation for pseduo-distributed and fully-distributed modes.

https://github.com/firas-jolha/docker-spark-yarn-cluster

Install MapReduce

If you successfully installed Hadoop, then you should be able to run MapReduce pipelines using mapred streaming command on YARN cluster.

Install Spark and PySpark package

You can install Spark by following the steps shared in the labs and make sure that you can run pyspark applications on local machine, on Spark cluster and on YARN cluster.

Install PostgreSQL

Without Docker

Ubuntu includes PostgreSQL by default, if not then use the apt (or other apt-driving) command to install PostgreSQL on Ubuntu:

apt install postgresql

You can check the installation by running:

psql -V
# output
# psql (PostgreSQL) 17.4 (Ubuntu 17.4-1.pgdg22.04+2)

Using Docker

You can install it via Docker as follows:

docker pull postgres

Then run a container.

docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres

Install Memgraph

You can download the Docker image for Memgraph and run a container. You can find more detail in this official page. Memgraph comes with different Docker images. The main repositories that contain memgraph are:

memgraph/memgraph-mage - includes Memgraph database, command-line interface mgconsole and MAGE graph algorithms library. If tagged with cuGraph, it also includes NVIDIA cuGraph GPU-powered graph algorithms.
memgraph/memgraph - includes Memgraph database and command-line interface mgconsole.

There are also two additional standalone images that do not include the Memgraph:

memgraph/lab - includes a web interface Memgraph Lab that helps you explore the data stored in Memgraph.
memgraph/mgconsole - includes a command-line interface mgconsole that allows you to interact with Memgraph from the command line.

For this lab, you can just need two containers. One is the graph engine and another the web user interface. You can run a multi-container memgraph as follows:

services:
  memgraph:
    image: memgraph/memgraph-mage:latest
    container_name: memgraph-mage
    ports:
      - "7687:7687"
      - "7444:7444"
    command: ["--log-level=TRACE"]

  lab:
    image: memgraph/lab:latest
    container_name: memgraph-lab
    ports:
      - "3001:3000"
    depends_on:
      - memgraph
    environment:
      - QUICK_CONNECT_MG_HOST=memgraph
      - QUICK_CONNECT_MG_PORT=7687

You can just put this content in a file docker-compose.yaml and then run it using docker compose up -d. Then you can open the Memgraph lab (http://localhost:3001) in the browser to access the web UI as shown below.

You can also access the console of the graph if you selected to not use the Memgraph lab web UI as follows:

docker exec -it memgraph-mage mgconsole

This will open the console for Memgraph where you can write your queries in Cypher language.

Install MongoDB tools

Before you install MongoDB server, stop/kill all MongoDB services if you have running ones.

Here we need to download and install four tools:

The server (Community edition is enough)
- Without Docker
  - Download website
    - https://www.mongodb.com/try/download/community
  - Go to the directory where the server is installed and locate the bin folder. Add the path of this bin folder to PATH environment variable.
    
    Note: Make sure that you have all the executables as shown in the screenshot above.
- Using Docker:
  - You can also install this server using Docker by pulling the image and creating a container from it.
```
docker run --name mongodb_server -p 27017:27017 -d mongodb/mongodb-community-server:latest
```
The compass
- You need to download it from official webite.
  - https://www.mongodb.com/try/download/compass
The shell (mongosh)
- You can download it from here and you do not need to install it .
- Go to the directory where the shell is downloaded and locate the bin folder. Add the path of this bin folder to PATH environment variable.
The Database CLI tools
- You can download these tools using the instructions provided in the lab materials.

Self-check tasks

Check that you installed Docker and Docker compose.
Check that you installed all tools related to Hadoop HDFS, YARN and MapReduce.
Check that you installed all Spark and PySpark package.
Check that you installed PostgreSQL server.
Check that you installed all required tools for mongodb (The compass, the server, database CLI tools and the shell) and can access the databases.
Check that you can start a Memgraph server and access the UI.
Check that you have a Google account to Colab notebooks.

Good luck in the midterm