Midterm Prerequisites 2026

Course: Big Data - IU S26
Date: Tuesday - March 12, 2026
Start time: 12:50
Total points: 100/4

I tested most of the tools here on WSL 2 installed on Windows 11. You can see below the version of the operating system that runs the services. Some UI tools like MongoDB Compass is installed directly on Windows 11.

> cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Hadoop Spark Cluster using Docker

Agenda

Objectives

This tutorial is written to provide you with the requirements that you need to check and prepare before you come to the midterm. If you do not meet the requirements, then you need to satisfy them during the midterm and it will take from your time, so make sure that your PC meets the requirements. If you have issues related to the requirements, please contact your TA at least three days before the midterm.

Hardware requirements

Software requirements

Install Docker and Docker compose

One of the requirements of the midterm is to have Docker and Docker compose on your machine. You need to make sure that Docker engine and Docker compose work on your machine. You can test the docker-compose.yml in the repository that is shared with you in the labs. You can follow the instructions in the official website to install Docker and Docker compose.

Install Apache Hadoop

There are mainly two modes to install Hadoop.

You should test both modes and there is no optional modes here but the mode is determined during the midterm time.

Pesudo-distributed mode without using Docker

You can install Hadoop on your local machine as follows. If you have Windows OS then you can use Ubuntu WSL which is the setup of my machine. Make sure that you have sudo access to the machine.

  1. Install openssh-server.
sudo apt-get install -y openssh-server
  1. Install Java 8, and any other packages you prefer.
sudo apt-get -y update
sudo apt-get install -y wget openjdk-8-jdk vim

You can check the version of the java after installation.

java -version

# Output
# openjdk version "17.0.14" 2025-01-21
# OpenJDK Runtime Environment (build 17.0.14+7-Ubuntu-122.04.1)
# OpenJDK 64-Bit Server VM (build 17.0.14+7-Ubuntu-122.04.1, mixed mode, sharing)

As you can see that the version is different from Java 8. To switch between installed java versions, use the update-java-alternatives command.

# List all java versions:
update-java-alternatives -l

# output
# java-1.11.0-openjdk-amd64      1111       /usr/lib/jvm/java-1.11.0-openjdk-amd64
# java-1.17.0-openjdk-amd64      1711       /usr/lib/jvm/java-1.17.0-openjdk-amd64
# java-1.8.0-openjdk-amd64       1081       /usr/lib/jvm/java-1.8.0-openjdk-amd64

Here you can see that I have Java 8, 11, and 17 installed. You can switch to Java 8 as follows.

sudo update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64

Then we can check the version again.

java -version
# openjdk version "1.8.0_442"
# OpenJDK Runtime Environment (build 1.8.0_442-8u442-b06~us1-0ubuntu1~22.04-b06)
# OpenJDK 64-Bit Server VM (build 25.442-b06, mixed mode)
  1. Set JAVA_HOME environment variable.
echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc

source ~/.bashrc
echo $JAVA_HOME
# output
# /usr/lib/jvm/java-8-openjdk-amd64
  1. Download Hadoop
sudo wget -O /hadoop.tar.gz http://archive.apache.org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz
  1. Unzip it and store it in some folder (/usr/local/hadoop) which will be the hadoop home directory.
cd / && sudo tar xfz hadoop.tar.gz
sudo mv /hadoop-3.3.1 /usr/local/hadoop
sudo rm /hadoop.tar.gz
  1. Set Hadoop home directory.
echo "export HADOOP_HOME=/usr/local/hadoop" >> ~/.bashrc
source ~/.bashrc
echo $HADOOP_HOME
# output
# /usr/local/hadoop
  1. Add Hadoop binaries to system PATH.
echo "export PATH='$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin'" >>  ~/.bashrc

source ~/.bashrc
  1. Create directories for storing data in the namenode and datanodes.
mkdir -p $HADOOP_HOME/hdfs/namenode
mkdir -p $HADOOP_HOME/hdfs/datanode
  1. Configuring Hadoop
<!-- $HADOOP_HOME/etc/hadoop/core-site.xml -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.http.staticuser.user</name>
        <!--    Put here your username      -->
        <value>firasj</value>
    </property>
</configuration>
<!-- $HADOOP_HOME/etc/hadoop/hdfs-site.xml -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
  1. Check that you can ssh to the localhost without a passphrase or prompting.
ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

# If you have an ssh key, skip the following line
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

# Add the public key to as an authorized key
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# prevent other users to read this file
chmod 0600 ~/.ssh/authorized_keys

Also add the following to ~/.shh/config file.

Host localhost
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
  LogLevel ERROR

You can now run the command ssh <Your-hostname> (ssh localhost) to check whether it will prompt or ask for passphrase. You can check the file /etc/hosts for host configuration.
11. Specify configuration for Hadoop binaries in the file ($HADOOP_HOME/etc/hadoop/hadoop-env.sh).

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop

# Make sure that you added your username for services.
export HDFS_NAMENODE_USER="firasj"
export HDFS_DATANODE_USER="firasj"
export HDFS_SECONDARYNAMENODE_USER="firasj"
export YARN_RESOURCEMANAGER_USER="firasj"
export YARN_NODEMANAGER_USER="firasj"

export HADOOP_SSH_OPTS="-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR"
  1. Format the HDFS file system (only for the first time installation). This will delete everything in HDFS.
hdfs namenode -format
  1. Start namenode, Secondary namenodes and datanodes
start-dfs.sh

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
14. Browse the web ui for the NameNode; by default it is available at:

  1. Test some HDFS commands.
hdfs fsck /
  1. When you are done, stop HDFS daemons.
stop-dfs.sh
  1. Start YARN resource manager and node managers.
start-yarn.sh

Check if you can browse the web ui for the YARN resource manager. By default it is available at:

  1. Start MapReduce history server. Make sure that HDFS is running before you run this.
mapred --daemon start historyserver

Check if you can browse the web ui for the MapReduce history server. By default it is available at:

The table below shows some of the default ports for the running services.

Service Port
HDFS namenode 9870
HDFS datanode 9864
HDFS secondary namenode 9868
YARN resource manager 8088
YARN node manager 8042
MapReduce history server 19888

This was a single node installation for Hadoop and for fully-distributed installation, you need to use Docker in the next section.

Fully-distributed mode Using Docker

Here you can follow instructions in the repository shared with you in the labs to check the steps of installation for pseduo-distributed and fully-distributed modes.

Install MapReduce

If you successfully installed Hadoop, then you should be able to run MapReduce pipelines using mapred streaming command on YARN cluster.

Install Spark and PySpark package

You can install Spark by following the steps shared in the labs and make sure that you can run pyspark applications on local machine, on Spark cluster and on YARN cluster.

Install PostgreSQL

Without Docker

Ubuntu includes PostgreSQL by default, if not then use the apt (or other apt-driving) command to install PostgreSQL on Ubuntu:

apt install postgresql

You can check the installation by running:

psql -V
# output
# psql (PostgreSQL) 17.4 (Ubuntu 17.4-1.pgdg22.04+2)

Using Docker

You can install it via Docker as follows:

docker pull postgres

Then run a container.

docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres

Install Memgraph

You can download the Docker image for Memgraph and run a container. You can find more detail in this official page. Memgraph comes with different Docker images. The main repositories that contain memgraph are:

There are also two additional standalone images that do not include the Memgraph:

For this lab, you can just need two containers. One is the graph engine and another the web user interface. You can run a multi-container memgraph as follows:

services:
  memgraph:
    image: memgraph/memgraph-mage:latest
    container_name: memgraph-mage
    ports:
      - "7687:7687"
      - "7444:7444"
    command: ["--log-level=TRACE"]

  lab:
    image: memgraph/lab:latest
    container_name: memgraph-lab
    ports:
      - "3001:3000"
    depends_on:
      - memgraph
    environment:
      - QUICK_CONNECT_MG_HOST=memgraph
      - QUICK_CONNECT_MG_PORT=7687

You can just put this content in a file docker-compose.yaml and then run it using docker compose up -d. Then you can open the Memgraph lab (http://localhost:3001) in the browser to access the web UI as shown below.

You can also access the console of the graph if you selected to not use the Memgraph lab web UI as follows:

docker exec -it memgraph-mage mgconsole

This will open the console for Memgraph where you can write your queries in Cypher language.

Install MongoDB tools

Before you install MongoDB server, stop/kill all MongoDB services if you have running ones.

Here we need to download and install four tools:

Self-check tasks

  1. Check that you installed Docker and Docker compose.
  2. Check that you installed all tools related to Hadoop HDFS, YARN and MapReduce.
  3. Check that you installed all Spark and PySpark package.
  4. Check that you installed PostgreSQL server.
  5. Check that you installed all required tools for mongodb (The compass, the server, database CLI tools and the shell) and can access the databases.
  6. Check that you can start a Memgraph server and access the UI.
  7. Check that you have a Google account to Colab notebooks.

Good luck in the midterm :blush: