Course: Big Data - IU S26
Date: Tuesday - March 12, 2026
Start time: 12:50
Total points: 100/4
I tested most of the tools here on WSL 2 installed on Windows 11. You can see below the version of the operating system that runs the services. Some UI tools like MongoDB Compass is installed directly on Windows 11.
> cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
This tutorial is written to provide you with the requirements that you need to check and prepare before you come to the midterm. If you do not meet the requirements, then you need to satisfy them during the midterm and it will take from your time, so make sure that your PC meets the requirements. If you have issues related to the requirements, please contact your TA at least three days before the midterm.
One of the requirements of the midterm is to have Docker and Docker compose on your machine. You need to make sure that Docker engine and Docker compose work on your machine. You can test the docker-compose.yml in the repository that is shared with you in the labs. You can follow the instructions in the official website to install Docker and Docker compose.
There are mainly two modes to install Hadoop.
You should test both modes and there is no optional modes here but the mode is determined during the midterm time.
You can install Hadoop on your local machine as follows. If you have Windows OS then you can use Ubuntu WSL which is the setup of my machine. Make sure that you have sudo access to the machine.
openssh-server.sudo apt-get install -y openssh-server
sudo apt-get -y update
sudo apt-get install -y wget openjdk-8-jdk vim
You can check the version of the java after installation.
java -version
# Output
# openjdk version "17.0.14" 2025-01-21
# OpenJDK Runtime Environment (build 17.0.14+7-Ubuntu-122.04.1)
# OpenJDK 64-Bit Server VM (build 17.0.14+7-Ubuntu-122.04.1, mixed mode, sharing)
As you can see that the version is different from Java 8. To switch between installed java versions, use the update-java-alternatives command.
# List all java versions:
update-java-alternatives -l
# output
# java-1.11.0-openjdk-amd64 1111 /usr/lib/jvm/java-1.11.0-openjdk-amd64
# java-1.17.0-openjdk-amd64 1711 /usr/lib/jvm/java-1.17.0-openjdk-amd64
# java-1.8.0-openjdk-amd64 1081 /usr/lib/jvm/java-1.8.0-openjdk-amd64
Here you can see that I have Java 8, 11, and 17 installed. You can switch to Java 8 as follows.
sudo update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64
Then we can check the version again.
java -version
# openjdk version "1.8.0_442"
# OpenJDK Runtime Environment (build 1.8.0_442-8u442-b06~us1-0ubuntu1~22.04-b06)
# OpenJDK 64-Bit Server VM (build 25.442-b06, mixed mode)
echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
source ~/.bashrc
echo $JAVA_HOME
# output
# /usr/lib/jvm/java-8-openjdk-amd64
sudo wget -O /hadoop.tar.gz http://archive.apache.org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz
cd / && sudo tar xfz hadoop.tar.gz
sudo mv /hadoop-3.3.1 /usr/local/hadoop
sudo rm /hadoop.tar.gz
echo "export HADOOP_HOME=/usr/local/hadoop" >> ~/.bashrc
source ~/.bashrc
echo $HADOOP_HOME
# output
# /usr/local/hadoop
echo "export PATH='$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin'" >> ~/.bashrc
source ~/.bashrc
mkdir -p $HADOOP_HOME/hdfs/namenode
mkdir -p $HADOOP_HOME/hdfs/datanode
<!-- $HADOOP_HOME/etc/hadoop/core-site.xml -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<!-- Put here your username -->
<value>firasj</value>
</property>
</configuration>
<!-- $HADOOP_HOME/etc/hadoop/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
# If you have an ssh key, skip the following line
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# Add the public key to as an authorized key
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# prevent other users to read this file
chmod 0600 ~/.ssh/authorized_keys
Also add the following to ~/.shh/config file.
Host localhost
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
You can now run the command ssh <Your-hostname> (ssh localhost) to check whether it will prompt or ask for passphrase. You can check the file /etc/hosts for host configuration.
11. Specify configuration for Hadoop binaries in the file ($HADOOP_HOME/etc/hadoop/hadoop-env.sh).
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
# Make sure that you added your username for services.
export HDFS_NAMENODE_USER="firasj"
export HDFS_DATANODE_USER="firasj"
export HDFS_SECONDARYNAMENODE_USER="firasj"
export YARN_RESOURCEMANAGER_USER="firasj"
export YARN_NODEMANAGER_USER="firasj"
export HADOOP_SSH_OPTS="-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR"
hdfs namenode -format
start-dfs.sh
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
14. Browse the web ui for the NameNode; by default it is available at:
hdfs fsck /
stop-dfs.sh
start-yarn.sh
Check if you can browse the web ui for the YARN resource manager. By default it is available at:
mapred --daemon start historyserver
Check if you can browse the web ui for the MapReduce history server. By default it is available at:
The table below shows some of the default ports for the running services.
| Service | Port |
|---|---|
| HDFS namenode | 9870 |
| HDFS datanode | 9864 |
| HDFS secondary namenode | 9868 |
| YARN resource manager | 8088 |
| YARN node manager | 8042 |
| MapReduce history server | 19888 |
This was a single node installation for Hadoop and for fully-distributed installation, you need to use Docker in the next section.
Here you can follow instructions in the repository shared with you in the labs to check the steps of installation for pseduo-distributed and fully-distributed modes.
If you successfully installed Hadoop, then you should be able to run MapReduce pipelines using mapred streaming command on YARN cluster.
You can install Spark by following the steps shared in the labs and make sure that you can run pyspark applications on local machine, on Spark cluster and on YARN cluster.
Ubuntu includes PostgreSQL by default, if not then use the apt (or other apt-driving) command to install PostgreSQL on Ubuntu:
apt install postgresql
You can check the installation by running:
psql -V
# output
# psql (PostgreSQL) 17.4 (Ubuntu 17.4-1.pgdg22.04+2)
You can install it via Docker as follows:
docker pull postgres
Then run a container.
docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres
You can download the Docker image for Memgraph and run a container. You can find more detail in this official page. Memgraph comes with different Docker images. The main repositories that contain memgraph are:
memgraph/memgraph-mage - includes Memgraph database, command-line interface mgconsole and MAGE graph algorithms library. If tagged with cuGraph, it also includes NVIDIA cuGraph GPU-powered graph algorithms.memgraph/memgraph - includes Memgraph database and command-line interface mgconsole.There are also two additional standalone images that do not include the Memgraph:
memgraph/lab - includes a web interface Memgraph Lab that helps you explore the data stored in Memgraph.memgraph/mgconsole - includes a command-line interface mgconsole that allows you to interact with Memgraph from the command line.For this lab, you can just need two containers. One is the graph engine and another the web user interface. You can run a multi-container memgraph as follows:
services:
memgraph:
image: memgraph/memgraph-mage:latest
container_name: memgraph-mage
ports:
- "7687:7687"
- "7444:7444"
command: ["--log-level=TRACE"]
lab:
image: memgraph/lab:latest
container_name: memgraph-lab
ports:
- "3001:3000"
depends_on:
- memgraph
environment:
- QUICK_CONNECT_MG_HOST=memgraph
- QUICK_CONNECT_MG_PORT=7687
You can just put this content in a file docker-compose.yaml and then run it using docker compose up -d. Then you can open the Memgraph lab (http://localhost:3001) in the browser to access the web UI as shown below.

You can also access the console of the graph if you selected to not use the Memgraph lab web UI as follows:
docker exec -it memgraph-mage mgconsole
This will open the console for Memgraph where you can write your queries in Cypher language.
Before you install MongoDB server, stop/kill all MongoDB services if you have running ones.
Here we need to download and install four tools:
The server (Community edition is enough)
bin folder. Add the path of this bin folder to PATH environment variable.
Note: Make sure that you have all the executables as shown in the screenshot above.
docker run --name mongodb_server -p 27017:27017 -d mongodb/mongodb-community-server:latest
The compass
The shell (mongosh)
.
bin folder. Add the path of this bin folder to PATH environment variable.
The Database CLI tools
Good luck in the midterm 