Course: Big Data - IU S25
Date: Tuesday - March 13, 2025
Start time: 14:30
Total points: 100/4
This tutorial is written to provide you with the requirements that you need to check and prepare before you come to the midterm. If you do not meet the requirements, then you need to satisfy them during the midterm and it will take from your time, so make sure that your PC meets the requirements. If you have issues related to the requirements, please contact your TA at least two days before the midterm.
One of the requirements of the midterm is to have Docker and Docker compose on your machine. You need to make sure that Docker engine and Docker compose work on your machine. You can test the docker-compose.yml in the repository that is shared with you in lab 4. You can follow the instructions in the official website to install Docker and Docker compose.
There are mainly two ways to install Hadoop.
You can install Hadoop on your local machine as follows. If you have Windows OS then you can use Ubuntu WSL which is the setup of my machine. Make sure that you have sudo access to the machine.
openssh-server.sudo apt-get install -y openssh-server
sudo apt-get -y update
sudo apt-get install -y wget openjdk-8-jdk vim
You can check the version of the java after installation.
java -version
# Output
# openjdk version "17.0.14" 2025-01-21
# OpenJDK Runtime Environment (build 17.0.14+7-Ubuntu-122.04.1)
# OpenJDK 64-Bit Server VM (build 17.0.14+7-Ubuntu-122.04.1, mixed mode, sharing)
As you can see that the version is different from Java 8. To switch between installed java versions, use the update-java-alternatives command.
# List all java versions:
update-java-alternatives -l
# output
# java-1.11.0-openjdk-amd64 1111 /usr/lib/jvm/java-1.11.0-openjdk-amd64
# java-1.17.0-openjdk-amd64 1711 /usr/lib/jvm/java-1.17.0-openjdk-amd64
# java-1.8.0-openjdk-amd64 1081 /usr/lib/jvm/java-1.8.0-openjdk-amd64
Here you can see that I have Java 8, 11, and 17 installed. You can switch to Java 8 as follows.
sudo update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64
Then we can check the version again.
java -version
# openjdk version "1.8.0_442"
# OpenJDK Runtime Environment (build 1.8.0_442-8u442-b06~us1-0ubuntu1~22.04-b06)
# OpenJDK 64-Bit Server VM (build 25.442-b06, mixed mode)
echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
source ~/.bashrc
echo $JAVA_HOME
# output
# /usr/lib/jvm/java-8-openjdk-amd64
sudo wget -O /hadoop.tar.gz http://archive.apache.org/dist/hadoop/core/hadoop-3.3.1/hadoop-3.3.1.tar.gz
cd / && sudo tar xfz hadoop.tar.gz
sudo mv /hadoop-3.3.1 /usr/local/hadoop
sudo rm /hadoop.tar.gz
echo "export HADOOP_HOME=/usr/local/hadoop" >> ~/.bashrc
source ~/.bashrc
echo $HADOOP_HOME
# output
# /usr/local/hadoop
echo "export PATH='$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin'" >> ~/.bashrc
source ~/.bashrc
mkdir -p $HADOOP_HOME/hdfs/namenode
mkdir -p $HADOOP_HOME/hdfs/datanode
<!-- $HADOOP_HOME/etc/hadoop/core-site.xml -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<!-- Put here your username -->
<value>firasj</value>
</property>
</configuration>
<!-- $HADOOP_HOME/etc/hadoop/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
# If you have an ssh key, skip the following line
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# Add the public key to as an authorized key
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# prevent other users to read this file
chmod 0600 ~/.ssh/authorized_keys
Also add the following to ~/.shh/config file.
Host localhost
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
You can now run the command ssh <Your-hostname> (ssh localhost) to check whether it will prompt or ask for passphrase. You can check the file /etc/hosts for host configuration.
11. Specify configuration for Hadoop binaries in the file ($HADOOP_HOME/etc/hadoop/hadoop-env.sh).
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
# Make sure that you added your username for services.
export HDFS_NAMENODE_USER="firasj"
export HDFS_DATANODE_USER="firasj"
export HDFS_SECONDARYNAMENODE_USER="firasj"
export YARN_RESOURCEMANAGER_USER="firasj"
export YARN_NODEMANAGER_USER="firasj"
export HADOOP_SSH_OPTS="-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR"
hdfs namenode -format
start-dfs.sh
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
14. Browse the web ui for the NameNode; by default it is available at:
hdfs fsck /
stop-dfs.sh
start-yarn.sh
Check if you can browse the web ui for the YARN resource manager. By default it is available at:
mapred --daemon start historyserver
Check if you can browse the web ui for the MapReduce history server. By default it is available at:
The table below shows some of the default ports for the running services.
| Service | Port |
|---|---|
| HDFS namenode | 9870 |
| HDFS datanode | 9864 |
| HDFS secondary namenode | 9868 |
| YARN resource manager | 8088 |
| YARN node manager | 8042 |
| MapReduce history server | 19888 |
This was a single node installation for Hadoop and for fully-distributed installation, you need to use Docker in the next section.
Here you can follow instructions in the repository shared with you in lab4 to check the steps of installation for pseduo-distributed and fully-distributed modes.
You can install Spark by following the steps shared in lab4 and make sure that you can run pyspark applications on local machine, on Spark cluster and on YARN cluster.
Ubuntu includes PostgreSQL by default. To install PostgreSQL on Ubuntu, use the apt (or other apt-driving) command:
apt install postgresql
You can check the installation by running:
psql -V
# output
# psql (PostgreSQL) 17.4 (Ubuntu 17.4-1.pgdg22.04+2)
You can install it via Docker as follows:
docker pull postgres
Then run a container.
docker run --name some-postgres -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -d postgres
You can download the server from the official website from here. You can also follow the direct download links below.
Before you install Neo4j server, stop/kill all Neo4j services if you have running ones.
The installation steps are very similar with slight difference in Windows. Please follow the installation approach according to your local operating system.
You need to download the zipped folder, unzip it and you can access the server folder as shown below.

The server CLI tool is .\bin\neo4j.bat. In order to run this server, you need to install it as Windows service. You can do that by simply running the command as follows.
.\bin\neo4j.bat windows-service install
You can check that the service is installed by visiting the local Windows services services.msc.

Notice that you may have issues as follows:

This means that you are already have a neo4j service with the same name as shown below.

You can fix it by renaming the service name in .\conf\neo4j.conf to a different name as follows.

and install the service again with the new name.
If you successfully installed the service, the service will be added to services.msc and you would see such output Neo4j service installed..

Now the server is not running. We can change the configuration of the server in .\conf\neo4j.conf.

Before you start the server, you have to accept the evaluation license as follows:
.\bin\neo4j-admin server license --accept-evaluation
We can run the server as follows:
.\bin\neo4j.bat start
Check in services.msc that it is running. After running the server, it will give you an address where you can access the Neo4j DBMS instance. For the default settings, you can access the server on http://localhost:7474

You can connect to the server in the browser as follows (the default username and password are neo4j):
:server connect

Then it will ask you to change the default password.

You can set a password like neo4jneo4j, then you will be able to connect to the server.
We can stop the service as follows:
.\bin\neo4j.bat stop
For more information about this tool, you can run .\bin\neo4j.bat --help.
You can delete the service from your Windows system services.msc as follows:
C:\Windows\System32\sc.exe delete ServiceName
Where ServiceName is your Neo4j Windows service name neo4j2.
You first need to unzip the compressed folder as follows:
tar zxf neo4j-enterprise-5.17.0-unix.tar.gz
You will get the server instance folder as follows:

Access that folder and you will see as follows:

The server CLI tool is .\bin\neo4j. Now the server is not running. We can change the configuration of the server in ./conf/neo4j.conf.

Make sure that you have Java 21 installed and your JAVA_HOME is set to the path of the home directory of Java.
You can install Java 21 on Ubuntu 22 as follows:
> sudo apt update
> sudo apt install openjdk-21-jdk
Before you start the server, you have to accept the evaluation license as follows:
./bin/neo4j-admin server license --accept-evaluation
We can run the server as follows:
./bin/neo4j start
After running the server, it will give you an address where you can access the Neo4j DBMS instance. For the default settings, you can access the server on http://localhost:7474

You can connect to the server in the browser as follows (the default username and password are neo4j):
:server connect

Then it will ask you to change the default password.

You can set a password like neo4jneo4j, then you will be able to connect to the server.
We can stop the service as follows:
./bin/neo4j stop
Or by killing the process whose pid is given when you started the server.

For instance, to kill the server running at the process (pid:5899), we can write as follows:
kill 5899
If it does not terminate with SIGTERM then send SIGKILL signal to force the termination as follows:
kill -9 5899
For more information about this tool, you can run ./bin/neo4j --help.
You can pull the docker image for enterprise edition of Neo4j server as follows.
docker pull neo4j:5.17.0-enterprise
Then run a container.
docker run --name neo4j_server_enterprise -p 7474:7474 -p 7687:7687 --env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes -v $HOME/neo4j/data:/data neo4j:5.17.0-enterprise
You can connect to the server in the browser as follows (the default username and password are neo4j):
:server connect

Then it will ask you to change the default password.

You can set a password like neo4jneo4j, then you will be able to connect to the server. You can stop the server by running the command docker stop neo4j_server_enterprise.
Before you install MongoDB server, stop/kill all MongoDB services if you have running ones.
Here we need to download and install four tools:
The server (Community edition is enough)

bin folder. Add the path of this bin folder to PATH environment variable.
Note: Make sure that you have both the database server mongod and mongos executables.
docker run --name mongodb_server -p 27017:27017 -d mongodb/mongodb-community-server:latest
The compass
The shell (mongosh)
.
bin folder. Add the path of this bin folder to PATH environment variable.
The Database CLI tools
Good luck in the midterm 