Lab 1 - Installing HDP Sandbox

Course: Big Data - IU S25
Author: Firas Jolha

Agenda

Prerequisites

Familiarity with terminals, shell commands and fundamental knowledge in operating Systems (Linux)
Basic knowledge in virtualization and managing virtual machines
Basic knowledge in containerization and dealing with Docker images and containers

Objectives

Install a hypervisor or VM monitor such as VirtualBox
Install Hortonworks Data Platform
Access Hortonworks Data Platform via ssh
Access HDFS and learn how to transfer files

Introduction

Every business is now a data business. Data is the organizations’ future and most valuable asset. The Hortonworks Data Platform (HDP) is a security-rich, enterprise-ready open-source Hadoop distribution based on a centralized architecture (YARN). Hortonworks Sandbox is a single-node cluster and can be run as a Docker container installed on a virtual machine. HDP is a complete system to handle the processing and storage of big data. It is an open architecture used to store and process complex and large-scale data. It is composed of numerous Apache Software Foundation (ASF) projects including Apache Hadoop and is built specifically to meet enterprise demands. Hortonworks was a standalone company untill 2019 when it is merged to Cloudera and now Hortonworks is a subsidiary for Cloudera, Inc.

Hortonworks is merged to Cloudera in 2019

At the beginning of this course, we are aiming to remind you with relational databases before dealing with big data. In practice, we will use the relational database management system PostgreSQL. Fortunately, HDP comes with a pre-installed PostgreSQL database server.

Hardware requirements

Memory dedicated to the cluster (Minimum: 4 GiB, Recommended: 8+ GiB). More is better.
CPU (Minimum: 4 Cores, Recommended: 6+ Cores)
- Virtualization should be enabled
  - (Check Virtualization on Windows, On Linux: lscpu). Sometimes it is disabled in BIOS.
Storage
- 25-35 GiB
  - for HDP 2.5.0
- 65-75 GiB
  - for HDP 2.6.5
- 80-100 GiB
  - for HDP 3.0.1

HDP 2.5.0 is based on CentOS 6.8 which is EOL by 2021 and updating the packages via yum is kinda not possible. We recommend to install HDP 2.6.5 unless you have less resources.

Installing Hortonworks Sandbox

There are two common ways to install HDP Sandbox on your PC, either by using a hypervisor such as VirtualBox which will pull the Docker image and run a container for your cluster or by directly using Docker where you need to manage your resources via docker command line options in Linux or you can use WSL backend in Windows where you can configure resources via .wslconfig file.

A. Using a Hypervisor

If you are not familiar with Docker, you can follow this approach where configuring the cluster resources can be done via the hypervisor GUI. If you have less resources, then we recommend using Docker, so you do not need to consume resources for running the guest virtual machine.

In this approach, you will run a virtual machine which in turn will run your cluster container, so the operating system of the virtual machine will be different from the operating system of the container (HDP Sandbox cluster). You can notice that by checking the content of /home directory or the version of the operating system cat /etc/redhat-release.

1. Installing a Hypervisor

We recommend VirtualBox as a hypervisor since it is supported for most common operating systems (Linux, Windows, and macOS). Please, follow the attached link in the following list to download your preferred hypervisor.

Oracle VM VirtualBox (Recommended)
VMware Workstation Player (Only for Linux and Windows)
VMware Fusion for mac (Only for macOS)

For installation instructions, you can use Google but I share here a tutorial to install VirtualBox on Windows 11. If you have an old version of the software, we recommend to update it in order to avoid any issues in installing the virtual machines. In my PC, I installed VirtualBox 7.1.4 in January 2025.

2. Installing HDP Sandbox

Hortonworks Data Platform (HDP) is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop. The Sandbox comes packaged in a virtual environment that can run in the cloud or on your personal machine. The Sandbox allows you to learn and explore HDP on your own.

2.1. Downloading the Sandbox

You can find the download links of the Sandbox in .ova format with respect to the chosen hypervisor. If you are using VirtualBox, then download from here. For VMware users, the download link is here. You can also download them from the official website but it needs an account on Cloudera website. I share below the download links for all available versions of HDP Sandbox.

The download links of HDP Sandbox on VirtualBox:

HDP 2.5.0 (https://archive.cloudera.com/hwx-sandbox/hdp/hdp-2.5.0/HDP_2.5_virtualbox.ova)
HDP 2.6.5 (https://archive.cloudera.com/hwx-sandbox/hdp/hdp-2.6.5/HDP_2.6.5_virtualbox_180626.ova)
HDP 3.0.1 (https://archive.cloudera.com/hwx-sandbox/hdp/hdp-3.0.1/HDP_3.0.1_virtualbox_181205.ova)

The download links of HDP Sandbox on VMware:

HDP 2.5.0 (https://archive.cloudera.com/hwx-sandbox/hdp/hdp-2.5.0/HDP_2.5_vmware.ova)
HDP 2.6.5 (https://archive.cloudera.com/hwx-sandbox/hdp/hdp-2.6.5/HDP_2.6.5_vmware_180622.ova)
HDP 3.0.1 (https://archive.cloudera.com/hwx-sandbox/hdp/hdp-3.0.1/HDP_3.0.1_vmware_181205.ova)

2.2. Installing the Sandbox

I will show here the steps to install the HDP Sandbox on VirtualBox. First of all, you need to be sure that you have installed VirtualBox and it is ready to create VMs.

Oracle VM VirtualBox

Select File from the top toolbar then choose Import Appliance... from the drop-down list or press Ctrl+I. The following window appears which allows you to specify the file where to import the virtual appliance from. Here you need to select the path of the virtual appliance. The virtual appliance has the extension .ova.

Import Appliance window

As shown in the figure below, select the path of the file .ovf then press Next.

Import Appliance window

In the next window, you may need to change some settings. Make sure that you set the CPU cores to 4 and the RAM size to 8192 MB.

Appliance settings window

And wait for the appliance to be imported as shown in the figure below.

Progress of importing the virtual appliance

If you got a value 0 for the Base Memory after importing the appliance (a bug in VirtualBox), please update the value as explained above and start the virtual machine.

2.3. Running the VM

The first boot of HDP Sandbox takes a lot of time, please take a break and wait till it finishes. Actually, during this time, the virtual machine is building the Docker image and then it starts to run a container for your cluster where you can access it from the host machine.

Booting HDP Sandbox

After finishing the extraction process, the system will run as shown below.

Running HDP Sandbox

After the boot operation is done, you will see the following screen where it gives the address to access the splash webpage of the platform at http://loaclhost:1080 or http://127.0.0.1:1080 for HDP 2.6.5.

HDP Sandbox 2.6.5

Now, you have finished installation and are ready to access the cluster.

B. Using Docker

Important note: Before you install the HDP sandbox using installation scripts. Make sure that you have Ubuntu 18 or 20 LTS version or CentOS 7.5. For MacOS and some other Linux distributions such as Arch Linux, you will probably have issues related to the kernel. If you are one of these users, then make sure that you are not getting any errors when executing the installation script in section Installing HDP Sandbox. In this case, install one of the Linux distros mentioned above in a virtual machine.

Here I will explain how you can install HDP Sandbox using Docker in Windows 11. For Linux users, installing Docker can be done via CLI and you can follow this tutorial.

1. Installing Docker Desktop on Windows

You can download Docker Desktop and install it by following the instructions in the official website. The installation instructions are easy and you can follow this tutorial.

Docker Desktop

Docker Desktop allows you to manage images and containers via GUI. It also allows you to run the cluster without the need to use CLI.

2. Installing HDP Sandbox

There are only two versions of HDP Sandbox on Docker Hub where HDP version 2.6.5 is what we need. Firstly, you need to download the installation scripts from Cloudera website:

The zip folder mainly contains .sh scripts which includes instructions to pull the HDP Sandbox image from Docker Hub and run the containers for your cluster.

HDP Sandbox on Docker

For Linux users, it will be straightforward to run .sh scripts whereas for Windows 10/11 users you can use C://Windows/System32/bash.exe program. If you have an older version of Windows then you can install Git BASH to run .sh scripts.

We need to run only the script docker-deploy-hdp265.sh from bash shell as follows:

$ bash ./docker-deploy-hdp265.sh

Pulling HDP Sandbox 2.6.5 image from Docker Hub

Running the installation script will take a lot of time since it will pull 15 GiB (HDP 2.6.5) from Docker Hub, so take a break. If you cannot see the progress of the download, then you can open another terminal and run the same instruction as follows:

docker pull hortonworks/sandbox-hdp:2.6.5

After the installation is successfully done, you will see two images on Docker Desktop and also two running containers.

HDP Sandbox images on Docker Desktop

The HDP Sandbox cluster is now running on your PC .

You do these steps only for the first time then you can stop/restart the cluster from Docker Desktop.

Configuring the resources for Docker

Limiting the resources of the containers is important to avoid freezing the host system due to the lack of memory. The default behavior of Docker engine is to give the containers as much resources as they request. Linux users can use command line options whereas Windows users can use configuration of WSL backend (which is installed by default in Windows 10/11 but do not forget that you need to install a Linux distribution such as Ubuntu, see tutorial here) or Hyper-V backend. Here I will explain how you can configure the resources of the containers if you use WSL backend in Windows.

You need to download .wslconfig file or build a new one with the same structure as follows:

# NOTE: This should be stored in %UserProfile% folder which is your home folder c:/Users/<Username> in Windows
# Settings apply across all Linux distros running on WSL 2
[wsl2]
# Limits VM memory to use no more than 4 GB, this can be set as whole numbers using GB or MB
memory=4GB
# Sets the VM to use four virtual processors
processors=4
# Sets amount of swap storage space to 0, default is 25% of available RAM
swap=0

After every update of .wslconfig file, you need to shutdown WSL, wait for some seconds then restart Desktop Docker. After that, you can start the containers to run the cluster. You can shutdown WSL as:

wsl --shutdown

More info about WSL configuration can be found here.

Stopping/Restarting the cluster

You can stop the cluster by turning off the virtual machine in case you are using a hypervisor. For Docker users, you only need to stop the containers sandbox-hdp and sandbox-proxy. Meanwhile, you can start the cluster in Docker Desktop by running the containers sandbox-hdp and sandbox-proxy whereas you just need to start the virtual machine in case you are using a hypervisor.

Important: Do not delete the containers in Docker Desktop, otherwise the persisted data will be gone and you need to reinstall the cluster using installation scripts.

Access HDP Sandbox cluster

Either you followed the first approach or the second one in cluster installation, you would end up here. The installed HDP Sandbox cluster is a single node implementation of HDP. It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. You can access the splash webpage of the cluster via http://localhost:1080 for HDP 2.6.5 and http://localhost:8888 for HDP 2.5.0.

HDP Sandbox splash webpage

The button Quick Links will transfer you to the page of links where you can access some services of the cluster.

HDP Sandbox quick links webpage

In order to see all services of the cluster, you need to access Ambari service at http://localhost:8080 where you can monitor and manage all services.

Ambari login page

You need to log in in order to access this service. You can use the credentials of the user maria_dev/maria_dev as (username/passowrd). HDP Sandbox comes with 4 default users with different roles in the cluster and there is also Ambari Admin who can manage the other users in the cluster.

Ambari homepage

As you can see in Ambari homepage, most services are showing alerts since they did not start so far or due to some problems. You need to wait till services start then you can access them. If you have set less resources than required then it is probably that most services cannot be run, so you can stop the services which are not needed to let the required services run.

Note: You can reset the password of Ambari Admin by running the command ambari-admin-password-reset via ssh as follows:

[root@sandbox-hdp ~]# ambari-admin-password-reset

Resetting Ambari Admin credentials

Overview of HDP services

Access via SSH

You can access the cluster via shell web client or called shell-in-a-box by following the address http://localhost:4200 in your browser.

For the very first time, the default credentials are root/hadoop and you will be asked to reset the password. You need to set a strong password to pass the password-reset step. For example, I use the password hdpIU2025.

Web Shell client for HDP Sandbox

You can also access the cluster via ssh command on your preferred terminal. You need to ssh on the port 2222 as root user:

ssh root@localhost -p 2222

Access HDFS

You can access HDFS files by selecting Files view in Ambari homepage.

Ambari - Files View

You can see in the following screen the contents of HDFS on the cluster. The page allows you to upload and download files/folders from/to local file system and HDFS.

HDFS on HDP Sandbox cluster

You can also access HDFS via CLI using the command hdfs dfs. For example, to list the content of the root directory / in HDFS, you can write as follows:

[root@sandbox-hdp ~]# hdfs dfs -ls /

Transfer data between HDFS and local file systems

The single node of the cluster is running on CentOS which has ext4 local file system whereas the distributed data in the cluster is stored in HDFS. You also have a local file system on the host machine. We have multiple file systems among whom data may need to be transferred. For example, in order to process the data in the cluster, you need to store it in HDFS.

Transfer data into and out of the cluster node

You can move data from the local file system in the host machine to the local file system of the cluster node by using command docker cp. As an example, to move the file C:\Users\Admin\Desktop\hello.txt on Windows to the root directory / of the node, we run the following command on the host machine:

docker cp "C:\Users\Admin\Desktop\hello.txt"  "sandbox-hdp:/"

sandbox-hdp is the container name/id.

In the same way, you can move the file /hello2.txt from the local file system of the node to the local file system of the host machine as follows (run the command on the host machine):

docker cp "sandbox-hdp:/hello2.txt" "C:\Users\Admin\Desktop"

Make sure that the containers of the cluster are running before running the commands.

Info: If you installed HDP using a hypervisor, then you can use the command scp to immediately copy files from the source machine (where you are running the command) to the local file system of the cluster node.
For instance, to transfer the files in data folder from the host machine to the folder /data in the cluster node, we run the following command on the terminal.

scp -P 2222 data/* root@localhost:/data

Note: If you got issues as shown in the figure below, then you need to open the file %USERPROFILE%/.ssh/known_hosts and remove the previous keys for that port (You can empty the file if all keys are not important).

scp will create a new key and exchange it with the cluster node.

Transfer data into and out of HDFS

You can move data between the local file system of the host machine and HDFS via Ambari service as explained in section Access HDFS.

To move data from HDFS to the local file system of the cluster node, you can use hdfs dfs command with appropriate options. The table below shows usage of some options.

Common issues

Docker does not read the parameters in .wslconfig file on Windows.
- Possible reasons:
  - The file is sensitive to white spaces and character cases. If the file has some issues in some lines then WSL backend of Docker Desktop will ignore the file.
- Possible solutions:
  - Make sure that your file follows the structure that I specified here and change only values according to your limited resources.

The virtual machine does not boot up.
- Possible reasons:
  - You set less resources than minimum for the CPU or the RAM.
- Possible solutions:
  - Increase the number of cores for CPU and/or increase the allocated RAM.
When you try to access the database by running the command psql -U postgres as [root@sandbox-hdp data]# psql -U postgres. You may get the following error:

psql: FATAL: Peer authentication failed for user “postgres”
- Possible reasons:
  - The database is configured to allow access only through the credentials of the user.
- Possible solutions:
  - You can change this configuration to allow all users to access the database without the need to enter credentials. Add the line local all all trust at the beginning of the file /var/lib/pgsql/data/pg_hba.conf then restart PostgreSQL service by running the command systemctl restart postgresql as root user.

NB: If you have other issues, please contact the TA.

Self-check tasks

Install HDP on your machine.
Access HDFS. Explore the files. How you can explore files in HDFS from terminal.
Download the data file from here. Add it to HDFS root folder (/) via Ambari Files View. Delete the file from HDFS.
Send the same data file from the local machine (where you boot up) to HDFS root folder (/) via scp or docker cp.
Access the shell of the cluster node via port 2222 in the terminal or 4200 in the browser and create a folder data in HDFS root folder (/). Move the file to the folder data. Send a copy of the folder to the local machine.
Reset the password of Ambari-Admin and access to Ambari as an admin. Display the roles of the users in the cluster.
Create a new user with your name and give it cluster manager (admin) role. You can check the roles of HDP from here.
Check the components of the services. What are the components of HDFS service.
Try to stop a service and restart it.
What happens if we put the service in maintenance mode?

Self-check questions

What are the key components of Hadoop core?
What is Apache Ambari?
How can we import a virtual appliance in VirtualBox?
How can we set the password of Ambari Admin?
What is the version of Hadoop distribution in the installed cluster?
What services does HDP provide?
How can you upload a file from local file system to HDFS?
How can you download a file from HDFS to the local file system?
How can you access the cluster via ssh?
How can we access the cluster via shell-in-a-box?
How can we access the database ambari in PostgreSQL server?
What are the default credentials of root user in the cluster?
How to transfer files from the host machine to the cluster node?

Appendix

Sandbox Version

To find information about the cluster Sandbox, execute the following command on the cluster container:

sandbox-version

Ambari Roles

Ambari 2.4 introduced the notion of Role-Based Access Control(RBAC) for the Ambari web interface. Ambari now includes additional cluster operation roles providing more granular division of control of the Ambari Dashboard and the various Ambari Views. Only the admin id has access to view or change these roles. You can learn more about the roles from here.

Other platforms

Trunk Data Platform (TDP)

TDP is an open-source and community driven big data management platform that provides a comprehensive set of tools and services for storing, processing, and analyzing large datasets. It is built on top of the Hadoop ecosystem and other open-source projects, and it enables organizations to effectively manage their big data workloads.

TDP architecture

Trunk Data Platform is an Open Source, free, Hadoop distribution. This distribution is built by EDF (French electricity provider) & DGFIP (Tax Office by the French Ministry of Finance), through an association called TOSIT (The Open source I Trust). TDP is built from Apache projects source code.

Arenadata Hadoop ()

Arenadata Hadoop (ADH) is a full-fledged enterprise distribution package based on Apache Hadoop and developed by the russian company Arenadata™. ADH is a big data platform designed for storing, processing, and analyzing large volumes of structured and unstructured data.

ADH architecture

Arenadata Hadoop includes various tools and components that are part of the Hadoop ecosystem, such as the Hadoop Distributed File System (HDFS), MapReduce, YARN, and various other Apache projects. It also includes additional software components and tools that are designed to make it easier to deploy, manage, and use Hadoop in enterprise environments.

Lab 1 - Installing HDP Sandbox

Agenda

Prerequisites

Objectives

Introduction

Hardware requirements

Installing Hortonworks Sandbox

A. Using a Hypervisor

1. Installing a Hypervisor

2. Installing HDP Sandbox

2.1. Downloading the Sandbox

2.2. Installing the Sandbox

2.3. Running the VM

B. Using Docker

1. Installing Docker Desktop on Windows

2. Installing HDP Sandbox

Configuring the resources for Docker

Stopping/Restarting the cluster

Access HDP Sandbox cluster

Access via SSH

Access HDFS

Transfer data between HDFS and local file systems

Transfer data into and out of the cluster node

Transfer data into and out of HDFS

Common issues

Self-check tasks

Self-check questions

Appendix

Sandbox Version

Ambari Roles

Other platforms

Trunk Data Platform (TDP)

Arenadata Hadoop ()

References