Course: Big Data - IU S25
Author: Firas Jolha
VirtualBox
Hortonworks Data Platform
Hortonworks Data Platform
via ssh
HDFS
and learn how to transfer filesEvery business is now a data business. Data is the organizations’ future and most valuable asset. The Hortonworks Data Platform (HDP)
is a security-rich, enterprise-ready open-source Hadoop
distribution based on a centralized architecture (YARN). Hortonworks Sandbox
is a single-node cluster and can be run as a Docker container installed on a virtual machine. HDP is a complete system to handle the processing and storage of big data. It is an open architecture used to store and process complex and large-scale data. It is composed of numerous Apache Software Foundation (ASF) projects including Apache Hadoop and is built specifically to meet enterprise demands. Hortonworks was a standalone company untill 2019 when it is merged to Cloudera and now Hortonworks is a subsidiary for Cloudera, Inc.
Hortonworks is merged to Cloudera in 2019
At the beginning of this course, we are aiming to remind you with relational databases before dealing with big data. In practice, we will use the relational database management system PostgreSQL. Fortunately, HDP comes with a pre-installed PostgreSQL database server.
lscpu
). Sometimes it is disabled in BIOS.HDP 2.5.0 is based on CentOS 6.8 which is EOL by 2021 and updating the packages via yum
is kinda not possible. We recommend to install HDP 2.6.5 unless you have less resources.
There are two common ways to install HDP Sandbox on your PC, either by using a hypervisor such as VirtualBox which will pull the Docker image and run a container for your cluster or by directly using Docker where you need to manage your resources via docker
command line options in Linux or you can use WSL backend in Windows where you can configure resources via .wslconfig
file.
If you are not familiar with Docker, you can follow this approach where configuring the cluster resources can be done via the hypervisor GUI. If you have less resources, then we recommend using Docker, so you do not need to consume resources for running the guest virtual machine.
In this approach, you will run a virtual machine which in turn will run your cluster container, so the operating system of the virtual machine will be different from the operating system of the container (HDP Sandbox cluster). You can notice that by checking the content of /home
directory or the version of the operating system cat /etc/redhat-release
.
We recommend VirtualBox as a hypervisor since it is supported for most common operating systems (Linux, Windows, and macOS). Please, follow the attached link in the following list to download your preferred hypervisor.
For installation instructions, you can use Google but I share here a tutorial to install VirtualBox on Windows 11. If you have an old version of the software, we recommend to update it in order to avoid any issues in installing the virtual machines. In my PC, I installed VirtualBox 7.1.4 in January 2025.
Hortonworks Data Platform (HDP) is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop. The Sandbox comes packaged in a virtual environment that can run in the cloud or on your personal machine. The Sandbox allows you to learn and explore HDP on your own.
You can find the download links of the Sandbox in .ova format with respect to the chosen hypervisor. If you are using VirtualBox, then download from here. For VMware users, the download link is here. You can also download them from the official website but it needs an account on Cloudera website. I share below the download links for all available versions of HDP Sandbox.
The download links of HDP Sandbox on VirtualBox:
The download links of HDP Sandbox on VMware:
I will show here the steps to install the HDP Sandbox on VirtualBox. First of all, you need to be sure that you have installed VirtualBox and it is ready to create VMs.
Oracle VM VirtualBox
Select File
from the top toolbar then choose Import Appliance...
from the drop-down list or press Ctrl+I
. The following window appears which allows you to specify the file where to import the virtual appliance from. Here you need to select the path of the virtual appliance. The virtual appliance has the extension .ova
.
Import Appliance window
As shown in the figure below, select the path of the file .ovf
then press Next.
Import Appliance window
In the next window, you may need to change some settings. Make sure that you set the CPU cores to 4 and the RAM size to 8192 MB.
Appliance settings window
And wait for the appliance to be imported as shown in the figure below.
Progress of importing the virtual appliance
If you got a value 0 for the Base Memory
after importing the appliance (a bug in VirtualBox), please update the value as explained above and start the virtual machine.
The first boot of HDP Sandbox takes a lot of time, please take a break and wait till it finishes. Actually, during this time, the virtual machine is building the Docker image and then it starts to run a container for your cluster where you can access it from the host machine.
Booting HDP Sandbox
After finishing the extraction process, the system will run as shown below.
Running HDP Sandbox
After the boot operation is done, you will see the following screen where it gives the address to access the splash webpage of the platform at http://loaclhost:1080 or http://127.0.0.1:1080 for HDP 2.6.5.
HDP Sandbox 2.6.5
Now, you have finished installation and are ready to access the cluster.
Important note: Before you install the HDP sandbox using installation scripts. Make sure that you have Ubuntu 18 or 20 LTS version or CentOS 7.5. For MacOS and some other Linux distributions such as Arch Linux, you will probably have issues related to the kernel. If you are one of these users, then make sure that you are not getting any errors when executing the installation script in section Installing HDP Sandbox. In this case, install one of the Linux distros mentioned above in a virtual machine.
Here I will explain how you can install HDP Sandbox using Docker in Windows 11. For Linux users, installing Docker can be done via CLI and you can follow this tutorial.
You can download Docker Desktop and install it by following the instructions in the official website. The installation instructions are easy and you can follow this tutorial.
Docker Desktop
Docker Desktop allows you to manage images and containers via GUI. It also allows you to run the cluster without the need to use CLI.
There are only two versions of HDP Sandbox on Docker Hub where HDP version 2.6.5 is what we need. Firstly, you need to download the installation scripts from Cloudera website:
The zip folder mainly contains .sh
scripts which includes instructions to pull the HDP Sandbox image from Docker Hub and run the containers for your cluster.
HDP Sandbox on Docker
For Linux users, it will be straightforward to run .sh
scripts whereas for Windows 10/11 users you can use C://Windows/System32/bash.exe
program. If you have an older version of Windows then you can install Git BASH to run .sh
scripts.
We need to run only the script docker-deploy-hdp265.sh
from bash
shell as follows:
$ bash ./docker-deploy-hdp265.sh
Pulling HDP Sandbox 2.6.5 image from Docker Hub
Running the installation script will take a lot of time since it will pull 15 GiB (HDP 2.6.5) from Docker Hub, so take a break. If you cannot see the progress of the download, then you can open another terminal and run the same instruction as follows:
docker pull hortonworks/sandbox-hdp:2.6.5
After the installation is successfully done, you will see two images on Docker Desktop and also two running containers.
HDP Sandbox images on Docker Desktop
The HDP Sandbox cluster is now running on your PC .
You do these steps only for the first time then you can stop/restart the cluster from Docker Desktop.
Limiting the resources of the containers is important to avoid freezing the host system due to the lack of memory. The default behavior of Docker engine is to give the containers as much resources as they request. Linux users can use command line options whereas Windows users can use configuration of WSL backend (which is installed by default in Windows 10/11 but do not forget that you need to install a Linux distribution such as Ubuntu, see tutorial here) or Hyper-V backend. Here I will explain how you can configure the resources of the containers if you use WSL backend in Windows.
You need to download .wslconfig
file or build a new one with the same structure as follows:
# NOTE: This should be stored in %UserProfile% folder which is your home folder c:/Users/<Username> in Windows
# Settings apply across all Linux distros running on WSL 2
[wsl2]
# Limits VM memory to use no more than 4 GB, this can be set as whole numbers using GB or MB
memory=4GB
# Sets the VM to use four virtual processors
processors=4
# Sets amount of swap storage space to 0, default is 25% of available RAM
swap=0
After every update of .wslconfig
file, you need to shutdown WSL
, wait for some seconds then restart Desktop Docker. After that, you can start the containers to run the cluster. You can shutdown WSL
as:
wsl --shutdown
More info about WSL configuration can be found here.
You can stop the cluster by turning off the virtual machine in case you are using a hypervisor. For Docker users, you only need to stop the containers sandbox-hdp
and sandbox-proxy
. Meanwhile, you can start the cluster in Docker Desktop by running the containers sandbox-hdp
and sandbox-proxy
whereas you just need to start the virtual machine in case you are using a hypervisor.
Important: Do not delete the containers in Docker Desktop, otherwise the persisted data will be gone and you need to reinstall the cluster using installation scripts.
Either you followed the first approach or the second one in cluster installation, you would end up here. The installed HDP Sandbox cluster is a single node implementation of HDP. It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. You can access the splash webpage of the cluster via http://localhost:1080 for HDP 2.6.5 and http://localhost:8888 for HDP 2.5.0.
HDP Sandbox splash webpage
The button Quick Links will transfer you to the page of links where you can access some services of the cluster.
HDP Sandbox quick links webpage
In order to see all services of the cluster, you need to access Ambari service at http://localhost:8080 where you can monitor and manage all services.
Ambari login page
You need to log in in order to access this service. You can use the credentials of the user maria_dev/maria_dev
as (username/passowrd). HDP Sandbox comes with 4 default users with different roles in the cluster and there is also Ambari Admin who can manage the other users in the cluster.
Ambari homepage
As you can see in Ambari homepage, most services are showing alerts since they did not start so far or due to some problems. You need to wait till services start then you can access them. If you have set less resources than required then it is probably that most services cannot be run, so you can stop the services which are not needed to let the required services run.
Note: You can reset the password of Ambari Admin by running the command ambari-admin-password-reset
via ssh
as follows:
[root@sandbox-hdp ~]# ambari-admin-password-reset
Resetting Ambari Admin credentials
Overview of HDP services
You can access the cluster via shell web client or called shell-in-a-box
by following the address http://localhost:4200 in your browser.
For the very first time, the default credentials are root/hadoop
and you will be asked to reset the password. You need to set a strong password to pass the password-reset step. For example, I use the password hdpIU2025
.
Web Shell client for HDP Sandbox
You can also access the cluster via ssh
command on your preferred terminal. You need to ssh
on the port 2222 as root user:
ssh root@localhost -p 2222
You can access HDFS files by selecting Files view
in Ambari homepage.
Ambari - Files View
You can see in the following screen the contents of HDFS on the cluster. The page allows you to upload and download files/folders from/to local file system and HDFS.
HDFS on HDP Sandbox cluster
You can also access HDFS via CLI using the command hdfs dfs
. For example, to list the content of the root directory /
in HDFS, you can write as follows:
[root@sandbox-hdp ~]# hdfs dfs -ls /
The single node of the cluster is running on CentOS which has ext4 local file system whereas the distributed data in the cluster is stored in HDFS. You also have a local file system on the host machine. We have multiple file systems among whom data may need to be transferred. For example, in order to process the data in the cluster, you need to store it in HDFS.
You can move data from the local file system in the host machine to the local file system of the cluster node by using command docker cp
. As an example, to move the file C:\Users\Admin\Desktop\hello.txt
on Windows to the root directory /
of the node, we run the following command on the host machine:
docker cp "C:\Users\Admin\Desktop\hello.txt" "sandbox-hdp:/"
sandbox-hdp
is the container name/id.
In the same way, you can move the file /hello2.txt
from the local file system of the node to the local file system of the host machine as follows (run the command on the host machine):
docker cp "sandbox-hdp:/hello2.txt" "C:\Users\Admin\Desktop"
Make sure that the containers of the cluster are running before running the commands.
Info: If you installed HDP using a hypervisor, then you can use the command scp
to immediately copy files from the source machine (where you are running the command) to the local file system of the cluster node.
For instance, to transfer the files in data folder from the host machine to the folder /data
in the cluster node, we run the following command on the terminal.
scp -P 2222 data/* root@localhost:/data
Note: If you got issues as shown in the figure below, then you need to open the file %USERPROFILE%/.ssh/known_hosts
and remove the previous keys for that port (You can empty the file if all keys are not important).
scp
will create a new key and exchange it with the cluster node.
You can move data between the local file system of the host machine and HDFS via Ambari service as explained in section Access HDFS.
To move data from HDFS to the local file system of the cluster node, you can use hdfs dfs
command with appropriate options. The table below shows usage of some options.
Docker
does not read the parameters in .wslconfig
file on Windows.
The virtual machine does not boot up.
When you try to access the database by running the command psql -U postgres
as [root@sandbox-hdp data]# psql -U postgres
. You may get the following error:
psql: FATAL: Peer authentication failed for user “postgres”
local all all trust
at the beginning of the file /var/lib/pgsql/data/pg_hba.conf
then restart PostgreSQL service by running the command systemctl restart postgresql
as root user.NB: If you have other issues, please contact the TA.
scp
or docker cp
.2222
in the terminal or 4200
in the browser and create a folder data
in HDFS root folder (/). Move the file to the folder data
. Send a copy of the folder to the local machine.Apache Ambari
?Ambari Admin
?ssh
?ambari
in PostgreSQL server?root
user in the cluster?To find information about the cluster Sandbox, execute the following command on the cluster container:
sandbox-version
Ambari 2.4 introduced the notion of Role-Based Access Control(RBAC) for the Ambari web interface. Ambari now includes additional cluster operation roles providing more granular division of control of the Ambari Dashboard and the various Ambari Views. Only the admin id has access to view or change these roles. You can learn more about the roles from here.
TDP is an open-source and community driven big data management platform that provides a comprehensive set of tools and services for storing, processing, and analyzing large datasets. It is built on top of the Hadoop ecosystem and other open-source projects, and it enables organizations to effectively manage their big data workloads.
Trunk Data Platform is an Open Source, free, Hadoop distribution. This distribution is built by EDF (French electricity provider) & DGFIP (Tax Office by the French Ministry of Finance), through an association called TOSIT (The Open source I Trust). TDP is built from Apache projects source code.
Arenadata Hadoop (ADH) is a full-fledged enterprise distribution package based on Apache Hadoop and developed by the russian company Arenadata™. ADH is a big data platform designed for storing, processing, and analyzing large volumes of structured and unstructured data.
Arenadata Hadoop includes various tools and components that are part of the Hadoop ecosystem, such as the Hadoop Distributed File System (HDFS), MapReduce, YARN, and various other Apache projects. It also includes additional software components and tools that are designed to make it easier to deploy, manage, and use Hadoop in enterprise environments.