Course: Big Data - IU S25
Author: Firas Jolha
PostgreSQL database serverThe Structured Query Language (SQL) is the most extensively used database language. SQL is composed of a data definition language (DDL), which allows the specification of database schemas; a data manipulation language (DML), which supports operations to retrieve, store, modify and delete data. In this lab, we will practice how to retrieve structured data from relational databases using SQL. Then we will build a graph for the same dataset on Neo4j and return connected data via Cypher queries.
This dataset contains four tables about social media users.
| File name | Description | Fields |
|---|---|---|
| users.csv | A line is added to this file when a user is subscribed to the socia media along with the age and subscription date time | userid, surname, name, age, timestamp |
| posts.csv | The user can add multiple posts and for each post we have post type and post unix timestamp | postid, user, posttype, posttimestamp |
| friends.csv | The user can have friends and this relationship is mutual | friend1, friend2 |
In fact, HDP Sandbox comes with a pre-installed PostgreSQL server so you do not to install any additional software to use PostgreSQL. You can access it via psql as postgres user:
[root@sandbox-hdp ~]# psql -U postgres
This will open a CLI to interact with PostgreSQL. You can access the databases using \c command and the tables via \dt command. For example, you can access the database ambari (a default database on HDP Sandbox) by running the following command in postgreSQL CLI.
postgres=# \c ambari
You are now connected to database "ambari" as user "postgres".
The version of PostgreSQL in HDP 2.6.5 is 9.2.23 as shown below. So you should read the documentation for this version of PostgreSQL.

Anything you enter in psql that begins with an unquoted backslash is a psql meta-command that is processed by psql itself. These commands make psql more useful for administration or scripting. Meta-commands are often called slash or backslash commands.
dbname as a user username\c <dbname> <username>
\l
\dt
<table_name>\d <table_name>
<file.sql>\i <file.sql>
\?
\h CREATE TABLE
\timing
You use the same command \timing to turn it off.
\q
You can run the shell commands in psql CLI via \! . For instance, to print the current working directory we write:
\! pwd
Note: Explore the dataset using any data analysis tool (Python+Pandas, Excel, Google Sheet, …etc) you have before building the database.
Another note: You do not need to create a new role/user. Just use the current role postgres.
The dataset are csv files and need to be imported into PostgreSQL tables.
To copy the values from csv files into the table, you need to use Postgres COPY method od data-loading. Postgres’s COPY comes in two separate variants, COPY and \COPY: COPY is server based, \COPY is client based. COPY will be run by the PostgreSQL backend (user “postgres”). The backend user requires permissions to read & write to the data file in order to copy from/to it. You need to use an absolute pathname with COPY. \COPY on the other hand, runs under the current $USER, and with that users environment. And \COPY can handle relative pathnames. The psql \COPY is accordingly much easier to use if it handles what you need. So, you can use the meta-command \COPY to import the data from csv files into PostgreSQL tables but you need to make sure that the primary keys and foreign keys are set according to the given ER diagram. You can learn more about ER diagrams from here.
Notice that you need to create tables before importing their data from files. You can detect the datatype of the column from its values.
social-- Create Database
CREATE DATABASE social;
-- Switch to the Database
\c social;
-- add User table
CREATE TABLE IF NOT EXISTS Users(
userid INT PRIMARY KEY NOT NULL,
surname VARCHAR(50),
name VARCHAR(50),
age INT,
_timestamp timestamp,
temp BIGINT NOT NULL
);
-- add Post table
CREATE TABLE IF NOT EXISTS Posts(
postid INT PRIMARY KEY NOT NULL,
userid INT NOT NULL,
posttype VARCHAR(50),
posttimestamp timestamp,
temp BIGINT NOT NULL
);
-- add Friend table
CREATE TABLE IF NOT EXISTS Friends(
friend1 INT NOT NULL,
friend2 INT NOT NULL
);
-- add constrains
ALTER TABLE Friends
ADD CONSTRAINT fk_User_userid_friend1
FOREIGN KEY (friend1)
REFERENCES Users (userid);
ALTER TABLE Friends
ADD CONSTRAINT fk_User_userid_friend2
FOREIGN KEY (friend2)
REFERENCES Users (userid);
ALTER TABLE Posts
ADD CONSTRAINT fk_User_userid_userid
FOREIGN KEY (userid)
REFERENCES Users (userid);
\COPY Users(userid, surname, name, age, temp) FROM 'users.csv' DELIMITER ',' CSV HEADER;
UPDATE Users SET _timestamp = TO_TIMESTAMP(temp);
ALTER TABLE Users DROP COLUMN temp;
\COPY Posts(postid, userid, posttype, temp) FROM 'posts.csv' CSV HEADER;
UPDATE Posts SET posttimestamp = TO_TIMESTAMP(temp);
ALTER TABLE Posts DROP COLUMN temp;
\COPY Friends from 'friends.csv' csv header;
CREATE TABLE friends2 AS
SELECT friend1, friend2 FROM friends
UNION
SELECT friend2, friend1 FROM friends;
ALTER TABLE friends RENAME TO friends_old;
ALTER TABLE friends2 RENAMT TO friends;
In this part of the lab, we will practice on writing SQL queries to retrieve data from our database.
You can return data from a single table by using SELECT command. This kind of data retrieval is the simplest way to get data from a single table.
SELECT DISTINCT userid
FROM posts;
SELECT DISTINCT postid
FROM posts;
SELECT *
FROM posts
WHERE EXTRACT(YEAR FROM posttimestamp) > 2012;
SELECT MAX(posttimestamp), MIN(posttimestamp)
FROM posts;
Image?SELECT *
FROM posts
WHERE posttype='Image'
ORDER BY posttimestamp DESC
LIMIT 5;
Image or Video recently?SELECT userid, posttype
FROM posts
WHERE posttype IN ('Image', 'Video')
ORDER BY posttimestamp DESC
LIMIT 5;
SELECT MAX(age), MIN(age)
FROM users;
UPDATE users
SET age = CASE
WHEN age < 0 THEN -age
ELSE age
END;
In PostgreSQL, double quotes "" are used to indicate identifiers within the database, which are objects like tables, column names, and roles. In contrast, single quotes '' are used to indicate string literals.
Image posts?SELECT DISTINCT name, surname, age, posttype
FROM posts
JOIN users ON posts.userid = users.userid
WHERE posts.posttype='Image'
ORDER BY age ASC
LIMIT 1
OFFSET 1;
SELECT postid, posttype, age, users.userid
FROM posts
JOIN users ON posts.userid = users.userid
WHERE age <@ int4range(20, 30)
ORDER BY age;
WITH youngest AS(
SELECT userid
FROM users
ORDER BY age
ASC LIMIT 1
)
SELECT friend1 AS user, friend2 AS friend
FROM youngest, friends
JOIN users ON friends.friend1 = users.userid
WHERE users.userid = youngest.userid;
Thronton?SELECT COUNT(friend2)
FROM users
JOIN friends ON (users.userid = friends.friend1)
WHERE users.surname = 'Thronton';
SELECT users.userid, COUNT(*) as c
FROM users
JOIN posts ON (users.userid = posts.userid)
GROUP BY users.userid
ORDER BY c DESC;
Image?SELECT DISTINCT u2.userid, u2.name, u2.surname, u2.age, u.userid, u.name, u.surname
FROM users u
JOIN friends f ON f.friend1 = u.userid
JOIN posts ON posts.userid = u.userid
JOIN users u2 ON f.friend2 = u2.userid
WHERE posts.posttype = 'Image';
SELECT COUNT(DISTINCT f1.friend2) as c
FROM users u
JOIN friends f1 ON (u.userid = f1.friend1)
WHERE u.age < 20;
SELECT COUNT(DISTINCT f2.friend2) as c
FROM users u
JOIN friends f1 ON (u.userid = f1.friend1)
JOIN friends f2 ON (f1.friend2 = f2.friend1)
WHERE u.age < 20 AND f2.friend2 <> u.userid;
WITH users_10_posts AS (
SELECT posts.userid, COUNT(*) as c
FROM posts
GROUP BY posts.userid
HAVING COUNT(*) > 10
)
SELECT COUNT(*) as c
FROM users_10_posts;
WITH post_count AS (
SELECT COUNT(posts.postid) AS c
FROM users u
JOIN friends f ON f.friend1 = u.userid
JOIN posts ON posts.userid = f.friend1
JOIN users u2 ON f.friend1 = u2.userid
WHERE EXTRACT(HOUR FROM u2._timestamp) = 12 AND u.age > 40
)
SELECT AVG(c)
FROM post_count;
WITH post_count AS (
SELECT COUNT(posts.postid) AS c
FROM users u
JOIN friends f ON f.friend1 = u.userid
JOIN posts ON posts.userid = f.friend1
JOIN users u2 ON f.friend1 = u2.userid
WHERE EXTRACT(HOUR FROM u2._timestamp) = 12 AND u.age > 40) ,
post_total AS (
SELECT COUNT(*) as c2
FROM posts
)
SELECT 100 * c/c2 || '%'
FROM post_total, post_count;
A graph database stores nodes and relationships instead of tables, or documents. Data is stored just like you might sketch ideas on a whiteboard. Your data is stored without restricting it to a pre-defined model, allowing a very flexible way of thinking about and using it.
Graph databases address big challenges many of us tackle daily. Modern data problems often involve many-to-many relationships with heterogeneous data that sets up needs to:
Whether it’s a social network, payment networks, or road network you’ll find that everything is an interconnected graph of relationships. And when we want to ask questions about the real world, many questions are about the relationships rather than about the individual data elements.
Example of a simple graph database
Neo4j is an open-source, NoSQL, native graph database that provides an ACID-compliant transactional backend for your applications that has been publicly available since 2007. In Neo4j, information is organized as nodes, relationships, and properties. We can retrieve data from Neo4j via its query language called Cypher which is a declarative query language similar to SQL, but optimized for graphs.
Building blocks of the property graph model
The property graph model consists of mainly nodes and relationships. Nodes are the entities in the graph.
Whereas relationships provide directed, named, connections between two node entities (e.g. Person LOVES Person).
Neo4j can be installed as an interpreter in Zeppelin but in HDP 2.6.5, Zeppelin 0.7.3 is installed and does not support neo4j interpreter. Installing Neo4j on HDP 2.6.5 as a Zeppelin interpreter will not work. The latest version (in Feb. 2025) of Zeppelin 0.11.0 supports Neo4j but we do not need to install it. Indeed, HDP 3.0 has Zeppelin 0.8.0 installed and supports neo4j interpreter but it is not a big deal to install HDP 3.0 just for neo4j since our work on the sandbox will not be mostly on graph databases.
The easy approach is to install Neo4j Desktop on your local machine. You can find here (https://neo4j.com/docs/desktop-manual/current/installation/download-installation/) some information about installing Neo4j Desktop and you can find here (https://neo4j.com/deployment-center) download links. After you start downloading, the website will give a long key you need to use when you install the software as follows:

The software will prepare the environment for installation as follows.
Installing Neo4j Desktop
When you succeed installation, it will show the following dashboard.

The less easy approach is to install Neo4j on your cluster node via yum OR offline via rpm. You can find here (https://yum.neo4j.com) some information about installing neo4j via yum and here (https://neo4j.com/docs/operations-manual/3.5/installation/linux/rpm/#linux-yum) some information about offline installation. After that, you can forward the port of the server to some custom port 10015 to the host machine in order to access Neo4j Browser (change this setting server.http.listen_address to 10015 in neo4j.conf file of the Neo4j DBMS instance).
If you want to install Neo4j on the cluster, you need to know that the old versions of Neo4j can be installed. I suggest the version 3.5.35.1 and you can see below the cypher shell dependencies.


You can access the guest machine in VirtualBox by ssh into the HostSSH port specified in the port forwarding rules which can be reached from the network settings of the virtual machine. You can also add your rules to the list.
Neo4j has docker image in dockerhub for the Neo4j database server. You can follow this approach to run the server as a docker container and access the service from the browser. Make sure that you have installed Docker engine and docker daemon is working. You can pull the image as follows:
docker pull neo4j

You can run the Neo4j server as a container:
docker run --name neo4j_server --publish=7474:7474 --publish=7687:7687 --volume /neo4j/logs:/var/lib/neo4j/logs --volume /neo4j/import:/var/lib/neo4j/import --volume /neo4j/plugins:/var/lib/neo4j/plugins --volume /neo4j/data:/var/lib/neo4j/data --env NEO4J_AUTH=neo4j/neo4jneo4j neo4j
The command above will create and run a docker container neo4j with name neo4j_server and publish ports 7474 for http and 7687 for bolt protocols. It will also attach volumes for logs, import, plugins and data folders. The env option will assign the password neo4jneo4j for the default user neo4j. You can use -d or --detach option to let the server run in detached mode in the background.
You can stop the container when you are done as follow:
docker stop neo4j_server
Where neo4j_server is the container name or id if you did not assign a name.
You can start the container again as follows:
docker start neo4j_server
Where neo4j_server is the container name or id if you did not assign a name.
In case you want to change some of the settings in running the container then you can delete the container and create a new one. Your data inside can be stored in local file system by attaching volumes to folders you want to. You can delete the container as follows: (make sure that you stopped it)
docker rm neo4j_server
Where neo4j_server is the container name or id if you did not assign a name.
You can access the Neo4j Browser usually at http://localhost:7474.
If you can not install Neo4j due to some reason, you can open a free session of Neo4j Sandbox at https://neo4j.com/sandbox/ but you need proxy.
Cypher is Neo4j’s graph database query language that allows users to store and retrieve data from the graph database. It is a declarative, SQL-inspired language for describing visual patterns in graphs. The syntax provides a visual and logical way to match patterns of nodes and relationships in the graph. Cypher keywords are not case-sensitive but it is case-sensitive for variables.
All Neo4j servers contain a built-in database called system, which behaves differently than all other databases. The system database stores system data and you can not perform graph queries against it. A fresh installation of Neo4j includes two databases:
Cypher provides first class support for a number of data types. These fall into the following three categories: property, structural, and composite.
The following data types are included in the property types category: Integer, Float, String, Boolean, Point, Date, Time, LocalTime, DateTime, LocalDateTime, and Duration. For more details, visit the documentation.
The following data types are included in the structural types category: Node, Relationship, and Path.
Nodes, relationships, and paths are returned as a result of pattern matching. In Neo4j, all relationships have a direction. However, you can have the notion of undirected relationships at query time.
The following data types are included in the composite types category: List and Map.
RETURN [1,2,3,4,5], {id:1, msg: 'hello', age: 31}
When you reference parts of a pattern or a query, you do so by naming them. The names you give the different parts are called variables.
In this example:
MATCH (n)-->(b)
RETURN b
The variables are n and b.
// this is a comment
A comment begin with double slash (//) and continue to the end of the line. Comments do not execute, they are for humans to read.
WITH ['a', 'a', 4, 'b', 4] AS ps
UNWIND ps AS p
RETURN DISTINCT p
The output will have only ‘a’, 4, and ‘b’.
. statically access the property of a node or relationship[] used for filtering on a dynamically-computed property keyCREATE
(a:Restaurant {name: 'Hungry Jo', rating_hygiene: 10, rating_food: 7}),
(b:Restaurant {name: 'Buttercup Tea Rooms', rating_hygiene: 5, rating_food: 6}),
(c1:Category {name: 'hygiene'}),
(c2:Category {name: 'food'})
WITH a, b, c1, c2
MATCH (restaurant:Restaurant), (category:Category)
WHERE restaurant["rating_" + category.name] > 6
RETURN DISTINCT restaurant.name
= used for replacing all properties of a node or relationshipCREATE (a:Person {name: 'Jane', age: 20})
WITH a
MATCH (p:Person {name: 'Jane'})
SET p = {name: 'Ellen', livesIn: 'London'}
RETURN p.name, p.age, p.livesIn
The mathematical operators comprise:
The comparison operators comprise:
String-specific comparison operators comprise:
WITH ['mouse', 'chair', 'door', 'house'] AS wordlist
UNWIND wordlist AS word
WITH word
WHERE word =~ '.*ous.*'
RETURN word
The boolean operators — also known as logical operators — comprise: AND, OR, XOR, NOT.
The string operators comprise: concatenating strings: +
RETURN 'neo' + '4j' AS result
Check other operators from the documentation.
Patterns and pattern-matching are at the very heart of Cypher, so being effective with Cypher requires a good understanding of patterns.
Using patterns, you describe the shape of the data you are looking for. For example, in the MATCH clause you describe the shape with a pattern, and Cypher will figure out how to get that data for you.
The pattern describes the data using a form that is very similar to how one typically draws the shape of property graph data on a whiteboard: usually as circles (representing nodes) and arrows between them to represent relationships.
Patterns appear in multiple places in Cypher: in MATCH, CREATE and MERGE clauses, and in pattern expressions.
This simple pattern describes a single node, and names that node using the variable a.
(a)
This pattern describes a very simple data shape: two nodes, and a single relationship from one to the other.
(a)-->(b)
This manner of describing nodes and relationships can be extended to cover an arbitrary number of nodes and the relationships between them, for example:
(a)-->(b)<--(c)
Such a series of connected nodes and relationships is called a “path”.
The most simple attribute that can be described in the pattern is a label that the node must have. For example:
(a:User)-->(b)
One can also describe a node that has multiple labels:
(a:User:Admin)-->(b)
Properties can be expressed in patterns using a map-construct: curly brackets surrounding a number of key-expression pairs, separated by commas. E.g. a node with two properties on it would look like:
(a {name: 'Andy', sport: 'Brazilian Ju-Jitsu'})
A relationship with expectations on it is given by:
(a)-[{blocked: false}]->(b)
The simplest way to describe a relationship is by using the arrow between two nodes, as in the previous examples. Using this technique, you can describe that the relationship should exist and the directionality of it. If you don’t care about the direction of the relationship, the arrow head can be omitted, as exemplified by:
(a)--(b)
(a)-[r]->(b)
(a)-[r:REL_TYPE]->(b)
(a)-[r:TYPE1|TYPE2]->(b)
(a)-[:REL_TYPE]->(b)
Rather than describing a long path using a sequence of many node and relationship descriptions in a pattern, many relationships (and the intermediate nodes) can be described by specifying a length in the relationship description of a pattern. For example:
(a)-[*2]->(b)
This describes a graph of three nodes and two relationships, all in one path (a path of length 2). This is equivalent to:
(a)-->()-->(b)
A range of lengths can also be specified: such relationship patterns are called ‘variable length relationships’. For example:
(a)-[*3..5]->(b)
This is a minimum length of 3, and a maximum of 5. It describes a graph of either 4 nodes and 3 relationships, 5 nodes and 4 relationships or 6 nodes and 5 relationships, all connected together in a single path.
Either bound can be omitted. For example, to describe paths of length 3 or more, use:
(a)-[*3..]->(b)
To describe paths of length 5 or less, use:
(a)-[*..5]->(b)
Omitting both bounds is equivalent to specifying a minimum of 1, allowing paths of any positive length to be described:
(a)-[*]->(b)
Cypher allows paths to be named using an identifer, as exemplified by:
p = (a)-[*3..5]->(b)
In Neo4j Desktop, create a new project as follows:

and add a local DBMs to the new project. Then, start the database as follows:

As you can see in the following screenshot, the database is active and we can access it as a default user neo4j by opening a dashboard with Neo4j Browser.

In order to load csv files to the database, you need to add them to the import of the database (do not add them to the project directory) as follows:

After copying/pasting the csv files, the import directory will be as follows:

CREATE DATABASE socialmedia
:use socialmedia
/* Create Index for users on userid field */
CREATE INDEX FOR (u:User) ON (u.userid)
/* Create Index for posts on postid field */
CREATE INDEX FOR (u:Post) ON (u.postid)
/* Load users data */
LOAD CSV WITH HEADERS
FROM "file:///users.csv" AS row
FIELDTERMINATOR ','
CREATE (
:User {
userid: toInteger(row['userid']),
name:row['name'],
surname: row['surname'],
age: toInteger(row['age']),
timestamp: datetime({epochSeconds: toInteger(row['timestamp'])})}
);
/* Load posts data */
LOAD CSV WITH HEADERS
FROM "file:///posts.csv" AS row
FIELDTERMINATOR ','
MATCH (u:User{userid:toInteger(row['userid'])})
MERGE (
:Post {
postid: toInteger(row['postid']),
userid: toInteger(row['userid']),
posttype:row['posttype']
})<-[:ADD{
posttimestamp: datetime({epochSeconds: toInteger(row['posttimestamp'])})
}]-(u);
/* Retrieve some nodes */
MATCH (n)
RETURN n
LIMIT 10;
/* Load Friends */
:auto LOAD CSV WITH HEADERS
FROM "file:///friends.csv" AS row
FIELDTERMINATOR ','
CALL {
WITH row
MATCH (u:User{userid:toInteger(row['friend1'])})
MATCH (v:User{userid:toInteger(row['friend2'])})
MERGE (u)-[:FRIEND]->(v)-[:FRIEND]->(u)
} IN TRANSACTIONS OF 1000 ROWS
/* IN 3 CONCURRENT TRANSACTIONS OF 1000 ROWS */
/*When a write transaction occurs,
Neo4j takes locks to preserve data consistency while updating.
A deadlock happens when two transactions are blocked by
each other because they are attempting to concurrently
modify a node or a relationship that is locked by the
other transaction
*/
Cypher is Neo4j’s graph query language that lets you retrieve data from the graph. It is like SQL for graphs, and was inspired by SQL so it lets you focus on what data you want out of the graph (not how to go get it). It is the easiest graph language to learn by far because of its similarity to other languages and intiutiveness.
MATCH (u:User)-[:ADD]-()
RETURN u
LIMIT 10 // This is added just to not display all nodes
MATCH (p:Post)
RETURN p
LIMIT 10 // This is added just to not display all nodes
MATCH (p:Post)<-[a:ADD]-()
WHERE a.posttimestamp.year > 2012
RETURN p
LIMIT 10 // This is added just to not display all nodes
MATCH (p:Post)-[a:ADD]-()
RETURN MAX(a.posttimestamp), MIN(a.posttimestamp)
Image?MATCH (p:Post{posttype:'Image'})-[a:ADD]-()
RETURN p, a
ORDER BY a.posttimestamp DESC
LIMIT 5
Image or Video recently?MATCH (u:User)-[a:ADD]->(p:Post)
WHERE p.posttype IN ['Image', 'Video']
RETURN u, p
ORDER BY a.posttimestamp
LIMIT 5
MATCH (u:User)
RETURN MAX(u.age), MIN(u.age)
// Update the age of the person
MATCH (u:User)
SET u.age = CASE
WHEN u.age < 0 THEN -u.age
ELSE u.age
END
Image posts?MATCH (u:User)-[:ADD]->(p:Post{posttype:'Image'})
RETURN DISTINCT u.name, u.surname, u.age, p.posttype
ORDER BY u.age ASC
SKIP 1
LIMIT 1
MATCH (p:Post)<-[:ADD]-(u:User)
WHERE u.age IN range(20, 30)
RETURN p.postid, p.posttype, u.age, u.userid
ORDER BY u.age
CALL{
MATCH (u:User)
RETURN u.userid AS youngest
ORDER BY u.age ASC
LIMIT 1
}
WITH youngest
MATCH (u:User{userid:youngest})-[:FRIEND]-(f1:User)
RETURN DISTINCT u.userid AS user, f1.userid AS friend
Thronton?MATCH (u:User{surname:'Thronton'})-[:FRIEND]-(f1:User)
RETURN COUNT(DISTINCT f1)
MATCH (u:User)--(p:Post)
RETURN u.userid AS user, COUNT(*) AS c
ORDER BY c DESC
Image?MATCH (f:User)--(u:User)--(:Post{posttype:'Image'})
RETURN DISTINCT u.userid, u.name, u.surname, f.userid, f.name, f.surname, f.age
MATCH (u:User)--(f1:User)
WHERE u.age < 20
RETURN COUNT(DISTINCT f1.userid) AS c
MATCH (u:User)-[:FRIEND*2]-(f2:User)
WHERE u.age < 20 AND f2.userid <> u.userid
RETURN COUNT(DISTINCT f2.userid) AS c
psql -U postgres as [root@sandbox-hdp data]# psql -U postgres. You may get the following error:
psql: FATAL: Peer authentication failed for user “postgres”
local all all trust at the beginning of the file /var/lib/pgsql/data/pg_hba.conf then restart PostgreSQL service by running the command systemctl restart postgresql as root user.NB: If you have other issues, please contact the TA.