Lab 2 - Data Retrieval with SQL and Cypher

Course: Big Data - IU S25
Author: Firas Jolha

Dataset

Users and Posts

Agenda

Lab 2 - Data Retrieval with SQL and Cypher
Dataset
Agenda
Prerequisites
Objectives
Introduction
Dataset Descripion
Access PostgreSQL server
PostgreSQL Meta-commands
Build the Database
- Fix Friends table
Data Retrieval using SQL
Graph databases
- Download and Install Neo4j
  - Neo4j docker image
Cypher
Data Retrieval with Cypher
Common issues
References

Prerequisites

HDP Sandbox is installed
Familiarity with SQL

Objectives

Access PostgreSQL database server
Review CRUD operations on a relational database
Install Neo4j on HDP Sandbox
Perform some CRUD operations on a graph database

Introduction

The Structured Query Language (SQL) is the most extensively used database language. SQL is composed of a data definition language (DDL), which allows the specification of database schemas; a data manipulation language (DML), which supports operations to retrieve, store, modify and delete data. In this lab, we will practice how to retrieve structured data from relational databases using SQL. Then we will build a graph for the same dataset on Neo4j and return connected data via Cypher queries.

Dataset Descripion

This dataset contains four tables about social media users.

users. csv contains data about the users of the social net.
friends.csv contains data about their friendship status
posts.csv contains data about the posts they made

File name	Description	Fields
users.csv	A line is added to this file when a user is subscribed to the socia media along with the age and subscription date time	userid, surname, name, age, timestamp
posts.csv	The user can add multiple posts and for each post we have post type and post unix timestamp	postid, user, posttype, posttimestamp
friends.csv	The user can have friends and this relationship is mutual	friend1, friend2

Access PostgreSQL server

In fact, HDP Sandbox comes with a pre-installed PostgreSQL server so you do not to install any additional software to use PostgreSQL. You can access it via psql as postgres user:

[root@sandbox-hdp ~]#  psql -U postgres

This will open a CLI to interact with PostgreSQL. You can access the databases using \c command and the tables via \dt command. For example, you can access the database ambari (a default database on HDP Sandbox) by running the following command in postgreSQL CLI.

postgres=# \c ambari
You are now connected to database "ambari" as user "postgres".

The version of PostgreSQL in HDP 2.6.5 is 9.2.23 as shown below. So you should read the documentation for this version of PostgreSQL.

PostgreSQL Meta-commands

Anything you enter in psql that begins with an unquoted backslash is a psql meta-command that is processed by psql itself. These commands make psql more useful for administration or scripting. Meta-commands are often called slash or backslash commands.

Switch connection to a new database dbname as a user username

\c <dbname> <username>

List available databases

\l

List available tables

\dt

Describe a table <table_name>

\d <table_name>

Execute psql commands from a file <file.sql>

\i <file.sql>

Get help on psql commands

\?
\h CREATE TABLE

Turn on query execution time

\timing

You use the same command \timing to turn it off.

Quit psql

\q

You can run the shell commands in psql CLI via \! . For instance, to print the current working directory we write:

\! pwd

Build the Database

Note: Explore the dataset using any data analysis tool (Python+Pandas, Excel, Google Sheet, …etc) you have before building the database.

Another note: You do not need to create a new role/user. Just use the current role postgres.

The dataset are csv files and need to be imported into PostgreSQL tables.

To copy the values from csv files into the table, you need to use Postgres COPY method od data-loading. Postgres’s COPY comes in two separate variants, COPY and \COPY: COPY is server based, \COPY is client based. COPY will be run by the PostgreSQL backend (user “postgres”). The backend user requires permissions to read & write to the data file in order to copy from/to it. You need to use an absolute pathname with COPY. \COPY on the other hand, runs under the current $USER, and with that users environment. And \COPY can handle relative pathnames. The psql \COPY is accordingly much easier to use if it handles what you need. So, you can use the meta-command \COPY to import the data from csv files into PostgreSQL tables but you need to make sure that the primary keys and foreign keys are set according to the given ER diagram. You can learn more about ER diagrams from here.

Notice that you need to create tables before importing their data from files. You can detect the datatype of the column from its values.

Create the database social

-- Create Database
CREATE DATABASE social;

-- Switch to the Database
\c social;

-- add User table
CREATE TABLE IF NOT EXISTS Users(
userid INT PRIMARY KEY NOT NULL,
surname VARCHAR(50),
name VARCHAR(50),
age INT,
_timestamp timestamp,
temp BIGINT NOT NULL
);

-- add Post table
CREATE TABLE IF NOT EXISTS Posts(
postid INT PRIMARY KEY NOT NULL,
userid INT NOT NULL,
posttype VARCHAR(50),
posttimestamp timestamp,
temp BIGINT NOT NULL
);

-- add Friend table
CREATE TABLE IF NOT EXISTS Friends(
friend1 INT NOT NULL,
friend2 INT NOT NULL
);

-- add constrains
ALTER TABLE Friends
ADD CONSTRAINT fk_User_userid_friend1
FOREIGN KEY (friend1)
REFERENCES Users (userid);

ALTER TABLE Friends
ADD CONSTRAINT fk_User_userid_friend2
FOREIGN KEY (friend2)
REFERENCES Users (userid);

ALTER TABLE Posts
ADD CONSTRAINT fk_User_userid_userid
FOREIGN KEY (userid)
REFERENCES Users (userid);

Upload the csv data into tables


\COPY Users(userid, surname, name, age, temp)  FROM 'users.csv'  DELIMITER ','  CSV HEADER;
UPDATE Users SET _timestamp = TO_TIMESTAMP(temp);
ALTER TABLE Users DROP COLUMN temp;


\COPY Posts(postid, userid, posttype, temp) FROM 'posts.csv' CSV HEADER;
UPDATE Posts SET posttimestamp = TO_TIMESTAMP(temp);
ALTER TABLE Posts DROP COLUMN temp;


\COPY Friends from 'friends.csv' csv header;

Fix Friends table

CREATE TABLE friends2 AS
SELECT friend1, friend2 FROM friends
UNION
SELECT friend2, friend1 FROM friends;


ALTER TABLE friends RENAME TO friends_old;

ALTER TABLE friends2 RENAMT TO friends;

Data Retrieval using SQL

In this part of the lab, we will practice on writing SQL queries to retrieve data from our database.

Data Retrieval from a single table

You can return data from a single table by using SELECT command. This kind of data retrieval is the simplest way to get data from a single table.

Exercises

Who are the users who have posts?

SELECT DISTINCT userid
FROM posts;

What are the posts of the users?

SELECT DISTINCT postid
FROM posts;

Which posts published after 2012?

SELECT *
FROM posts
WHERE EXTRACT(YEAR FROM posttimestamp) > 2012;

What is the oldest and recent dates of the posts?

SELECT MAX(posttimestamp), MIN(posttimestamp)
FROM posts;

What are the top 5 most recent posts whose type is Image?

SELECT * 
FROM posts
WHERE posttype='Image'
ORDER BY posttimestamp DESC
LIMIT 5;

Which top 5 users who posted an Image or Video recently?

SELECT userid, posttype
FROM posts
WHERE posttype IN ('Image', 'Video')
ORDER BY posttimestamp DESC
LIMIT 5;

What is the age of the eldest and youngest persons?

SELECT MAX(age), MIN(age)
FROM users;

UPDATE users
SET age = CASE
           WHEN age < 0 THEN -age
           ELSE age
          END;

In PostgreSQL, double quotes "" are used to indicate identifiers within the database, which are objects like tables, column names, and roles. In contrast, single quotes '' are used to indicate string literals.

Data Retrieval from multiple tables (Joining tables)

Exercises

Who is the 2nd youngest person who published Image posts?

SELECT DISTINCT name, surname, age, posttype
FROM posts 
JOIN users ON posts.userid = users.userid
WHERE posts.posttype='Image'
ORDER BY age ASC
LIMIT 1
OFFSET 1;

What are the posts that are published by people who are between 20 and 30?

SELECT postid, posttype, age, users.userid
FROM posts 
JOIN users ON posts.userid = users.userid
WHERE age <@ int4range(20, 30)
ORDER BY age;

What are the friends of the youngest person?

WITH youngest AS(
    SELECT userid 
    FROM users 
    ORDER BY age 
    ASC LIMIT 1
)
SELECT friend1 AS user, friend2 AS friend
FROM youngest, friends
JOIN users ON friends.friend1 = users.userid
WHERE users.userid = youngest.userid;

How many friends for the surname Thronton?

SELECT COUNT(friend2)
FROM users
JOIN friends ON (users.userid = friends.friend1)
WHERE users.surname = 'Thronton';

Aggregation and grouping

Exercises

How many posts for each user?

SELECT users.userid, COUNT(*) as c
FROM users
JOIN posts ON (users.userid = posts.userid)
GROUP BY users.userid
ORDER BY c DESC;

Who are the friends of users who posted Image?

SELECT DISTINCT u2.userid, u2.name, u2.surname, u2.age, u.userid, u.name, u.surname
FROM users u
JOIN friends f ON f.friend1 = u.userid
JOIN posts ON posts.userid = u.userid
JOIN users u2 ON f.friend2 = u2.userid
WHERE posts.posttype = 'Image';

How many friends of users whose age is less than 20?

SELECT COUNT(DISTINCT f1.friend2) as c
FROM users u
JOIN friends f1 ON (u.userid = f1.friend1)
WHERE u.age < 20;

How many friends of friends of users whose age is less than 20?

SELECT COUNT(DISTINCT f2.friend2) as c
FROM users u
JOIN friends f1 ON (u.userid = f1.friend1)
JOIN friends f2 ON (f1.friend2 = f2.friend1)
WHERE u.age < 20 AND f2.friend2 <> u.userid;

How many users have more than 10 posts?

WITH users_10_posts AS (
    SELECT posts.userid, COUNT(*) as c
    FROM posts
    GROUP BY posts.userid
    HAVING COUNT(*) > 10
)
SELECT COUNT(*) as c
FROM users_10_posts;

What is the average number of posts published at noon and by the friends of the users whose age is more than 40? Calculate the percentage of posts.


WITH post_count AS (
    SELECT COUNT(posts.postid) AS c
    FROM users u
    JOIN friends f ON f.friend1 = u.userid
    JOIN posts ON posts.userid = f.friend1
    JOIN users u2 ON f.friend1 = u2.userid
    WHERE EXTRACT(HOUR FROM u2._timestamp) = 12 AND u.age > 40
)
SELECT AVG(c)
FROM post_count;

What is the percentage of posts published at noon and by the friends of the users whose age is more than 40?

WITH post_count AS (
    SELECT COUNT(posts.postid) AS c
    FROM users u
    JOIN friends f ON f.friend1 = u.userid
    JOIN posts ON posts.userid = f.friend1
    JOIN users u2 ON f.friend1 = u2.userid
    WHERE EXTRACT(HOUR FROM u2._timestamp) = 12 AND u.age > 40) ,
    post_total AS (
        SELECT COUNT(*) as c2
        FROM posts
    )
SELECT 100 * c/c2 || '%'
FROM post_total, post_count;

Graph databases

A graph database stores nodes and relationships instead of tables, or documents. Data is stored just like you might sketch ideas on a whiteboard. Your data is stored without restricting it to a pre-defined model, allowing a very flexible way of thinking about and using it.

Graph databases address big challenges many of us tackle daily. Modern data problems often involve many-to-many relationships with heterogeneous data that sets up needs to:

Navigate deep hierarchies,
Find hidden connections between distant items, and
Discover inter-relationships between items.

Whether it’s a social network, payment networks, or road network you’ll find that everything is an interconnected graph of relationships. And when we want to ask questions about the real world, many questions are about the relationships rather than about the individual data elements.

Example of a simple graph database

Neo4j is an open-source, NoSQL, native graph database that provides an ACID-compliant transactional backend for your applications that has been publicly available since 2007. In Neo4j, information is organized as nodes, relationships, and properties. We can retrieve data from Neo4j via its query language called Cypher which is a declarative query language similar to SQL, but optimized for graphs.

Building blocks of the property graph model

The property graph model consists of mainly nodes and relationships. Nodes are the entities in the graph.

Nodes can be tagged with labels, representing their different roles in your domain. (For example, Person).
Nodes can hold any number of key-value pairs, or properties. (For example, name).
Node labels may also attach metadata (such as index or constraint information) to certain nodes.

Whereas relationships provide directed, named, connections between two node entities (e.g. Person LOVES Person).

Relationships always have a direction, a type, a start node, and an end node, and they can have properties, just like nodes.
Nodes can have any number or type of relationships without sacrificing performance.
Although relationships are always directed, they can be navigated efficiently in any direction.

Download and Install Neo4j

Neo4j can be installed as an interpreter in Zeppelin but in HDP 2.6.5, Zeppelin 0.7.3 is installed and does not support neo4j interpreter. Installing Neo4j on HDP 2.6.5 as a Zeppelin interpreter will not work. The latest version (in Feb. 2025) of Zeppelin 0.11.0 supports Neo4j but we do not need to install it. Indeed, HDP 3.0 has Zeppelin 0.8.0 installed and supports neo4j interpreter but it is not a big deal to install HDP 3.0 just for neo4j since our work on the sandbox will not be mostly on graph databases.

The easy approach is to install Neo4j Desktop on your local machine. You can find here (https://neo4j.com/docs/desktop-manual/current/installation/download-installation/) some information about installing Neo4j Desktop and you can find here (https://neo4j.com/deployment-center) download links. After you start downloading, the website will give a long key you need to use when you install the software as follows:

The software will prepare the environment for installation as follows.

Installing Neo4j Desktop

When you succeed installation, it will show the following dashboard.

The less easy approach is to install Neo4j on your cluster node via yum OR offline via rpm. You can find here (https://yum.neo4j.com) some information about installing neo4j via yum and here (https://neo4j.com/docs/operations-manual/3.5/installation/linux/rpm/#linux-yum) some information about offline installation. After that, you can forward the port of the server to some custom port 10015 to the host machine in order to access Neo4j Browser (change this setting server.http.listen_address to 10015 in neo4j.conf file of the Neo4j DBMS instance).

If you want to install Neo4j on the cluster, you need to know that the old versions of Neo4j can be installed. I suggest the version 3.5.35.1 and you can see below the cypher shell dependencies.

You can access the guest machine in VirtualBox by ssh into the HostSSH port specified in the port forwarding rules which can be reached from the network settings of the virtual machine. You can also add your rules to the list.

Neo4j docker image

Neo4j has docker image in dockerhub for the Neo4j database server. You can follow this approach to run the server as a docker container and access the service from the browser. Make sure that you have installed Docker engine and docker daemon is working. You can pull the image as follows:

docker pull neo4j

You can run the Neo4j server as a container:

docker run --name neo4j_server  --publish=7474:7474 --publish=7687:7687 --volume /neo4j/logs:/var/lib/neo4j/logs --volume /neo4j/import:/var/lib/neo4j/import --volume /neo4j/plugins:/var/lib/neo4j/plugins  --volume /neo4j/data:/var/lib/neo4j/data --env NEO4J_AUTH=neo4j/neo4jneo4j neo4j

The command above will create and run a docker container neo4j with name neo4j_server and publish ports 7474 for http and 7687 for bolt protocols. It will also attach volumes for logs, import, plugins and data folders. The env option will assign the password neo4jneo4j for the default user neo4j. You can use -d or --detach option to let the server run in detached mode in the background.

You can stop the container when you are done as follow:

docker stop neo4j_server

Where neo4j_server is the container name or id if you did not assign a name.

You can start the container again as follows:

docker start neo4j_server

Where neo4j_server is the container name or id if you did not assign a name.

In case you want to change some of the settings in running the container then you can delete the container and create a new one. Your data inside can be stored in local file system by attaching volumes to folders you want to. You can delete the container as follows: (make sure that you stopped it)

docker rm neo4j_server

Where neo4j_server is the container name or id if you did not assign a name.

You can access the Neo4j Browser usually at http://localhost:7474.

If you can not install Neo4j due to some reason, you can open a free session of Neo4j Sandbox at https://neo4j.com/sandbox/ but you need proxy.

Cypher

Cypher is Neo4j’s graph database query language that allows users to store and retrieve data from the graph database. It is a declarative, SQL-inspired language for describing visual patterns in graphs. The syntax provides a visual and logical way to match patterns of nodes and relationships in the graph. Cypher keywords are not case-sensitive but it is case-sensitive for variables.

All Neo4j servers contain a built-in database called system, which behaves differently than all other databases. The system database stores system data and you can not perform graph queries against it. A fresh installation of Neo4j includes two databases:

system - the system database described above, containing meta-data on the DBMS and security configuration.
neo4j - the default database, named using the config option dbms.default_database=neo4j.

Cypher provides first class support for a number of data types. These fall into the following three categories: property, structural, and composite.

Property types

The following data types are included in the property types category: Integer, Float, String, Boolean, Point, Date, Time, LocalTime, DateTime, LocalDateTime, and Duration. For more details, visit the documentation.

Structural types

The following data types are included in the structural types category: Node, Relationship, and Path.

The Node data type includes: Id, Label(s), and Map (of properties). Note that labels are not values, but a form of pattern syntax.
The Relationship data type includes: Id, Type, Map (of properties), Id of start node, and Id of end node.
The Path data type is an alternating sequence of nodes and relationships.

Nodes, relationships, and paths are returned as a result of pattern matching. In Neo4j, all relationships have a direction. However, you can have the notion of undirected relationships at query time.

Composite types

The following data types are included in the composite types category: List and Map.

RETURN [1,2,3,4,5], {id:1, msg: 'hello', age: 31}

How to define a variable

When you reference parts of a pattern or a query, you do so by naming them. The names you give the different parts are called variables.
In this example:

MATCH (n)-->(b)
RETURN b

The variables are n and b.

// this is a comment

A comment begin with double slash (//) and continue to the end of the line. Comments do not execute, they are for humans to read.

Operators

DISTINCT (aggregation operator)

WITH ['a', 'a', 4, 'b', 4] AS ps
UNWIND ps AS p
RETURN DISTINCT p

The output will have only ‘a’, 4, and ‘b’.

Property operators

The operator `.` statically access the property of a node or relationship

The operator `[]` used for filtering on a dynamically-computed property key

CREATE
  (a:Restaurant {name: 'Hungry Jo', rating_hygiene: 10, rating_food: 7}),
  (b:Restaurant {name: 'Buttercup Tea Rooms', rating_hygiene: 5, rating_food: 6}),
  (c1:Category {name: 'hygiene'}),
  (c2:Category {name: 'food'})
WITH a, b, c1, c2
MATCH (restaurant:Restaurant), (category:Category)
WHERE restaurant["rating_" + category.name] > 6
RETURN DISTINCT restaurant.name

The operator `=` used for replacing all properties of a node or relationship

CREATE (a:Person {name: 'Jane', age: 20})
WITH a
MATCH (p:Person {name: 'Jane'})
SET p = {name: 'Ellen', livesIn: 'London'}
RETURN p.name, p.age, p.livesIn

Mathemtical operators

The mathematical operators comprise:

addition: +
subtraction or unary minus: -
multiplication: *
division: /
modulo division: %
exponentiation: ^

Comparison operators

The comparison operators comprise:

equality: =
inequality: <>
less than: <
greater than: >
less than or equal to: <=
greater than or equal to: >=
IS NULL
IS NOT NULL

String-specific comparison operators comprise:

STARTS WITH: perform case-sensitive prefix searching on strings
ENDS WITH: perform case-sensitive suffix searching on strings
CONTAINS: perform case-sensitive inclusion searching in strings
=~: regular expression for matching a pattern

WITH ['mouse', 'chair', 'door', 'house'] AS wordlist
UNWIND wordlist AS word
WITH word
WHERE word =~ '.*ous.*'
RETURN word

Boolean operators

The boolean operators — also known as logical operators — comprise: AND, OR, XOR, NOT.

String operators

The string operators comprise: concatenating strings: +

RETURN 'neo' + '4j' AS result

Check other operators from the documentation.

Patterns

Patterns and pattern-matching are at the very heart of Cypher, so being effective with Cypher requires a good understanding of patterns.
Using patterns, you describe the shape of the data you are looking for. For example, in the MATCH clause you describe the shape with a pattern, and Cypher will figure out how to get that data for you.

The pattern describes the data using a form that is very similar to how one typically draws the shape of property graph data on a whiteboard: usually as circles (representing nodes) and arrows between them to represent relationships.

Patterns appear in multiple places in Cypher: in MATCH, CREATE and MERGE clauses, and in pattern expressions.

Patterns for nodes

This simple pattern describes a single node, and names that node using the variable a.

(a)

This pattern describes a very simple data shape: two nodes, and a single relationship from one to the other.

(a)-->(b)

This manner of describing nodes and relationships can be extended to cover an arbitrary number of nodes and the relationships between them, for example:

(a)-->(b)<--(c)

Such a series of connected nodes and relationships is called a “path”.

Patterns for labels

The most simple attribute that can be described in the pattern is a label that the node must have. For example:

(a:User)-->(b)

One can also describe a node that has multiple labels:

(a:User:Admin)-->(b)

Specifying properties

Properties can be expressed in patterns using a map-construct: curly brackets surrounding a number of key-expression pairs, separated by commas. E.g. a node with two properties on it would look like:

(a {name: 'Andy', sport: 'Brazilian Ju-Jitsu'})

A relationship with expectations on it is given by:

(a)-[{blocked: false}]->(b)

Patterns for relationships

The simplest way to describe a relationship is by using the arrow between two nodes, as in the previous examples. Using this technique, you can describe that the relationship should exist and the directionality of it. If you don’t care about the direction of the relationship, the arrow head can be omitted, as exemplified by:

(a)--(b)

(a)-[r]->(b)

(a)-[r:REL_TYPE]->(b)
    
(a)-[r:TYPE1|TYPE2]->(b)

(a)-[:REL_TYPE]->(b)

Variable-length pattern matching

Rather than describing a long path using a sequence of many node and relationship descriptions in a pattern, many relationships (and the intermediate nodes) can be described by specifying a length in the relationship description of a pattern. For example:

(a)-[*2]->(b)

This describes a graph of three nodes and two relationships, all in one path (a path of length 2). This is equivalent to:

(a)-->()-->(b)

A range of lengths can also be specified: such relationship patterns are called ‘variable length relationships’. For example:

(a)-[*3..5]->(b)

This is a minimum length of 3, and a maximum of 5. It describes a graph of either 4 nodes and 3 relationships, 5 nodes and 4 relationships or 6 nodes and 5 relationships, all connected together in a single path.

Either bound can be omitted. For example, to describe paths of length 3 or more, use:

(a)-[*3..]->(b)

To describe paths of length 5 or less, use:

(a)-[*..5]->(b)

Omitting both bounds is equivalent to specifying a minimum of 1, allowing paths of any positive length to be described:

(a)-[*]->(b)

Assigning to path variables

Cypher allows paths to be named using an identifer, as exemplified by:

p = (a)-[*3..5]->(b)

Build the database

In Neo4j Desktop, create a new project as follows:

and add a local DBMs to the new project. Then, start the database as follows:

As you can see in the following screenshot, the database is active and we can access it as a default user neo4j by opening a dashboard with Neo4j Browser.

In order to load csv files to the database, you need to add them to the import of the database (do not add them to the project directory) as follows:

After copying/pasting the csv files, the import directory will be as follows:

Create a new database and access it.

CREATE DATABASE socialmedia

:use socialmedia

Create nodes and relationships
Each of the 3 CSV files needs to be loaded into Neo4j using the LOAD CSV command that reads each row of the file and then assigns the imported values to the nodes and edges of the graph.


/* Create Index for users on userid field */
CREATE INDEX FOR (u:User) ON (u.userid)
    
/* Create Index for posts on postid field */
CREATE INDEX FOR (u:Post) ON (u.postid)

/* Load users data */
LOAD CSV WITH HEADERS 
FROM "file:///users.csv" AS row 
FIELDTERMINATOR ','
CREATE (
    :User {
        userid: toInteger(row['userid']), 
        name:row['name'], 
        surname: row['surname'], 
        age: toInteger(row['age']), 
        timestamp: datetime({epochSeconds: toInteger(row['timestamp'])})}
);
    
    
/* Load posts data */
LOAD CSV WITH HEADERS 
FROM "file:///posts.csv" AS row 
FIELDTERMINATOR ','
MATCH (u:User{userid:toInteger(row['userid'])})
MERGE (
    :Post {
        postid: toInteger(row['postid']),  
        userid: toInteger(row['userid']), 
        posttype:row['posttype']
        })<-[:ADD{
    posttimestamp: datetime({epochSeconds: toInteger(row['posttimestamp'])})
    }]-(u);

/* Retrieve some nodes */
MATCH (n)
RETURN n
LIMIT 10;

/* Load Friends */

:auto LOAD CSV WITH HEADERS 
FROM "file:///friends.csv" AS row 
FIELDTERMINATOR ','
CALL {
WITH row
MATCH (u:User{userid:toInteger(row['friend1'])})
MATCH (v:User{userid:toInteger(row['friend2'])})
MERGE (u)-[:FRIEND]->(v)-[:FRIEND]->(u)
} IN TRANSACTIONS OF 1000 ROWS

/* IN 3 CONCURRENT TRANSACTIONS OF 1000 ROWS */
/*When a write transaction occurs, 
Neo4j takes locks to preserve data consistency while updating.

A deadlock happens when two transactions are blocked by 
each other because they are attempting to concurrently 
modify a node or a relationship that is locked by the 
other transaction
*/

Data Retrieval with Cypher

Cypher is Neo4j’s graph query language that lets you retrieve data from the graph. It is like SQL for graphs, and was inspired by SQL so it lets you focus on what data you want out of the graph (not how to go get it). It is the easiest graph language to learn by far because of its similarity to other languages and intiutiveness.

Exercises

Who are the users who have posts?

MATCH (u:User)-[:ADD]-()
RETURN u
LIMIT 10 // This is added just to not display all nodes

What are the posts of the users?

MATCH (p:Post)
RETURN p
LIMIT 10 // This is added just to not display all nodes

Which posts published after 2012?

MATCH (p:Post)<-[a:ADD]-()
WHERE a.posttimestamp.year > 2012
RETURN p
LIMIT 10 // This is added just to not display all nodes

What is the oldest and recent dates of the posts?

MATCH (p:Post)-[a:ADD]-()
RETURN MAX(a.posttimestamp), MIN(a.posttimestamp)

What are the top 5 most recent posts whose type is Image?

MATCH (p:Post{posttype:'Image'})-[a:ADD]-()
RETURN p, a
ORDER BY a.posttimestamp DESC
LIMIT 5

Which top 5 users who posted an Image or Video recently?

MATCH (u:User)-[a:ADD]->(p:Post)
WHERE p.posttype IN ['Image', 'Video']
RETURN u, p
ORDER BY a.posttimestamp
LIMIT 5

What is the age of the eldest and youngest persons?

MATCH (u:User)
RETURN MAX(u.age), MIN(u.age)

// Update the age of the person
MATCH (u:User)
SET u.age = CASE
           WHEN u.age < 0 THEN -u.age
           ELSE u.age
          END

Who is the 2nd youngest person who published Image posts?

MATCH (u:User)-[:ADD]->(p:Post{posttype:'Image'})
RETURN DISTINCT u.name, u.surname, u.age, p.posttype
ORDER BY u.age ASC
SKIP 1
LIMIT 1

What are the posts that are published by people who are between 20 and 30?

MATCH (p:Post)<-[:ADD]-(u:User)
WHERE u.age IN range(20, 30)
RETURN p.postid, p.posttype, u.age, u.userid
ORDER BY u.age

What are the friends of the youngest person?

CALL{
    MATCH (u:User)
    RETURN u.userid AS youngest
    ORDER BY u.age ASC
    LIMIT 1
}
WITH youngest
MATCH (u:User{userid:youngest})-[:FRIEND]-(f1:User)
RETURN DISTINCT u.userid AS user, f1.userid AS friend

How many friends for the surname Thronton?

MATCH (u:User{surname:'Thronton'})-[:FRIEND]-(f1:User)
RETURN COUNT(DISTINCT f1)

How many posts for each user?

MATCH (u:User)--(p:Post)
RETURN u.userid AS user, COUNT(*) AS c
ORDER BY c DESC

Who are the friends of users who posted Image?

MATCH (f:User)--(u:User)--(:Post{posttype:'Image'})
RETURN DISTINCT u.userid, u.name, u.surname, f.userid, f.name, f.surname, f.age

How many friends of users whose age is less than 20?

MATCH (u:User)--(f1:User)
WHERE u.age < 20
RETURN COUNT(DISTINCT f1.userid) AS c

How many friends of friends of users whose age is less than 20?

MATCH (u:User)-[:FRIEND*2]-(f2:User)
WHERE u.age < 20 AND f2.userid <> u.userid
RETURN COUNT(DISTINCT f2.userid) AS c

How many users have more than 10 posts?
What is the average number of posts published at noon and by the friends of the users whose age is more than 40? Calculate the percentage of posts.
What is the percentage of posts published at noon and by the friends of the users whose age is more than 40?

Common issues

When you try to access the database by running the command psql -U postgres as [root@sandbox-hdp data]# psql -U postgres. You may get the following error:

psql: FATAL: Peer authentication failed for user “postgres”
- Possible reasons:
  - The database is configured to allow access only through the credentials of the user.
- Possible solutions:
  - You can change this configuration to allow all users to access the database without the need to enter credentials. Add the line local all all trust at the beginning of the file /var/lib/pgsql/data/pg_hba.conf then restart PostgreSQL service by running the command systemctl restart postgresql as root user.

NB: If you have other issues, please contact the TA.

References

PostgreSQL cheatsheet
Yasin N. Silva, Isadora Almeida, and Michell Queiroz. 2016. SQL: From Traditional Databases to Big Data. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (SIGCSE '16). Association for Computing Machinery, New York, NY, USA, 413–418. https://doi.org/10.1145/2839509.2844560
Comprehensive PostgreSQL Tutorial
psql 9.2 Documentation
Cypher Manual

Lab 2 - Data Retrieval with SQL and Cypher

Dataset

Agenda

Prerequisites

Objectives

Introduction

Dataset Descripion

Access PostgreSQL server

PostgreSQL Meta-commands

Build the Database

Fix Friends table

Data Retrieval using SQL

Data Retrieval from a single table

Exercises

Data Retrieval from multiple tables (Joining tables)

Exercises

Aggregation and grouping

Exercises

Graph databases

Download and Install Neo4j

Neo4j docker image

Cypher

Property types

Structural types

Composite types

How to define a variable

Operators

DISTINCT (aggregation operator)

Property operators

The operator . statically access the property of a node or relationship

The operator [] used for filtering on a dynamically-computed property key

The operator = used for replacing all properties of a node or relationship

Mathemtical operators

Comparison operators

Boolean operators

String operators

Patterns

Patterns for nodes

Patterns for related nodes

Patterns for labels

Specifying properties

Patterns for relationships

Variable-length pattern matching

Assigning to path variables

Build the database

Data Retrieval with Cypher

Exercises

Common issues

References

The operator `.` statically access the property of a node or relationship

The operator `[]` used for filtering on a dynamically-computed property key

The operator `=` used for replacing all properties of a node or relationship