Lab - Graph databases (Memgraph/Neo4j)

Course: Big Data - IU S25
Author: Firas Jolha

Dataset

Users and Posts

Agenda

Graph databases

A graph database stores nodes and relationships instead of tables, or documents. Data is stored just like you might sketch ideas on a whiteboard. Your data is stored without restricting it to a pre-defined model, allowing a very flexible way of thinking about and using it.

Graph databases address big challenges many of us tackle daily. Modern data problems often involve many-to-many relationships with heterogeneous data that sets up needs to:

Navigate deep hierarchies,
Find hidden connections between distant items, and
Discover inter-relationships between items.

Whether it’s a social network, payment networks, or road network you’ll find that everything is an interconnected graph of relationships. And when we want to ask questions about the real world, many questions are about the relationships rather than about the individual data elements.

Example of a simple graph database

Neo4j is an open-source, NoSQL, native graph database that provides an ACID-compliant transactional backend for your applications that has been publicly available since 2007. In Neo4j, information is organized as nodes, relationships, and properties. We can retrieve data from Neo4j via its query language called Cypher which is a declarative query language similar to SQL, but optimized for graphs.

Building blocks of the property graph model

The property graph model consists of mainly nodes and relationships. Nodes are the entities in the graph.

Nodes can be tagged with labels, representing their different roles in your domain. (For example, Person).
Nodes can hold any number of key-value pairs, or properties. (For example, name).
Node labels may also attach metadata (such as index or constraint information) to certain nodes.

Whereas relationships provide directed, named, connections between two node entities (e.g. Person LOVES Person).

Relationships always have a direction, a type, a start node, and an end node, and they can have properties, just like nodes.
Nodes can have any number or type of relationships without sacrificing performance.
Although relationships are always directed, they can be navigated efficiently in any direction.

Download and Install Memgraph

You can download the Docker image for Memgraph and run a container. You can find more detail in this official page. Memgraph comes with different Docker images. The main repositories that contain memgraph are:

memgraph/memgraph-mage - includes Memgraph database, command-line interface mgconsole and MAGE graph algorithms library. If tagged with cuGraph, it also includes NVIDIA cuGraph GPU-powered graph algorithms.
memgraph/memgraph - includes Memgraph database and command-line interface mgconsole.

There are also two additional standalone images that do not include the Memgraph:

memgraph/lab - includes a web interface Memgraph Lab that helps you explore the data stored in Memgraph.
memgraph/mgconsole - includes a command-line interface mgconsole that allows you to interact with Memgraph from the command line.

For this lab, you can just need two containers. One is the graph engine and another the web user interface. You can run a multi-container memgraph as follows:

services:
  memgraph:
    image: memgraph/memgraph-mage:latest
    container_name: memgraph-mage
    ports:
      - "7687:7687"
      - "7444:7444"
    command: ["--log-level=TRACE"]

  lab:
    image: memgraph/lab:latest
    container_name: memgraph-lab
    ports:
      - "3001:3000"
    depends_on:
      - memgraph
    environment:
      - QUICK_CONNECT_MG_HOST=memgraph
      - QUICK_CONNECT_MG_PORT=7687

You can just put this content in a file docker-compose.yaml and then run it using docker compose up -d. Then you can open the Memgraph lab (http://localhost:3001) in the browser to access the web UI as shown below.

You can also access the console of the graph if you selected to not use the Memgraph lab web UI as follows:

docker exec -it memgraph-mage mgconsole

This will open the console for Memgraph where you can write your queries in Cypher language.

Download and Install Neo4j

Using Docker

The easy approach is to install Neo4j server using Docker. You can do that by simply running one container which will allow you to start the database engine and a web UI.

docker run --name neo4j_server -d --publish=7474:7474 --publish=7687:7687 --volume=$HOME/neo4j/data:/data neo4j

Make sure that all ports (7474 and 7687) are free. For instance, if you are running both Neo4j and Memgraph then the port 7687 is used by both so you either change this port number for one of them or stop one of them to start the other.

Without Docker

You also can install Neo4j Desktop on your local machine. You can find here (https://neo4j.com/docs/desktop-manual/current/installation/download-installation/) some information about installing Neo4j Desktop and you can find here (https://neo4j.com/download-center/) download links.

If you are downloading Neo4j Desktop from Russia, select other from country list and use some proxy server.

After you start downloading, the website will give a long key you need to use when you install the software as follows:

The software will prepare the environment for installation as follows.

Installing Neo4j Desktop

When you succeed installation, it will show the following dashboard.

You can access the Neo4j web UI usually at http://localhost:7474 in case you did not change the default ports.

If you can not install Neo4j due to some reason, you can open a free session of Neo4j Sandbox at https://neo4j.com/sandbox/ but you need proxy.

Cypher

Cypher is Neo4j’s graph database query language that allows users to store and retrieve data from the graph database. It is a declarative, SQL-inspired language for describing visual patterns in graphs. The syntax provides a visual and logical way to match patterns of nodes and relationships in the graph. Cypher keywords are not case-sensitive but it is case-sensitive for variables.

All Neo4j servers contain a built-in database called system, which behaves differently than all other databases. The system database stores system data and you can not perform graph queries against it. A fresh installation of Neo4j includes two databases:

system - the system database described above, containing meta-data on the DBMS and security configuration.
neo4j - the default database, named using the config option dbms.default_database=neo4j.

Cypher provides first class support for a number of data types. These fall into the following three categories: property, structural, and composite.

Memgraph also provides with a default database in community edition named memgraph. You can use the same Cypher language to write and run queries in Memgraph. The difference is in some commands (e.g. loading data) and the functions supported by the engine. Neo4j provides APOC library that allows to access a huge set of functions useful for different purposes (data integration, graph algorithms, and data conversion.). Memgraph also provides functions and you can list all available functions by running CALL mg.functions() YIELD *;. Memgraph has a library called MAGE graph algorithm library that allows to access a set of functions for graph algorithms. Neo4j has a similar library called Graph Data Science (GDS).

Property types

The following data types are included in the property types category: Integer, Float, String, Boolean, Point, Date, Time, LocalTime, DateTime, LocalDateTime, and Duration. For more details, visit the documentation.

Structural types

The following data types are included in the structural types category: Node, Relationship, and Path.

The Node data type includes: Id, Label(s), and Map (of properties). Note that labels are not values, but a form of pattern syntax.
The Relationship data type includes: Id, Type, Map (of properties), Id of start node, and Id of end node.
The Path data type is an alternating sequence of nodes and relationships.

Nodes, relationships, and paths are returned as a result of pattern matching. In Neo4j, all relationships have a direction. However, you can have the notion of undirected relationships at query time.

Composite types

The following data types are included in the composite types category: List and Map.

RETURN [1,2,3,4,5], {id:1, msg: 'hello', age: 31}

How to define a variable

When you reference parts of a pattern or a query, you do so by naming them. The names you give the different parts are called variables.
In this example:

MATCH (n)-->(b)
RETURN b

The variables are n and b.

// this is a comment

A comment begin with double slash (//) and continue to the end of the line. Comments do not execute, they are for humans to read.

Operators

DISTINCT (aggregation operator)

WITH ['a', 'a', 4, 'b', 4] AS ps
UNWIND ps AS p
RETURN DISTINCT p

The output will have only ‘a’, 4, and ‘b’.

Property operators

The operator `.` statically access the property of a node or relationship

The operator `[]` used for filtering on a dynamically-computed property key

CREATE
  (a:Restaurant {name: 'Hungry Jo', rating_hygiene: 10, rating_food: 7}),
  (b:Restaurant {name: 'Buttercup Tea Rooms', rating_hygiene: 5, rating_food: 6}),
  (c1:Category {name: 'hygiene'}),
  (c2:Category {name: 'food'})
WITH a, b, c1, c2
MATCH (restaurant:Restaurant), (category:Category)
WHERE restaurant["rating_" + category.name] > 6
RETURN DISTINCT restaurant.name

The operator `=` used for replacing all properties of a node or relationship

CREATE (a:Person {name: 'Jane', age: 20})
WITH a
MATCH (p:Person {name: 'Jane'})
SET p = {name: 'Ellen', livesIn: 'London'}
RETURN p.name, p.age, p.livesIn

Mathemtical operators

The mathematical operators comprise:

addition: +
subtraction or unary minus: -
multiplication: *
division: /
modulo division: %
exponentiation: ^

Comparison operators

The comparison operators comprise:

equality: =
inequality: <>
less than: <
greater than: >
less than or equal to: <=
greater than or equal to: >=
IS NULL
IS NOT NULL

String-specific comparison operators comprise:

STARTS WITH: perform case-sensitive prefix searching on strings
ENDS WITH: perform case-sensitive suffix searching on strings
CONTAINS: perform case-sensitive inclusion searching in strings
=~: regular expression for matching a pattern

WITH ['mouse', 'chair', 'door', 'house'] AS wordlist
UNWIND wordlist AS word
WITH word
WHERE word =~ '.*ous.*'
RETURN word

Boolean operators

The boolean operators — also known as logical operators — comprise: AND, OR, XOR, NOT.

String operators

The string operators comprise: concatenating strings: +

RETURN 'neo' + '4j' AS result

Check other operators from the documentation.

Patterns

Patterns and pattern-matching are at the very heart of Cypher, so being effective with Cypher requires a good understanding of patterns.
Using patterns, you describe the shape of the data you are looking for. For example, in the MATCH clause you describe the shape with a pattern, and Cypher will figure out how to get that data for you.

The pattern describes the data using a form that is very similar to how one typically draws the shape of property graph data on a whiteboard: usually as circles (representing nodes) and arrows between them to represent relationships.

Patterns appear in multiple places in Cypher: in MATCH, CREATE and MERGE clauses, and in pattern expressions.

Patterns for nodes

This simple pattern describes a single node, and names that node using the variable a.

(a)

This pattern describes a very simple data shape: two nodes, and a single relationship from one to the other.

(a)-->(b)

This manner of describing nodes and relationships can be extended to cover an arbitrary number of nodes and the relationships between them, for example:

(a)-->(b)<--(c)

Such a series of connected nodes and relationships is called a “path”.

Patterns for labels

The most simple attribute that can be described in the pattern is a label that the node must have. For example:

(a:User)-->(b)

One can also describe a node that has multiple labels:

(a:User:Admin)-->(b)

Specifying properties

Properties can be expressed in patterns using a map-construct: curly brackets surrounding a number of key-expression pairs, separated by commas. E.g. a node with two properties on it would look like:

(a {name: 'Andy', sport: 'Brazilian Ju-Jitsu'})

A relationship with expectations on it is given by:

(a)-[{blocked: false}]->(b)

Patterns for relationships

The simplest way to describe a relationship is by using the arrow between two nodes, as in the previous examples. Using this technique, you can describe that the relationship should exist and the directionality of it. If you don’t care about the direction of the relationship, the arrow head can be omitted, as exemplified by:

(a)--(b)

(a)-[r]->(b)

(a)-[r:REL_TYPE]->(b)
    
(a)-[r:TYPE1|TYPE2]->(b)

(a)-[:REL_TYPE]->(b)

Variable-length pattern matching

Rather than describing a long path using a sequence of many node and relationship descriptions in a pattern, many relationships (and the intermediate nodes) can be described by specifying a length in the relationship description of a pattern. For example:

(a)-[*2]->(b)

This describes a graph of three nodes and two relationships, all in one path (a path of length 2). This is equivalent to:

(a)-->()-->(b)

A range of lengths can also be specified: such relationship patterns are called ‘variable length relationships’. For example:

(a)-[*3..5]->(b)

This is a minimum length of 3, and a maximum of 5. It describes a graph of either 4 nodes and 3 relationships, 5 nodes and 4 relationships or 6 nodes and 5 relationships, all connected together in a single path.

Either bound can be omitted. For example, to describe paths of length 3 or more, use:

(a)-[*3..]->(b)

To describe paths of length 5 or less, use:

(a)-[*..5]->(b)

Omitting both bounds is equivalent to specifying a minimum of 1, allowing paths of any positive length to be described:

(a)-[*]->(b)

Assigning to path variables

Cypher allows paths to be named using an identifer, as exemplified by:

p = (a)-[*3..5]->(b)

Building the database and importing dataset files

Using Neo4j Desktop

In Neo4j Desktop, create a new project as follows:

and add a local DBMs to the new project. Then, start the database as follows:

As you can see in the following screenshot, the database is active and we can access it as a default user neo4j by opening a dashboard with Neo4j Browser.

In order to load csv files to the database, you need to add them to the import of the database (do not add them to the project directory) as follows:

After copying/pasting the csv files, the import directory will be as follows:

Using Neo4j installed via Docker

This image is related to the community edition of Neo4j and you cannot create a new database but you can load files as follows. You need to find the folder import inside the Neo4j server container where you have to put your files. By default, the path in the container is /var/lib/neo4j/import. You can copy your files there using docker cp.

Using Memgraph installed via Docker

Here you also cannot create more than the default database memgraph if you do not have the enterprise edition. You can import the dataset files into the database using docker cp and there is no specific folder to put the files in. So, I will assume that you will put the files in the directory /var/lib/memgraph/ or maybe you can access the container using bash and create a directory/repository for importing data like /var/lib/memgraph/data. Let’s follow the latter approach.

Access the container via bash.

docker exec  -it memgraph-mage bash

Create the folder

mkdir -p /var/lib/memgraph/data


# You can also run one line for the steps above
# docker exec -it memgraph-mage bash -c "mkdir -p /var/lib/memgraph/data"

Import the data using docker cp.

docker cp <path-to-your-file-n-local-machine> memgraph-mage:/var/lib/memgraph/data/

Data loading and creating nodes/relationships

Each of the 3 CSV files needs to be loaded into Neo4j using the LOAD CSV command that reads each row of the file and then assigns the imported values to the nodes and edges of the graph.

The instructions for Neo4j here are a little different than Memgraph but for the queries in the next sections it is the same.


// Memgraph
// Create Index for users on userid field
CREATE INDEX ON :User(userid);

// Neo4j
// Create Index for users on userid field
CREATE INDEX FOR (u:User) ON (u.userid)


// Memgraph
// Create Index for posts on postid field
CREATE INDEX ON :Post(postid);

// Neo4j
// Create Index for posts on postid field
CREATE INDEX FOR (u:Post) ON (u.postid)


// Memgraph
// Load users data
LOAD CSV FROM "/var/lib/memgraph/data/users.csv"
WITH HEADER DELIMITER "," AS row&nbsp;
WITH row, toInteger(row.timestamp) AS epoch_seconds&nbsp;
CALL date.format(epoch_seconds, "s", "%Y-%m-%dT%H:%M:%S") YIELD formatted
CREATE (:User {
userid: toInteger(row.userid),
name: row.name,
surname: row.surname,
age: toInteger(row.age),
timestamp: localDateTime(formatted)
});


// Neo4j
// Load users data
LOAD CSV WITH HEADERS 
FROM "file:///users.csv" AS row 
FIELDTERMINATOR ','
CREATE (
    :User {
        userid: toInteger(row['userid']), 
        name:row['name'], 
        surname: row['surname'], 
        age: toInteger(row['age']), 
        timestamp: datetime({epochSeconds: toInteger(row['timestamp'])})}
);


    


// Memgraph
// Load posts data
LOAD CSV FROM "/var/lib/memgraph/data/posts.csv" 
WITH HEADER DELIMITER "," AS row 
CALL date.format(toInteger(row['posttimestamp']),"s", "%Y-%m-%dT%H:%M:%S") YIELD formatted 
MATCH (u:User{userid:toInteger(row['userid'])})
MERGE (
    :Post {
        postid: toInteger(row['postid']),  
        userid: toInteger(row['userid']), 
        posttype:row['posttype']
        })<-[:ADD{
    posttimestamp: localDateTime(formatted)
    }]-(u);

    
// Neo4j
// Load posts data
LOAD CSV WITH HEADERS 
FROM "file:///posts.csv" AS row 
FIELDTERMINATOR ','
MATCH (u:User{userid:toInteger(row['userid'])})
MERGE (
    :Post {
        postid: toInteger(row['postid']),  
        userid: toInteger(row['userid']), 
        posttype:row['posttype']
        })<-[:ADD{
    posttimestamp: datetime({epochSeconds: toInteger(row['posttimestamp'])})
    }]-(u);




// Same for Neo4j/Memgraph
// Retrieve some nodes
MATCH (n)
RETURN n
LIMIT 10;


// Memgraph
// Load Friends
USING PERIODIC COMMIT 1000 // loads 1000 rows per batch
LOAD CSV FROM "/var/lib/memgraph/data/friends.csv"
WITH HEADER DELIMITER "," AS row 
MATCH (u:User{userid:toInteger(row['friend1'])})
MATCH (v:User{userid:toInteger(row['friend2'])})
MERGE (u)-[:FRIEND]->(v)-[:FRIEND]->(u)

// Neo4j
// Load Friends
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS 
FROM "file:///friends.csv" AS row 
FIELDTERMINATOR ','
MATCH (u:User{userid:toInteger(row['friend1'])})
MATCH (v:User{userid:toInteger(row['friend2'])})
MERGE (u)-[:FRIEND]->(v)-[:FRIEND]->(u)

Data Retrieval with Cypher

All Cypher queries here can run on both engines Neo4j and Memgraph.

Cypher is Neo4j’s graph query language that lets you retrieve data from the graph. It is like SQL for graphs, and was inspired by SQL so it lets you focus on what data you want out of the graph (not how to go get it). It is the easiest graph language to learn by far because of its similarity to other languages and intiutiveness.

Exercises

Who are the users who have posts?

// MATCH (u:User)-[:ADD]-()
MATCH (u:User)
RETURN u
LIMIT 10 // This is added just to not display all nodes

2. How many posts of the first 10 users?

MATCH (u:User)
WITH u LIMIT 10
MATCH path = (u)-[:ADD]->(p:Post)
WITH u, COUNT(p) AS posts
ORDER BY posts DESCENDING // 117
RETURN u, posts

This will return the number of posts added by the first 10 users sorted in descending order of the their number of posts. The output is displayed below.

3. Which posts published after 2012?

MATCH (p:Post)<-[a:ADD]-()
WHERE a.posttimestamp.year > 2012
RETURN p
LIMIT 10 // This is added just to not display all nodes

4. What is the oldest and recent dates of the posts?

MATCH (p:Post)-[a:ADD]-()
RETURN MAX(a.posttimestamp), MIN(a.posttimestamp)

5. What are the top 5 most recent posts whose type is Image?

MATCH (p:Post{posttype:'Image'})-[a:ADD]-()
RETURN p, a
ORDER BY a.posttimestamp DESC
LIMIT 5

6. Which top 5 users who posted an Image or Video recently?

MATCH (u:User)-[a:ADD]->(p:Post)
WHERE p.posttype IN ['Image', 'Video']
RETURN u, p
ORDER BY a.posttimestamp
LIMIT 5

7. What is the age of the eldest and youngest persons?

MATCH (u:User)
RETURN MAX(u.age), MIN(u.age)

// Update the age of the person
MATCH (u:User)
SET u.age = CASE
           WHEN u.age < 0 THEN -u.age
           ELSE u.age
          END

Who is the 2nd youngest person who published Image posts?

MATCH (u:User)-[:ADD]->(p:Post{posttype:'Image'})
RETURN DISTINCT u.name, u.surname, u.age, p.posttype
ORDER BY u.age ASC
SKIP 1
LIMIT 1

9. What are the posts that are published by people who are between 20 and 30?

MATCH (p:Post)<-[:ADD]-(u:User)
WHERE u.age IN range(20, 30)
RETURN p.postid, p.posttype, u.age, u.userid
ORDER BY u.age

10. What are the friends of the youngest person?

CALL{
    MATCH (u:User)
    RETURN u.userid AS youngest
    ORDER BY u.age ASC
    LIMIT 1
}
WITH youngest
MATCH (u:User{userid:youngest})-[:FRIEND]-(f1:User)
RETURN DISTINCT u.userid AS user, f1.userid AS friend

11. How many friends for the surname Thronton?

MATCH (u:User{surname:'Thronton'})-[:FRIEND]-(f1:User)
RETURN COUNT(DISTINCT f1)

12. How many posts for each user?

MATCH (u:User)--(p:Post)
RETURN u.userid AS user, COUNT(*) AS c
ORDER BY c DESC

13. Who are the friends of users who posted Image?

MATCH (f:User)--(u:User)--(:Post{posttype:'Image'})
RETURN DISTINCT u.userid, u.name, u.surname, f.userid, f.name, f.surname, f.age

14. How many friends of users whose age is less than 20?

MATCH (u:User)--(f1:User)
WHERE u.age < 20
RETURN COUNT(DISTINCT f1.userid) AS c

15. How many friends of friends of users whose age is less than 20?

MATCH (u:User)-[:FRIEND*2]-(f2:User)
WHERE u.age < 20 AND f2.userid <> u.userid
RETURN COUNT(DISTINCT f2.userid) AS c

16. How many users have more than 10 posts?
17. What is the average number of posts published at noon and by the friends of the users whose age is more than 40? Calculate the percentage of posts.
18. What is the percentage of posts published at noon and by the friends of the users whose age is more than 40?