Assignment 1 : Big Data Storage in SQL vs. NoSQL Databases

Course: Big Data - IU
Author: Firas Jolha
Deadline: March 8, 23:59

Moodle submission link

Dataset

Foursquare Dataset(~2.13GB zipped, ~6.38GB unzipped)

Report template

https://docs.google.com/document/d/1c4v2lY16fJHF-q2vV3CbrDX7sp6tj16q6N1y2kMXnJI/edit?usp=sharing

Agenda

Assignment 1 : Big Data Storage in SQL vs. NoSQL Databases
Moodle submission link
Dataset
Report template
Agenda
Prerequisites
Description
Data Description
Instructions
I. Data Preparation [5 points]
II. Data Modeling & Ingestion [31 points]
III. Analytical Query Execution [48 points]
IV. Performance Analysis & Visualization [16 points +10 extra]
References

Prerequisites

Installed PostgreSQL server
Installed pandas package
Installed Cassndra server
Installed MongoDB server
Installed Neo4j/Memgraph server

Description

This assignment provides a hands-on experience in working with four distinct data management paradigms used for large-scale analytics:

Document Store: MongoDB
Traditional Relational DBMS: PostgreSQL
Distributed SQL: Citus (an extension to PostgreSQL)
Column-Family Store: ScyllaDB (compatible with Apache Cassandra)

You will ingest a large, real-world dataset (global Foursquare check-ins) into each system, design an appropriate schema, execute a set of analytical queries, and rigorously benchmark their performance. The goal is to understand the trade-offs between data modeling flexibility, query expressiveness, and scalability.

You have to work on check-in dataset collected from Foursquare, which has been widely used as a key data source for studying location based services. Specifically, as Foursquare user’s check-ins can only be accessed from the user’s social circle, they are not publicly available. However, many Foursquare users choose to also share their check-ins or points of interest (POIs) via X (Twitter) platform. User social relationships are collected from X. A friendship exists between two users if both users follow each other. The user data is anonymized for privacy purposes.

The dataset includes long-term (about 22 months, from April 2012 to January 2014) global-scale Foursquare check-in data, as well as two snapshots of user social networks before (in Mar. 2012) and after (in May 2014) the check-in data collection period. The check-in dataset includes 90,048,627 checkins from 2,733,324,324 users across 11,180,160 venues. There are 363,704 (old) and 607,333 (new) friendships in the social network data.

Important Note: In this assignment, you are required to work only on a specific slice of the users dataframe. To know the slice assigned to you:

You need to know your serial number (sid) from the list published by the TA.
Your slice of users data must be generated by running the following Python script select_my_users_slice.py. You must not modify the script. Just run it and enter your SID to get your slice of users IDs.

# select_my_users_slice.py
import pandas as pd
import traceback

MAX_STD = 103
USERS_PERC = 0.25
SID = input(f"Enter your SID [0-{MAX_STD-1}]: ")
try:
    SID = int(SID)
    if SID <0 or SID>103:
        raise ValueError(f"SID should be in the range [0-{MAX_STD-1}].")
    #print(type(SID))
    def select_random_data_with_seed(dataframe, num_samples, seed):
        random_selection = dataframe.sample(n=num_samples, random_state=seed)
        return random_selection
    
    print("Reading the file 'users.txt'...")
    df = pd.read_csv("users.txt")
    
    print("Selecting the slice of users...")
    users_num_to_select = int(USERS_PERC * len(df))
    selected_df = select_random_data_with_seed(df, users_num_to_select, seed=SID)
    print(f"Selected DataFrame rows with seed {SID}:\n{selected_df}")
    
    print("Persisting the selected slice of users...")
    file_path = 'my_users.csv'
    selected_df.to_csv(file_path, index=False)
    print(f"The selected slice is successfully saved to {file_path}")
except ValueError:
    print(f"SID = {SID} is not correct, valid SID should be in the range [0-{MAX_STD-1}]!")
except:
    print(f"\033[91m{traceback.format_exc()}\033[0m")

For example, when sid=10, the script will retrive as follows:





















ubuntu> python select_my_users_slice.py
Enter your SID [0-102]: 10
Reading the file 'users.txt'...
Selecting the slice of users...
Selected DataFrame rows with seed 10:
          userid
760215    760216
1251679  1251680
1236500  1236501
1454359  1454360
2570231  2570232
...          ...
2202515  2202516
2050250  2050251
369611    369612
2610100  2610101
1974059  1974060

[683331 rows x 1 columns]
Persisting the selected slice of users...
The selected slice is successfully saved to my_users.csv

You must work only on the check-ins performed by users/friends from your slice. The same for friendship_before and friendship_after. All friend ids of users should belong to your slice of users too.
Do not query, or create database objects for users outside of your slice. Otherwise, your work will be not be graded.

Data Description

The dataset consists of 5 tsv files without headers. We present in the table below the description of the dataset files. Use it to know the headers of the files.

File Name	Description	Fields
checkins_anonymized.txt	Each line corresponds to a user id checked in a venue at time UTCTime with Timezone offset in minutes	user_id, venue_id, utc_time, timezone_offset_mins
friendship_before_old.txt	Each line corresponds to a user id at the first column mapped to the friend id at the second column and they have friendship before April 2012	user_id, friend_id
friendship_after_new.txt	Each line corresponds to a user id at the first column mapped to the friend id at the second column and they still have friendship after January 2014	user_id, friend_id
POIs.txt	Each line corresponds to the features of the venue including venue id, POI latitude, POI longitude, venue category name, country code (ISO 3166-1 alpha-2 two-letter country codes)	venue_id, latitude, longitude, category, country
users.txt	a file with a header where each line contains user id for all 114,324 users	userid

Instructions

In this assignment, you need to prepare your slice of users data with corresponding users’ POIs, checkins and friendships as explained above. Afterwards, you can start solving the exercises in any order you prefer. For documentation, you need to prepare a report. The template of the report is attached as a link to this document.

Important Note: For each task, you need to put all of its queries as a text to the table and the queries with their corresponding outputs as screenshots below the task in the report. The screenshot should show the query and the result you got. The screenshot should be full screen and any screenshot that is not full screen will not be evaluated. Any screenshot that is cut will also not be evaluated. If the result for a single query, is too long you can limit the output to only the size of your screen.

Regarding the technical stack and setup of the assignment, we have the following minimum requirements:

A single server for (Monolothic) PostgreSQL tasks.
A cluster of 3 nodes for Distributed PostgreSQL (Citus Data) tasks.
A cluster of 3 nodes for ScyllaDB tasks.
A replica set of 3 nodes for MongoDB tasks.

If you need/prefer to use any Python drivers, please use the following official drivers:

psycopg2 for PostgreSQL and Citus Data.
pymongo for MongoDB.
scylla-driver/cassandra-driver for ScyllaDB.

You need to submit only the report (Assignment1.pdf) to Moodle and please do not compress or zip it. The Moodle submission link is attached to the beginning of this document. The total points for this assignment is 100 + 10 extra.

The assignment is divided into 4 main parts:

I. Data Preparation [5 points]

Warning: Here you have a large dataset, and you are allowed to use any Python packages/tools for data preparation. However, be careful when you use tools that operate on a single machine such as Pandas since the large amount of data can consume all of your main memory and CPU. Therefore, we recommend to incrementally read the files in chunks and also check other packages made for parallel processing including Dask and Polars dataframes. Using Apache Spark or distributed dataframes is the ideal solution for large datasets but they are not allowed here. So, you have to use tools which operate only on the same machine for working with some large dataset.

Given the whole 5 dataset files from the attached link, you need to prepare your custom data files. Specifically, you need to do as follows:

Obtain your sid from the TA.
Extract your slice of users from users.txt using the given Python script and will be stored in my_users.csv.

Add to the report: screenshots of the output of the script with the command to run the script.

Prepare the check ins of your users and store it in my_checkins_anonymized.tsv.

Add to the report: screenshots of the code with its output.

Prepare the friendships_before of your users and store it in my_frienship_before.tsv. Note that both friendid and userid should belong only to your slice of users.

Add to the report: screenshots of the code with its output.

Prepare the friendships_after of your users and store it in my_frienship_after.tsv. Note that both friendid and userid should belong only to your slice of users.

Add to the report: screenshots of the code with its output.

Prepare the POIs of your users and store it in my_POIs.tsv.

Add to the report: screenshots of the code with its output.

Throw users.csv and all other downloaded files that do not belong to you to the Recycle Bin (better to remove them completetly ).

II. Data Modeling & Ingestion [31 points]

A. Design a Database Schema: [17 points]

PostgreSQL (Standalone): Design a normalized or denormalized schema. Define tables, primary keys, and indexes you believe will be crucial. Build the relational database foursquaredb and create all required objects with necessary constraints.

Add the Database instructions to the report with screenshots about the schema and created database objects.

PostgreSQL (Citus Data): Design a schema optimized for distribution. Choose a distribution column (shard key) for your main table (e.g., checkins) and explain your choice. How does your schema design differ from the standalone PostgreSQL schema?

Add the Database instructions to the report with screenshots about the schema and created database objects. Add you justification to the report too.

ScyllaDB: Go and check the queries in Part III. Build a logical data model (Chebotko-diagram) for the queries. Design a table schema optimized for the query patterns. Carefully choose your primary key, including partition key and clustering columns. Build the keyspace foursquaredb.

Add to the report: the logical model, the physical model of the tables, database instructions for creating the database objects, and screenshots of the created database objects.

MongoDB: Design a document structure. Should you embed venues within check-ins, or reference them? Justify your decision for analytical query performance. Build the Mongodb database foursquaredb.

Add to the report: screenshots of the document structures for each collection, database instructions for creating the database objects, and your justification.

B. Ingest the Data: [14 points]

For each of the four database systems, write a Python/shell script (e.g. ingest_mongodb.py, ingest_scylladb.py, …etc) to read the prepared dataset and insert them into each system. Crucially, measure and report the total ingestion time for each database system.

Add to the report: screenshots of the script with its output for each system. Add to the report a table with the following structure.

Database	Total Ingestion time	Database Setup	CPU cores allocated	Main Memory usage
PostgreSQL	??	A single server	??	?? GiB
Citus Data	??	A cluster of ?? nodes	??	?? GiB
ScyllaDB	??	A cluster of ?? nodes	??	?? GiB
MongoDB	??	A replica set of ?? nodes	??	?? GiB

III. Analytical Query Execution [48 points]

Execute the following analytical queries on each of the four systems. For each query, your Python script should capture and report the execution time. Run each query at least three times and report the average.

Regarding the case when you get an empty output if any maybe for some of the queries. Such case is ok, since you are working on a slice of the dataset. The main idea in this part is to know how to analyze the data, and getting an empty output for a valid query could be an interesting analysis result unless you are doing something wrong.

Q1. [8 points] Find the top 10 countries with the highest total number of check-ins.

For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.

Q2. [12 points] We want to locate the users who may prefer to check the point of interest shared by their friends. We need to display the list of POIs for these users on their home page. Explain how we can find these users and write and execute the appropriate queries for retrieving their corresponding POIs. We consider here only friendship relationships that did not change between the periods of two snapshots.

For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.

Q3. [16 points] Venues can be shared as POIs by different users. Here, we want to find out which venues are more attractive than others and at which countries, regions or locations they are. Attractive venues are the ones that are shared by most users (here we also count the same user who shares the same venue multiple times at different timestamps). You need to write and implement the appropriate queries for retrieving the info of the attractive venues by countries and location.

For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.

Q4. [12 points] Here we are interested in attaching categories to venues. The category field in the POI.txt file holds values related to subcategories but we want more general custom categories. Specifically, we want to categorize the venues into (‘Restaurant’, ‘Club’, ‘Museum’, ‘Shop’, and ‘Others’) using full text search techniques provided by the database. You need to write and implement the appropriate queries for retrieving the number of venues given a custom category (e.g. ‘Club’).

For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.

IV. Performance Analysis & Visualization [16 points +10 extra]

A. Create a Summary Table: [10 points] Create a table with rows for each database (MongoDB, PostgreSQL, Citus, ScyllaDB) and columns for each task (Q1, Q2, Q3, Q4). Populate it with the average times you measured.

Add to the report: screenshots of the script with its output for each system. Add to the report a table with the following structure.

Database	Q1	Q2	Q3	Q4	Database Setup	CPU cores allocated	Main Memory usage
PostgreSQL	??	??	??	??	A single server	??	?? GiB
Citus Data	??	??	??	??	A cluster of ?? nodes	??	?? GiB
ScyllaDB	??	??	??	??	A cluster of ?? nodes	??	?? GiB
MongoDB	??	??	??	??	A replica set of ?? nodes	??	?? GiB

B. Create a Visualization: [6 points] Generate a bar chart or a series of bar charts comparing the performance of the four databases across the five analytical queries.

Add to the report all figures of the charts with required annotations.

C. Analyze: [10 points] Based on your results, write a concise analysis (1-2 paragraphs) for each query, explaining why a particular database performed well or poorly. Reference your data models, indexing strategies, and the fundamental architectural differences of each system.

Add to the report your analysis attached with screenshots when needed.

It is not allowed to use AI tools here to generate or analyze your results. You must write the analysis text using your own words based on your understanding. The originality in your text will be checked.

References

https://www.postgresqltutorial.com/
https://pandas.pydata.org/docs/index.html
https://foursquare.com
https://memgraph.com/
https://www.mongodb.com/
https://neo4j.com
https://cassandra.apache.org
Foursquare Datasets
Dingqi Yang, Bingqing Qu, Jie Yang, Philippe Cudre-Mauroux, Revisiting User Mobility and Social Relationships in LBSNs: A Hypergraph Embedding Approach, In Proc. of The Web Conference (WWW’19). May. 2019, San Francisco, USA.
Dingqi Yang, Bingqing Qu, Jie Yang, Philippe Cudre-Mauroux, LBSN2Vec++: Heterogeneous Hypergraph Embedding for Location-Based Social Networks, IEEE Transactions on Knowledge and Data Engineering (TKDE), 2020.