Course: Big Data - IU
Author: Firas Jolha
Deadline: March 8, 23:59
This assignment provides a hands-on experience in working with four distinct data management paradigms used for large-scale analytics:
You will ingest a large, real-world dataset (global Foursquare check-ins) into each system, design an appropriate schema, execute a set of analytical queries, and rigorously benchmark their performance. The goal is to understand the trade-offs between data modeling flexibility, query expressiveness, and scalability.
You have to work on check-in dataset collected from Foursquare, which has been widely used as a key data source for studying location based services. Specifically, as Foursquare user’s check-ins can only be accessed from the user’s social circle, they are not publicly available. However, many Foursquare users choose to also share their check-ins or points of interest (POIs) via X (Twitter) platform. User social relationships are collected from X. A friendship exists between two users if both users follow each other. The user data is anonymized for privacy purposes.

The dataset includes long-term (about 22 months, from April 2012 to January 2014) global-scale Foursquare check-in data, as well as two snapshots of user social networks before (in Mar. 2012) and after (in May 2014) the check-in data collection period. The check-in dataset includes 90,048,627 checkins from 2,733,324,324 users across 11,180,160 venues. There are 363,704 (old) and 607,333 (new) friendships in the social network data.
Important Note: In this assignment, you are required to work only on a specific slice of the users dataframe. To know the slice assigned to you:
select_my_users_slice.py. You must not modify the script. Just run it and enter your SID to get your slice of users IDs.# select_my_users_slice.py
import pandas as pd
import traceback
MAX_STD = 103
USERS_PERC = 0.25
SID = input(f"Enter your SID [0-{MAX_STD-1}]: ")
try:
SID = int(SID)
if SID <0 or SID>103:
raise ValueError(f"SID should be in the range [0-{MAX_STD-1}].")
#print(type(SID))
def select_random_data_with_seed(dataframe, num_samples, seed):
random_selection = dataframe.sample(n=num_samples, random_state=seed)
return random_selection
print("Reading the file 'users.txt'...")
df = pd.read_csv("users.txt")
print("Selecting the slice of users...")
users_num_to_select = int(USERS_PERC * len(df))
selected_df = select_random_data_with_seed(df, users_num_to_select, seed=SID)
print(f"Selected DataFrame rows with seed {SID}:\n{selected_df}")
print("Persisting the selected slice of users...")
file_path = 'my_users.csv'
selected_df.to_csv(file_path, index=False)
print(f"The selected slice is successfully saved to {file_path}")
except ValueError:
print(f"SID = {SID} is not correct, valid SID should be in the range [0-{MAX_STD-1}]!")
except:
print(f"\033[91m{traceback.format_exc()}\033[0m")
For example, when sid=10, the script will retrive as follows:
ubuntu> python select_my_users_slice.py
Enter your SID [0-102]: 10
Reading the file 'users.txt'...
Selecting the slice of users...
Selected DataFrame rows with seed 10:
userid
760215 760216
1251679 1251680
1236500 1236501
1454359 1454360
2570231 2570232
... ...
2202515 2202516
2050250 2050251
369611 369612
2610100 2610101
1974059 1974060
[683331 rows x 1 columns]
Persisting the selected slice of users...
The selected slice is successfully saved to my_users.csv
The dataset consists of 5 tsv files without headers. We present in the table below the description of the dataset files. Use it to know the headers of the files.
| File Name | Description | Fields |
|---|---|---|
| checkins_anonymized.txt | Each line corresponds to a user id checked in a venue at time UTCTime with Timezone offset in minutes | user_id, venue_id, utc_time, timezone_offset_mins |
| friendship_before_old.txt | Each line corresponds to a user id at the first column mapped to the friend id at the second column and they have friendship before April 2012 | user_id, friend_id |
| friendship_after_new.txt | Each line corresponds to a user id at the first column mapped to the friend id at the second column and they still have friendship after January 2014 | user_id, friend_id |
| POIs.txt | Each line corresponds to the features of the venue including venue id, POI latitude, POI longitude, venue category name, country code (ISO 3166-1 alpha-2 two-letter country codes) | venue_id, latitude, longitude, category, country |
| users.txt | a file with a header where each line contains user id for all 114,324 users | userid |
In this assignment, you need to prepare your slice of users data with corresponding users’ POIs, checkins and friendships as explained above. Afterwards, you can start solving the exercises in any order you prefer. For documentation, you need to prepare a report. The template of the report is attached as a link to this document.
Important Note: For each task, you need to put all of its queries as a text to the table and the queries with their corresponding outputs as screenshots below the task in the report. The screenshot should show the query and the result you got. The screenshot should be full screen and any screenshot that is not full screen will not be evaluated. Any screenshot that is cut will also not be evaluated. If the result for a single query, is too long you can limit the output to only the size of your screen.
Regarding the technical stack and setup of the assignment, we have the following minimum requirements:
If you need/prefer to use any Python drivers, please use the following official drivers:
psycopg2 for PostgreSQL and Citus Data.pymongo for MongoDB.scylla-driver/cassandra-driver for ScyllaDB.You need to submit only the report (Assignment1.pdf) to Moodle and please do not compress or zip it. The Moodle submission link is attached to the beginning of this document. The total points for this assignment is 100 + 10 extra.
The assignment is divided into 4 main parts:
Warning: Here you have a large dataset, and you are allowed to use any Python packages/tools for data preparation. However, be careful when you use tools that operate on a single machine such as Pandas since the large amount of data can consume all of your main memory and CPU. Therefore, we recommend to incrementally read the files in chunks and also check other packages made for parallel processing including Dask and Polars dataframes. Using Apache Spark or distributed dataframes is the ideal solution for large datasets but they are not allowed here. So, you have to use tools which operate only on the same machine for working with some large dataset.
Given the whole 5 dataset files from the attached link, you need to prepare your custom data files. Specifically, you need to do as follows:
users.txt using the given Python script and will be stored in my_users.csv.Add to the report: screenshots of the output of the script with the command to run the script.
my_checkins_anonymized.tsv.Add to the report: screenshots of the code with its output.
friendships_before of your users and store it in my_frienship_before.tsv. Note that both friendid and userid should belong only to your slice of users.Add to the report: screenshots of the code with its output.
friendships_after of your users and store it in my_frienship_after.tsv. Note that both friendid and userid should belong only to your slice of users.Add to the report: screenshots of the code with its output.
my_POIs.tsv.Add to the report: screenshots of the code with its output.
users.csv and all other downloaded files that do not belong to you to the Recycle Bin (better to remove them completetly
).A. Design a Database Schema: [17 points]
foursquaredb and create all required objects with necessary constraints.Add the Database instructions to the report with screenshots about the schema and created database objects.
Add the Database instructions to the report with screenshots about the schema and created database objects. Add you justification to the report too.
foursquaredb.Add to the report: the logical model, the physical model of the tables, database instructions for creating the database objects, and screenshots of the created database objects.
foursquaredb.Add to the report: screenshots of the document structures for each collection, database instructions for creating the database objects, and your justification.
B. Ingest the Data: [14 points]
ingest_mongodb.py, ingest_scylladb.py, …etc) to read the prepared dataset and insert them into each system. Crucially, measure and report the total ingestion time for each database system.Add to the report: screenshots of the script with its output for each system. Add to the report a table with the following structure.
| Database | Total Ingestion time | Database Setup | CPU cores allocated | Main Memory usage |
|---|---|---|---|---|
| PostgreSQL | ?? | A single server | ?? | ?? GiB |
| Citus Data | ?? | A cluster of ?? nodes | ?? | ?? GiB |
| ScyllaDB | ?? | A cluster of ?? nodes | ?? | ?? GiB |
| MongoDB | ?? | A replica set of ?? nodes | ?? | ?? GiB |
Execute the following analytical queries on each of the four systems. For each query, your Python script should capture and report the execution time. Run each query at least three times and report the average.
Regarding the case when you get an empty output if any maybe for some of the queries. Such case is ok, since you are working on a slice of the dataset. The main idea in this part is to know how to analyze the data, and getting an empty output for a valid query could be an interesting analysis result unless you are doing something wrong.
Q1. [8 points] Find the top 10 countries with the highest total number of check-ins.
For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.
Q2. [12 points] We want to locate the users who may prefer to check the point of interest shared by their friends. We need to display the list of POIs for these users on their home page. Explain how we can find these users and write and execute the appropriate queries for retrieving their corresponding POIs. We consider here only friendship relationships that did not change between the periods of two snapshots.
For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.
Q3. [16 points] Venues can be shared as POIs by different users. Here, we want to find out which venues are more attractive than others and at which countries, regions or locations they are. Attractive venues are the ones that are shared by most users (here we also count the same user who shares the same venue multiple times at different timestamps). You need to write and implement the appropriate queries for retrieving the info of the attractive venues by countries and location.
For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.
Q4. [12 points] Here we are interested in attaching categories to venues. The category field in the POI.txt file holds values related to subcategories but we want more general custom categories. Specifically, we want to categorize the venues into (‘Restaurant’, ‘Club’, ‘Museum’, ‘Shop’, and ‘Others’) using full text search techniques provided by the database. You need to write and implement the appropriate queries for retrieving the number of venues given a custom category (e.g. ‘Club’).
For each system, add to the report: a short explanation (2-3 sentences) justifying the query you designed, screenshots of the query code you run and the output/results from the database.
A. Create a Summary Table: [10 points] Create a table with rows for each database (MongoDB, PostgreSQL, Citus, ScyllaDB) and columns for each task (Q1, Q2, Q3, Q4). Populate it with the average times you measured.
Add to the report: screenshots of the script with its output for each system. Add to the report a table with the following structure.
| Database | Q1 | Q2 | Q3 | Q4 | Database Setup | CPU cores allocated | Main Memory usage |
|---|---|---|---|---|---|---|---|
| PostgreSQL | ?? | ?? | ?? | ?? | A single server | ?? | ?? GiB |
| Citus Data | ?? | ?? | ?? | ?? | A cluster of ?? nodes | ?? | ?? GiB |
| ScyllaDB | ?? | ?? | ?? | ?? | A cluster of ?? nodes | ?? | ?? GiB |
| MongoDB | ?? | ?? | ?? | ?? | A replica set of ?? nodes | ?? | ?? GiB |
B. Create a Visualization: [6 points] Generate a bar chart or a series of bar charts comparing the performance of the four databases across the five analytical queries.
Add to the report all figures of the charts with required annotations.
C. Analyze: [10 points] Based on your results, write a concise analysis (1-2 paragraphs) for each query, explaining why a particular database performed well or poorly. Reference your data models, indexing strategies, and the fundamental architectural differences of each system.
Add to the report your analysis attached with screenshots when needed.
It is not allowed to use AI tools here to generate or analyze your results. You must write the analysis text using your own words based on your understanding. The originality in your text will be checked.