Assignment 1 : SQL & NoSQL Databases

Course: Big Data - IU S25
Author: Firas Jolha
Soft deadline: March 11, 23:59
Hard deadline: March 15, 23:59

Moodle submission link

https://moodle.innopolis.university/mod/assign/view.php?id=126457

Dataset

Foursquare Dataset(~2.13GB zipped, ~6.38GB unzipped)

Report format

https://docs.google.com/document/d/1c4v2lY16fJHF-q2vV3CbrDX7sp6tj16q6N1y2kMXnJI/edit?usp=sharing

Agenda

Assignment 1 : SQL & NoSQL Databases
Moodle submission link
Dataset
Report format
Agenda
Prerequisites
Description
Data Description
Instructions
Data Preparation tasks [10 points]
PostgreSQL tasks [25 points]
Cassandra tasks [35 points]
MongoDB tasks [30 points]
Memgraph/Neo4j tasks [Extra 15 points]
References

Prerequisites

Installed PostgreSQL server
Installed pandas package
Installed Cassndra server
Installed MongoDB server
Installed Neo4j/Memgraph server

Description

This assignment will be dedicated to practise on using SQL language for retrieving, analyzing and manipulating data. It will also include practice on working with CQL and MongoDB databases.

The objective of the assignment is to model and store relatively big data in different SQL and NoSQL databases, comparing them and finding the optimal one based on the storage requirements.

You have to work on check-in dataset collected from Foursquare, which has been widely used as a key data source for studying location based services. Specifically, as Foursquare user’s check-ins can only be accessed from the user’s social circle, they are not publicly available. However, many Foursquare users choose to also share their check-ins or points of interest (POIs) via X (Twitter) platform. User social relationships are collected from X. A friendship exists between two users if the two users follow each other. The user data is anonymized for privacy purposes.

The dataset includes long-term (about 22 months, from April 2012 to January 2014) global-scale Foursquare check-in data, as well as two snapshots of user social networks before (in Mar. 2012) and after (in May 2014) the check-in data collection period. The check-in dataset includes 90,048,627 checkins from 2,733,324,324 users across 11,180,160 venues. There are 363,704 (old) and 607,333 (new) friendships in the social network data.

Important Note: In this assignment, you are required to work only on a specific slice of the users dataframe. To know the slice assigned to you:

You need to know your serial number (sid) from the list published by the TA.
Your slice of users data is users[sid-1::100]
For example, for sid=10, your slice is users[9::100].

          userid
9             10
109          110
209          210
309          310
409          410
...          ...
2732909  2732910
2733009  2733010
2733109  2733110
2733209  2733210
2733309  2733310

[27334 rows x 1 columns]

You have to work only on the check-ins by users from your slice. The same for friendship_before and friendship_after. All friend ids of users should belong to your slice of users too.
Do not query, or create database objects for users outside of your slice. Otherwise, your work will be penalized.

Data Description

The dataset consists of 5 tsv files without headers. We present in the table below the description of the dataset files. Use it to know the headers of the files.

File Name	Description	Fields
checkins_anonymized.txt	Each line corresponds to a user id checked in a venue at time UTCTime with Timezone offset in minutes	user_id, venue_id, utc_time, timezone_offset_mins
friendship_before_old.txt	Each line corresponds to a user id at the first column mapped to the friend id at the second column and they have friendship before April 2012	user_id, friend_id
friendship_after_new.txt	Each line corresponds to a user id at the first column mapped to the friend id at the second column and they still have friendship after January 2014	user_id, friend_id
POIs.txt	Each line corresponds to the features of the venue including venue id, POI latitude, POI longitude, venue category name, country code (ISO 3166-1 alpha-2 two-letter country codes)	venue_id, latitude, longitude, category, country
users.txt	a file with a header where each line contains user id for all 114,324 users	userid

Instructions

In this assignment, you need to prepare a Jupyter notebook foursquare.ipynb for extracting your slice of users data with corresponding users’ checkins and friendships. It is not accepted other formats like .py or .txt. The cells should be already executed and not edited during or after execution. I will use the order of execution and time of execution to check this. Any cell that do not have time or order will not be evaluated.

You also need to prepare a report. The format of the report is attached as a link to this document.

Important: For each task, you need to put all of its queries as a text to the table and the queries with their corresponding outputs as screenshots below the task in the report. The screenshot should show the query and the result you got. The screenshot should be full screen and any screenshot that is not full screen will not be evaluated. Any screenshot that is cut will also not be evaluated. If the result for a single query, is too long you can limit the output to only the size of your screen.

We assume that the database servers are running on a single machine. So, you are not asked to create a cluster of nodes.

You need to submit two files (the report Assignment1.pdf, the Jupyter notebook foursquare.ipynb) to Moodle and please do not compress or zip them in a folder. The Moodle submission link is https://moodle.innopolis.university/mod/assign/view.php?id=126457. The total points for this assignment is 100 + 15 extra.

Data Preparation tasks [10 points]

Given the whole dataset files from the attached link, you need to prepare your custom data files. Specifically, you need to do as follows:

Obtain your sid from the TA.
Extract your slice of users from users.txt.
Prepare the check ins of your users and store it in “my_checkins_anonymized.tsv”.
Prepare the friendships_before of your users and store it in “my_frienship_before.tsv”. Note that both friendid and userid should belong only to your slice of users.
Prepare the friendships_after of your users and store it in “my_frienship_after.tsv”. Note that both friendid and userid should belong only to your slice of users.
Prepare the POIs of your users and store it in “my_POIs.tsv”.
Throw users.csv and all other downloaded files that do not belong to you to the Recycle Bin (better to remove them completetly ).

Hint: Here you have a large dataset, and you are allowed to use different Python packages that operate on the same machine for data preparation such as Pandas. I also recommend to check other packages including Dask and Polars dataframes. Using Apache Spark is not allowed here. So, you have to use tools which operate only on the same machine.

PostgreSQL tasks [25 points]

Given the following six tasks:

Build a relational database foursquare and create all required objects with needed constraints.
Load the data to your PostgreSQL database.
We want to locate the users who may prefer to check the point of interest shared by their friends. We need to display the list of POIs for these users on their home page. Explain how we can find these users and write and execute the appropriate queries for retrieving their corresponding POIs. We consider here only friendship relationships that did not change between the periods of two snapshots.
We are interseted in identifying the active users in the platform. We can define an active user as a user who shares most distinct POIs. We want to retrieve statistical info (quantity, top 5, and average by distinct POI) about these users per year. Explain how we can find these users and write and execute the appropriate queries for retrieving their statistical information.
Venues can be shared as POIs by different users. Here, we want to find out which venues are more attractive than others and at which countries, regions or locations. Attractive venues are the ones that are shared by most users (here we also count the same user who shares the same venue multiple times at different timestamps). You need to write and implement the appropriate queries for retrieving the info of the attractive venues by countries and location.
Here we are interested in identifying venues by categories. The category field in the POI.txt file holds values related to subcategories but we want more general custom categories. Specifically, we want to categorize the venues into (‘Restaurant’, ‘Club’, ‘Muesum’, ‘Shop’, and ‘Others’) using full text search techniques provided by the database. You need to write and implement the appropriate queries for retrieving the venues by the custom category.

Cassandra tasks [35 points]

The Cassandra database in this assignment is not provided and you have to build an efficient database solution.

Note: The queries here are same queries in PostgreSQL section.

Build a logical data model (Chebotko-diagram) and add it to the report.
Build the Cassandra database foursquare.
Load all data to the database.
We want to locate the users who may prefer to check the point of interest shared by their friends. We need to display the list of POIs for these users on their home page. Explain how we can find these users and write and execute the appropriate queries for retrieving their corresponding POIs. We consider here only friendship relationships that did not change between the periods of two snapshots.
We are interseted in identifying the active users in the platform. We can define an active user as a user who shares most distinct POIs. We want to retrieve statistical info (quantity, top 5, and average by distinct POI) about these users per year. Explain how we can find these users and write and execute the appropriate queries for retrieving their statistical information.
Venues can be shared as POIs by different users. Here, we want to find out which venues are more attractive than others and at which countries, regions or locations. Attractive venues are the ones that are shared by most users (here we also count the same user who shares the same venue multiple times at different timestamps). You need to write and implement the appropriate queries for retrieving the info of the attractive venues by countries and location.
Here we are interested in identifying venues by categories. The category field in the POI.txt file holds values related to subcategories but we want more general custom categories. Specifically, we want to categorize the venues into (‘Restaurant’, ‘Club’, ‘Muesum’, ‘Shop’, and ‘Others’) using full text search techniques provided by the database. You need to write and implement the appropriate queries for retrieving the venues by the custom category.

MongoDB tasks [30 points]

Note: The queries here are same queries in PostgreSQL section.

Build the Mongodb database foursquare.
Load all data to the database.
We want to locate the users who may prefer to check the point of interest shared by their friends. We need to display the list of POIs for these users on their home page. Explain how we can find these users and write and execute the appropriate queries for retrieving their corresponding POIs. We consider here only friendship relationships that did not change between the periods of two snapshots.
We are interseted in identifying the active users in the platform. We can define an active user as a user who shares most distinct POIs. We want to retrieve statistical info (quantity, top 5, and average by distinct POI) about these users per year. Explain how we can find these users and write and execute the appropriate queries for retrieving their statistical information.
Venues can be shared as POIs by different users. Here, we want to find out which venues are more attractive than others and at which countries, regions or locations. Attractive venues are the ones that are shared by most users (here we also count the same user who shares the same venue multiple times at different timestamps). You need to write and implement the appropriate queries for retrieving the info of the attractive venues by countries and location.
Here we are interested in identifying venues by categories. The category field in the POI.txt file holds values related to subcategories but we want more general custom categories. Specifically, we want to categorize the venues into (‘Restaurant’, ‘Club’, ‘Muesum’, ‘Shop’, and ‘Others’) using full text search techniques provided by the database. You need to write and implement the appropriate queries for retrieving the venues by the custom category.

Memgraph/Neo4j tasks [Extra 15 points]

Build a property graph model and add it to the report.
Build the Memgraph/Neo4j database foursquare.
Load all data to the database.
We want to locate the users who may prefer to check the point of interest shared by their friends. We need to display the list of POIs for these users on their home page. Explain how we can find these users and write and execute the appropriate queries for retrieving their corresponding POIs. We consider here only friendship relationships that did not change between the periods of two snapshots.
We are interseted in identifying the active users in the platform. We can define an active user as a user who shares most distinct POIs. We want to retrieve statistical info (quantity, top 5, and average by distinct POI) about these users per year. Explain how we can find these users and write and execute the appropriate queries for retrieving their statistical information.

References

https://www.postgresqltutorial.com/
https://pandas.pydata.org/docs/index.html
https://foursquare.com
https://memgraph.com/
https://www.mongodb.com/
https://neo4j.com
https://cassandra.apache.org
Foursquare Datasets
Dingqi Yang, Bingqing Qu, Jie Yang, Philippe Cudre-Mauroux, Revisiting User Mobility and Social Relationships in LBSNs: A Hypergraph Embedding Approach, In Proc. of The Web Conference (WWW’19). May. 2019, San Francisco, USA.
Dingqi Yang, Bingqing Qu, Jie Yang, Philippe Cudre-Mauroux, LBSN2Vec++: Heterogeneous Hypergraph Embedding for Location-Based Social Networks, IEEE Transactions on Knowledge and Data Engineering (TKDE), 2020.