Course: Big Data - IU S25
Author: Firas Jolha
Due at: March 3, 23:59
Important: This data folder contains 30 dataset folders with different dataset files. You should select only one dataset folder (one of the fxx where ) and use only one dataset folder. Make sure that your dataset folder contains five dataset files (friends.csv, messages.csv, events.csv, client_first_purchase_date.csv, campaigns.csv).
You can select the data folder by picking the dataset folder name in the sheet. Note that each dataset can be selected only once by one person.
A global e-commerce company, faces challenges managing exponential data growth from customer transactions, product catalogs, social networks, and inventory systems. The company wants to find an efficient way to store and analyze the data. They want to see which data model they can use to have fast data retrieval and query processing.
The company is struggling with three big data issues:
The primary objective of this assignment is to design a big data solution to store petabytes (the provided dataset is only a snapshot) of raw data using effective, optimal and scalabe data models that can handle growth and complexity without compromising quality, efficiency, or usability. The optimal design will enable analytics and transactions for real-time and batch data processing. Your task is to find the best data model for storing the big data and obtaining valuable insights from it by following the tasks below.
The given dataset consists of five csv files. The dataset has different snapshots and you need to select your snapshot of the dataset and work on it. The description of csv files are presented below.
client_id in client_first_purchase_date.csv and messages.csv is formulated by adding a unique identifier 151591562 as a prefix and user_device_id as a postfix to the user_id.
events.csvThis file contains behavior data for 3 months (from October 2019 to December 2019). Each row in the file represents an event. All events are related to products and users. Each event may happen multiple times between products and users, for instance, one product can be viewed by multiple users and one user can view multiple products. We can each line in the file as follows.
Given the following line of the file:
event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
2019-10-01 02:20:02 UTC,view,3701062,2053013565983425517,appliances.environment.vacuum,gorenje,90.07,550067820,b567743b-5fe0-4994-a3bd-949d5462e767
Or as a table.
| event_time | event_type | product_id | category_id | category_code | brand | price | user_id | user_session |
|---|---|---|---|---|---|---|---|---|
| 2019-10-01 02:20:02 UTC | view | 3701062 | 2053013565983425517 | appliances.environment.vacuum | gorenje | 90.07 | 550067820 | b567743b-5fe0-4994-a3bd-949d5462e767 |
We can read it as follows:
A user (*user_id=550067820*) during session (*user_session=b567743b-5fe0-4994-a3bd-949d5462e767*) viewed (property (*event_type*) is *view*) the product (*product_id=3701062*) of brand (*brand=gorenje*) of category (*category_id=2053013565983425517* and *category_code=appliances.environment.vacuum*) with price (*price=90.07*) at (*event_time=2019-10-01 02:20:02*)
campaigns.csvThis file contains information about marketing campaigns. Campaigns are messaging activities to a group of recipients. Channel is a kind of campaign like email, mobile push, web push and SMS. Campaign’s properties and metadata depends on the channel type. Emails
have subject, body (usually HTML with a lot of info, offers and personalization) and a lot of metrics like: click, spam complaint, hard bounce, unsubscribe. Mobile push and web push also have subject (first line of a message) and optional body (some browsers display it, some browsers don’t) and this channel don’t have some events like spam complaint, soft bounce or unsubscribe. SMS doesn’t have subject, only have body (text message) and has least number of events: deliver, click, purchase, hard bounce. Multichannel is a kind of transactional campaign when a retailer tries to deliver important information from the cheapest channel to the more expensive, depending on which channels recipient uses. For example the sequence can be like this:
There are 3 campaign types in dataset:
The difference between bulk and two other types is that Bulk campaign can be sent once whereas triggers and transactionals works forever.
Subject is the title of a message. In email it’s the bold line of text in your inbox. In push campaigns it’s the main line of text in push notification. SMS doesn’t have subject. Pre-header is the preview text in your inbox. Usually it’s have no accent. And it works for email only. Topic is the meaning (main goal) of a bulk campaign.

Other properties of campaigns are:
subject_length - the lenght of subject in symbols (sometimes open rate depends on subject length)subject_with_personalization - subject includes personalization. Example: “Hello, John Wick, there is the special delivery for you!”subject_with_deadline - subject includes some kind of deadline. Example: “Hey, John Wick! Only 4 days left when you can get your special delivery!”subject_with_emoji - subject includes emoji. Example: “Hello, John Wick, there is the special delivery for you 😵🍺”subject_with_bonuses - subject includes information about bonuses. Example: “Hey, John Wick! We added 300 gold coins to your bonus account!”subject_with_discount - subject includes information about discount. Example: “Hey, John Wick! 40% discount in Continental for a low season!”subject_with_saleout - subject announces some kind of saleout. Example: “Hey, John Wick! Black Friday is coming!”Only bulk campaigns have start date and finish date. Start date is the date/time in local timezone when a campaign was launched by marketing specialist. Finish date is the date/time when campaign did sent the final message.
Usually it takes from 30 minutes to 4 hours to send small to large campaigns. But some campaigns (warmup campaigns) can work days or weeks. Usually it’s done when retailer gets new recipients database and don’t want to be blocked by email providers due to high hard bounce or spam complaints rate (see Messages). In this case campaign has warmup_mode=True and hour_limit value which says how many messages can be sent every hour. The usual value of a hour limit is 500 or 1000. But sometimes when mailing domain is warmed up (got trust from email providers), limit can be increased to large values to finish the campaign. The column total_recipients says how many recipients was in bulk campaign when it was sent. Trigger and transactional campaigns don’t have this value.
Other service info including is_test - says it’s test campaign to a limited audience, and position - a position of trigger campaigns (only). Because different events can trigger different trigger campaigns, it’s important to prevent uncontrolled sending of trigger messages without limits (recipient will mark campaign as spam or unsubscribe from retailer). So there is an important sequence from most important triggers (less position is more important) to the less important (larger position number). If trigger with the smaller number can be sent, all other triggers are ignored. If can’t, next trigger is checked. If all triggers are checked and not match trigger conditions, trigger message is not sent.
An example line from campaigns.csv.
id,campaign_type,channel,topic,started_at,finished_at,total_count,ab_test,warmup_mode,hour_limit,subject_length,subject_with_personalization,subject_with_deadline,subject_with_emoji,subject_with_bonuses,subject_with_discount,subject_with_saleout,is_test,position
63,bulk,mobile_push,sale out,2021-04-30 07:22:36.615023,2021-04-30 07:23:41,48211,,False,,146.0,False,False,True,False,False,False,,
Or as a table:
| id | campaign_type | channel | topic | started_at | finished_at | total_count | ab_test | warmup_mode | hour_limit | subject_length | subject_with_personalization | subject_with_deadline | subject_with_emoji | subject_with_bonuses | subject_with_discount | subject_with_saleout | is_test | position |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | bulk | mobile_push | sale out | 2021-04-30 07:22:36.615023 | 2021-04-30 07:23:41 | 48211 | False | 146.0 | False | False | True | False | False | False |
The description of the columns in campaigns.csv file are described in the table below.
| Column title | Description | Example |
|---|---|---|
| id | Campaign ID | 64 |
| campaign_type | Campaign type (bulk, trigger, transactional) | bulk |
| channel | Channel (email, mobile_push, web_push, sms) | |
| topic | Meaning of a campaign | happy birthday |
| started_at | Bulk campaign start datetime | 2021-04-30 07:22:36.615023 |
| finished_at | Bulk campaign finish datetime | 2021-04-30 07:23:41 |
| total_count | Total recipients in bulk campaign | 48211 |
| ab_test | Bulk campaign with A/B test mode | false/true |
| warmup_mode | Bulk campaign with warmup mode | false/true |
| hour_limit | Hour limit for a bulk campaign with warmup mode | 300.00 |
| subject_length | Email subject length | 97.0 |
| subject_with_personalization | Subject contains recipient’s name | false/true |
| subject_with_deadline | Subject has deadline meaning (like “4 days left for…”) | false/true |
| subject_with_emoji | Subject has emoji symbols | false/true |
| subject_with_bonuses | Subject mentions bonuses for actions | false/true |
| subject_with_discount | Subject mentions a discount | false/true |
| subject_with_saleout | Subject mentions a sale out | false/true |
| is_test | It’s a test campaign (bulk campaigns only) | false/true |
| position | Priority of trigger campaigns | 4 |
messages.csvHere the information about messages of campaigns are stored. All messages are related to some kind of campaigns:
Note that campaign_id is unique only for a specific campaign_type, so two campaigns with different campaign_type can have the same campaign_id. Message row is an individual message sent to a single recipient. It contains type, channel, recipient and some attributess, described below.
Campaign related info:
campaign_id - identifier from campaigns dataframe.message_type - Message type (bulk, trigger, transactional).channel - Message channel. Keep in mind: there is no multichannel message type as it was in campaigns dataframe, becase campaign can be multichannel (can send messages to different channels), but message belongs to a single channel.Recipient info:
client_id - recipients ID in REES46 database (unique per user).email_provider - a domain part of email (for email messages) like gmail.com or hotmail.com. Can be NA: data removed for unique personal or corporate domains with small number of messages which could cause person identification.Device related info
platform - platform type where a message was opened (if it was opened). This is defined automatically according to user-agent or other service info. For example smartphone, tablet, phablet.stream - string value of a device type (desktop, ios, android, etc) where a message was opened (if it was opened). This is defined by app developer on retailer side, when they integrate SDK.Behavior related info:
date - date in YYYY-MM-DD when a message was sent. It’s part of sent_at datetime property, helps to group messages by days.sent_at - Datetime when a message was sent.is_opened - Boolean flag if a message was opened by a recipient.opened_first_time_at - First time when a message was opened.opened_last_time_at - Last time when a message was opened (can be equal to opened_first_time_at, if the message was opened only once).is_clicked - Boolean flag if a message was clicked by a recipient.clicked_first_time_at - First time when a message was clicked.clicked_last_time_at - Last time when a message was clicked (can be equal to clicked_first_time_at, if the message was clicked only once).is_unsubscribed - Boolean flag if a recipient clicked unsubscribe link in a message.unsubscribed_at - Datetime when a recipient clicked unsubscribe link in a message.is_hard_bounced - Datetime when a message was “hard bounced”: permanent failure to deliver a message if recipient’s address does not exist (email is not exist or mobile push token is invalid).hard_bounced_at - Datetime when a message was hard bounced.is_soft_bounced - Datetime when a message was “soft bounced”: temporary failure to deliver a message if recipient’s address temporary can’t receive emails (for example mailbox is full, but exists and possibly can receive messages in the future).soft_bounced_at - Datetime when a message was soft bounced.is_complained - Boolean flag if a recipient clicked SPAM button in email client.complained_at - Datetime when a recipient clicked SPAM button in email client.is_blocked - Boolean flag if a delivery attempt was temporarily blocked by email provider (postmaster says "you act like spammer, I won’t deliver your email). It’s not permanent ban.blocked_at - Datetime when a delivery attempt was temporarily blocked by email provideris_purchased - Boolean flag if a recipient clicked any link in a message, opened a website or mobile app and made a purchasepurchased_at - Datetime when a recipient made a purchase after click on email or other message.client_first_purchase_date.csvThis file contains the date of the first purchase for the client.
friends.csvThis file stores the mutual friendship between users. Both columns contain ids belong to the user_id in other files.
Important: The reference paper for the report of this assignment is here. Download the paper and read it before starting the assignment.
Important: Before you start solving the tasks, create a new repository on Github for this assignment and submit only the repository link to Moodle. You can add the TA (firas-jolha) as contributor if your repository is private. There is no specific structure for the repository but you should minimally have:
/scripts folder to store all types of scripts and queries. SQL queries should have .sql extension, MongoDB queries should have .js extension, and Neo4J queries should have .cypher extension./screenshots folder to store the screenshots./output folder to put the results of the experiments.report.pdf is your final report.Hackolade is a data modeling tool for SQL and NoSQL databases, APIs, and storage formats. It is a graphical design tool for building data models. You can access the freemium Hackolade Studio on the browser and use it for building your data models. You can also download the software and run it on your machine using the links below. Use the community edition of the software.
scripts/clean_data.py file./scripts/load_data_psql.sql in your repository. The script will be executed from the command line as sudo psql -d ecommerce -f scripts/load_data_psql.sql (You can assume that there is an empty database ecommerce already created). Do not worry about authentication issues./scripts/load_data_mongodb.js in your repository. The script will be executed from the command line as sudo mongosh --file scripts/load_data_mongodb.js. Do not worry about authentication issues./scripts/load_data_neo4j.cypher in your repository. The script will be executed from the command line as sudo cat scripts/load_data_neo4j.cypher | cypher-shell. Do not worry about authentication issues. You do not create a new database here but use the default database neo4j./scripts./scripts./scripts/q1.sql, /scripts/q1.js, /scripts/q1.cypher. Add the analysis results to the report including the queries and your analysis approach./scripts/q2.sql, /scripts/q2.js, /scripts/q2.cypher. Add the analysis results to the report including the queries and your analysis approach.category_code of the products. Add the analysis results to the report including the queries and your analysis approach. Write the queries in scripts in files /scripts/q3.sql, /scripts/q3.js, /scripts/q3.cypher.time command here for Linux systems.query execution times similar to fig. 7 in the referece paper.For benchamarking, you can use hyperfine but do not forget to export the results in a specific format to add to the report.
The report should have a similar sturucture to the reference paper. Ensure that you have minimaly the following sections:
.pdf (docx, doc or any other formats are not accepted).It is allowed to use any support tools but not recommended to us GenAI tools for generating text or code. You can use these generative tools to proofread your report or check the syntax of the code. All solutions must be submitted individually. We will perform a plagiarism check on the code and report and you will be penalized if your submission is found to be plagiarized.