Final Project

Course: Big Data - IU S25

Project info submission

Project info submission - Google sheet

Repository Template

bigdata-final-project

Agenda

Final Project
Project info submission
Repository Template
Agenda
Prerequisites
Technology Stack
Description
Project submission
Dataset Criteria
Team Formation Criteria
Grading Criteria
- Possible cases for penalties:

Prerequisites

You have access to a Hadoop cluster of at least 3 nodes with all required services for the project stages
- You got access to IU Hadoop Cluster.

Technology Stack

Description

The aim of this project is to create an end-to-end big data pipeline that ingests the large data from databases such as PostgreSQL into HDFS followed by analyzing the distributed data (batch processing) using Spark Engine and presenting the results in a dashboard (Apache Superset). The stages of the pipeline are as follows:

Data Collection and Ingestion
- Collect the dataset and explore it.
- Build a relational model for the data and create a relational database in PostgreSQL
- Load the data to the relational database.
- Ingest the data from the relational database to HDFS using Sqoop and Hadoop MapReduce engine.
Data Storage/Preparation
- Create efficient Hive Tables compressed (e.g. Snappy, GZIP) and stored with big data storage formats (e.g. AVRO, PARQUET).
  - Store the data in the data warehouse Hive for further analytics.

Data Analysis
- Perform Exploratory Data Analysis (EDA) with HiveQL on Tez engine.
- Analyze the data using Spark DataFrame and SQL.
- Perform Predictive Data Analysis (PDA) via building distributed ML models using Spark MLlib.
  - This part also includes feature extraction and data preprocessing.
Presentation
- Present the analysis results in a dashboard using Apache Superset on defence day.
- Tell the story of the data on the dashboard.

The deliverables of this project are:

The project repository
- You have to use only Python as a programming language for PySpark application.
- You have to use and follow the project repository template attached to this tutorial for your project. Do not fork but duplicate it and start working on your project.
  - The repository has a README file which includes the rules of using the template.
  - The scripts should generate reproducible results. For instance, If I run the script stage1.sh for the second time, it should not give me any errors, so you should make sure that you cleared the objects before creating new ones.
- For EDA part
  - You need to analyze the data and display data characteristics and at least 6 different data insights in the dashboard prior to data preprocessing.
  - Data insights should be valuable and help business stakeholders make better decisions.
  - For each insight drawn from the data, you need to create charts and add a description/story for it.
    - You need to explain the data insight (data storytelling).
      - Data storytelling is the ability to effectively communicate insights from a dataset using narratives and visualizations. It can be used to put data insights into context for and inspire action from your audience.
- For ML part
  - You need to build three different kinds of models like Random Forest, Support Vector Machine and Naive Bayes for classification task.
  - For data split, use 70% training and 30% test. You could use different ratios based on the size of your dataset but you should always make sure that you have at least 60% training data.
  - Fine-tune the hyper-parameters via grid search on 3 different values for each hyperparameter (at least 3 hyperparameters). It is enough to optimize for 27 different combinations. The number of epochs/iterations is not considered as a hyperparameter. One of the hyperparameters should be an algorithm hyperparameter and other two are model hyperparameters.
  - You should report two metrics at least. You need to use the following performance metrics (reference):
    - Accuracy and F1 for multiclass classification tasks
      - Area Under ROC and Area Under PR for binary classification tasks.
    - RMSE and $R^2$ for regression tasks.
    - Mean Average Precision and NDCG for recommendation tasks.
    - Custom metrics for custom ML tasks.
  - Perform k-fold cross validation (where 2<k<5) and select the best model.
  - Show the performance of both best models of each kind in the dashboard.
- For the dashboard.
  - It is enough to create a single dashboard using Apache Superset.
  - You have to display the data characteristics such as number of data instances, number of features, feature names,…
  - You need to display all the results of the analytics.
  - You need to train the model on a cluster (Hadoop YARN) and not local machine.
  - Check the grading criteria to see the requirements needed to be fulfilled here.

A report
- The report contains the details of the stages, activities and findings in the project and must include at least the following sections:
  - Title page
  - Introduction
    - Business objectives
  - Data Description
    - Data Characteristics
  - Architecture of data pipeline
    - The input and the output for each stage
  - Data preparation
    - ER diagram
    - Some samples from the database
    - Creating Hive tables and preparing the data for analysis
  - Data analysis
    - Analysis results
    - Charts
    - Interpretation
  - ML modeling
    - Feature extraction and data preprocessing
    - Training and fine-tuning
    - Evaluation
  - Data presentation
    - The description of the dashboard
    - Description of each chart
    - Your findings
  - Conclusion
    - Summary of the report
  - Reflections on own work
    - Challenges and difficulties
    - Recommendations
    - The table of contributions of each team member

Example:

Project Tasks	Task description	fname lname 1	fname lname 2	fname lname 3	fname lname 4	deliverables	Average hours spent
Data extraction	Collect the data from the <link> and extract the representative sample for the project	40%	50%	0%	10%	sample_data.csv	0.5
Task 2
Task 3
…etc

Note: Make sure that you have a total 100% for each row. You can fill in this table as follows:

Add a task and the description of the task
- Assign poeple to work on the task
- Add estimated hours and deadline to finish the task
- Regularly check the progress
After the task is done,
- Add the average hours spent for the task by all teammates
- Add the deliverables of the task
Go to the next task
You can add tasks to work simultaneously as a team unless the task is blocked by other tasks of some teammates

The presentation
- You should explain the goal of your project, for instance you are trying to predict ratings of users to build a good recommendation system. Expand this sentence, add some figures and images to explain the objective.
- The dataset characteristics.
- How you analyzed the data.
- Analysis results.
- The results for each stage of the project.
- What challenges you encountered.
- A short demo you present the dashboard and demonstrate the results you obtained.

Notes:

1. You do not need to store the dataset in Github. You can add it to .gitignore and keep it in the cluster.
2. This tutorial will be followed by four stages that should be done as parts of the project. More instructions will be specified in the individual stages next labs.

Project submission

You should submit to Moodle the following files:

A link to your repository.
- A public repo
- A private repo where you add me (https://github.com/firas-jolha) as a contributor.
The report.
The presentation in PDF format or link to your presentation on Google Slides.

Dataset Criteria

Data is big in terms of volume for batch processing on Spark engine.
- At least 500,000 data records.
- At least 400 MB in size.
At least 8 explanatory variables (features). The features should not be anonymized nor encoded. The features should be described.
Either datetime or geospatial features must be present in the data (both can be). Features that store only dates are not counted here.
The goal of predictive analysis with regard to the selected dataset should be determined (Regression, Classification, …etc).
The dataset should not be one of the datasets you worked on during the labs or assignments and you also should not select a dataset selected by other teams.
Preferred format of the dataset files is “csv”. JSON and Excel files are fine but you need to convert them to csv.
The TA will verify your dataset. So, you cannot start working on the dataset till you get the verification.
You should find a dataset before March 31.
You should submit the dataset info to the submission sheet

Team Formation Criteria

Each teammate should participate in the project and presentation.
You have to work in teams of 4 people.
- For exceptional situations, you need to contact the TA.
You should select a team before March 25.
You should submit the team info to the submission sheet

Grading Criteria

[10 points] Stage I
- [5 points] You built all tables in the database. You added all constraints if any. You loaded the data to the tables.
- [5 points] You imported the data to HDFS via Sqoop.
[15 points] Stage II
- [3 points] You created Hive Tables and stored the data.
- [12 points] EDA
  - [2 points] for each insight provided (query + chart + story of the data)
[23 points] Stage III
- [23 points] PDA
  - [8 points] for the first model with hyperparameter optimization
  - [8 points] for the second model with hyperparameter optimization
  - [5 points] for the third model with hyperparameter optimization
  - [2 points] for prediction of a specific data sample
[15 points] Stage IV
- [15 points] for the dashboard
  - [1 points] for showing the data characteristics
  - [6 points] for the insights with provided description
  - [4 point] for showing the performance of models
  - [2 points] for prediction results
  - [2 points] for visual quality of dashboard
[15 points] Report
[15 points] Project Defence
- [5 points] for presentation
- [10 points] for answering the questions
[7 points] Check the files and repository documentation and code quality using pylint

The final grade of the project is computed as follows:

f i n a l g r a d e = r o u n d (30 * \frac{t o t a l p o i n t s}{100})

$final\space grade = round(30 * \frac{total \space points}{100})$

Possible cases for penalties:

You did not follow the repo structure or did not follow the rules in the repostiory.
- Remove the points assigned for code quality
- Add penalty [-40%] for each other grading criterion.
You did not attend the defence.
- Remove the points assigned for project defence
- Add penalty [-70%] for each other grading criterion.
Late submissions
- Add penalty [-70 points].
- If the student submitted after the defence day, then their submission is not considered and they get zero for the late submission
The Spark application is not running on Yarn cluster
- Add penalty [-80%] for ML part
The student did not contribute in the project
- Add penalties to the student’s grade based on the submission sheet and contribution table
  - The penalty depends on the situation and percentage of contribution