Final Project

Course: Big Data - IU S25

Project info submission

Repository Template

Agenda

Prerequisites

Technology Stack

Description

The aim of this project is to create an end-to-end big data pipeline that ingests the large data from databases such as PostgreSQL into HDFS followed by analyzing the distributed data (batch processing) using Spark Engine and presenting the results in a dashboard (Apache Superset). The stages of the pipeline are as follows:

  1. Data Collection and Ingestion
    • Collect the dataset and explore it.
    • Build a relational model for the data and create a relational database in PostgreSQL
    • Load the data to the relational database.
    • Ingest the data from the relational database to HDFS using Sqoop and Hadoop MapReduce engine.
  2. Data Storage/Preparation
    • Create efficient Hive Tables compressed (e.g. Snappy, GZIP) and stored with big data storage formats (e.g. AVRO, PARQUET).
      • Store the data in the data warehouse Hive for further analytics.
  1. Data Analysis
    • Perform Exploratory Data Analysis (EDA) with HiveQL on Tez engine.
    • Analyze the data using Spark DataFrame and SQL.
    • Perform Predictive Data Analysis (PDA) via building distributed ML models using Spark MLlib.
      • This part also includes feature extraction and data preprocessing.
  2. Presentation
    • Present the analysis results in a dashboard using Apache Superset on defence day.
    • Tell the story of the data on the dashboard.

The deliverables of this project are:

  1. The project repository
    • You have to use only Python as a programming language for PySpark application.
    • You have to use and follow the project repository template attached to this tutorial for your project. Do not fork but duplicate it and start working on your project.
      • The repository has a README file which includes the rules of using the template.
      • The scripts should generate reproducible results. For instance, If I run the script stage1.sh for the second time, it should not give me any errors, so you should make sure that you cleared the objects before creating new ones.
    • For EDA part
      • You need to analyze the data and display data characteristics and at least 6 different data insights in the dashboard prior to data preprocessing.
      • Data insights should be valuable and help business stakeholders make better decisions.
      • For each insight drawn from the data, you need to create charts and add a description/story for it.
        • You need to explain the data insight (data storytelling).
          • Data storytelling is the ability to effectively communicate insights from a dataset using narratives and visualizations. It can be used to put data insights into context for and inspire action from your audience.
    • For ML part
      • You need to build three different kinds of models like Random Forest, Support Vector Machine and Naive Bayes for classification task.
      • For data split, use 70% training and 30% test. You could use different ratios based on the size of your dataset but you should always make sure that you have at least 60% training data.
      • Fine-tune the hyper-parameters via grid search on 3 different values for each hyperparameter (at least 3 hyperparameters). It is enough to optimize for 27 different combinations. The number of epochs/iterations is not considered as a hyperparameter. One of the hyperparameters should be an algorithm hyperparameter and other two are model hyperparameters.
      • You should report two metrics at least. You need to use the following performance metrics (reference):
        • Accuracy and F1 for multiclass classification tasks
          • Area Under ROC and Area Under PR for binary classification tasks.
        • RMSE and R2 for regression tasks.
        • Mean Average Precision and NDCG for recommendation tasks.
        • Custom metrics for custom ML tasks.
      • Perform k-fold cross validation (where 2<k<5) and select the best model.
      • Show the performance of both best models of each kind in the dashboard.
    • For the dashboard.
      • It is enough to create a single dashboard using Apache Superset.
      • You have to display the data characteristics such as number of data instances, number of features, feature names,…
      • You need to display all the results of the analytics.
      • You need to train the model on a cluster (Hadoop YARN) and not local machine.
      • Check the grading criteria to see the requirements needed to be fulfilled here.
  1. A report
    • The report contains the details of the stages, activities and findings in the project and must include at least the following sections:
      • Title page
      • Introduction
        • Business objectives
      • Data Description
        • Data Characteristics
      • Architecture of data pipeline
        • The input and the output for each stage
      • Data preparation
        • ER diagram
        • Some samples from the database
        • Creating Hive tables and preparing the data for analysis
      • Data analysis
        • Analysis results
        • Charts
        • Interpretation
      • ML modeling
        • Feature extraction and data preprocessing
        • Training and fine-tuning
        • Evaluation
      • Data presentation
        • The description of the dashboard
        • Description of each chart
        • Your findings
      • Conclusion
        • Summary of the report
      • Reflections on own work
        • Challenges and difficulties
        • Recommendations
        • The table of contributions of each team member

Example:

Project Tasks Task description fname lname 1 fname lname 2 fname lname 3 fname lname 4 deliverables Average hours spent
Data extraction Collect the data from the <link> and extract the representative sample for the project 40% 50% 0% 10% sample_data.csv 0.5
Task 2
Task 3
…etc

Note: Make sure that you have a total 100% for each row. You can fill in this table as follows:

  1. The presentation
    • You should explain the goal of your project, for instance you are trying to predict ratings of users to build a good recommendation system. Expand this sentence, add some figures and images to explain the objective.
    • The dataset characteristics.
    • How you analyzed the data.
    • Analysis results.
    • The results for each stage of the project.
    • What challenges you encountered.
    • A short demo you present the dashboard and demonstrate the results you obtained.

Notes:

1. You do not need to store the dataset in Github. You can add it to .gitignore and keep it in the cluster.
2. This tutorial will be followed by four stages that should be done as parts of the project. More instructions will be specified in the individual stages next labs.

Project submission

You should submit to Moodle the following files:

  1. A link to your repository.
  2. The report.
  3. The presentation in PDF format or link to your presentation on Google Slides.

Dataset Criteria

  1. Data is big in terms of volume for batch processing on Spark engine.
    • At least 500,000 data records.
    • At least 400 MB in size.
  2. At least 8 explanatory variables (features). The features should not be anonymized nor encoded. The features should be described.
  3. Either datetime or geospatial features must be present in the data (both can be). Features that store only dates are not counted here.
  4. The goal of predictive analysis with regard to the selected dataset should be determined (Regression, Classification, …etc).
  5. The dataset should not be one of the datasets you worked on during the labs or assignments and you also should not select a dataset selected by other teams.
  6. Preferred format of the dataset files is “csv”. JSON and Excel files are fine but you need to convert them to csv.
  7. The TA will verify your dataset. So, you cannot start working on the dataset till you get the verification.
  8. You should find a dataset before March 31.
  9. You should submit the dataset info to the submission sheet

Team Formation Criteria

  1. Each teammate should participate in the project and presentation.
  2. You have to work in teams of 4 people.
    • For exceptional situations, you need to contact the TA.
  3. You should select a team before March 25.
  4. You should submit the team info to the submission sheet

Grading Criteria

The final grade of the project is computed as follows:

final grade=round(30total points100)

Possible cases for penalties: