Final Project

Course: Big Data - IU S25
Authors: Armen Beklaryan & Firas Jolha

Project info submission

Project Report Overleaf Template

Agenda

Prerequisites

Objectives

The aim of this project is to analyze a business case study from your choice using CRISP-DM framework. You need to investigate the problem of the business, analyze the provided data, and present the analysis results to stakeholders and managers to address the business problem. Your solution should have a data architecture satisfies the business requirements and the reproducability of your work must be ensured. You should have at least one end-to-end big data and machine learning pipeline that automates the processes for data collection, storage, ingestion, analysis, reporting and presentation. The primary stages of the project are:

  1. Find a business case study, its dataset and explicitly define and implement your ways for data storage.
  2. Implement a data pipeline to ingest the stord data into a data warehouse like Hive for analytics. Use efficient methods for storage and retrieval.
  3. Perform Explanatory Data Analysis for the data in the data warehouse and store the analysis results in monolithic or distributed databases.
  4. Build and train ML models using SparkML.
  5. Build a dashboard to present the analysis results using Apache Superset.

Description

Within the framework of this project, it is necessary to conduct a full cycle of big data analysis from data loading to the presentation of the final results. Each student or a group of students independently choose a dataset that meets the following criteria:

As part of the lab, the software stack and the project stages will be analyzed. The lectures will discuss in detail the methodology, the necessary calculations of the unit economy of the project, reporting documents and presentation.

In the course of the project, you will not be monitored in terms of passing all the necessary stages. Self-control in terms of the reference points of the project is the main challenge. See how much you can really do the work yourself and how much your productivity increases if there are no tight deadlines. The postponement of the deadline for the final uploading of the project is not provided. From this day on, you have enough time, regardless of all the other parallel study activities. Time management is completely on your side. In case of technical issues, schedule a consultation with the TA (it is advisable to collect a set of questions from everyone, not individual consultation). For project management issues, as well as the methodology of its implementation, please contact the lecturer.

Instructions on Project Artifacts

Important: Before you start working on the project, submit the info of your project in the sheet attached to this tutorial.

In this section, I will explain the requirements and criteria for each artifact of the project

The repository

Here I will present the requirements and criteria related to the project repository.

The template

Your repository should contain all required scripts and files to replicate the whole processes starting from data collection to presentation. The repository should be organized and documented, and have a Readme file about how to reproduce your work and eventually get the analysis results. The data pipelines should be written as shell scripts to be executed on bash shell for Centos 7.9.2009 Linux distro (on which the IU cluster is running)

All files and scripts in the repository should be documented whereas undocumented files will not be evaluated. Any file that does not run due to some errors will also not be evaluated.

You have to use only Python as a programming language. If you are using external packages, please add them to requirments and the installation commands to the scripts. If you need to install packages that need sudo permission, you need to ask the TA.

The scripts should generate reproducible results. For instance, If I run the script stage1.sh for the second time, it should not give me any errors, so you should make sure that you cleared the objects before creating new ones.

Important: Your repository should have a main.sh script in the root directory of the repository to run the whole work in the project.

You can make sure that any error you are encountering in the cluster, I will also get it when I test your project since I will evaluate your project by running main.sh on the cluster using your username. Please, add the location of your main.sh in the cluster to the report.

Data Analytics

Your data should be large to be a big data and here we will discuss the criteria for complexity of the data analysis.

  1. The performance metrics of the ML model should be linked to the business objectives in CRISP-DM model.
    • It is one of the minimum requirements in this project.
  2. The analysis should include some degree of creativity and this needs to be revealed in the pitch presentation. This criterion will be assessed based on the analysis outcomes. For instance, you discovered an interesting and useful insight from the data which helps to increase the revenue of the company or maybe developed an AI model which hits the business objectives.
    • The minimum requirement is to provide at least 5 different insights from your data in EDA part.
  3. The analysis should consider and assess the potential risks after AI model deployment. For instance, you are generating recommendations for Amazon products, one of the very possible risks is to encounter the cold-start issue. You have to:
    • consider different scenarios for the possible risks.
    • generate fake data for different scenarios
      • different data distributions
    • run prediction on the generated data
    • assess your AI model
    • discuss the potential risk mitigation strategies
    • The minimum requirement is to discuss a single potential risk.
  4. The analysis should involve building diverse AI models where comparison is possible. The assessment here will be based on:
    • Used feature selection/creation methods like PCA, SVD, …etc. You can find some methods in Spark API and feel free to use any library which supports Spark DataFrame API and perform the processing in the distributed environment.
      • The minimum requirement is to use all features from your data.
    • Built different supervised models
      • Classical ML methods
        • Linear regression
        • SVM
        • Logistic Regression
        • Decision Trees
        • …etc
      • Non-classical ML methods
        • Feed-forward Neural networks
          • pyspark.ml.classification.MultilayerPerceptronClassifier
        • Boosting methods
          • Gradient-Boosted Trees
            • pyspark.ml.classification.GBTClassifier
          • CatBoost
          • …etc
        • Ensemble Learning
          • Random forests.
      • The minimum requirement is to train at least 2 different models (classical and non-classical).
    • ML model learning curve
      • The hyperparameters are optimized
      • The random generators are seeded
      • The code can generate reproducible results
      • Used test/train split size
      • The minimum requirement is to optimize 3 hyperparameters with k>2 folds using GridSearch for both models.
    • ML Model Intrepretation
      • You can use SHAP or LIME for neural networks. Check here to see how you can parallelize the compuatation using Spark UDFs.
      • For classical methods you can use the coefficients of the model. For instance in the decision tree you can use the decisions which lead to the tree branches, the method model.toDebugString will return the learned tree model.
      • The intrepretation should be connected to the business problem and aim to fulfill the objectives.
      • The minimum requirement is not included here.

The report

The report should follow the CRISP-DM model and the attached overleaf template needs to be used. The report include the following key points:

The presentation

The recording of the presentation

The presentation is not live so you can give your pitch presentation at anytime before the deadline, record it and submit the recording. The assessment will be done after the deadline.

Project submission

You should submit to Moodle the following files:

  1. A link to your repository.
  2. The report.
  3. The presentation in PDF format.
  4. The recording of the presentation.
    • The recording if it is less than 100 MB
      • OR
    • A Youtube link to your video

Grading Criteria

The project has 100 total points and it will be assessed as follows.