Final Project
Course: Big Data - IU S25
Authors: Armen Beklaryan & Firas Jolha
Project info submission
Project Report Overleaf Template
Agenda
Prerequisites
- You have access to IU Hadoop cluster of at least 3 nodes with all required services for the project stages
Objectives
The aim of this project is to analyze a business case study from your choice using CRISP-DM framework. You need to investigate the problem of the business, analyze the provided data, and present the analysis results to stakeholders and managers to address the business problem. Your solution should have a data architecture satisfies the business requirements and the reproducability of your work must be ensured. You should have at least one end-to-end big data and machine learning pipeline that automates the processes for data collection, storage, ingestion, analysis, reporting and presentation. The primary stages of the project are:
- Find a business case study, its dataset and explicitly define and implement your ways for data storage.
- Implement a data pipeline to ingest the stord data into a data warehouse like Hive for analytics. Use efficient methods for storage and retrieval.
- Perform Explanatory Data Analysis for the data in the data warehouse and store the analysis results in monolithic or distributed databases.
- Build and train ML models using SparkML.
- Build a dashboard to present the analysis results using Apache Superset.
Description
Within the framework of this project, it is necessary to conduct a full cycle of big data analysis from data loading to the presentation of the final results. Each student or a group of students independently choose a dataset that meets the following criteria:
- At least 300k rows of data and 200MB of dataset file size.
- At least 10 explanatory variables (there may be both initial features and those created during the analysis).
- Either time or geospatial features must be present in the data (both can be). Date-only features are not counted here.
- The formulation of the problem, as well as the evaluation of the results, should be carried out according to the CRISP-DM methodology (will be explained at lectures) and have practical value.
- The final reporting documents are:
- Github repository with a project
- Detailed report
- Pitch presentation
- Recording of a speech with a presentation.
As part of the lab, the software stack and the project stages will be analyzed. The lectures will discuss in detail the methodology, the necessary calculations of the unit economy of the project, reporting documents and presentation.
In the course of the project, you will not be monitored in terms of passing all the necessary stages. Self-control in terms of the reference points of the project is the main challenge. See how much you can really do the work yourself and how much your productivity increases if there are no tight deadlines. The postponement of the deadline for the final uploading of the project is not provided. From this day on, you have enough time, regardless of all the other parallel study activities. Time management is completely on your side. In case of technical issues, schedule a consultation with the TA (it is advisable to collect a set of questions from everyone, not individual consultation). For project management issues, as well as the methodology of its implementation, please contact the lecturer.
Instructions on Project Artifacts
Important: Before you start working on the project, submit the info of your project in the sheet attached to this tutorial.
In this section, I will explain the requirements and criteria for each artifact of the project
The repository
Here I will present the requirements and criteria related to the project repository.
The template
Your repository should contain all required scripts and files to replicate the whole processes starting from data collection to presentation. The repository should be organized and documented, and have a Readme file about how to reproduce your work and eventually get the analysis results. The data pipelines should be written as shell scripts to be executed on bash shell for Centos 7.9.2009 Linux distro (on which the IU cluster is running)
All files and scripts in the repository should be documented whereas undocumented files will not be evaluated. Any file that does not run due to some errors will also not be evaluated.
You have to use only Python as a programming language. If you are using external packages, please add them to requirments and the installation commands to the scripts. If you need to install packages that need sudo permission, you need to ask the TA.
The scripts should generate reproducible results. For instance, If I run the script stage1.sh for the second time, it should not give me any errors, so you should make sure that you cleared the objects before creating new ones.
Important: Your repository should have a main.sh script in the root directory of the repository to run the whole work in the project.
You can make sure that any error you are encountering in the cluster, I will also get it when I test your project since I will evaluate your project by running main.sh on the cluster using your username. Please, add the location of your main.sh in the cluster to the report.
Data Analytics
Your data should be large to be a big data and here we will discuss the criteria for complexity of the data analysis.
- The performance metrics of the ML model should be linked to the business objectives in CRISP-DM model.
- It is one of the minimum requirements in this project.
- The analysis should include some degree of creativity and this needs to be revealed in the pitch presentation. This criterion will be assessed based on the analysis outcomes. For instance, you discovered an interesting and useful insight from the data which helps to increase the revenue of the company or maybe developed an AI model which hits the business objectives.
- The minimum requirement is to provide at least 5 different insights from your data in EDA part.
- The analysis should consider and assess the potential risks after AI model deployment. For instance, you are generating recommendations for Amazon products, one of the very possible risks is to encounter the cold-start issue. You have to:
- consider different scenarios for the possible risks.
- generate fake data for different scenarios
- different data distributions
- run prediction on the generated data
- assess your AI model
- discuss the potential risk mitigation strategies
- The minimum requirement is to discuss a single potential risk.
- The analysis should involve building diverse AI models where comparison is possible. The assessment here will be based on:
- Used feature selection/creation methods like PCA, SVD, …etc. You can find some methods in Spark API and feel free to use any library which supports Spark DataFrame API and perform the processing in the distributed environment.
- The minimum requirement is to use all features from your data.
- Built different supervised models
- Classical ML methods
- Linear regression
- SVM
- Logistic Regression
- Decision Trees
- …etc
- Non-classical ML methods
- Feed-forward Neural networks
pyspark.ml.classification.MultilayerPerceptronClassifier
- Boosting methods
- Gradient-Boosted Trees
pyspark.ml.classification.GBTClassifier
- CatBoost
- …etc
- Ensemble Learning
- The minimum requirement is to train at least 2 different models (classical and non-classical).
- ML model learning curve
- The hyperparameters are optimized
- The random generators are seeded
- The code can generate reproducible results
- Used test/train split size
- The minimum requirement is to optimize 3 hyperparameters with k>2 folds using GridSearch for both models.
- ML Model Intrepretation
- You can use SHAP or LIME for neural networks. Check here to see how you can parallelize the compuatation using Spark UDFs.
- For classical methods you can use the coefficients of the model. For instance in the decision tree you can use the decisions which lead to the tree branches, the method
model.toDebugString will return the learned tree model.
- The intrepretation should be connected to the business problem and aim to fulfill the objectives.
- The minimum requirement is not included here.
The report
The report should follow the CRISP-DM model and the attached overleaf template needs to be used. The report include the following key points:
- The business problem statment
- It should be clear and contains a formula for the dependency between the business objectives and the performance of the AI model (The best performance is not always the best solution for the business problem).
- The minimum requirement is to provide a clear statement
- The appearance of the report should be well-designed including the font size, the margin,…etc.
- A well written, easy to read document has to be clear, uniform and pleasing to look at.
- The minimum requirement is to follow the report template.
- The report should include multiple sections/chapters for covering the stages of CRISP-DM involving the project stages of the big data pipeline. The sections should be linked and the transition is smooth.
- The minimum requirement is to cover only core sections in the project report template.
- The report should include a table for the participation of each team member (student). The table includes:
- the percentage of participation for each student.
- the performed tasks individually.
- The minimum requirement is to include this table.
- The report should follow the CRISP-DM model format.
- The minimum requirement is to cover only core sections in the project report template.
The presentation
- It should be a pitch presentation and includes the summary of your work (not a lot of words on each slide). The presentation should include a summary of your report.
- The minimum requirement is to deliver summary of your work.
- The presentation should be prepared to deliver it to the stakeholders of the project.
- The minimum requirement is to prepate to deliver the presentation to the TA.
- The presentation should involve the following key points:
- business problem
- business objectives
- descriptive data analysis (data characteristics)
- exploratory data analysis
- hyperparamaeters and performance of the models
- the discussion of your results
- the challenges you faced
- could be related to the hardware/software …etc
- the future prospects
- participation of each student in the project
- The minimum requirement is to cover the key points.
The recording of the presentation
The presentation is not live so you can give your pitch presentation at anytime before the deadline, record it and submit the recording. The assessment will be done after the deadline.
- The participation of each team member will be assessd.
- The minimum requirement is to particiapte in the presentation for each team member.
- The presentation should not exceed 15-20 minutes.
- The minimum requirement is to deliver 25 minutes pitch presentation at most.
- You can use targeted dress/background which can impress the stakeholders. No restriction on the creativity of the presentation.
- The minimum requirement is not included here.
- The grader can call the student for a meeting to ask/confirm on some topics related to the project.
- The minimum requirement is to come to the meeting.
Project submission
You should submit to Moodle the following files:
- A link to your repository.
- The report.
- The presentation in PDF format.
- The recording of the presentation.
- The recording if it is less than 100 MB
- A Youtube link to your video
Grading Criteria
The project has 100 total points and it will be assessed as follows.
- A successful project with minimum requirements
- If the project does not fulfill the minmum requirements then the total points will be less than 50 points.
- A project that satisfies all requirements, includes creative solutions, addresses the business problem and impresses the stakeholders with the presented pitch.
- All other projects will be between 50 and 100, and graded according to the following individual grading criteria as follows:
- 20 points for the repository and complexity of the analysis.
- 20 points for the report.
- 10 points for the presentation.