Capstone Project

Course: MLOps engineering
Author: Firas Jolha

Note: This document is subject to change. Any possible change will be announced to the targeted audience. The project instructions are not limited to this document.

Submission sheet

MLOps.project.2024

Agenda

Capstone Project
Submission sheet
Agenda
Technology Stack
Description
Project submission
Dataset Criteria
Team Formation Criteria
Grade distribution
Possible cases for penalties

Technology Stack

Description

The aim of this project is to apply the CRISP-ML process on a machine learning project which begins with business and data understanding till model deployment and monitoring. The phases of the process are as follows:

Business and data understanding
- Elicit the project requirements and formulate business problem
- Specify business goals and machine learning goals
- Determine the success criteria for business and machine learning modeling
- Analyse the risks and set mitigation approaches
Data engineering/Preparation
- Create ETL pipelines using Apache Airflow
- Perform data transformation
- Check the quality of the data and perform data cleaning
- Create ML-ready features and store them in feature stores such as feast
Model engineering
- Select and build ML models
- Perform and track experiments using MLflow
- Optimize models and select best models
- model versioning in model registry of MLflow
Model validation
- Prepare one model for production
- Check the success criteria of machine learning
- Check the business and machine learning modeling objectives
- The business stakeholders take part in this phase
- Check if deploying the model is feasible
- Check the quality of the model for production
- Select one model to be deployed
Model deployment
- Search for options available to serve the model
- Deploy the model
- Create a REST endpoint for your model prediction using Flask or FastAPI
- Create a UI for your model using streamlit or pure HTML and JS.
- Create a CI/CD pipeline for your model using Github Actions and Docker
Model monitoring and maintenance
- Check for drifts using evidently AI
- Generate some synthetic data and feed it to your ML project
- Check again for drifts using evidently AI
- Check if retraining the model is needed
- Check if more data engineering is needed
- Deploy a new model, commit and push the code
- Check if CI/CD pipeline is working without issues

The deliverables of this project are:

The project repository
The presentation
[Only for Master’s students] A report

Note: The requirements of these deliverables will be determined in labs and shared later.

Project submission

You should submit to Moodle the following files:

A link to your repository.
- A public repo OR a private repo where you add me (https://github.com/firas-jolha) as a contributor.
The presentation in PDF format or link to your presentation on Google Slides.
[Only for Master’s students] The report in PDF format.

Dataset Criteria

The dataset should be real (not synthetic) and collected from some real business.
At least 10 explanatory variables (features). The features should not be anonymized nor encoded. The features should be well-documented and have detailed description for data understanding.
The dataset should have at least one categorical, one text and one time related (date, time or datetime) features. The objective here is to deal with different types of data.
The data size should be at least 50K rows. The data will be partitioned into 5 batches where each batch (10K) will be ingested to the pipeline after every some minutes (for instance, 2 minutes).
The goal of ML task should be determined in advance (Regression, Classification, …etc). Here, you will use same DL/ML frameworks that you worked on in your ML courses. The selection of the framework is up to you.
[Only for Master’s students] You have to work on DL framework such as Tensorflow (Keras is ok), Pytorch. It is not accepted to work on sklearn or traditional ML models.
Preferred format of the dataset files from the source is csv. Other types of dataset files are also fine.
The dataset can be static. Collecting dynamic data from data sources such as REST APIs is good.
[Only for Master’s students] The dataset should contain null values.
The TA needs to verify your dataset
You should find a dataset before third lab (End of first phase of CRISP-ML).

Team Formation Criteria

Each teammate should participate in the project and presentation.
You have to work in teams of 3 people.
- The team consists of three personas.
  - Data engineer
    - Mainly peforms tasks of phase 2 (data engineernig/preparation) of CRISP-ML.
    - Creates data pipelines using Apache Airflow and deals with databases
    - Responsible for building automated data pipelines for data scientists and ML engineers.
  - Data scientist
    - Mainly peforms tasks of phases 1 and 3 (data understanding, model engineering) of CRISP-ML.
    - Prepares the data for ML modeling.
    - Prepares production-ready models.
    - Runs and tracks experiments with MLflow.
    - Participates ML engineer in phase 4 (model validation)
  - ML engineer
    - Focuses on peforming tasks of phases 5 and 6 (model deployment and model monitoring) of CRISP-ML.
    - Participates Data scientist in phase 4 (model validation)
    - Builds CI/CD pipelines
    - Deploys ML models
    - Monitors ML models and inform data scientists in case of issues
- All personas work on phase 1 (business and data understanding) of CRISP-ML.
- For exceptional situations, you need to contact the TA.
You should select a team before second lab.
Submit your teams in the sheet provided in the beginning of this document.

Grade distribution

Bachelor’s students

[20] The presentation
[80] The repository
- 15 points for phase I.
- 15 points for phase II.
- 10 points for phase III.
- 10 points for phase IV.
- 15 points for phase V.
- 15 points for phase VI.

The total points in this project is 100. The final grade of the project is computed as follows:

f i n a l g r a d e = r o u n d (60 * \frac{t o t a l p o i n t s}{100})

$final\space grade = round(60 * \frac{total \space points}{100})$

Master’s students

[15] The report
[15] The presentation
[70] The repository
- 10 points for phase I.
- 10 points for phase II.
- 10 points for phase III.
- 10 points for phase IV.
- 10 points for phase V.
- 10 points for phase VI.
- 10 points for additional efforts and creative projects.

The total points in this project is 100. The final grade of the project is computed as follows:

f i n a l g r a d e = r o u n d (60 * \frac{t o t a l p o i n t s}{100})

$final\space grade = round(60 * \frac{total \space points}{100})$

Possible cases for penalties

You did not attend the presentation day.
- Remove the points assigned for project defence
- Add penalty [-70%] for each other project deliverables.
Late submissions
- Add penalty [-70 points].
- If the student submitted after the defence day, then their submission is not considered and they get zero for the late submission
The student did not contribute in the project
- Add penalties to the student’s grade based on the contribution table
  - The penalty depends on the situation and percentage of contribution

Note: The penalty cases are not limited to this list. You need to ask the instructor to check for penalties in other cases.