Capstone Project

Course: MLOps engineering
Author: Firas Jolha

Note: This document is subject to change. Any possible change will be announced to the targeted audience. The project instructions are not limited to this document.

Submission sheet

Agenda

Technology Stack

Description

The aim of this project is to apply the CRISP-ML process on a machine learning project which begins with business and data understanding till model deployment and monitoring. The phases of the process are as follows:

Model monitoring and maintenance
Model deployment
Model validation
Model engineering
Data preparation
Business and Data understanding
CRISP-ML
  1. Business and data understanding
    • Elicit the project requirements and formulate business problem
    • Specify business goals and machine learning goals
    • Determine the success criteria for business and machine learning modeling
    • Analyse the risks and set mitigation approaches
  2. Data engineering/Preparation
    • Create ETL pipelines using Apache Airflow
    • Perform data transformation
    • Check the quality of the data and perform data cleaning
    • Create ML-ready features and store them in feature stores such as feast
  3. Model engineering
    • Select and build ML models
    • Perform and track experiments using MLflow
    • Optimize models and select best models
    • model versioning in model registry of MLflow
  4. Model validation
    • Prepare one model for production
    • Check the success criteria of machine learning
    • Check the business and machine learning modeling objectives
    • The business stakeholders take part in this phase
    • Check if deploying the model is feasible
    • Check the quality of the model for production
    • Select one model to be deployed
  5. Model deployment
    • Search for options available to serve the model
    • Deploy the model
    • Create a REST endpoint for your model prediction using Flask or FastAPI
    • Create a UI for your model using streamlit or pure HTML and JS.
    • Create a CI/CD pipeline for your model using Github Actions and Docker
  6. Model monitoring and maintenance
    • Check for drifts using evidently AI
    • Generate some synthetic data and feed it to your ML project
    • Check again for drifts using evidently AI
    • Check if retraining the model is needed
    • Check if more data engineering is needed
    • Deploy a new model, commit and push the code
    • Check if CI/CD pipeline is working without issues

The deliverables of this project are:

  1. The project repository
  2. The presentation
  3. [Only for Master’s students] A report

Note: The requirements of these deliverables will be determined in labs and shared later.

Project submission

You should submit to Moodle the following files:

  1. A link to your repository.
  2. The presentation in PDF format or link to your presentation on Google Slides.
  3. [Only for Master’s students] The report in PDF format.

Dataset Criteria

  1. The dataset should be real (not synthetic) and collected from some real business.
  2. At least 10 explanatory variables (features). The features should not be anonymized nor encoded. The features should be well-documented and have detailed description for data understanding.
  3. The dataset should have at least one categorical, one text and one time related (date, time or datetime) features. The objective here is to deal with different types of data.
  4. The data size should be at least 50K rows. The data will be partitioned into 5 batches where each batch (10K) will be ingested to the pipeline after every some minutes (for instance, 2 minutes).
  5. The goal of ML task should be determined in advance (Regression, Classification, …etc). Here, you will use same DL/ML frameworks that you worked on in your ML courses. The selection of the framework is up to you.
  6. [Only for Master’s students] You have to work on DL framework such as Tensorflow (Keras is ok), Pytorch. It is not accepted to work on sklearn or traditional ML models.
  7. Preferred format of the dataset files from the source is csv. Other types of dataset files are also fine.
  8. The dataset can be static. Collecting dynamic data from data sources such as REST APIs is good.
  9. [Only for Master’s students] The dataset should contain null values.
  10. The TA needs to verify your dataset
  11. You should find a dataset before third lab (End of first phase of CRISP-ML).

Team Formation Criteria

  1. Each teammate should participate in the project and presentation.
  2. You have to work in teams of 3 people.
    • The team consists of three personas.
      • Data engineer
        • Mainly peforms tasks of phase 2 (data engineernig/preparation) of CRISP-ML.
        • Creates data pipelines using Apache Airflow and deals with databases
        • Responsible for building automated data pipelines for data scientists and ML engineers.
      • Data scientist
        • Mainly peforms tasks of phases 1 and 3 (data understanding, model engineering) of CRISP-ML.
        • Prepares the data for ML modeling.
        • Prepares production-ready models.
        • Runs and tracks experiments with MLflow.
        • Participates ML engineer in phase 4 (model validation)
      • ML engineer
        • Focuses on peforming tasks of phases 5 and 6 (model deployment and model monitoring) of CRISP-ML.
        • Participates Data scientist in phase 4 (model validation)
        • Builds CI/CD pipelines
        • Deploys ML models
        • Monitors ML models and inform data scientists in case of issues
    • All personas work on phase 1 (business and data understanding) of CRISP-ML.
    • For exceptional situations, you need to contact the TA.
  3. You should select a team before second lab.
  4. Submit your teams in the sheet provided in the beginning of this document.

Grade distribution

Bachelor’s students

The total points in this project is 100. The final grade of the project is computed as follows:

final grade=round(60total points100)

Master’s students

The total points in this project is 100. The final grade of the project is computed as follows:

final grade=round(60total points100)

Possible cases for penalties

Note: The penalty cases are not limited to this list. You need to ask the instructor to check for penalties in other cases.