Note: This document is subject to change. Any possible change will be announced to the targeted audience. The project instructions are not limited to this document.
The aim of this project is to apply the CRISP-ML process on a machine learning project which begins with business and data understanding till model deployment and monitoring. The phases of the process are as follows:
Business and data understanding
Elicit the project requirements and formulate business problem
Specify business goals and machine learning goals
Determine the success criteria for business and machine learning modeling
Analyse the risks and set mitigation approaches
Data engineering/Preparation
Create ETL pipelines using Apache Airflow
Perform data transformation
Check the quality of the data and perform data cleaning
Create ML-ready features and store them in feature stores such as feast
Model engineering
Select and build ML models
Perform and track experiments using MLflow
Optimize models and select best models
model versioning in model registry of MLflow
Model validation
Prepare one model for production
Check the success criteria of machine learning
Check the business and machine learning modeling objectives
The business stakeholders take part in this phase
Check if deploying the model is feasible
Check the quality of the model for production
Select one model to be deployed
Model deployment
Search for options available to serve the model
Deploy the model
Create a REST endpoint for your model prediction using Flask or FastAPI
Create a UI for your model using streamlit or pure HTML and JS.
Create a CI/CD pipeline for your model using Github Actions and Docker
Model monitoring and maintenance
Check for drifts using evidently AI
Generate some synthetic data and feed it to your ML project
Check again for drifts using evidently AI
Check if retraining the model is needed
Check if more data engineering is needed
Deploy a new model, commit and push the code
Check if CI/CD pipeline is working without issues
The deliverables of this project are:
The project repository
The presentation
[Only for Master’s students] A report
Note: The requirements of these deliverables will be determined in labs and shared later.
The presentation in PDF format or link to your presentation on Google Slides.
[Only for Master’s students] The report in PDF format.
Dataset Criteria
The dataset should be real (not synthetic) and collected from some real business.
At least 10 explanatory variables (features). The features should not be anonymized nor encoded. The features should be well-documented and have detailed description for data understanding.
The dataset should have at least one categorical, one text and one time related (date, time or datetime) features. The objective here is to deal with different types of data.
The data size should be at least 50K rows. The data will be partitioned into 5 batches where each batch (10K) will be ingested to the pipeline after every some minutes (for instance, 2 minutes).
The goal of ML task should be determined in advance (Regression, Classification, …etc). Here, you will use same DL/ML frameworks that you worked on in your ML courses. The selection of the framework is up to you.
[Only for Master’s students] You have to work on DL framework such as Tensorflow (Keras is ok), Pytorch. It is not accepted to work on sklearn or traditional ML models.
Preferred format of the dataset files from the source is csv. Other types of dataset files are also fine.
The dataset can be static. Collecting dynamic data from data sources such as REST APIs is good.
[Only for Master’s students] The dataset should contain null values.
The TA needs to verify your dataset
You should find a dataset before third lab (End of first phase of CRISP-ML).
Team Formation Criteria
Each teammate should participate in the project and presentation.
You have to work in teams of 3 people.
The team consists of three personas.
Data engineer
Mainly peforms tasks of phase 2 (data engineernig/preparation) of CRISP-ML.
Creates data pipelines using Apache Airflow and deals with databases
Responsible for building automated data pipelines for data scientists and ML engineers.
Data scientist
Mainly peforms tasks of phases 1 and 3 (data understanding, model engineering) of CRISP-ML.
Prepares the data for ML modeling.
Prepares production-ready models.
Runs and tracks experiments with MLflow.
Participates ML engineer in phase 4 (model validation)
ML engineer
Focuses on peforming tasks of phases 5 and 6 (model deployment and model monitoring) of CRISP-ML.
Participates Data scientist in phase 4 (model validation)
Builds CI/CD pipelines
Deploys ML models
Monitors ML models and inform data scientists in case of issues
All personas work on phase 1 (business and data understanding) of CRISP-ML.
For exceptional situations, you need to contact the TA.
You should select a team before second lab.
Submit your teams in the sheet provided in the beginning of this document.
Grade distribution
Bachelor’s students
[20] The presentation
[80] The repository
15 points for phase I.
15 points for phase II.
10 points for phase III.
10 points for phase IV.
15 points for phase V.
15 points for phase VI.
The total points in this project is 100. The final grade of the project is computed as follows:
finalgrade=round(60∗totalpoints100)
Master’s students
[15] The report
[15] The presentation
[70] The repository
10 points for phase I.
10 points for phase II.
10 points for phase III.
10 points for phase IV.
10 points for phase V.
10 points for phase VI.
10 points for additional efforts and creative projects.
The total points in this project is 100. The final grade of the project is computed as follows:
finalgrade=round(60∗totalpoints100)
Possible cases for penalties
You did not attend the presentation day.
Remove the points assigned for project defence
Add penalty [-70%] for each other project deliverables.
Late submissions
Add penalty [-70 points].
If the student submitted after the defence day, then their submission is not considered and they get zero for the late submission
The student did not contribute in the project
Add penalties to the student’s grade based on the contribution table
The penalty depends on the situation and percentage of contribution
Note: The penalty cases are not limited to this list. You need to ask the instructor to check for penalties in other cases.