Phase IV - Model testing and validation

Course: MLOps engineering
Author: Firas Jolha

Agenda

Phase IV - Model testing and validation
Agenda
Description
Giskard
- Install Giskard
- Giskard Scan
  - Demo
Project tasks
- A. Repository
- B. Report [Only for Master’s students]
References

Description

Consequently, model training is followed by a model evaluation phase, also known as offline testing. During this phase, the performance of the trained model needs to be validated on a test set. Additionally, the model robustness should be assessed using noisy or wrong input data. Furthermore, it is best practice to develop an explainable ML model to provide trust, meet regulatory requirements, and govern humans in ML-assisted decisions.

Finally, the model deployment decision should be met automatically based on success criteria or manually by domain and ML experts. Similar to the modeling phase, all outcomes of the evaluation phase need to be documented.

Giskard

Giskard is an open-source framework for testing all ML models. Giskard provides an automatic scan functionality that is designed to automatically detect a variety of risks associated with your ML model. You can learn more about the different vulnerabilities it can detect here:

Performance Bias
Unrobustness
Overconfidence
Underconfidence
Unethical behaviour
Data Leakage
Stochasticity
Spurious correlation

By conducting a Giskard Scan, you can proactively identify and address these vulnerabilities to ensure the reliability, fairness, and robustness of your Machine Learning models. Check the link for details about vulnerabilities of ML models.

Install Giskard

Install Giskard version 2.14.0 using pip:

pip install giskard==2.14.0

If this version introduces conflicts or issues then downgrade it. Be careful that the most recent version 2.14.1 installs pandas version 2.2.2 which could introduce some conflicts with requirements of other packages.

Note: In MLflow, giskard is a model validation plugin and can be used as an evaluator (besides the default evaluator) for mlflow.evaluate function. Before you run any mlflow.evaluate, make sure that you call import giskard since the function mlflow.evaluate fetches the available evaluators and giskard need to be validated by importing the package. You can list all available evaluators in MLflow by running function mlflow.models.list_evaluators().

Giskard Scan

in the project, you should scan your model and validate it. You should scan the ML model using Giskard as follows:

Wrap your dataset
Wrap your model
Scan your model
Create a test suite
Add performance tests
Run the test suite

Demo

Here we should start from raw dataset before transformation to analyze each column appropriately using Giskard.

Wrap your raw dataset

# src/validate.py

from data import extract_data # custom module
from transform_data import transform_data # custom module
from model import retrieve_model_with_alias # custom module
from utils import init_hydra # custom module
import giskard
import hydra
import mlflow


cfg = init_hydra()

version  = cfg.test_data_version

df, version = extract_data(version = version, cfg = cfg)

# Specify categorical columns and target column
TARGET_COLUMN = cfg.data.target_cols[0]

CATEGORICAL_COLUMNS = list(cfg.data.cat_cols) + list(cfg.data.bin_cols)

dataset_name = cfg.data.dataset_name


# Wrap your Pandas DataFrame with giskard.Dataset (validation or test set)
giskard_dataset = giskard.Dataset(
    df=df,  # A pandas.DataFrame containing raw data (before pre-processing) and including ground truth variable.
    target=TARGET_COLUMN,  # Ground truth variable
    name=dataset_name, # Optional: Give a name to your dataset
    cat_columns=CATEGORICAL_COLUMNS  # List of categorical columns. Optional, but improves quality of results if available.
)

Wrap your model

Fetch your model from model registry


model_name = cfg.model.best_model_name

# You can sweep over challenger aliases using Hydra
model_alias = cfg.model.best_model_alias

model: mlflow.pyfunc.PyFuncModel = retrieve_model_with_alias(model_name, model_alias = model_alias)  

client = mlflow.MlflowClient()

mv = client.get_model_version_by_alias(name = model_name, alias=model_alias)

model_version = mv.version

Define custom predict function
We know that the <model>.predict function of the MLflow model accepts a features dataframe (transformed dataframe) but here we will feed it with raw dataframe, so we need to write a custom predict function which will transform the data into features then it will call <model>.predict.


transformer_version = cfg.data_transformer_version

def predict(raw_df):
    X = transform_data(
                        df = raw_df, 
                        version = version, 
                        cfg = cfg, 
                        return_df = False, 
                        only_transform = True, 
                        transformer_version = transformer_version, 
                        only_X = True
                      )

    return model.predict(X)

predictions = predict(df[df.columns].head())
print(predictions)

The argument raw_df in predict function contains only input data features and does not contain the target, even though we pass the full dataframe when we create the Giskard dataset.

NOTE: Make sure that the model.predict function retrieves probabilities and not labels as prediction results. You can do that in sklearn models by specifying pyfunc_predict_fn as predict_proba when you log the MLflow model. In pytorch, you need to add an output layer to output probabilites and not logits.

Create Giskard model

giskard_model = giskard.Model(
  model=predict,
  model_type = "classification", # regression
  classification_labels=list(cfg.data.labels),  # The order MUST be identical to the prediction_function's output order
  feature_names = df.columns, # By default all columns of the passed dataframe
  name=model_name, # Optional: give it a name to identify it in metadata
  # classification_threshold=0.5, # Optional: Default: 0.5
)

Scan your model

scan_results = giskard.scan(giskard_model, giskard_dataset)

# Save the results in `html` file
scan_results_path = f"reports/validation_results_{model_name}_{model_version}_{dataset_name}_{version}.html"
scan_results.to_html(scan_results_path)

Create a test suite

suite_name = f"test_suite_{model_name}_{model_version}_{dataset_name}_{version}"
test_suite = giskard.Suite(name = suite_name)

# We can also generate a test suite from the scan results
# test_suite = scan_results.generate_test_suite(suite_name)
# We will not do this for simplicity

Add performance tests

test1 = giskard.testing.test_f1(model = giskard_model, 
                                dataset = giskard_dataset,
                                threshold=cfg.model.f1_threshold)

test_suite.add_test(test1)

Check this website for a list of performance tests in Giskard.

Run the test suite

test_results = test_suite.run()
  if (test_results.passed):
    print("Passed model validation!")
  else:
    print("Model has vulnerabilities!")

Project tasks

Note: The project tasks are graded, and they form the practice part of the course. We have tasks for repository and as well as for report (for Master’s student).

A. Repository

Set all model aliases you selected in Phase 3 to challenger1, challenger2, …etc. We assume that we do not have champion model since the champion will be selected after model validation.
Get all the models which are challengers from phase III.
Prepare Giskard dataset and model to scan the challenger models. For the dataset, use the test set which should be a different sample than your training sample. Test the models on one o the test data samples.
Save the report in reports folder. Name the reports as f"test_suite_{model_name}_{model_version}_{dataset_name}_{testdata_version}.html" where it includes the model name, model version, dataset name and version of the test dataset. You should generate a new report for every new model version or new data set version.
Select an evaluation metric for your model based on the success criteria of ML in phase 1. Create a Giskard test suite and add one performance test with the threshold you defined in ML success criteria in phase 1, for instance f1 metric should be at least 0.6. Set the threshold and run the test suite.
Among the challenger models, select the model which passes the test suite and has the least number of (major or minor) issues.
If the challenger model cannot pass the tests (invalid model), then you need to try to improve your models till you get a model which can pass the validation test.
When the selected challenger model passes all tests, tag this model in MLflow registry UI as champion and go and deploy it.
Add your model validation code to src/validate.py and add an entry point validate to MLproject file to run this module python src/validate.py.
Add another entry point transform to MLproject to run the ZenML pipeline.
Add another entry point extract to MLproject to test the airflow dag you defined in phase 2. We can test an airflow dag <dag-id> as follows.
```
airflow dags test <dag-id>
```
Check that your entry points are running successfully without errors.

Note: For simplicity, we did not ask you to fix the vulnerabilities before deployment but it is required to understand them and explain your model vulnerabilites on the presentation day.

B. Report [Only for Master’s students]

Complete the following chapters:

Chapter 5: Model evaluation
- Consequently, model training is followed by a model evaluation phase, also known as offline testing. During this phase, the performance of the trained model needs to be validated on a test set. Additionally, the model robustness should be assessed using noisy or wrong input data. Furthermore, it is best practice to develop an explainable ML model to provide trust, meet regulatory requirements, and govern humans in ML-assisted decisions.
- Finally, the model deployment decision should be met automatically based on success criteria or manually by domain and ML experts. Similar to the modeling phase, all outcomes of the evaluation phase need to be documented.
- Section 5.1. Model validation report
  - Report the performance of the model on the test dataset.
  - Explain the vulnerabilities of the model you got from Giskard.
- Section 5.2. Discussion
  - Compare results you got from ML modeling and Giskard validation.
  - Compare them with defined success criteria.
- Section 5.3. Deployment decision
  - Make a decision whether to deploy the model or not based on the previous discussion.