Course: MLOps engineering
Author: Firas Jolha
Consequently, model training is followed by a model evaluation phase, also known as offline testing. During this phase, the performance of the trained model needs to be validated on a test set. Additionally, the model robustness should be assessed using noisy or wrong input data. Furthermore, it is best practice to develop an explainable ML model to provide trust, meet regulatory requirements, and govern humans in ML-assisted decisions.
Finally, the model deployment decision should be met automatically based on success criteria or manually by domain and ML experts. Similar to the modeling phase, all outcomes of the evaluation phase need to be documented.
Giskard is an open-source framework for testing all ML models. Giskard provides an automatic scan functionality that is designed to automatically detect a variety of risks associated with your ML model. You can learn more about the different vulnerabilities it can detect here:
By conducting a Giskard Scan, you can proactively identify and address these vulnerabilities to ensure the reliability, fairness, and robustness of your Machine Learning models. Check the link for details about vulnerabilities of ML models.
Install Giskard version 2.14.0
using pip
:
pip install giskard==2.14.0
If this version introduces conflicts or issues then downgrade it. Be careful that the most recent version 2.14.1
installs pandas
version 2.2.2
which could introduce some conflicts with requirements of other packages.
Note: In MLflow, giskard
is a model validation plugin and can be used as an evaluator (besides the default
evaluator) for mlflow.evaluate
function. Before you run any mlflow.evaluate
, make sure that you call import giskard
since the function mlflow.evaluate
fetches the available evaluators and giskard
need to be validated by importing the package. You can list all available evaluators in MLflow by running function mlflow.models.list_evaluators()
.
in the project, you should scan your model and validate it. You should scan the ML model using Giskard as follows:
Here we should start from raw dataset before transformation to analyze each column appropriately using Giskard.
# src/validate.py
from data import extract_data # custom module
from transform_data import transform_data # custom module
from model import retrieve_model_with_alias # custom module
from utils import init_hydra # custom module
import giskard
import hydra
import mlflow
cfg = init_hydra()
version = cfg.test_data_version
df, version = extract_data(version = version, cfg = cfg)
# Specify categorical columns and target column
TARGET_COLUMN = cfg.data.target_cols[0]
CATEGORICAL_COLUMNS = list(cfg.data.cat_cols) + list(cfg.data.bin_cols)
dataset_name = cfg.data.dataset_name
# Wrap your Pandas DataFrame with giskard.Dataset (validation or test set)
giskard_dataset = giskard.Dataset(
df=df, # A pandas.DataFrame containing raw data (before pre-processing) and including ground truth variable.
target=TARGET_COLUMN, # Ground truth variable
name=dataset_name, # Optional: Give a name to your dataset
cat_columns=CATEGORICAL_COLUMNS # List of categorical columns. Optional, but improves quality of results if available.
)
model_name = cfg.model.best_model_name
# You can sweep over challenger aliases using Hydra
model_alias = cfg.model.best_model_alias
model: mlflow.pyfunc.PyFuncModel = retrieve_model_with_alias(model_name, model_alias = model_alias)
client = mlflow.MlflowClient()
mv = client.get_model_version_by_alias(name = model_name, alias=model_alias)
model_version = mv.version
<model>.predict
function of the MLflow model accepts a features dataframe (transformed dataframe) but here we will feed it with raw dataframe, so we need to write a custom predict
function which will transform the data into features then it will call <model>.predict
.
transformer_version = cfg.data_transformer_version
def predict(raw_df):
X = transform_data(
df = raw_df,
version = version,
cfg = cfg,
return_df = False,
only_transform = True,
transformer_version = transformer_version,
only_X = True
)
return model.predict(X)
predictions = predict(df[df.columns].head())
print(predictions)
The argument raw_df
in predict
function contains only input data features and does not contain the target, even though we pass the full dataframe when we create the Giskard dataset.
NOTE: Make sure that the model.predict
function retrieves probabilities and not labels as prediction results. You can do that in sklearn
models by specifying pyfunc_predict_fn
as predict_proba
when you log the MLflow model. In pytorch
, you need to add an output layer to output probabilites and not logits.
giskard_model = giskard.Model(
model=predict,
model_type = "classification", # regression
classification_labels=list(cfg.data.labels), # The order MUST be identical to the prediction_function's output order
feature_names = df.columns, # By default all columns of the passed dataframe
name=model_name, # Optional: give it a name to identify it in metadata
# classification_threshold=0.5, # Optional: Default: 0.5
)
scan_results = giskard.scan(giskard_model, giskard_dataset)
# Save the results in `html` file
scan_results_path = f"reports/validation_results_{model_name}_{model_version}_{dataset_name}_{version}.html"
scan_results.to_html(scan_results_path)
suite_name = f"test_suite_{model_name}_{model_version}_{dataset_name}_{version}"
test_suite = giskard.Suite(name = suite_name)
# We can also generate a test suite from the scan results
# test_suite = scan_results.generate_test_suite(suite_name)
# We will not do this for simplicity
test1 = giskard.testing.test_f1(model = giskard_model,
dataset = giskard_dataset,
threshold=cfg.model.f1_threshold)
test_suite.add_test(test1)
Check this website for a list of performance tests in Giskard.
test_results = test_suite.run()
if (test_results.passed):
print("Passed model validation!")
else:
print("Model has vulnerabilities!")
Note: The project tasks are graded, and they form the practice part of the course. We have tasks for repository and as well as for report (for Master’s student).
challenger1
, challenger2
, …etc. We assume that we do not have champion
model since the champion
will be selected after model validation.reports
folder. Name the reports as f"test_suite_{model_name}_{model_version}_{dataset_name}_{testdata_version}.html"
where it includes the model name, model version, dataset name and version of the test dataset. You should generate a new report for every new model version or new data set version.f1
metric should be at least 0.6
. Set the threshold and run the test suite.champion
and go and deploy it.src/validate.py
and add an entry point validate
to MLproject
file to run this module python src/validate.py
.transform
to MLproject
to run the ZenML pipeline.extract
to MLproject
to test the airflow dag you defined in phase 2. We can test an airflow dag <dag-id>
as follows.airflow dags test <dag-id>
Note: For simplicity, we did not ask you to fix the vulnerabilities before deployment but it is required to understand them and explain your model vulnerabilites on the presentation day.
Complete the following chapters: