Course: MLOps engineering
Author: Firas Jolha
Machine learning models are complicated. Every use case has a long list of challenges at every stage, from concept to training to deployment.
Existing DevOps strategies, which help businesses streamline software development do not always apply. Machine learning operations, or MLOps, follows the spirit of DevOps. It helps businesses use ML efficiently and reliably.
If you are already familiar with the concept of DevOps, you may think that MLOps is just DevOps for machine learning. While the concept and goals are similar, the details are significantly different.
If DevOps is a set of practices for building a traditional house, MLOps is the process for building a house on Mars. The builders still need to plan the basics: architecture, utilities, and labor. However, they also need critical input and feedback from experts in space logistics and the Martian environment. They need to get the right materials to Mars, plus they must ensure that the house can withstand the harsh Martian conditions.
While MLOps and DevOps both care about scalability and reproducibility, their scale, complexity, and methods can be significantly different.
Imagine a pizza restaurant franchise that wants to use ML to sell more pizza. The team responsible for this project might follow these steps:
Example: The team wants to make an ML model that suggests discounts, coupons, and other promotions. The team names it “Pizzatron”. Pizzatron runs on current sales data and sends promotion recommendations to store owners. It is simple, isn’t it?
With our simplified Pizzatron example, there are complexities at every phase in the machine learning development chain. To begin with, in practice, the “team” in the example usually consists of three or more teams. To do an analysis this well, you need data scientists. Experts in building and deploying ML systems are known as ML engineers. Teams of software engineers work together to prepare data and construct models.
Even if we assume that everyone involved does their job well, there are bound to be mistakes.
Pizzatron requires constant validation. Any problems send Pizzatron right back to the development teams, who must go through the development steps, again. ML models can come back, again and again, to cause trouble for their creators.
If these teams are going to do complicated work together on a regular basis, they need a process. They need automation, defined stages, and workflows. This is exactly the structure that MLOps aims to provide. As explained by the authors of ml-ops.org:
“With [MLOps], we want to provide an end-to-end machine learning development process to design, build, and manage reproducible, testable, and evolvable ML-powered software.”
MLOps is a set of processes and automation for managing models, data and code to improve performance stability and long-term efficiency in ML systems.
An optimal MLOps experience [as] one where Machine Learning assets are treated consistently with all other software assets within a CI/CD environment. Machine Learning models can be deployed alongside the services that wrap them and the services that consume them as part of a unified release process. In the following, we describe a set of important concepts in MLOps such as Iterative-Incremental Development, Automation, Continuous Deployment, Versioning, Testing, Reproducibility, and Monitoring.
The fastest way to build an ML product is to rapidly build, evaluate, and iterate on models. This includes three core phases:
The level of automation of the Data, ML Model, and Code pipelines determines the maturity of the ML process.
How do you know?
To adopt MLOps, there is three levels of automation, starting from the initial level with manual model training and deployment, up to running both ML and CI/CD pipelines automatically.
The goal of the versioning is to treat ML training scrips, ML models and data sets for model training as first-class citizens in DevOps processes by tracking ML models and data sets with version control systems.
In contrast to the traditional software development process, in ML development, multiple experiments on model training can be executed in parallel before making the decision what model will be promoted to production.
The complete development pipeline includes three essential components, data pipeline, ML model pipeline, and application pipeline. In accordance with this separation we distinguish three scopes for testing in ML systems: tests for features and data, tests for model development, and tests for ML infrastructure.
The picture below shows that the model monitoring can be implemented by tracking the precision, recall, and F1-score of the model prediction along with the time. The decrease of the precision, recall, and F1-score triggers the model retraining, which leads to model recovery.
The “ML Test Score” measures the overall readiness of the ML system for production. The final ML Test Score is computed as follows:
Points | Description |
---|---|
0 | More of the research project than a productionized system. |
(0,1] | Not totally untested, but it is worth considering the possibility of serious holes in reliability. |
(1,2] | There has been first pass at basic productionization, but additional investment may be needed. |
(2,3] | Reasonably tested, but it is possible that more of those tests and procedures may be automated. |
(3,5] | Strong level of automated testing and monitoring. |
>5 | Exceptional level of automated testing and monitoring. |
Reproducibility in a machine learning workflow means that every phase of either data processing, ML model training, and ML model deployment should produce identical results given the same input.
These metrcs are:
How often does your organization deploy code to production or release it to end-users?
How long does it take to go from code committed to code successfully running in production?
How long does it generally take to restore service when a service incident or a defect that impacts users occurs (e.g., unplanned outage or service impairment)?
What percentage of changes to production or released to users result in degraded service (e.g., lead to service impairment or service outage) and subsequently require remediation (e.g., require a hotfix, rollback, fix forward, patch)?
These metrics have been found useful to measure and improve ones ML-based software delivery.
The complete ML development pipeline includes three levels where changes can occur: Data, ML Model, and Code. This means that in machine learning-based systems, the trigger for a build might be the combination of a code change, data change or model change.
Standardization of approaches and processes makes it possible to unify and scale the best practices of research and development management, including extending them to other domains. For example, CRISP-DM (Cross-Industry Standard Process for Data Mining) as the most common methodology for performing data science projects describes their life cycle in 6 phases, each of which includes a number of tasks.
A similar approach to machine learning was published in 2012 under the name CRISP-ML(Q). It is an acronym for CRoss-Industry Standard Process model for the development of Machine Learning applications with Quality assurance methodology. Based on CRISP-DM, the CRISP-ML(Q) process model describes six steps:
Datasets themselves are a core part of that success of models. This is why data gathering, preparation, and labeling should be seen as an iterative process, just like modeling. Start with a simple dataset that you can gather immediately, and be open to improving it based on what you learn.
This iterative approach to data can seem confusing at first. In ML research, performance is often reported on standard datasets that the community uses as benchmarks and are thus immutable. In traditional software engineering, we write deterministic rules for our programs, so we treat data as something to receive, process, and store.
ML engineering combines engineering and ML in order to build products. Our dataset is thus just another tool to allow us to build products. In ML engineering, choosing an initial dataset, regularly updating it, and augmenting it is often the majority of the work. This difference in workflow between research and industry is illustrated in the figure above.
At a high level, ML problem framing consists of two distinct steps:
A. To understand the problem, perform the following tasks:
App | Goal |
---|---|
Weather app | Calculate precipitation in six-hour increments for a geographic region. |
Banking app | Identify fraudulent transactions. |
B. After verifying that your problem is best solved using either a predictive ML or a generative AI approach, you’re ready to frame your problem in ML terms. You frame a problem in ML terms by completing the following tasks:
I recommend to comprehend this course from Google (https://developers.google.com/machine-learning/problem-framing)
Confirming the feasibility before setting up the ML project is a best practice in an industrial setting. Applying the Machine Learning Canvas (ML Canvas) framework would be a structured way to perform this task. The ML Canvas guides through the prediction and learning phases of the ML application. In addition, it enables all stakeholders to specify data availability, regulatory constraints, and application requirements such as robustness, scalability, explainability, and resource demand.
The Machine Learning Canvas is a template for developing new or documenting existing predictive systems that are based on machine learning techniques. It is a visual chart with elements describing a question to predict answers to (with machine learning), objectives to reach in the application domain, ways to use predictions to reach these objectives, data sources used to learn a predictor, and performance evaluation methods.
The template can be downloaded from here.
There are not a lot of case studies of CRISP-ML published in the Internet, but you can find a lot studies for CRISP-DM process. CRISP-ML combines business and data understanding phases of CRISP-DM into one phase but these phases have similar objectives. So, the case study here I will be present is related to CRISP-DM, with some modifications. There is another case study of CRISP-ML mentioned in References
section.
CRISP-DM
CRISP-ML
The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.
Follow the attached notebook for more details.
Hydra is an open-source Python framework that simplifies the development of research and other complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. Some features of Hydra:
OmegaConf is a YAML based hierarchical configuration system, with support for merging configurations from multiple sources (files, CLI argument, environment variables) providing a consistent API regardless of how the configuration was created. OmegaConf also offers runtime type safety via Structured Configs.
To get started using Hydra, you can watch this video
Here I will present some of the features of this tool and some of its use cases. The official website has good tutorials and guides, so follow them to learn more about the tool.
# It is a Python package.
pip install hydra-core
Note: Python decorator is a function that takes another Python function and extends the behavior of the latter function without explicitly modifying it. Example:
def mydecorator(func):
def wrapper():
print("Before")
func()
print("After")
return wrapper
@mydecorator
def say_whee():
print("Welcome!")
Configuration file: configs/main.yaml
db:
driver: mysql
user: james
password: secret
Python file: src/app.py
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(version_base=None, config_path="configs", config_name="main")
def app(cfg : DictConfig = None) -> None:
print(OmegaConf.to_yaml(cfg))
print(cfg.db.password)
print(cfg['db']['password'])
if __name__ == "__main__":
app()
You can learn more about OmegaConf here later.
Note: Create a new virtual environemnt using venv
or conda
in VS code.
We can run our simple application as:
python app.py
The output will be:
db:
driver: mysql
user: james
password: secret
secret
secret
We will have the Hydra output directory outputs
and the application log file. Inside the configuration output directory we have:
config.yaml
: A dump of the user specified configurationhydra.yaml
: A dump of the Hydra configurationoverrides.yaml
: The command line overrides usedAnd in the main output directory:
app.log
: A log file created for this runWe can override our configuration as follows:
python app.py db.password=secret2 +db.dbname=project_db ++version=1.2.4 ~db.driver
The output:
db:
user: james
password: secret2
dbname: project_db
version: 1.2.4
secret2
secret2
You may want to alternate between two different databases. To support this create a config group
named db
, and place one config file for each alternative inside. The directory structure of our application now looks like:
├── configs
│ ├── main.yaml
│ ├── db
│ ├── mysql.yaml
│ └── postgresql.yaml
│
├── src
└── my_app.py
The primary configuration file: configs/main.yaml
defaults:
- db: mysql
db:
user: james
The configuration file: configs/db/mysql.yaml
driver: mysql
password: secret_mysql
user: jones
The configuration file: configs/db/postgresql.yaml
driver: psql
password: secret_psql
user: jones
defaults
is a special directive telling Hydra to use db/mysql.yaml
when composing the configuration object. The resulting cfg
object is a composition of configs from defaults with configs specified in your main.yaml
.
You can now choose which database configuration to use and override values from the command line:
python src/app.py db=postgresql
The output:
db:
driver: psql
password: secret_psql
user: james
secret_psql
secret_psql
Use _self_
enrty in Defaults List
, and check what changes in the output.
The _self_
entry determines the relative position of this config in the Defaults List
. If it is not specified, it is added automatically as the last item.
You can run your function multiple times with different configurations easily with the --multirun|-m
flag.
python src/app.py --multirun db=mysql,postgresql
We can populate config values dynamically via value interpolation. We can specify missing values in configuration files using ???
literal.
The primary configuration file configs/main.yaml
.
db:
user: james
age: ???
current_age: ${db.age}
age_string: "This is my age ${db.current_age}"
defaults:
- db: mysql
- _self_
The script file: src/app.py
"""
My application
"""
import hydra
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path="../configs", config_name="main", version_base=None)
def app(cfg: DictConfig = None):
"""
Entry function of my application
"""
print(OmegaConf.to_yaml(cfg))
print(cfg.db.password)
print(cfg['db']['password'])
print(cfg.db.age)
print(cfg.db.current_age)
print(cfg.db.age_string)
if __name__ == "__main__":
app()
The command line:
python src/app.py db.age=10
The output:
db:
driver: mysql
password: secret_mysql
user: james
age: 10
current_age: ${db.age}
age_string: This is my age ${db.current_age}
secret_mysql
secret_mysql
10
10
This is my age 10
Sometimes a config file does not belong in any config group. You can still load it by default. Here is an example for some_file.yaml
.
defaults:
- some_file
Config files that are not part of a config group will always be loaded. They cannot be overridden. Prefer using a config group.
The compose API can compose a config similarly to @hydra.main()
anywhere in the code. The Compose API is useful when @hydra.main()
is not applicable.
Prior to calling compose()
, you have to initialize Hydra: This can be done by using the standard @hydra.main()
or by calling one of the initialization methods listed below:
initialize()
: Initialize with a config path relative to the callerinitialize_config_module()
: Initialize with config_module (absolute name)initialize_config_dir()
: Initialize with a config_dir on the file system (absolute path)All 3 initialization ways can be used as methods or contexts. When used as methods, they are initializing Hydra globally and should only be called once. When used as contexts, they are initializing Hydra within the context and can be used multiple times.
Example:
from hydra import compose, initialize
from omegaconf import OmegaConf
if __name__ == "__main__":
# context initialization
with initialize(version_base=None, config_path="configs", job_name="test_app"):
cfg = compose(config_name="main", overrides=["db=mysql", "db.user=me"])
print(OmegaConf.to_yaml(cfg))
# global initialization
initialize(version_base=None, config_path="configs", job_name="test_app")
cfg = compose(config_name="main", overrides=["db=mysql", "db.user=me"])
print(OmegaConf.to_yaml(cfg))
Three Hydra initializers:
initialize
config_path
to the config search path. The config_path
is relative to the parent of the caller.initialize_config_module
config_module
to the config search path. The config module must be importable (an __init__.py
must exist at its top level)initialize_config_dir
config_dir
is always a path on the file system and is must be an absolute path. Relative paths will result in an error.Structured Configs use Python dataclasses to describe your configuration structure and types. We will not cover this section here, but you can learn about it on your own from here.
Data Version Control (DVC) is a free, open-source tool for data management. It allows to track large datasets and version the data.
💫 DVC is your “Git for data”!
pip install dvc
We will use a local directory as a remote storage for versioning data files. You can check the documentation if you want to install for a different remote storage.
dvc init
There is a VS code extension for DVC, you can install it from here.
This will initialize the DVC repository. The output is:
A few internal files are created that should be added to Git.
We should commit this change to Git as follows:
git add .
git commit -m "initialize DVC"
DVC
is technically not a version control system by itself! It manipulates .dvc
files, whose contents define the data file versions. Git is already used to version your code, and now it can also version your data alongside it.
The internal files:
.dvc
files (“dot DVC files”) are placeholders to track data files and directories..dvc/config
: This is the default DVC configuration file. It can be edited by hand or with the dvc config
command..dvc/.gitignore
: This is a .gitignore
file to untrack the folders config.local
, /tmp
, /cache
..dvc/config.local
: This is an optional Git-ignored configuration file, that will overwrite options in .dvc/config
. This is useful when you need to specify sensitive values (secrets, passwords) which should not reach the Git repo (credentials, private locations, etc). This config file can also be edited by hand or with dvc config --local
..dvc/cache
: Default location of the cache directory. The cache stores the project data in a special structure. The data files and directories in the workspace will only contain links to the data files in the cache..dvc/tmp
: Directory for miscellaneous temporary files.The metafiles of DVC are typically versioned with Git, as DVC does not replace its Git version control features, but rather extends on them.
We will use a local directory as a remote storage but feel free to use the supported remote storage options if you have sufficient access. DVC remotes are similar to Git remotes (e.g. GitHub or GitLab hosting), but for cached data instead of code.
datastore
folder in the root directory of the project repository.localstore
as a remote storage.dvc remote add --default localstore $PWD/datastore
# localstore is a placeholder for the remote storage
# $PWD/datastore is the path of the remote storage
# Add the data store to .gitignore to untrack it
echo datastore >> .gitignore
This command line will modify the .dvc/config
file and you need to commit this change to Git repo.
git add .dvc/config
git commit -m "add remote storage"
Notes on data versioning using DVC
:
To version a data file data.csv
:
data.scv
to your Git repository.data.csv
to DVC repository.Here I will download my data data.csv
, version it, introduce a change to it, after that I will version the change, and switch between versions:
wget
as follows:mkdir -p data/raw
wget https://archive.ics.uci.edu/static/public/222/data.csv -O data/raw/data.csv
If your data is stored in Git or DVC repository then you can download it using dvc
as follows:
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/raw/data.csv
dvc add data/raw/data.csv
DVC stores information about the added file in a special .dvc
file named data/raw/data.csv.dvc
. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking. The actual data file data.csv
is added to .gitignore
to untrack it from Git repository.
git add data/raw/data.csv.dvc data/raw/.gitignore
Now the metadata about your data is versioned alongside your source code, while the original data file was added to .gitignore
.
git commit -m "add raw data"
git push
git tag -a "v1.0" -m "add data version v1.0"
git push --tags
dvc push
Info: DVC tracks the data in the remote storage, whereas Git tracks the hash code of the files.
Here I will download a new version of the data data.csv
and version it.
9. Download new data.
wget https://archive.ics.uci.edu/static/public/96/data.csv -O data/raw/data.csv
You can check the status of DVC repository by running:
dvc status
The option -c|--cloud
allows to show status of a local cache (.dvc/cache) compared to the remote repository.
dvc add data/raw/data.csv
git add .
git commit -m "add new data"
git push
git tag -a "v2.0" -m "A new version of the data V2.0"
git push --tags
dvc push
You can list the tags of a Git repo as:
git tag
You actually can delete the file from data
folder since it is stored in the remote storage. You will notice that the data is locally saved in its cache folder .dvc/cache
using file hashes.
You can restore the previous versions of the data from Git repository by checking out the corresponding commit. You can use the tags that you defined as follows:
git checkout v1.0 data/raw/data.csv.dvc
This will restore the data.csv.dvc
metafile.
In order to restore the actual file, you need to checkout DVC too:
dvc checkout
You can add and commit the selected version then push it.
git commit data/raw/data.csv.dvc -a -m "Restored data v1.0"
You do not need to push it to DVC again since this version is already stored in the remote storage.
Notice that if you checkout without specifying the file (e.g. git checkout v1.0
), then all other tracked files (e.g. config files, source codes, …etc) with the same commit “v1.0” will also be checked out (be careful). This will also detach the HEAD pointer to point to the commit which contains the data of version v1.0
. You can attach it to main
branch by running git switch main
.
dvc.api
Python APIDVC comes as a VS Code Extension, as a command line interface, and as a Python API. Here we will use dvc.api
to obtain the data files based on the selected version.
# configs/main.yaml
data:
version: 1.0
path: data/raw/data.csv
repo: .
remote: localstore
# src/app.py
import pandas as pd
import hydra
import dvc.api
import os.path
@hydra.main(config_path="../configs", config_name="main", version_base=None)
def app(cfg=None):
url = dvc.api.get_url(
path=os.path.join("data/raw/data.csv"), # the path to the file in the remote storage
repo=os.path.join(cfg.data.repo), # The url/path to the remote stage
rev=str(cfg.data.version), # the version of the data to restore
remote=cfg.data.remote # The placeholder of the remote storage
)
print(url) # It will be the path of the file in the remote storage
# Read the data
df = pd.read_csv(url)
print(df.head())
if __name__=="__main__":
app()
You can set the version in the config file and read the data with the requested version.
data.csv
in the folder data
, can be deleted for some reason, and you still need to set its path the same when you added to the DVC repository since DVC also tracks the path of the file in data.csv.dvc
file.data/data.csv
is deleted, we still can restore it by running dvc pull
, since it is stored in the remote storage.data.csv.dvc
is important to not delete it to know how to restore your actual data file.Some other functions which could be useful from dvc.api
:
from dvc.api import DVCFileSystem
fs = DVCFileSystem(".") # /path/to/local/repository
url = "https://remote/repository" # remote repository
fs = DVCFileSystem(url, rev="main")
# Open a file
with fs.open("data.csv", mode="r") as f:
print(f.readlines()[0])
# Reading a file
text = fs.read_text("get-started/data.xml", encoding="utf-8")
contents = fs.read_bytes("get-started/data.xml")
# Listing all DVC-tracked files recursively
fs.find("/", detail=False, dvc_only=True)
# Listing all files (including Git-tracked)
fs.find("/", detail=False)
# Downloading a file or a directory
fs.get_file("data/data.xml", "data.xml")
## This downloads "data/data.xml" file to the current working directory as "data.xml" file
fs.get("data", "data", recursive=True)
## This downloads all the files in "data" directory - be it Git-tracked or DVC-tracked into a local directory "data".
Testing is one of the important practices to do in MLOps pipelines. Writing and maintaining boilerplate tests is not easy, so you should leverage all the tools at your disposal to make it as painless as possible. pytest
is one of the best tools that you can use to boost your testing productivity. In ML projects, we need to test or validate data, code and model. pytest
is intended to validate the codebase of the project.
pip install pytest
The AAA (Arrange-Act-Assert) pattern is a widely recognized and effective approach to structuring unit tests. This pattern provides a clear and organized structure for writing tests, enhancing readability and maintainability. By following the AAA pattern, developers can create robust and reliable unit tests that validate the behavior of their code.
The Arrange-Act-Assert pattern:
Example:
def sum(a, b):
return a + b
def test_sum():
# Arrange
x = 1
y = 1
# Act
result = sum(x, y)
# Assert
assert result == 2
unittest
short demoIf you have written unit tests for your Python code before, then you may have used Python’s built-in unittest
module. The package unittest
provides a solid base on which to build your test suite, but it has a few shortcomings.
Testing frameworks typically hook into your test’s assertions so that they can provide information when an assertion fails. unittest
, for example, provides a number of helpful assertion utilities out of the box. However, even a small set of tests requires a fair amount of boilerplate code.
# tests/dummy.py
def sum(a, b):
return a + b
# tests/test_sum_unittest.py
from unittest import TestCase
from src.dummy import sum
class TryTesting(TestCase):
def test_sum_pn(self):
x = 1
y = -1
result = sum(x, y)
self.assertTrue(result == 0)
def test_sum_pp(self):
x = 1
y = 1
result = sum(x, y)
self.assertGreater(result, 2)
You can run unittest
tests from command line using discover
option.
python -m unittest discover tests
Note: By default, the test modules and functions, should start with “test*” such as “test_sum”, “test_x.py”, “test_fun”, …etc.
The module unittest
is a built-in testing framework in Python but includes a lot of boilerplate code. pytest
simplifies this workflow by allowing you to use normal functions and Python’s assert
keyword directly.
pytest vs unittest
Here I created two dummy functions with their test functions.
# src/dummy.py
def sum(a, b):
return a + b
def mul(a, b):
return a * b
# tests/test_sum.py
from src.dummy import sum
def test_sum_pn():
x = 1
y = -1
result = sum(x, y)
assert result == 0
def test_sum_pp():
x = 1
y = 1
result = sum(x, y)
assert result > 2
# tests/test_mul.py
from src.dummy import mul
def test_mul_pn():
x = 1
y = -1
result = mul(x, y)
assert result < 0
def test_mul_pp():
x = 1
y = 1
result = mul(x, y)
assert result > 0
You can run tests from command line using pytest
command.
pytest
If you get an error “ModuleNotFoundError: No module named ‘src’”, then it is probably that you did not add __init__.py
file to src
and tests
folders. Add them and run the command again.
The output:
================= test session starts =================
platform win32 -- Python 3.12.0, pytest-8.2.2, pluggy-1.5.0
rootdir: C:\Users\Admin\mlops\test\mlops-dvc-test01
plugins: hydra-core-1.3.2
collected 4 items
tests\test_mul.py .. [ 50%]
tests\test_sum.py .F [100%]
====================== FAILURES =======================
_____________________ test_sum_pp _____________________
def test_sum_pp():
x = 1
y = 1
result = sum(x, y)
> assert result > 2
E assert 2 > 2
tests\test_sum.py:17: AssertionError
=============== short test summary info ===============
FAILED tests/test_sum.py::test_sum_pp - assert 2 > 2
============= 1 failed, 3 passed in 0.27s =============
pytest
is able to find all the tests we want it to run as long as we name them according to the pytest naming conventions which are:
test_<something>.py
or <something>_test.py
.test_<something>
.Test<Something>
.So far we’ve seen one passing test and one failing test. However, pass and fail are not the only outcomes possible.
Here are the possible outcomes of a test:
@pytest.mark.xfail()
decorator.A test will fail if there is any uncaught exception. This can happen if
AssertionError
exception,pytest.fail()
, which will raise an exception, orraise
keyword).# src/cards.py
from dataclasses import asdict, dataclass
@dataclass
class Card:
summary: str = None
owner: str = None
state: str = "todo"
id: int = 0
def from_dict(cls, d):
return Card(**d)
def to_dict(self):
return asdict(self)
# tests/test_alt_fail.py
import pytest
from src.cards import Card
def test_with_fail():
c1 = Card("sit there", "brian")
c2 = Card("do something", "okken")
if c1 != c2:
pytest.fail("they don't match")
We’ve looked at how any exception can cause a test to fail. But what if a bit of code you are testing is supposed to raise an exception? How do you test for that? You use pytest.raises()
to test for expected exceptions.
# src/cards.py
class CardsDB:
def __init__(self, db_path):
self._db_path = db_path
# tests/test_no_exceptions
import pytest
from src.cards import CardsDB
def test_no_path_raises():
with pytest.raises(TypeError):
CardsDB()
We can run this test alone as follows:
pytest tests/test_exceptions.py
The output:
================= test session starts =================
platform win32 -- Python 3.12.0, pytest-8.2.2, pluggy-1.5.0
rootdir: C:\Users\Admin\mlops\test\mlops-dvc-test01
plugins: hydra-core-1.3.2
collected 1 item
tests\test_exceptions.py . [100%]
================== 1 passed in 0.04s ==================
So far we have written test functions within test modules. pytest
also allows us to group tests with classes.
# tests/test_card.py
from src.cards import Card
class TestEquality:
def test_equality(self):
c1 = Card("something", "brian", "todo", 123)
c2 = Card("something", "brian", "todo", 123)
assert c1 == c2
def test_equality_with_diff_ids(self):
c1 = Card("something", "brian", "todo", 123)
c2 = Card("something", "brian", "todo", 4567)
assert c1 == c2
def test_inequality(self):
c1 = Card("something", "brian", "todo", 123)
c2 = Card("completely different", "okken", "done", 123)
assert c1 != c2
class TestCardIds:
def test_equality_id(self):
c1 = Card("something", "brian", "todo", 123)
c2 = Card("something", "brian", "todo", 122)
assert c1.id == c2.id
def test_equality_with_diff_ids(self):
c1 = Card("something", "brian", "todo", 123)
c2 = Card("something", "brian", "todo", 4567)
assert c1.id != c2.id
def test_inequality(self):
c1 = Card("something", "brian", "todo", 123)
c2 = Card("completely different", "okken", "done", 123)
assert c1.id != c2.id
def test_hello():
assert msg=="hello"
We can run a subset of tests as follows:
pytest tests/test_card.py::TestCardIds
pytest tests/test_card.py
pytest tests/test_card.py::test_hello
pytest tests/test_card.py::TestCardIds::test_inequality
pytest tests
Fixtures are functions that are run by pytest before (and sometimes after) the actual test functions.
You can use fixtures to get a data set for the tests to work on. Fixtures are also used to get data ready for multiple tests.
import pytest
@pytest.fixture()
def some_data():
"""
Loads some data
"""
df = pd.read_csv("data.csv")
return df
# some_data is the function name (fixture)
def test_some_data(some_data):
assert len(some_data) > 0
The @pytest.fixture()
decorator is used to tell pytest
that a function is a fixture. When you include the fixture name in the parameter list of a test function, pytest
knows to run it before running the test. Fixtures can do work, and can also return data to the test function.
scope='function'
scope='module'
scope='class'
scope='session'
--setup-show
of pytest
# tests/test_auto
import pytest
import time
@pytest.fixture(autouse=True, scope="session")
def footer_session_scope():
"""Report the time at the end of a session."""
yield
now = time.time()
print("--")
print(
"finished : {}".format(
time.strftime("%d %b %X", time.localtime(now))
)
)
print("-----------------")
@pytest.fixture(autouse=True)
def footer_function_scope():
"""Report test durations after each function."""
start = time.time()
yield
stop = time.time()
delta = stop - start
print("\ntest duration : {:0.3} seconds".format(delta))
def test_1():
"""Simulate long-ish running test."""
time.sleep(1)
def test_2():
"""Simulate slightly longer test."""
time.sleep(1.23)
import pytest
@pytest.fixture()
def cards_db():
with TemporaryDirectory() as db_dir:
db_path = Path(db_dir)
db = cards.CardsDB(db_path)
yield db
db.close()
def test_empty(cards_db):
assert cards_db.count() == 0
Parametrized testing refers to adding parameters to our test functions and passing in multiple sets of arguments to the test to create new test cases.
# test_param.py
import pytest
@pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
def test_eval(test_input, expected):
assert eval(test_input) == expected
Here, the @parametrize
decorator defines three different (test_input,expected)
tuples so that the test_eval
function will run three times using them in turn.
In pytest, markers are a way to tell pytest there’s something special about a particular test.
pytest’s builtin markers are used to modify the behavior of how tests run.
@pytest.mark.parametrize() is a marker.
Great Expectations is a tool for validating and documenting your data. Software developers have long known that automated testing is essential for managing complex codebases. Great Expectations brings the same discipline, confidence, and acceleration to data science and data engineering teams. It is a Python library that provides a framework for describing the acceptable state of data and then validating that the data meets those criteria.
pip install great_expectations
We can import the package as follows:
import great_expectations as gx
When working with GX you use the following four core components to access, store, and manage underlying objects and processes:
A Data Context contains all the metadata used by GX, the configurations for GX objects, and the output from validating data.
We have 3 data context types:
We can create an Ephermeral data context as follows:
context = gx.get_context()
# This will create a temporary folder in /tmp to hold the data context files
The output:
We can create a File data context as follows:
from great_expectations.data_context import FileDataContext
context = FileDataContext(context_root_dir = "../services")
# This will create a folder "../services/gx" in the current working directory and will store the data context files
# context = FileDataContext(project_root_dir = "../services")
# This will create a folder "../services/gx" in the current working directory and will store the data context file
if you have an existing file data context, then you can load it as follows:
context = gx.get_context(context_root_dir = "../services/gx")
# OR
# context = gx.get_context(project_root_dir = "../services")
You should work with File data context for the project.
You can convert an Ephemeral Data Context to a Filesystem Data Context.
context = context.convert_to_file_context()
This will put the data context files in a folder ./gx
.
Data Sources connect GX to Data Assets such as CSV files in a folder, a PostgreSQL database hosted on AWS, or any combination of data formats and environments. Regardless of the format of your Data Asset or where it resides, Data Sources provide GX with a unified API for working with it.
Data Sources do not modify your data.
We can use a Filesystem data source or a cloud data source but for this project, we need to use a Filesystem data source.
We can create a Pandas data source to connect to the data file stored on the local filesystem as follows:
ds1 = context.sources.add_or_update_pandas(name="my_pandas_ds")
# ds1 is PandasDatasource
We can create a Pandas data source to connect to the data files stored in a folder on the local filesystem as follows:
ds2 = context.sources.add_or_update_pandas_filesystem(name="my_pandas_ds", base_directory="../data/raw")
# base_directory is the path to the folder containing data files
# ds2 is PandasFilesystemDatasource
Data Assets are collections of records within a Data Source.
We can create a csv data asset to read the data from data file(s) stored on the local filesystem as follows:
# Read a single data file
da1 = ds1.add_csv_asset(
name = "asset01",
filepath_or_buffer="/path/to/data/file.csv"
)
# Read multiple data files
da2 = ds2.add_csv_asset(
name = "asset02",
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
)
# Here we defined a regex pattern to read files only those match this pattern
Here we will try to read data from a Pandas dataframe residing in memory. So, Pandas is responsible for reading the data and loading to the memory.
import pandas as pd
df = pd.read_csv("/path/to/data/file.csv")
# ds1 is a PandasDatasource
da3 = ds1.add_dataframe_asset(name = "pandas_dataframe")
We did not pass the dataframe so far, we will do when we build a batch request.
Data Assets can be partitioned into Batches. Batches are unique subsets of records within a Data Asset. For example, say you have a Data Asset in a SQL Data Source that consists of all records from last year in a given table. You could then partition those records into Batches of data that correspond to the records for individual months of that year.
A Batch Request specifies one or more Batches within the Data Asset. Batch Requests are the primary way of retrieving data for use in GX.
We can request data from the previous data asset as follows:
batch_request = da1.build_batch_request()
batches = da1.get_batch_list_from_batch_request(batch_request)
for batch in batches:
print(batch.batch_spec)
Here we will pass the Pandas dataframe to build the batch request.
batch_request = da3.build_batch_request(dataframe = df)
batches = da3.get_batch_list_from_batch_request(batch_request)
for batch in batches:
print(batch.batch_spec)
You can filter data when you build batch requests for data assets connected to a PSQL table or Pandas file system data source. The options to batch requests depends on the type of data assets being used, we can know that as follows:
print(my_asset.batch_request_options)
For instance, for file based data assets, we can use sorters to sort the retrieved data as follows:
my_asset = my_asset.add_sorters(['+year', '-month'])
# Here we can use da2 as data asset
# Here the data will be sorted by `year` in asc order and by `month` in desc order
We can use head()
method of the Batch to see the content of the dataframe.
Expectations provide a flexible, declarative language for describing expected behaviors. Unlike traditional unit tests which describe the expected behavior of code given a specific input, Expectations apply to the input data itself.
For example, you can define an Expectation that a column contains no null values. When GX runs that Expectation on your data it generates a report which indicates if a null value was found.
Expectations can be built directly from the domain knowledge of subject matter experts, interactively while introspecting a set of data, or through automated tools provided by GX.
We have four workflows to create expectations. The methodology for saving and testing Expectations is the same for all workflows.
Here, you work in a Python interpreter or Jupyter Notebook.
- You use a Validator and call Expectations as methods on it to define them in an Expectation Suite.
- When you have finished you save that Expectation Suite into your Expectation Store.
Example:
# Create a batch request
data_asset = context.get_datasource("my_datasource").get_asset("my_data_asset")
batch_request = data_asset.build_batch_request()
# Create an expectations suite
context.add_or_update_expectation_suite("my_expectation_suite")
# Verify that the expectation is created
context.list_expectation_suite_names()
# Create a validator
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="my_expectation_suite",
)
# Check the validator content
print(validator.head())
# Add your expectations on the validator
validator.expect_column_values_to_not_be_null(
column="vendor_id"
)
validator.expect_column_values_to_not_be_null(
column="account_id"
)
validator.expect_column_values_to_be_in_set(
column="transaction_type",
value_set = ["purchase", "refund", "upgrade"]
)
# Save your expecations suite
validator.save_expectation_suite(
discard_failed_expectations = False
)
# In interactive mode, GX will add all exepctations that you successfully run to the suite.
Advanced users can use a manual process to create Expectations. This workflow does not require Data Assets, but it does require knowledge of the configurations available for Expectations.
Example:
# Data assets and batch request are not needed since we will not run the expectations
# data_asset = context.get_datasource("my_datasource").get_asset("my_data_asset")
# batch_request = data_asset.build_batch_request()
# Create an expectations suite
suite = context.add_or_update_expectation_suite("my_expectation_suite")
# Verify that the expectation is created
context.list_expectation_suite_names()
# Import the class to create expectation configuration
from great_expectations.core.expectation_configuration import (
ExpectationConfiguration
)
# Create your expectations
# Expectation 1
expectation_configuration_1 = ExpectationConfiguration(
# Name of expectation type being added
expectation_type="expect_column_values_to_not_be_null",
# These are the arguments of the expectation
# The keys allowed in the dictionary are Parameters and
# Keyword Arguments of this Expectation Type
kwargs={
"column": "vendor_id"
},
# This is how you can optionally add a comment about this expectation.
# It will be rendered in Data Docs.
# See this guide for details:
# `How to add comments to Expectations and display them in Data Docs`.
meta={
"notes": {
"format": "markdown",
"content": "Some clever comment about this expectation. **Markdown** `Supported`",
}
},
)
# Add the Expectation 1 to the suite
suite.add_expectation(
expectation_configuration=expectation_configuration_1
)
# validator.expect_column_values_to_not_be_null(
# column="vendor_id"
# )
# Expectation 2
expectation_configuration_2 = ExpectationConfiguration(
# Name of expectation type being added
expectation_type="expect_column_values_to_not_be_null",
# These are the arguments of the expectation
# The keys allowed in the dictionary are Parameters and
# Keyword Arguments of this Expectation Type
kwargs={
"column": "account_id"
},
# This is how you can optionally add a comment about this expectation.
# It will be rendered in Data Docs.
# See this guide for details:
# `How to add comments to Expectations and display them in Data Docs`.
meta={
"notes": {
"format": "markdown",
"content": "Some clever comment about this expectation. **Markdown** `Supported`",
}
},
)
# Add the Expectation 2 to the suite
suite.add_expectation(
expectation_configuration=expectation_configuration_2
)
# validator.expect_column_values_to_not_be_null(
# column="account_id"
# )
# Expectation 1
expectation_configuration_3 = ExpectationConfiguration(
# Name of expectation type being added
expectation_type="expect_column_values_to_be_in_set",
# These are the arguments of the expectation
# The keys allowed in the dictionary are Parameters and
# Keyword Arguments of this Expectation Type
kwargs={
"column": "transaction_type",
"value_set": ["purchase", "refund", "upgrade"],
},
)
validator.expect_column_values_to_be_in_set(
column="transaction_type",
value_set = ["purchase", "refund", "upgrade"]
)
# Add the Expectation 3 to the suite
suite.add_expectation(
expectation_configuration=expectation_configuration_3
)
# Save your expecations suite
context.save_expectation_suite(expectation_suite=suite)
# validator.save_expectation_suite(
# discard_failed_expectations = False
# )
# In non-interactive mode, you have to manually add each exepctation to the suite.
Here, you use a Data Assistant to generate Expectations based on the input data you provide. You can preview the Metrics that these Expectations are based on, and you can save the generated Expectations as an Expectation Suite in an Expectation Store.
You can find all supported expectations in the gallery of GX expectations
When GX validates data, an Expectation Suite helps streamline the process by running all of the contained Expectations against that data. In almost all cases, when you create an Expectation you will be creating it inside of an Expectation Suite object.
You can define multiple Expectation Suites for the same data to cover different use cases. An example could be having one Expectation Suite for raw data, and a more strict Expectation Suite for that same data post-processing. Because an Expectation Suite is decoupled from a specific source of data, you can apply the same Expectation Suite against multiple, disparate Data Sources. For instance, you can reuse an Expectation Suite that was created around an older set of data to validate the quality of a new set of data.
A Data Assistant is a utility that automates the process of building Expectations by asking questions about your data, gathering information to describe what is observed, and then presenting Metrics and proposed Expectations based on the answers it has determined. This can accelerate the process of creating Expectations for the provided data.
The following diagram illustrates the end-to-end GX OSS data validation workflow that you’ll implement with this quickstart. Click a workflow step to view the related content.
A Checkpoint is the primary means for validating data in a production deployment of GX. Checkpoints provide an abstraction for bundling a Batch (or Batches) of data with an Expectation Suite (or several), and then running those Expectations against the paired data.
# Given the batch requests and the expectation suites
# We create checkpoint
checkpoint = context.add_or_update_checkpoint(
name="my_checkpoint",
validations=[ # A list of validations
{
"batch_request": batch_request,
"expectation_suite_name": "my_expectation_suite",
},
],
)
checkpoint_result = checkpoint.run()
By default, a Checkpoint only validates the last Batch included in a Batch Request. We can run the checkpoint on all batches in a batch request by converting the list of batches into a list of batch requests
# Given a batch request
batch_list = asset.get_batch_list_from_batch_request(batch_request)
batch_request_list = [batch.batch_request for batch in batch_list]
validations = [
{
"batch_request": batch.batch_request,
"expectation_suite_name": "my_expectation_suite"
}
for batch in batch_list
]
checkpoint = context.add_or_update_checkpoint(
name="my_validator_checkpoint",
validations=validations
)
checkpoint_result = checkpoint.run()
# Build the data docs (website files)
context.build_data_docs()
# Open the data docs in a browser
context.open_data_docs()
retrieved_checkpoint = context.get_checkpoint(name="my_checkpoint")
We can add more validations (different batch requests with different expectation suites) to the checkpoint as follows:
validations = [
{
"batch_request": {
"datasource_name": "my_datasource",
"data_asset_name": "users",
},
"expectation_suite_name": "users.warning",
},
{
"batch_request": {
"datasource_name": "my_datasource",
"data_asset_name": "users.error",
},
"expectation_suite_name": "users",
},
]
checkpoint = context.add_or_update_checkpoint(
name="my_checkpoint",
validations=validations,
)
The Validation Results returned by GX tell you how your data corresponds to what you expected of it. You can view this information in the Data Docs that are configured in your Data Context. Evaluating your Validation Results helps you identify issues with your data. If the Validation Results show that your data meets your Expectations, you can confidently use it.
Data Docs are human-readable documentation generated by GX. Data Docs describe the standards that you expect your data to conform to, and the results of validating your data against those standards. The Data Context manages the storage and retrieval of this information.
We can host and share data docs locally as follows:
# Build the data docs (website files)
context.build_data_docs()
# Open the data docs in a browser
context.open_data_docs()
The output:
The Validation Results generated when a Checkpoint runs determine what Actions are performed. Typical use cases include sending email, Slack messages, or custom notifications. Another common use case is updating Data Docs sites. Actions can be used to do anything you are capable of programming in Python. Actions are a versatile tool for integrating Checkpoints in your pipeline’s workflow.
In the project, we will raise an exception if there are any failed expectations. You can get the status of the checkpoint as follows:
checkpoint_result.success
After you define your checkpoint and expectations suite. You should write a script/function to validate your data.
import great_expectations as gx
context = gx.get_context(project_root_dir = "services")
retrieved_checkpoint = context.get_checkpoint(name="my_checkpoint")
results = retrieved_checkpoint.run()
assert results.success
Great Expectations comes with a CLI to perform operations as command lines. This tool will basically create a FileDataContext in a folder gx
in the current working directory.
It is not recommended to use the CLI to create datasources or other objects. Use this tool only if you want to explore the GX project folder. Use the Python API (great_expectations) to do tasks with GX.
Some commands:
great_expectations init
You initialize your project only once per repository. init
initializes a new Great Expectations project gx
in the current working directory. If you want to specify the GX project directory then you need to Access the folder “project_root_dir” where gx
folder will be created by the CLI and run the command line.
great_expectations datasource list
This will guide you to create a datasource and then opens a Jupyter notebook. You run the cells to create a new data source.
great_expectations docs build
You should validate the data after you collect it and whenever you transform it. You can see in the figure below that the data validation happens after:
In this project, you will validate the data:
Data quality (DQ) is defined as the data’s suitability for a user’s defined purpose. It is subjective, as the concept of quality is relative to the standards defined by the end users’ expectations.
Data Quality Expectations: It is possible that for the exact same data, depending on their usage, different users can have totally different data quality expectations. For example, the accounting department needs accurate data to the penny, whereas the marketing team does not, because the approximate sales number are enough to determine the sales trends.
The six primary data quality dimensions are Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity. However, this classification is not universally agreed upon.
The completeness data quality dimension is defined as the percentage of data populated vs. the possibility of 100% fulfillment.
expect_column_values_to_not_be_null
— expect the column values to not be null.Uniqueness is a dimension of data quality that refers to the degree to which each record in a dataset represents a unique and distinct entity or event. It measures whether the data is free of duplicates or redundant records, and whether each record represents a unique and distinct entity.
expect_column_values_to_be_unique
— Expect each column value to be unique.Timelessness is a dimension of data quality that measures the relevance and accuracy of data over time. It refers to whether the data is up to date.
expect_column_values_to_be_between
— expect the column values to not be null.Validity is a dimension of data quality that measures whether the data is accurate and conforms to the expected format or structure. Because invalid data can disrupt the training of AI algorithms on a dataset, organizations should establish a set of methodical business rules for evaluating data validity.
expect_column_values_to_match_regex
— expect the column values to match a specific regex format.Accuracy is the degree to which data represent real-world things, events, or an agreed-upon source.
expect_column_max_to_be_between
— expect the column max value to between a range of two specific values.Consistency is a dimension of data quality that refers to the degree to which data is uniform across a dataset.
expect_column_values_to_be_between
— expect each column value to be in a given set.More than one data quality dimension can be checked using one GX expectation. There are more than one expectation to check each data quality dimension.
Note: The project tasks are graded, and they form the practice part of the course. We have tasks for repository and as well as for report (for Master’s student).
.gitignore
and add a README.md
file. Add the instructor and all team members to the repository as contributors. The strcture is similar to:├───README.md # Repo docs
├───.gitignore # gitignore file
├───requirements.txt # Python packages
├───configs # Hydra configuration management
├───data # All data
├───docs # Project docs like reports or figures
├───models # ML models
├───notebooks # Jupyter notebooks
├───outputs # Outputs of Hydra
├───pipelines # A Soft link to DAGs of Apache Airflow
├───reports # Generated reports
├───scripts # Shell scripts (.sh)
├───services # Metadata of services (PostgreSQL, Feast, Apache airflow, ...etc)
├───sql # SQL files
├───src # Python scripts
└───tests # Scripts for testing Python code
Note: You can use Cookiecutter PyPackage if you prefer but it is not required in this project.
2. Install VS Code and add required and useful extensions such as Python, Docker, Jupyter, …etc.
3. Open the project repository using VS Code in some local folder.
4. Create a virtual environment using venv
or conda
. Activate it.
5. Add requirements.txt
. Add an executable file like a script scripts/install_requirements.sh
to install the requirements using the pip to the created virtual environment. Run it whenever you added a new package to requirements.txt
.
6. Complete the notebook business_understanding.ipynb
(shared in this document) and push it to notebooks
folder. You can complete this notebook by writing on the notebook itslef or by writing it in a document shared as pdf and pushed to reports
folder. Build the ML Canvas and push it to reports
folder as a pdf.
7. Use Jupyter/Colab notebooks to understand the data. The notebook should include data analysis results supported with charts as follows:
notebooks
.dev
) to the repo.sample_data
in a module src/data.py
to read the data file and take a sample. Use Hydra to manage the data url in the configs and any configurations needed like the size of the sample, the name of the dataset,…etc. You need to define a structure for project configurations and use Hydra to read them from the source code. The function should read the data url from config files using Hydra and sample the data. Then, it stores the data samples in data/samples
folder as sample.csv
where the sample size is 20% of the total rows in the whole data. The sample size can be added to the config files too. The idea here is to make your project maintainable and ready for any changes with minimal modifications (DRY principle). For each new sample added, version the data using dvc
to not lose any data samples.sample_data
and it should generate a sample. The function should overwrite the file sample.csv
if it already exists. You can manage the data versions using dvc
.tests
folder to test the function sample_data
and make sure it can pass all tests. You can add extensions from VS Code to setup pytest
.validate_initial_data
which declares expectations about the data features and validates them. Create an expectation suite and save them. Make sure that you created a File data context in a folder services/gx
where the metdata of the context will be stored.Note: You need to write expectations to satisfy the data requirements. The data requirements should be written (and added to 1_business_data_understanding.ipynb
) to cover the data quality dimensions which are completeness, uniqueness, timelessness, validity, consistency, and accuracy. Basically, the number of expectations is not a fixed number. Ideally, for each data feature we aim to validate all dimensions. For simplicity, we ask to validate 3 most important dimensions for each of the 7 most critical features (in total 21-40 expectations). Do not try to validate same dimension for all features, use a diverse set of quality dimensions. You should validate each dimension at least once per dataset.
Note: If any of the expectations got failed, then the program should raise a custom exception (alert) and terminate the program. This is the action that we will use for validating the data.
16. Write a script scripts/test_data.sh
to automate:
Note: You should not version the data if it is invalid.
Note: Make sure that it is automated and it gives reproducible results without side effects.
17. Commit the changes to dev
branch and merge it to main
branch.
18. You should commit and push whenever you fixed some issue or started/finished building some artifact.
Use properly all tools (hydra, pytest, dvc, …etc) that you studided in this phase.
Complete the following chapters:
Chapter 1: Introduction
Chapter 2: Business and Data understanding
This phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a machine learning problem. Your organisation may have competing objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of the project. Neglecting this step can mean that a great deal of effort is put into producing the right answers to the wrong questions.
Write a paragraph to state your business problem connected to your data, assessment of the current situation. Give the reader a short overview of the problem and the necessary background about the business that you will focus on in your report. Make sure that concepts are explained/defined. Do not forget to cite your sources.
Section 2.1. Terminology
Section 2.2. Scope of the ML project
Section 2.3. Success criteria
Section 2.5. Data collection
Section 2.6. Data quality verification
Section 2.7. Project feasibility
Section 2.8. Project plan
Push your report as pdf to reports
folder for this phase. Add a version to the report such as reportv1.pdf
.