Packaging models for production
by Ty Myrddin
Published on May 1, 2022
Soledad Galli and Christopher Samiullah of trainindata have a wonderful way of structuring notebooks which can easily be mapped to modules and packages. We have tried and played with it in the titanic and ames training wheel repos for packaging the resulting R&D models.
For the rest it is puzzling with configurations and bringing it together in a Tox config file.
Requirements
Following trainindata's structural setup for packaging, we use compatible release functionality (see PEP 440) to specify acceptable version ranges of project dependencies. This gives us the flexibility to keep up with small updates/fixes, whilst ensuring we don't install a major update which could introduce backwards incompatible changes.
For now we just wish to get it all to work, so we are rather conservative and are not taking unnecessary risks in this packaging for the pipeline. Any changes beyond these limits will require a separate branch, to prevent breaking the build.
There will be scenarios where testing is not necessary, so the requirements are split into to requirements files.
The requirements.txt
:
numpy>=1.20.0,<1.21.0
pandas>=1.3.5,<1.4.0
pydantic>=1.8.1,<1.9.0
scikit-learn>=1.0.2,<1.1.0
strictyaml>=1.3.2,<1.4.0
ruamel.yaml==0.16.12
feature-engine>=1.0.2,<1.1.0
joblib>=1.0.1,<1.1.0
And the test_requirements.txt
:
-r requirements.txt
# testing requirements
pytest>=6.2.3,<6.3.0
# repo maintenance tooling
black==20.8b1
flake8>=3.9.0,<3.10.0
mypy==0.812
isort==5.8.0
Install without testing with:
$ pip install -r requirements/requirements.txt
Install all of it with:
$ pip install -r requirements/test_requirements.txt
Specifying version ranges for managing dependencies and keeping those stable, is a common best practice, whether one uses poetry, pipenv, or just pip, like here.
Package Configuration
The creation of configuration objects is done with a config.yml
file, and not with a python config (why not).
All global constants from the jupyter notebooks are moved into a config.yml for ames and config.yml for titanic in a yaml format.
The core.py
files can be found in the config directories. We are using pydantic for data validation and settings management (see PEP 484).
Tox
Tox is a generic virtualenv management and test command line tool. Its goal is to standardize testing in Python.
Using Tox (why use it) we can (on multiple operating systems):
- Eliminate PYTHONPATH challenges when running scripts/tests
- Eliminate virtual environment setup confusion
- Streamline steps such as model training and model publishing
- Reduce the use of shell scripts and its pitfalls
All of Tox's configuration is in the tox.ini
file:
[tox]
envlist = test_package, typechecks, stylechecks, lint
skipsdist = True
[testenv]
install_command = pip install {opts} {packages}
[testenv:test_package]
deps =
-rrequirements/test_requirements.txt
setenv =
PYTHONPATH=.
PYTHONHASHSEED=0
commands=
python regression_model/train_pipeline.py
pytest \
-s \
-vv \
{posargs:tests/}
[testenv:train]
envdir = {toxworkdir}/test_package
deps =
{[testenv:test_package]deps}
setenv =
{[testenv:test_package]setenv}
commands=
python regression_model/train_pipeline.py
[testenv:typechecks]
envdir = {toxworkdir}/test_package
deps =
{[testenv:test_package]deps}
commands = {posargs:mypy regression_model}
[testenv:stylechecks]
envdir = {toxworkdir}/test_package
deps =
{[testenv:test_package]deps}
commands = {posargs:flake8 regression_model tests}
[testenv:lint]
envdir = {toxworkdir}/test_package
deps =
{[testenv:test_package]deps}
commands =
isort regression_model tests
black regression_model tests
mypy regression_model
flake8 regression_model
[flake8]
exclude = .git,env
max-line-length = 90
Example Tox commands
Make sure tox is installed (pip install tox
or system wide sudo apt install tox
), and the train.csv
and test.csv
files are available in the production-model-package/regression_model/datasets
directory when working with the ames training wheel repo.
To train the regression model (triggers the train_pipeline.py
script):
$ tox -e train
train installed: appdirs==1.4.4,attrs==21.4.0,black==20.8b1,click==8.1.3,feature-engine==1.0.2,flake8==3.9.2,iniconfig==1.1.1,isort==5.8.0,jobl
ib==1.0.1,mccabe==0.6.1,mypy==0.812,mypy-extensions==0.4.3,numpy==1.20.3,packaging==21.3,pandas==1.3.5,pathspec==0.9.0,patsy==0.5.2,pluggy==1.0
.0,py==1.11.0,pycodestyle==2.7.0,pydantic==1.8.2,pyflakes==2.3.1,pyparsing==3.0.8,pytest==6.2.5,python-dateutil==2.8.2,pytz==2022.1,regex==2022
.4.24,ruamel.yaml==0.16.12,scikit-learn==1.0.2,scipy==1.8.0,six==1.16.0,statsmodels==0.13.2,strictyaml==1.3.2,threadpoolctl==3.1.0,toml==0.10.2
,typed-ast==1.4.3,typing_extensions==4.2.0
train run-test-pre: PYTHONHASHSEED='0'
train run-test: commands[0] | python regression_model/train_pipeline.py
___________________________________________________________________ summary ___________________________________________________________________
train: commands succeeded
congratulations :)
The production-model-package/regression_model/trained_models
directory now contains a regression_model_output_v0.0.1.pkl
pickle file.
To run the test_package
environment:
$ tox -e test_package
test_package installed: [snip]
test_package run-test-pre: PYTHONHASHSEED='0'
test_package run-test: commands[0] | python regression_model/train_pipeline.py
test_package run-test: commands[1] | pytest -s -vv tests/
============================================================= test session starts =============================================================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /path/to/production-model-package/.tox/test_
package/bin/python
cachedir: .tox/test_package/.pytest_cache
rootdir: /home/nina/Development/gitlab/ames/production-model-package, configfile: pyproject.toml
collected 2 items
tests/test_features.py::test_temporal_variable_transformer PASSED
tests/test_prediction.py::test_make_prediction PASSED
============================================================== 2 passed in 0.19s ==============================================================
___________________________________________________________________ summary ___________________________________________________________________
test_package: commands succeeded
congratulations :)
To run all:
$ tox
test_package installed: [snip]
test_package run-test-pre: PYTHONHASHSEED='0'
test_package run-test: commands[0] | python regression_model/train_pipeline.py
test_package run-test: commands[1] | pytest -s -vv tests/
============================================================= test session starts =============================================================
platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /path/to/production-model-package/.tox/test_package/bin/python
cachedir: .tox/test_package/.pytest_cache
rootdir: /home/nina/Development/gitlab/ames/production-model-package, configfile: pyproject.toml
collected 2 items
tests/test_features.py::test_temporal_variable_transformer PASSED
tests/test_prediction.py::test_make_prediction PASSED
============================================================== 2 passed in 0.18s ==============================================================
typechecks installed: [snip]
typechecks run-test-pre: PYTHONHASHSEED='4267597864'
typechecks run-test: commands[0] | mypy regression_model
Success: no issues found in 12 source files
stylechecks installed: [snip]
stylechecks run-test-pre: PYTHONHASHSEED='4267597864'
stylechecks run-test: commands[0] | flake8 regression_model tests
lint installed: [snip]
lint run-test-pre: PYTHONHASHSEED='4267597864'
lint run-test: commands[0] | isort regression_model tests
lint run-test: commands[1] | - black classification_model tests
Traceback (most recent call last):[snip]
ImportError: cannot import name '_unicodefun' from 'click' [snip]
lint run-test: commands[2] | mypy regression_model
Success: no issues found in 12 source files
lint run-test: commands[3] | flake8 regression_model
___________________________________________________________________ summary ___________________________________________________________________
test_package: commands succeeded
typechecks: commands succeeded
stylechecks: commands succeeded
lint: commands succeeded
congratulations :)
Troubleshooting Tox
The ImportError: cannot import name '_unicodefun' from 'click'
is a known issue. Upgrading black==22.3.0
and downgrading click==8.0.2
helps for manually running the checks, but not via Tox. Hence, the dash in front of black. It will run and give the error but will not block the process.
If for some reason you are unable to run things with Tox, then you can run the Python commands listed in the tox.ini file by hand. To do this, add the production-model-package
directory to the system PYTHONPATH
:
$ pwd
/path/to/production-model-package
Then, add the path to your ~/.bashrc
:
$ export PYTHONPATH="${PYTHONPATH}:/path/to/production-model-package"
Building the package
Packaging uses MANIFEST.in
, pyproject.toml
and a setup.py
. All three are often adaptations from existing templates, or are generated by yet another tool. A MANIFEST.in file specifies which files are to be included in the package, and the setup.py
is where the magic happens.
To build the package, install build
, and run this command from the same directory where pyproject.toml
is located:
$ python -m pip install --upgrade build
[snip]
$ python -m build
[snip]
Successfully built ames-regression-model-0.0.1.tar.gz and ames_regression_model-0.0.1-py3-none-any.whl
A build
, egg.info
and dist
directory will appear. These are for installing the package in the api
application.
Oh well. Last orders, please. Waiter