Packaging models for production

by Ty Myrddin

Published on May 1, 2022

Soledad Galli and Christopher Samiullah of trainindata have a wonderful way of structuring notebooks which can easily be mapped to modules and packages. We have tried and played with it in the titanic and ames training wheel repos for packaging the resulting R&D models.

For the rest it is puzzling with configurations and bringing it together in a Tox config file.

Requirements

Following trainindata's structural setup for packaging, we use compatible release functionality (see PEP 440) to specify acceptable version ranges of project dependencies. This gives us the flexibility to keep up with small updates/fixes, whilst ensuring we don't install a major update which could introduce backwards incompatible changes.

For now we just wish to get it all to work, so we are rather conservative and are not taking unnecessary risks in this packaging for the pipeline. Any changes beyond these limits will require a separate branch, to prevent breaking the build.

There will be scenarios where testing is not necessary, so the requirements are split into to requirements files.

The requirements.txt:


        numpy>=1.20.0,<1.21.0
        pandas>=1.3.5,<1.4.0
        pydantic>=1.8.1,<1.9.0
        scikit-learn>=1.0.2,<1.1.0
        strictyaml>=1.3.2,<1.4.0
        ruamel.yaml==0.16.12
        feature-engine>=1.0.2,<1.1.0
        joblib>=1.0.1,<1.1.0

And the test_requirements.txt:


        -r requirements.txt

        # testing requirements
        pytest>=6.2.3,<6.3.0

        # repo maintenance tooling
        black==20.8b1
        flake8>=3.9.0,<3.10.0
        mypy==0.812
        isort==5.8.0

Install without testing with:


        $ pip install -r requirements/requirements.txt

Install all of it with:


        $ pip install -r requirements/test_requirements.txt

Specifying version ranges for managing dependencies and keeping those stable, is a common best practice, whether one uses poetry, pipenv, or just pip, like here.

Package Configuration

The creation of configuration objects is done with a config.yml file, and not with a python config (why not).

All global constants from the jupyter notebooks are moved into a config.yml for ames and config.yml for titanic in a yaml format.

The core.py files can be found in the config directories. We are using pydantic for data validation and settings management (see PEP 484).

Tox

Tox is a generic virtualenv management and test command line tool. Its goal is to standardize testing in Python.

Using Tox (why use it) we can (on multiple operating systems):

Eliminate PYTHONPATH challenges when running scripts/tests
Eliminate virtual environment setup confusion
Streamline steps such as model training and model publishing
Reduce the use of shell scripts and its pitfalls

All of Tox's configuration is in the tox.ini file:


        [tox]
        envlist = test_package, typechecks, stylechecks, lint
        skipsdist = True

        [testenv]
        install_command = pip install {opts} {packages}

        [testenv:test_package]
        deps =
            -rrequirements/test_requirements.txt

        setenv =
            PYTHONPATH=.
            PYTHONHASHSEED=0

        commands=
            python regression_model/train_pipeline.py
            pytest \
            -s \
            -vv \
            {posargs:tests/}

        [testenv:train]
        envdir = {toxworkdir}/test_package
        deps =
            {[testenv:test_package]deps}

        setenv =
            {[testenv:test_package]setenv}

        commands=
            python regression_model/train_pipeline.py


        [testenv:typechecks]
        envdir = {toxworkdir}/test_package

        deps =
            {[testenv:test_package]deps}

        commands = {posargs:mypy regression_model}


        [testenv:stylechecks]
        envdir = {toxworkdir}/test_package

        deps =
            {[testenv:test_package]deps}

        commands = {posargs:flake8 regression_model tests}


        [testenv:lint]
        envdir = {toxworkdir}/test_package

        deps =
            {[testenv:test_package]deps}

        commands =
            isort regression_model tests
            black regression_model tests
            mypy regression_model
            flake8 regression_model

        [flake8]
        exclude = .git,env
        max-line-length = 90

Example Tox commands

Make sure tox is installed (pip install tox or system wide sudo apt install tox), and the train.csv and test.csv files are available in the production-model-package/regression_model/datasets directory when working with the ames training wheel repo.

To train the regression model (triggers the train_pipeline.py script):


        $ tox -e train
        train installed: appdirs==1.4.4,attrs==21.4.0,black==20.8b1,click==8.1.3,feature-engine==1.0.2,flake8==3.9.2,iniconfig==1.1.1,isort==5.8.0,jobl
        ib==1.0.1,mccabe==0.6.1,mypy==0.812,mypy-extensions==0.4.3,numpy==1.20.3,packaging==21.3,pandas==1.3.5,pathspec==0.9.0,patsy==0.5.2,pluggy==1.0
        .0,py==1.11.0,pycodestyle==2.7.0,pydantic==1.8.2,pyflakes==2.3.1,pyparsing==3.0.8,pytest==6.2.5,python-dateutil==2.8.2,pytz==2022.1,regex==2022
        .4.24,ruamel.yaml==0.16.12,scikit-learn==1.0.2,scipy==1.8.0,six==1.16.0,statsmodels==0.13.2,strictyaml==1.3.2,threadpoolctl==3.1.0,toml==0.10.2
        ,typed-ast==1.4.3,typing_extensions==4.2.0
        train run-test-pre: PYTHONHASHSEED='0'
        train run-test: commands[0] | python regression_model/train_pipeline.py
        ___________________________________________________________________ summary ___________________________________________________________________
          train: commands succeeded
          congratulations :)

The production-model-package/regression_model/trained_models directory now contains a regression_model_output_v0.0.1.pkl pickle file.

To run the test_package environment:


        $ tox -e test_package
        test_package installed: [snip]
        test_package run-test-pre: PYTHONHASHSEED='0'
        test_package run-test: commands[0] | python regression_model/train_pipeline.py
        test_package run-test: commands[1] | pytest -s -vv tests/
        ============================================================= test session starts =============================================================
        platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /path/to/production-model-package/.tox/test_
        package/bin/python
        cachedir: .tox/test_package/.pytest_cache
        rootdir: /home/nina/Development/gitlab/ames/production-model-package, configfile: pyproject.toml
        collected 2 items

        tests/test_features.py::test_temporal_variable_transformer PASSED
        tests/test_prediction.py::test_make_prediction PASSED

        ============================================================== 2 passed in 0.19s ==============================================================
        ___________________________________________________________________ summary ___________________________________________________________________
          test_package: commands succeeded
          congratulations :)

To run all:


        $ tox
        test_package installed: [snip]
        test_package run-test-pre: PYTHONHASHSEED='0'
        test_package run-test: commands[0] | python regression_model/train_pipeline.py
        test_package run-test: commands[1] | pytest -s -vv tests/
        ============================================================= test session starts =============================================================
        platform linux -- Python 3.9.5, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /path/to/production-model-package/.tox/test_package/bin/python
        cachedir: .tox/test_package/.pytest_cache
        rootdir: /home/nina/Development/gitlab/ames/production-model-package, configfile: pyproject.toml
        collected 2 items

        tests/test_features.py::test_temporal_variable_transformer PASSED
        tests/test_prediction.py::test_make_prediction PASSED

        ============================================================== 2 passed in 0.18s ==============================================================
        typechecks installed: [snip]
        typechecks run-test-pre: PYTHONHASHSEED='4267597864'
        typechecks run-test: commands[0] | mypy regression_model
        Success: no issues found in 12 source files
        stylechecks installed: [snip]
        stylechecks run-test-pre: PYTHONHASHSEED='4267597864'
        stylechecks run-test: commands[0] | flake8 regression_model tests
        lint installed: [snip]
        lint run-test-pre: PYTHONHASHSEED='4267597864'
        lint run-test: commands[0] | isort regression_model tests
        lint run-test: commands[1] | - black classification_model tests
        Traceback (most recent call last):[snip]
        ImportError: cannot import name '_unicodefun' from 'click' [snip]
        lint run-test: commands[2] | mypy regression_model
        Success: no issues found in 12 source files
        lint run-test: commands[3] | flake8 regression_model
        ___________________________________________________________________ summary ___________________________________________________________________
          test_package: commands succeeded
          typechecks: commands succeeded
          stylechecks: commands succeeded
          lint: commands succeeded
          congratulations :)

Troubleshooting Tox

The ImportError: cannot import name '_unicodefun' from 'click' is a known issue. Upgrading black==22.3.0 and downgrading click==8.0.2 helps for manually running the checks, but not via Tox. Hence, the dash in front of black. It will run and give the error but will not block the process.

If for some reason you are unable to run things with Tox, then you can run the Python commands listed in the tox.ini file by hand. To do this, add the production-model-package directory to the system PYTHONPATH:


        $ pwd
        /path/to/production-model-package

Then, add the path to your ~/.bashrc:


        $ export PYTHONPATH="${PYTHONPATH}:/path/to/production-model-package"

Building the package

Packaging uses MANIFEST.in, pyproject.toml and a setup.py. All three are often adaptations from existing templates, or are generated by yet another tool. A MANIFEST.in file specifies which files are to be included in the package, and the setup.py is where the magic happens.

To build the package, install build, and run this command from the same directory where pyproject.toml is located:


        $ python -m pip install --upgrade build
        [snip]
        $ python -m build
        [snip]
        Successfully built ames-regression-model-0.0.1.tar.gz and ames_regression_model-0.0.1-py3-none-any.whl

A build, egg.info and dist directory will appear. These are for installing the package in the apiapplication.

Oh well. Last orders, please. Waiter