Model selection with non-SBML models and unsupported calibration tools

PEtab Select is currently designed for use with models encoded in SBML. However, this requirement is inherited from PEtab, and the SBML model is irrelevant to PEtab Select, which only reads the parameter table of a model’s PEtab problem to determine the estimated parameters (e.g. to accurately compute criteria values).

Furthermore, PEtab Select is designed for use with one of the supported calibration tools. However, PEtab Select also implements a command-line interface (CLI) and Python interface such that other calibration tools can easily interface with PEtab Select in order to perform the model selection.

In this notebook, we demonstrate model selection with a modeling framework that currently does not support PEtab problems (specifically, statsmodels). Since statsmodels is a Python package, we use the Python interface of PEtab Select. Synthetic data was generated from the model

\[y(t) = 20t + 10t^2 + 5t^3 + \epsilon\]

where \(\epsilon\) is drawn from the standard normal distribution and represents measurement noise. The hypothetical scenario is that you are provided \((t,y)\) data pairs and asked to identify the powers of \(t\) that are evidenced by the data (i.e., that were used to generate the data). Here, we only consider polynomial models involving terms up to \(t^5\), i.e., the superset model is

\[y(t) = \sum_{i=0}^5 k_i t^i,\]

where \(k_i\) is the \(i\)th coefficient in the linear regression model.

“True” model

We start by loading the data and demonstrating model calibration of the true model using statsmodels.

[1]:
# Based on the statsmodels "Ordinary Least Squares" example
# https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html

import numpy as np
import petab
import statsmodels.api as sm
from petab import MEASUREMENT, TIME

import petab_select
from petab_select.constants import (
    CANDIDATE_SPACE,
    ESTIMATE,
    TERMINATE,
    UNCALIBRATED_MODELS,
    Criterion,
)

# Data was generated with the formula (see next cell to regenerate data)
# y(t) = 20*t + 10*t^2 + 5*t^3
df = petab.get_measurement_df(
    "other_model_types_problem/petab/measurements.tsv"
)
t = df[TIME].to_numpy()
y = df[MEASUREMENT].to_numpy()

# Show statsmodels fit for the "true" model
true_T = np.column_stack([t**1, t**2, t**3])
print(sm.OLS(y, true_T).fit().summary())
                                 OLS Regression Results
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          2.226e+08
Date:                Tue, 30 Jun 2026   Prob (F-statistic):                        0.00
Time:                        15:50:30   Log-Likelihood:                         -137.59
No. Observations:                 101   AIC:                                      281.2
Df Residuals:                      98   BIC:                                      289.0
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            19.8935      0.165    120.611      0.000      19.566      20.221
x2            10.0536      0.051    197.694      0.000       9.953      10.155
x3             4.9954      0.004   1334.839      0.000       4.988       5.003
==============================================================================
Omnibus:                        1.563   Durbin-Watson:                   1.861
Prob(Omnibus):                  0.458   Jarque-Bera (JB):                1.284
Skew:                          -0.078   Prob(JB):                        0.526
Kurtosis:                       2.470   Cond. No.                         695.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2]:
# Uncomment to regenerate data

# import numpy as np
# rng = np.random.default_rng(seed=0)

# # Measurement time points
# t = np.linspace(0, 10, 101)
# # The first three powers of x are used to generate the data
# T = np.column_stack([t**1, t**2, t**3])
# K = np.array([20, 10, 5])
# noise = rng.normal(size=t.size)
# y = np.dot(T, K) + noise

# # Write PEtab measurement table
# lines = ["observableId\tsimulationConditionId\ttime\tmeasurement"]
# for t_, y_ in zip(t, y, strict=True):
#     lines.append(f"y\tdefault\t{round(t_,1)}\t{round(y_,5)}")
# content = "\n".join(lines)
# with open("other_model_types_problem/petab/measurements.tsv", "w") as f:
#     f.write(content)

Model selection

To perform model selection, we define three methods to:

  1. calibrate a single model

    This is where we convert a PEtab Select model into a statsmodels model, fit the model with statsmodels, then save the log-likelihood value in the PEtab Select model.

  2. perform a single iteration of a model selection method involving calibration of multiple models

    This is generic code that executes the required PEtab Select commands.

  3. perform a full model selection run, involving all iterations of a model selection method

    This loads the PEtab Select problem from disk and performs all iterations of a model selection method.

In this case, the PEtab Select problem is setup to perform backward selection (see the specification files for this example).

[3]:
def calibrate_model(model: petab_select.Model, t=t, y=y):
    # Convert the PEtab Select model into a statsmodel model
    powers = [
        int(parameter_id[1:])
        for parameter_id, parameter_value in model.parameters.items()
        if parameter_value == ESTIMATE
    ]
    if not powers:
        return
    T = np.column_stack([t**i for i in powers])
    sm_model = sm.OLS(y, T)

    # Calibrate
    results = sm_model.fit()

    # Save the log-likelihood
    model.set_criterion(criterion=Criterion.LLH, value=results.llf)


def perform_iteration(
    problem: petab_select.Problem, candidate_space: petab_select.CandidateSpace
):
    # Start the iteration, request models according to the model selection method.
    iteration = petab_select.ui.start_iteration(
        problem=problem, candidate_space=candidate_space
    )

    # Calibrate each model.
    models = iteration[UNCALIBRATED_MODELS]
    for model in models:
        if all(
            parameter_value != ESTIMATE
            for parameter_value in model.parameters.values()
        ):
            del models[model.hash]
            continue
        calibrate_model(model=model)
        # Compute the criterion based on the log-likelihood computed by statsmodels.
        model.compute_criterion(criterion=problem.criterion)

    # End the iteration and return the results.
    return petab_select.ui.end_iteration(
        problem=problem,
        candidate_space=iteration[CANDIDATE_SPACE],
        calibrated_models=models,
    )


def run_model_selection():
    # Load the PEtab Select problem.
    problem = petab_select.Problem.from_yaml(
        "other_model_types_problem/petab_select/problem.yaml"
    )
    candidate_space = None

    # Perform all iterations of the model selection method until the termination signal is emitted.
    while True:
        iteration_results = perform_iteration(
            problem=problem, candidate_space=candidate_space
        )
        if iteration_results[TERMINATE]:
            break
        candidate_space = iteration_results[CANDIDATE_SPACE]

    # Return the problem, which contains the calibrated models.
    return problem
[4]:
# Run the model selection method.
problem = run_model_selection()

Analysis

As when using a PEtab-compatible calibration tool, the results from calibration with statsmodels can also be analyzed with PEtab Select.

Firstly, we get the best model identified during model selection. It matches the “true” model, with the same AIC as computed by statsmodels at the top of this notebook.

[5]:
best_model = petab_select.analyze.get_best(
    models=problem.state.models, criterion=problem.criterion
)
print(f"""
Best model information
----------------------
PEtab Select model hash: {best_model.hash}
Selected powers of t: {[int(parameter_id[1:]) for parameter_id, parameter_value in best_model.parameters.items() if parameter_value == ESTIMATE]}
Log-Likelihood: {best_model.get_criterion(criterion=Criterion.LLH)}
AIC: {best_model.get_criterion(criterion=Criterion.AIC)}
""")

Best model information
----------------------
PEtab Select model hash: M-011100
Selected powers of t: [1, 2, 3]
Log-Likelihood: -137.59037189701047
AIC: 281.18074379402094

[6]:
# Plot AIC of all models that were calibrated.

import petab_select.plot

plot_data = petab_select.plot.PlotData(
    models=problem.state.models, criterion=problem.criterion
)

petab_select.plot.scatter_criterion_vs_n_estimated(plot_data=plot_data);
../_images/examples_other_model_types_9_0.png
[7]:
# Plot the history of model selection iterations.
import matplotlib.pyplot as plt

draw_networkx_kwargs = {
    "arrowstyle": "-|>",
    "node_shape": "s",
    "edgecolors": "k",
}
fig, ax = plt.subplots(figsize=(20, 20))
petab_select.plot.graph_iteration_layers(
    plot_data=plot_data,
    draw_networkx_kwargs=draw_networkx_kwargs,
    ax=ax,
);
../_images/examples_other_model_types_10_0.png

Supplementary: the PEtab files

The PEtab Select problem used above is stored alongside this notebook, in other_model_types_problem/. This section explains what those files contain. The directory has two parts: a standard PEtab problem (petab/) that describes the model and data, and a PEtab Select problem (petab_select/) that describes the model space and the selection settings.

other_model_types_problem/
├── petab/                 # the PEtab problem
│   ├── problem.yaml
│   ├── parameters.tsv
│   ├── observables.tsv
│   ├── conditions.tsv
│   ├── measurements.tsv
│   └── dummy.xml
└── petab_select/          # the PEtab Select problem
    ├── problem.yaml
    └── model_space.tsv

The PEtab problem (petab/)

A PEtab problem normally pairs an SBML model with tables for parameters, observables, conditions, and measurements. In this example however, we focus on only the information relevant to PEtab Select. The required non-dummy files are:

  • problem.yaml — the PEtab problem file. It points to the parameter table and, for its single problem, to the condition, measurement, (dummy) observable, and (dummy) SBML files.

  • parameters.tsv — the six regression coefficients k0k5, one per power of t. Each is on a linear scale with bounds [0, 100]. The estimate column is only a default; for each candidate model, PEtab Select overrides it (via the model space) to decide which coefficients are estimated. This is the only table PEtab Select reads, in order to count the estimated parameters when computing criteria such as the AIC.

The other files can be dummy files. In this example, we used them to encode as much information as possible, even though they are not used, just to demonstrate the PEtab format:

  • observables.tsv — defines a single observable y whose formula is the regression model \(y(t) = \sum_{i=0}^{5} k_i\, t^i\), written as k0*time^0 + k1*time^1 + ... + k5*time^5. The noiseFormula is 1 (unit, constant noise), matching the standard-normal noise used to generate the data.

  • conditions.tsv — a single dummy experimental condition, default (the data involves no experimental perturbations/conditions).

  • measurements.tsv — the synthetic data: 101 measurements of observable y at condition default, one per time point \(t = 0.0, 0.1, \ldots, 10.0\). The values were generated from \(y(t) = 20t + 10t^2 + 5t^3 + \epsilon\) (i.e. only the powers \(t^1, t^2, t^3\) are truly present) with standard-normal noise \(\epsilon\). This is used in the notebook to load data, but the data could have been loaded from e.g. an Excel spreadsheet too.

  • dummy.xml — an essentially empty SBML model.

We display each PEtab table below, and print the PEtab problem YAML file.

[1]:
import pandas as pd
from IPython.display import display

for petab_file in [
    "petab/parameters.tsv",
    "petab/observables.tsv",
    "petab/conditions.tsv",
    "petab/measurements.tsv",
]:
    print(petab_file)
    display(pd.read_csv(f"other_model_types_problem/{petab_file}", sep="\t"))
petab/parameters.tsv
parameterId parameterScale lowerBound upperBound estimate
0 k0 lin 0 100 1
1 k1 lin 0 100 1
2 k2 lin 0 100 1
3 k3 lin 0 100 1
4 k4 lin 0 100 1
5 k5 lin 0 100 1
petab/observables.tsv
observableId observableFormula noiseFormula
0 y k0*time^0 + k1*time^1 + k2*time^2 + k3*time^3 ... 1
petab/conditions.tsv
conditionId
0 default
petab/measurements.tsv
observableId simulationConditionId time measurement
0 y default 0.0 0.12573
1 y default 0.1 1.97290
2 y default 0.2 5.08042
3 y default 0.3 7.13990
4 y default 0.4 9.38433
... ... ... ... ...
96 y default 9.6 5537.44101
97 y default 9.7 5697.67947
98 y default 9.8 5861.01878
99 y default 9.9 6028.19348
100 y default 10.0 6200.50268

101 rows × 4 columns

[9]:
# The PEtab problem YAML file
print(open("other_model_types_problem/petab/problem.yaml").read())
format_version: 1
parameter_file: parameters.tsv
problems:
- condition_files:
  - conditions.tsv
  measurement_files:
  - measurements.tsv
  sbml_files:
  - dummy.xml
  observable_files:
  - observables.tsv

The PEtab Select problem (petab_select/)

  • model_space.tsv — the model space. It defines a single model subspace M (pointing to ../petab/problem.yaml) with one column per coefficient, k0k5. Each entry is 0;estimate, meaning the coefficient can either be fixed to 0 (that power of t is then effectively removed from the model) or estimated (that power is present). With six independently toggled coefficients, this single row compactly encodes all \(2^6 = 64\) candidate models. The goal of the selection is to recover the powers that generated the data (\(t^1, t^2, t^3\)).

[10]:
print("petab_select/model_space.tsv")
display(
    pd.read_csv(
        "other_model_types_problem/petab_select/model_space.tsv", sep="\t"
    )
)
petab_select/model_space.tsv
model_subspace_id petab_yaml k0 k1 k2 k3 k4 k5
0 M ../petab/problem.yaml 0;estimate 0;estimate 0;estimate 0;estimate 0;estimate 0;estimate
  • problem.yaml — the model selection settings: compare models by the AIC criterion, search the model space with the backward method (start from the full model and remove terms), and read the model space from model_space.tsv.

[11]:
# The PEtab Select problem YAML file
print(open("other_model_types_problem/petab_select/problem.yaml").read())
format_version: 1
criterion: AIC
method: backward
model_space_files:
- model_space.tsv