rwskit.benchmarking

Benchmarking tools.

Attributes

`TimeUnit`	The supported units of time.
`AggregationFunctionName`	The names of the supported aggregation functions.
`AggregationFunction`	An aggregation function is a callable that takes an array-like object and
`BenchmarkSortValue`	The supported values for sorting the `BenchmarkResults` when represented

Classes

`BenchmarkRunner`	A class for profiling a set of functions based on one or mor criteria.
`BenchmarkResult`	A class for managing the results output by a `BenchmarkRunner`.

Functions

`get_time_unit_abbreviation`(→ TimeUnit)	Get the time unit abbreviation from the given string.
`change_time_unit`(→ float)	Change the unit of time of a given `value` currently in the `from_unit`
`validate_call_signature`(→ bool)	Check that the two functions take the same parameters.

Module Contents

rwskit.benchmarking.TimeUnit[source]: The supported units of time.

rwskit.benchmarking.AggregationFunctionName[source]: The names of the supported aggregation functions.

rwskit.benchmarking.AggregationFunction[source]: An aggregation function is a callable that takes an array-like object and returns a single float.

rwskit.benchmarking.BenchmarkSortValue[source]: The supported values for sorting the BenchmarkResults when represented as a string.

rwskit.benchmarking.get_time_unit_abbreviation(name: TimeUnit) → TimeUnit[source]

Get the time unit abbreviation from the given string.

Parameters:: name (str) – The name or abbreviation of a supported time unit.
Returns:: The abbreviation of the time unit specified by name.
Return type:: str
Raises:: icontract.errors.ViolationError – If the name is not a supported TimeUnit.

rwskit.benchmarking.change_time_unit(value: int | float, from_unit: TimeUnit, to_unit: TimeUnit) → float[source]

Change the unit of time of a given value currently in the from_unit unit to a value in the to_unit unit.

Parameters:

value (int or float) – The current time value.
from_unit (TimeUnit) – The unit of the current value.
to_unit (TimeUnit) – The unit to change the value into.

Returns:

Return the equivalent value in the new time unit to_unit

Return type:

float

rwskit.benchmarking.validate_call_signature(fn1: Callable[Ellipsis, Any], fn2: Callable[Ellipsis, Any], strict: bool = False) → bool[source]

Check that the two functions take the same parameters.

Parameters:

fn1 (Callable[..., Any]) – The first function to compare.
fn2 (Callable[..., Any]) – The second function to compare.
strict (bool, default = False) – If True the signatures must match exactly, including whether defaults are present and their values. Otherwise, they are considered equal if the number and types of all parameters are the same.

Returns:

True if the functions take the same number and type of parameters.

Return type:

bool

class rwskit.benchmarking.BenchmarkRunner(functions: Iterable[Callable] | dict[str, Callable], benchmark_space: dict[str, list[T]], setup_fn: Callable | None = None, use_single_setup: bool = True, n_runs: int = 10, n_tests: int = 2, n_warm_ups: int = 1, time_unit: TimeUnit = cast(TimeUnit, 's'), test_agg_fn: AggregationFunctionName = 'min', run_label: str = 'run', show_progress: bool = False, verbose: bool = True, float_fmt: str = '0.4e', sort_by: BenchmarkSortValue = 'min', test_significance: bool = True)[source]

A class for profiling a set of functions based on one or mor criteria.

The high level view of the benchmarking process is as follows. For every combination of parameters in the benchmark_space a sub-benchmark will be run. There are 2 nested execution loops for each sub-benchmark. The innermost loop runs each function on the current data n_tests number of times and aggregates the results using the test_agg_fn. The same data is always used for this loop no matter what. The outer loop will run this process n_runs times. If use_single_setup is True then the setup function will only be called once and will be used for all the runs. If it is False the setup function will be called for every run. The execution times for all runs will be stored in a Pandas DataFrame that can be retrieved after calling the benchmark.

Note

min, max, mean, std cannot be used as keys in the benchmark_space.

Note

functions must be an iterable, but cannot be a generator.

Note

The first parameter in the benchmark_space is always used as the x-axis for BenchmarkResult.plot().

Parameters:

functions (list[BenchmarkFunction]) – A list of functions to benchmark or a dictionary that maps a label to a benchmark function.
benchmark_space (dict[str, list[T]]) – The space of values to benchmark over. A benchmark will be executed for each combination of values obtained from the dictionary. The combinations are formed by taking the Cartesian product taking one value from each list in the dictionary. The names of the keys of this dictionary must either be the names of keyword arguments of the setup_fn, or keyword arguments of the benchmark functions if no setup_fn is provided. Only bool, int, float, and str values are supported.
setup_fn (SetupFunction) – A function that initializes data to be passed to the benchmark functions. If None, the values from setup_args will be passed directly to each function in functions.
use_single_setup (bool, default = True) – For functions that are guaranteed to be deterministic no matter what the input is, this should be True. However, if the function is non-deterministic or the performance might depend on how the data is initialized, this should be False.
n_runs (int) – The number of execution tests to run.
n_tests (int) – The number times to run each function in a single test.
n_warm_ups (int) – The number of tests to run before recording the timing data.
test_agg_fn ({'min', 'max', 'mean', 'median', 'sum'}) – The function to use for aggregating individual test results within a run.
run_label (str) – The column label in the resulting Pandas DataFrame that indicates the run number for the given execution times.
show_progress (bool = False) – Show progress bars while running the benchmark.
verbose (bool, default = True) – Print the full results and summary statistics to stdout when complete.
float_fmt (str) – The format used to print floating point values to a string.
sort_by (str {min, mean, function}) – When verbose=True this will determine how the results are sorted (either by the min run time, max run time or by the function name).
test_significance (bool, default = False) – If True, test if the difference in run times are different between all pairs of models.

Notes

Deterministic Function and Deterministic Data

If your algorithm is deterministic and is not influenced at all by the content of the data, only its size, then I would suggest the following parameters:

use_single_setup = True: Use the same data for all the runs on the current setup parameters.
n_runs > 1: Run it at least a few times per parameter set to make sure there weren’t any anomalies biasing the results.
n_tests = 1: You should not need to run multiple tests here.

Deterministic Function and Non-Deterministic Data

If your function is deterministic (the sequence of execution is always the same), but could be influenced by the content of the data I would suggest the following parameters:

setup_fn != None: The setup function should return different data each run (of the same size)
use_single_setup = False: Run the setup function to generate new data on each run.
n_tests > 1: Run the function on the same data a few times in case there was an anomaly, which could bias the result.
n_runs > 1: Run the function on multiple different data sets to estimate how much variability is expected due to the makeup of the data.
test_agg_fn = 'min': Since the function should execute the same way on the same data, the min should be the most informative.

Non-Deterministic Function

If the function itself is non-deterministic you probably want something similar to the deterministic case with non-deterministic data. In this case however, it is probably pointless to set n_tests > 1 and you should just increase n_runs to get better overall estimates.

Examples

>>> import time
>>> sort_setup_fn = (
...    lambda array_size, dtype, unique_values:
...        np.random.randint(unique_values, size=array_size).astype(dtype)
... )

>>> b = BenchmarkRunner(functions={"fn1": lambda a: time.sleep(0.01),
...                                "fn2": lambda a: time.sleep(0.02)},
...                     benchmark_space={"array_size": [100, 10000],
...                                      "dtype": ["U", "int"],
...                                      "unique_values": [10, 100, 1000]},
...                     setup_fn=sort_setup_fn
...                     time_unit="ms"
...                     float_fmt="0.3f")

>>> b()
function  array_size  unique_values    min   mean    std
--------------------------------------------------------
     fn1         100             10  1.040  1.053  0.008
     fn2         100             10  5.057  5.058  0.001
--------------------------------------------------------
     fn1         100            100  1.053  1.056  0.002
     fn2         100            100  5.057  5.058  0.001
--------------------------------------------------------
     fn1         100           1000  1.056  1.057  0.000
     fn2         100           1000  5.058  5.058  0.000
--------------------------------------------------------
     fn1       10000             10  1.056  1.057  0.000
     fn2       10000             10  5.058  5.059  0.000
--------------------------------------------------------
     fn1       10000            100  1.056  1.057  0.000
     fn2       10000            100  5.058  5.064  0.009
--------------------------------------------------------
     fn1       10000           1000  1.056  1.057  0.001
     fn2       10000           1000  5.063  5.066  0.002

functions[source]

benchmark_space[source]

setup_fn[source]

use_single_setup = True[source]

n_runs = 10[source]

n_tests = 2[source]

n_warm_ups = 1[source]

test_agg_fn[source]

run_label = 'run'[source]

show_progress = False[source]

verbose = True[source]

time_unit: TimeUnit[source]

float_fmt = '0.4e'[source]

sort_by = 'min'[source]

test_significance = True[source]

__call__() → BenchmarkResult[source]

Runs the benchmark.

Return type:: BenchmarkResult

run() → BenchmarkResult[source]

Runs the benchmark.

Return type:: BenchmarkResult

class rwskit.benchmarking.BenchmarkResult(results: pandas.DataFrame, significance_results: pandas.DataFrame | None, benchmark_space: dict[str, list[T]], float_fmt: str = '0.4e', sort_by: BenchmarkSortValue = 'min', run_label: str = 'run', time_unit: TimeUnit = 's')[source]

A class for managing the results output by a BenchmarkRunner.

Note

This class is not intended to be instantiated directly.

Parameters:

results (DataFrame) – The results DataFrame obtained by a BenchmarkRunner.
significance_results (DataFrame) – Pairwise t-test results.
benchmark_space (dict[string, list[T]]) – The parameters used to benchmark the functions.
float_fmt (str) – A valid format string to use for floating point numbers.
sort_by (str {min, mean}) – The summary statistic to sort the results by when represented as a string.
run_label (str) – The label used to indicate the run number.
time_unit (TimeUnit) – The original time unit used to benchmark the results.

__repr__() → str[source]

The full pandas pandas.DataFrame containing all the runs as a string.

Returns:: The full benchmark results as a string.
Return type:: str

__str__() → str[source]

Returns a table of the summary statistics of the benchmark results as a string.

Returns:: The summary statistics of the benchmark results as a string.
Return type:: str

property benchmark_space: dict[str, list[T]][source]

Return the parameters used to benchmark the functions.

Returns:: The parameters used to produce these results.
Return type:: dict[str, list[T]]

results(wide: bool = True, as_time_unit: TimeUnit | None = None) → pandas.DataFrame[source]

Get a pandas DataFrame containing the results.

Parameters:

as_time_unit (TimeUnit, optional) – Return the results in this time unit instead of the one used during the benchmark.
wide (bool, default = True) – If True return the results in the default wide format, which is easier to read. Otherwise, return the results in long format, which can be easier to use for plotting.

Returns:

The DataFrame containing the results, either in wide or long format.

Return type:

DataFrame

property significance_results: pandas.DataFrame[source]

Return the results of the significance tests as a pandas.DataFrame.

Returns:: The results of the significance tests as a pandas.DataFrame.
Return type:: DataFrame

summary(wide: bool = True, as_time_unit: TimeUnit | None = None) → pandas.DataFrame[source]

The summary statistics of the benchmark results.

Parameters:

wide (bool, default = True) – If True, return the results in wide format, otherwise, return the results in long format.
as_time_unit (TimeUnit, optional) – Return the results in this time unit instead of the one used during the benchmark.

Returns:

The DataFrame with the summary statistics.

Return type:

pd.DataFrame

plot(x_label: str = None, use_stat: Literal['min', 'mean'] = 'min', functions: Iterable[str] = None, show_points: bool = False, show_ribbon: bool = False, free_y: bool = False, theme_name: PlotTheme = None, figure_size: tuple[int, int] | None = None, as_time_unit: TimeUnit | None = None) → plotnine.ggplot[source]

Create and return a ggplot object that visualizes the benchmark results.

Note

Use show` or ``save` on the resulting object to render or save it.

Parameters:

x_label (str, optional) – Label the x-axis with this value, if given. Otherwise, the first key in the benchmark_space will be used.
use_stat (str {min, mean}) – Which stat to plot.
functions (Iterable[str], optional) – If defined, limit the plot to these functions.
show_points (bool, default = False) – Show each individual run as a point on the graph.
show_ribbon (bool, default = False) – If True, include a ribbon that encompasses the minimum and maximum value over all the runs in a group.
free_y (bool, default = False) – If True and the benchmark_space includes more than 1 parameter, the y-axis is not constrained to be the same for each resulting plot.
theme_name (PlotTheme, optional) – The name of a theme to style your plot with, otherwise, it will use the default theme.
figure_size (tuple[int, int], optional,) – Override the size of the resulting figure. If not specified and there are more than one benchmark_space parameters a heuristic is used to try to ensure each subgraph will be legible.
as_time_unit (TimeUnit, optional) – Display the results in this time unit instead of the one used during the benchmarking.

Returns:

The plot object.

Return type:

plotnine.ggplot