API Reference¶

Histoptimizer¶

Histoptimizer provides a base class, Histoptimizer, which provides a pure Python reference implementation that will suffice for small problem sets. However, as problem sizes grow, performance falls quickly

Time to Solve in Seconds, Pure Python Optimizer, AMD Ryzen 9 5900 12-Core Processor¶
num_items	5 buckets	10 buckets
100	0.005	0.027
1000	0.91	2.07
5000	22.995	51.541

Larger problem sets should use NumbaOptimizer or CUDAOptimizer subclasses depending on what hardware is available

class histoptimizer.Histoptimizer¶

Base class for objects implementing the Histoptimizer API.

Histoptimizer provides a basic, pure python solution for the linear partition problem with a minimum variance cost function.

The base implementation is a staightforward linear implementation of Skiena’s dynamic programming algorithm for the linear partition problem, with a modified cost function.

See:

The Algorithm Design Manual by Steven Skiena. Springer, London, 2008.: Section 8.5: The Partition Problem, pp 294-297

histoptimizer.Histoptimizer.partition(item_sizes, num_buckets: int, debug_info: dict | None = None) -> (typing.List[int], <class 'numpy.float32'>)¶

Given a list of item sizes, partition the items into buckets evenly.

This function returns a set of partition indexes, or divider locations, such that dividing the given ordered set of items into “buckets” at these indexes results in a set of buckets with the lowest possible variance over the sum of the items sizes in each bucket.

Arguments:

item_sizes: An iterable of float- or float-compatible values representing a sorted series of item sizes.
num_buckets: The number of buckets to partition the items into.
debug_info: A dictionary that can accept debug information.

Returns:

A tuple:

partition_locations: Index of dividers within items. Dividers come after the item in 0-based indexing and before the item in 1-based indexing.
min_variance: The variance of the solution defined by partition_locations

histoptimizer.Histoptimizer.precompile()¶

Precompile any accelerator code used by this class.

Some implementations of the Histoptimizer API rely on the compilation of python code to GPU or SIMD machine code. For these implementations, precompile will run a trivial problem set in order to trigger JIT compilation.

For the default implementation, this is a no-op.

NumbaOptimizer¶

The Numba-accelerated partitioner is very practical for what could well be very useful workload sizes, if there were any useful workloads:

Time to Solve in Seconds, Numba Optimizer, AMD Ryzen 9 5900 12-Core Processor¶
num_items	10 buckets	20 buckets	30 buckets
5000	0.125	0.256	0.424
10000	0.456	1.053	1.81
15000	1.052	2.38	4.304
20000	1.854	4.184	7.742
25000	2.869	6.559	12.097
30000	4.163	9.369	17.611

class histoptimizer.numba_optimizer.NumbaOptimizer¶

Numba JIT-based implementation of Histoptimizer.

NumbaOptimizer uses Numba to compile Python functions to native SIMD instructions, significantly improving speed over Histoptimizer.

CUDAOptimizer¶

The CUDA Optimizer is the best-performing implementation by a lot, but it is still starting to work pretty hard by 1 million items.

Time to Solve in Seconds, CUDA Optimizer, GeForce RTX 3080¶
num_items	5 buckets	10 buckets	15 buckets
10000	0.035	0.032	0.047
100000	0.472	0.8	1.179
500000	7.718	16.621	27.279
1000000	31.35	85.078	278.634

class histoptimizer.cuda.CUDAOptimizer¶: GPU-based implementation of Skiena’s dynamic programming algorithm for the linear partition problem with variance cost function.

histoptimize¶

Histoptimizer provides a convenience API for dealing with Pandas DataFrames. The histoptimizer CLI is mostly a wrapper around this function.

histoptimizer.histoptimize(data: ~pandas.core.frame.DataFrame, sizes: str, bucket_list: list, column_name: str, partitioner: object, optimal_only=False) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[str])¶

Histoptimize takes a Pandas DataFrame and adds additional columns, one for each integer in bucket_list.

The additional columns are named column_name + {bucket_List[i]} and contain for each row a bucket number such that the rows are distributed into the given number of buckets in such a manner as to minimize the variance/standard deviation over all buckets.

Args:

data (DataFrame): The DataFrame to add columns to.
sizes (str): Column to get size values from.
bucket_list (list): A list of integer bucket sizes.
column_name (str): Prefix to be added to the number of buckets to get the column name.
partitioner (class):: Class that implements the Histoptimizer API.
optimal_only (bool):: If true, add only one column, for the number of buckets with the lowest variance.

Returns:

A tuple of values:

DataFrame: Original DataFrame with one or more columns added.
list(str): List of column names added to the original DataFrame

benchmark¶

The histobench CLI is a wrapper around the benchmark function, which may itself be of use to you if you were for some reason benchmarking competing partitioner implementations.

histoptimizer.benchmark.benchmark(partitioner_list: list, item_list: list, bucket_list: list, iterations: int = 1, begin_range: int = 1, end_range: int = 10, specified_items_sizes: list | None = None, verbose: bool = False, include_items_in_results: bool = False) → DataFrame¶

Benchmark a given list of partitioners against the same data.

The caller can specify that the partitioners be

Args:

partitioner_list: List of partitioner functions to benchmark.
item_list: A list of item counts to benchmark.
bucket_list: A list bucket counts to benchmark.
iterations: Number of iterations to test each item_list x bucket_list combination.
begin_range: For random item generation, the lower bound of the random size values.
end_range: For random item generation, the upper bound of the random size values.
specified_items_sizes: An ordered list of item sizes. Must be as long as the max value of item_list.
verbose: If true, log debugging information.
include_items_in_results: If true, include the items used in the test in every result row.

Returns:

pandas.DataFrame: DataFrame containing one row for each: partitioner x item size x bucket size x iteration.

Each row contains the following columns:

partitioner (str)
Name of the partitioner used in this run.

num_items
Number of items in this run.

buckets
Number of buckets in this run.

iteration
Iteration number for this run.

item_set_id
32-bit hex form of a random UUID generated when the items used in the problem were generated.

variance
Variance of the discovered solution.

elapsed_seconds
Number of seconds to find the solution.

dividers (list)
Divider locations for the optimal solution.

items
List of items sizes for this run, if include_item_sizes is True

Raises:

partitioner_pivot is a small function that pivots the time results for a particular partitioner out of the results DataFrame returned by benchmark.

histoptimizer.benchmark.partitioner_pivot(df: DataFrame, partitioner) → DataFrame¶

Given a DataFrame of results produced by histobench, and the name of a partitioner, isolate results of that partitioner and create a pivot table on bucket, so that the frame becomes a grid with # items in rows and # buckets in columns.

Args:

df (DataFrame): Results DataFrame.
partitioner (str): partitioner name to isolate.

Returns:

Pivoted Dataframe with performance data for the specified partitioner.

API Reference¶

Histoptimizer¶

NumbaOptimizer¶

CUDAOptimizer¶

histoptimize¶

benchmark¶

Project Links

Table of Contents

Contents

Navigation