API Reference¶
Histoptimizer¶
Histoptimizer provides a base class, Histoptimizer
, which provides a pure
Python reference implementation that will suffice for small problem sets.
However, as problem sizes grow, performance falls quickly
num_items |
5 buckets |
10 buckets |
---|---|---|
100 |
0.005 |
0.027 |
1000 |
0.91 |
2.07 |
5000 |
22.995 |
51.541 |
Larger problem sets should use NumbaOptimizer or CUDAOptimizer subclasses depending on what hardware is available
- class histoptimizer.Histoptimizer¶
Base class for objects implementing the Histoptimizer API.
Histoptimizer provides a basic, pure python solution for the linear partition problem with a minimum variance cost function.
The base implementation is a staightforward linear implementation of Skiena’s dynamic programming algorithm for the linear partition problem, with a modified cost function.
See:
- The Algorithm Design Manual by Steven Skiena. Springer, London, 2008.
Section 8.5: The Partition Problem, pp 294-297
See Also: https://www3.cs.stonybrook.edu/~algorith/video-lectures/1997/lecture11.pdf
- histoptimizer.Histoptimizer.partition(item_sizes, num_buckets: int, debug_info: dict | None = None) -> (typing.List[int], <class 'numpy.float32'>)¶
Given a list of item sizes, partition the items into buckets evenly.
This function returns a set of partition indexes, or divider locations, such that dividing the given ordered set of items into “buckets” at these indexes results in a set of buckets with the lowest possible variance over the sum of the items sizes in each bucket.
- Arguments:
- item_sizes
An iterable of float- or float-compatible values representing a sorted series of item sizes.
- num_buckets
The number of buckets to partition the items into.
- debug_info
A dictionary that can accept debug information.
- Returns:
A tuple:
- partition_locations
Index of dividers within items. Dividers come after the item in 0-based indexing and before the item in 1-based indexing.
- min_variance
The variance of the solution defined by partition_locations
- histoptimizer.Histoptimizer.precompile()¶
Precompile any accelerator code used by this class.
Some implementations of the Histoptimizer API rely on the compilation of python code to GPU or SIMD machine code. For these implementations, precompile will run a trivial problem set in order to trigger JIT compilation.
For the default implementation, this is a no-op.
NumbaOptimizer¶
The Numba-accelerated partitioner is very practical for what could well be very useful workload sizes, if there were any useful workloads:
num_items |
10 buckets |
20 buckets |
30 buckets |
---|---|---|---|
5000 |
0.125 |
0.256 |
0.424 |
10000 |
0.456 |
1.053 |
1.81 |
15000 |
1.052 |
2.38 |
4.304 |
20000 |
1.854 |
4.184 |
7.742 |
25000 |
2.869 |
6.559 |
12.097 |
30000 |
4.163 |
9.369 |
17.611 |
- class histoptimizer.numba_optimizer.NumbaOptimizer¶
Numba JIT-based implementation of Histoptimizer.
NumbaOptimizer uses Numba to compile Python functions to native SIMD instructions, significantly improving speed over Histoptimizer.
CUDAOptimizer¶
The CUDA Optimizer is the best-performing implementation by a lot, but it is still starting to work pretty hard by 1 million items.
num_items |
5 buckets |
10 buckets |
15 buckets |
---|---|---|---|
10000 |
0.035 |
0.032 |
0.047 |
100000 |
0.472 |
0.8 |
1.179 |
500000 |
7.718 |
16.621 |
27.279 |
1000000 |
31.35 |
85.078 |
278.634 |
- class histoptimizer.cuda.CUDAOptimizer¶
GPU-based implementation of Skiena’s dynamic programming algorithm for the linear partition problem with variance cost function.
histoptimize¶
Histoptimizer provides a convenience API for dealing with Pandas DataFrames.
The histoptimizer
CLI is mostly a wrapper around this function.
- histoptimizer.histoptimize(data: ~pandas.core.frame.DataFrame, sizes: str, bucket_list: list, column_name: str, partitioner: object, optimal_only=False) -> (<class 'pandas.core.frame.DataFrame'>, typing.List[str])¶
Histoptimize takes a Pandas DataFrame and adds additional columns, one for each integer in bucket_list.
The additional columns are named column_name + {bucket_List[i]} and contain for each row a bucket number such that the rows are distributed into the given number of buckets in such a manner as to minimize the variance/standard deviation over all buckets.
- Args:
- data (DataFrame)
The DataFrame to add columns to.
- sizes (str)
Column to get size values from.
- bucket_list (list)
A list of integer bucket sizes.
- column_name (str)
Prefix to be added to the number of buckets to get the column name.
- partitioner (class):
Class that implements the Histoptimizer API.
- optimal_only (bool):
If true, add only one column, for the number of buckets with the lowest variance.
- Returns:
A tuple of values:
- DataFrame
Original DataFrame with one or more columns added.
- list(str)
List of column names added to the original DataFrame
benchmark¶
The histobench
CLI is a wrapper around the benchmark
function, which may
itself be of use to you if you were for some reason benchmarking competing
partitioner implementations.
- histoptimizer.benchmark.benchmark(partitioner_list: list, item_list: list, bucket_list: list, iterations: int = 1, begin_range: int = 1, end_range: int = 10, specified_items_sizes: list | None = None, verbose: bool = False, include_items_in_results: bool = False) DataFrame ¶
Benchmark a given list of partitioners against the same data.
The caller can specify that the partitioners be
- Args:
- partitioner_list
List of partitioner functions to benchmark.
- item_list
A list of item counts to benchmark.
- bucket_list
A list bucket counts to benchmark.
- iterations
Number of iterations to test each item_list x bucket_list combination.
- begin_range
For random item generation, the lower bound of the random size values.
- end_range
For random item generation, the upper bound of the random size values.
- specified_items_sizes
An ordered list of item sizes. Must be as long as the max value of item_list.
- verbose
If true, log debugging information.
- include_items_in_results
If true, include the items used in the test in every result row.
- Returns:
- pandas.DataFrame: DataFrame containing one row for each
partitioner x item size x bucket size x iteration.
Each row contains the following columns:
- partitioner (str)
Name of the partitioner used in this run.
- num_items
Number of items in this run.
- buckets
Number of buckets in this run.
- iteration
Iteration number for this run.
- item_set_id
32-bit hex form of a random UUID generated when the items used in the problem were generated.
- variance
Variance of the discovered solution.
- elapsed_seconds
Number of seconds to find the solution.
- dividers (list)
Divider locations for the optimal solution.
- items
List of items sizes for this run, if include_item_sizes is True
Raises:
partitioner_pivot
is a small function that pivots the time results for a
particular partitioner out of the results DataFrame returned by benchmark
.
- histoptimizer.benchmark.partitioner_pivot(df: DataFrame, partitioner) DataFrame ¶
Given a DataFrame of results produced by histobench, and the name of a partitioner, isolate results of that partitioner and create a pivot table on bucket, so that the frame becomes a grid with # items in rows and # buckets in columns.
- Args:
- df (DataFrame)
Results DataFrame.
- partitioner (str)
partitioner name to isolate.
- Returns:
Pivoted Dataframe with performance data for the specified partitioner.