Histoptimizer Quickstart

Base API

Histoptimizer provides two APIs. The lower-level interface takes a list or NumPy array of floating point values, and returns two values: The optimal divider locations, and the variance they achieve:

[1]:
from histoptimizer import Histoptimizer

item_sizes = [1.0, 4.5, 6.3, 2.1, 8.4, 3.7, 8.6, 0.3, 5.2, 6.9, 1.2, 2.4, 9.8, 3.7]

# Get the optimal position of two dividers that partition the list above into 3 buckets.
(dividers, variance) = Histoptimizer.partition(item_sizes, 3)

print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")
Optimal Divider Locations: [5 9] Optimal solution variance: 6.842

Histoptimizer is a pure Python implementation, and slow. For improved performance, try the Numba JIT accelerated implementation, NumbaOptimizer. The API is the same:

[2]:
from histoptimizer.numba_optimizer import NumbaOptimizer

(dividers, variance) = NumbaOptimizer.partition(item_sizes, 3)

print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")
Optimal Divider Locations: [5 9] Optimal solution variance: 6.842

If you have an NVidia GPU and have installed the CUDA toolkit, you can use the CUDA-based CUDAOptimizer:

[3]:
from histoptimizer.cuda import CUDAOptimizer

(dividers, variance) = CUDAOptimizer.partition(item_sizes, 3)

print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")
Optimal Divider Locations: [5 9] Optimal solution variance: 6.842

NumbaOptimizer and CUDAOptimizer require just-in-time compilation the first time they are invoked; this can throw off benchmarks. These classes provide a precompile static method that solves a small problem instance to trigger precompilation.

Pandas API

Histoptimizer also provides a function histoptimize that takes a Pandas Dataframe and a number of buckets, and adds columns to the Dataframe that assign an optimal bucket number to each item for given number of buckets.

For example, consider the following Dataframe, books:

[4]:
from histoptimizer import histoptimize
import pandas as pd

books = pd.read_csv('books.csv', header=0)
books
[4]:
Title Pages
0 The Algorithm Design Manual 748
1 Software Engineering at Google 599
2 Site Reliability Engineering 550
3 Hands-on Machine Learning 850
4 Clean Code 464
5 Code Complete 960
6 Web Operations 338
7 Consciousness Explained 528
8 I am a Strange Loop 432
9 The Information 544
10 The Fractal Geometry of Nature 500
11 Consider Phlebas 544
12 I Heart Logs 60
13 Kraken 528
14 Noise 464
15 Snow Crash 440

We can find the optimal division (based on page sum) of books into 3 buckets in this way:

[5]:
divisions, column_names = histoptimize(books, "Pages", [3], "assistant_", Histoptimizer)
print(divisions.to_markdown())
|    | Title                          |   Pages |   assistant_3 |
|---:|:-------------------------------|--------:|--------------:|
|  0 | The Algorithm Design Manual    |     748 |             1 |
|  1 | Software Engineering at Google |     599 |             1 |
|  2 | Site Reliability Engineering   |     550 |             1 |
|  3 | Hands-on Machine Learning      |     850 |             1 |
|  4 | Clean Code                     |     464 |             2 |
|  5 | Code Complete                  |     960 |             2 |
|  6 | Web Operations                 |     338 |             2 |
|  7 | Consciousness Explained        |     528 |             2 |
|  8 | I am a Strange Loop            |     432 |             2 |
|  9 | The Information                |     544 |             3 |
| 10 | The Fractal Geometry of Nature |     500 |             3 |
| 11 | Consider Phlebas               |     544 |             3 |
| 12 | I Heart Logs                   |      60 |             3 |
| 13 | Kraken                         |     528 |             3 |
| 14 | Noise                          |     464 |             3 |
| 15 | Snow Crash                     |     440 |             3 |

If we wish, we can obtain the optimal division into 5 buckets in a single call:

[6]:
divisions, column_names = histoptimize(books, "Pages", [3, 5], "assistant_", Histoptimizer)
divisions
[6]:
Title Pages assistant_3 assistant_5
0 The Algorithm Design Manual 748 1 1
1 Software Engineering at Google 599 1 1
2 Site Reliability Engineering 550 1 2
3 Hands-on Machine Learning 850 1 2
4 Clean Code 464 2 2
5 Code Complete 960 2 3
6 Web Operations 338 2 3
7 Consciousness Explained 528 2 3
8 I am a Strange Loop 432 2 4
9 The Information 544 3 4
10 The Fractal Geometry of Nature 500 3 4
11 Consider Phlebas 544 3 4
12 I Heart Logs 60 3 5
13 Kraken 528 3 5
14 Noise 464 3 5
15 Snow Crash 440 3 5