Histoptimizer Quickstart¶
Base API¶
Histoptimizer provides two APIs. The lower-level interface takes a list or NumPy array of floating point values, and returns two values: The optimal divider locations, and the variance they achieve:
[1]:
from histoptimizer import Histoptimizer
item_sizes = [1.0, 4.5, 6.3, 2.1, 8.4, 3.7, 8.6, 0.3, 5.2, 6.9, 1.2, 2.4, 9.8, 3.7]
# Get the optimal position of two dividers that partition the list above into 3 buckets.
(dividers, variance) = Histoptimizer.partition(item_sizes, 3)
print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")
Optimal Divider Locations: [5 9] Optimal solution variance: 6.842
Histoptimizer
is a pure Python implementation, and slow. For improved performance, try the Numba JIT accelerated implementation, NumbaOptimizer
. The API is the same:
[2]:
from histoptimizer.numba_optimizer import NumbaOptimizer
(dividers, variance) = NumbaOptimizer.partition(item_sizes, 3)
print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")
Optimal Divider Locations: [5 9] Optimal solution variance: 6.842
If you have an NVidia GPU and have installed the CUDA toolkit, you can use the CUDA-based CUDAOptimizer
:
[3]:
from histoptimizer.cuda import CUDAOptimizer
(dividers, variance) = CUDAOptimizer.partition(item_sizes, 3)
print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")
Optimal Divider Locations: [5 9] Optimal solution variance: 6.842
NumbaOptimizer
and CUDAOptimizer
require just-in-time compilation the first time they are invoked; this can throw off benchmarks. These classes provide a precompile
static method that solves a small problem instance to trigger precompilation.
Pandas API¶
Histoptimizer also provides a function histoptimize
that takes a Pandas Dataframe and a number of buckets, and adds columns to the Dataframe that assign an optimal bucket number to each item for given number of buckets.
For example, consider the following Dataframe, books
:
[4]:
from histoptimizer import histoptimize
import pandas as pd
books = pd.read_csv('books.csv', header=0)
books
[4]:
Title | Pages | |
---|---|---|
0 | The Algorithm Design Manual | 748 |
1 | Software Engineering at Google | 599 |
2 | Site Reliability Engineering | 550 |
3 | Hands-on Machine Learning | 850 |
4 | Clean Code | 464 |
5 | Code Complete | 960 |
6 | Web Operations | 338 |
7 | Consciousness Explained | 528 |
8 | I am a Strange Loop | 432 |
9 | The Information | 544 |
10 | The Fractal Geometry of Nature | 500 |
11 | Consider Phlebas | 544 |
12 | I Heart Logs | 60 |
13 | Kraken | 528 |
14 | Noise | 464 |
15 | Snow Crash | 440 |
We can find the optimal division (based on page sum) of books into 3 buckets in this way:
[5]:
divisions, column_names = histoptimize(books, "Pages", [3], "assistant_", Histoptimizer)
print(divisions.to_markdown())
| | Title | Pages | assistant_3 |
|---:|:-------------------------------|--------:|--------------:|
| 0 | The Algorithm Design Manual | 748 | 1 |
| 1 | Software Engineering at Google | 599 | 1 |
| 2 | Site Reliability Engineering | 550 | 1 |
| 3 | Hands-on Machine Learning | 850 | 1 |
| 4 | Clean Code | 464 | 2 |
| 5 | Code Complete | 960 | 2 |
| 6 | Web Operations | 338 | 2 |
| 7 | Consciousness Explained | 528 | 2 |
| 8 | I am a Strange Loop | 432 | 2 |
| 9 | The Information | 544 | 3 |
| 10 | The Fractal Geometry of Nature | 500 | 3 |
| 11 | Consider Phlebas | 544 | 3 |
| 12 | I Heart Logs | 60 | 3 |
| 13 | Kraken | 528 | 3 |
| 14 | Noise | 464 | 3 |
| 15 | Snow Crash | 440 | 3 |
If we wish, we can obtain the optimal division into 5 buckets in a single call:
[6]:
divisions, column_names = histoptimize(books, "Pages", [3, 5], "assistant_", Histoptimizer)
divisions
[6]:
Title | Pages | assistant_3 | assistant_5 | |
---|---|---|---|---|
0 | The Algorithm Design Manual | 748 | 1 | 1 |
1 | Software Engineering at Google | 599 | 1 | 1 |
2 | Site Reliability Engineering | 550 | 1 | 2 |
3 | Hands-on Machine Learning | 850 | 1 | 2 |
4 | Clean Code | 464 | 2 | 2 |
5 | Code Complete | 960 | 2 | 3 |
6 | Web Operations | 338 | 2 | 3 |
7 | Consciousness Explained | 528 | 2 | 3 |
8 | I am a Strange Loop | 432 | 2 | 4 |
9 | The Information | 544 | 3 | 4 |
10 | The Fractal Geometry of Nature | 500 | 3 | 4 |
11 | Consider Phlebas | 544 | 3 | 4 |
12 | I Heart Logs | 60 | 3 | 5 |
13 | Kraken | 528 | 3 | 5 |
14 | Noise | 464 | 3 | 5 |
15 | Snow Crash | 440 | 3 | 5 |