Histoptimizer Quickstart¶

Base API¶

Histoptimizer provides two APIs. The lower-level interface takes a list or NumPy array of floating point values, and returns two values: The optimal divider locations, and the variance they achieve:

[1]:

from histoptimizer import Histoptimizer

item_sizes = [1.0, 4.5, 6.3, 2.1, 8.4, 3.7, 8.6, 0.3, 5.2, 6.9, 1.2, 2.4, 9.8, 3.7]

# Get the optimal position of two dividers that partition the list above into 3 buckets.
(dividers, variance) = Histoptimizer.partition(item_sizes, 3)

print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")

Optimal Divider Locations: [5 9] Optimal solution variance: 6.842

Histoptimizer is a pure Python implementation, and slow. For improved performance, try the Numba JIT accelerated implementation, NumbaOptimizer. The API is the same:

[2]:

from histoptimizer.numba_optimizer import NumbaOptimizer

(dividers, variance) = NumbaOptimizer.partition(item_sizes, 3)

print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")

Optimal Divider Locations: [5 9] Optimal solution variance: 6.842

If you have an NVidia GPU and have installed the CUDA toolkit, you can use the CUDA-based CUDAOptimizer:

[3]:

from histoptimizer.cuda import CUDAOptimizer

(dividers, variance) = CUDAOptimizer.partition(item_sizes, 3)

print(f"Optimal Divider Locations: {dividers} Optimal solution variance: {variance:.4}")

Optimal Divider Locations: [5 9] Optimal solution variance: 6.842

NumbaOptimizer and CUDAOptimizer require just-in-time compilation the first time they are invoked; this can throw off benchmarks. These classes provide a precompile static method that solves a small problem instance to trigger precompilation.

Pandas API¶

Histoptimizer also provides a function histoptimize that takes a Pandas Dataframe and a number of buckets, and adds columns to the Dataframe that assign an optimal bucket number to each item for given number of buckets.

For example, consider the following Dataframe, books:

[4]:

from histoptimizer import histoptimize
import pandas as pd

books = pd.read_csv('books.csv', header=0)
books

[4]:

	Title	Pages
0	The Algorithm Design Manual	748
1	Software Engineering at Google	599
2	Site Reliability Engineering	550
3	Hands-on Machine Learning	850
4	Clean Code	464
5	Code Complete	960
6	Web Operations	338
7	Consciousness Explained	528
8	I am a Strange Loop	432
9	The Information	544
10	The Fractal Geometry of Nature	500
11	Consider Phlebas	544
12	I Heart Logs	60
13	Kraken	528
14	Noise	464
15	Snow Crash	440

We can find the optimal division (based on page sum) of books into 3 buckets in this way:

[5]:

divisions, column_names = histoptimize(books, "Pages", [3], "assistant_", Histoptimizer)
print(divisions.to_markdown())

|    | Title                          |   Pages |   assistant_3 |
|---:|:-------------------------------|--------:|--------------:|
|  0 | The Algorithm Design Manual    |     748 |             1 |
|  1 | Software Engineering at Google |     599 |             1 |
|  2 | Site Reliability Engineering   |     550 |             1 |
|  3 | Hands-on Machine Learning      |     850 |             1 |
|  4 | Clean Code                     |     464 |             2 |
|  5 | Code Complete                  |     960 |             2 |
|  6 | Web Operations                 |     338 |             2 |
|  7 | Consciousness Explained        |     528 |             2 |
|  8 | I am a Strange Loop            |     432 |             2 |
|  9 | The Information                |     544 |             3 |
| 10 | The Fractal Geometry of Nature |     500 |             3 |
| 11 | Consider Phlebas               |     544 |             3 |
| 12 | I Heart Logs                   |      60 |             3 |
| 13 | Kraken                         |     528 |             3 |
| 14 | Noise                          |     464 |             3 |
| 15 | Snow Crash                     |     440 |             3 |

If we wish, we can obtain the optimal division into 5 buckets in a single call:

[6]:

divisions, column_names = histoptimize(books, "Pages", [3, 5], "assistant_", Histoptimizer)
divisions

[6]:

	Title	Pages	assistant_3	assistant_5
0	The Algorithm Design Manual	748	1	1
1	Software Engineering at Google	599	1	1
2	Site Reliability Engineering	550	1	2
3	Hands-on Machine Learning	850	1	2
4	Clean Code	464	2	2
5	Code Complete	960	2	3
6	Web Operations	338	2	3
7	Consciousness Explained	528	2	3
8	I am a Strange Loop	432	2	4
9	The Information	544	3	4
10	The Fractal Geometry of Nature	500	3	4
11	Consider Phlebas	544	3	4
12	I Heart Logs	60	3	5
13	Kraken	528	3	5
14	Noise	464	3	5
15	Snow Crash	440	3	5

Histoptimizer Quickstart¶

Base API¶

Pandas API¶

Project Links

Table of Contents

Contents

Navigation