CLI Guide¶

histoptimizer¶

Histoptimizer takes a CSV and partitions it evenly into a supplied number of buckets by adding columns where each item is assigned to a bucket.

Usage¶

Usage: histoptimizer [OPTIONS] FILE SIZE_COLUMN PARTITIONS

  Partition ordered items in a CSV into a given number of buckets, evenly.

  Given a CSV or JSON Dataframe, a size column name, and a number of buckets,
  Histoptimizer will add a column which gives the partition number for each
  row that optimally divides the given items into the buckets so as to
  minimize the variance from mean of the summed items in each bucket.

  Additional features allow doing a list of bucket sizes in one go, sorting
  items beforehand, and producing output with only relevant columns.

  Example:

      > histoptimizer books.csv state_name population 10

      Output:

      state_name, population, partition_10     Wyoming, xxxxxx, 1
      California, xxxxxxxx, 10

Options:
  -l, --limit INTEGER             Take the first {limit} records from the
                                  input, rather than the whole file.
  -a, --ascending, --asc / -d, --descending, --desc
                                  If a sort column is provided,
  --print-all, --all / --no-print-all, --brief
                                  Output all columns in input, or with
                                  --brief, only output the ID, size, and
                                  buckets columns.
  -c, --column-prefix TEXT        Partition column name prefix. The number of
                                  buckets will be appended. Defaults to
                                  partion_{number of buckets}.
  -s, --sort-key TEXT             Optionally sort records by this column name
                                  before partitioning.
  -i, --id-column TEXT            Optional ID column to print with brief
                                  output.
  -p, --partitioner TEXT          Use the named partitioner implementation.
                                  Defaults to "numba". If you have an NVidia
                                  GPU use "cuda" for better performance
  -o, --output FILENAME           Send output to the given file. Defaults to
                                  stdout.
  -f, --output-format [csv|json]  Specify output format. Pandas JSON or CSV.
                                  Defaults to CSV
  --help                          Show this message and exit.

Examples¶

Consider the following CSV:

books.csv¶
Title	Pages
The Algorithm Design Manual	748
Software Engineering at Google	599
Site Reliability Engineering	550
Hands-on Machine Learning	850
Clean Code	464
Code Complete	960
Web Operations	338
Consciousness Explained	528
I am a Strange Loop	432
The Information	544
The Fractal Geometry of Nature	500
Consider Phlebas	544
I Heart Logs	60
Kraken	528
Noise	464
Snow Crash	440

To sort by title, and then divide optimally into 3, 5, 6, and 7 buckets, use this command:

histoptimizer -s Title books.csv Pages 3,5-7

Returns:

	Title	Pages	partition_3	partition_5	partition_6	partition_7
0	Clean Code	464	1	1	1	1
1	Code Complete	960	1	1	1	1
2	Consciousness Explained	528	1	1	2	2
3	Consider Phlebas	544	1	2	2	2
4	Hands-on Machine Learning	850	2	2	3	3
5	I Heart Logs	60	2	2	3	3
6	I am a Strange Loop	432	2	2	3	3
7	Kraken	528	2	3	4	4
8	Noise	464	2	3	4	4
9	Site Reliability Engineering	550	2	3	4	5
10	Snow Crash	440	3	4	5	5
11	Software Engineering at Google	599	3	4	5	6
12	The Algorithm Design Manual	748	3	4	5	6
13	The Fractal Geometry of Nature	500	3	5	6	7
14	The Information	544	3	5	6	7
15	Web Operations	338	3	5	6	7

histobench¶

histobench is a CLI that lets you unlock Histoptimizer’s most powerful abilities: running against random data and then throwing away the results.

If you supply it with specifications of multiple item sizes and multiple buckets, it will benchmark every possible combination of item size and buckets.

Set Expressions¶

Set expressions are used by histoptimizer and histobench to define counts of items or buckets to be used for partitioning and benchmarking.

The format is a comma-separated list of range specifications. A range specification may be a single number or two numbers (beginning and ending, inclusive) separated by a ‘-’. If two numbers, a ‘:’ and third number may be supplied to provide a step. If the end number is not reachable in complete steps then the series will be truncated at the last valid step size.

Some examples:

10000-50000:10000 → (10000, 20000, 30000, 40000, 50000)

3,4,7-9 → (3, 4, 7, 8, 9)

10,20-30:5,50 → (10, 20, 25, 30, 50)

10-25:8 → (10, 18)

Usage¶

Usage: histobench [OPTIONS] PARTITIONER_TYPES [ITEM_SPEC] [BUCKET_SPEC]
                  [ITERATIONS] [SIZE_SPEC]

  Histobench is a benchmarking harness for testing Histoptimizer partitioner
  performance.

  By Default it uses random data, and so may not be an accurate benchmark for
  algorithms whose performance depends upon the data set.

  The PARTITIONER_TYPES parameter is a comma-separated list of partitioners to
  benchmark, which can be specified as either:

  1. A standard optimizer name, or 2. filepath:classname

  To specify the standard cuda module and also a custom variant, for example,
  one could use: cuda,./old_optimizers/cuda_20221130.py:CUDAOptimizer20221130

  Args:

  Returns:

  Raises:

Options:
  --debug-info / --no-debug-info
  --force-jit / --no-force-jit
  --report PATH
  --sizes-from PATH
  --tables / --no-tables
  --verbose / --no-verbose
  --include-items / --no-include-items
  --help                          Show this message and exit.

Examples¶

$ histobench --report benchymcmark.csv numba,cuda 1000,25000 3-5 1

If you supply the --report option, histobench will write a row to a CSV (or JSON, or compressed version of either) for each benchmark test it performs.

You may request multiple iterations of each test to avoid outliers. For interestingly large problem sets this are rarely outliers. If you supply --no-force-jit and request two iterations for a single bucket and item size, then the different between the first and second iterations is the compile time overhead for NumbaOptimizer and CUDAOptimizer.

benchymcmark.csv¶
partitioner	num_items	buckets	iteration	variance	elapsed_seconds	item_set_id
numba	1000	3	1	8	0	30…2b
cuda	1000	3	1	8	0.015	30…2b
numba	1000	4	1	2	0	20…b5
cuda	1000	4	1	2	0	20…b5
numba	1000	5	1	5.440	0.015	d9…6d
cuda	1000	5	1	5.440	0	d9…6d
numba	25000	3	1	2	0.627	d11…02
cuda	25000	3	1	2	0.065	d11…02
numba	25000	4	1	0.687	0.949	85…31
cuda	25000	4	1	0.687	0.062	85…31
numba	25000	5	1	4.240	1.255	fe…5f
cuda	25000	5	1	4.240	0.080	fe…5f

If you supply the --tables option then histobench will show you the average time to solve for each problem size and each partitioner:

$ histobench.exe --tables numba,cuda 5000-25000:5000 10-30:10 1
Partitioner: numba
             10      20      30
   5000   0.110   0.257   0.421
  10000   0.465   1.052   1.868
  15000   1.050   2.380   4.350
  20000   1.853   4.255   7.679
  25000   2.901   6.553  11.954

Partitioner: cuda
            10     20     30
   5000  0.016  0.010  0.010
  10000  0.022  0.032  0.047
  15000  0.049  0.275  0.345
  20000  0.157  0.377  0.345
  25000  0.298  0.432  0.465

If you have a custom partitioner, you can reference the file path to import it:

(venv-39) PS D:\histoptimizer\docs> histobench \
--tables ..\old_optimizers\cuda_shfl_down.py:CUDAOptimizerShuffleDown,cuda \
10000-50000:10000 3
Partitioner: cuda_shfl_down
             3
  10000  0.877
  20000  0.039
  30000  0.063
  40000  0.079
  50000  0.110

Partitioner: cuda
             3
  10000  0.025
  20000  0.047
  30000  0.063
  40000  0.095
  50000  0.126

CLI Guide¶

histoptimizer¶

Usage¶

Examples¶

histobench¶

Set Expressions¶

Usage¶

Examples¶

Project Links

Table of Contents

Contents

Navigation