CLI Guide

histoptimizer

Histoptimizer takes a CSV and partitions it evenly into a supplied number of buckets by adding columns where each item is assigned to a bucket.

Usage

Usage: histoptimizer [OPTIONS] FILE SIZE_COLUMN PARTITIONS

  Partition ordered items in a CSV into a given number of buckets, evenly.

  Given a CSV or JSON Dataframe, a size column name, and a number of buckets,
  Histoptimizer will add a column which gives the partition number for each
  row that optimally divides the given items into the buckets so as to
  minimize the variance from mean of the summed items in each bucket.

  Additional features allow doing a list of bucket sizes in one go, sorting
  items beforehand, and producing output with only relevant columns.

  Example:

      > histoptimizer books.csv state_name population 10

      Output:

      state_name, population, partition_10     Wyoming, xxxxxx, 1
      California, xxxxxxxx, 10

Options:
  -l, --limit INTEGER             Take the first {limit} records from the
                                  input, rather than the whole file.
  -a, --ascending, --asc / -d, --descending, --desc
                                  If a sort column is provided,
  --print-all, --all / --no-print-all, --brief
                                  Output all columns in input, or with
                                  --brief, only output the ID, size, and
                                  buckets columns.
  -c, --column-prefix TEXT        Partition column name prefix. The number of
                                  buckets will be appended. Defaults to
                                  partion_{number of buckets}.
  -s, --sort-key TEXT             Optionally sort records by this column name
                                  before partitioning.
  -i, --id-column TEXT            Optional ID column to print with brief
                                  output.
  -p, --partitioner TEXT          Use the named partitioner implementation.
                                  Defaults to "numba". If you have an NVidia
                                  GPU use "cuda" for better performance
  -o, --output FILENAME           Send output to the given file. Defaults to
                                  stdout.
  -f, --output-format [csv|json]  Specify output format. Pandas JSON or CSV.
                                  Defaults to CSV
  --help                          Show this message and exit.

Examples

Consider the following CSV:

books.csv

Title

Pages

The Algorithm Design Manual

748

Software Engineering at Google

599

Site Reliability Engineering

550

Hands-on Machine Learning

850

Clean Code

464

Code Complete

960

Web Operations

338

Consciousness Explained

528

I am a Strange Loop

432

The Information

544

The Fractal Geometry of Nature

500

Consider Phlebas

544

I Heart Logs

60

Kraken

528

Noise

464

Snow Crash

440

To sort by title, and then divide optimally into 3, 5, 6, and 7 buckets, use this command:

histoptimizer -s Title books.csv Pages 3,5-7

Returns:

Title

Pages

partition_3

partition_5

partition_6

partition_7

0

Clean Code

464

1

1

1

1

1

Code Complete

960

1

1

1

1

2

Consciousness Explained

528

1

1

2

2

3

Consider Phlebas

544

1

2

2

2

4

Hands-on Machine Learning

850

2

2

3

3

5

I Heart Logs

60

2

2

3

3

6

I am a Strange Loop

432

2

2

3

3

7

Kraken

528

2

3

4

4

8

Noise

464

2

3

4

4

9

Site Reliability Engineering

550

2

3

4

5

10

Snow Crash

440

3

4

5

5

11

Software Engineering at Google

599

3

4

5

6

12

The Algorithm Design Manual

748

3

4

5

6

13

The Fractal Geometry of Nature

500

3

5

6

7

14

The Information

544

3

5

6

7

15

Web Operations

338

3

5

6

7

histobench

histobench is a CLI that lets you unlock Histoptimizer’s most powerful abilities: running against random data and then throwing away the results.

If you supply it with specifications of multiple item sizes and multiple buckets, it will benchmark every possible combination of item size and buckets.

Set Expressions

Set expressions are used by histoptimizer and histobench to define counts of items or buckets to be used for partitioning and benchmarking.

The format is a comma-separated list of range specifications. A range specification may be a single number or two numbers (beginning and ending, inclusive) separated by a ‘-’. If two numbers, a ‘:’ and third number may be supplied to provide a step. If the end number is not reachable in complete steps then the series will be truncated at the last valid step size.

Some examples:

10000-50000:10000(10000, 20000, 30000, 40000, 50000)

3,4,7-9(3, 4, 7, 8, 9)

10,20-30:5,50(10, 20, 25, 30, 50)

10-25:8(10, 18)

Usage

Usage: histobench [OPTIONS] PARTITIONER_TYPES [ITEM_SPEC] [BUCKET_SPEC]
                  [ITERATIONS] [SIZE_SPEC]

  Histobench is a benchmarking harness for testing Histoptimizer partitioner
  performance.

  By Default it uses random data, and so may not be an accurate benchmark for
  algorithms whose performance depends upon the data set.

  The PARTITIONER_TYPES parameter is a comma-separated list of partitioners to
  benchmark, which can be specified as either:

  1. A standard optimizer name, or 2. filepath:classname

  To specify the standard cuda module and also a custom variant, for example,
  one could use: cuda,./old_optimizers/cuda_20221130.py:CUDAOptimizer20221130

  Args:

  Returns:

  Raises:

Options:
  --debug-info / --no-debug-info
  --force-jit / --no-force-jit
  --report PATH
  --sizes-from PATH
  --tables / --no-tables
  --verbose / --no-verbose
  --include-items / --no-include-items
  --help                          Show this message and exit.

Examples

$ histobench --report benchymcmark.csv numba,cuda 1000,25000 3-5 1

If you supply the --report option, histobench will write a row to a CSV (or JSON, or compressed version of either) for each benchmark test it performs.

You may request multiple iterations of each test to avoid outliers. For interestingly large problem sets this are rarely outliers. If you supply --no-force-jit and request two iterations for a single bucket and item size, then the different between the first and second iterations is the compile time overhead for NumbaOptimizer and CUDAOptimizer.

benchymcmark.csv

partitioner

num_items

buckets

iteration

variance

elapsed_seconds

item_set_id

numba

1000

3

1

8

0

30…2b

cuda

1000

3

1

8

0.015

30…2b

numba

1000

4

1

2

0

20…b5

cuda

1000

4

1

2

0

20…b5

numba

1000

5

1

5.440

0.015

d9…6d

cuda

1000

5

1

5.440

0

d9…6d

numba

25000

3

1

2

0.627

d11…02

cuda

25000

3

1

2

0.065

d11…02

numba

25000

4

1

0.687

0.949

85…31

cuda

25000

4

1

0.687

0.062

85…31

numba

25000

5

1

4.240

1.255

fe…5f

cuda

25000

5

1

4.240

0.080

fe…5f

If you supply the --tables option then histobench will show you the average time to solve for each problem size and each partitioner:

$ histobench.exe --tables numba,cuda 5000-25000:5000 10-30:10 1
Partitioner: numba
             10      20      30
   5000   0.110   0.257   0.421
  10000   0.465   1.052   1.868
  15000   1.050   2.380   4.350
  20000   1.853   4.255   7.679
  25000   2.901   6.553  11.954

Partitioner: cuda
            10     20     30
   5000  0.016  0.010  0.010
  10000  0.022  0.032  0.047
  15000  0.049  0.275  0.345
  20000  0.157  0.377  0.345
  25000  0.298  0.432  0.465

If you have a custom partitioner, you can reference the file path to import it:

(venv-39) PS D:\histoptimizer\docs> histobench \
--tables ..\old_optimizers\cuda_shfl_down.py:CUDAOptimizerShuffleDown,cuda \
10000-50000:10000 3
Partitioner: cuda_shfl_down
             3
  10000  0.877
  20000  0.039
  30000  0.063
  40000  0.079
  50000  0.110

Partitioner: cuda
             3
  10000  0.025
  20000  0.047
  30000  0.063
  40000  0.095
  50000  0.126