Issues with volume breaking algorithm and solutions #27

bertfrees · 2019-11-18T18:24:06Z

The volume breaking algorithm has a number of issues that result in a slow convergence and no guarantee that the final solution is actually a good one:

EvenSizeVolumeSplitter does not compare solutions. All it does is step into a certain direction at every iteration, increasing the number of volumes when content does not fit, or decreasing the number of volumes when a good solution has been found, and keep doing this until it ends up at a solution where it was before. Because of this, and because there are two variables that determine the next solution, namely the volume offset and the sheet count, the final solution depends a lot on the (two-dimensional) path taken by the algorithm. Experience shows that small changes in the algorithm (how the two variables are computed) can influence the path and the final solution a lot.
A second problem is that EvenSizeVolumeSplitter doesn't even promote even volumes. The only thing it does is push the average number of sheets per volume down in order to push the size of the last volume up. At the moment it doesn't control the relation between the other volumes (these are kept equal by the distance penalty in VolumeProvider). Moreover, it probably isn't even very effective at getting the size of the last volume in balance with with the other volumes because it doesn't actually check the individual sizes, it's guesswork. In other words, the EvenSizeVolumeSplitter is basically useless.

These are the solutions that come to mind:

If there would be only one variable, namely the volume count (and if we would keep the target size of each volume constant at the maximum value), there would be no "randomness" because the path would be one-dimensional. EvenSizeVolumeSplitter could be eliminated. This would make the algorithm much faster, but it would still not favour solutions with more even volumes.
We could also do a more systematic search to make sure we find the optimum. For instance, whenever a volume is added we could start with the maximum amount of sheets and decrease the number of sheets (step by step, or with a binary search) until an optimum is found. Note that we would actually need to check the results (rather than descreasing the number of sheets until the content does not fit anymore).

The solution that I prefer would be to eliminate EvenSizeVolumeSplitter and do the "even sizes" optimization in VolumeProvider's cost function. The distance penalty should already keep the volume sizes close to some average value, which indirectly leads to even volumes. That is, if the optimization is done well. The problem is that the optimization is currently done in a greedy manner which does not lead to a global optimum, and especially for the last volume it can lead to big a deviation from the average volume size (and this is what EvenSizeVolumeSplitter tries to fix). A proper optimization could be achieved with something like the a* (a-star) algorithm. Of course the big question is: how much more processing will this require? In order to find a global optimum we need to try a lot more solutions. On the other hand, if instead of 50 we only need a handful of iterations (to let the cross-references settle down) that brings the total processing down. To get an idea of the net effect on the processing and the memory consumption, I have done some simulations.

In the following we assume a limit of 40 sheets per volume, and a worst-case scenario of 50 iterations for the greedy method:

volume count	greedy worst-case	brute force	a* min	max
5	10000	8.0E59	200	6832
10	20000	1.3E120	1792	14957
15	30000	2.1E180	6933	22442
20	40000	3.3E240	13320	29973
25	50000	5.4E300	18680	38224
30	60000	##Inf	27000	46792
35	70000	##Inf	32273	54498
40	80000	##Inf	38000	62179
45	90000	##Inf	43638	70703
50	100000	##Inf	56755	78030

Another difference with the greedy method is the number of unfinished solutions that need to be kept in memory: with the greedy method there is at most one solution that we need to save, while with the a* method we may need to save as many unfinished solutions as there are sheets in the whole document:

volume count	dijkstra worst-case	min	max
5	200	157	196
10	400	256	375
15	600	309	535
20	800	331	540
25	1000	404	635
30	1200	461	622
35	1400	478	713
40	1600	528	737
45	1800	530	767
50	2000	507	761

The reason is explained in issue https://github.com/mtmse/dotify.formatter.impl/issues/5. While an algorithm that favours solutions with more even volumes would be useful to have, EvenSizeVolumeSplitter (despite its name) does not actually accomplish this. Moreover, it has many problems, deteriorating the volume breaking more than improving it. Until we find a way to implement it properly, the whole feature is dropped.

kalaspuffar transferred this issue from mtmse/dotify.formatter.impl Aug 20, 2021

kalaspuffar added the Low Prio label Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with volume breaking algorithm and solutions #27

Issues with volume breaking algorithm and solutions #27

bertfrees commented Nov 18, 2019

Issues with volume breaking algorithm and solutions #27

Issues with volume breaking algorithm and solutions #27

Comments

bertfrees commented Nov 18, 2019