Pareto frontier? #230

bknowles · 2017-05-30T01:14:24Z

I learned about Squash from the page at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and one of the key additional things I think they bring to the table is the reduction of the amount of data to be charted and graphed by eliminating all points from the chart where another data point is both faster and smaller compression size.

As your corpus grows, and as you grow the number of compression algorithms you test against, I think it's going to become more and more important to try to reduce the amount of data that you try to present in your graphs. So, I think this principle will be very useful to you.

nemequ · 2017-05-30T02:36:58Z

Did you see the Optimal Codecs chart, which is pretty much what you describe?

As for the other charts, I think it's going to be a long time before this becomes a problem; see Rich Geldrich's amazing charts for an example with a lot of data points which are still quite useful.

Execution time of the benchmark is a limiting factor here; it already takes around 24 hours to run on a fast machine, and over a week on some of the slower machines. I want to keep adding more codecs to squash, but I plan to reduce the number of pieces of data in the main benchmark (by replacing it with the Squash Compression Corpus when it's ready). Additional data should go in its own benchmark.

bknowles · 2017-05-30T03:20:33Z

I had seen the Optimal Codes chart, but while each data point is identified as to what compression algorithm was used and what compression/decompression speed was achieved, what gets lost in that chart is how the selected few "best" algorithms compare at each data point -- all codecs get munged together on the blue line, the green line, or the black line.

I know you have to slice the data in multiple different ways, and this chart is just one way to look at it. But I found the introduction of the Pareto Frontier to be a particularly interesting concept that helped reduce the complexity of the charts at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and I was thinking that it might also help in your case -- you're comparing considerably more compression algorithms and compression settings.

Anyway, it was just a thought. If it's not helpful to you, then please feel free to close this issue without further comment.

Thanks again!

nemequ · 2017-05-30T03:38:59Z

You're right about everything getting munged together; it would be very nice to provide a visual indication of which codec is represented by each point without having to hover over the point in question. I'm not sure how to do that with high charts, though…

I don't see a lot of benefit in only showing Pareto frontier results in the other charts, though, since you can get the same information by just looking for results with no results higher on both axes. I'm hardly a data visualization expert, though.

One thing that might be interesting is a way to only show results on the Pareto frontier for other variables; for example, a version of the "compression ratio vs. compression speed" chart which only showed items which are on the Pareto frontier of "compression ratio vs. decompression speed".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pareto frontier? #230

Pareto frontier? #230

bknowles commented May 30, 2017

nemequ commented May 30, 2017

bknowles commented May 30, 2017

nemequ commented May 30, 2017

Pareto frontier? #230

Pareto frontier? #230

Comments

bknowles commented May 30, 2017

nemequ commented May 30, 2017

bknowles commented May 30, 2017

nemequ commented May 30, 2017