Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pareto frontier? #230

Open
bknowles opened this issue May 30, 2017 · 3 comments
Open

Pareto frontier? #230

bknowles opened this issue May 30, 2017 · 3 comments

Comments

@bknowles
Copy link

I learned about Squash from the page at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and one of the key additional things I think they bring to the table is the reduction of the amount of data to be charted and graphed by eliminating all points from the chart where another data point is both faster and smaller compression size.

As your corpus grows, and as you grow the number of compression algorithms you test against, I think it's going to become more and more important to try to reduce the amount of data that you try to present in your graphs. So, I think this principle will be very useful to you.

@nemequ
Copy link
Member

nemequ commented May 30, 2017

Did you see the Optimal Codecs chart, which is pretty much what you describe?

As for the other charts, I think it's going to be a long time before this becomes a problem; see Rich Geldrich's amazing charts for an example with a lot of data points which are still quite useful.

Execution time of the benchmark is a limiting factor here; it already takes around 24 hours to run on a fast machine, and over a week on some of the slower machines. I want to keep adding more codecs to squash, but I plan to reduce the number of pieces of data in the main benchmark (by replacing it with the Squash Compression Corpus when it's ready). Additional data should go in its own benchmark.

@bknowles
Copy link
Author

I had seen the Optimal Codes chart, but while each data point is identified as to what compression algorithm was used and what compression/decompression speed was achieved, what gets lost in that chart is how the selected few "best" algorithms compare at each data point -- all codecs get munged together on the blue line, the green line, or the black line.

I know you have to slice the data in multiple different ways, and this chart is just one way to look at it. But I found the introduction of the Pareto Frontier to be a particularly interesting concept that helped reduce the complexity of the charts at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and I was thinking that it might also help in your case -- you're comparing considerably more compression algorithms and compression settings.

Anyway, it was just a thought. If it's not helpful to you, then please feel free to close this issue without further comment.

Thanks again!

@nemequ
Copy link
Member

nemequ commented May 30, 2017

You're right about everything getting munged together; it would be very nice to provide a visual indication of which codec is represented by each point without having to hover over the point in question. I'm not sure how to do that with high charts, though…

I don't see a lot of benefit in only showing Pareto frontier results in the other charts, though, since you can get the same information by just looking for results with no results higher on both axes. I'm hardly a data visualization expert, though.

One thing that might be interesting is a way to only show results on the Pareto frontier for other variables; for example, a version of the "compression ratio vs. compression speed" chart which only showed items which are on the Pareto frontier of "compression ratio vs. decompression speed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants