Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards a dictionary builder #185

Merged
merged 9 commits into from
Oct 12, 2018

Conversation

Yoric
Copy link
Collaborator

@Yoric Yoric commented Oct 9, 2018

No description provided.

@Yoric Yoric requested a review from arai-a October 9, 2018 15:37
@Yoric Yoric force-pushed the entropy-with-dictionary-builder branch 2 times, most recently from f5ab182 to 6f0afa8 Compare October 9, 2018 18:50
@arai-a
Copy link
Collaborator

arai-a commented Oct 10, 2018

can you summarize the goal here?

@arai-a
Copy link
Collaborator

arai-a commented Oct 10, 2018

also, the fix for #176 seems to be mixed into mutliple PRs
what's happening?

Copy link
Collaborator

@arai-a arai-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue after the question for #176 answered.

Cargo.toml Outdated
@@ -45,6 +46,11 @@ path = "src/bin/decode.rs"
name = "binjs_dump"
path = "src/bin/dump.rs"

[[bin]]
# Create a dictionary based on a sample.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify what "create a dictionary" mean, in term of command line application?
like, what's input and what's output and what is the output for?

@Yoric
Copy link
Collaborator Author

Yoric commented Oct 10, 2018

also, the fix for #176 seems to be mixed into mutliple PRs
what's happening?

Well, it's a complicated refactoring. I added new features in one PR, refactored the code that used the old features in another PR.

@Yoric Yoric force-pushed the entropy-with-dictionary-builder branch 2 times, most recently from 2e9ae26 to 72563ab Compare October 10, 2018 07:42
@Yoric
Copy link
Collaborator Author

Yoric commented Oct 10, 2018

I should also clarify that the format used to store dictionaries is temporary. Actual format is being discussed here.

Yoric added 2 commits October 10, 2018 10:05
…t as probability dictionaries.

1. Reworking a little bit the entropy code to make it type-safer, with the introduction of
integer-like structs `InstancesInFile` and `FilesContaining`.

2. Moving a few core definitions from module entropy::predict into a new module entropy::probabilities.

3. Introducing conversion from Dictionary<usize> (which counts symbols) into Dictionary<Symbol> (which
may be used directly for encoding using Opus).
@Yoric Yoric force-pushed the entropy-with-dictionary-builder branch 2 times, most recently from 9071c85 to 93788ea Compare October 10, 2018 10:37
Copy link
Collaborator

@arai-a arai-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed commit message :)
it helped much!

crates/binjs_io/src/io/deprecated.rs Show resolved Hide resolved
@@ -28,7 +28,7 @@ lzw = "*"
rand = "^0.4"
test-logger = "*"
vec_map = "*"
webidl = "^0.7"
webidl = "*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reasoning behind the difference between "*" and "^0.8" for webidl dependency in each file?
(looks like it's related to the section build-dependencies/dev-dependencies/dependencies, but I don't know much)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure how it works, but the "^0.8" should specificy that we use the latest 0.8.X, while "*" apparently simply says "whichever version is required by the other components".

use std::rc::Rc;

let instances = self.values()
.map(|x| x.0 as u32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too, x.deref()

crates/binjs_io/src/entropy/model.rs Outdated Show resolved Hide resolved
crates/binjs_io/src/entropy/predict.rs Outdated Show resolved Hide resolved
.borrow_mut()
.exit_tagged_tuple_at(&null_name, &[], path)
.map_err(Error::TokenWriterError)?;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be sure, these braces are workaround for ownership things?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. They specify the end of the borrow_mut().

crates/binjs_io/src/entropy/model.rs Outdated Show resolved Hide resolved
src/bin/generate_dictionary.rs Outdated Show resolved Hide resolved
src/bin/generate_dictionary.rs Outdated Show resolved Hide resolved
Arg::with_name("quiet")
.long("quiet")
.short("q")
.help("Do not print progress"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll eventually want to unify the shared part of command line application implementation
(arguments, file handling, etc)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we already did some of that with the advanced subcommand, but we'll probably want to go further.

@Yoric Yoric force-pushed the entropy-with-dictionary-builder branch 3 times, most recently from f9de6c1 to 9c88d01 Compare October 11, 2018 13:13
Yoric added 7 commits October 11, 2018 16:50
…, Tree} and the companion code in ast.rs

This changeset simplifies trait `io::TokenWriter`, getting rid of some baggage that makes work on entropy
needlessly complicated. For backwards compatibility reason, we keep a `io::deprecated::TokenWriterWithTree`,
which is essentially a copy of the old `io::TokenWriter`, along with a `io::deprecated::TokenWriterTreeAdapter`,
which converts from the old form to the new one.

This change should make future work on entropy a bit simpler, improve performance and this has already helped locate a few bugs in the generic implementation of encoding.

This change will need to be followed up by a corresponding work on the TokenReader, once we have a clear idea of exactly what needs to be done.

1. `TokenWriter` doesn't define a type `Tree` anymore. Rather, on success, all methods return `()`. Consequently,
	all implementations of `Serializer`, which returned `TokenWriter::Tree` on success now return `()` on success.
	This includes both hardcoded implementations and implementations extracted from the webidl.

2. `TokenWriter` doesn't take a `Vec` of children for lists and tagged tuples anymore. Consequently, we adapt the
	code generator to not create these `Vec` of children anymore.

3. Similarly, the generic implementation of encoding (which might not be useful anymore, but it's still here for the moment)
	needs to be adapted to the new `TokenWriter`. This requires a few contorsions in binjs_generic/src/io/encode.rs.

4. Both methods `Encoder::encode`, which give access to all token writers now use `TokenWriterTreeAdapter` for older
	implementations of `TokenWriterWithThree`.

5. We adapt the entropy dictionary builder to this new `TokenWriter`, which essentially means removing now useless code.

6. Other implementations of `TokenWriter` move to `TokenWriterWithTree`, which changes essentially the interface name.

7. Adding methods `from_rc_string` to all types derived from `SharedString`.

8. Adapting tests to the API changes.
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde
(de)serialization mechanism. While we have not decided a format for storing dictionaries
(that's binast/container-format-spec#7 ), this will let us
experiment using a simple, temporary format.
This change adds a new command-line tool, binjs_generate_prediction_tables,
which analyzes a bunch of JS source files and outputs binary files containing
the prediction tables for the group. Once we have implemented v2 of the `entropy`
encoder, these prediction tables may then be used for encoding/decoding with
that format.
@Yoric Yoric force-pushed the entropy-with-dictionary-builder branch from 9c88d01 to 9e857c2 Compare October 11, 2018 14:50
@Yoric Yoric merged commit d553a78 into binast:master Oct 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants