-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards a dictionary builder #185
Conversation
f5ab182
to
6f0afa8
Compare
can you summarize the goal here? |
also, the fix for #176 seems to be mixed into mutliple PRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll continue after the question for #176 answered.
Cargo.toml
Outdated
@@ -45,6 +46,11 @@ path = "src/bin/decode.rs" | |||
name = "binjs_dump" | |||
path = "src/bin/dump.rs" | |||
|
|||
[[bin]] | |||
# Create a dictionary based on a sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify what "create a dictionary" mean, in term of command line application?
like, what's input and what's output and what is the output for?
Well, it's a complicated refactoring. I added new features in one PR, refactored the code that used the old features in another PR. |
2e9ae26
to
72563ab
Compare
I should also clarify that the format used to store dictionaries is temporary. Actual format is being discussed here. |
…t as probability dictionaries. 1. Reworking a little bit the entropy code to make it type-safer, with the introduction of integer-like structs `InstancesInFile` and `FilesContaining`. 2. Moving a few core definitions from module entropy::predict into a new module entropy::probabilities. 3. Introducing conversion from Dictionary<usize> (which counts symbols) into Dictionary<Symbol> (which may be used directly for encoding using Opus).
9071c85
to
93788ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed commit message :)
it helped much!
@@ -28,7 +28,7 @@ lzw = "*" | |||
rand = "^0.4" | |||
test-logger = "*" | |||
vec_map = "*" | |||
webidl = "^0.7" | |||
webidl = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reasoning behind the difference between "*" and "^0.8" for webidl dependency in each file?
(looks like it's related to the section build-dependencies/dev-dependencies/dependencies, but I don't know much)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure how it works, but the "^0.8"
should specificy that we use the latest 0.8.X, while "*"
apparently simply says "whichever version is required by the other components".
crates/binjs_io/src/entropy/model.rs
Outdated
use std::rc::Rc; | ||
|
||
let instances = self.values() | ||
.map(|x| x.0 as u32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too, x.deref()
.borrow_mut() | ||
.exit_tagged_tuple_at(&null_name, &[], path) | ||
.map_err(Error::TokenWriterError)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be sure, these braces are workaround for ownership things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. They specify the end of the borrow_mut()
.
Arg::with_name("quiet") | ||
.long("quiet") | ||
.short("q") | ||
.help("Do not print progress"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll eventually want to unify the shared part of command line application implementation
(arguments, file handling, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we already did some of that with the advanced
subcommand, but we'll probably want to go further.
f9de6c1
to
9c88d01
Compare
…, Tree} and the companion code in ast.rs This changeset simplifies trait `io::TokenWriter`, getting rid of some baggage that makes work on entropy needlessly complicated. For backwards compatibility reason, we keep a `io::deprecated::TokenWriterWithTree`, which is essentially a copy of the old `io::TokenWriter`, along with a `io::deprecated::TokenWriterTreeAdapter`, which converts from the old form to the new one. This change should make future work on entropy a bit simpler, improve performance and this has already helped locate a few bugs in the generic implementation of encoding. This change will need to be followed up by a corresponding work on the TokenReader, once we have a clear idea of exactly what needs to be done. 1. `TokenWriter` doesn't define a type `Tree` anymore. Rather, on success, all methods return `()`. Consequently, all implementations of `Serializer`, which returned `TokenWriter::Tree` on success now return `()` on success. This includes both hardcoded implementations and implementations extracted from the webidl. 2. `TokenWriter` doesn't take a `Vec` of children for lists and tagged tuples anymore. Consequently, we adapt the code generator to not create these `Vec` of children anymore. 3. Similarly, the generic implementation of encoding (which might not be useful anymore, but it's still here for the moment) needs to be adapted to the new `TokenWriter`. This requires a few contorsions in binjs_generic/src/io/encode.rs. 4. Both methods `Encoder::encode`, which give access to all token writers now use `TokenWriterTreeAdapter` for older implementations of `TokenWriterWithThree`. 5. We adapt the entropy dictionary builder to this new `TokenWriter`, which essentially means removing now useless code. 6. Other implementations of `TokenWriter` move to `TokenWriterWithTree`, which changes essentially the interface name. 7. Adding methods `from_rc_string` to all types derived from `SharedString`. 8. Adapting tests to the API changes.
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde (de)serialization mechanism. While we have not decided a format for storing dictionaries (that's binast/container-format-spec#7 ), this will let us experiment using a simple, temporary format.
This change adds a new command-line tool, binjs_generate_prediction_tables, which analyzes a bunch of JS source files and outputs binary files containing the prediction tables for the group. Once we have implemented v2 of the `entropy` encoder, these prediction tables may then be used for encoding/decoding with that format.
9c88d01
to
9e857c2
Compare
No description provided.