Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of trie objects #9

Open
hrbrmstr opened this issue Sep 29, 2016 · 7 comments
Open

Serialization of trie objects #9

hrbrmstr opened this issue Sep 29, 2016 · 7 comments

Comments

@hrbrmstr
Copy link

Ref: http://stackoverflow.com/questions/39769316/save-on-disk-an-r-object-of-type-externalptr

I have used the neat library triebeard that implements tries in R.
I created my trie using the following syntax:
myTrie <- trie(keys = unigram$words, values = unigram$frequency)

It works perfectly, so now I'd like to save the variable myTrie on disk for future usage. Unfortunately the variable is of type external pointer:

class(myTrie)
[1] "externalptr" "trie"        "string_trie"
object.size(myTrie)
408 bytes

So it looks like I cannot used the standard savefunction to store the trie to disk. Is there a way to access the object referenced by the pointer? Or is there a way to save on disk the objects referenced by the pointer?

@hrbrmstr
Copy link
Author

While I cut/pasted that from SO, this would be super helpful for itools since we could enable saving/loading of large CIDR lists.

Having said that, it's super fast to re-create a trie so I'm not sure how high on the priority list this should be.

Having said that, http://gallery.rcpp.org/articles/rcpp-serialization/ seems like it could help here.

@Ironholds
Copy link
Owner

Mn; that's a lot of dependencies to add (including C++11) to urltools.

I'm wondering if there could be a simpler way of doing it? What if we shadowed save and load to:

  1. If save() is given a trie, it runs get_keys() and get_values() and saves them to an identified list (and if not, just runs base::save();
  2. If load() brings in an IDd list, it generates a trie out of it and returns that to the global namespace.

Alternately we could just bite the bullet and add C++11 and all the rest (not as horrendous an imposition as in days of yore) which would also allow me to write ye olde crappy inefficient URL extractor (which I think is a feature request of yours from waaay back) into urltools.

Thoughts?

@hrbrmstr
Copy link
Author

hrbrmstr commented Oct 3, 2016

I'm all for C++11. I've started making it the default for new Rcpp packages
and it works on Windows, Ubuntu/Debian and macOS, so it's quite CRANnable.

On Sun, Oct 2, 2016 at 5:29 PM, Oliver Keyes [email protected]
wrote:

Mn; that's a lot of dependencies to add (including C++11) to urltools.

I'm wondering if there could be a simpler way of doing it? What if we
shadowed save and load to:

  1. If save() is given a trie, it runs get_keys() and get_values() and
    saves them to an identified list (and if not, just runs base::save();
  2. If load() brings in an IDd list, it generates a trie out of it and
    returns that to the global namespace.

Alternately we could just bite the bullet and add C++11 and all the rest
(not as horrendous an imposition as in days of yore) which would also allow
me to write ye olde crappy inefficient URL extractor (which I think is a
feature request of yours from waaay back) into urltools.

Thoughts?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#9 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAfHtgvmYuICSS_kH-wTDHjL2RpOO0GJks5qwCIhgaJpZM4KJ9wI
.

@Ironholds
Copy link
Owner

boB, any thoughts on how exactly one would hook a custom function here up into save() or load()? (I mean, aside from telling the user to do it ;p)

@Ironholds
Copy link
Owner

Update: yeah working out an elegant way of doing that is my main blocker. There are 11192342 ways of serialising and deserialising objects, but making it nice to do is a different matter.

@rexdouglass
Copy link

+1
I just wanted to compare the memory size of a trie, thinking it might be stored more efficiently than data frame/data.table and therefore could be a good way to compress text.

@Ironholds
Copy link
Owner

Oh! So just a two-column df analogue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants