Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-architecture the node binding #36

Open
kelson42 opened this issue Apr 30, 2020 · 11 comments
Open

Re-architecture the node binding #36

kelson42 opened this issue Apr 30, 2020 · 11 comments
Assignees
Milestone

Comments

@kelson42
Copy link
Contributor

kelson42 commented Apr 30, 2020

The current node binding does not respect the libzim reader/writer patterns. This should be rewritten to respect it (like the the python-libzim binding does).

After this is doing, we could probably provide high level abstraction of it in a dedicated package (like the python_scraperlib) to simplify the life of scraper devs.

@kelson42
Copy link
Contributor Author

@mgautierfr I would be happy if you could give a few guidelines/things to respect to do that work properly.

@mgautierfr
Copy link

First, you have to consider the reading part and the writing part as two different libraries. They are in the same repository and share few common structure name (as Article) but they are totally different (in usage and code behind).

  • On the reading part. The Article is a "view" on what is inside the zim file. Article contains nothing. When you call a method (getTitle) then libzim search for the title and return it to you. A wrapper should use the same mechanism. The article wrapper must be a "proxy" to the real article and call the c++ method only when needed. It may store the result as a cache system but it must not a be container itself.
  • On the writing part. The Article is a interface to implement. As the creator need to know some information about the article to write, it will call the method of the inteface. At the wrapper level, you cannot know how to implement those methods, by definition, the implementation depends of the use case. Information can be store in memory but it can come from a fileystem, a database, ...
    It is the to code using the wrapper to implement this. So the wrapper must be a proxy here also. But the methods are call from the c++ side and forward to node side. (Where on reading part, the call go from node to c++).

About blob.

Blob is a proxy to a content store somewhere. It mostly contains a pointer to the data and a size. It is cheap to copy.

On reading side, the wrapper should avoid to copy the data pointed by the blob. If it is not possible, the data must be copied the later possible, and only when actually needed. The best would be to "convert" the blob to the equivalent in node world (I don't know if it is possible, we have memoryview in python for this). If the data is not copied but referenced, the wrapper must ensure that the pointed data is not deleted while node is using it. To do so, it is enought to keep a copy of the blob object somewhere as long as node is using (may use) the pointed data.

When creating zim file, the wrapper must create a blob (when needed) pointing to the data and pass it to c++. You must ensure that the data is not deleted by node as long as c++ use the blob. (You are sure that the blob will not live longer than the article. So it is enough, and easier, to ensure that the pointed data existing as long as the article exists).

@kelvinhammond
Copy link
Collaborator

@mgautierfr I have a few questions about libzim's architecture regarding this.

  1. Why aren't zim::Article and zim::writer::Article sharing a common subclass such that zim::writer::Article can be polymorphic and interfaces can simply use zim::Article *?
  2. Why is there no way to create a zim::Blob and have it manage the data stored internally instead of an external pointer to the data? Internally libzim can do this but externally one cannot.
  3. There is a way, if I recall correctly, to store data in napi and have a callback function handle its deletion if this is useful.

@mgautierfr
Copy link

mgautierfr commented May 5, 2020

Why aren't zim::Article and zim::writer::Article sharing a common subclass such that zim::writer::Article can be polymorphic and interfaces can simply use zim::Article *?

Because they are nothing in common (even if few methods have the same name). When you use inheritance, B inherits A, so B is a specific A. Then you can pass a B to a code waiting for a A. Here zim::writer::Article is not a a specific zim::Article (libzim cannot give you a writer article when you are asking for a specific article of a zim file). The same way, zim::Article is not a zim::writer::Article, they are missing methods (shouldIndex, shouldCompress, getRedirectUrl) and it make no sense to add theme to a (reading) article.
About an intermediate subclass, well, what would it be ? zim::writer::Article is just a collection of methods to implement, there is no code.
And code of zim::Article is really specific to reading. Have a look for the getData implementation (https://github.com/openzim/libzim/blob/master/src/article.cpp#L198-L205). How this could be reuse by a writer article ?

Why is there no way to create a zim::Blob and have it manage the data stored internally instead of an external pointer to the data? Internally libzim can do this but externally one cannot.

This is a flaw in the API. I realized that when working with python wrapper. You can't create your own Blob managing its own data internally.
(To be exact, you can create it (inherit from it), but you cannot pass it to the creator). This will probably be changed with openzim/libzim#325. The signature of getData will be to return a pointer to a blob. The reading entry will return a specific blob using data in a zim file. On the writing side, user will have to create its own blob and manage the data the way specific to the usecase.

There is a way, if I recall correctly, to store data in napi and have a callback function handle its deletion if this is useful.

Yes. It is probably this system that should be use once we will fix the issue with the blob. The deletion of the data would be tied to the existance of the blob (or one of its copy). When the last copy of the blob is deleted, then napi would delete the data.

@kelson42
Copy link
Contributor Author

kelson42 commented May 5, 2020

@kelvinhammond Would be happy to invite you to our Slack channel if you are interested.... because part of the discussion around node-libzim and MWoffliner just happen there. Just write me an email to kelson at kiwix . org to get an invitation.

@kelson42 kelson42 pinned this issue Jun 27, 2020
@kelson42
Copy link
Contributor Author

@kelvinhammond To which extend #72 fixes this ticket?

@kelvinhammond
Copy link
Collaborator

@kelson42 You're still going to have issues with the copying, I haven't figured out a way to avoid that in nodejs. We now have the threading issue where we can't easily create custom content creators.

@kelson42 kelson42 added this to the 3.0.0 milestone Dec 4, 2022
@kelson42
Copy link
Contributor Author

kelson42 commented Dec 4, 2022

@kelvinhammond To my understanding, it looks like you have done most of the work requested here. If so, I woukd recommend to open dedicated tickets for the parts still to do and close this one. Do you agree?

@kelson42
Copy link
Contributor Author

kelson42 commented May 4, 2023

@kelvinhammond Any feedback here please? Now that libzim 8.2.0 has been released, I will push the release of libzim-nodejs 3.0.0 (finally).

@kelson42 kelson42 modified the milestones: 3.0.0, 3.1.0 May 19, 2023
@kelson42 kelson42 modified the milestones: 3.1.0, 3.2.0 Dec 3, 2023
@kelson42
Copy link
Contributor Author

kelson42 commented Dec 9, 2023

@kelson42 You're still going to have issues with the copying, I haven't figured out a way to avoid that in nodejs. We now have the threading issue where we can't easily create custom content creators.

@kelvinhammond We have #80 for the threading issue, but if you could open a ticket for the copying issue, I would be glad to close this old ticket which has been implemented to 99% thanks to your work.

@kelson42 kelson42 modified the milestones: 3.2.0, 3.3.0 Dec 9, 2023
@kelson42
Copy link
Contributor Author

@audiodude Issue probably worth a read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants