-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn local scanning into a pipeline #25
Comments
I've been thinking more about this lately and the setup I imagine is still that we have the external interface be the URLs to scan queue. The GStreamer scanning bit should be reduced a bit in scope so we only produce the tags and the duration. This result gets passed to the next queue without getting converted to a track, this format should match whatever the tags changed event emits. The results in the post-processing queue should have per URI type annotations run, an example of this would be last modified for At the end of the post processing we should either have an other queue, or just emit the resulting track/playlist/... From this point the consumers of the scanner data become relevant again and take over either returning the metadata we found for a stream lookup or adding the result to the library. For the GStreamer bit we should consider getting rid of the manual bus handling and just leave that to the GObject loop. This way we can scale number or GStreamer scanners without throwing more Python threads at the problem (GStreamer will still have it's own internal threads though). Note that there are however two/three current use cases. Local scanning done "offline", the planed in process local scanning and finally metadata lookup for streams. This is important to note as the turnaround time for the stream case is much tighter than the others. As such there is a fair chance we should make the queues priority queues, or have two of them with different service levels, ensuring that a running local scan doesn't block the stream metadata lookup or otherwise consume to much resources. For this I'm also assuming that we have single scanner pipeline running as part of audio/core which local and others are allowed to use. For the batch scanning case we don't really care about when we get our answers out, while for the stream case we do. So an other idea that just popped into my head while writing this is to have "scan sessions" each session has a priority and a way to add tracks and then also get the results out. For the stream case we simply create a session, give it the one URI, get our result and then close the session. For the batch scanning we create a session, feed it with URIs to scan as we find them (might be slow due to network etc) and process results as we get them and then when we've found the last URI we want to scan we tell the session, at which point we can join the queues. Of course this assumes a batch oriented mind set, and for in-process a streaming continual approach would be nicer IMO. Hopefully some of this still makes sense, as this became a bit more of braindump than I had planed. |
Makes sense to me :-) |
The current state has also caused out of memory issues for at least one user trying to scan 100k songs over SMB. In this case it was a raspi running out of memory already at the finder stage. |
I've written for other project a scanner actor: I use Discoverer from gst.pbutils instead of python one
and a Resolver Actor
that I feed from another actor, it try to discover an Uri and call another function on completion |
Thanks for the suggestion, we've looked into this before and at the time speed was the main reason for not using the built in one. Downside of course being having to reinvent the wheel and discovering problems already solved upstream (such a srcsinks with dynamic pads). If this can be shown to run with acceptable speed we should probably switch. On a side note we've also talked about splitting mopidy-local out to it's own extension instead of bundling it. This would probably also cover killing off mopidy-local-json and merging mopidy-local-sqlite into the new mopidy-local. In which case it would be up to who ever maintains that new extension to figure out what is best, and we can keep doing our own thing in core as we see fit :-) |
This resolver class could be implemented as a pool of services, if speed is crucial but for my application is faster than mopidy default and doesn't block the base class waiting for resolver to start the only problem could be if one checker dies and do not call the done function (I need in pykka a system to launch something if thread dies....) |
Just to update this, if someone were to do some benchmarks comparing recent versions of GST's |
Playing with some prototypes to speed up scanning I've come up with the following plan:
Additionally this means we can stat files as we go instead of trying to everything up front and most likely still keep up with not blocking the scanners. My quick hack to add multiple scanners also showed that this will quickly move the bottleneck to the library indexing. At least this was the case with whoosh, while for the json backend there was no work to do so scanners could simple work at full speed.
The text was updated successfully, but these errors were encountered: