-
-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized Background Noise Augmentation for Large Background Files #360
base: main
Are you sure you want to change the base?
Optimized Background Noise Augmentation for Large Background Files #360
Conversation
Thanks for the PR. I will have a closer look when I have time |
In this case I would prefer lazy caching over eager caching. The difference becomes quite noticeable when there is a large number of files. Hypothetically, if you have half a million files, and it takes 1 ms to check the duration of each file, initializing the class would take 500 seconds. On the other hand, with lazy caching, initializing the class would be almost instant. |
Hello @iver56, Thank you for your valuable feedback. I have implemented the suggested changes and replaced eager caching with lazy caching. The system now caches file-related time information on demand, significantly improving the initialization speed for large datasets. I do have a question regarding the lookup mechanism for file time information. Currently, I am using a dictionary for this purpose, but its average-case time complexity for lookups is not guaranteed to be constant. I am exploring an alternative approach using an array of size len(sound_file_paths). With this method:
Additionally, I was wondering if there’s any provision in the current system to prioritize sampling certain files more frequently than others—for instance, based on importance, weight, or any custom-defined priority. If such functionality does not currently exist, is there a plan to introduce it in the future? Thank you for your time and guidance. I appreciate your input and look forward to your feedback! Best regards, |
Thanks for implementing that change
Here's a rough comparison of the memory usage in the three different alternatives, given that there are half a million items: If you feel like optimizing it with your array idea, here's my green light: 🟢 |
I don't have any immediate plans for adding that feature, but you're welcome to add an issue for it |
Hello @iver56 , Thank you for your detailed response and insights into the performance and memory usage of different data structures. The comparison between list, dictionary, and NumPy array was particularly helpful. Based on your feedback:
Regarding the feature to prioritize file sampling based on weights or importance, I will create an issue for it in the repository to track any discussions or future plans around it. Thank you once again for your guidance and support. I look forward to your feedback on the latest changes. Best regards, |
Proposed Algorithm:
This approach ensures that:
Experiments and Results:
I’ve tested this algorithm using:
This optimized approach significantly reduces memory usage while maintaining augmentation quality. I’ve attached the comparison plot showcasing the performance difference for your reference.
Please let me know your thoughts on this proposal and if any further details or clarifications are needed.
In the below figure First plot shows difference between memory usage over the test_cases normalized for by 1e6 and next graph is time taken comparison of old vs proposed.