Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized Background Noise Augmentation for Large Background Files #360

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

PratikKulkar
Copy link

Proposed Algorithm:

Get duration of the background file (bg_file) in seconds.
Sample a random value from the range [0, bg_file_seconds - event_file_seconds).
Read the background file from sampled_value to sampled_value + event_file_seconds.

This approach ensures that:

We only load a portion of the background file required for the augmentation.
It maintains randomness in background selection while reducing memory overhead.
It is adaptable to cases with varied sample rates and event/background file durations.

Experiments and Results:

I’ve tested this algorithm using:

Event durations ranging from 1 to 9 seconds.
Background durations ranging from 81 to 10,000 seconds.
Sample rates: 16,000 Hz, 22,500 Hz, and 44,100 Hz.

This optimized approach significantly reduces memory usage while maintaining augmentation quality. I’ve attached the comparison plot showcasing the performance difference for your reference.

Improves scalability by avoiding unnecessary memory consumption for large files.
Enhances performance in real-time audio augmentation workflows.
Can be integrated as a feature or an option in AddBackgroundNoise to provide more flexibility to users.

Please let me know your thoughts on this proposal and if any further details or clarifications are needed.

In the below figure First plot shows difference between memory usage over the test_cases normalized for by 1e6 and next graph is time taken comparison of old vs proposed.

Labeled_final

@iver56
Copy link
Owner

iver56 commented Oct 14, 2024

Thanks for the PR. I will have a closer look when I have time

@iver56
Copy link
Owner

iver56 commented Jan 15, 2025

In this case I would prefer lazy caching over eager caching. The difference becomes quite noticeable when there is a large number of files. Hypothetically, if you have half a million files, and it takes 1 ms to check the duration of each file, initializing the class would take 500 seconds. On the other hand, with lazy caching, initializing the class would be almost instant.

@PratikKulkar
Copy link
Author

Hello @iver56,

Thank you for your valuable feedback. I have implemented the suggested changes and replaced eager caching with lazy caching. The system now caches file-related time information on demand, significantly improving the initialization speed for large datasets.

I do have a question regarding the lookup mechanism for file time information. Currently, I am using a dictionary for this purpose, but its average-case time complexity for lookups is not guaranteed to be constant. I am exploring an alternative approach using an array of size len(sound_file_paths). With this method:

  • Each file would be assigned an index (e.g., from 0 to len(sound_file_paths) - 1).
  • File paths and corresponding time information could then be accessed directly using the index, enabling efficient retrieval.

Additionally, I was wondering if there’s any provision in the current system to prioritize sampling certain files more frequently than others—for instance, based on importance, weight, or any custom-defined priority. If such functionality does not currently exist, is there a plan to introduce it in the future?

Thank you for your time and guidance. I appreciate your input and look forward to your feedback!

Best regards,
Pratik Kulkar

@iver56
Copy link
Owner

iver56 commented Jan 16, 2025

Thanks for implementing that change

I do have a question regarding the lookup mechanism for file time information. Currently, I am using a dictionary for this purpose, but its average-case time complexity for lookups is not guaranteed to be constant. I am exploring an alternative approach using an array of size len(sound_file_paths). With this method:

  • Each file would be assigned an index (e.g., from 0 to len(sound_file_paths) - 1).
  • File paths and corresponding time information could then be accessed directly using the index, enabling efficient retrieval.
  • dict lookups are O(1) on average for both string and integer keys
  • Having integers as keys is faster than having strings as keys, due to faster hashing and comparison. And it uses less memory.
  • Accessing a value in an array/list is also O(1), but in practice it is faster than a dict lookup
  • a numpy array requires less memory than a python-native list of floats

Here's a rough comparison of the memory usage in the three different alternatives, given that there are half a million items:
List of floats: ~16 MB
Dictionary (int keys, float values): ~47 MB
NumPy array (float32): ~2 MB

If you feel like optimizing it with your array idea, here's my green light: 🟢

@iver56
Copy link
Owner

iver56 commented Jan 16, 2025

Additionally, I was wondering if there’s any provision in the current system to prioritize sampling certain files more frequently than others—for instance, based on importance, weight, or any custom-defined priority. If such functionality does not currently exist, is there a plan to introduce it in the future?

I don't have any immediate plans for adding that feature, but you're welcome to add an issue for it

@PratikKulkar
Copy link
Author

Hello @iver56 ,

Thank you for your detailed response and insights into the performance and memory usage of different data structures. The comparison between list, dictionary, and NumPy array was particularly helpful.

Based on your feedback:

  • I have implemented the idea of saving time information as discussed, with a focus on efficient storage and retrieval.
  • I have also fixed a logical bug in the previous commit titled "Avoiding Preloading" to ensure the updated implementation aligns with the lazy caching approach.

Regarding the feature to prioritize file sampling based on weights or importance, I will create an issue for it in the repository to track any discussions or future plans around it.

Thank you once again for your guidance and support. I look forward to your feedback on the latest changes.

Best regards,
Pratik Kulkar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants