Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaveBank doesn't update when new files are added to the directory. #275

Open
lvanderlaat opened this issue Apr 18, 2024 · 10 comments
Open
Labels
bug Something isn't working

Comments

@lvanderlaat
Copy link

Description

After I add new mseed files to the base_path, a new run of my program won't recognize the new data.

To Reproduce

My code:

client = WaveBank(base_path).update_index()

print(client.get_availability_df())

I run it once, and it scans perfectly.

If I add data to the folder, the second time I run it, it won't show the new data.

Expected behavior

The output dataframe should show the new data.

Versions (please complete the following information):

  • OS: MacOS 14
  • ObsPlus Version 0.3.0
  • Python Version: 3.11.8
@lvanderlaat lvanderlaat added the bug Something isn't working label Apr 18, 2024
@d-chambers
Copy link
Member

Hi @lvanderlaat,

Yes there is a subtly here. For a new file to get indexed it doesn't just have to be moved into the wavebank directory, but its mtime has to be greater than the timestamp stored in the index of the last indexing time.

Currently the implementation is rather simple; we scan through all the files in the directory and just index the ones which have mtime greater than the last time the index operation was run. If we wanted to be sure to include new files in the directory, regardless of any timestamps in the file, we would have to load an array of the entire bank contents into memory then compare each file to determine if it was already in the index or not. In addition to increased complexity, I suspect doing it this way would slow down indexing considerably.

I suppose rather than mtime, which does not reset when you move a file, we could look using ctime which does. Honestly, I would need to read up on the differences and think through it more before deciding that is the right path. For example, if you moved some files around inside the bank they would then get re-added to the index, which may not be desirable.

So, if you are systematically moving files into your bank, just make sure to reset the mtime. You can use path.Path.touch for this. For example:

import shutil
from pathlib import Path

path1 = Path("...")  # path outside the bank directory
path2 = Path("...")  # path inside the bank directory

shutil.move(path1, path2)
path2.touch()

Give that a try and let me know if they files are now added to your bank.

@shawnboltz
Copy link
Collaborator

@d-chambers

I suppose rather than mtime, which does not reset when you move a file, we could look using ctime which does. Honestly, I would need to read up on the differences and think through it more before deciding that is the right path. For example, if you moved some files around inside the bank they would then get re-added to the index, which may not be desirable.

Two questions:

  1. If you move the files around within the bank, wouldn't it then be desirable to delete the old entry and add the new one to the index? Unless you "move" them to the same spot, the bank will no longer be able to find them.
  2. Would a longer-term strategy be to both check the ctime (assuming it behaves how we think) and then also check if the data are already in the index and revise it accordingly? I have actually come across the above scenario where I updated some files (I can't remember the exact context, but the files covered the same seed id(s) and time windows, but there was something different about the waveform data itself) and it did end up duplicating them in the index.

@d-chambers
Copy link
Member

If you move the files around within the bank, wouldn't it then be desirable to delete the old entry and add the new one to the index? Unless you "move" them to the same spot, the bank will no longer be able to find them.

Yes, good point. If you don't update it will no longer have the correct path so the bank wont work. Maybe we do just want to use the ctime then?

Would a longer-term strategy be to both check the ctime (assuming it behaves how we think) and then also check if the data are already in the index and revise it accordingly?

By check if it is already in the bank you mean to just look at the path right? It would get much more complicated if we wanted to check the contents, or the file hash, or something like that to see if two files are the same (or nearly so). I think it would be a pain to do either way.

Maybe a pragmatic way to address the duplicate issue, if we don't do this already, would be to just drop duplicates from the contents dataframe before determining which files to load. That way each file can only get loaded once so there really wouldn't be a performance hit for having duplicates in the index. As long as the creation of duplicates are relatively rare, I don't think it would be a big problem.

@shawnboltz
Copy link
Collaborator

By check if it is already in the bank you mean to just look at the path right? It would get much more complicated if we wanted to check the contents, or the file hash, or something like that to see if two files are the same (or nearly so). I think it would be a pain to do either way.

I was just thinking to query the index for any entries that have the same contents (minus path) as the new data and then update the path for the entry if it's already in the index. I don't know how the logic for writing to the index works, though, so it could be a bigger PITA than I'm imagining.

Maybe a pragmatic way to address the duplicate issue, if we don't do this already, would be to just drop duplicates from the contents dataframe before determining which files to load. That way each file can only get loaded once so there really wouldn't be a performance hit for having duplicates in the index. As long as the creation of duplicates are relatively rare, I don't think it would be a big problem.

Maybe. In the case of a file that got moved, how would we know which of the duplicates to keep when loading the data?

@d-chambers
Copy link
Member

d-chambers commented Apr 19, 2024

Maybe. In the case of a file that got moved, how would we know which of the duplicates to keep when loading the data?

Wouldn't it work if we always keep the files with the latest ctimes?

@d-chambers
Copy link
Member

Thinking more about it, would we want the bank to completely re-index if we move the bank folder? If we used ctime that would be a likely side effect.

@d-chambers
Copy link
Member

Hi @lvanderlaat, a quick follow up: did the provided work-around work for you?

@d-chambers
Copy link
Member

@shawnboltz and I had a discussion and we think changing ObsPlus to use ctimes rather than mtimes is the right path forward for less surprising behavior. We will open a PR in the (hopefully) near future.

@dcroman
Copy link

dcroman commented Oct 11, 2024

I am also running into this issue and either the solution above doesn't work or I am misunderstanding what should be in path1 and what should be in path2. Right now I am trying to add new files available in path1 to the existing wavebank of files in path2. Any guidance?

(I found a way around it for now, which is to delete the hidden index file in path2).

@shawnboltz
Copy link
Collaborator

Hi @dcroman,

In your scenario, is path2 the path of the bank or the path of a specific new file? If it's the path of the bank (or, more generally, a directory), then you need to touch each individual file within it that needs to be updated. Here's a simple example that I just tested:

import shutil
from pathlib import Path

# As some setup that I just did manually, I created two folders: "folder1" and "folder2" and created a handful 
#  of empty text files in "folder1"

path1 = Path("folder1")
path2 = Path("folder2")

shitul.copytree(path1. path2, dirs_exist_ok=True)

# Note that this will touch -all- files in the directory, not just the new ones. You'll want to be more specific or 
#  you'll effectively re-index everything and possibly end up with duplicate entries
for p in path2.rglob("*"): 
    p.touch()

As a worst-case scenario, it also works to delete the .index.db (EventBank) or .index.h5 (WaveBank) file and just force it to re-index everything. That's not ideal if it's a big bank or something you're going to do repeatedly, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants