-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WaveBank doesn't update when new files are added to the directory. #275
Comments
Hi @lvanderlaat, Yes there is a subtly here. For a new file to get indexed it doesn't just have to be moved into the wavebank directory, but its mtime has to be greater than the timestamp stored in the index of the last indexing time. Currently the implementation is rather simple; we scan through all the files in the directory and just index the ones which have mtime greater than the last time the index operation was run. If we wanted to be sure to include new files in the directory, regardless of any timestamps in the file, we would have to load an array of the entire bank contents into memory then compare each file to determine if it was already in the index or not. In addition to increased complexity, I suspect doing it this way would slow down indexing considerably. I suppose rather than mtime, which does not reset when you move a file, we could look using ctime which does. Honestly, I would need to read up on the differences and think through it more before deciding that is the right path. For example, if you moved some files around inside the bank they would then get re-added to the index, which may not be desirable. So, if you are systematically moving files into your bank, just make sure to reset the mtime. You can use import shutil
from pathlib import Path
path1 = Path("...") # path outside the bank directory
path2 = Path("...") # path inside the bank directory
shutil.move(path1, path2)
path2.touch() Give that a try and let me know if they files are now added to your bank. |
Two questions:
|
Yes, good point. If you don't update it will no longer have the correct path so the bank wont work. Maybe we do just want to use the ctime then?
By check if it is already in the bank you mean to just look at the path right? It would get much more complicated if we wanted to check the contents, or the file hash, or something like that to see if two files are the same (or nearly so). I think it would be a pain to do either way. Maybe a pragmatic way to address the duplicate issue, if we don't do this already, would be to just drop duplicates from the contents dataframe before determining which files to load. That way each file can only get loaded once so there really wouldn't be a performance hit for having duplicates in the index. As long as the creation of duplicates are relatively rare, I don't think it would be a big problem. |
I was just thinking to query the index for any entries that have the same contents (minus path) as the new data and then update the path for the entry if it's already in the index. I don't know how the logic for writing to the index works, though, so it could be a bigger PITA than I'm imagining.
Maybe. In the case of a file that got moved, how would we know which of the duplicates to keep when loading the data? |
Wouldn't it work if we always keep the files with the latest ctimes? |
Thinking more about it, would we want the bank to completely re-index if we move the bank folder? If we used ctime that would be a likely side effect. |
Hi @lvanderlaat, a quick follow up: did the provided work-around work for you? |
@shawnboltz and I had a discussion and we think changing ObsPlus to use ctimes rather than mtimes is the right path forward for less surprising behavior. We will open a PR in the (hopefully) near future. |
I am also running into this issue and either the solution above doesn't work or I am misunderstanding what should be in path1 and what should be in path2. Right now I am trying to add new files available in path1 to the existing wavebank of files in path2. Any guidance? (I found a way around it for now, which is to delete the hidden index file in path2). |
Hi @dcroman, In your scenario, is import shutil
from pathlib import Path
# As some setup that I just did manually, I created two folders: "folder1" and "folder2" and created a handful
# of empty text files in "folder1"
path1 = Path("folder1")
path2 = Path("folder2")
shitul.copytree(path1. path2, dirs_exist_ok=True)
# Note that this will touch -all- files in the directory, not just the new ones. You'll want to be more specific or
# you'll effectively re-index everything and possibly end up with duplicate entries
for p in path2.rglob("*"):
p.touch() As a worst-case scenario, it also works to delete the .index.db (EventBank) or .index.h5 (WaveBank) file and just force it to re-index everything. That's not ideal if it's a big bank or something you're going to do repeatedly, however. |
Description
After I add new mseed files to the
base_path
, a new run of my program won't recognize the new data.To Reproduce
My code:
I run it once, and it scans perfectly.
If I add data to the folder, the second time I run it, it won't show the new data.
Expected behavior
The output dataframe should show the new data.
Versions (please complete the following information):
The text was updated successfully, but these errors were encountered: