-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison of JP2 and other data holdings at different locations #340
Comments
The leading idea for this is from Bogdan's message here
So the idea is something like this: If insertion time is the best metric to use, there are some things to think about:
|
Hello, thanks for the proposal! We definitely need a better synchronization process. A few clarifications first about your flowchart:
A proposal:
|
Clarifications:
Yes, new files means the ingestion date on the remote server is newer than the ingestion date on the local server. Since after the local server pulls the new file, its local ingestion date will be newer than the remote ingestion date, this method only works one way. So there must be a "primary" or "authoritative" server.
It is the ingestion date of the file on the local server.
Yes, it will be one of the parameters returned in query.
Yes, the existing ingestion method would stay the same. The part this proposal changes is how new files are selected. For this download method, instead of querying a web directory, it would query another helioviewer server and filter the results following the proposal we're discussing here. Proposal Comments:
Agreed, UTC for all dates. And the date of interest is the time the file was added to the local database.
It sounds like the dates will need to be stored per-source. Since the database is already storing the latest ingestion dates, it could get the latest ingestion dates for the sources being updated, and choose the oldest time from that selection. |
I'm away and I could not think about this schema. Another thing to add is the computation of a checksum for each file ingested. This checksum to be stored as a column and can be retrieved over HAPI. This allows the following data integrity checks:
|
So the query could select remote files based on their (remote) ingestion time, but the comparison between remote files and local files could be on checksum only? (no need to compare ingestion times if the checksums are compared) |
Checksum sounds good. That seems more reliable than just checking a timestamp. |
I wrote up what we've discussed so far here. And I'm sure I've made some assumptions particularly about who's mirroring which sources. Please review and feel free to edit. For the HAPI server, how will the datasets be grouped? We could group by source id, then each HAPI dataset would be at the measurement level i.e. AIA 94 is its own dataset, AIA 304 is its own dataset, etc |
Looks good. A small comment (somewhat following this comment): the query on the primary server needs to select by ingestion time, but I don't see why returning ingestion time is necessary. In pseudo SQL: I don't have an opinion on dataset grouping. |
Makes sense. Technically in HAPI there's no way to turn that off, though. Time is always returned even if it's not requested. |
There are four different JPIP servers running - GSFC, ROB, IAS and ESAC. Each of them are serving the data stored at that location. It would be good to understand more completely which data are at which location. This would improve the ability of each helioviewer location to fill in gaps in their data holdings.
The text was updated successfully, but these errors were encountered: