Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of JP2 and other data holdings at different locations #340

Open
wafels opened this issue Oct 13, 2023 · 9 comments
Open

Comparison of JP2 and other data holdings at different locations #340

wafels opened this issue Oct 13, 2023 · 9 comments

Comments

@wafels
Copy link
Contributor

wafels commented Oct 13, 2023

There are four different JPIP servers running - GSFC, ROB, IAS and ESAC. Each of them are serving the data stored at that location. It would be good to understand more completely which data are at which location. This would improve the ability of each helioviewer location to fill in gaps in their data holdings.

@dgarciabriseno
Copy link
Contributor

The leading idea for this is from Bogdan's message here

My idea is to have insertion time as a column in the HV database and add the ability to query that remotely (HAPI?)

So the idea is something like this:
diagram of mirror algorithm

If insertion time is the best metric to use, there are some things to think about:

  • Only works one-way. When B mirrors from A, the timestamps in B will be newer then A.
  • I'm not sure if there could be issues if working in both mirror mode + pulling from upstream sources

@bogdanni @ebuchlin, any thoughts?

@ebuchlin
Copy link

Hello, thanks for the proposal! We definitely need a better synchronization process.

A few clarifications first about your flowchart:

  • Is "query primary server for new files" about ingestion date being more recent than the last (remote) ingestion date of the files already received (by the local server)? (I guess yes)
  • In the different servers' databases, is ingestion date the ingestion date on the local server, or first ingestion date of this version of the file on any server?
  • In the box "is timestamp more recent?", is this about the remote file timestamp being more recent than the local file file timestamp? (I guess yes) But then how is the remote file timestamp obtained by the local server?
  • What is happening to the local database after download, do we keep the same file ingestion mechanism as currently?

A proposal:

  • The "ingestion date" (everything in UTC) in the local database is the local ingestion date, which should be more or less equivalent to the local file timestamp (and same for the remote server, which is "local" from its own viewpoint)
  • The query to the remote server is by ingestion date range (usually, in routine operations: from the last locally processed remote ingestion date to now)
    • This query uses the remote server's HAPI interface
    • There could be other conditions in the query (observation date range, dataset...) because maybe we don't want to update everything at once, but then the last locally processed remote ingestion date cannot be stored to be used later anymore.
  • For each file in the result of this query, ordered by ingestion date, if (the file does not exist in the local database) or (the local ingestion date is older than the remote ingestion date):
    • download file
    • ingest it (setting the ingestion date to the current date; replacing existing database row if there is one)
  • There could be a mechanism to check that the servers' times (clocks) are synchronized as expected

@dgarciabriseno
Copy link
Contributor

Clarifications:

Is "query primary server for new files" about ingestion date being more recent than the last (remote) ingestion date of the files already received (by the local server)? (I guess yes)

Yes, new files means the ingestion date on the remote server is newer than the ingestion date on the local server. Since after the local server pulls the new file, its local ingestion date will be newer than the remote ingestion date, this method only works one way. So there must be a "primary" or "authoritative" server.

In the different servers' databases, is ingestion date the ingestion date on the local server, or first ingestion date of this version of the file on any server?

It is the ingestion date of the file on the local server.

In the box "is timestamp more recent?", is this about the remote file timestamp being more recent than the local file file timestamp? (I guess yes) But then how is the remote file timestamp obtained by the local server?

Yes, it will be one of the parameters returned in query.

What is happening to the local database after download, do we keep the same file ingestion mechanism as currently?

Yes, the existing ingestion method would stay the same. The part this proposal changes is how new files are selected. For this download method, instead of querying a web directory, it would query another helioviewer server and filter the results following the proposal we're discussing here.

Proposal Comments:

The "ingestion date" (everything in UTC) in the local database is the local ingestion date, which should be more or less equivalent to the local file timestamp (and same for the remote server, which is "local" from its own viewpoint)

Agreed, UTC for all dates. And the date of interest is the time the file was added to the local database.

There could be other conditions in the query (observation date range, dataset...) because maybe we don't want to update everything at once, but then the last locally processed remote ingestion date cannot be stored to be used later anymore.

It sounds like the dates will need to be stored per-source. Since the database is already storing the latest ingestion dates, it could get the latest ingestion dates for the sources being updated, and choose the oldest time from that selection.

Updated diagram:
Updated diagram for sync method

@bogdanni
Copy link
Contributor

bogdanni commented Nov 2, 2023

I'm away and I could not think about this schema.

Another thing to add is the computation of a checksum for each file ingested. This checksum to be stored as a column and can be retrieved over HAPI. This allows the following data integrity checks:

  • identify local storage corruption
  • identify when a file on a remote server is different

@ebuchlin
Copy link

ebuchlin commented Nov 2, 2023

So the query could select remote files based on their (remote) ingestion time, but the comparison between remote files and local files could be on checksum only? (no need to compare ingestion times if the checksums are compared)

@dgarciabriseno
Copy link
Contributor

Checksum sounds good. That seems more reliable than just checking a timestamp.

@dgarciabriseno
Copy link
Contributor

I wrote up what we've discussed so far here. And I'm sure I've made some assumptions particularly about who's mirroring which sources.

Please review and feel free to edit.

For the HAPI server, how will the datasets be grouped? We could group by source id, then each HAPI dataset would be at the measurement level i.e. AIA 94 is its own dataset, AIA 304 is its own dataset, etc

@ebuchlin
Copy link

ebuchlin commented Nov 2, 2023

Looks good. A small comment (somewhat following this comment): the query on the primary server needs to select by ingestion time, but I don't see why returning ingestion time is necessary. In pseudo SQL: SELECT name, ingestion_time, jp2_url WHERE ingestion_time > ..., probably no need for SELECT ingestion_time, name, ....

I don't have an opinion on dataset grouping.

@dgarciabriseno
Copy link
Contributor

Makes sense. Technically in HAPI there's no way to turn that off, though. Time is always returned even if it's not requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants