cd: create a gha that generate an index for each query in the src/sql folder #5

vkt1414 · 2024-03-15T00:41:21Z

This PR aims to address part of #2.

Manual trigger only for now
- has additional triggers but no cron jobs
Take all of the queries in the queries folder and run them
create artifacts containing the result for each query saved as CSV and Parquet, files should be named consistently with the query file name and include IDC version in the file name by figuring out what idc_current maps to at the time the query is executed
to get the number of the latest version of IDC, list all of the dataset in bigquery-public-data project and get latest idc_v*
- leverages more efficient way by querying the table bigquery-public-data.idc_current.version_metadata to get the latest idc release
create an issue that will include links to the artifacts generated by the GHA, title "[github-action] Index updates IDC v..." (something like that)
- creates a pull request to update queries if they are outdated
replace idc_current with the actual version in the query and save each of the queries as GHA artifact
- queries will now have version explicity mentioned in them. I had to go this route as there is no otherway to know what we would be referring to. Previously idc-version in index.py was used as the baseline. However, in this repo, there is no such baseline. So I figured this may be the best way possible.
@jcfr I'm still figuring out how cmake works and am not completely sure if this is the best way to package the indices. As we are not uploading the indices as gh release attachments, I removed the download by urls and instead tried to package the generated indices as part of the cd. I'm working on ci.yml too, but unable to figure out how to package the indices. Any guidance is much appreciated. https://github.com/vkt1414/idc-index-data/actions/runs/8288398907

@fedorov we may need to decide whether to keep both csv and parquet or choose one. In my opinion we should keep both and slowly phase out csv. Also, please let me know if this PR is satisfactory to the goals mentioned in Automate index updates and packaging #2

we will need to create a couple of secrets for this gha. A sample successful run is available here: https://github.com/vkt1414/idc-index-data/actions/runs/8289604309/job/22686311347

jcfr · 2024-03-15T03:48:08Z

Well done moving forward, I will review and help out tmrw.

fedorov · 2024-03-15T14:14:51Z

I will spend more time reviewing, but few comments for now:

v18 CSV index is ~75MB compressed; I wonder if we should discuss what it would take to switch to Parquet only for v18? What are the challenges/downsides?
I noticed you have idc_v17 in the query - should that point to idc_current instead? I do not think the GHA should be updating the version in the query. The comment in the earlier discussion was to replace idc_currentwith the specific version and save the result as an artifact, so that the versioned queries could be included in the release for provenance of the packaged indices.

In light of the reogranization of the repositories by @jcfr, I think we should revisit the original plan outlined in #2 (comment).

In particular, as I understand it, there is no need anymore to attach index files as release artifacts. They can be uploaded directly to PyPI as part of the package. This would also make it unnecessary to create an issue

Maybe we should have one workflow with manual trigger to generate results of running the query, as you did it - this will be useful to verify everything works as expected before preparing the package - and then another one that would be triggered on the creation of the release (as done in idc-index) that will upload the package to PyPI?

jcfr · 2024-03-15T17:32:31Z

Suggestions:

Add scripts directory
Move src/sql/idc_index.sql to scripts/sql/idc_index.sql
Move and rename from .github/get_latest_index.py to scripts/idc_index_data_manager.py
Update pyproject settings to ensure linting also account for scripts directory
Add code currently associated with initialize_idc_manager_class step of cd.yml as a function to idc_index_builder.py

Update CMakeLists with code similar to the following:

find_package(
  Python
  COMPONENTS Interpreter
  REQUIRED)

set(IDC_INDEX_DATA_VERSION "17")
 
if(NOT DEFINED ENV{GCP_PROJECT_ID})
  message(FATAL_ERROR "GCP_PROJECT_ID env. variable is not set")
endif()

option(IDC_INDEX_DATA_GENERATE_PARQUET "Generate idc_index.parquet file" OFF)

set(download_dir "${PROJECT_BINARY_DIR}")

add_custom_command(
   OUTPUT
     ${download_dir}/idc_index.csv.zip
     $<$<BOOL:IDC_INDEX_DATA_GENERATE_PARQUET>:${download_dir}/idc_index.parquet> \
   COMMAND Python::Interpreter -m idc_index_data_manager \
     --generate-csv-archive \
     $<$<BOOL:IDC_INDEX_DATA_GENERATE_PARQUET>:--generate-parquet> \
     --version ${IDC_INDEX_DATA_VERSION}
   ...

Revert back change from cd.yml
Add dedicated workflow (e.g called bump-version.yml) that will get latest version, get current version from CMakeLists.txt (by parsing it using a regex), and if it applies, update CMakeLists.txt and create a pull request

jcfr · 2024-03-15T17:35:08Z

This first version should package v17 without parquet, once this is working, we will then release the version 17.0.0 on PyPI.

After that, we will add the bump-version.yml workflow to effectively detect that a new version if available and release it as well.

fedorov · 2024-03-15T20:44:46Z

After that, we will add the bump-version.yml workflow to effectively detect that a new version if available and release it as well.

Can you clarify what you mean by this? Effectively detect new version of what and in which package?

.github/get_latest_index.py

jcfr · 2024-03-15T21:10:34Z

Can you clarify what you mean by this? Effectively detect new version of what and in which package?

The bump-idc-data-index-version.yml workflow would be scheduled to run at given frequency (daily, weekly or monthly), it would then do the following:

call retrieve_latest_idc_release_version¹
extract version from CMakeLists.txt
if version are different, create a pull request with the update.

https://github.com/ImagingDataCommons/idc-index-data/pull/5/files#r1526830095 ↩

fedorov · 2024-03-15T22:10:20Z

The bump-idc-data-index-version.yml workflow would be scheduled to run at given frequency (daily, weekly or monthly), it would then do the following:

I see. We considered this approach before, but at this point I do not believe this is a good approach. First, IDC releases do not happen frequently - once every 1-3 months. So there is no significant burden to respond to IDC version updates manually. Second, new IDC releases will have new data or schema updates, that may break assumptions, or produce index that is too large, or I do not know what else that we can't anticipate. Because of this, it is safer to manually test the queries, and consider the result of the query before publishing.

I would prefer to have:

a manually-triggered workflow that would run the queries and generate artifacts for examination and release preparation
automatic workflow that would be triggered on creation of a release that would generate the index based on the current content of the queries and publish the package on PyPI

.github/workflows/cd.yml

scripts/python/idc_index_data_manager.py

scripts/python/idc_index_builder.py

scripts/python/idc_index_data_manager.py

scripts/python/idc_index_builder.py

.github/workflows/cd.yml

vkt1414 · 2024-03-20T23:19:08Z

@jcfr I was finally able to call the datamanager directly from cmake, cleaned up data manager and cd heavily. I realized I was very far from your vision about using cmake to handle index generation but am also slowly understanding how pyproject.toml, cmake, and hynek/build-and-inspect-python-package gha combination are working. Thanks for your patience. The latest iteration addresses most of not all of the things Andrey would like. I hardcoded to generate both CSV and parquet as I realized we will be hardcoding even if we use cmake lists. But with somehelp, we can fix it if we should ideally parameterize cmake instead of the python file

latest successful run: https://github.com/vkt1414/idc-index-data/actions/runs/8366868268

vkt1414 · 2024-03-21T14:02:28Z

One more update: After including the GCP authentication step, CI is also working on all combinations from python 3.8-.12 on latest mac, ubuntu, and windows. However pypi 3.10 on latest ubuntu did not work and logs did help to troubleshoot, so I disabled pypi-3.10 for now.
latest CI run: https://github.com/vkt1414/idc-index-data/actions/runs/8376050860

fedorov · 2024-03-22T16:06:54Z

For the reference, this PR was needed to fix the CI failure due to inability to access GHA secrets from the PR submitted from a fork of the repo: #7.

.github/workflows/cd.yml

src/idc_index_data/__init__.py

jcfr

The issue with the current approach is that wheels are not reproducible and the content of the index file being generated depend on time at which they are generated.

For sake of "simplicity", you could move forward with this approach but it may be sensible to:

document that the generated wheel include an index file corresponding to the state of the tables at the time the wheel was generated
not upload any source distribution (.tar.gz) on PyPI. Indeed it would be misleading as rebuilding the source distribution would not allow to get a index file matching one provided in the wheel
revisit the versioning from being "number" based to be date based (e.g YYYY.MM.DD.N)

.github/workflows/cd.yml

.github/workflows/ci.yml

CMakeLists.txt

scripts/python/idc_index_data_manager.py

fedorov · 2024-03-22T19:21:14Z

The issue with the current approach is that wheels are not reproducible and the content of the index file being generated depend on time at which they are generated.

@jcfr this is a very good point. I did not consider that it is important for the wheel packaging process to be reproducible.

revisit the versioning from being "number" based to be date based (e.g YYYY.MM.DD.N)

The current approach of using IDC data release version as the major version of the package is exactly the purpose of enabling reproducibility. I would argue that mapping the date to the IDC release version is not trivial. But if the user knows the IDC data version, this is all that is needed to make queries reproducible.

Thinking about this a bit more, we could modify the queries to have, for example {idc_versioned_dataset} in place of idc_current, and replace that placeholder with the IDC version based on the idc-index-data major release number. So, for example, if the user generates release tag 17.2.10, GHA workflow would pick 17, replace the dataset name in each of the queries with idc_v17, run queries, and package+publish the result into idc-index-data v17.2.10.

jcfr · 2024-03-22T21:23:37Z

for example {idc_versioned_dataset} in place of

Indeed, that would be similar to the use of @IDC_TABLENAME@ discussed in #5 (comment)

For convenience, supporting --output-sql-query would also be nice to have.

version

Extracting the major version and map it to the name of the table is an option. For this to work, we would need to also customize the versioning with the following settings:

[tool.setuptools_scm]
version_scheme = "no-guess-dev"

The "challenge" with relying only on the tag is that we would have no way of testing the generation of a wheel associated with a new version of the index by simply creating a pull request, indeed a new tag and release would have to be created first without if it actually work.

To address this I would instead suggest to hard-code the major version in the CMakeLists.txt as we originally discussed.

vkt1414 · 2024-03-23T08:51:59Z

The "challenge" with relying only on the tag is that we would have no way of testing the generation of a wheel associated with a new version of the index by simply creating a pull request, indeed a new tag and release would have to be created first without if it actually work.

To address this I would instead suggest to hard-code the major version in the CMakeLists.txt as we originally discussed.

If all of the solutions we discussed involve manually updating the version, can't we simply hardcode the idc version in the sql query itself?

jcfr · 2024-03-25T16:23:51Z

If all of the solutions we discussed involve manually updating the version, can't we simply hardcode the idc version in the sql query itself?

I have a draft of bump function for addressing this and allow updating files using nox -s bump (query the latest) or nox -s bump -- <version>

Before moving forward, we really need @fedorov to integrate #8 as none of the code in this pull request is being tested.

fedorov · 2024-03-25T17:54:42Z

@jcfr that PR is integrated now, please go ahead or let us know what we should do to fix this.

every query in the src/sql folder. The gha will check if the version of IDC is out of date. If so, the queries will be updated, and run with bigquery to generate csv and parquet files in root directory. The cmake lists file is edited to pick up the csv and parquet files generated by bq and package them. Moreover gitignore is edited to have a pattern for ignoring service account files. Lastly, if queries are updated, a pull request will be created to update the queries to be in sync with latest idc release.

a helper python script is referred and run in gha to generate indices use static methods for functions that do not rely on initialization of a class sql queries are reverted back to use idc_current but explicit versioned queries are uploaded as gha artifacts pylint is configured to lint scripts folder as well

call the index generator from cmake lists started using a single function to generate indices instead of two taking boolean inputs whether or not to generate csv, parquet files. Currently hard coded to generate both formats and not fully using the boolean capabilities yet with cmake. Queries are no longer needed to be attached as gha artifacts

fix minor formatting errors

…ult)

fedorov · 2024-03-25T18:14:22Z

CMakeLists.txt

+    $<$<BOOL:IDC_INDEX_DATA_GENERATE_CSV_ARCHIVE>:${download_dir}/idc_index.csv.zip>
+    $<$<BOOL:IDC_INDEX_DATA_GENERATE_PARQUET>:${download_dir}/idc_index.parquet>
+  COMMAND python ${CMAKE_CURRENT_SOURCE_DIR}/scripts/python/idc_index_data_manager.py
+    --generate-csv-archive


Should this line be instead

$<$<BOOL:IDC_INDEX_DATA_GENERATE_CSV_ARCHIVE>:--generate-csv-archive

Also, I do not know if the boolean value for parquet is passed along to the index manager properly, as it seem to generate both csv and parquet although the latter is disabled in cmakelists.
This was from latest CD run:
https://github.com/ImagingDataCommons/idc-index-data/actions/runs/8425241913

fedorov · 2024-03-25T18:16:32Z

@jcfr I recall when we met last time you mentioned that one could handle packaging with just python, and CMake is not critical. Maybe you can comment on that a bit. I do not want to derail the current development, but also I do not know if learning CMake is a worthwhile investment of effort for Vamsi. Just wanted to raise the idea if he could understand the alternatives better.

fedorov · 2024-03-25T18:23:07Z

CMakeLists.txt

+
+add_custom_command(
+  OUTPUT
+    $<$<BOOL:IDC_INDEX_DATA_GENERATE_CSV_ARCHIVE>:${download_dir}/idc_index.csv.zip>


@jcfr if we want to build this for growth from the start, we cannot hard-code the file names. The generator code will now produce CSV/Parquet for each of the queries in the sql folder, but any changes to the SQL queries file names or addition of new queries will break the code at the moment.

It would make more sense to use wildcards to package all CSV/Parquet files depending on what is configured with the flags. Vamsi is going to give it a try to replace with the wildcard, unless you have other thoughts.

Vamsi is going to give it a try to replace with the wildcard

Using wildcard as such will not work as the target check the timestamp of the output (aka the filename) to decide if it should "re-build" or not.

Once this is integrated, I suggest you create an issue describing the type, name and size of index files, and how these related to queries for generating them (as well as which domain they relates) (e.g microscopy, ...)

Let's also not that with working CI and CD pipeline refactoring will be straightforward.

jcfr · 2024-03-25T18:26:18Z

re: CMake vs Python

Ditto. A pure python plugin can be written to implement similar behavior, there is some hints in the "Discussions" section of one of our recent scikit-build meeting¹

Given my limited experience with such hatch plugin, I suggest we move forward with CMakeLists.txt and consider revisiting afterward. Revisiting later will be much easier as we will have a working system.

Also worth noting that I expect the CMakeLists.txt to remain small and be straightforward to maintain.

https://github.com/orgs/scikit-build/discussions/1071 ↩

jcfr · 2024-03-25T18:34:00Z

re: permission error

For now, it suggest we create the branch feat-cd-2 directly into the repo and not from a fork.

jcfr · 2024-03-27T22:31:50Z

Closing. This pull request is superseded by #9

vkt1414 changed the title ~~cd:create a gha that generate an index for each query in the src/sql folder~~ cd: create a gha that generate an index for each query in the src/sql folder Mar 15, 2024

vkt1414 force-pushed the feat-cd branch from 8075192 to fd40ab3 Compare March 15, 2024 05:07

jcfr reviewed Mar 15, 2024

View reviewed changes

.github/get_latest_index.py Outdated Show resolved Hide resolved

.github/get_latest_index.py Outdated Show resolved Hide resolved

.github/get_latest_index.py Outdated Show resolved Hide resolved

.github/get_latest_index.py Outdated Show resolved Hide resolved

jcfr reviewed Mar 15, 2024

View reviewed changes

.github/get_latest_index.py Outdated Show resolved Hide resolved

jcfr mentioned this pull request Mar 15, 2024

Create GitHub App called "ImagingDataCommons App" #6

Open

vkt1414 marked this pull request as draft March 15, 2024 23:35

vkt1414 force-pushed the feat-cd branch from 312f64a to d6a4fe0 Compare March 16, 2024 05:57

jcfr reviewed Mar 16, 2024

View reviewed changes

vkt1414 force-pushed the feat-cd branch from 79c8faf to dce8139 Compare March 20, 2024 23:08

vkt1414 marked this pull request as ready for review March 20, 2024 23:20

vkt1414 requested review from fedorov and jcfr March 21, 2024 01:50

jcfr reviewed Mar 22, 2024

View reviewed changes

.github/workflows/cd.yml Outdated Show resolved Hide resolved

fedorov reviewed Mar 22, 2024

View reviewed changes

src/idc_index_data/__init__.py Outdated Show resolved Hide resolved

fedorov reviewed Mar 22, 2024

View reviewed changes

src/idc_index_data/__init__.py Outdated Show resolved Hide resolved

jcfr reviewed Mar 22, 2024

View reviewed changes

vkt1414 force-pushed the feat-cd branch from 46ca06c to 3d38017 Compare March 22, 2024 18:20

jcfr force-pushed the feat-cd branch from 507e5bf to 4e6ba51 Compare March 22, 2024 21:39

jcfr force-pushed the feat-cd branch 2 times, most recently from 4c2e6ef to ae34b46 Compare March 22, 2024 23:21

jcfr mentioned this pull request Mar 25, 2024

ci: Ensure testing of proposed code in pull requests from forks #8

Merged

vkt1414 and others added 13 commits March 25, 2024 14:02

enh: include gcp auth step to ci, remove pypi 3.10

67f1e30

enh: Generate only csv for now

fc64bfd

fix: generate only csv for now

52954f1

enh: use csv index only for now

b2a5d45

fix minor formatting errors

style: Revert unneeded changes from "cd.yml"

17c9e13

style: fix indent in CMakeLists.txt

e002336

style: Tweak IDCIndexDataManager docstring formatting

2503daa

feat: Update idc_index_data_manager.py to accept arguments

985c9c8

feat: Add IDC_INDEX_DATA_GENERATE_PARQUET option

3324ffe

feat: Add IDC_INDEX_DATA_GENERATE_CSV_ARCHIVE option (enabled by defa…

0ba0527

…ult)

jcfr force-pushed the feat-cd branch from f77bdfc to 0ba0527 Compare March 25, 2024 18:03

fedorov reviewed Mar 25, 2024

View reviewed changes

jcfr mentioned this pull request Mar 25, 2024

feat: Add support for generating and bundling index files during wheel build #9

Merged

jcfr closed this Mar 27, 2024

vkt1414 mentioned this pull request Apr 15, 2024

Add pathology index #15

Open

vkt1414 deleted the feat-cd branch June 9, 2024 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cd: create a gha that generate an index for each query in the src/sql folder #5

cd: create a gha that generate an index for each query in the src/sql folder #5

vkt1414 commented Mar 15, 2024 •

edited

Loading

jcfr commented Mar 15, 2024

fedorov commented Mar 15, 2024

jcfr commented Mar 15, 2024 •

edited

Loading

jcfr commented Mar 15, 2024

fedorov commented Mar 15, 2024

jcfr commented Mar 15, 2024

fedorov commented Mar 15, 2024

vkt1414 commented Mar 20, 2024 •

edited

Loading

vkt1414 commented Mar 21, 2024

fedorov commented Mar 22, 2024

jcfr left a comment •

edited

Loading

fedorov commented Mar 22, 2024

jcfr commented Mar 22, 2024

vkt1414 commented Mar 23, 2024

jcfr commented Mar 25, 2024

fedorov commented Mar 25, 2024

fedorov Mar 25, 2024

vkt1414 Mar 26, 2024

fedorov commented Mar 25, 2024

fedorov Mar 25, 2024

jcfr Mar 25, 2024

jcfr commented Mar 25, 2024

jcfr commented Mar 25, 2024

jcfr commented Mar 27, 2024

cd: create a gha that generate an index for each query in the src/sql folder #5

cd: create a gha that generate an index for each query in the src/sql folder #5

Conversation

vkt1414 commented Mar 15, 2024 • edited Loading

jcfr commented Mar 15, 2024

fedorov commented Mar 15, 2024

jcfr commented Mar 15, 2024 • edited Loading

jcfr commented Mar 15, 2024

fedorov commented Mar 15, 2024

jcfr commented Mar 15, 2024

Footnotes

fedorov commented Mar 15, 2024

vkt1414 commented Mar 20, 2024 • edited Loading

vkt1414 commented Mar 21, 2024

fedorov commented Mar 22, 2024

jcfr left a comment • edited Loading

Choose a reason for hiding this comment

fedorov commented Mar 22, 2024

jcfr commented Mar 22, 2024

vkt1414 commented Mar 23, 2024

jcfr commented Mar 25, 2024

fedorov commented Mar 25, 2024

fedorov Mar 25, 2024

Choose a reason for hiding this comment

vkt1414 Mar 26, 2024

Choose a reason for hiding this comment

fedorov commented Mar 25, 2024

fedorov Mar 25, 2024

Choose a reason for hiding this comment

jcfr Mar 25, 2024

Choose a reason for hiding this comment

jcfr commented Mar 25, 2024

Footnotes

jcfr commented Mar 25, 2024

jcfr commented Mar 27, 2024

vkt1414 commented Mar 15, 2024 •

edited

Loading

jcfr commented Mar 15, 2024 •

edited

Loading

vkt1414 commented Mar 20, 2024 •

edited

Loading

jcfr left a comment •

edited

Loading