Merge branch 'main' into dmrpp_tutorial_nb

nsidc · Jan 27, 2025 · e9d9293 · e9d9293
2 parents bb38fa1 + fcb0c22
commit e9d9293
Show file tree

Hide file tree

Showing 44 changed files with 1,840 additions and 339 deletions.
diff --git a/.github/workflows/discussions.yml b/.github/workflows/discussions.yml
@@ -0,0 +1,45 @@
+name: Generate Discussion Thread for Hackdays
+
+on:
+  workflow_dispatch:
+
+jobs:
+  create-discussion-threads:
+    runs-on: ubuntu-latest
+    permissions:
+      discussions: write
+      contents: read
+
+    steps:
+
+      - name: Generate the Hackathon title
+        run: |
+          DATE=$(date --iso-8601 | sed 's|-|/|g')
+          echo "DISCUSSION_TITLE=Hackathon $DATE" >> $GITHUB_ENV
+
+      - name: Set the Hackathon description
+        run: |
+          echo "DISCUSSION_BODY=Reporting out on earthaccess hack days. Use the 'comment' button at the very bottom to send a message. Additionally, consider sending issues and PRs relevant to your work to help make the job of future readers easier. It is okay to duplicate information here! Use the reply feature to have a discussion under any comment. Enjoy!" >> $GITHUB_ENV
+
+      - name: Create Discussions
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          REPOSITORY_ID: MDEwOlJlcG9zaXRvcnkzOTk4Njc1Mjk=
+          CATEGORY_ID: DIC_kwDOF9V-ic4CdYaN
+        run: |
+          gh api graphql -f query="
+          mutation
+           {createDiscussion
+             (
+               input:
+                 {
+                 repositoryId: \"${{ env.REPOSITORY_ID }}\",
+                 categoryId: \"${{ env.CATEGORY_ID }}\",
+                 body: \"${{ env.DISCUSSION_BODY }}\",
+                 title: \"${{ env.DISCUSSION_TITLE }}\"
+                 }
+             )
+             {
+               discussion {id}
+             }
+           }"
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,12 @@ and this project uses [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
 
 ## [Unreleased]
 
+### Fixed
+
+- `earthaccess.download` will let requests automatically decode compressed content
+  ([#887](https://github.com/nsidc/earthaccess/issues/887))
+  ([**@itcarroll**](https://github.com/itcarroll))
+
 ## [v0.12.0] - 2024-11-13
 
 ### Changed

diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -4,17 +4,14 @@
 
 In the interest of fostering an open and welcoming environment, we as
 contributors and maintainers pledge to making participation in our project and
-our community a harassment-free experience for everyone, regardless of age, body
-size, disability, ethnicity, sex characteristics, gender identity and expression,
-level of experience, education, socioeconomic status, nationality, personal
-appearance, race, religion, or sexual identity and orientation.
+our community a harassment-free experience for everyone.
 
 ## Our Standards
 
 Examples of behavior that contributes to creating a positive environment
 include:
 
-* Using welcoming and inclusive language
+* Using welcoming and fair language
 * Being respectful of differing viewpoints and experiences
 * Gracefully accepting constructive criticism
 * Focusing on what is best for the community

diff --git a/docs/contributing/integration-tests.md b/docs/contributing/integration-tests.md
@@ -0,0 +1,16 @@
+# Integration tests
+
+## Testing most popular datasets
+
+Some integration tests operate on the most popular collections for each provider in CMR.
+Those collection IDs are cached as static data in `tests/integration/popular_collections/`
+to give our test suite more stability. The list of most popular collections can be
+updated by running a script in the same directory.
+
+Sometimes, we find collections with unexpected conditions, like 0 granules, and
+therefore "comment" those collections from the list by prefixing the line with a `#`.
+
+!!! note
+
+    Let's consider a CSV format for this data; we may want to, for example, allow
+    skipping collections with a EULA by representing that as a column.
diff --git a/docs/contributing/maintainers-guide.md b/docs/contributing/maintainers-guide.md
@@ -16,7 +16,7 @@ To get permissions, please start by participating on GitHub by answering questio
 
 1. As a maintainer, there is no strict time obligation, as we understand that everyone's ability to commit can fluctuate. However, we do expect maintainers to communicate openly and transparently with the team and the community.
 
-2. As a maintainer, you are expected to uphold a positive and inclusive team culture. This includes following the guidelines outlined in the [Openscapes team culture page](https://openscapes.github.io/series/core-lessons/team-culture.html) and the [recorded psychological safety talk](https://www.youtube.com/watch?v=rzi-qkl8u5M) . By doing so, you can help ensure that all team members and contributors feel safe, respected, and valued.
+2. As a maintainer, you are expected to uphold a positive team culture. This includes following the guidelines outlined in the [Openscapes team culture page](https://openscapes.github.io/series/core-lessons/team-culture.html) and the [recorded psychological safety talk](https://www.youtube.com/watch?v=rzi-qkl8u5M) . By doing so, you can help ensure that all team members and contributors feel safe, respected, and valued.
 
 
 ### Maintainer Processes Beyond Regular Contributing

diff --git a/docs/index.md b/docs/index.md
@@ -2,7 +2,7 @@
 
 `earthaccess` is a python library to **search for**, and **download** or **stream** NASA Earth science data with just a few lines of code.
 
-Open science only reaches its full potential if we have easy-to-use workflows that facilitate research in an inclusive, efficient and reproducible way. Unfortunately —as it stands today— scientists and students alike face a steep learning curve adapting to systems that have grown too complex and end up spending more time on the technicalities of the tools, cloud and NASA APIs than focusing on their important science.
+Open science only reaches its full potential if we have easy-to-use workflows that facilitate research in an efficient and reproducible way. Unfortunately —as it stands today— scientists and students alike face a steep learning curve adapting to systems that have grown too complex and end up spending more time on the technicalities of the tools, cloud and NASA APIs than focusing on their important science.
 
 During several workshops organized by [NASA Openscapes](https://nasa-openscapes.github.io/events.html), the need to provide easy-to-use tools to our users became evident. Open science is a collaborative effort; it involves people from different technical backgrounds, and the data analysis to solve the pressing problems we face cannot be limited by the complexity of the underlying systems. Therefore, providing easy access to NASA Earthdata regardless of the data storage location (hosted within or outside of the cloud) is the main motivation behind this Python library.
 

diff --git a/docs/tutorials/restricted-datasets.ipynb b/docs/tutorials/restricted-datasets.ipynb
@@ -12,7 +12,7 @@
     "## <img src=\"https://logos-world.net/wp-content/uploads/2020/05/NASA-Logo-1959-present.png\" width=\"100px\" align=\"middle\" /> NASA Earthdata API Client 🌍\n",
     "\n",
     "\n",
-    "> Note: Before we can use `earthaccess` we need an account with **[NASA EDL](https://urs.earthaccess.nasa.gov/)**\n"
+    "> Note: Before we can use `earthaccess` we need an account with **[NASA EDL](https://urs.earthdata.nasa.gov/)**\n"
    ]
   },
   {

diff --git a/earthaccess/dmrpp_zarr.py b/earthaccess/dmrpp_zarr.py
@@ -92,8 +92,8 @@ def open_virtual_mfdataset(
     import xarray as xr
 
     if access == "direct":
-        fs = earthaccess.get_s3_filesystem(results=granules[0])
-        fs.storage_options["anon"] = False  # type: ignore
+        fs = earthaccess.get_s3_filesystem(results=granules)  # type: ignore
+        fs.storage_options["anon"] = False
     else:
         fs = earthaccess.get_fsspec_https_session()
     if parallel:
@@ -114,7 +114,7 @@ def open_virtual_mfdataset(
                 filetype="dmrpp",  # type: ignore
                 group=group,
                 indexes={},
-                reader_options={"storage_options": fs.storage_options},  # type: ignore
+                reader_options={"storage_options": fs.storage_options},
             )
         )
     if preprocess is not None:
@@ -127,6 +127,7 @@ def open_virtual_mfdataset(
         vds = xr.combine_nested(vdatasets, **xr_combine_nested_kwargs)
     if load:
         refs = vds.virtualize.to_kerchunk(filepath=None, format="dict")
+        protocol = "s3" if "s3" in fs.protocol else fs.protocol
         return xr.open_dataset(
             "reference://",
             engine="zarr",
@@ -135,8 +136,8 @@ def open_virtual_mfdataset(
                 "consolidated": False,
                 "storage_options": {
                     "fo": refs,  # codespell:ignore
-                    "remote_protocol": fs.protocol,
-                    "remote_options": fs.storage_options,  # type: ignore
+                    "remote_protocol": protocol,
+                    "remote_options": fs.storage_options,
                 },
             },
         )

diff --git a/earthaccess/kerchunk.py b/earthaccess/kerchunk.py
@@ -6,6 +6,7 @@
 import fsspec.utils
 import s3fs
 
+# import ipdb
 import earthaccess
 
 
@@ -15,12 +16,19 @@ def _get_chunk_metadata(
 ) -> list[dict]:
     from kerchunk.hdf import SingleHdf5ToZarr
 
+    if not isinstance(granule, earthaccess.DataGranule) and isinstance(granule, dict):
+        # WHY: dask serialization is doing something weird, it serializes the granule as a simple dict
+        # we need to add cast it back to a datagranule to get the nice methods for parsing the data links
+        # TODO: ask James what is going on
+        granule = earthaccess.DataGranule(granule)
+
     metadata = []
     access = "direct" if isinstance(fs, s3fs.S3FileSystem) else "indirect"
+    # ipdb.set_trace()
 
     for url in granule.data_links(access=access):
         with fs.open(url) as inf:
-            h5chunks = SingleHdf5ToZarr(inf, url)
+            h5chunks = SingleHdf5ToZarr(inf, url)  # type: ignore
             m = h5chunks.translate()
             metadata.append(m)
 
@@ -50,6 +58,8 @@ def consolidate_metadata(
 
     # Get metadata for each granule
     get_chunk_metadata = dask.delayed(_get_chunk_metadata)  # type: ignore
+
+    # ipdb.set_trace()
     chunks = dask.compute(*[get_chunk_metadata(g, fs) for g in granules])  # type: ignore
     chunks = sum(chunks, start=[])
 

diff --git a/earthaccess/results.py b/earthaccess/results.py
@@ -257,6 +257,9 @@ def _repr_html_(self) -> str:
         granule_html_repr = _repr_granule_html(self)
         return granule_html_repr
 
+    def __hash__(self) -> int:  # type: ignore[override]
+        return hash(self["meta"]["concept-id"])
+
     def get_s3_credentials_endpoint(self) -> Union[str, None]:
         for link in self["umm"]["RelatedUrls"]:
             if "/s3credentials" in link["URL"]:

diff --git a/earthaccess/store.py b/earthaccess/store.py
@@ -1,6 +1,6 @@
 import datetime
 import logging
-import shutil
+import threading
 import traceback
 from functools import lru_cache
 from itertools import chain
@@ -18,7 +18,7 @@
 
 import earthaccess
 
-from .auth import Auth
+from .auth import Auth, SessionWithHeaderRedirection
 from .daac import DAAC_TEST_URLS, find_provider
 from .results import DataGranule
 from .search import DataCollections
@@ -119,6 +119,7 @@ def __init__(self, auth: Any, pre_authorize: bool = False) -> None:
         Parameters:
             auth: Auth instance to download and access data.
         """
+        self.thread_locals = threading.local()
         if auth.authenticated is True:
             self.auth = auth
             self._s3_credentials: Dict[
@@ -127,7 +128,7 @@ def __init__(self, auth: Any, pre_authorize: bool = False) -> None:
             oauth_profile = f"https://{auth.system.edl_hostname}/profile"
             # sets the initial URS cookie
             self._requests_cookies: Dict[str, Any] = {}
-            self.set_requests_session(oauth_profile)
+            self.set_requests_session(oauth_profile, bearer_token=True)
             if pre_authorize:
                 # collect cookies from other DAACs
                 for url in DAAC_TEST_URLS:
@@ -183,7 +184,7 @@ def _running_in_us_west_2(self) -> bool:
         return False
 
     def set_requests_session(
-        self, url: str, method: str = "get", bearer_token: bool = False
+        self, url: str, method: str = "get", bearer_token: bool = True
     ) -> None:
         """Sets up a `requests` session with bearer tokens that are used by CMR.
 
@@ -324,19 +325,19 @@ def get_fsspec_session(self) -> fsspec.AbstractFileSystem:
         session = fsspec.filesystem("https", client_kwargs=client_kwargs)
         return session
 
-    def get_requests_session(self, bearer_token: bool = True) -> requests.Session:
+    def get_requests_session(self) -> SessionWithHeaderRedirection:
         """Returns a requests HTTPS session with bearer tokens that are used by CMR.
 
         This HTTPS session can be used to download granules if we want to use a direct,
         lower level API.
 
-        Parameters:
-            bearer_token: if true, will be used for authenticated queries on CMR
-
         Returns:
             requests Session
         """
-        return self.auth.get_session()
+        if hasattr(self, "_http_session"):
+            return self._http_session
+        else:
+            raise AttributeError("The requests session hasn't been set up yet.")
 
     def open(
         self,
@@ -652,6 +653,27 @@ def _get_granules(
                 data_links, local_path, pqdm_kwargs=pqdm_kwargs
             )
 
+    def _clone_session_in_local_thread(
+        self, original_session: SessionWithHeaderRedirection
+    ) -> None:
+        """Clone the original session and store it in the local thread context.
+
+        This method creates a new session that replicates the headers, cookies, and authentication settings
+        from the provided original session. The new session is stored in a thread-local storage.
+
+        Parameters:
+            original_session (SessionWithHeaderRedirection): The session to be cloned.
+
+        Returns:
+            None
+        """
+        if not hasattr(self.thread_locals, "local_thread_session"):
+            local_thread_session = SessionWithHeaderRedirection()
+            local_thread_session.headers.update(original_session.headers)
+            local_thread_session.cookies.update(original_session.cookies)
+            local_thread_session.auth = original_session.auth
+            self.thread_locals.local_thread_session = local_thread_session
+
     def _download_file(self, url: str, directory: Path) -> str:
         """Download a single file from an on-prem location, a DAAC data center.
 
@@ -669,17 +691,18 @@ def _download_file(self, url: str, directory: Path) -> str:
         path = directory / Path(local_filename)
         if not path.exists():
             try:
-                session = self.auth.get_session()
-                with session.get(
-                    url,
-                    stream=True,
-                    allow_redirects=True,
-                ) as r:
+                original_session = self.get_requests_session()
+                # This reuses the auth cookie, we make sure we only authenticate N threads instead
+                # of one per file, see #913
+                self._clone_session_in_local_thread(original_session)
+                session = self.thread_locals.local_thread_session
+                with session.get(url, stream=True, allow_redirects=True) as r:
                     r.raise_for_status()
                     with open(path, "wb") as f:
-                        # This is to cap memory usage for large files at 1MB per write to disk per thread
+                        # Cap memory usage for large files at 1MB per write to disk per thread
                         # https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content
-                        shutil.copyfileobj(r.raw, f, length=1024 * 1024)
+                        for chunk in r.iter_content(chunk_size=1024 * 1024):
+                            f.write(chunk)
             except Exception:
                 logger.exception(f"Error while downloading the file {local_filename}")
                 raise Exception

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -60,13 +60,15 @@ nav:
   - "What is earthaccess?": "index.md"
   - "Quick start": "quick-start.md"
   - "Work with us":
-      - "contributing/index.md"
-      - "Development": "contributing/development.md"
-      - "Releasing": "contributing/releasing.md"
-      - "Our meet-ups": "contributing/our-meet-ups.md"
+      - "contributing/index.md" # << Link target of the parent node
+      - "Development Guide": "contributing/development.md"
+      - "Releasing Guide": "contributing/releasing.md"
       - "Maintainers Guide": "contributing/maintainers-guide.md"
       - "Code of Conduct": "contributing/code-of-conduct.md"
-      - "Contributing naming convention": "contributing/naming-convention.md"
+      - "Meet-ups": "contributing/our-meet-ups.md"
+      - "Topics":
+          - "Naming conventions": "contributing/naming-convention.md"
+          - "Integration tests": "contributing/integration-tests.md"
   - "Resources": "resources.md"
   - USER GUIDE:
       - "user_guide/index.md"

diff --git a/notebooks/Demo.ipynb b/notebooks/Demo.ipynb
@@ -526,7 +526,7 @@
     "\n",
     "**CMR** API documentation: https://cmr.earthaccess.nasa.gov/search/site/docs/search/api.html\n",
     "\n",
-    "**EDL** API documentation: https://urs.earthaccess.nasa.gov/\n",
+    "**EDL** API documentation: https://urs.earthdata.nasa.gov/documentation\n",
     "\n",
     "NASA OpenScapes: https://nasa-openscapes.github.io/earthaccess-cloud-cookbook/\n",
     "\n",

diff --git a/noxfile.py b/noxfile.py
@@ -57,11 +57,6 @@ def integration_tests(session: nox.Session) -> None:
             EARTHDATA_USERNAME=os.environ["EARTHDATA_USERNAME"],
             EARTHDATA_PASSWORD=os.environ["EARTHDATA_PASSWORD"],
         ),
-        external=True,
-        # NOTE: integration test are permitted to pass if the failure rate was less than a hardcoded threshold.
-        #       PyTest will return 99 if there were some failures, but less than the threshold. For more details, see:
-        #       `pytest_sessionfinish` in tests/integration/conftest.py
-        success_codes=[0, 99],
     )
-Original file line number
+Diff line change
@@ Expand Up / @@ -12,7 +12,7 @@ @@
         "## <img src=\"https://logos-world.net/wp-content/uploads/2020/05/NASA-Logo-1959-present.png\" width=\"100px\" align=\"middle\" /> NASA Earthdata API Client 🌍\n",
         "\n",
         "\n",
-        "> Note: Before we can use `earthaccess` we need an account with **[NASA EDL](https://urs.earthaccess.nasa.gov/)**\n"
+        "> Note: Before we can use `earthaccess` we need an account with **[NASA EDL](https://urs.earthdata.nasa.gov/)**\n"
        ]
       },
       {
@@ Expand Down @@