Skip to content

Commit

Permalink
Merge branch 'main' into dmrpp_tutorial_nb
Browse files Browse the repository at this point in the history
  • Loading branch information
ayushnag authored Jan 27, 2025
2 parents bb38fa1 + fcb0c22 commit e9d9293
Show file tree
Hide file tree
Showing 44 changed files with 1,840 additions and 339 deletions.
45 changes: 45 additions & 0 deletions .github/workflows/discussions.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
name: Generate Discussion Thread for Hackdays

on:
workflow_dispatch:

jobs:
create-discussion-threads:
runs-on: ubuntu-latest
permissions:
discussions: write
contents: read

steps:

- name: Generate the Hackathon title
run: |
DATE=$(date --iso-8601 | sed 's|-|/|g')
echo "DISCUSSION_TITLE=Hackathon $DATE" >> $GITHUB_ENV
- name: Set the Hackathon description
run: |
echo "DISCUSSION_BODY=Reporting out on earthaccess hack days. Use the 'comment' button at the very bottom to send a message. Additionally, consider sending issues and PRs relevant to your work to help make the job of future readers easier. It is okay to duplicate information here! Use the reply feature to have a discussion under any comment. Enjoy!" >> $GITHUB_ENV
- name: Create Discussions
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPOSITORY_ID: MDEwOlJlcG9zaXRvcnkzOTk4Njc1Mjk=
CATEGORY_ID: DIC_kwDOF9V-ic4CdYaN
run: |
gh api graphql -f query="
mutation
{createDiscussion
(
input:
{
repositoryId: \"${{ env.REPOSITORY_ID }}\",
categoryId: \"${{ env.CATEGORY_ID }}\",
body: \"${{ env.DISCUSSION_BODY }}\",
title: \"${{ env.DISCUSSION_TITLE }}\"
}
)
{
discussion {id}
}
}"
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ and this project uses [Semantic Versioning](https://semver.org/spec/v2.0.0.html)

## [Unreleased]

### Fixed

- `earthaccess.download` will let requests automatically decode compressed content
([#887](https://github.com/nsidc/earthaccess/issues/887))
([**@itcarroll**](https://github.com/itcarroll))

## [v0.12.0] - 2024-11-13

### Changed
Expand Down
7 changes: 2 additions & 5 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,14 @@

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socioeconomic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
our community a harassment-free experience for everyone.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Using welcoming and fair language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
Expand Down
16 changes: 16 additions & 0 deletions docs/contributing/integration-tests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Integration tests

## Testing most popular datasets

Some integration tests operate on the most popular collections for each provider in CMR.
Those collection IDs are cached as static data in `tests/integration/popular_collections/`
to give our test suite more stability. The list of most popular collections can be
updated by running a script in the same directory.

Sometimes, we find collections with unexpected conditions, like 0 granules, and
therefore "comment" those collections from the list by prefixing the line with a `#`.

!!! note

Let's consider a CSV format for this data; we may want to, for example, allow
skipping collections with a EULA by representing that as a column.
2 changes: 1 addition & 1 deletion docs/contributing/maintainers-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ To get permissions, please start by participating on GitHub by answering questio

1. As a maintainer, there is no strict time obligation, as we understand that everyone's ability to commit can fluctuate. However, we do expect maintainers to communicate openly and transparently with the team and the community.

2. As a maintainer, you are expected to uphold a positive and inclusive team culture. This includes following the guidelines outlined in the [Openscapes team culture page](https://openscapes.github.io/series/core-lessons/team-culture.html) and the [recorded psychological safety talk](https://www.youtube.com/watch?v=rzi-qkl8u5M) . By doing so, you can help ensure that all team members and contributors feel safe, respected, and valued.
2. As a maintainer, you are expected to uphold a positive team culture. This includes following the guidelines outlined in the [Openscapes team culture page](https://openscapes.github.io/series/core-lessons/team-culture.html) and the [recorded psychological safety talk](https://www.youtube.com/watch?v=rzi-qkl8u5M) . By doing so, you can help ensure that all team members and contributors feel safe, respected, and valued.


### Maintainer Processes Beyond Regular Contributing
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

`earthaccess` is a python library to **search for**, and **download** or **stream** NASA Earth science data with just a few lines of code.

Open science only reaches its full potential if we have easy-to-use workflows that facilitate research in an inclusive, efficient and reproducible way. Unfortunately —as it stands today— scientists and students alike face a steep learning curve adapting to systems that have grown too complex and end up spending more time on the technicalities of the tools, cloud and NASA APIs than focusing on their important science.
Open science only reaches its full potential if we have easy-to-use workflows that facilitate research in an efficient and reproducible way. Unfortunately —as it stands today— scientists and students alike face a steep learning curve adapting to systems that have grown too complex and end up spending more time on the technicalities of the tools, cloud and NASA APIs than focusing on their important science.

During several workshops organized by [NASA Openscapes](https://nasa-openscapes.github.io/events.html), the need to provide easy-to-use tools to our users became evident. Open science is a collaborative effort; it involves people from different technical backgrounds, and the data analysis to solve the pressing problems we face cannot be limited by the complexity of the underlying systems. Therefore, providing easy access to NASA Earthdata regardless of the data storage location (hosted within or outside of the cloud) is the main motivation behind this Python library.

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/restricted-datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"## <img src=\"https://logos-world.net/wp-content/uploads/2020/05/NASA-Logo-1959-present.png\" width=\"100px\" align=\"middle\" /> NASA Earthdata API Client 🌍\n",
"\n",
"\n",
"> Note: Before we can use `earthaccess` we need an account with **[NASA EDL](https://urs.earthaccess.nasa.gov/)**\n"
"> Note: Before we can use `earthaccess` we need an account with **[NASA EDL](https://urs.earthdata.nasa.gov/)**\n"
]
},
{
Expand Down
11 changes: 6 additions & 5 deletions earthaccess/dmrpp_zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,8 @@ def open_virtual_mfdataset(
import xarray as xr

if access == "direct":
fs = earthaccess.get_s3_filesystem(results=granules[0])
fs.storage_options["anon"] = False # type: ignore
fs = earthaccess.get_s3_filesystem(results=granules) # type: ignore
fs.storage_options["anon"] = False
else:
fs = earthaccess.get_fsspec_https_session()
if parallel:
Expand All @@ -114,7 +114,7 @@ def open_virtual_mfdataset(
filetype="dmrpp", # type: ignore
group=group,
indexes={},
reader_options={"storage_options": fs.storage_options}, # type: ignore
reader_options={"storage_options": fs.storage_options},
)
)
if preprocess is not None:
Expand All @@ -127,6 +127,7 @@ def open_virtual_mfdataset(
vds = xr.combine_nested(vdatasets, **xr_combine_nested_kwargs)
if load:
refs = vds.virtualize.to_kerchunk(filepath=None, format="dict")
protocol = "s3" if "s3" in fs.protocol else fs.protocol
return xr.open_dataset(
"reference://",
engine="zarr",
Expand All @@ -135,8 +136,8 @@ def open_virtual_mfdataset(
"consolidated": False,
"storage_options": {
"fo": refs, # codespell:ignore
"remote_protocol": fs.protocol,
"remote_options": fs.storage_options, # type: ignore
"remote_protocol": protocol,
"remote_options": fs.storage_options,
},
},
)
Expand Down
12 changes: 11 additions & 1 deletion earthaccess/kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import fsspec.utils
import s3fs

# import ipdb
import earthaccess


Expand All @@ -15,12 +16,19 @@ def _get_chunk_metadata(
) -> list[dict]:
from kerchunk.hdf import SingleHdf5ToZarr

if not isinstance(granule, earthaccess.DataGranule) and isinstance(granule, dict):
# WHY: dask serialization is doing something weird, it serializes the granule as a simple dict
# we need to add cast it back to a datagranule to get the nice methods for parsing the data links
# TODO: ask James what is going on
granule = earthaccess.DataGranule(granule)

metadata = []
access = "direct" if isinstance(fs, s3fs.S3FileSystem) else "indirect"
# ipdb.set_trace()

for url in granule.data_links(access=access):
with fs.open(url) as inf:
h5chunks = SingleHdf5ToZarr(inf, url)
h5chunks = SingleHdf5ToZarr(inf, url) # type: ignore
m = h5chunks.translate()
metadata.append(m)

Expand Down Expand Up @@ -50,6 +58,8 @@ def consolidate_metadata(

# Get metadata for each granule
get_chunk_metadata = dask.delayed(_get_chunk_metadata) # type: ignore

# ipdb.set_trace()
chunks = dask.compute(*[get_chunk_metadata(g, fs) for g in granules]) # type: ignore
chunks = sum(chunks, start=[])

Expand Down
3 changes: 3 additions & 0 deletions earthaccess/results.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,9 @@ def _repr_html_(self) -> str:
granule_html_repr = _repr_granule_html(self)
return granule_html_repr

def __hash__(self) -> int: # type: ignore[override]
return hash(self["meta"]["concept-id"])

def get_s3_credentials_endpoint(self) -> Union[str, None]:
for link in self["umm"]["RelatedUrls"]:
if "/s3credentials" in link["URL"]:
Expand Down
57 changes: 40 additions & 17 deletions earthaccess/store.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import datetime
import logging
import shutil
import threading
import traceback
from functools import lru_cache
from itertools import chain
Expand All @@ -18,7 +18,7 @@

import earthaccess

from .auth import Auth
from .auth import Auth, SessionWithHeaderRedirection
from .daac import DAAC_TEST_URLS, find_provider
from .results import DataGranule
from .search import DataCollections
Expand Down Expand Up @@ -119,6 +119,7 @@ def __init__(self, auth: Any, pre_authorize: bool = False) -> None:
Parameters:
auth: Auth instance to download and access data.
"""
self.thread_locals = threading.local()
if auth.authenticated is True:
self.auth = auth
self._s3_credentials: Dict[
Expand All @@ -127,7 +128,7 @@ def __init__(self, auth: Any, pre_authorize: bool = False) -> None:
oauth_profile = f"https://{auth.system.edl_hostname}/profile"
# sets the initial URS cookie
self._requests_cookies: Dict[str, Any] = {}
self.set_requests_session(oauth_profile)
self.set_requests_session(oauth_profile, bearer_token=True)
if pre_authorize:
# collect cookies from other DAACs
for url in DAAC_TEST_URLS:
Expand Down Expand Up @@ -183,7 +184,7 @@ def _running_in_us_west_2(self) -> bool:
return False

def set_requests_session(
self, url: str, method: str = "get", bearer_token: bool = False
self, url: str, method: str = "get", bearer_token: bool = True
) -> None:
"""Sets up a `requests` session with bearer tokens that are used by CMR.
Expand Down Expand Up @@ -324,19 +325,19 @@ def get_fsspec_session(self) -> fsspec.AbstractFileSystem:
session = fsspec.filesystem("https", client_kwargs=client_kwargs)
return session

def get_requests_session(self, bearer_token: bool = True) -> requests.Session:
def get_requests_session(self) -> SessionWithHeaderRedirection:
"""Returns a requests HTTPS session with bearer tokens that are used by CMR.
This HTTPS session can be used to download granules if we want to use a direct,
lower level API.
Parameters:
bearer_token: if true, will be used for authenticated queries on CMR
Returns:
requests Session
"""
return self.auth.get_session()
if hasattr(self, "_http_session"):
return self._http_session
else:
raise AttributeError("The requests session hasn't been set up yet.")

def open(
self,
Expand Down Expand Up @@ -652,6 +653,27 @@ def _get_granules(
data_links, local_path, pqdm_kwargs=pqdm_kwargs
)

def _clone_session_in_local_thread(
self, original_session: SessionWithHeaderRedirection
) -> None:
"""Clone the original session and store it in the local thread context.
This method creates a new session that replicates the headers, cookies, and authentication settings
from the provided original session. The new session is stored in a thread-local storage.
Parameters:
original_session (SessionWithHeaderRedirection): The session to be cloned.
Returns:
None
"""
if not hasattr(self.thread_locals, "local_thread_session"):
local_thread_session = SessionWithHeaderRedirection()
local_thread_session.headers.update(original_session.headers)
local_thread_session.cookies.update(original_session.cookies)
local_thread_session.auth = original_session.auth
self.thread_locals.local_thread_session = local_thread_session

def _download_file(self, url: str, directory: Path) -> str:
"""Download a single file from an on-prem location, a DAAC data center.
Expand All @@ -669,17 +691,18 @@ def _download_file(self, url: str, directory: Path) -> str:
path = directory / Path(local_filename)
if not path.exists():
try:
session = self.auth.get_session()
with session.get(
url,
stream=True,
allow_redirects=True,
) as r:
original_session = self.get_requests_session()
# This reuses the auth cookie, we make sure we only authenticate N threads instead
# of one per file, see #913
self._clone_session_in_local_thread(original_session)
session = self.thread_locals.local_thread_session
with session.get(url, stream=True, allow_redirects=True) as r:
r.raise_for_status()
with open(path, "wb") as f:
# This is to cap memory usage for large files at 1MB per write to disk per thread
# Cap memory usage for large files at 1MB per write to disk per thread
# https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content
shutil.copyfileobj(r.raw, f, length=1024 * 1024)
for chunk in r.iter_content(chunk_size=1024 * 1024):
f.write(chunk)
except Exception:
logger.exception(f"Error while downloading the file {local_filename}")
raise Exception
Expand Down
12 changes: 7 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,15 @@ nav:
- "What is earthaccess?": "index.md"
- "Quick start": "quick-start.md"
- "Work with us":
- "contributing/index.md"
- "Development": "contributing/development.md"
- "Releasing": "contributing/releasing.md"
- "Our meet-ups": "contributing/our-meet-ups.md"
- "contributing/index.md" # << Link target of the parent node
- "Development Guide": "contributing/development.md"
- "Releasing Guide": "contributing/releasing.md"
- "Maintainers Guide": "contributing/maintainers-guide.md"
- "Code of Conduct": "contributing/code-of-conduct.md"
- "Contributing naming convention": "contributing/naming-convention.md"
- "Meet-ups": "contributing/our-meet-ups.md"
- "Topics":
- "Naming conventions": "contributing/naming-convention.md"
- "Integration tests": "contributing/integration-tests.md"
- "Resources": "resources.md"
- USER GUIDE:
- "user_guide/index.md"
Expand Down
2 changes: 1 addition & 1 deletion notebooks/Demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -526,7 +526,7 @@
"\n",
"**CMR** API documentation: https://cmr.earthaccess.nasa.gov/search/site/docs/search/api.html\n",
"\n",
"**EDL** API documentation: https://urs.earthaccess.nasa.gov/\n",
"**EDL** API documentation: https://urs.earthdata.nasa.gov/documentation\n",
"\n",
"NASA OpenScapes: https://nasa-openscapes.github.io/earthaccess-cloud-cookbook/\n",
"\n",
Expand Down
5 changes: 0 additions & 5 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,6 @@ def integration_tests(session: nox.Session) -> None:
EARTHDATA_USERNAME=os.environ["EARTHDATA_USERNAME"],
EARTHDATA_PASSWORD=os.environ["EARTHDATA_PASSWORD"],
),
external=True,
# NOTE: integration test are permitted to pass if the failure rate was less than a hardcoded threshold.
# PyTest will return 99 if there were some failures, but less than the threshold. For more details, see:
# `pytest_sessionfinish` in tests/integration/conftest.py
success_codes=[0, 99],
)


Expand Down
Loading

0 comments on commit e9d9293

Please sign in to comment.