diff --git a/.gitignore b/.gitignore index 09ccd077..c2a663ae 100644 --- a/.gitignore +++ b/.gitignore @@ -131,6 +131,7 @@ venv/ ENV/ env.bak/ venv.bak/ +.idea # Spyder project settings .spyderproject diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 2d962e28..59e9c99f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,16 +1,16 @@ # Contributing to HashStore -:tada: First off, thanks for contributing! :tada: +**🎉 First off, thanks for contributing! 🎉** -- [Types of contributions](#types-of-contributions) -- [Pull Requests](#pull-requests) -- [Development Workflow](#development-workflow) -- [Release process](#release-process) -- [Testing](#testing) -- [Code style](#code-style) -- [Contributor license agreement](#contributor-license-agreement) +- [✨ Types of Contributions](#types-of-contributions) +- [🌳 Pull Requests](#pull-requests) +- [🔀 Development Workflow](#development-workflow) +- [🚀 Release Process](#release-process) +- [🔬 Testing](#testing) +- [🎨 Code Style](#code-style) +- [📄 Contributor License Agreement](#contributor-license-agreement) -## Types of contributions +## ✨ Types of Contributions We welcome all types of contributions, including bug fixes, feature enhancements, bug reports, documentation, graphics, and many others. You might consider contributing by: @@ -29,7 +29,7 @@ made to increase the value of HashStore to the community. We strive to incorporate code, documentation, and other useful contributions quickly and efficiently while maintaining a high-quality software product. -## Pull Requests +## 🌳 Pull Requests We use the pull-request model for contributions. See [GitHub's help on pull-requests](https://help.github.com/articles/about-pull-requests/). In short: @@ -43,7 +43,7 @@ In short: - our team may request changes before we will approve the Pull Request, or we will make them for you - once the code is reviewed, our team will merge in your changes to `develop` for the next planned release -## Development Workflow +## 🔀 Development Workflow Development is managed through the git repository at https://github.com/DataONEorg/hashstore. The repository is organized into several branches, each with a specific purpose. @@ -104,7 +104,7 @@ gitGraph merge develop id: "11" tag: "v1.1.0" ``` -## Release process +## 🚀 Release Process 1. Our release process starts with integration testing in a `develop` branch. Once all changes that are desired in a release are merged into the `develop` branch, we run @@ -115,7 +115,7 @@ reflect the new release and the `develop` branch can be fast-forwarded to sync w start work on the next release. 3. Releases can be downloaded from the [GitHub releases page](https://github.com/DataONEorg/hashstore/releases). -## Testing +## 🔬 Testing **Unit and integration tests**. HashStore has a full suite of `pytest` tests in the `tests` subdirectory. Any new code developed should include a robust set of tests for each public @@ -127,7 +127,7 @@ or merging to `develop`. Tests are automatically run via GitHub Actions. Check the root `README.md` file for this GitHub Actions status badge and make sure it says "Passing": -## Code style +## 🎨 Code Style Code should be written to professional standards to enable clean, well-documented, readable, and maintainable software. While there has been significant variability @@ -135,7 +135,7 @@ in the coding styles applied historically, new contributions should strive for clean code formatting. We generally follow PEP8 guidelines for Python code formatting, typically enforced through the `black` code formatting package. -## Contributor license agreement +## 📄 Contributor License Agreement In order to clarify the intellectual property license granted with Contributions from any person or entity, you agree to diff --git a/README.md b/README.md index 0ddaac61..7dd0bd32 100644 --- a/README.md +++ b/README.md @@ -1,35 +1,81 @@ ## HashStore: hash-based object storage for DataONE data packages -- **Author**: Matthew B. Jones, Dou Mok, Jing Tao, Matthew Brooke +Version: 1.1.0 +- DOI: [doi:10.18739/A2ZG6G87Q](https://doi.org/10.18739/A2ZG6G87Q) + +## Contributors + +- **Author**: Dou Mok, Matthew Brooke, Jing Tao, Jeanette Clarke, Ian Nesbitt, Matthew B. Jones - **License**: [Apache 2](http://opensource.org/licenses/Apache-2.0) - [Package source code on GitHub](https://github.com/DataONEorg/hashstore) - [**Submit Bugs and feature requests**](https://github.com/DataONEorg/hashstore/issues) - Contact us: support@dataone.org - [DataONE discussions](https://github.com/DataONEorg/dataone/discussions) -HashStore is a server-side python package implementing a content-based identifier file system for storing and accessing data and metadata for DataONE services. The package is used in DataONE system components that need direct, filesystem-based access to data objects, their system metadata, and extended metadata about the objects. This package is a core component of the [DataONE federation](https://dataone.org), and supports large-scale object storage for a variety of repositories, including the [KNB Data Repository](http://knb.ecoinformatics.org), the [NSF Arctic Data Center](https://arcticdata.io/catalog/), the [DataONE search service](https://search.dataone.org), and other repositories. +## Citation + +Cite this software as: + +> Dou Mok, Matthew Brooke, Jing Tao, Jeanette Clarke, Ian Nesbitt, Matthew B. Jones. 2024. +> HashStore: hash-based object storage for DataONE data packages. Arctic Data Center. +> [doi:10.18739/A2ZG6G87Q](https://doi.org/10.18739/A2ZG6G87Q) + +## Introduction -DataONE in general, and HashStore in particular, are open source, community projects. We [welcome contributions](https://github.com/DataONEorg/hashstore/blob/main/CONTRIBUTING.md) in many forms, including code, graphics, documentation, bug reports, testing, etc. Use the [DataONE discussions](https://github.com/DataONEorg/dataone/discussions) to discuss these contributions with us. +HashStore is a server-side python package that implements a hash-based object storage file system +for storing and accessing data and metadata for DataONE services. The package is used in DataONE +system components that need direct, filesystem-based access to data objects, their system +metadata, and extended metadata about the objects. This package is a core component of the +[DataONE federation](https://dataone.org), and supports large-scale object storage for a variety +of repositories, including the [KNB Data Repository](http://knb.ecoinformatics.org), the [NSF +Arctic Data Center](https://arcticdata.io/catalog/), the [DataONE search service](https://search.dataone.org), and other repositories. +DataONE in general, and HashStore in particular, are open source, community projects. +We [welcome contributions](https://github.com/DataONEorg/hashstore/blob/main/CONTRIBUTING.md) in +many forms, including code, graphics, documentation, bug reports, testing, etc. Use +the [DataONE discussions](https://github.com/DataONEorg/dataone/discussions) to discuss these +contributions with us. ## Documentation -Documentation is a work in progress, and can be found on the [Metacat repository](https://github.com/NCEAS/metacat/blob/feature-1436-storage-and-indexing/docs/user/metacat/source/storage-subsystem.rst#physical-file-layout) as part of the storage redesign planning. Future updates will include documentation here as the package matures. +The documentation around HashStore's initial design phase can be found here in the [Metacat +repository](https://github.com/NCEAS/metacat/blob/feature-1436-storage-and-indexing/docs/user/metacat/source/storage-subsystem.rst#physical-file-layout) +as part of the storage re-design planning. Future updates will include documentation here as the +package matures. -## Development build +## HashStore Overview -HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org) build tool. +HashStore is a hash-based object storage system that provides persistent file-based storage using +content hashes to de-duplicate data. The system stores data objects, references (refs) and +metadata in its respective directories and utilizes an identifier-based API for interacting +with the store. HashStore storage classes (like `filehashstore`) must implement the HashStore +interface to ensure the consistent and expected usage of HashStore. -To install `hashstore` locally, create a virtual environment for python 3.9+, -install poetry, and then install or build the package with `poetry install` or `poetry build`, respectively. +### Public API Methods -To run tests, navigate to the root directory and run `pytest -s`. The test suite contains tests that -take a longer time to run (relating to the storage of large files) - to execute all tests, run -`pytest --run-slow`. To see detailed +- store_object +- tag_object +- store_metadata +- retrieve_object +- retrieve_metadata +- delete_object +- delete_if_invalid_object +- delete_metadata +- get_hex_digest + +For details, please see the HashStore +interface [hashstore.py](https://github.com/DataONEorg/hashstore/blob/main/src/hashstore/hashstore.py) + +### How do I create a HashStore? -## Usage Example +To create or interact with a HashStore, instantiate a HashStore object with the following set of +properties: -To view more details about the Public API - see 'hashstore.py` interface documentation +- store_path +- store_depth +- store_width +- store_algorithm +- store_metadata_namespace ```py from hashstore import HashStoreFactory @@ -42,64 +88,291 @@ properties = { "store_path": "/path/to/your/store", "store_depth": 3, "store_width": 2, - "store_algorithm": "sha256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", } # Get HashStore from factory -module_name = "hashstore.filehashstore.filehashstore" +module_name = "hashstore.filehashstore" class_name = "FileHashStore" -my_store = factory.get_hashstore(module_name, class_name, properties) +hashstore = hashstore_factory.get_hashstore(module_name, class_name, properties) # Store objects (.../[hashstore_path]/objects/) pid = "j.tao.1700.1" -object = "/path/to/your/object.data" -hash_address = my_store.store_object(pid, object) -object_cid = hash_address.id +object_path = "/path/to/your/object.data" +object_metadata = hashstore.store_object(pid, object_path) +object_cid = object_metadata.cid # Store metadata (.../[hashstore_path]/metadata/) # By default, storing metadata will use the given properties namespace `format_id` pid = "j.tao.1700.1" sysmeta = "/path/to/your/sysmeta/document.xml" -metadata_cid = my_store.store_metadata(pid, sysmeta) -``` +metadata_cid = hashstore.store_metadata(pid, sysmeta) -If you want to store other types of metadata, add an additional `format_id`. -```py +# If you want to store other types of metadata, include a `format_id`. pid = "j.tao.1700.1" metadata = "/path/to/your/metadata/document.json" format_id = "http://custom.metadata.com/json/type/v1.0" -metadata_cid = my_store.store_metadata(pid, metadata, format_id) +metadata_cid_two = hashstore.store_metadata(pid, metadata, format_id) + +# ... ``` +### What does HashStore look like? + +```sh +# Example layout in HashStore with a single file stored along with its metadata and reference files. +# This uses a store depth of 3 (number of nested levels/directories - e.g. '/4d/19/81/' within +# 'objects', see below), with a width of 2 (number of characters used in directory name - e.g. "4d", +# "19" etc.) and "SHA-256" as its default store algorithm +## Notes: +## - Objects are stored using their content identifier as the file address +## - The reference file for each pid contains a single cid +## - The reference file for each cid contains multiple pids each on its own line +## - There are two metadata docs under the metadata directory for the pid (sysmeta, annotations) + +.../metacat/hashstore +├── hashstore.yaml +└── objects +| └── 4d +| └── 19 +| └── 81 +| └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c +└── metadata +| └── 0d +| └── 55 +| └── 55 +| └── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e +| └── 323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7 +| └── sha256(pid+formatId_annotations) +└── refs + ├── cids + | └── 4d + | └── 19 + | └── 81 + | └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c + └── pids + └── 0d + └── 55 + └── 55 + └── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e +``` + +### Working with objects (store, retrieve, delete) + +In HashStore, data objects begin as temporary files while their content identifiers are +calculated. Once the default hash algorithm list and their hashes are generated, objects are stored +in their permanent locations using the hash value of the store's configured algorithm, and +then divided accordingly based on the configured width and depth. Lastly, objects are 'tagged' +with a given identifier (ex. persistent identifier (pid)). This process produces reference +files, which allow objects to be found and retrieved with a given identifier. + +- Note 1: An identifier can only be used once +- Note 2: Each object is stored once and only once using its content identifier (a checksum + generated + from using a hashing algorithm). Clients that attempt to store duplicate objects will receive + the expected ObjectMetadata - with HashStore handling the de-duplication process under the hood. + +By calling the various interface methods for `store_object`, the calling app/client can validate, +store and tag an object simultaneously if the relevant data is available. In the absence of an +identifier (ex. persistent identifier (pid)), `store_object` can be called to solely store an +object. The client is then expected to call `delete_if_invalid_object` when the relevant +metadata is available to confirm that the object is what is expected. And to finalize the data-only +storage process (to make the object discoverable), the client calls `tagObject``. In summary, there +are two expected paths to store an object: + +```py +import io +from hashstore import HashStoreFactory + +# Instantiate a factory +hashstore_factory = HashStoreFactory() + +# Create a properties dictionary with the required fields +properties = { + "store_path": "/path/to/your/store", + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", +} + +# Get HashStore from factory +module_name = "hashstore.filehashstore" +class_name = "FileHashStore" +hashstore = hashstore_factory.get_hashstore(module_name, class_name, properties) + +additional_algo = "sha224" +checksum = "sha3_224_checksum_value" +checksum_algo = "sha3_224" +obj_size = 123456 +path = "/path/to/dou.test.1" +input_stream = io.open(path, "rb") +pid = "dou.test.1" +# All-in-one process which stores, validates and tags an object +obj_info_all_in_one = hashstore.store_object(input_stream, pid, additional_algo, checksum, + checksum_algo, obj_size) + +# Manual Process +# Store object +obj_info_manual = hashstore.store_object(input_stream) +# Validate object with expected values when available +hashstore.delete_if_invalid_object(obj_info_manual, checksum, checksum_algo, obj_size) +# Tag object, makes the object discoverable (find, retrieve, delete) +hashstore.tag_object(pid, obj_info_manual.cid) +``` + +**How do I retrieve an object if I have the pid?** + +- To retrieve an object, call the Public API method `retrieve_object` which opens a stream to the + object if it exists. + +**How do I delete an object if I have the pid?** + +- To delete an object and all its associated reference files, call the Public API + method `delete_object`. +- Note, `delete_object` and `store_object` are synchronized processes based on a given `pid`. + Additionally, `delete_object` further synchronizes with `tag_object` based on a `cid`. Every + object is stored once, is unique and shares one cid reference file. + +###### Working with metadata (store, retrieve, delete) + +HashStore's '/metadata' directory holds all metadata for objects stored in HashStore. To +differentiate between metadata documents for a given object, HashStore includes the 'format_id' ( +format or namespace of the metadata) when generating the address of the metadata document to store ( +the hash of the 'pid' + 'format_id'). By default, calling `store_metadata` will use HashStore's +default metadata namespace as the 'format_id' when storing metadata. Should the calling app wish to +store multiple metadata files about an object, the client app is expected to provide a 'format_id' +that represents an object format for the metadata type ( +ex. `store_metadata(stream, pid, format_id)`). + +**How do I retrieve a metadata file?** + +- To find a metadata object, call the Public API method `retrieve_metadata` which returns a stream + to the metadata file that's been stored with the default metadata namespace if it exists. +- If there are multiple metadata objects, a 'format_id' must be specified when + calling `retrieve_metadata` (ex. `retrieve_metadata(pid, format_id)`) + +**How do I delete a metadata file?** + +- Like `retrieve_metadata`, call the Public API method `delete_metadata` to delete all metadata + documents associated with the given pid. +- If there are multiple metadata objects, and you wish to only delete one type, a 'format_id' must + be specified when calling `delete_metadata(pid, format_id)` to ensure the expected metadata object + is deleted. + +### What are HashStore reference files? + +HashStore assumes that every data object is referenced by its a respective identifier. This +identifier is then used when storing, retrieving and deleting an object. In order to facilitate +this process, we create two types of reference files: + +- pid (persistent identifier) reference files +- cid (content identifier) reference files + +These reference files are implemented in HashStore underneath the hood with no expectation for +modification from the calling app/client. The one and only exception to this process is when the +calling client/app does not have an identifier available (i.e. they receive the stream to store +the data object first without any metadata, thus calling `store_object(stream)`). + +**'pid' Reference Files** + +- Pid (persistent identifier) reference files are created when storing an object with an identifier. +- Pid reference files are located in HashStores '/refs/pids' directory +- If an identifier is not available at the time of storing an object, the calling app/client must + create this association between a pid and the object it represents by calling `tag_object` + separately. +- Each pid reference file contains a single string that represents the content identifier of the + object it references +- Like how objects are stored once and only once, there is also only one pid reference file for each + data object. + +**'cid' Reference Files** + +- Cid (content identifier) reference files are created at the same time as pid reference files when + storing an object with an identifier. +- Cid reference files are located in HashStore's '/refs/cids' directory +- A cid reference file is a list of all the pids that reference a cid, delimited by a new line (" + \n") character + +## Concurrency in HashStore + +HashStore is both threading and multiprocessing safe, and by default synchronizes calls to store & +delete objects/metadata with Python's threading module. If you wish to use multiprocessing to +parallelize your application, please declare a global environment variable `USE_MULTIPROCESSING` +as `True` before initializing Hashstore. This will direct the relevant Public API calls to +synchronize using the Python `multiprocessing` module's locks and conditions. +Please see below for example: + +```py +import os + +# Set the global environment variable +os.environ["USE_MULTIPROCESSING"] = "True" + +# Check that the global environment variable has been set +use_multiprocessing = os.getenv("USE_MULTIPROCESSING", "False") == "True" +``` + +## Development build + +HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org) +build tool. + +To install `hashstore` locally, create a virtual environment for python 3.9+, +install poetry, and then install or build the package with `poetry install` or `poetry build`, +respectively. Note, installing `hashstore` with poetry will also make the `hashstore` command +available through the command line terminal (see `HashStore Client` section below for details). + +To run tests, navigate to the root directory and run `pytest`. The test suite contains tests that +take a longer time to run (relating to the storage of large files) - to execute all tests, run +`pytest --run-slow`. + +## HashStore Client + +Client API Options: + +- `-storeobject` +- `-storemetadata` +- `-retrieveobject` +- `-retrievemetadata` +- `-deleteobject` +- `-deletemetadata` +- `-getchecksum` (get_hex_digest) + How to use HashStore client (command line app) + ```sh -# Step 1: Create a HashStore -$ python './src/hashstore/client.py' /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="http://www.ns.test/v1" +# Step 0: Install hashstore via poetry to create an executable script +$ poetry install + +# Step 1: Create a HashStore at your desired store path (ex. /var/metacat/hashstore) +$ hashstore /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="http://www.ns.test/v1" # Get the checksum of a data object -$ python './src/hashstore/client.py' /path/to/store/ -getchecksum -pid=content_identifier -algo=SHA-256 +$ hashstore /path/to/store/ -getchecksum -pid=persistent_identifier -algo=SHA-256 # Store a data object -$ python './src/hashstore/client.py' /path/to/store/ -storeobject -pid=content_identifier -path=/path/to/object +$ hashstore /path/to/store/ -storeobject -pid=persistent_identifier -path=/path/to/object # Store a metadata object -$ python './src/hashstore/client.py' /path/to/store/ -storemetadata -pid=content_identifier -path=/path/to/metadata/object -formatid=http://ns.dataone.org/service/types/v2.0 +$ hashstore /path/to/store/ -storemetadata -pid=persistent_identifier -path=/path/to/metadata/object -formatid=https://ns.dataone.org/service/types/v2.0#SystemMetadata # Retrieve a data object -$ python './src/hashstore/client.py' /path/to/store/ -retrieveobject -pid=content_identifier +$ hashstore /path/to/store/ -retrieveobject -pid=persistent_identifier # Retrieve a metadata object -$ python './src/hashstore/client.py' /path/to/store/ -retrievemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0 +$ hashstore /path/to/store/ -retrievemetadata -pid=persistent_identifier -formatid=https://ns.dataone.org/service/types/v2.0#SystemMetadata # Delete a data object -$ python './src/hashstore/client.py' /path/to/store/ -deleteobject -pid=content_identifier +$ hashstore /path/to/store/ -deleteobject -pid=persistent_identifier # Delete a metadata file -$ python './src/hashstore/client.py' /path/to/store/ -deletemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0 +$ hashstore /path/to/store/ -deletemetadata -pid=persistent_identifier -formatid=https://ns.dataone.org/service/types/v2.0#SystemMetadata ``` ## License + ``` Copyright [2022] [Regents of the University of California] @@ -117,12 +390,16 @@ limitations under the License. ``` ## Acknowledgements + Work on this package was supported by: - DataONE Network -- Arctic Data Center: NSF-PLR grant #2042102 to M. B. Jones, A. Budden, M. Schildhauer, and J. Dozier +- Arctic Data Center: NSF-PLR grant #2042102 to M. B. Jones, A. Budden, M. Schildhauer, and J. + Dozier -Additional support was provided for collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California. +Additional support was provided for collaboration by the National Center for Ecological Analysis and +Synthesis, a Center funded by the University of California, Santa Barbara, and the State of +California. [![DataONE_footer](https://user-images.githubusercontent.com/6643222/162324180-b5cf0f5f-ae7a-4ca6-87c3-9733a2590634.png)](https://dataone.org) diff --git a/poetry.lock b/poetry.lock index 85abf43e..538bc8be 100644 --- a/poetry.lock +++ b/poetry.lock @@ -1,10 +1,9 @@ -# This file is automatically @generated by Poetry 1.4.2 and should not be changed by hand. +# This file is automatically @generated by Poetry 1.8.3 and should not be changed by hand. [[package]] name = "asn1crypto" version = "1.5.1" description = "Fast ASN.1 parser and serializer with definitions for private keys, public keys, certificates, CRL, OCSP, CMS, PKCS#3, PKCS#7, PKCS#8, PKCS#12, PKCS#5, X.509 and TSP" -category = "dev" optional = false python-versions = "*" files = [ @@ -16,7 +15,6 @@ files = [ name = "astroid" version = "2.15.6" description = "An abstract syntax tree for Python with inference support." -category = "dev" optional = false python-versions = ">=3.7.2" files = [ @@ -36,7 +34,6 @@ wrapt = [ name = "black" version = "22.12.0" description = "The uncompromising code formatter." -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -72,7 +69,6 @@ uvloop = ["uvloop (>=0.15.2)"] name = "click" version = "8.1.5" description = "Composable command line interface toolkit" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -87,7 +83,6 @@ colorama = {version = "*", markers = "platform_system == \"Windows\""} name = "colorama" version = "0.4.6" description = "Cross-platform colored terminal text." -category = "dev" optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,>=2.7" files = [ @@ -99,7 +94,6 @@ files = [ name = "dill" version = "0.3.6" description = "serialize all of python" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -114,7 +108,6 @@ graph = ["objgraph (>=1.7.2)"] name = "exceptiongroup" version = "1.1.2" description = "Backport of PEP 654 (exception groups)" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -129,7 +122,6 @@ test = ["pytest (>=6)"] name = "iniconfig" version = "2.0.0" description = "brain-dead simple config-ini parsing" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -141,7 +133,6 @@ files = [ name = "isort" version = "5.12.0" description = "A Python utility / library to sort Python imports." -category = "dev" optional = false python-versions = ">=3.8.0" files = [ @@ -159,7 +150,6 @@ requirements-deprecated-finder = ["pip-api", "pipreqs"] name = "lazy-object-proxy" version = "1.9.0" description = "A fast and thorough lazy object proxy." -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -205,7 +195,6 @@ files = [ name = "mccabe" version = "0.7.0" description = "McCabe checker, plugin for flake8" -category = "dev" optional = false python-versions = ">=3.6" files = [ @@ -217,7 +206,6 @@ files = [ name = "mypy-extensions" version = "1.0.0" description = "Type system extensions for programs checked with the mypy type checker." -category = "dev" optional = false python-versions = ">=3.5" files = [ @@ -229,7 +217,6 @@ files = [ name = "packaging" version = "23.1" description = "Core utilities for Python packages" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -241,7 +228,6 @@ files = [ name = "pathlib" version = "1.0.1" description = "Object-oriented filesystem paths" -category = "main" optional = false python-versions = "*" files = [ @@ -253,7 +239,6 @@ files = [ name = "pathspec" version = "0.11.1" description = "Utility library for gitignore style pattern matching of file paths." -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -265,7 +250,6 @@ files = [ name = "pg8000" version = "1.29.8" description = "PostgreSQL interface library" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -281,7 +265,6 @@ scramp = ">=1.4.3" name = "platformdirs" version = "3.8.1" description = "A small Python package for determining appropriate platform-specific dirs, e.g. a \"user data dir\"." -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -297,7 +280,6 @@ test = ["appdirs (==1.4.4)", "covdefaults (>=2.3)", "pytest (>=7.3.1)", "pytest- name = "pluggy" version = "1.2.0" description = "plugin and hook calling mechanisms for python" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -313,7 +295,6 @@ testing = ["pytest", "pytest-benchmark"] name = "pylint" version = "2.17.4" description = "python code static checker" -category = "dev" optional = false python-versions = ">=3.7.2" files = [ @@ -343,7 +324,6 @@ testutils = ["gitpython (>3)"] name = "pytest" version = "7.4.0" description = "pytest: simple powerful testing with Python" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -366,7 +346,6 @@ testing = ["argcomplete", "attrs (>=19.2.0)", "hypothesis (>=3.56)", "mock", "no name = "python-dateutil" version = "2.8.2" description = "Extensions to the standard Python datetime module" -category = "dev" optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7" files = [ @@ -381,7 +360,6 @@ six = ">=1.5" name = "pyyaml" version = "6.0" description = "YAML parser and emitter for Python" -category = "main" optional = false python-versions = ">=3.6" files = [ @@ -431,7 +409,6 @@ files = [ name = "scramp" version = "1.4.4" description = "An implementation of the SCRAM protocol." -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -446,7 +423,6 @@ asn1crypto = ">=1.5.1" name = "six" version = "1.16.0" description = "Python 2 and 3 compatibility utilities" -category = "dev" optional = false python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*" files = [ @@ -458,7 +434,6 @@ files = [ name = "tomli" version = "2.0.1" description = "A lil' TOML parser" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -470,7 +445,6 @@ files = [ name = "tomlkit" version = "0.11.8" description = "Style preserving TOML library" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -482,7 +456,6 @@ files = [ name = "typing-extensions" version = "4.7.1" description = "Backported and Experimental Type Hints for Python 3.7+" -category = "dev" optional = false python-versions = ">=3.7" files = [ @@ -494,7 +467,6 @@ files = [ name = "wrapt" version = "1.15.0" description = "Module for decorators, wrappers and monkey patching." -category = "dev" optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,>=2.7" files = [ @@ -577,5 +549,5 @@ files = [ [metadata] lock-version = "2.0" -python-versions = "^3.9" -content-hash = "6eeffad7b4becc9f995e576d3fc5db2a8640bfe60876d254a6b5854ddd0e283a" +python-versions = ">=3.9" +content-hash = "29d95a36557ed6e054de245ce01f8cc49055e3b478d030a891aa3ee57b981245" diff --git a/pyproject.toml b/pyproject.toml index 1c9f80d9..13d2e42b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,20 +1,41 @@ [tool.poetry] name = "hashstore" -version = "1.0.0" -description = "HashStore, a hash-based object store for data packages." -authors = ["Matt Jones ", "Dou Mok "] +version = "1.1.0" +description = "HashStore, an object storage system using content identifiers." +authors = ["Dou Mok ", "Matt Jones ", + "Matthew Brooke", "Jing Tao", "Jeanette Clark", "Ian M. Nesbitt"] readme = "README.md" +keywords = ["filesystem", "object storage", "hashstore", "storage"] +classifiers = [ + "Development Status :: 4 - Beta", + "Intended Audience :: Developers", + "Intended Audience :: Science/Research", + "License :: OSI Approved :: Apache Software License", + "Natural Language :: English", + "Operating System :: OS Independent", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Topic :: System :: Filesystems" +] [tool.poetry.dependencies] -python = "^3.9" -pathlib = "^1.0.1" -pyyaml = "^6.0" +python = ">=3.9" +pathlib = ">=1.0.1" +pyyaml = ">=6.0" + +[tool.poetry_bumpversion.file."src/hashstore/__init__.py"] [tool.poetry.group.dev.dependencies] -pytest = "^7.2.0" -black = "^22.10.0" -pylint = "^2.17.4" -pg8000 = "^1.29.8" +pytest = ">=7.2.0" +black = ">=22.10.0" +pylint = ">=2.17.4" +pg8000 = ">=1.29.8" + +[tool.poetry.scripts] +hashstore = "hashstore.hashstoreclient:main" [build-system] requires = ["poetry-core"] diff --git a/src/hashstore/__init__.py b/src/hashstore/__init__.py index 352bd3d3..be656f5e 100644 --- a/src/hashstore/__init__.py +++ b/src/hashstore/__init__.py @@ -1,20 +1,22 @@ -"""HashStore is a hash-based object store for data packages. It uses -cryptographic hash functions to name files consistently. -HashStore creates a directory where objects and metadata are -stored using a hash value as the name. +"""HashStore is an object storage file system that provides persistent file-based +storage using content identifiers/hashes to de-duplicate data. HashStore is mainly focused on storing DataONE data package contents on a shared file system for simple and fast access by data management processes that function across a cluster environment. Some properties: - Data objects are immutable and never change -- Data objects are named using the SHA-256, base64-encoded hash of their contents +- Data objects are named using the base64-encoded hash of their contents (thus, a content-identifier) -- Metadata objects are stored with the formatId, a null character and its contents -- Metadata objects are named using the SHA-256 + formatId, base64-encoded hash of - their persistent identifier (PID) +- Metadata documents for a given identifier are stored in a directory structure + based on the base64-encoded hash of the identifier +- Metadata objects are named using the base64-encoded hash of the given identifier + + its respective format_id/namespace +- The relationships between data objects and metadata are managed with a reference + system. """ -from hashstore.hashstore import HashStore, HashStoreFactory, ObjectMetadata +from hashstore.hashstore import HashStore, HashStoreFactory -__all__ = ("HashStore", "HashStoreFactory", "ObjectMetadata") +__all__ = ("HashStore", "HashStoreFactory") +__version__ = "1.1.0" diff --git a/src/hashstore/filehashstore.py b/src/hashstore/filehashstore.py index 87f652e7..74b9c600 100644 --- a/src/hashstore/filehashstore.py +++ b/src/hashstore/filehashstore.py @@ -1,24 +1,46 @@ """Core module for FileHashStore""" + import atexit import io +import multiprocessing import shutil import threading -import time import hashlib import os import logging +import inspect +import fcntl +import yaml +from typing import List, Dict, Union, Optional, IO, Tuple, Set, Any +from dataclasses import dataclass from pathlib import Path from contextlib import closing from tempfile import NamedTemporaryFile -import yaml -from hashstore import HashStore, ObjectMetadata +from hashstore import HashStore +from hashstore.filehashstore_exceptions import ( + CidRefsContentError, + OrphanPidRefsFileFound, + CidRefsFileNotFound, + HashStoreRefsAlreadyExists, + NonMatchingChecksum, + NonMatchingObjSize, + PidRefsAlreadyExistsError, + PidNotFoundInCidRefsFile, + PidRefsContentError, + PidRefsDoesNotExist, + PidRefsFileNotFound, + RefsFileExistsButCidObjMissing, + UnsupportedAlgorithm, + StoreObjectForPidAlreadyInProgress, + IdentifierNotLocked, +) class FileHashStore(HashStore): - """FileHashStore is a content addressable file manager based on Derrick - Gilland's 'hashfs' library. It supports the storage of objects on disk using - an authority-based identifier's hex digest with a given hash algorithm value - to address files. + """FileHashStore is an object storage system that was extended from Derrick Gilland's + 'hashfs' library. It supports the storage of objects on disk using a content identifier + to address files (data objects are de-duplicated) and provides a content identifier-based + API to interact with a HashStore. FileHashStore initializes using a given properties dictionary containing the required keys (see Args). Upon initialization, FileHashStore verifies the provided @@ -26,13 +48,12 @@ class FileHashStore(HashStore): store path directory. Properties must always be supplied to ensure consistent usage of FileHashStore once configured. - Args: - properties (dict): A python dictionary with the following keys (and values): - store_path (str): Path to the HashStore directory. - store_depth (int): Depth when sharding an object's hex digest. - store_width (int): Width of directories when sharding an object's hex digest. - store_algorithm (str): Hash algorithm used for calculating the object's hex digest. - store_metadata_namespace (str): Namespace for the HashStore's system metadata. + :param dict properties: A Python dictionary with the following keys (and values): + - store_path (str): Path to the HashStore directory. + - store_depth (int): Depth when sharding an object's hex digest. + - store_width (int): Width of directories when sharding an object's hex digest. + - store_algorithm (str): Hash algorithm used for calculating the object's hex digest. + - store_metadata_namespace (str): Namespace for the HashStore's system metadata. """ # Property (hashstore configuration) requirements @@ -44,8 +65,8 @@ class FileHashStore(HashStore): "store_metadata_namespace", ] # Permissions settings for writing files and creating directories - fmode = 0o664 - dmode = 0o755 + f_mode = 0o664 + d_mode = 0o755 # The other algorithm list consists of additional algorithms that can be included # for calculating when storing objects, in addition to the default list. other_algo_list = [ @@ -57,14 +78,10 @@ class FileHashStore(HashStore): "blake2b", "blake2s", ] - # Variables to orchestrate thread locking and object store synchronization - time_out_sec = 1 - object_lock = threading.Lock() - metadata_lock = threading.Lock() - object_locked_pids = [] - metadata_locked_pids = [] def __init__(self, properties=None): + self.fhs_logger = logging.getLogger(__name__) + # Now check properties if properties: # Validate properties against existing configuration if present checked_properties = self._validate_properties(properties) @@ -80,128 +97,177 @@ def __init__(self, properties=None): ] # Check to see if a configuration is present in the given store path - self.hashstore_configuration_yaml = prop_store_path + "/hashstore.yaml" + self.hashstore_configuration_yaml = Path(prop_store_path) / "hashstore.yaml" self._verify_hashstore_properties(properties, prop_store_path) # If no exceptions thrown, FileHashStore ready for initialization - logging.debug("FileHashStore - Initializing, properties verified.") - self.root = prop_store_path - if not os.path.exists(self.root): - self.create_path(self.root) + self.fhs_logger.debug("Initializing, properties verified.") + self.root = Path(prop_store_path) self.depth = prop_store_depth self.width = prop_store_width self.sysmeta_ns = prop_store_metadata_namespace # Write 'hashstore.yaml' to store path - if not os.path.exists(self.hashstore_configuration_yaml): + if not os.path.isfile(self.hashstore_configuration_yaml): # pylint: disable=W1201 - logging.debug( - "FileHashStore - HashStore does not exist & configuration file not found." + self.fhs_logger.debug( + "HashStore does not exist & configuration file not found." + " Writing configuration file." ) - self.write_properties(properties) + self._write_properties(properties) # Default algorithm list for FileHashStore based on config file written self._set_default_algorithms() # Complete initialization/instantiation by setting and creating store directories - self.objects = self.root + "/objects" - self.metadata = self.root + "/metadata" + self.objects = self.root / "objects" + self.metadata = self.root / "metadata" + self.refs = self.root / "refs" + self.cids = self.refs / "cids" + self.pids = self.refs / "pids" if not os.path.exists(self.objects): - self.create_path(self.objects + "/tmp") + self._create_path(self.objects / "tmp") if not os.path.exists(self.metadata): - self.create_path(self.metadata + "/tmp") - logging.debug( - "FileHashStore - Initialization success. Store root: %s", self.root + self._create_path(self.metadata / "tmp") + if not os.path.exists(self.refs): + self._create_path(self.refs / "tmp") + self._create_path(self.refs / "pids") + self._create_path(self.refs / "cids") + + # Variables to orchestrate parallelization + # Check to see whether a multiprocessing or threading sync lock should be used + self.use_multiprocessing = ( + os.getenv("USE_MULTIPROCESSING", "False") == "True" ) + if self.use_multiprocessing == "True": + # Create multiprocessing synchronization variables + # Synchronization values for object locked pids + self.object_pid_lock_mp = multiprocessing.Lock() + self.object_pid_condition_mp = multiprocessing.Condition( + self.object_pid_lock_mp + ) + self.object_locked_pids_mp = multiprocessing.Manager().list() + # Synchronization values for object locked cids + self.object_cid_lock_mp = multiprocessing.Lock() + self.object_cid_condition_mp = multiprocessing.Condition( + self.object_cid_lock_mp + ) + self.object_locked_cids_mp = multiprocessing.Manager().list() + # Synchronization values for metadata locked documents + self.metadata_lock_mp = multiprocessing.Lock() + self.metadata_condition_mp = multiprocessing.Condition( + self.metadata_lock_mp + ) + self.metadata_locked_docs_mp = multiprocessing.Manager().list() + # Synchronization values for reference locked pids + self.reference_pid_lock_mp = multiprocessing.Lock() + self.reference_pid_condition_mp = multiprocessing.Condition( + self.reference_pid_lock_mp + ) + self.reference_locked_pids_mp = multiprocessing.Manager().list() + else: + # Create threading synchronization variables + # Synchronization values for object locked pids + self.object_pid_lock_th = threading.Lock() + self.object_pid_condition_th = threading.Condition( + self.object_pid_lock_th + ) + self.object_locked_pids_th = [] + # Synchronization values for object locked cids + self.object_cid_lock_th = threading.Lock() + self.object_cid_condition_th = threading.Condition( + self.object_cid_lock_th + ) + self.object_locked_cids_th = [] + # Synchronization values for metadata locked documents + self.metadata_lock_th = threading.Lock() + self.metadata_condition_th = threading.Condition(self.metadata_lock_th) + self.metadata_locked_docs_th = [] + # Synchronization values for reference locked pids + self.reference_pid_lock_th = threading.Lock() + self.reference_pid_condition_th = threading.Condition( + self.metadata_lock_th + ) + self.reference_locked_pids_th = [] + + self.fhs_logger.debug("Initialization success. Store root: %s", self.root) else: # Cannot instantiate or initialize FileHashStore without config - exception_string = ( - "FileHashStore - HashStore properties must be supplied." - + f" Properties: {properties}" + err_msg = ( + "HashStore properties must be supplied." + f" Properties: {properties}" ) - logging.debug(exception_string) - raise ValueError(exception_string) + self.fhs_logger.debug(err_msg) + raise ValueError(err_msg) # Configuration and Related Methods - def load_properties(self): + @staticmethod + def _load_properties( + hashstore_yaml_path: Path, hashstore_required_prop_keys: List[str] + ) -> Dict[str, Union[str, int]]: """Get and return the contents of the current HashStore configuration. - Returns: - hashstore_yaml_dict (dict): HashStore properties with the following keys (and values): - store_depth (int): Depth when sharding an object's hex digest. - store_width (int): Width of directories when sharding an object's hex digest. - store_algorithm (str): Hash algorithm used for calculating the object's hex digest. - store_metadata_namespace (str): Namespace for the HashStore's system metadata. + :return: HashStore properties with the following keys (and values): + - store_depth (int): Depth when sharding an object's hex digest. + - store_width (int): Width of directories when sharding an object's hex digest. + - store_algorithm (str): Hash algo used for calculating the object's hex digest. + - store_metadata_namespace (str): Namespace for the HashStore's system metadata. """ - if not os.path.exists(self.hashstore_configuration_yaml): - exception_string = ( - "FileHashStore - load_properties: hashstore.yaml not found" - + " in store root path." - ) - logging.critical(exception_string) - raise FileNotFoundError(exception_string) + if not os.path.isfile(hashstore_yaml_path): + err_msg = "'hashstore.yaml' not found in store root path." + logging.critical(err_msg) + raise FileNotFoundError(err_msg) + # Open file - with open(self.hashstore_configuration_yaml, "r", encoding="utf-8") as file: - yaml_data = yaml.safe_load(file) + with open(hashstore_yaml_path, "r", encoding="utf-8") as hs_yaml_file: + yaml_data = yaml.safe_load(hs_yaml_file) # Get hashstore properties hashstore_yaml_dict = {} - for key in self.property_required_keys: - if key is not "store_path": + for key in hashstore_required_prop_keys: + if key != "store_path": hashstore_yaml_dict[key] = yaml_data[key] - logging.debug( - "FileHashStore - load_properties: Successfully retrieved 'hashstore.yaml' properties." - ) + logging.debug("Successfully retrieved 'hashstore.yaml' properties.") return hashstore_yaml_dict - def write_properties(self, properties): + def _write_properties(self, properties: Dict[str, Union[str, int]]) -> None: """Writes 'hashstore.yaml' to FileHashStore's root directory with the respective properties object supplied. - Args: - properties (dict): A python dictionary with the following keys (and values): - store_depth (int): Depth when sharding an object's hex digest. - store_width (int): Width of directories when sharding an object's hex digest. - store_algorithm (str): Hash algorithm used for calculating the object's hex digest. - store_metadata_namespace (str): Namespace for the HashStore's system metadata. + :param dict properties: A Python dictionary with the following keys (and values): + - store_depth (int): Depth when sharding an object's hex digest. + - store_width (int): Width of directories when sharding an object's hex digest. + - store_algorithm (str): Hash algo used for calculating the object's hex digest. + - store_metadata_namespace (str): Namespace for the HashStore's system metadata. """ # If hashstore.yaml already exists, must throw exception and proceed with caution - if os.path.exists(self.hashstore_configuration_yaml): - exception_string = ( - "FileHashStore - write_properties: configuration file 'hashstore.yaml'" - + " already exists." - ) - logging.error(exception_string) - raise FileExistsError(exception_string) + if os.path.isfile(self.hashstore_configuration_yaml): + err_msg = "Configuration file 'hashstore.yaml' already exists." + logging.error(err_msg) + raise FileExistsError(err_msg) # Validate properties checked_properties = self._validate_properties(properties) # Collect configuration properties from validated & supplied dictionary - ( - _, - store_depth, - store_width, - store_algorithm, - store_metadata_namespace, - ) = [ + (_, store_depth, store_width, store_algorithm, store_metadata_namespace,) = [ checked_properties[property_name] for property_name in self.property_required_keys ] # Standardize algorithm value for cross-language compatibility - checked_store_algorithm = None # Note, this must be declared here because HashStore has not yet been initialized accepted_store_algorithms = ["MD5", "SHA-1", "SHA-256", "SHA-384", "SHA-512"] if store_algorithm in accepted_store_algorithms: checked_store_algorithm = store_algorithm else: - exception_string = ( - f"FileHashStore - write_properties: algorithm supplied ({store_algorithm})" - + " cannot be used as default for HashStore. Must be one of:" - + " MD5, SHA-1, SHA-256, SHA-384, SHA-512 which are DataONE" - + " controlled algorithm values" + err_msg = ( + f"Algorithm supplied ({store_algorithm}) cannot be used as default for" + f" HashStore. Must be one of: {', '.join(accepted_store_algorithms)}" + f" which are DataONE controlled algorithm values" ) - logging.error(exception_string) - raise ValueError(exception_string) + logging.error(err_msg) + raise ValueError(err_msg) + + # If given store path doesn't exist yet, create it. + if not os.path.exists(self.root): + self._create_path(self.root) # .yaml file to write hashstore_configuration_yaml = self._build_hashstore_yaml_string( @@ -213,69 +279,91 @@ def write_properties(self, properties): # Write 'hashstore.yaml' with open( self.hashstore_configuration_yaml, "w", encoding="utf-8" - ) as hashstore_yaml: - hashstore_yaml.write(hashstore_configuration_yaml) + ) as hs_yaml_file: + hs_yaml_file.write(hashstore_configuration_yaml) logging.debug( - "FileHashStore - write_properties: Configuration file written to: %s", - self.hashstore_configuration_yaml, + "Configuration file written to: %s", self.hashstore_configuration_yaml ) return @staticmethod def _build_hashstore_yaml_string( - store_depth, store_width, store_algorithm, store_metadata_namespace - ): + store_depth: int, + store_width: int, + store_algorithm: str, + store_metadata_namespace: str, + ) -> str: """Build a YAML string representing the configuration for a HashStore. - Args: - store_path (str): Path to the HashStore directory. - store_depth (int): Depth when sharding an object's hex digest. - store_width (int): Width of directories when sharding an object's hex digest. - store_algorithm (str): Hash algorithm used for calculating the object's hex digest. - store_metadata_namespace (str): Namespace for the HashStore's system metadata. + :param int store_depth: Depth when sharding an object's hex digest. + :param int store_width: Width of directories when sharding an object's hex digest. + :param str store_algorithm: Hash algorithm used for calculating the object's hex digest. + :param str store_metadata_namespace: Namespace for the HashStore's system metadata. - Returns: - hashstore_configuration_yaml (str): A YAML string representing the configuration for - a HashStore. - """ - hashstore_configuration_yaml = f""" - # Default configuration variables for HashStore - - ############### Directory Structure ############### - # Desired amount of directories when sharding an object to form the permanent address - store_depth: {store_depth} # WARNING: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE - # Width of directories created when sharding an object to form the permanent address - store_width: {store_width} # WARNING: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE - # Example: - # Below, objects are shown listed in directories that are 3 levels deep (DIR_DEPTH=3), - # with each directory consisting of 2 characters (DIR_WIDTH=2). - # /var/filehashstore/objects - # ├── 7f - # │ └── 5c - # │ └── c1 - # │ └── 8f0b04e812a3b4c8f686ce34e6fec558804bf61e54b176742a7f6368d6 - - ############### Format of the Metadata ############### - # The default metadata format - store_metadata_namespace: "{store_metadata_namespace}" - - ############### Hash Algorithms ############### - # Hash algorithm to use when calculating object's hex digest for the permanent address - store_algorithm: "{store_algorithm}" - # Algorithm values supported by python hashlib 3.9.0+ for File Hash Store (FHS) - # The default algorithm list includes the hash algorithms calculated when storing an - # object to disk and returned to the caller after successful storage. - store_default_algo_list: - - "MD5" - - "SHA-1" - - "SHA-256" - - "SHA-384" - - "SHA-512" + :return: A YAML string representing the configuration for a HashStore. """ - return hashstore_configuration_yaml + hashstore_configuration = { + "store_depth": store_depth, + "store_width": store_width, + "store_metadata_namespace": store_metadata_namespace, + "store_algorithm": store_algorithm, + "store_default_algo_list": [ + "MD5", + "SHA-1", + "SHA-256", + "SHA-384", + "SHA-512", + ], + } + + # The tabbing here is intentional otherwise the created .yaml will have extra tabs + hashstore_configuration_comments = f""" +# Default configuration variables for HashStore + +############### HashStore Config Notes ############### +############### Directory Structure ############### +# store_depth +# - Desired amount of directories when sharding an object to form the permanent address +# - **WARNING**: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE +# +# store_width +# - Width of directories created when sharding an object to form the permanent address +# - **WARNING**: DO NOT CHANGE UNLESS SETTING UP NEW HASHSTORE +# +# Example: +# Below, objects are shown listed in directories that are 3 levels deep (DIR_DEPTH=3), +# with each directory consisting of 2 characters (DIR_WIDTH=2). +# /var/filehashstore/objects +# ├── 7f +# │ └── 5c +# │ └── c1 +# │ └── 8f0b04e812a3b4c8f686ce34e6fec558804bf61e54b176742a7f6368d6 + +############### Format of the Metadata ############### +# store_metadata_namespace +# - The default metadata format (ex. system metadata) + +############### Hash Algorithms ############### +# store_algorithm +# - Hash algorithm to use when calculating object's hex digest for the permanent address +# +# store_default_algo_list +# - Algorithm values supported by python hashlib 3.9.0+ for File Hash Store (FHS) +# - The default algorithm list includes the hash algorithms calculated when storing an +# - object to disk and returned to the caller after successful storage. + +""" + + hashstore_yaml_with_comments = hashstore_configuration_comments + yaml.dump( + hashstore_configuration, sort_keys=False + ) + + return hashstore_yaml_with_comments - def _verify_hashstore_properties(self, properties, prop_store_path): + def _verify_hashstore_properties( + self, properties: Dict[str, Union[str, int]], prop_store_path: str + ) -> None: """Determines whether FileHashStore can instantiate by validating a set of arguments and throwing exceptions. HashStore will not instantiate if an existing configuration file's properties (`hashstore.yaml`) are different from what is supplied - or if an @@ -286,86 +374,102 @@ def _verify_hashstore_properties(self, properties, prop_store_path): look to see if any directories/files exist in the given store path and throw an exception if any file or directory is found. - Args: - properties (dict): HashStore properties - prop_store_path (string): Store path to check + :param dict properties: HashStore properties. + :param str prop_store_path: Store path to check. """ - if os.path.exists(self.hashstore_configuration_yaml): - logging.debug( - "FileHashStore - Config found (hashstore.yaml) at {%s}. Verifying properties.", + if os.path.isfile(self.hashstore_configuration_yaml): + self.fhs_logger.debug( + "Config found (hashstore.yaml) at {%s}. Verifying properties.", self.hashstore_configuration_yaml, ) # If 'hashstore.yaml' is found, verify given properties before init - hashstore_yaml_dict = self.load_properties() + hashstore_yaml_dict = self._load_properties( + self.hashstore_configuration_yaml, self.property_required_keys + ) for key in self.property_required_keys: # 'store_path' is required to init HashStore but not saved in `hashstore.yaml` - if key is not "store_path": + if key != "store_path": supplied_key = properties[key] if key == "store_depth" or key == "store_width": supplied_key = int(properties[key]) if hashstore_yaml_dict[key] != supplied_key: - exception_string = ( - f"FileHashStore - Given properties ({key}: {properties[key]}) does not" - + f" match. HashStore configuration ({key}: {hashstore_yaml_dict[key]})" + err_msg = ( + f"Given properties ({key}: {properties[key]}) does not match." + + f" HashStore configuration ({key}: {hashstore_yaml_dict[key]})" + f" found at: {self.hashstore_configuration_yaml}" ) - logging.critical(exception_string) - raise ValueError(exception_string) + self.fhs_logger.critical(err_msg) + raise ValueError(err_msg) else: if os.path.exists(prop_store_path): # Check if HashStore exists and throw exception if found - if any(Path(prop_store_path).iterdir()): - exception_string = ( - "FileHashStore - HashStore directories and/or objects found at:" - + f" {prop_store_path} but missing configuration file at: " - + self.hashstore_configuration_yaml + subfolders = ["objects", "metadata", "refs"] + if any( + os.path.isdir(os.path.join(prop_store_path, sub)) + for sub in subfolders + ): + err_msg = ( + "Unable to initialize HashStore. `hashstore.yaml` is not present but " + "conflicting HashStore directory exists. Please delete '/objects', " + "'/metadata' and/or '/refs' at the store path or supply a new path." ) - logging.critical(exception_string) - raise FileNotFoundError(exception_string) + self.fhs_logger.critical(err_msg) + raise RuntimeError(err_msg) - def _validate_properties(self, properties): + def _validate_properties( + self, properties: Dict[str, Union[str, int]] + ) -> Dict[str, Union[str, int]]: """Validate a properties dictionary by checking if it contains all the required keys and non-None values. - Args: - properties (dict): Dictionary containing filehashstore properties. + :param dict properties: Dictionary containing filehashstore properties. - Raises: - KeyError: If key is missing from the required keys. - ValueError: If value is missing for a required key. + :raises KeyError: If key is missing from the required keys. + :raises ValueError: If value is missing for a required key. - Returns: - properties (dict): The given properties object (that has been validated). + :return: The given properties object (that has been validated). """ if not isinstance(properties, dict): - exception_string = ( - "FileHashStore - _validate_properties: Invalid argument -" - + " expected a dictionary." - ) - logging.debug(exception_string) - raise ValueError(exception_string) + err_msg = "Invalid argument expected a dictionary." + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) + + # New dictionary for validated properties + checked_properties = {} for key in self.property_required_keys: if key not in properties: - exception_string = ( - "FileHashStore - _validate_properties: Missing required" - + f" key: {key}." - ) - logging.debug(exception_string) - raise KeyError(exception_string) - if properties.get(key) is None: - exception_string = ( - "FileHashStore - _validate_properties: Value for key:" - + f" {key} is none." - ) - logging.debug(exception_string) - raise ValueError(exception_string) - return properties + err_msg = "Missing required key: {key}." + self.fhs_logger.error(err_msg) + raise KeyError(err_msg) + + value = properties.get(key) + if value is None: + err_msg = "Value for key: {key} is none." + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) + + # Add key and values to checked_properties + if key == "store_depth" or key == "store_width": + # Ensure store depth and width are integers + try: + checked_properties[key] = int(value) + except Exception as err: + err_msg = ( + "Unexpected exception when attempting to ensure store depth and width " + f"are integers. Details: {err}" + ) + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) + else: + checked_properties[key] = value + + return checked_properties def _set_default_algorithms(self): """Set the default algorithms to calculate when storing objects.""" - def lookup_algo(algo): + def lookup_algo(algo_to_translate): """Translate DataONE controlled algorithms to python hashlib values: https://dataoneorg.github.io/api-documentation/apis/Types.html#Types.ChecksumAlgorithm """ @@ -376,17 +480,17 @@ def lookup_algo(algo): "SHA-384": "sha384", "SHA-512": "sha512", } - return dataone_algo_translation[algo] + return dataone_algo_translation[algo_to_translate] - if not os.path.exists(self.hashstore_configuration_yaml): - exception_string = ( - "FileHashStore - set_default_algorithms: hashstore.yaml not found" - + " in store root path." - ) - logging.critical(exception_string) - raise FileNotFoundError(exception_string) - with open(self.hashstore_configuration_yaml, "r", encoding="utf-8") as file: - yaml_data = yaml.safe_load(file) + if not os.path.isfile(self.hashstore_configuration_yaml): + err_msg = "hashstore.yaml not found in store root path." + self.fhs_logger.critical(err_msg) + raise FileNotFoundError(err_msg) + + with open( + self.hashstore_configuration_yaml, "r", encoding="utf-8" + ) as hs_yaml_file: + yaml_data = yaml.safe_load(hs_yaml_file) # Set default store algorithm self.algorithm = lookup_algo(yaml_data["store_algorithm"]) @@ -404,265 +508,629 @@ def lookup_algo(algo): def store_object( self, - pid, - data, - additional_algorithm=None, - checksum=None, - checksum_algorithm=None, - expected_object_size=None, - ): - logging.debug( - "FileHashStore - store_object: Request to store object for pid: %s", pid - ) - # Validate input parameters - self._is_string_none_or_empty(pid, "pid", "store_object") - self._validate_data_to_store(data) - self._validate_file_size(expected_object_size) - ( - additional_algorithm_checked, - checksum_algorithm_checked, - ) = self._validate_algorithms_and_checksum( - additional_algorithm, checksum, checksum_algorithm - ) - - # Wait for the pid to release if it's in use - while pid in self.object_locked_pids: - logging.debug( - "FileHashStore - store_object: %s is currently being stored. Waiting.", - pid, - ) - time.sleep(self.time_out_sec) - # Modify object_locked_pids consecutively - with self.object_lock: - logging.debug( - "FileHashStore - store_object: Adding pid: %s to object_locked_pids.", - pid, - ) - self.object_locked_pids.append(pid) - try: - logging.debug( - "FileHashStore - store_object: Attempting to store object for pid: %s", - pid, + pid: Optional[str] = None, + data: Optional[Union[str, bytes]] = None, + additional_algorithm: Optional[str] = None, + checksum: Optional[str] = None, + checksum_algorithm: Optional[str] = None, + expected_object_size: Optional[int] = None, + ) -> "ObjectMetadata": + if pid is None and self._check_arg_data(data): + # If no pid is supplied, store the object only without tagging + logging.debug("Request to store data only received.") + object_metadata = self._store_data_only(data) + self.fhs_logger.info( + "Successfully stored object for cid: %s", object_metadata.cid ) - object_metadata = self.put_object( - pid, - data, - additional_algorithm=additional_algorithm_checked, - checksum=checksum, - checksum_algorithm=checksum_algorithm_checked, - file_size_to_validate=expected_object_size, + else: + # Else the object will be stored and tagged + self.fhs_logger.debug("Request to store object for pid: %s", pid) + # Validate input parameters + self._check_string(pid, "pid") + self._check_arg_data(data) + self._check_integer(expected_object_size) + ( + additional_algorithm_checked, + checksum_algorithm_checked, + ) = self._check_arg_algorithms_and_checksum( + additional_algorithm, checksum, checksum_algorithm ) - finally: - # Release pid - with self.object_lock: - logging.debug( - "FileHashStore - store_object: Removing pid: %s from object_locked_pids.", - pid, + + try: + err_msg = ( + f"Duplicate object request for pid: {pid}. Already in progress." ) - self.object_locked_pids.remove(pid) - logging.info( - "FileHashStore - store_object: Successfully stored object for pid: %s", - pid, - ) + if self.use_multiprocessing: + with self.object_pid_condition_mp: + # Raise exception immediately if pid is in use + if pid in self.object_locked_pids_mp: + self.fhs_logger.error(err_msg) + raise StoreObjectForPidAlreadyInProgress(err_msg) + else: + with self.object_pid_condition_th: + if pid in self.object_locked_pids_th: + logging.error(err_msg) + raise StoreObjectForPidAlreadyInProgress(err_msg) + + try: + self._synchronize_object_locked_pids(pid) + + self.fhs_logger.debug("Attempting to store object for pid: %s", pid) + object_metadata = self._store_and_validate_data( + pid, + data, + additional_algorithm=additional_algorithm_checked, + checksum=checksum, + checksum_algorithm=checksum_algorithm_checked, + file_size_to_validate=expected_object_size, + ) + self.fhs_logger.debug("Attempting to tag object for pid: %s", pid) + cid = object_metadata.cid + self.tag_object(pid, cid) + self.fhs_logger.info("Successfully stored object for pid: %s", pid) + finally: + # Release pid + self._release_object_locked_pids(pid) + except Exception as err: + err_msg = ( + f"Failed to store object for pid: {pid}. Reference files will not be " + f"created or tagged. Unexpected error: {err})" + ) + self.fhs_logger.error(err_msg) + raise err return object_metadata - def store_metadata(self, pid, metadata, format_id=None): - logging.debug( - "FileHashStore - store_metadata: Request to store metadata for pid: %s", pid - ) - # Validate input parameters - self._is_string_none_or_empty(pid, "pid", "store_metadata") - checked_format_id = self._validate_format_id(format_id, "store_metadata") - self._validate_metadata_to_store(metadata) - - # Wait for the pid to release if it's in use - while pid in self.metadata_locked_pids: - logging.debug( - "FileHashStore - store_metadata: %s is currently being stored. Waiting.", - pid, + def tag_object(self, pid: str, cid: str) -> None: + logging.debug("Tagging object cid: %s with pid: %s.", cid, pid) + self._check_string(pid, "pid") + self._check_string(cid, "cid") + + try: + self._store_hashstore_refs_files(pid, cid) + except HashStoreRefsAlreadyExists as hrae: + err_msg = f"Reference files for pid: {pid} and {cid} already exist. Details: {hrae}" + self.fhs_logger.error(err_msg) + raise HashStoreRefsAlreadyExists(err_msg) + except PidRefsAlreadyExistsError as praee: + err_msg = f"A pid can only reference one cid. Details: {praee}" + self.fhs_logger.error(err_msg) + raise PidRefsAlreadyExistsError(err_msg) + + def delete_if_invalid_object( + self, + object_metadata: "ObjectMetadata", + checksum: str, + checksum_algorithm: str, + expected_file_size: int, + ) -> None: + self._check_string(checksum, "checksum") + self._check_string(checksum_algorithm, "checksum_algorithm") + self._check_integer(expected_file_size) + if object_metadata is None or not isinstance(object_metadata, ObjectMetadata): + err_msg = ( + "'object_metadata' cannot be None. Must be a 'ObjectMetadata' object." ) - time.sleep(self.time_out_sec) + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) + else: + self.fhs_logger.info( + "Called to verify object with id: %s", object_metadata.cid + ) + object_metadata_hex_digests = object_metadata.hex_digests + object_metadata_file_size = object_metadata.obj_size + checksum_algorithm_checked = self._clean_algorithm(checksum_algorithm) - with self.metadata_lock: - logging.debug( - "FileHashStore - store_metadata: Adding pid: %s to metadata_locked_pids.", - pid, + # Throws exceptions if there's an issue + try: + self._verify_object_information( + pid=None, + checksum=checksum, + checksum_algorithm=checksum_algorithm_checked, + entity="objects", + hex_digests=object_metadata_hex_digests, + tmp_file_name=None, + tmp_file_size=object_metadata_file_size, + file_size_to_validate=expected_file_size, + ) + except NonMatchingObjSize as nmose: + self._delete_object_only(object_metadata.cid) + logging.error(nmose) + raise nmose + except NonMatchingChecksum as mmce: + self._delete_object_only(object_metadata.cid) + raise mmce + self.fhs_logger.info( + "Object has been validated for cid: %s", object_metadata.cid ) - # Modify metadata_locked_pids consecutively - self.metadata_locked_pids.append(pid) + + def store_metadata( + self, pid: str, metadata: Union[str, bytes], format_id: Optional[str] = None + ) -> str: + self.fhs_logger.debug("Request to store metadata for pid: %s", pid) + # Validate input parameters + self._check_string(pid, "pid") + self._check_arg_data(metadata) + checked_format_id = self._check_arg_format_id(format_id, "store_metadata") + pid_doc = self._computehash(pid + checked_format_id) + + sync_begin_debug_msg = ( + f" Adding pid: {pid} to locked list, with format_id: {checked_format_id} with doc " + f"name: {pid_doc}" + ) + sync_wait_msg = ( + f"Pid: {pid} is locked for format_id: {checked_format_id} with doc name: {pid_doc}. " + f"Waiting." + ) + if self.use_multiprocessing: + with self.metadata_condition_mp: + # Wait for the pid to release if it's in use + while pid_doc in self.metadata_locked_docs_mp: + self.fhs_logger.debug(sync_wait_msg) + self.metadata_condition_mp.wait() + # Modify metadata_locked_docs consecutively + self.fhs_logger.debug(sync_begin_debug_msg) + self.metadata_locked_docs_mp.append(pid_doc) + else: + with self.metadata_condition_th: + while pid_doc in self.metadata_locked_docs_th: + self.fhs_logger.debug(sync_wait_msg) + self.metadata_condition_th.wait() + self.fhs_logger.debug(sync_begin_debug_msg) + self.metadata_locked_docs_th.append(pid_doc) try: - logging.debug( - "FileHashStore - store_metadata: Attempting to store metadata for pid: %s", - pid, + metadata_cid = self._put_metadata(metadata, pid, pid_doc) + info_msg = ( + f"Successfully stored metadata for pid: {pid} with format_id: " + + checked_format_id ) - metadata_cid = self.put_metadata(metadata, pid, checked_format_id) + self.fhs_logger.info(info_msg) + return str(metadata_cid) finally: # Release pid - with self.metadata_lock: - logging.debug( - "FileHashStore - store_metadata: Removing pid: %s from metadata_locked_pids.", - pid, - ) - self.metadata_locked_pids.remove(pid) - logging.info( - "FileHashStore - store_metadata: Successfully stored metadata for pid: %s", - pid, + end_sync_debug_msg = ( + f"Releasing pid doc ({pid_doc}) from locked list for pid: {pid} with format_id: " + + checked_format_id ) - - return metadata_cid - - def retrieve_object(self, pid): - logging.debug( - "FileHashStore - retrieve_object: Request to retrieve object for pid: %s", - pid, - ) - self._is_string_none_or_empty(pid, "pid", "retrieve_object") - + if self.use_multiprocessing: + with self.metadata_condition_mp: + self.fhs_logger.debug(end_sync_debug_msg) + self.metadata_locked_docs_mp.remove(pid_doc) + self.metadata_condition_mp.notify() + else: + with self.metadata_condition_th: + self.fhs_logger.debug(end_sync_debug_msg) + self.metadata_locked_docs_th.remove(pid_doc) + self.metadata_condition_th.notify() + + def retrieve_object(self, pid: str) -> IO[bytes]: + self.fhs_logger.debug("Request to retrieve object for pid: %s", pid) + self._check_string(pid, "pid") + + object_info_dict = self._find_object(pid) + object_cid = object_info_dict.get("cid") entity = "objects" - object_cid = self.get_sha256_hex_digest(pid) - object_exists = self.exists(entity, object_cid) - if object_exists: - logging.debug( - "FileHashStore - retrieve_object: Metadata exists for pid: %s, retrieving object.", - pid, + if object_cid: + self.fhs_logger.debug( + "Metadata exists for pid: %s, retrieving object.", pid ) - obj_stream = self.open(entity, object_cid) + obj_stream = self._open(entity, object_cid) else: - exception_string = ( - f"FileHashStore - retrieve_object: No object found for pid: {pid}" - ) - logging.error(exception_string) - raise ValueError(exception_string) - logging.info( - "FileHashStore - retrieve_object: Retrieved object for pid: %s", pid - ) + err_msg = f"No object found for pid: {pid}" + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) + self.fhs_logger.info("Retrieved object for pid: %s", pid) return obj_stream - def retrieve_metadata(self, pid, format_id=None): - logging.debug( - "FileHashStore - retrieve_metadata: Request to retrieve metadata for pid: %s", - pid, - ) - self._is_string_none_or_empty(pid, "pid", "retrieve_metadata") - checked_format_id = self._validate_format_id(format_id, "retrieve_metadata") + def retrieve_metadata(self, pid: str, format_id: Optional[str] = None) -> IO[bytes]: + self.fhs_logger.debug("Request to retrieve metadata for pid: %s", pid) + self._check_string(pid, "pid") + checked_format_id = self._check_arg_format_id(format_id, "retrieve_metadata") entity = "metadata" - metadata_cid = self.get_sha256_hex_digest(pid + checked_format_id) - metadata_exists = self.exists(entity, metadata_cid) + metadata_directory = self._computehash(pid) + if format_id is None: + metadata_document_name = self._computehash(pid + self.sysmeta_ns) + else: + metadata_document_name = self._computehash(pid + checked_format_id) + metadata_rel_path = ( + Path(*self._shard(metadata_directory)) / metadata_document_name + ) + metadata_exists = self._exists(entity, str(metadata_rel_path)) + if metadata_exists: - metadata_stream = self.open(entity, metadata_cid) + metadata_stream = self._open(entity, str(metadata_rel_path)) + self.fhs_logger.info("Retrieved metadata for pid: %s", pid) + return metadata_stream else: - exception_string = ( - f"FileHashStore - retrieve_metadata: No metadata found for pid: {pid}" - ) - logging.error(exception_string) - raise ValueError(exception_string) + err_msg = f"No metadata found for pid: {pid}" + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) - logging.info( - "FileHashStore - retrieve_metadata: Retrieved metadata for pid: %s", pid - ) - return metadata_stream + def delete_object(self, pid: str) -> None: + self.fhs_logger.debug("Request to delete object for id: %s", pid) + self._check_string(pid, "pid") - def delete_object(self, pid): - logging.debug( - "FileHashStore - delete_object: Request to delete object for pid: %s", pid - ) - self._is_string_none_or_empty(pid, "pid", "delete_object") + objects_to_delete = [] - entity = "objects" - object_cid = self.get_sha256_hex_digest(pid) - self.delete(entity, object_cid) + # Storing and deleting objects are synchronized together + # Duplicate store object requests for a pid are rejected, but deleting an object + # will wait for a pid to be released if it's found to be in use before proceeding. - logging.info( - "FileHashStore - delete_object: Successfully deleted object for pid: %s", - pid, - ) - return True + try: + # Before we begin deletion process, we look for the `cid` by calling + # `find_object` which will throw custom exceptions if there is an issue with + # the reference files, which help us determine the path to proceed with. + self._synchronize_object_locked_pids(pid) - def delete_metadata(self, pid, format_id=None): - logging.debug( - "FileHashStore - delete_metadata: Request to delete metadata for pid: %s", - pid, - ) - self._is_string_none_or_empty(pid, "pid", "delete_metadata") - checked_format_id = self._validate_format_id(format_id, "delete_metadata") + try: + object_info_dict = self._find_object(pid) + cid = object_info_dict.get("cid") - entity = "metadata" - metadata_cid = self.get_sha256_hex_digest(pid + checked_format_id) - self.delete(entity, metadata_cid) + # Proceed with next steps - cid has been retrieved without any issues + # We must synchronize here based on the `cid` because multiple threads may + # try to access the `cid_reference_file` + self._synchronize_object_locked_cids(cid) - logging.info( - "FileHashStore - delete_metadata: Successfully deleted metadata for pid: %s", - pid, - ) - return True + try: + cid_ref_abs_path = object_info_dict.get("cid_refs_path") + pid_ref_abs_path = object_info_dict.get("pid_refs_path") + # Add pid refs file to be permanently deleted + objects_to_delete.append( + self._rename_path_for_deletion(pid_ref_abs_path) + ) + # Remove pid from cid reference file + self._update_refs_file(Path(cid_ref_abs_path), pid, "remove") + # Delete cid reference file and object only if the cid refs file is empty + if os.path.getsize(cid_ref_abs_path) == 0: + debug_msg = ( + f"Cid reference file is empty (size == 0): {cid_ref_abs_path} - " + + "deleting cid reference file and data object." + ) + self.fhs_logger.debug(debug_msg) + objects_to_delete.append( + self._rename_path_for_deletion(cid_ref_abs_path) + ) + obj_real_path = object_info_dict.get("cid_object_path") + objects_to_delete.append( + self._rename_path_for_deletion(obj_real_path) + ) + # Remove all files confirmed for deletion + self._delete_marked_files(objects_to_delete) - def get_hex_digest(self, pid, algorithm): - logging.debug( - "FileHashStore - get_hex_digest: Request to get hex digest for object with pid: %s", - pid, - ) - self._is_string_none_or_empty(pid, "pid", "get_hex_digest") - self._is_string_none_or_empty(algorithm, "algorithm", "get_hex_digest") + # Remove metadata files if they exist + self.delete_metadata(pid) - entity = "objects" - algorithm = self.clean_algorithm(algorithm) - object_cid = self.get_sha256_hex_digest(pid) - if not self.exists(entity, object_cid): - exception_string = ( - f"FileHashStore - get_hex_digest: No object found for pid: {pid}" + info_string = ( + f"Successfully deleted references, metadata and object associated" + + f" with pid: {pid}" + ) + self.fhs_logger.info(info_string) + return + + finally: + # Release cid + self._release_object_locked_cids(cid) + + except OrphanPidRefsFileFound: + warn_msg = ( + f"Orphan pid reference file found for pid: {pid}. Skipping object deletion. " + + "Deleting pid reference file and related metadata documents." + ) + self.fhs_logger.warning(warn_msg) + + # Delete pid refs file + pid_ref_abs_path = self._get_hashstore_pid_refs_path(pid) + objects_to_delete.append( + self._rename_path_for_deletion(pid_ref_abs_path) + ) + # Remove metadata files if they exist + self.delete_metadata(pid) + # Remove all files confirmed for deletion + self._delete_marked_files(objects_to_delete) + return + except RefsFileExistsButCidObjMissing: + warn_msg = ( + f"Reference files exist for pid: {pid}, but the data object is missing. " + + "Deleting pid reference file & related metadata documents. Handling cid " + + "reference file." + ) + self.fhs_logger.warning(warn_msg) + + # Add pid refs file to be permanently deleted + pid_ref_abs_path = self._get_hashstore_pid_refs_path(pid) + objects_to_delete.append( + self._rename_path_for_deletion(pid_ref_abs_path) + ) + # Remove pid from cid refs file + pid_refs_cid = self._read_small_file_content(pid_ref_abs_path) + try: + self._synchronize_object_locked_cids(pid_refs_cid) + + cid_ref_abs_path = self._get_hashstore_cid_refs_path(pid_refs_cid) + # Remove if the pid refs is found + if self._is_string_in_refs_file(pid, cid_ref_abs_path): + self._update_refs_file(cid_ref_abs_path, pid, "remove") + finally: + self._release_object_locked_cids(pid_refs_cid) + + # Remove metadata files if they exist + self.delete_metadata(pid) + # Remove all files confirmed for deletion + self._delete_marked_files(objects_to_delete) + return + except PidNotFoundInCidRefsFile: + warn_msg = ( + f"Pid {pid} not found in cid reference file. Deleting pid reference " + + "file and related metadata documents." + ) + self.fhs_logger.warning(warn_msg) + + # Add pid refs file to be permanently deleted + pid_ref_abs_path = self._get_hashstore_pid_refs_path(pid) + objects_to_delete.append( + self._rename_path_for_deletion(pid_ref_abs_path) + ) + # Remove metadata files if they exist + self.delete_metadata(pid) + # Remove all files confirmed for deletion + self._delete_marked_files(objects_to_delete) + return + finally: + # Release pid + self._release_object_locked_pids(pid) + + def delete_metadata(self, pid: str, format_id: Optional[str] = None) -> None: + self.fhs_logger.debug("Request to delete metadata for pid: %s", pid) + self._check_string(pid, "pid") + checked_format_id = self._check_arg_format_id(format_id, "delete_metadata") + metadata_directory = self._computehash(pid) + rel_path = Path(*self._shard(metadata_directory)) + + if format_id is None: + # Delete all metadata documents + objects_to_delete = [] + # Retrieve all metadata doc names + metadata_rel_path = self._get_store_path("metadata") / rel_path + metadata_file_paths = self._get_file_paths(metadata_rel_path) + if metadata_file_paths is not None: + for path in metadata_file_paths: + # Get document name + pid_doc = os.path.basename(path) + # Synchronize based on doc name + # Wait for the pid to release if it's in use + sync_begin_debug_msg = ( + f"Adding pid: {pid} to locked list, with format_id: {checked_format_id} " + + f"with doc name: {pid_doc}" + ) + sync_wait_msg = ( + f"Pid: {pid} is locked for format_id: {checked_format_id} with doc name:" + + f" {pid_doc}. Waiting." + ) + if self.use_multiprocessing: + with self.metadata_condition_mp: + # Wait for the pid to release if it's in use + while pid in self.metadata_locked_docs_mp: + self.fhs_logger.debug(sync_wait_msg) + self.metadata_condition_mp.wait() + # Modify metadata_locked_docs consecutively + self.fhs_logger.debug(sync_begin_debug_msg) + self.metadata_locked_docs_mp.append(pid_doc) + else: + with self.metadata_condition_th: + while pid in self.metadata_locked_docs_th: + self.fhs_logger.debug(sync_wait_msg) + self.metadata_condition_th.wait() + self.fhs_logger.debug(sync_begin_debug_msg) + self.metadata_locked_docs_th.append(pid_doc) + try: + # Mark metadata doc for deletion + objects_to_delete.append(self._rename_path_for_deletion(path)) + finally: + # Release pid + end_sync_debug_msg = ( + f"Releasing pid doc ({pid_doc}) from locked list for pid: {pid} with " + + f"format_id: {checked_format_id}" + ) + if self.use_multiprocessing: + with self.metadata_condition_mp: + self.fhs_logger.debug(end_sync_debug_msg) + self.metadata_locked_docs_mp.remove(pid_doc) + self.metadata_condition_mp.notify() + else: + with self.metadata_condition_th: + self.fhs_logger.debug(end_sync_debug_msg) + self.metadata_locked_docs_th.remove(pid_doc) + self.metadata_condition_th.notify() + + # Delete metadata objects + self._delete_marked_files(objects_to_delete) + info_string = ("Successfully deleted all metadata for pid: {pid}",) + self.fhs_logger.info(info_string) + else: + # Delete a specific metadata file + pid_doc = self._computehash(pid + checked_format_id) + # Wait for the pid to release if it's in use + sync_begin_debug_msg = ( + f"Adding pid: {pid} to locked list, with format_id: {checked_format_id} with doc " + + f"name: {pid_doc}" ) - logging.error(exception_string) - raise ValueError(exception_string) - cid_stream = self.open(entity, object_cid) - hex_digest = self.computehash(cid_stream, algorithm=algorithm) - - info_msg = ( - f"FileHashStore - get_hex_digest: Successfully calculated hex digest for pid: {pid}." - + f" Hex Digest: {hex_digest}", - ) - logging.info(info_msg) + sync_wait_msg = ( + f"Pid: {pid} is locked for format_id: {checked_format_id} with doc name:" + + f" {pid_doc}. Waiting." + ) + if self.use_multiprocessing: + with self.metadata_condition_mp: + # Wait for the pid to release if it's in use + while pid in self.metadata_locked_docs_mp: + self.fhs_logger.debug(sync_wait_msg) + self.metadata_condition_mp.wait() + # Modify metadata_locked_docs consecutively + self.fhs_logger.debug(sync_begin_debug_msg) + self.metadata_locked_docs_mp.append(pid_doc) + else: + with self.metadata_condition_th: + while pid in self.metadata_locked_docs_th: + self.fhs_logger.debug(sync_wait_msg) + self.metadata_condition_th.wait() + self.fhs_logger.debug(sync_begin_debug_msg) + self.metadata_locked_docs_th.append(pid_doc) + try: + full_path_without_directory = Path(self.metadata / rel_path / pid_doc) + self._delete("metadata", full_path_without_directory) + info_string = ( + f"Deleted metadata for pid: {pid} for format_id: {format_id}" + ) + + self.fhs_logger.info(info_string) + finally: + # Release pid + end_sync_debug_msg = ( + f"Releasing pid doc ({pid_doc}) from locked list for pid: {pid} with " + f"format_id: {checked_format_id}" + ) + if self.use_multiprocessing: + with self.metadata_condition_mp: + self.fhs_logger.debug(end_sync_debug_msg) + self.metadata_locked_docs_mp.remove(pid_doc) + self.metadata_condition_mp.notify() + else: + with self.metadata_condition_th: + self.fhs_logger.debug(end_sync_debug_msg) + self.metadata_locked_docs_th.remove(pid_doc) + self.metadata_condition_th.notify() + + def get_hex_digest(self, pid: str, algorithm: str) -> str: + self.fhs_logger.debug("Request to get hex digest for object with pid: %s", pid) + self._check_string(pid, "pid") + self._check_string(algorithm, "algorithm") + + entity = "objects" + algorithm = self._clean_algorithm(algorithm) + object_cid = self._find_object(pid).get("cid") + if not self._exists(entity, object_cid): + err_msg = f"No object found for pid: {pid}" + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) + cid_stream = self._open(entity, object_cid) + hex_digest = self._computehash(cid_stream, algorithm=algorithm) + + info_string = f"Successfully calculated hex digest for pid: {pid}. Hex Digest: {hex_digest}" + logging.info(info_string) return hex_digest # FileHashStore Core Methods - def put_object( + def _find_object(self, pid: str) -> Dict[str, str]: + """Check if an object referenced by a pid exists and retrieve its content identifier. + The `find_object` method validates the existence of an object based on the provided + pid and returns the associated content identifier. + + :param str pid: Authority-based or persistent identifier of the object. + + :return: obj_info_dict: + - cid: content identifier + - cid_object_path: path to the object + - cid_refs_path: path to the cid refs file + - pid_refs_path: path to the pid refs file + - sysmeta_path: path to the sysmeta file + """ + self.fhs_logger.debug("Request to find object for for pid: %s", pid) + self._check_string(pid, "pid") + + pid_ref_abs_path = self._get_hashstore_pid_refs_path(pid) + if os.path.isfile(pid_ref_abs_path): + # Read the file to get the cid from the pid reference + pid_refs_cid = self._read_small_file_content(pid_ref_abs_path) + + # Confirm that the cid reference file exists + cid_ref_abs_path = self._get_hashstore_cid_refs_path(pid_refs_cid) + if os.path.isfile(cid_ref_abs_path): + # Check that the pid is actually found in the cid reference file + if self._is_string_in_refs_file(pid, cid_ref_abs_path): + # Object must also exist in order to return the cid retrieved + if not self._exists("objects", pid_refs_cid): + err_msg = ( + f"Reference file found for pid ({pid}) at {pid_ref_abs_path}" + + f", but object referenced does not exist, cid: {pid_refs_cid}" + ) + self.fhs_logger.error(err_msg) + raise RefsFileExistsButCidObjMissing(err_msg) + else: + sysmeta_doc_name = self._computehash(pid + self.sysmeta_ns) + metadata_directory = self._computehash(pid) + metadata_rel_path = Path(*self._shard(metadata_directory)) + sysmeta_full_path = ( + self._get_store_path("metadata") + / metadata_rel_path + / sysmeta_doc_name + ) + obj_info_dict = { + "cid": pid_refs_cid, + "cid_object_path": self._get_hashstore_data_object_path( + pid_refs_cid + ), + "cid_refs_path": cid_ref_abs_path, + "pid_refs_path": pid_ref_abs_path, + "sysmeta_path": ( + sysmeta_full_path + if os.path.isfile(sysmeta_full_path) + else "Does not exist." + ), + } + return obj_info_dict + else: + # If not, it is an orphan pid refs file + err_msg = ( + f"Pid reference file exists with cid: {pid_refs_cid} for pid: {pid} but " + f"is missing from cid refs file: {cid_ref_abs_path}" + ) + self.fhs_logger.error(err_msg) + raise PidNotFoundInCidRefsFile(err_msg) + else: + err_msg = ( + f"Pid reference file exists with cid: {pid_refs_cid} but cid reference file " + + f"not found: {cid_ref_abs_path} for pid: {pid}" + ) + self.fhs_logger.error(err_msg) + raise OrphanPidRefsFileFound(err_msg) + else: + err_msg = ( + f"Pid reference file not found for pid ({pid}): {pid_ref_abs_path}" + ) + self.fhs_logger.error(err_msg) + raise PidRefsDoesNotExist(err_msg) + + def _store_and_validate_data( self, - pid, - file, - extension=None, - additional_algorithm=None, - checksum=None, - checksum_algorithm=None, - file_size_to_validate=None, - ): - """Store contents of `file` on disk using the hash of the given pid - - Args: - pid (string): Authority-based identifier. \n - file (mixed): Readable object or path to file. \n - extension (str, optional): Optional extension to append to file - when saving. \n - additional_algorithm (str, optional): Optional algorithm value to include - when returning hex digests. \n - checksum (str, optional): Optional checksum to validate object - against hex digest before moving to permanent location. \n - checksum_algorithm (str, optional): Algorithm value of given checksum. \n - file_size_to_validate (bytes, optional): Expected size of object - - Returns: - object_metadata (ObjectMetadata): object that contains the object id, - object file size, duplicate file boolean and hex digest dictionary. + pid: str, + file: Union[str, bytes], + additional_algorithm: Optional[str] = None, + checksum: Optional[str] = None, + checksum_algorithm: Optional[str] = None, + file_size_to_validate: Optional[int] = None, + ) -> "ObjectMetadata": + """Store contents of `file` on disk, validate the object's parameters if provided, + and tag/reference the object. + + :param str pid: Authority-based identifier. + :param mixed file: Readable object or path to file. + :param str additional_algorithm: Optional algorithm value to include when returning + hex digests. + :param str checksum: Optional checksum to validate object against hex digest before moving + to permanent location. + :param str checksum_algorithm: Algorithm value of the given checksum. + :param int file_size_to_validate: Expected size of the object. + + :return: ObjectMetadata - object that contains the object id, object file size, + and hex digest dictionary. """ stream = Stream(file) - logging.debug( - "FileHashStore - put_object: Request to put object for pid: %s", pid - ) + self.fhs_logger.debug("Request to put object for pid: %s", pid) with closing(stream): ( object_cid, @@ -671,199 +1139,229 @@ def put_object( ) = self._move_and_get_checksums( pid, stream, - extension, additional_algorithm, checksum, checksum_algorithm, file_size_to_validate, ) - object_metadata = ObjectMetadata(object_cid, obj_file_size, hex_digest_dict) - logging.debug( - "FileHashStore - put_object: Successfully put object for pid: %s", - pid, + object_metadata = ObjectMetadata( + pid, object_cid, obj_file_size, hex_digest_dict ) + self.fhs_logger.debug("Successfully put object for pid: %s", pid) return object_metadata + def _store_data_only(self, data: Union[str, bytes]) -> "ObjectMetadata": + """Store an object to HashStore and return a metadata object containing the content + identifier, object file size and hex digests dictionary of the default algorithms. This + method does not validate the object and writes directly to `/objects` after the hex + digests are calculated. + + :param mixed data: String or path to object. + + :raises IOError: If the object fails to store. + :raises FileExistsError: If the file already exists. + + :return: ObjectMetadata - object that contains the object ID, object file + size, and hex digest dictionary. + """ + self.fhs_logger.debug("Request to store data object only.") + + try: + # Ensure the data is a stream + stream = Stream(data) + + # Get the hex digest dictionary + with closing(stream): + ( + object_cid, + obj_file_size, + hex_digest_dict, + ) = self._move_and_get_checksums(None, stream) + + object_metadata = ObjectMetadata( + "HashStoreNoPid", + object_cid, + obj_file_size, + hex_digest_dict, + ) + # The permanent address of the data stored is based on the data's checksum + cid = hex_digest_dict.get(self.algorithm) + self.fhs_logger.debug("Successfully stored object with cid: %s", cid) + return object_metadata + # pylint: disable=W0718 + except Exception as err: + err_msg = f"Failed to store object. Unexpected {err=}, {type(err)=}" + self.fhs_logger.error(err_msg) + raise err + def _move_and_get_checksums( self, - pid, - stream, - extension=None, - additional_algorithm=None, - checksum=None, - checksum_algorithm=None, - file_size_to_validate=None, - ): - """Copy the contents of `stream` onto disk with an optional file - extension appended. The copy process uses a temporary file to store the - initial contents and returns a dictionary of algorithms and their + pid: Optional[str], + stream: "Stream", + additional_algorithm: Optional[str] = None, + checksum: Optional[str] = None, + checksum_algorithm: Optional[str] = None, + file_size_to_validate: Optional[int] = None, + ) -> Tuple[str, int, Dict[str, str]]: + """Copy the contents of the `Stream` object onto disk. The copy process uses a temporary + file to store the initial contents and returns a dictionary of algorithms and their hex digest values. If the file already exists, the method will immediately - raise an exception. If an algorithm and checksum is provided, it will proceed to - validate the object (and delete the tmpFile if the hex digest stored does + raise an exception. If an algorithm and checksum are provided, it will proceed to + validate the object (and delete the temporary file created if the hex digest stored does not match what is provided). - Args: - pid (string): authority-based identifier. \n - stream (io.BufferedReader): object stream. \n - extension (str, optional): Optional extension to append to file - when saving. \n - additional_algorithm (str, optional): Optional algorithm value to include - when returning hex digests. \n - checksum (str, optional): Optional checksum to validate object - against hex digest before moving to permanent location. \n - checksum_algorithm (str, optional): Algorithm value of given checksum. \n - file_size_to_validate (bytes, optional): Expected size of object - - Returns: - object_metadata (tuple): object id, object file size, duplicate file - boolean and hex digest dictionary. - """ - entity = "objects" - object_cid = self.get_sha256_hex_digest(pid) - abs_file_path = self.build_abs_path(entity, object_cid, extension) - - # Only create tmp file to be moved if target destination doesn't exist - if os.path.isfile(abs_file_path): - exception_string = ( - "FileHashStore - _move_and_get_checksums: File already exists" - + f" for pid: {pid} at {abs_file_path}" - ) - logging.error(exception_string) - raise FileExistsError(exception_string) + :param Optional[str] pid: Authority-based identifier. + :param Stream stream: Object stream when saving. + :param str additional_algorithm: Optional algorithm value to include when returning hex + digests. + :param str checksum: Optional checksum to validate the object against hex digest before + moving to the permanent location. + :param str checksum_algorithm: Algorithm value of the given checksum. + :param int file_size_to_validate: Expected size of the object. - # Create temporary file and calculate hex digests - debug_msg = ( - "FileHashStore - _move_and_get_checksums: Creating temp" - + f" file and calculating checksums for pid: {pid}" - ) - logging.debug(debug_msg) - hex_digests, tmp_file_name, tmp_file_size = self._mktmpfile( - stream, additional_algorithm, checksum_algorithm - ) - logging.debug( - "FileHashStore - _move_and_get_checksums: Temp file created: %s", + :return: tuple - Object ID, object file size, and hex digest dictionary. + """ + debug_msg = f"Creating temp file and calculating checksums for pid: {pid}" + self.fhs_logger.debug(debug_msg) + ( + hex_digests, tmp_file_name, + tmp_file_size, + ) = self._write_to_tmp_file_and_get_hex_digests( + stream, additional_algorithm, checksum_algorithm ) + self.fhs_logger.debug("Temp file created: %s", tmp_file_name) - # Only move file if it doesn't exist. - # Files are stored once and only once + # Objects are stored with their content identifier based on the store algorithm + object_cid = hex_digests.get(self.algorithm) + abs_file_path = self._build_hashstore_data_object_path(object_cid) + + # Only move file if it doesn't exist. We do not check before we create the tmp + # file and calculate the hex digests because the given checksum could be incorrect. if not os.path.isfile(abs_file_path): - self._validate_object( + # Files are stored once and only once + self._verify_object_information( pid, checksum, checksum_algorithm, - entity, + "objects", hex_digests, tmp_file_name, tmp_file_size, file_size_to_validate, ) - self.create_path(os.path.dirname(abs_file_path)) + self._create_path(Path(os.path.dirname(abs_file_path))) try: - debug_msg = ( - "FileHashStore - _move_and_get_checksums: Moving temp file to permanent" - + f" location: {abs_file_path}", - ) - logging.debug(debug_msg) + debug_msg = f"Moving temp file to permanent location: {abs_file_path}" + self.fhs_logger.debug(debug_msg) shutil.move(tmp_file_name, abs_file_path) except Exception as err: # Revert storage process - exception_string = ( - "FileHashStore - _move_and_get_checksums:" - + f" Unexpected {err=}, {type(err)=}" - ) - logging.error(exception_string) + err_msg = f" Unexpected Error: {err}" + self.fhs_logger.warning(err_msg) if os.path.isfile(abs_file_path): - # Check to see if object has moved successfully before deleting + # Check to see if object exists before determining whether to delete debug_msg = ( - "FileHashStore - _move_and_get_checksums: Permanent file" - + f" found during exception, checking hex digest for pid: {pid}" + f"Permanent file found, checking hex digest for pid: {pid}" ) - logging.debug(debug_msg) + self.fhs_logger.debug(debug_msg) pid_checksum = self.get_hex_digest(pid, self.algorithm) if pid_checksum == hex_digests.get(self.algorithm): # If the checksums match, return and log warning - warning_msg = ( - "FileHashStore - _move_and_get_checksums: File moved" - + f" successfully but unexpected issue encountered: {exception_string}", + err_msg = ( + f"Object exists at: {abs_file_path} but an unexpected issue has been " + + "encountered. Reference files will not be created and/or tagged." ) - logging.warning(warning_msg) - return + self.fhs_logger.warning(err_msg) + raise err else: debug_msg = ( - "FileHashStore - _move_and_get_checksums: Permanent file" - + f" found but with incomplete state, deleting file: {abs_file_path}", + f"Object exists at {abs_file_path} but the pid object checksum " + + "provided does not match what has been calculated. Deleting object. " + + "References will not be created and/or tagged.", ) - logging.debug(debug_msg) - self.delete(entity, abs_file_path) - logging.debug( - "FileHashStore - _move_and_get_checksums: Deleting temporary file: %s", + self.fhs_logger.debug(debug_msg) + self._delete("objects", abs_file_path) + raise err + else: + self.fhs_logger.debug("Deleting temporary file: %s", tmp_file_name) + self._delete("tmp", tmp_file_name) + err_msg = ( + f"Object has not been stored for pid: {pid} - an unexpected error has " + + f"occurred when moving tmp file to: {object_cid}. Reference files will " + + f"not be created and/or tagged. Error: {err}" + ) + self.fhs_logger.warning(err_msg) + raise + else: + # If the data object already exists, do not move the file but attempt to verify it + try: + self._verify_object_information( + pid, + checksum, + checksum_algorithm, + "objects", + hex_digests, tmp_file_name, + tmp_file_size, + file_size_to_validate, ) - self.delete(entity, tmp_file_name) + except NonMatchingObjSize as nmose: + # If any exception is thrown during validation, we do not tag. err_msg = ( - "Aborting store_object upload - an unexpected error has occurred when moving" - + f" file to: {object_cid} - Error: {err}" + f"Object already exists for pid: {pid}, deleting temp file. Reference files " + + "will not be created and/or tagged due to an issue with the supplied pid " + + f"object metadata. {str(nmose)}" ) - logging.error("FileHashStore - _move_and_get_checksums: %s", err_msg) - raise - else: - # Else delete temporary file - warning_msg = ( - f"FileHashStore - _move_and_get_checksums: Object exists at: {abs_file_path}," - + " deleting temporary file." - ) - logging.warning(warning_msg) - self.delete(entity, tmp_file_name) + self.fhs_logger.debug(err_msg) + raise NonMatchingObjSize(err_msg) from nmose + except NonMatchingChecksum as nmce: + # If any exception is thrown during validation, we do not tag. + err_msg = ( + f"Object already exists for pid: {pid}, deleting temp file. Reference files " + + "will not be created and/or tagged due to an issue with the supplied pid " + + f"object metadata. {str(nmce)}" + ) + self.fhs_logger.debug(err_msg) + raise NonMatchingChecksum(err_msg) from nmce + finally: + # Ensure that the tmp file has been removed, the data object already exists, so it + # is redundant. No exception is thrown so 'store_object' can proceed to tag object + if os.path.isfile(tmp_file_name): + self._delete("tmp", tmp_file_name) - return (object_cid, tmp_file_size, hex_digests) + return object_cid, tmp_file_size, hex_digests - def _mktmpfile(self, stream, additional_algorithm=None, checksum_algorithm=None): + def _write_to_tmp_file_and_get_hex_digests( + self, + stream: "Stream", + additional_algorithm: Optional[str] = None, + checksum_algorithm: Optional[str] = None, + ) -> Tuple[Dict[str, str], str, int]: """Create a named temporary file from a `Stream` object and return its filename - and a dictionary of its algorithms and hex digests. If an additionak and/or checksum - algorithm is provided, it will add the respective hex digest to the dictionary. - - Args: - stream (io.BufferedReader): Object stream. - algorithm (string): Algorithm of additional hex digest to generate - checksum_algorithm (string): Algorithm of additional checksum algo to generate - - Returns: - hex_digest_dict, tmp.name (tuple pack): - hex_digest_dict (dictionary): Algorithms and their hex digests. - tmp.name: Name of temporary file created and written into. + and a dictionary of its algorithms and hex digests. If an additional and/or checksum + algorithm is provided, it will add the respective hex digest to the dictionary if + it is supported. + + :param Stream stream: Object stream. + :param str additional_algorithm: Algorithm of additional hex digest to generate. + :param str checksum_algorithm: Algorithm of additional checksum algo to generate. + + :return: tuple - hex_digest_dict, tmp.name + - hex_digest_dict (dict): Algorithms and their hex digests. + - tmp.name (str): Name of the temporary file created and written into. + - tmp_file_size (int): Size of the data object """ # Review additional hash object to digest and create new list algorithm_list_to_calculate = self._refine_algorithm_list( additional_algorithm, checksum_algorithm ) + tmp_root_path = self._get_store_path("objects") / "tmp" + tmp = self._mktmpfile(tmp_root_path) - tmp_root_path = self.get_store_path("objects") / "tmp" - # Physically create directory if it doesn't exist - if os.path.exists(tmp_root_path) is False: - self.create_path(tmp_root_path) - tmp = NamedTemporaryFile(dir=tmp_root_path, delete=False) - - # Delete tmp file if python interpreter crashes or thread is interrupted - # when store_object is called - def delete_tmp_file(): - if os.path.exists(tmp.name): - os.remove(tmp.name) - - atexit.register(delete_tmp_file) - - # Ensure tmp file is created with desired permissions - if self.fmode is not None: - oldmask = os.umask(0) - try: - os.chmod(tmp.name, self.fmode) - finally: - os.umask(oldmask) - - logging.debug( - "FileHashStore - _mktempfile: tmp file created: %s, calculating hex digests.", - tmp.name, + self.fhs_logger.debug( + "Tmp file created: %s, calculating hex digests.", tmp.name ) tmp_file_completion_flag = False @@ -875,12 +1373,12 @@ def delete_tmp_file(): # tmp is a file-like object that is already opened for writing by default with tmp as tmp_file: for data in stream: - tmp_file.write(self._to_bytes(data)) + tmp_file.write(self._cast_to_bytes(data)) for hash_algorithm in hash_algorithms: - hash_algorithm.update(self._to_bytes(data)) - logging.debug( - "FileHashStore - _mktempfile: Object stream successfully written to tmp file: %s", - tmp.name, + hash_algorithm.update(self._cast_to_bytes(data)) + + self.fhs_logger.debug( + "Object stream successfully written to tmp file: %s", tmp.name ) hex_digest_list = [ @@ -891,318 +1389,756 @@ def delete_tmp_file(): # Ready for validation and atomic move tmp_file_completion_flag = True - logging.debug("FileHashStore - _mktempfile: Hex digests calculated.") + self.fhs_logger.debug("Hex digests calculated.") return hex_digest_dict, tmp.name, tmp_file_size # pylint: disable=W0718 except Exception as err: - exception_string = ( - f"FileHashStore - _mktempfile: Unexpected {err=}, {type(err)=}" - ) - logging.error(exception_string) + err_msg = f"Unexpected {err=}, {type(err)=}" + self.fhs_logger.error(err_msg) # pylint: disable=W0707,W0719 - raise Exception(exception_string) + raise Exception(err_msg) except KeyboardInterrupt: - exception_string = ( - "FileHashStore - _mktempfile: Keyboard interruption by user." - ) - logging.error(exception_string) - if os.path.exists(tmp.name): + err_msg = "Keyboard interruption by user." + self.fhs_logger.error(err_msg) + if os.path.isfile(tmp.name): os.remove(tmp.name) finally: if not tmp_file_completion_flag: try: - if os.path.exists(tmp.name): + if os.path.isfile(tmp.name): os.remove(tmp.name) # pylint: disable=W0718 except Exception as err: - exception_string = ( - f"FileHashStore - _mktempfile: Unexpected {err=} while attempting to" - + f" delete tmp file: {tmp.name}, {type(err)=}" + err_msg = ( + f"Unexpected {err=} while attempting to delete tmp file: " + + f"{tmp.name}, {type(err)=}" ) - logging.error(exception_string) + self.fhs_logger.error(err_msg) - def put_metadata(self, metadata, pid, format_id): - """Store contents of metadata to `[self.root]/metadata` using the hash of the - given pid and format_id as the permanent address. + def _mktmpfile(self, path: Path) -> IO[bytes]: + """Create a temporary file at the given path ready to be written. - Args: - pid (string): Authority-based identifier. - format_id (string): Metadata format. - metadata (mixed): String or path to metadata document. + :param Path path: Path to the file location. - Returns: - metadata_cid (string): Address of the metadata document. + :return: file object - object with a file-like interface. """ - logging.debug( - "FileHashStore - put_metadata: Request to put metadata for pid: %s", pid - ) - # Create metadata tmp file and write to it - metadata_stream = Stream(metadata) - with closing(metadata_stream): - metadata_tmp = self._mktmpmetadata(metadata_stream) - - # Get target and related paths (permanent location) - metadata_cid = self.get_sha256_hex_digest(pid + format_id) - rel_path = "/".join(self.shard(metadata_cid)) - full_path = self.get_store_path("metadata") / rel_path + # Physically create directory if it doesn't exist + if os.path.exists(path) is False: + self._create_path(path) - # Move metadata to target path - if os.path.exists(metadata_tmp): - try: - parent = full_path.parent - parent.mkdir(parents=True, exist_ok=True) - # Metadata will be replaced if it exists - shutil.move(metadata_tmp, full_path) - logging.debug( - "FileHashStore - put_metadata: Successfully put metadata for pid: %s", - pid, - ) - return metadata_cid - except Exception as err: - exception_string = ( - f"FileHashStore - put_metadata: Unexpected {err=}, {type(err)=}" - ) - logging.error(exception_string) - if os.path.exists(metadata_tmp): - # Remove tmp metadata, calling app must re-upload - logging.debug( - "FileHashStore - put_metadata: Deleting metadata for pid: %s", - pid, - ) - self.metadata.delete(metadata_tmp) - raise - else: - exception_string = ( - f"FileHashStore - put_metadata: Attempt to move metadata for pid: {pid}" - + f", but metadata temp file not found: {metadata_tmp}" - ) - logging.error(exception_string) - raise FileNotFoundError(exception_string) + tmp = NamedTemporaryFile(dir=path, delete=False) - def _mktmpmetadata(self, stream): - """Create a named temporary file with `stream` (metadata) and `format_id`. + # Delete tmp file if python interpreter crashes or thread is interrupted + def delete_tmp_file(): + if os.path.isfile(tmp.name): + os.remove(tmp.name) - Args: - stream (io.BufferedReader): Metadata stream. - format_id (string): Format of metadata. + atexit.register(delete_tmp_file) - Returns: - tmp.name (string): Path/name of temporary file created and written into. - """ - # Create temporary file in .../{store_path}/tmp - tmp_root_path = self.get_store_path("metadata") / "tmp" - # Physically create directory if it doesn't exist - if os.path.exists(tmp_root_path) is False: - self.create_path(tmp_root_path) - - tmp = NamedTemporaryFile(dir=tmp_root_path, delete=False) # Ensure tmp file is created with desired permissions - if self.fmode is not None: - oldmask = os.umask(0) + if self.f_mode is not None: + old_mask = os.umask(0) try: - os.chmod(tmp.name, self.fmode) + os.chmod(tmp.name, self.f_mode) finally: - os.umask(oldmask) + os.umask(old_mask) + return tmp + + def _store_hashstore_refs_files(self, pid: str, cid: str) -> None: + """Create the pid refs file and create/update cid refs files in HashStore to establish + the relationship between a 'pid' and a 'cid'. + + :param str pid: Persistent or authority-based identifier. + :param str cid: Content identifier + """ + try: + self._synchronize_referenced_locked_pids(pid) + self._synchronize_object_locked_cids(cid) + + try: + # Prepare files and paths + tmp_root_path = self._get_store_path("refs") / "tmp" + pid_refs_path = self._get_hashstore_pid_refs_path(pid) + cid_refs_path = self._get_hashstore_cid_refs_path(cid) + # Create paths for pid ref file in '.../refs/pid' and cid ref file in '.../refs/cid' + self._create_path(Path(os.path.dirname(pid_refs_path))) + self._create_path(Path(os.path.dirname(cid_refs_path))) + + if os.path.isfile(pid_refs_path) and os.path.isfile(cid_refs_path): + # If both reference files exist, we confirm that reference files are where they + # are expected to be and throw an exception to inform the client that everything + # is in place - and include other issues for context + err_msg = ( + f"Object with cid: {cid} exists and is tagged with pid: {pid}." + ) + try: + self._verify_hashstore_references( + pid, + cid, + pid_refs_path, + cid_refs_path, + "Refs file already exists, verifying.", + ) + self.fhs_logger.error(err_msg) + raise HashStoreRefsAlreadyExists(err_msg) + except Exception as e: + rev_msg = err_msg + " " + str(e) + self.fhs_logger.error(rev_msg) + raise HashStoreRefsAlreadyExists(err_msg) + + elif os.path.isfile(pid_refs_path) and not os.path.isfile( + cid_refs_path + ): + # If pid refs exists, the pid has already been claimed and cannot be tagged we + # throw an exception immediately + error_msg = f"Pid refs file already exists for pid: {pid}." + self.fhs_logger.error(error_msg) + raise PidRefsAlreadyExistsError(error_msg) + + elif not os.path.isfile(pid_refs_path) and os.path.isfile( + cid_refs_path + ): + debug_msg = ( + f"Pid reference file does not exist for pid {pid} but cid refs file " + + f"found at: {cid_refs_path} for cid: {cid}" + ) + self.fhs_logger.debug(debug_msg) + # Move the pid refs file + pid_tmp_file_path = self._write_refs_file(tmp_root_path, cid, "pid") + shutil.move(pid_tmp_file_path, pid_refs_path) + # Update cid ref files as it already exists + if not self._is_string_in_refs_file(pid, cid_refs_path): + self._update_refs_file(cid_refs_path, pid, "add") + self._verify_hashstore_references( + pid, + cid, + pid_refs_path, + cid_refs_path, + f"Updated existing cid refs file: {cid_refs_path} with pid: {pid}", + ) + info_msg = f"Successfully updated cid: {cid} with pid: {pid}" + self.fhs_logger.info(info_msg) + return + + # Move both files after checking the existing status of refs files + pid_tmp_file_path = self._write_refs_file(tmp_root_path, cid, "pid") + cid_tmp_file_path = self._write_refs_file(tmp_root_path, pid, "cid") + shutil.move(pid_tmp_file_path, pid_refs_path) + shutil.move(cid_tmp_file_path, cid_refs_path) + log_msg = "Refs files have been moved to their permanent location. Verifying refs." + self._verify_hashstore_references( + pid, cid, pid_refs_path, cid_refs_path, log_msg + ) + info_msg = f"Successfully updated cid: {cid} with pid: {pid}" + self.fhs_logger.info(info_msg) + + except ( + HashStoreRefsAlreadyExists, + PidRefsAlreadyExistsError, + ) as expected_exceptions: + raise expected_exceptions + + except Exception as ue: + # For all other unexpected exceptions, we are to revert the tagging process as + # much as possible. No exceptions from the reverting process will be thrown. + err_msg = f"Unexpected exception: {ue}, reverting tagging process (untag obj)." + self.fhs_logger.error(err_msg) + self._untag_object(pid, cid) + raise ue + + finally: + # Release cid + self._release_object_locked_cids(cid) + self._release_reference_locked_pids(pid) + + def _untag_object(self, pid: str, cid: str) -> None: + """Untags a data object in HashStore by deleting the 'pid reference file' and removing + the 'pid' from the 'cid reference file'. This method will never delete a data + object. `_untag_object` will attempt to proceed with as much of the untagging process as + possible and swallow relevant exceptions. + + :param str cid: Content identifier + :param str pid: Persistent or authority-based identifier. + """ + self._check_string(pid, "pid") + self._check_string(cid, "cid") + + untag_obj_delete_list = [] + + # To untag a pid, the pid must be found and currently locked + # The pid will not be released until this process is over + self._check_reference_locked_pids(pid) + + # Before we begin the untagging process, we look for the `cid` by calling `find_object` + # which will throw custom exceptions if there is an issue with the reference files, + # which help us determine the path to proceed with. + try: + obj_info_dict = self._find_object(pid) + cid_to_check = obj_info_dict["cid"] + self._validate_and_check_cid_lock(pid, cid, cid_to_check) + + # Remove pid refs + pid_refs_path = self._get_hashstore_pid_refs_path(pid) + self._mark_pid_refs_file_for_deletion( + pid, untag_obj_delete_list, pid_refs_path + ) + # Remove pid from cid refs + cid_refs_path = self._get_hashstore_cid_refs_path(cid) + self._remove_pid_and_handle_cid_refs_deletion( + pid, untag_obj_delete_list, cid_refs_path + ) + # Remove all files confirmed for deletion + self._delete_marked_files(untag_obj_delete_list) + info_msg = f"Untagged pid: {pid} with cid: {cid}" + self.fhs_logger.info(info_msg) + + except OrphanPidRefsFileFound as oprff: + # `find_object` throws this exception when the cid refs file doesn't exist, + # so we only need to delete the pid refs file (pid is already locked) + pid_refs_path = self._get_hashstore_pid_refs_path(pid) + cid_read = self._read_small_file_content(pid_refs_path) + self._validate_and_check_cid_lock(pid, cid, cid_read) + + # Remove pid refs + self._mark_pid_refs_file_for_deletion( + pid, untag_obj_delete_list, pid_refs_path + ) + self._delete_marked_files(untag_obj_delete_list) + + warn_msg = ( + f"Cid refs file does not exist for pid: {pid}. Deleted orphan pid refs file. " + f"Additional info: {oprff}" + ) + self.fhs_logger.warning(warn_msg) + + except RefsFileExistsButCidObjMissing as rfebcom: + # `find_object` throws this exception when both pid/cid refs files exist but the + # actual data object does not. + pid_refs_path = self._get_hashstore_pid_refs_path(pid) + cid_read = self._read_small_file_content(pid_refs_path) + self._validate_and_check_cid_lock(pid, cid, cid_read) + + # Remove pid refs + self._mark_pid_refs_file_for_deletion( + pid, untag_obj_delete_list, pid_refs_path + ) + # Remove pid from cid refs + cid_refs_path = self._get_hashstore_cid_refs_path(cid) + self._remove_pid_and_handle_cid_refs_deletion( + pid, untag_obj_delete_list, cid_refs_path + ) + # Remove all files confirmed for deletion + self._delete_marked_files(untag_obj_delete_list) + + warn_msg = ( + f"data object for cid: {cid_read}. does not exist, but pid and cid references " + + f"files found for pid: {pid}, Deleted pid and cid refs files. " + + f"Additional info: {rfebcom}" + ) + self.fhs_logger.warning(warn_msg) + + except PidNotFoundInCidRefsFile as pnficrf: + # `find_object` throws this exception when both the pid and cid refs file exists + # but the pid is not found in the cid refs file + pid_refs_path = self._get_hashstore_pid_refs_path(pid) + cid_read = self._read_small_file_content(pid_refs_path) + self._validate_and_check_cid_lock(pid, cid, cid_read) + + # Remove pid refs + self._mark_pid_refs_file_for_deletion( + pid, untag_obj_delete_list, pid_refs_path + ) + self._delete_marked_files(untag_obj_delete_list) + + warn_msg = ( + f"Pid not found in expected cid refs file for pid: {pid}. Deleted orphan pid refs " + f"file. Additional info: {pnficrf}" + ) + self.fhs_logger.warning(warn_msg) + + except PidRefsDoesNotExist as prdne: + # `find_object` throws this exception if the pid refs file is not found + # Check to see if pid is in the 'cid refs file' and attempt to remove it + self._check_object_locked_cids(cid) + + # Remove pid from cid refs + cid_refs_path = self._get_hashstore_cid_refs_path(cid) + self._remove_pid_and_handle_cid_refs_deletion( + pid, untag_obj_delete_list, cid_refs_path + ) + # Remove all files confirmed for deletion + self._delete_marked_files(untag_obj_delete_list) + + warn_msg = ( + "Pid refs file not found, removed pid from cid reference file for cid:" + + f" {cid}. Additional info: {prdne}" + ) + self.fhs_logger.warning(warn_msg) + + def _put_metadata( + self, metadata: Union[str, bytes], pid: str, metadata_doc_name: str + ) -> Path: + """Store contents of metadata to `[self.root]/metadata` using the hash of the + given PID and format ID as the permanent address. + + :param mixed metadata: String or path to metadata document. + :param str pid: Authority-based identifier. + :param str metadata_doc_name: Metadata document name + + :return: Address of the metadata document. + """ + self.fhs_logger.debug("Request to put metadata for pid: %s", pid) + # Create metadata tmp file and write to it + metadata_stream = Stream(metadata) + with closing(metadata_stream): + metadata_tmp = self._mktmpmetadata(metadata_stream) + + # Get target and related paths (permanent location) + metadata_directory = self._computehash(pid) + metadata_document_name = metadata_doc_name + rel_path = Path(*self._shard(metadata_directory)) + full_path = self._get_store_path("metadata") / rel_path / metadata_document_name + + # Move metadata to target path + if os.path.isfile(metadata_tmp): + try: + parent = full_path.parent + parent.mkdir(parents=True, exist_ok=True) + # Metadata will be replaced if it exists + shutil.move(metadata_tmp, full_path) + self.fhs_logger.debug("Successfully put metadata for pid: %s", pid) + return full_path + except Exception as err: + err_msg = f"Unexpected {err=}, {type(err)=}" + self.fhs_logger.error(err_msg) + if os.path.isfile(metadata_tmp): + # Remove tmp metadata, calling app must re-upload + self.fhs_logger.debug("Deleting metadata for pid: %s", pid) + self._delete("metadata", metadata_tmp) + raise + else: + err_msg = ( + f"Attempted to move metadata for pid: {pid}, but metadata temp file not found:" + + f" {metadata_tmp}" + ) + self.fhs_logger.error(err_msg) + raise FileNotFoundError(err_msg) + + def _mktmpmetadata(self, stream: "Stream") -> str: + """Create a named temporary file with `stream` (metadata). + + :param Stream stream: Metadata stream. + + :return: Path/name of temporary file created and written into. + """ + # Create temporary file in .../{store_path}/tmp + tmp_root_path = self._get_store_path("metadata") / "tmp" + tmp = self._mktmpfile(tmp_root_path) # tmp is a file-like object that is already opened for writing by default - logging.debug( - "FileHashStore - _mktmpmetadata: Writing stream to tmp metadata file: %s", - tmp.name, - ) + self.fhs_logger.debug("Writing stream to tmp metadata file: %s", tmp.name) with tmp as tmp_file: for data in stream: - tmp_file.write(self._to_bytes(data)) + tmp_file.write(self._cast_to_bytes(data)) - logging.debug( - "FileHashStore - _mktmpmetadata: Successfully written to tmp metadata file: %s", - tmp.name, - ) + self.fhs_logger.debug("Successfully written to tmp metadata file: %s", tmp.name) return tmp.name # FileHashStore Utility & Supporting Methods - def _validate_data_to_store(self, data): - """Evaluates a data argument to ensure that it is either a string, path or - stream object before attempting to store it. + @staticmethod + def _delete_marked_files(delete_list: list[str]) -> None: + """Delete all the file paths in a given delete list. - Args: - data (string, path, stream): object to validate + :param list delete_list: Persistent or authority-based identifier. """ - if ( - not isinstance(data, str) - and not isinstance(data, Path) - and not isinstance(data, io.BufferedIOBase) - ): - exception_string = ( - "FileHashStore - store_object: Data must be a path, string or buffered" - + f" stream type. Data type supplied: {type(data)}" + if delete_list is not None: + for obj in delete_list: + try: + os.remove(obj) + except Exception as e: + warn_msg = f"Unable to remove {obj} in given delete_list. " + str(e) + logging.warning(warn_msg) + else: + raise ValueError("list cannot be None") + + def _mark_pid_refs_file_for_deletion( + self, pid: str, delete_list: List[str], pid_refs_path: Path + ) -> None: + """Attempt to rename a pid refs file and add the renamed file to a provided list. + + :param str pid: Persistent or authority-based identifier. + :param list delete_list: List to add the renamed pid refs file marked for deletion to + :param path pid_refs_path: Path to the pid reference file + """ + try: + delete_list.append(self._rename_path_for_deletion(pid_refs_path)) + + except Exception as e: + err_msg = ( + f"Unable to delete pid refs file: {pid_refs_path} for pid: {pid}. {e}" ) - logging.error(exception_string) - raise TypeError(exception_string) - if isinstance(data, str): - if data.replace(" ", "") == "": - exception_string = ( - "FileHashStore - store_object: Data string cannot be empty." - ) - logging.error(exception_string) - raise TypeError(exception_string) - - def _validate_algorithms_and_checksum( - self, additional_algorithm, checksum, checksum_algorithm - ): - """Determines whether calling app has supplied the necessary arguments to validate - an object with a checksum value - - Args: - additional_algorithm: value of additional algorithm to calculate - checksum (string): value of checksum - checksum_algorithm (string): algorithm of checksum + self.fhs_logger.error(err_msg) + + def _remove_pid_and_handle_cid_refs_deletion( + self, pid: str, delete_list: List[str], cid_refs_path: Path + ) -> None: + """Attempt to remove a pid from a 'cid refs file' and add the 'cid refs file' to the + delete list if it is empty. + + :param str pid: Persistent or authority-based identifier. + :param list delete_list: List to add the renamed pid refs file marked for deletion to + :param path cid_refs_path: Path to the pid reference file """ - additional_algorithm_checked = None - if additional_algorithm != self.algorithm and additional_algorithm is not None: - # Set additional_algorithm - additional_algorithm_checked = self.clean_algorithm(additional_algorithm) - checksum_algorithm_checked = None - if checksum is not None: - self._is_string_none_or_empty( - checksum_algorithm, - "checksum_algorithm", - "validate_checksum_args (store_object)", + try: + # Remove pid from cid reference file + self._update_refs_file(cid_refs_path, pid, "remove") + # Delete cid reference file and object only if the cid refs file is empty + if os.path.getsize(cid_refs_path) == 0: + delete_list.append(self._rename_path_for_deletion(cid_refs_path)) + + except Exception as e: + err_msg = ( + f"Unable to delete remove pid from cid refs file: {cid_refs_path} for pid:" + f" {pid}. " + str(e) ) - if checksum_algorithm is not None: - self._is_string_none_or_empty( - checksum, - "checksum", - "validate_checksum_args (store_object)", + self.fhs_logger.error(err_msg) + + def _validate_and_check_cid_lock( + self, pid: str, cid: str, cid_to_check: str + ) -> None: + """Confirm that the two content identifiers provided are equal and is locked to ensure + thread safety. + + :param str pid: Persistent identifier + :param str cid: Content identifier + :param str cid_to_check: Cid that was retrieved or read + """ + self._check_string(cid, "cid") + self._check_string(cid_to_check, "cid_to_check") + + if cid != cid_to_check: + err_msg = ( + f"_validate_and_check_cid_lock: cid provided: {cid_to_check} does not " + f"match untag request for cid: {cid} and pid: {pid}" ) - # Set checksum_algorithm - checksum_algorithm_checked = self.clean_algorithm(checksum_algorithm) - return additional_algorithm_checked, checksum_algorithm_checked + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) + self._check_object_locked_cids(cid) - def _refine_algorithm_list(self, additional_algorithm, checksum_algorithm): - """Create the final list of hash algorithms to calculate + def _write_refs_file(self, path: Path, ref_id: str, ref_type: str) -> str: + """Write a reference file in the supplied path into a temporary file. + All `pid` or `cid` reference files begin with a single identifier, with the + difference being that a cid reference file can potentially contain multiple + lines of `pid`s that reference the `cid`. - Args: - additional_algorithm (string) - checksum_algorithm (string) + :param path path: Directory to write a temporary file into + :param str ref_id: Authority-based, persistent or content identifier + :param str ref_type: 'cid' or 'pid' - Return: - algorithm_list_to_calculate (set): De-duplicated list of hash algorithms + :return: tmp_file_path - Path to the tmp refs file """ - algorithm_list_to_calculate = self.default_algo_list - if checksum_algorithm is not None: - self.clean_algorithm(checksum_algorithm) - if checksum_algorithm in self.other_algo_list: - debug_additional_other_algo_str = ( - f"FileHashStore - _mktempfile: checksum algorithm: {checksum_algorithm}" - + " found in other_algo_lists, adding to list of algorithms to calculate." - ) - logging.debug(debug_additional_other_algo_str) - algorithm_list_to_calculate.append(checksum_algorithm) - if additional_algorithm is not None: - self.clean_algorithm(additional_algorithm) - if additional_algorithm in self.other_algo_list: - debug_additional_other_algo_str = ( - f"FileHashStore - _mktempfile: additional algorithm: {additional_algorithm}" - + " found in other_algo_lists, adding to list of algorithms to calculate." - ) - logging.debug(debug_additional_other_algo_str) - algorithm_list_to_calculate.append(additional_algorithm) + self.fhs_logger.debug("Writing id (%s) into a tmp file in: %s", ref_id, path) + try: + with self._mktmpfile(path) as tmp_file: + tmp_file_path = tmp_file.name + with open(tmp_file_path, "w", encoding="utf8") as tmp_cid_ref_file: + if ref_type == "cid": + tmp_cid_ref_file.write(ref_id + "\n") + if ref_type == "pid": + tmp_cid_ref_file.write(ref_id) + return tmp_file_path - # Remove duplicates - algorithm_list_to_calculate = set(algorithm_list_to_calculate) - return algorithm_list_to_calculate + except Exception as err: + err_msg = ( + f"Failed to write cid refs file for pid: {ref_id} into path: {path}. " + + f"Unexpected error: {err=}, {type(err)=}" + ) + self.fhs_logger.error(err_msg) + raise err + + def _update_refs_file( + self, refs_file_path: Path, ref_id: str, update_type: str + ) -> None: + """Add or remove an existing ref from a refs file. + + :param path refs_file_path: Absolute path to the refs file. + :param str ref_id: Authority-based or persistent identifier of the object. + :param str update_type: 'add' or 'remove' + """ + debug_msg = f"Updating ({update_type}) for ref_id: {ref_id} at refs file: {refs_file_path}." + self.fhs_logger.debug(debug_msg) + if not os.path.isfile(refs_file_path): + err_msg = ( + f"Refs file: {refs_file_path} does not exist." + + f"Cannot {update_type} ref_id: {ref_id}" + ) + self.fhs_logger.error(err_msg) + raise FileNotFoundError(err_msg) + try: + if update_type == "add": + pid_found = self._is_string_in_refs_file(ref_id, refs_file_path) + if not pid_found: + with open(refs_file_path, "a", encoding="utf8") as ref_file: + # Lock file for the shortest amount of time possible + file_descriptor = ref_file.fileno() + fcntl.flock(file_descriptor, fcntl.LOCK_EX) + ref_file.write(ref_id + "\n") + if update_type == "remove": + with open(refs_file_path, "r+", encoding="utf8") as ref_file: + # Lock file immediately, this process needs to complete + # before any others read/modify the content of resf file + file_descriptor = ref_file.fileno() + fcntl.flock(file_descriptor, fcntl.LOCK_EX) + new_pid_lines = [ + cid_pid_line + for cid_pid_line in ref_file.readlines() + if cid_pid_line.strip() != ref_id + ] + ref_file.seek(0) + ref_file.writelines(new_pid_lines) + ref_file.truncate() + debug_msg = ( + f"Update ({update_type}) for ref_id: {ref_id} " + + f"completed on refs file: {refs_file_path}." + ) + self.fhs_logger.debug(debug_msg) + except Exception as err: + err_msg = ( + f"Failed to {update_type} for ref_id: {ref_id}" + + f" at refs file: {refs_file_path}. Unexpected {err=}, {type(err)=}" + ) + self.fhs_logger.error(err_msg) + raise err - def _validate_object( + @staticmethod + def _is_string_in_refs_file(ref_id: str, refs_file_path: Path) -> bool: + """Check a reference file for a ref_id (`cid` or `pid`). + + :param str ref_id: Authority-based, persistent identifier or content identifier + :param path refs_file_path: Path to the refs file + + :return: pid_found + """ + with open(refs_file_path, "r", encoding="utf8") as ref_file: + # Confirm that pid is not currently already tagged + for line in ref_file: + value = line.strip() + if ref_id == value: + return True + return False + + def _verify_object_information( self, - pid, - checksum, - checksum_algorithm, - entity, - hex_digests, - tmp_file_name, - tmp_file_size, - file_size_to_validate, - ): - """Evaluates an object's integrity - - Args: - pid: For logging purposes - checksum: Value of checksum - checksum_algorithm: Algorithm of checksum - entity: Type of object - hex_digests: Dictionary of hex digests to select from - tmp_file_name: Name of tmp file - tmp_file_size: Size of the tmp file - file_size_to_validate: Expected size of the object + pid: Optional[str], + checksum: str, + checksum_algorithm: str, + entity: str, + hex_digests: Dict[str, str], + tmp_file_name: Optional[str], + tmp_file_size: int, + file_size_to_validate: int, + ) -> None: + """Evaluates an object's integrity - if there is a mismatch, deletes the object + in question and raises an exception. + + :param Optional[str] pid: For logging purposes. + :param str checksum: Value of the checksum to check. + :param str checksum_algorithm: Algorithm of the checksum. + :param str entity: Type of object ('objects' or 'metadata'). + :param dict hex_digests: Dictionary of hex digests to parse. + :param Optional[str] tmp_file_name: Name of the temporary file. + :param int tmp_file_size: Size of the temporary file. + :param int file_size_to_validate: Expected size of the object. """ if file_size_to_validate is not None and file_size_to_validate > 0: if file_size_to_validate != tmp_file_size: - self.delete(entity, tmp_file_name) - exception_string = ( - "FileHashStore - _move_and_get_checksums: Object file size calculated: " - + f" {tmp_file_size} does not match with expected size:" - + f"{file_size_to_validate}. Tmp file deleted and file not stored for" - + f" pid: {pid}" + err_msg = ( + f"Object file size calculated: {tmp_file_size} does not match with expected " + f"size: {file_size_to_validate}." ) - logging.error(exception_string) - raise ValueError(exception_string) + if pid is not None: + self._delete(entity, tmp_file_name) + err_msg_for_pid = ( + f"{err_msg} Tmp file deleted and file not stored for pid: {pid}" + ) + self.fhs_logger.debug(err_msg_for_pid) + raise NonMatchingObjSize(err_msg_for_pid) + else: + self.fhs_logger.debug(err_msg) + raise NonMatchingObjSize(err_msg) if checksum_algorithm is not None and checksum is not None: - hex_digest_stored = hex_digests[checksum_algorithm] - if hex_digest_stored != checksum: - self.delete(entity, tmp_file_name) - exception_string = ( - "FileHashStore - _move_and_get_checksums: Hex digest and checksum" - + f" do not match - file not stored for pid: {pid}. Algorithm:" - + f" {checksum_algorithm}. Checksum provided: {checksum} !=" - + f" HexDigest: {hex_digest_stored}. Tmp file deleted." - ) - logging.error(exception_string) - raise ValueError(exception_string) + if checksum_algorithm not in hex_digests: + # Check to see if it is a supported algorithm + self._clean_algorithm(checksum_algorithm) + # If so, calculate the checksum and compare it + if tmp_file_name is not None and pid is not None: + # Calculate the checksum from the tmp file + hex_digest_calculated = self._computehash( + tmp_file_name, algorithm=checksum_algorithm + ) + else: + # Otherwise, a data object has been stored without a pid + object_cid = hex_digests[self.algorithm] + cid_stream = self._open(entity, object_cid) + hex_digest_calculated = self._computehash( + cid_stream, algorithm=checksum_algorithm + ) + if hex_digest_calculated != checksum: + err_msg = ( + f"Checksum_algorithm ({checksum_algorithm}) cannot be found in the " + + "default hex digests dict, but is supported. New checksum calculated: " + + f"{hex_digest_calculated}, does not match what has been provided: " + + checksum + ) + self.fhs_logger.debug(err_msg) + raise NonMatchingChecksum(err_msg) + else: + hex_digest_stored = hex_digests[checksum_algorithm] + if hex_digest_stored != checksum.lower(): + err_msg = ( + f"Hex digest and checksum do not match - file not stored for pid: {pid}. " + + f"Algorithm: {checksum_algorithm}. Checksum provided: {checksum} !=" + + f" HexDigest: {hex_digest_stored}." + ) + if pid is not None: + # Delete the tmp file + self._delete(entity, tmp_file_name) + err_msg_for_pid = ( + err_msg + f" Tmp file ({tmp_file_name}) deleted." + ) + self.fhs_logger.error(err_msg_for_pid) + raise NonMatchingChecksum(err_msg_for_pid) + else: + self.fhs_logger.error(err_msg) + raise NonMatchingChecksum(err_msg) + + def _verify_hashstore_references( + self, + pid: str, + cid: str, + pid_refs_path: Optional[Path] = None, + cid_refs_path: Optional[Path] = None, + additional_log_string: Optional[str] = None, + ) -> None: + """Verifies that the supplied pid and pid reference file and content have been + written successfully. + + :param str pid: Authority-based or persistent identifier. + :param str cid: Content identifier. + :param path pid_refs_path: Path to pid refs file + :param path cid_refs_path: Path to cid refs file + :param str additional_log_string: String to append to exception statement + """ + debug_msg = ( + f"Verifying pid ({pid}) and cid ({cid}) refs files. {additional_log_string}" + ) + self.fhs_logger.debug(debug_msg) + if pid_refs_path is None: + pid_refs_path = self._get_hashstore_pid_refs_path(pid) + if cid_refs_path is None: + cid_refs_path = self._get_hashstore_cid_refs_path(cid) + + # Check that reference files were created + if not os.path.isfile(pid_refs_path): + err_msg = f" Pid refs file missing: {pid_refs_path}. Note: {additional_log_string}" + self.fhs_logger.error(err_msg) + raise PidRefsFileNotFound(err_msg) + if not os.path.isfile(cid_refs_path): + err_msg = ( + f"Cid refs file missing: {cid_refs_path}. Note: {additional_log_string}" + ) + self.fhs_logger.error(err_msg) + raise CidRefsFileNotFound(err_msg) + # Check the content of the reference files + # Start with the cid + retrieved_cid = self._read_small_file_content(pid_refs_path) + if retrieved_cid != cid: + err_msg = ( + f"Pid refs file exists ({pid_refs_path}) but cid ({cid}) does not match." + + f" Note: {additional_log_string}" + ) + self.fhs_logger.error(err_msg) + raise PidRefsContentError(err_msg) + # Then the pid + pid_found = self._is_string_in_refs_file(pid, cid_refs_path) + if not pid_found: + err_msg = ( + f"Cid refs file exists ({cid_refs_path}) but pid ({pid}) not found." + + f" Note: {additional_log_string}" + ) + self.fhs_logger.error(err_msg) + raise CidRefsContentError(err_msg) - def _validate_metadata_to_store(self, metadata): - """Evaluates a metadata argument to ensure that it is either a string, path or - stream object before attempting to store it. + def _delete_object_only(self, cid: str) -> None: + """Attempt to delete an object based on the given content identifier (cid). If the object + has any pids references and/or a cid refs file exists, the object will not be deleted. - Args: - metadata (string, path, stream): metadata to validate + :param str cid: Content identifier """ - if isinstance(metadata, str): - if metadata.replace(" ", "") == "": - exception_string = ( - "FileHashStore - store_metadata: Given string path to" - + " metadata cannot be empty." + try: + cid_refs_abs_path = self._get_hashstore_cid_refs_path(cid) + # If the refs file still exists, do not delete the object + self._synchronize_object_locked_cids(cid) + if os.path.isfile(cid_refs_abs_path): + debug_msg = ( + f"Cid reference file exists for: {cid}, skipping delete request." ) - logging.error(exception_string) - raise TypeError(exception_string) - if ( - not isinstance(metadata, str) - and not isinstance(metadata, Path) - and not isinstance(metadata, io.BufferedIOBase) - ): - exception_string = ( - "FileHashStore - store_metadata: Metadata must be a path or string" - + f" type, data type supplied: {type(metadata)}" - ) - logging.error(exception_string) - raise TypeError(exception_string) + self.fhs_logger.debug(debug_msg) + + else: + self._delete("objects", cid) + info_msg = f"Deleted object only for cid: {cid}" + self.fhs_logger.info(info_msg) - def _validate_format_id(self, format_id, method): + finally: + self._release_object_locked_cids(cid) + + def _check_arg_algorithms_and_checksum( + self, + additional_algorithm: Optional[str], + checksum: Optional[str], + checksum_algorithm: Optional[str], + ) -> Tuple[Optional[str], Optional[str]]: + """Determines whether the caller has supplied the necessary arguments to validate + an object with a checksum value. + + :param additional_algorithm: Value of the additional algorithm to calculate. + :type additional_algorithm: str or None + :param checksum: Value of the checksum. + :type checksum: str or None + :param checksum_algorithm: Algorithm of the checksum. + :type checksum_algorithm: str or None + + :return: Hashlib-compatible string or 'None' for additional_algorithm and + checksum_algorithm. + """ + additional_algorithm_checked = None + if additional_algorithm != self.algorithm and additional_algorithm is not None: + # Set additional_algorithm + additional_algorithm_checked = self._clean_algorithm(additional_algorithm) + checksum_algorithm_checked = None + if checksum is not None: + self._check_string(checksum_algorithm, "checksum_algorithm") + if checksum_algorithm is not None: + self._check_string(checksum, "checksum") + # Set checksum_algorithm + checksum_algorithm_checked = self._clean_algorithm(checksum_algorithm) + return additional_algorithm_checked, checksum_algorithm_checked + + def _check_arg_format_id(self, format_id: str, method: str) -> str: """Determines the metadata namespace (format_id) to use for storing, - retrieving and deleting metadata. + retrieving, and deleting metadata. - Args: - format_id (string): Metadata namespace to review - method (string): Calling method for logging purposes + :param str format_id: Metadata namespace to review. + :param str method: Calling method for logging purposes. - Returns: - checked_format_id (string): Valid metadata namespace + :return: Valid metadata namespace. """ - checked_format_id = None - if format_id is not None and format_id.replace(" ", "") == "": - exception_string = f"FileHashStore - {method}: Format_id cannot be empty." - logging.error(exception_string) - raise ValueError(exception_string) + if format_id and not format_id.strip(): + err_msg = f"FileHashStore - {method}: Format_id cannot be empty." + self.fhs_logger.error(err_msg) + raise ValueError(err_msg) elif format_id is None: # Use default value set by hashstore config checked_format_id = self.sysmeta_ns @@ -1210,15 +2146,47 @@ def _validate_format_id(self, format_id, method): checked_format_id = format_id return checked_format_id - def clean_algorithm(self, algorithm_string): + def _refine_algorithm_list( + self, additional_algorithm: Optional[str], checksum_algorithm: Optional[str] + ) -> Set[str]: + """Create the final list of hash algorithms to calculate. + + :param str additional_algorithm: Additional algorithm. + :param str checksum_algorithm: Checksum algorithm. + + :return: De-duplicated list of hash algorithms. + """ + algorithm_list_to_calculate = self.default_algo_list + if checksum_algorithm is not None: + self._clean_algorithm(checksum_algorithm) + if checksum_algorithm in self.other_algo_list: + debug_additional_other_algo_str = ( + f"Checksum algo: {checksum_algorithm} found in other_algo_lists, adding to " + + f"list of algorithms to calculate." + ) + self.fhs_logger.debug(debug_additional_other_algo_str) + algorithm_list_to_calculate.append(checksum_algorithm) + if additional_algorithm is not None: + self._clean_algorithm(additional_algorithm) + if additional_algorithm in self.other_algo_list: + debug_additional_other_algo_str = ( + f"Additional algo: {additional_algorithm} found in other_algo_lists, " + + f"adding to list of algorithms to calculate." + ) + self.fhs_logger.debug(debug_additional_other_algo_str) + algorithm_list_to_calculate.append(additional_algorithm) + + # Remove duplicates + algorithm_list_to_calculate = set(algorithm_list_to_calculate) + return algorithm_list_to_calculate + + def _clean_algorithm(self, algorithm_string: str) -> str: """Format a string and ensure that it is supported and compatible with - the python hashlib library. + the Python `hashlib` library. - Args: - algorithm_string (string): Algorithm to validate. + :param str algorithm_string: Algorithm to validate. - Returns: - cleaned_string (string): `hashlib` supported algorithm string. + :return: `hashlib` supported algorithm string. """ count = 0 for char in algorithm_string: @@ -1233,104 +2201,134 @@ def clean_algorithm(self, algorithm_string): cleaned_string not in self.default_algo_list and cleaned_string not in self.other_algo_list ): - exception_string = ( - "FileHashStore: clean_algorithm: Algorithm not supported:" - + cleaned_string - ) - logging.error(exception_string) - raise ValueError(exception_string) + err_msg = f"Algorithm not supported: {cleaned_string}" + self.fhs_logger.error(err_msg) + raise UnsupportedAlgorithm(err_msg) return cleaned_string - def computehash(self, stream, algorithm=None): - """Compute hash of a file-like object using :attr:`algorithm` by default - or with optional algorithm supported. + def _computehash( + self, stream: Union["Stream", str, IO[bytes]], algorithm: Optional[str] = None + ) -> str: + """Compute the hash of a file-like object (or string) using the store algorithm by + default or with an optional supported algorithm. - Args: - stream (io.BufferedReader): A buffered stream of an object_cid object. \n - algorithm (string): Algorithm of hex digest to generate. + :param mixed stream: A buffered stream (`io.BufferedReader`) of an object. A string is + also acceptable as they are a sequence of characters (Python only). + :param str algorithm: Algorithm of hex digest to generate. - Returns: - hex_digest (string): Hex digest. + :return: Hex digest. """ if algorithm is None: - hashobj = hashlib.new(self.algorithm) + hash_obj = hashlib.new(self.algorithm) else: - check_algorithm = self.clean_algorithm(algorithm) - hashobj = hashlib.new(check_algorithm) + check_algorithm = self._clean_algorithm(algorithm) + hash_obj = hashlib.new(check_algorithm) for data in stream: - hashobj.update(self._to_bytes(data)) - hex_digest = hashobj.hexdigest() + hash_obj.update(self._cast_to_bytes(data)) + hex_digest = hash_obj.hexdigest() return hex_digest - def get_store_path(self, entity): - """Return a path object of the root directory of the store. - - Args: - entity (str): Desired entity type: "objects" or "metadata" - """ - if entity == "objects": - return Path(self.objects) - elif entity == "metadata": - return Path(self.metadata) - else: - raise ValueError( - f"entity: {entity} does not exist. Do you mean 'objects' or 'metadata'?" - ) - - def exists(self, entity, file): - """Check whether a given file id or path exists on disk. + def _shard(self, checksum: str) -> List[str]: + """Splits the given checksum into a list of tokens of length `self.width`, followed by + the remainder. - Args: - entity (str): Desired entity type (ex. "objects", "metadata"). \n - file (str): The name of the file to check. - - Returns: - file_exists (bool): True if the file exists. - - """ - file_exists = bool(self.get_real_path(entity, file)) - return file_exists - - def shard(self, digest): - """Generates a list given a digest of `self.depth` number of tokens with width - `self.width` from the first part of the digest plus the remainder. + This method divides the checksum into `self.depth` number of tokens, each with a fixed + width of `self.width`, taken from the beginning of the checksum. Any leftover characters + are added as the final element in the list. Example: + For a checksum of '0d555ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e', + the result may be: ['0d', '55', '5e', 'd77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e'] - Args: - digest (str): The string to be divided into tokens. + :param str checksum: The checksum string to be split into tokens. - Returns: - hierarchical_list (list): A list containing the tokens of fixed width. + :return: A list where each element is a token of fixed width, with any leftover + characters as the last element. """ - def compact(items): + def compact(items: List[Any]) -> List[Any]: """Return only truthy elements of `items`.""" + # truthy_items = [] + # for item in items: + # if item: + # truthy_items.append(item) + # return truthy_items return [item for item in items if item] # This creates a list of `depth` number of tokens with width # `width` from the first part of the id plus the remainder. hierarchical_list = compact( - [digest[i * self.width : self.width * (i + 1)] for i in range(self.depth)] - + [digest[self.depth * self.width :]] + [checksum[i * self.width : self.width * (i + 1)] for i in range(self.depth)] + + [checksum[self.depth * self.width :]] ) return hierarchical_list - def open(self, entity, file, mode="rb"): + def _count(self, entity: str) -> int: + """Return the count of the number of files in the `root` directory. + + :param str entity: Desired entity type (ex. "objects", "metadata"). + + :return: Number of files in the directory. + """ + count = 0 + if entity == "objects": + directory_to_count = self.objects + elif entity == "metadata": + directory_to_count = self.metadata + elif entity == "pid": + directory_to_count = self.pids + elif entity == "cid": + directory_to_count = self.cids + elif entity == "tmp": + directory_to_count = self.objects / "tmp" + else: + raise ValueError( + f"entity: {entity} does not exist. Do you mean 'objects' or 'metadata'?" + ) + + for _, _, files in os.walk(directory_to_count): + for _ in files: + count += 1 + return count + + def _exists(self, entity: str, file: str) -> bool: + """Check whether a given file id or path exists on disk. + + :param str entity: Desired entity type (e.g., "objects", "metadata"). + :param str file: The name of the file to check. + + :return: True if the file exists. + """ + if entity == "objects": + try: + return bool(self._get_hashstore_data_object_path(file)) + except FileNotFoundError: + return False + if entity == "metadata": + try: + return bool(self._get_hashstore_metadata_path(file)) + except FileNotFoundError: + return False + + def _open( + self, entity: str, file: str, mode: str = "rb" + ) -> Union[IO[bytes], IO[str]]: """Return open buffer object from given id or path. Caller is responsible for closing the stream. - Args: - entity (str): Desired entity type (ex. "objects", "metadata"). \n - file (str): Address ID or path of file. \n - mode (str, optional): Mode to open file in. Defaults to 'rb'. + :param str entity: Desired entity type (ex. "objects", "metadata"). + :param str file: Address ID or path of file. + :param str mode: Mode to open file in. Defaults to 'rb'. - Returns: - buffer (io.BufferedReader): An `io` stream dependent on the `mode`. + :return: An `io` stream dependent on the `mode`. """ - realpath = self.get_real_path(entity, file) + realpath = None + if entity == "objects": + realpath = self._get_hashstore_data_object_path(file) + if entity == "metadata": + realpath = self._get_hashstore_metadata_path(file) if realpath is None: raise IOError(f"Could not locate file: {file}") @@ -1339,229 +2337,459 @@ def open(self, entity, file, mode="rb"): buffer = io.open(realpath, mode) return buffer - def delete(self, entity, file): + def _delete(self, entity: str, file: Union[str, Path]) -> None: """Delete file using id or path. Remove any empty directories after deleting. No exception is raised if file doesn't exist. - Args: - entity (str): Desired entity type (ex. "objects", "metadata"). \n - file (str): Address ID or path of file. + :param str entity: Desired entity type (ex. "objects", "metadata"). + :param str file: Address ID or path of file. """ - realpath = self.get_real_path(entity, file) - if realpath is None: - return None + try: + if entity == "tmp": + realpath = file + elif entity == "objects": + realpath = self._get_hashstore_data_object_path(file) + elif entity == "metadata": + try: + realpath = self._get_hashstore_metadata_path(file) + except FileNotFoundError: + # Swallow file not found exceptions for metadata + realpath = None + elif os.path.isfile(file): + # Check if the given path is an absolute path + realpath = file + else: + raise IOError( + f"FileHashStore - delete(): Could not locate file: {file}" + ) + if realpath is not None: + os.remove(realpath) + + except Exception as err: + err_msg = f"FileHashStore - delete(): Unexpected {err=}, {type(err)=}" + self.fhs_logger.error(err_msg) + raise err + + def _create_path(self, path: Path) -> None: + """Physically create the folder path (and all intermediate ones) on disk. + :param Path path: The path to create. + :raises AssertionError: If the path already exists but is not a directory. + """ try: - os.remove(realpath) - except OSError: - pass + os.makedirs(path, self.d_mode) + except FileExistsError: + assert os.path.isdir(path), f"expected {path} to be a directory" + + def _get_store_path(self, entity: str) -> Path: + """Return a path object to the root directory of the requested hashstore directory type + + :param str entity: Desired entity type: "objects", "metadata", "refs", "cid" and "pid". + Note, "cid" and "pid" are refs specific directories. + + :return: Path to requested store entity type + """ + if entity == "objects": + return Path(self.objects) + elif entity == "metadata": + return Path(self.metadata) + elif entity == "refs": + return Path(self.refs) + elif entity == "cid": + return Path(self.cids) + elif entity == "pid": + return Path(self.pids) else: - self.remove_empty(os.path.dirname(realpath)) + raise ValueError( + f"entity: {entity} does not exist. Do you mean 'objects', 'metadata' or 'refs'?" + ) + + def _build_hashstore_data_object_path(self, hash_id: str) -> str: + """Build the absolute file path for a given content identifier - def remove_empty(self, subpath): - """Successively remove all empty folders starting with `subpath` and - proceeding "up" through directory tree until reaching the `root` - folder. + :param str hash_id: A hash ID to build a file path for. - Args: - subpath (str, path): Name of directory. + :return: An absolute file path for the specified hash ID. """ - # Don't attempt to remove any folders if subpath is not a - # subdirectory of the root directory. - if not self._has_subdir(subpath): - return + paths = self._shard(hash_id) + root_dir = self._get_store_path("objects") + absolute_path = os.path.join(root_dir, *paths) + return absolute_path - while subpath != self.root: - if len(os.listdir(subpath)) > 0 or os.path.islink(subpath): - break - os.rmdir(subpath) - subpath = os.path.dirname(subpath) + def _get_hashstore_data_object_path(self, cid_or_relative_path: str) -> Path: + """Get the expected path to a hashstore data object that exists using a content identifier. - def _has_subdir(self, path): - """Return whether `path` is a subdirectory of the `root` directory. + :param str cid_or_relative_path: Content identifier or relative path in '/objects' to check - Args: - path (str, path): Name of path. + :return: Path to the data object referenced by the pid + """ + expected_abs_data_obj_path = self._build_hashstore_data_object_path( + cid_or_relative_path + ) + if os.path.isfile(expected_abs_data_obj_path): + return Path(expected_abs_data_obj_path) + else: + if os.path.isfile(cid_or_relative_path): + # Check whether the supplied arg is an abs path that exists or not for convenience + return Path(cid_or_relative_path) + else: + # Check the relative path + relpath = os.path.join(self.objects, cid_or_relative_path) + if os.path.isfile(relpath): + return Path(relpath) + else: + raise FileNotFoundError( + "Could not locate a data object in '/objects' for the supplied " + + f"cid_or_relative_path: {cid_or_relative_path}" + ) + + def _get_hashstore_metadata_path(self, metadata_relative_path: str) -> Path: + """Return the expected metadata path to a hashstore metadata object that exists. + + :param str metadata_relative_path: Metadata path to check or relative path in '/metadata' + to check - Returns: - is_subdir (boolean): `True` if subdirectory. + :return: Path to the data object referenced by the pid """ - # Append os.sep so that paths like /usr/var2/log doesn't match /usr/var. - root_path = os.path.realpath(self.root) + os.sep - subpath = os.path.realpath(path) - is_subdir = subpath.startswith(root_path) - return is_subdir + # Form the absolute path to the metadata file + expected_abs_metadata_path = os.path.join(self.metadata, metadata_relative_path) + if os.path.isfile(expected_abs_metadata_path): + return Path(expected_abs_metadata_path) + else: + if os.path.isfile(metadata_relative_path): + # Check whether the supplied arg is an abs path that exists or not for convenience + return Path(metadata_relative_path) + else: + raise FileNotFoundError( + "Could not locate a metadata object in '/metadata' for the supplied " + + f"metadata_relative_path: {metadata_relative_path}" + ) - def create_path(self, path): - """Physically create the folder path (and all intermediate ones) on disk. + def _get_hashstore_pid_refs_path(self, pid: str) -> Path: + """Return the expected path to a pid reference file. The path may or may not exist. - Args: - path (str): The path to create. + :param str pid: Persistent or authority-based identifier - Raises: - AssertionError (exception): If the path already exists but is not a directory. + :return: Path to pid reference file """ - try: - os.makedirs(path, self.dmode) - except FileExistsError: - assert os.path.isdir(path), f"expected {path} to be a directory" + # The pid refs file is named after the hash of the pid using the store's algorithm + hash_id = self._computehash(pid, self.algorithm) + root_dir = self._get_store_path("pid") + directories_and_path = self._shard(hash_id) + pid_ref_file_abs_path = os.path.join(root_dir, *directories_and_path) + return Path(pid_ref_file_abs_path) - def get_real_path(self, entity, file): - """Attempt to determine the real path of a file id or path through - successive checking of candidate paths. If the real path is stored with - an extension, the path is considered a match if the basename matches - the expected file path of the id. + def _get_hashstore_cid_refs_path(self, cid: str) -> Path: + """Return the expected path to a cid reference file. The path may or may not exist. - Args: - entity (str): desired entity type (ex. "objects", "metadata"). \n - file (string): Name of file. + :param str cid: Content identifier - Returns: - exists (boolean): Whether file is found or not. + :return: Path to cid reference file """ - # Check for absolute path. - if os.path.isfile(file): - return file + root_dir = self._get_store_path("cid") + # The content identifier is to be split into directories as is supplied + directories_and_path = self._shard(cid) + cid_ref_file_abs_path = os.path.join(root_dir, *directories_and_path) + return Path(cid_ref_file_abs_path) - # Check for relative path. - rel_root = "" - if entity == "objects": - rel_root = self.objects - elif entity == "metadata": - rel_root = self.metadata + # Synchronization Methods + + def _synchronize_object_locked_pids(self, pid: str) -> None: + """Threads must work with 'pid's one identifier at a time to ensure thread safety when + handling requests to store, delete or tag pids. + + :param str pid: Persistent or authority-based identifier + """ + if self.use_multiprocessing: + with self.object_pid_condition_mp: + # Wait for the cid to release if it's being tagged + while pid in self.object_locked_pids_mp: + self.fhs_logger.debug(f"Pid ({pid}) is locked. Waiting.") + self.object_pid_condition_mp.wait() + self.object_locked_pids_mp.append(pid) + self.fhs_logger.debug(f"Synchronizing object_locked_pids_mp for pid: {pid}") else: - raise ValueError( - f"entity: {entity} does not exist. Do you mean 'objects' or 'metadata'?" + with self.object_pid_condition_th: + while pid in self.object_locked_pids_th: + self.fhs_logger.debug(f"Pid ({pid}) is locked. Waiting.") + self.object_pid_condition_th.wait() + self.object_locked_pids_th.append(pid) + self.fhs_logger.debug(f"Synchronizing object_locked_pids_th for pid: {pid}") + + def _release_object_locked_pids(self, pid: str) -> None: + """Remove the given persistent identifier from 'object_locked_pids' and notify other + waiting threads or processes. + + :param str pid: Persistent or authority-based identifier + """ + if self.use_multiprocessing: + with self.object_pid_condition_mp: + self.object_locked_pids_mp.remove(pid) + self.object_pid_condition_mp.notify() + self.fhs_logger.debug(f"Releasing pid ({pid}) from object_locked_pids_mp.") + else: + # Release pid + with self.object_pid_condition_th: + self.object_locked_pids_th.remove(pid) + self.object_pid_condition_th.notify() + self.fhs_logger.debug(f"Releasing pid ({pid}) from object_locked_pids_th.") + + def _synchronize_object_locked_cids(self, cid: str) -> None: + """Multiple threads may access a data object via its 'cid' or the respective 'cid + reference file' (which contains a list of 'pid's that reference a 'cid') and this needs + to be coordinated. + + :param str cid: Content identifier + """ + if self.use_multiprocessing: + with self.object_cid_condition_mp: + # Wait for the cid to release if it's being tagged + while cid in self.object_locked_cids_mp: + self.fhs_logger.debug(f"Cid ({cid}) is locked. Waiting.") + self.object_cid_condition_mp.wait() + # Modify reference_locked_cids consecutively + self.object_locked_cids_mp.append(cid) + self.fhs_logger.debug(f"Synchronizing object_locked_cids_mp for cid: {cid}") + else: + with self.object_cid_condition_th: + while cid in self.object_locked_cids_th: + self.fhs_logger.debug(f"Cid ({cid}) is locked. Waiting.") + self.object_cid_condition_th.wait() + self.object_locked_cids_th.append(cid) + self.fhs_logger.debug(f"Synchronizing object_locked_cids_th for cid: {cid}") + + def _check_object_locked_cids(self, cid: str) -> None: + """Check that a given content identifier is currently locked (found in the + 'object_locked_cids' array). If it is not, an exception will be thrown. + + :param str cid: Content identifier + """ + if self.use_multiprocessing: + if cid not in self.object_locked_cids_mp: + err_msg = f"Cid {cid} is not locked." + self.fhs_logger.error(err_msg) + raise IdentifierNotLocked(err_msg) + else: + if cid not in self.object_locked_cids_th: + err_msg = f"Cid {cid} is not locked." + self.fhs_logger.error(err_msg) + raise IdentifierNotLocked(err_msg) + + def _release_object_locked_cids(self, cid: str) -> None: + """Remove the given content identifier from 'object_locked_cids' and notify other + waiting threads or processes. + + :param str cid: Content identifier + """ + if self.use_multiprocessing: + with self.object_cid_condition_mp: + self.object_locked_cids_mp.remove(cid) + self.object_cid_condition_mp.notify() + self.fhs_logger.debug( + f"Releasing cid ({cid}) from object_cid_condition_mp." + ) + else: + with self.object_cid_condition_th: + self.object_locked_cids_th.remove(cid) + self.object_cid_condition_th.notify() + self.fhs_logger.debug( + f"Releasing cid ({cid}) from object_cid_condition_th." ) - relpath = os.path.join(rel_root, file) - if os.path.isfile(relpath): - return relpath - # Check for sharded path. - abspath = self.build_abs_path(entity, file) - if os.path.isfile(abspath): - return abspath + def _synchronize_referenced_locked_pids(self, pid: str) -> None: + """Multiple threads may interact with a pid (to tag, untag, delete) and these actions + must be coordinated to prevent unexpected behaviour/race conditions that cause chaos. - # Could not determine a match. - return None + :param str pid: Persistent or authority-based identifier + """ + if self.use_multiprocessing: + with self.reference_pid_condition_mp: + # Wait for the pid to release if it's in use + while pid in self.reference_locked_pids_mp: + self.fhs_logger.debug(f"Pid ({pid}) is locked. Waiting.") + self.reference_pid_condition_mp.wait() + # Modify reference_locked_pids consecutively + self.reference_locked_pids_mp.append(pid) + self.fhs_logger.debug( + f"Synchronizing reference_locked_pids_mp for pid: {pid}" + ) + else: + with self.reference_pid_condition_th: + while pid in self.reference_locked_pids_th: + logging.debug(f"Pid ({pid}) is locked. Waiting.") + self.reference_pid_condition_th.wait() + self.reference_locked_pids_th.append(pid) + self.fhs_logger.debug( + f"Synchronizing reference_locked_pids_th for pid: {pid}" + ) - def build_abs_path(self, entity, cid, extension=""): - """Build the absolute file path for a given hash id with an optional file extension. + def _check_reference_locked_pids(self, pid: str) -> None: + """Check that a given persistent identifier is currently locked (found in the + 'reference_locked_pids' array). If it is not, an exception will be thrown. + + :param str pid: Persistent or authority-based identifier + """ + if self.use_multiprocessing: + if pid not in self.reference_locked_pids_mp: + err_msg = f"Pid {pid} is not locked." + self.fhs_logger.error(err_msg) + raise IdentifierNotLocked(err_msg) + else: + if pid not in self.reference_locked_pids_th: + err_msg = f"Pid {pid} is not locked." + self.fhs_logger.error(err_msg) + raise IdentifierNotLocked(err_msg) - Args: - entity (str): Desired entity type (ex. "objects", "metadata"). \n - cid (str): A hash id to build a file path for. \n - extension (str): An optional file extension to append to the file path. + def _release_reference_locked_pids(self, pid: str) -> None: + """Remove the given persistent identifier from 'reference_locked_pids' and notify other + waiting threads or processes. - Returns: - absolute_path (str): An absolute file path for the specified hash id. + :param str pid: Persistent or authority-based identifier """ - paths = self.shard(cid) - root_dir = self.get_store_path(entity) + if self.use_multiprocessing: + with self.reference_pid_condition_mp: + self.reference_locked_pids_mp.remove(pid) + self.reference_pid_condition_mp.notify() + self.fhs_logger.debug( + f"Releasing pid ({pid}) from reference_locked_pids_mp." + ) + else: + # Release pid + with self.reference_pid_condition_th: + self.reference_locked_pids_th.remove(pid) + self.reference_pid_condition_th.notify() + self.fhs_logger.debug( + f"Releasing pid ({pid}) from reference_locked_pids_th." + ) + + # Other Static Methods + @staticmethod + def _read_small_file_content(path_to_file: Path): + """Read the contents of a file with the given path. This method is not optimized for + large files - so it should only be used for small files (like reference files). - if extension and not extension.startswith(os.extsep): - extension = os.extsep + extension - elif not extension: - extension = "" + :param path path_to_file: Path to the file to read - absolute_path = os.path.join(root_dir, *paths) + extension - return absolute_path + :return: Content of the given file + """ + with open(path_to_file, "r", encoding="utf8") as opened_path: + content = opened_path.read() + return content - def count(self, entity): - """Return count of the number of files in the `root` directory. + @staticmethod + def _rename_path_for_deletion(path: Union[Path, str]) -> str: + """Rename a given path by appending '_delete' and move it to the renamed path. - Args: - entity (str): Desired entity type (ex. "objects", "metadata"). + :param Path path: Path to file to rename - Returns: - count (int): Number of files in the directory. + :return: Path to the renamed file """ - count = 0 - directory_to_count = "" - if entity == "objects": - directory_to_count = self.objects - elif entity == "metadata": - directory_to_count = self.metadata + if isinstance(path, str): + path = Path(path) + delete_path = path.with_name(path.stem + "_delete" + path.suffix) + shutil.move(path, delete_path) + # TODO: Adjust all code for constructing paths to use path and revise accordingly + return str(delete_path) + + @staticmethod + def _get_file_paths(directory: Union[str, Path]) -> Optional[List[Path]]: + """Get the file paths of a given directory if it exists + + :param mixed directory: String or path to directory. + + :raises FileNotFoundError: If the directory doesn't exist + + :return: file_paths - File paths of the given directory or None if directory doesn't exist + """ + if os.path.exists(directory): + files = os.listdir(directory) + file_paths = [ + directory / file for file in files if os.path.isfile(directory / file) + ] + return file_paths else: - raise ValueError( - f"entity: {entity} does not exist. Do you mean 'objects' or 'metadata'?" - ) + return None - for _, _, files in os.walk(directory_to_count): - for _ in files: - count += 1 - return count + @staticmethod + def _check_arg_data(data: Union[str, os.PathLike, io.BufferedReader]) -> bool: + """Checks a data argument to ensure that it is either a string, path, or stream + object. - # Other Static Methods + :param data: Object to validate (string, path, or stream). + :type data: str, os.PathLike, io.BufferedReader + + :return: True if valid. + """ + if ( + not isinstance(data, str) + and not isinstance(data, Path) + and not isinstance(data, io.BufferedIOBase) + ): + err_msg = ( + "FileHashStore - _validate_arg_data: Data must be a path, string or buffered" + + f" stream type. Data type supplied: {type(data)}" + ) + logging.error(err_msg) + raise TypeError(err_msg) + if isinstance(data, str): + if data.strip() == "": + err_msg = ( + "FileHashStore - _validate_arg_data: Data string cannot be empty." + ) + logging.error(err_msg) + raise TypeError(err_msg) + return True @staticmethod - def _validate_file_size(file_size): - """Checks whether a file size is > 0 and an int and throws exception if not. + def _check_integer(file_size: int) -> None: + """Check whether a given argument is an integer and greater than 0; + throw an exception if not. - Args: - file_size (int): file size to check + :param int file_size: File size to check. """ if file_size is not None: if not isinstance(file_size, int): - exception_string = ( - "FileHashStore - _is_file_size_valid: size given must be an integer." + err_msg = ( + "FileHashStore - _check_integer: size given must be an integer." + f" File size: {file_size}. Arg Type: {type(file_size)}." ) - logging.error(exception_string) - raise TypeError(exception_string) - if file_size < 1 or not isinstance(file_size, int): - exception_string = ( - "FileHashStore - _is_file_size_valid: size given must be > 0" - ) - logging.error(exception_string) - raise ValueError(exception_string) + logging.error(err_msg) + raise TypeError(err_msg) + if file_size < 1: + err_msg = "FileHashStore - _check_integer: size given must be > 0" + logging.error(err_msg) + raise ValueError(err_msg) @staticmethod - def _is_string_none_or_empty(string, arg, method): - """Checks whether a string is None or empty and throws an exception if so. + def _check_string(string: str, arg: str) -> None: + """Check whether a string is None or empty - or if it contains an illegal character; + throws an exception if so. - Args: - string (string): Value to check - arg (): Name of argument to check - method (string): Calling method for logging purposes + :param str string: Value to check. + :param str arg: Name of the argument to check. """ - if string is None or string.replace(" ", "") == "": - exception_string = ( + if string is None or string.strip() == "" or any(ch.isspace() for ch in string): + method = inspect.stack()[1].function + err_msg = ( f"FileHashStore - {method}: {arg} cannot be None" + f" or empty, {arg}: {string}." ) - logging.error(exception_string) - raise ValueError(exception_string) + logging.error(err_msg) + raise ValueError(err_msg) @staticmethod - def _to_bytes(text): - """Convert text to sequence of bytes using utf-8 encoding. + def _cast_to_bytes(text: any) -> bytes: + """Convert text to a sequence of bytes using utf-8 encoding. - Args: - text (str): String to convert. - - Returns: - text (bytes): Bytes with utf-8 encoding. + :param Any text: String to convert. + :return: Bytes with utf-8 encoding. """ if not isinstance(text, bytes): text = bytes(text, "utf8") return text - @staticmethod - def get_sha256_hex_digest(string): - """Calculate the SHA-256 digest of a UTF-8 encoded string. - - Args: - string (string): String to convert. - - Returns: - hex (string): Hexadecimal string. - """ - hex_digest = hashlib.sha256(string.encode("utf-8")).hexdigest() - return hex_digest - -class Stream(object): +class Stream: """Common interface for file-like objects. The input `obj` can be a file-like object or a path to a file. If `obj` is @@ -1575,7 +2803,7 @@ class Stream(object): set its position back to ``0``. """ - def __init__(self, obj): + def __init__(self, obj: Union[IO[bytes], str, Path]): if hasattr(obj, "read"): pos = obj.tell() elif os.path.isfile(obj): @@ -1619,3 +2847,25 @@ def close(self): self._obj.close() else: self._obj.seek(self._pos) + + +@dataclass +class ObjectMetadata: + """Represents metadata associated with an object. + + The `ObjectMetadata` class represents metadata associated with an object, including + a persistent or authority-based identifier (`pid`), a content identifier (`cid`), + the size of the object in bytes (`obj_size`), and an optional list of hex digests + (`hex_digests`) to assist with validating objects. + + :param str pid: An authority-based or persistent identifier + :param str cid: A unique identifier for the object (Hash ID, hex digest). + :param int obj_size: The size of the object in bytes. + :param dict hex_digests: A list of hex digests to validate objects + (md5, sha1, sha256, sha384, sha512) (optional). + """ + + pid: str + cid: str + obj_size: int + hex_digests: dict diff --git a/src/hashstore/filehashstore_exceptions.py b/src/hashstore/filehashstore_exceptions.py new file mode 100644 index 00000000..7acb77f8 --- /dev/null +++ b/src/hashstore/filehashstore_exceptions.py @@ -0,0 +1,135 @@ +"""FileHashStore custom exception module.""" + + +class StoreObjectForPidAlreadyInProgress(Exception): + """Custom exception thrown when called to store a data object for a pid that is already + progress. A pid can only ever reference one data object/content identifier so duplicate + requests are rejected immediately.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class IdentifierNotLocked(Exception): + """Custom exception thrown when an identifier (ex. 'pid' or 'cid') is not locked, which is + required to ensure thread safety.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class CidRefsContentError(Exception): + """Custom exception thrown when verifying reference files and a cid refs + file does not have a pid that is expected to be found.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class CidRefsFileNotFound(Exception): + """Custom exception thrown when verifying reference files and a cid refs + file is not found.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class OrphanPidRefsFileFound(Exception): + """Custom exception thrown when a cid refs file does not exist.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class PidRefsContentError(Exception): + """Custom exception thrown when verifying reference files and a pid refs + file does not contain the cid that it is expected.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class PidRefsFileNotFound(Exception): + """Custom exception thrown when verifying reference files and a pid refs + file is not found.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class PidRefsAlreadyExistsError(Exception): + """Custom exception thrown when a client calls 'tag_object' and the pid + that is being tagged is already accounted for (has a pid refs file and + is found in the cid refs file).""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class PidRefsDoesNotExist(Exception): + """Custom exception thrown when a pid refs file does not exist.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class PidNotFoundInCidRefsFile(Exception): + """Custom exception thrown when pid reference file exists with a cid, but + the respective cid reference file does not contain the pid.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class NonMatchingObjSize(Exception): + """Custom exception thrown when verifying an object and the expected file size + does not match what has been calculated.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class NonMatchingChecksum(Exception): + """Custom exception thrown when verifying an object and the expected checksum + does not match what has been calculated.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class RefsFileExistsButCidObjMissing(Exception): + """Custom exception thrown when pid and cid refs file exists, but the + cid object does not.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class HashStoreRefsAlreadyExists(Exception): + """Custom exception thrown when called to tag an object that is already tagged appropriately.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors + + +class UnsupportedAlgorithm(Exception): + """Custom exception thrown when a given algorithm is not supported in HashStore for + calculating hashes/checksums, as the default store algo and/or other operations.""" + + def __init__(self, message, errors=None): + super().__init__(message) + self.errors = errors diff --git a/src/hashstore/hashstore.py b/src/hashstore/hashstore.py index 6c704209..20a93fd8 100644 --- a/src/hashstore/hashstore.py +++ b/src/hashstore/hashstore.py @@ -1,13 +1,13 @@ """Hashstore Interface""" + from abc import ABC, abstractmethod -from collections import namedtuple import importlib.metadata +import importlib.util class HashStore(ABC): - """HashStore is a content-addressable file management system that - utilizes a persistent identifier (PID) in the form of a hex digest - value to address files.""" + """HashStore is a content-addressable file management system that utilizes + an object's content identifier (hex digest/checksum) to address files.""" @staticmethod def version(): @@ -25,174 +25,185 @@ def store_object( checksum_algorithm, expected_object_size, ): - """The `store_object` method is responsible for the atomic storage of objects to - disk using a given InputStream and a persistent identifier (pid). Upon - successful storage, the method returns a ObjectMetadata object containing - relevant file information, such as the file's id (which can be used to locate the - object on disk), the file's size, and a hex digest map of algorithms and checksums. - `store_object` also ensures that an object is stored only once by synchronizing - multiple calls and rejecting calls to store duplicate objects. - - The file's id is determined by calculating the SHA-256 hex digest of the - provided pid, which is also used as the permanent address of the file. The - file's identifier is then sharded using a depth of 3 and width of 2, - delimited by '/' and concatenated to produce the final permanent address - and is stored in the `/store_directory/objects/` directory. - - By default, the hex digest map includes the following hash algorithms: - Default algorithms and hex digests to return: md5, sha1, sha256, sha384, sha512, - which are the most commonly used algorithms in dataset submissions to DataONE - and the Arctic Data Center. If an additional algorithm is provided, the - `store_object` method checks if it is supported and adds it to the map along - with its corresponding hex digest. An algorithm is considered "supported" if it - is recognized as a valid hash algorithm in the `hashlib` library. - - Similarly, if a file size and/or checksum & checksumAlgorithm value are provided, - `store_object` validates the object to ensure it matches the given arguments - before moving the file to its permanent address. - - Args: - pid (string): Authority-based identifier. - data (mixed): String or path to object. - additional_algorithm (string): Additional hex digest to include. - checksum (string): Checksum to validate against. - checksum_algorithm (string): Algorithm of supplied checksum. - expected_object_size (int): Size of object to verify - - Returns: - object_metadata (ObjectMetadata): Object that contains the permanent address, - file size, duplicate file boolean and hex digest dictionary. + """Atomic storage of objects to disk using a given stream. Upon successful storage, + it returns an `ObjectMetadata` object containing relevant file information, such as + a persistent identifier that references the data file, the file's size, and a hex digest + dictionary of algorithms and checksums. The method also tags the object, creating + references for discoverability. + + `store_object` ensures that an object is stored only once by synchronizing multiple calls + and rejecting attempts to store duplicate objects. If called without a pid, it stores the + object without tagging, and it becomes the caller's responsibility to finalize the process + by calling `tag_object` after verifying the correct object is stored. + + The file's permanent address is determined by calculating the object's content identifier + based on the store's default algorithm, which is also the permanent address of the file. + The content identifier is then sharded using the store's configured depth and width, + delimited by '/', and concatenated to produce the final permanent address. This address + is stored in the `/store_directory/objects/` directory. + + By default, the hex digest map includes common hash algorithms (md5, sha1, sha256, sha384, + sha512). If an additional algorithm is provided, the method checks if it is supported and + adds it to the hex digests dictionary along with its corresponding hex digest. An algorithm + is considered "supported" if it is recognized as a valid hash algorithm in the `hashlib` + library. + + If file size and/or checksum & checksum_algorithm values are provided, `store_object` + validates the object to ensure it matches the given arguments before moving the file to + its permanent address. + + :param str pid: Authority-based identifier. + :param mixed data: String or path to the object. + :param str additional_algorithm: Additional hex digest to include. + :param str checksum: Checksum to validate against. + :param str checksum_algorithm: Algorithm of the supplied checksum. + :param int expected_object_size: Size of the object to verify. + + :return: ObjectMetadata - Object containing the persistent identifier (pid), + content identifier (cid), object size and hex digests dictionary (checksums). + """ + raise NotImplementedError() + + @abstractmethod + def tag_object(self, pid, cid): + """Creates references that allow objects stored in HashStore to be discoverable. + Retrieving, deleting or calculating a hex digest of an object is based on a pid + argument, to proceed, we must be able to find the object associated with the pid. + + :param str pid: Authority-based or persistent identifier of the object. + :param str cid: Content identifier of the object. """ raise NotImplementedError() @abstractmethod def store_metadata(self, pid, metadata, format_id): - """The `store_metadata` method is responsible for adding and/or updating metadata - (ex. `sysmeta`) to disk using a given path/stream, a persistent identifier `pid` - and a metadata `format_id`. The metadata object's permanent address, which is - determined by calculating the SHA-256 hex digest of the provided `pid` + `format_id`. - - Upon successful storage of metadata, `store_metadata` returns a string that - represents the file's permanent address. Lastly, the metadata objects are stored - in parallel to objects in the `/store_directory/metadata/` directory. - - Args: - pid (string): Authority-based identifier. - format_id (string): Metadata format - metadata (mixed): String or path to metadata document. - - Returns: - metadata_cid (string): Address of the metadata document. + """Add or update metadata, such as `sysmeta`, to disk using the given path/stream. The + `store_metadata` method uses a persistent identifier `pid` and a metadata `format_id` + to determine the permanent address of the metadata object. All metadata documents for a + given `pid` will be stored in a directory that follows the HashStore configuration + settings (under ../metadata) that is determined by calculating the hash of the given pid. + Metadata documents are stored in this directory, and is each named using the hash of the pid + and metadata format (`pid` + `format_id`). + + Upon successful storage of metadata, the method returns a string representing the file's + permanent address. Metadata objects are stored in parallel to objects in the + `/store_directory/metadata/` directory. + + :param str pid: Authority-based identifier. + :param mixed metadata: String or path to the metadata document. + :param str format_id: Metadata format. + + :return: str - Address of the metadata document. """ raise NotImplementedError() @abstractmethod def retrieve_object(self, pid): - """The `retrieve_object` method retrieves an object from disk using a given - persistent identifier (pid). If the object exists (determined by calculating - the object's permanent address using the SHA-256 hash of the given pid), the - method will open and return a buffered object stream ready to read from. + """Retrieve an object from disk using a persistent identifier (pid). The `retrieve_object` + method opens and returns a buffered object stream ready for reading if the object + associated with the provided `pid` exists on disk. - Args: - pid (string): Authority-based identifier. + :param str pid: Authority-based identifier. - Returns: - obj_stream (io.BufferedReader): A buffered stream of a data object. + :return: io.BufferedReader - Buffered stream of the data object. """ raise NotImplementedError() @abstractmethod def retrieve_metadata(self, pid, format_id): - """The 'retrieve_metadata' method retrieves the metadata object from disk using - a given persistent identifier (pid) and metadata namespace (format_id). - If the object exists (determined by calculating the metadata object's permanent - address using the SHA-256 hash of the given pid+format_id), the method will open - and return a buffered metadata stream ready to read from. - - Args: - pid (string): Authority-based identifier - format_id (string): Metadata format - - Returns: - metadata_stream (io.BufferedReader): A buffered stream of a metadata object. + """Retrieve the metadata object from disk using a persistent identifier (pid) + and metadata namespace (format_id). If the metadata document exists, the method opens + and returns a buffered metadata stream ready for reading. + + :param str pid: Authority-based identifier. + :param str format_id: Metadata format. + + :return: io.BufferedReader - Buffered stream of the metadata object. """ raise NotImplementedError() @abstractmethod def delete_object(self, pid): - """The 'delete_object' method deletes an object permanently from disk using a - given persistent identifier. + """Deletes an object and its related data permanently from HashStore using a given + persistent identifier. The object associated with the pid will be deleted if it is not + referenced by any other pids, along with its reference files and all metadata documents + found in its respective metadata directory. + + :param str pid: Persistent or Authority-based identifier. + """ + raise NotImplementedError() - Args: - pid (string): Authority-based identifier. + @abstractmethod + def delete_if_invalid_object( + self, object_metadata, checksum, checksum_algorithm, expected_file_size + ): + """Confirm equality of content in an ObjectMetadata. The `delete_invalid_object` method + will delete a data object if the object_metadata does not match the specified values. - Returns: - boolean: `True` upon successful deletion. + :param ObjectMetadata object_metadata: ObjectMetadata object. + :param str checksum: Value of the checksum. + :param str checksum_algorithm: Algorithm of the checksum. + :param int expected_file_size: Size of the temporary file. """ raise NotImplementedError() @abstractmethod def delete_metadata(self, pid, format_id): - """The 'delete_metadata' method deletes a metadata document permanently - from disk using a given persistent identifier and format_id. + """Deletes a metadata document (ex. `sysmeta`) permanently from HashStore using a given + persistent identifier (`pid`) and format_id (metadata namespace). If a `format_id` is + not supplied, all metadata documents associated with the given `pid` will be deleted. - Args: - pid (string): Authority-based identifier - format_id (string): Metadata format - - Returns: - boolean: `True` upon successful deletion. + :param str pid: Authority-based identifier. + :param str format_id: Metadata format. """ raise NotImplementedError() @abstractmethod def get_hex_digest(self, pid, algorithm): - """The 'get_hex_digest' method calculates the hex digest of an object that exists - in HashStore using a given persistent identifier and hash algorithm. + """Calculates the hex digest of an object that exists in HashStore using a given persistent + identifier and hash algorithm. - Args: - pid (string): Authority-based identifier. - algorithm (string): Algorithm of hex digest to generate. + :param str pid: Authority-based identifier. + :param str algorithm: Algorithm of hex digest to generate. - Returns: - hex_digest (string): Hex digest of the object. + :return: str - Hex digest of the object. """ raise NotImplementedError() class HashStoreFactory: - """A factory class for creating `HashStore`-like objects (classes - that implement the 'HashStore' abstract methods) + """A factory class for creating `HashStore`-like objects. + + The `HashStoreFactory` class serves as a factory for creating `HashStore`-like objects, + which are classes that implement the 'HashStore' abstract methods. - This factory class provides a method to retrieve a `HashStore` object - based on a given module (ex. "hashstore.filehashstore.filehashstore") - and class name (ex. "FileHashStore"). + This factory class provides a method to retrieve a `HashStore` object based on a given module + (e.g., "hashstore.filehashstore.filehashstore") and class name (e.g., "FileHashStore"). """ @staticmethod def get_hashstore(module_name, class_name, properties=None): """Get a `HashStore`-like object based on the specified `module_name` and `class_name`. - Args: - module_name (str): Name of package (ex. "hashstore.filehashstore") \n - class_name (str): Name of class in the given module (ex. "FileHashStore") \n - properties (dict, optional): Desired HashStore properties, if 'None', default values - will be used. \n - Example Properties Dictionary: - { - "store_path": "var/metacat",\n - "store_depth": 3,\n - "store_width": 2,\n - "store_algorithm": "sha256",\n - "store_sysmeta_namespace": "http://ns.dataone.org/service/types/v2.0"\n - } - - Returns: - HashStore: A hash store object based on the given `module_name` and `class_name` - - Raises: - ModuleNotFoundError: If module is not found - AttributeError: If class does not exist within the module + The `get_hashstore` method retrieves a `HashStore`-like object based on the provided + `module_name` and `class_name`, with optional custom properties. + + :param str module_name: Name of the package (e.g., "hashstore.filehashstore"). + :param str class_name: Name of the class in the given module (e.g., "FileHashStore"). + :param dict properties: Desired HashStore properties (optional). If `None`, default values + will be used. Example Properties Dictionary: + { + "store_path": "var/metacat", + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + } + + :return: HashStore - A hash store object based on the given `module_name` and `class_name`. + + :raises ModuleNotFoundError: If the module is not found. + :raises AttributeError: If the class does not exist within the module. """ # Validate module if importlib.util.find_spec(module_name) is None: @@ -208,18 +219,3 @@ def get_hashstore(module_name, class_name, properties=None): raise AttributeError( f"Class name '{class_name}' is not an attribute of module '{module_name}'" ) - - -class ObjectMetadata(namedtuple("ObjectMetadata", ["id", "obj_size", "hex_digests"])): - """File address containing file's path on disk and its content hash ID. - - Args: - ab_id (str): Hash ID (hexdigest) of file contents. - obj_size (bytes): Size of the object - hex_digests (dict, optional): A list of hex digests to validate objects - (md5, sha1, sha256, sha384, sha512) - """ - - # Default value to prevent dangerous default value - def __new__(cls, ab_id, obj_size, hex_digests=None): - return super(ObjectMetadata, cls).__new__(cls, ab_id, obj_size, hex_digests) diff --git a/src/hashstore/client.py b/src/hashstore/hashstoreclient.py similarity index 75% rename from src/hashstore/client.py rename to src/hashstore/hashstoreclient.py index c1e2e4b6..b1c1cc67 100644 --- a/src/hashstore/client.py +++ b/src/hashstore/hashstoreclient.py @@ -1,4 +1,6 @@ +#!/usr/bin/env python """HashStore Command Line App""" + import logging import os from argparse import ArgumentParser @@ -104,6 +106,11 @@ def __init__(self): action="store_true", help="Delete objects in a HashStore", ) + self.parser.add_argument( + "-gbskip", + dest="gb_file_size_to_skip", + help="Number of objects to convert", + ) # Individual API call related optional arguments self.parser.add_argument( @@ -186,15 +193,16 @@ def __init__(self): help="Flag to delete a metadata document from a HashStore", ) - def load_store_properties(self, hashstore_yaml): + @staticmethod + def load_store_properties(hashstore_yaml): """Get and return the contents of the current HashStore config file. - Returns: - hashstore_yaml_dict (dict): HashStore properties with the following keys (and values): - store_depth (int): Depth when sharding an object's hex digest. - store_width (int): Width of directories when sharding an object's hex digest. - store_algorithm (str): Hash algorithm used for calculating the object's hex digest. - store_metadata_namespace (str): Namespace for the HashStore's system metadata. + :return: HashStore properties with the following keys (and values): + - store_depth (int): Depth when sharding an object's hex digest. + - store_width (int): Width of directories when sharding an object's hex digest. + - store_algorithm (str): Hash algorithm used for calculating the object's hex digest. + - store_metadata_namespace (str): Namespace for the HashStore's system metadata. + :rtype: dict """ property_required_keys = [ "store_depth", @@ -234,18 +242,28 @@ class HashStoreClient: MET_TYPE = "metadata" def __init__(self, properties, testflag=None): - """Initialize HashStore and MetacatDB + """Initialize the HashStoreClient with optional flag to test with the + test server at 'test.arcticdata.io' - Args: - properties: See FileHashStore for dictionary example - testflag (str): "knbvm" to initialize MetacatDB + :param dict properties: HashStore properties to initialize with + :param str testflag: 'knbvm' to denote testing on 'test.arcticdata.io' """ factory = HashStoreFactory() # Get HashStore from factory - module_name = "filehashstore" + if testflag: + # Set multiprocessing to true if testing in knbvm + module_name = "filehashstore" + os.environ["USE_MULTIPROCESSING"] = "True" + else: + module_name = "hashstore.filehashstore" class_name = "FileHashStore" + use_multiprocessing = os.getenv("USE_MULTIPROCESSING", "False") == "True" + logging.info( + "HashStoreClient - use_multiprocessing (bool): %s", use_multiprocessing + ) + # Instance attributes self.hashstore = factory.get_hashstore(module_name, class_name, properties) logging.info("HashStoreClient - HashStore initialized.") @@ -257,20 +275,24 @@ def __init__(self, properties, testflag=None): # Methods relating to testing HashStore with knbvm (test.arcticdata.io) - def store_to_hashstore_from_list(self, origin_dir, obj_type, num): - """Store objects in a given directory into HashStore + def store_to_hashstore_from_list(self, origin_dir, obj_type, num, skip_obj_size): + """Store objects in a given directory into HashStore. - Args: - origin_dir (str): Directory to convert - obj_type (str): 'object' or 'metadata' - num (int): Number of files to store + :param str origin_dir: Directory to convert. + :param str obj_type: Type of objects ('object' or 'metadata'). + :param int num: Number of files to store. + :param int skip_obj_size: Size of obj in GB to skip (ex. 4 = 4GB) """ - info_msg = f"HashStore Client - Begin storing {obj_type} objects." + info_msg = f"HashStoreClient - Begin storing {obj_type} objects." logging.info(info_msg) # Object and Metadata list - metacat_obj_list = self.metacatdb.get_object_metadata_list(origin_dir, num) + metacat_obj_list = self.metacatdb.get_object_metadata_list( + origin_dir, num, skip_obj_size + ) + logging.info(info_msg) # Get list of objects to store from metacat db + checked_obj_list = None if obj_type == self.OBJ_TYPE: checked_obj_list = self.metacatdb.refine_list_for_objects( metacat_obj_list, "store" @@ -310,8 +332,7 @@ def store_to_hashstore_from_list(self, origin_dir, obj_type, num): def try_store_object(self, obj_tuple): """Store an object to HashStore and log exceptions as warning. - Args: - obj_tuple: See HashStore store_object signature for details. + :param obj_tuple: See HashStore store_object signature for details. """ try: self.hashstore.store_object(*obj_tuple) @@ -321,10 +342,10 @@ def try_store_object(self, obj_tuple): print(so_exception) def try_store_metadata(self, obj_tuple): - """Store an object to HashStore and log exceptions as warning. + """Store a metadata document to HashStore and log exceptions as warning. Args: - obj_tuple: See HashStore store_object signature for details. + obj_tuple: See HashStore store_metadata signature for details. """ try: self.hashstore.store_metadata(*obj_tuple) @@ -333,23 +354,28 @@ def try_store_metadata(self, obj_tuple): except Exception as so_exception: print(so_exception) - def retrieve_and_validate_from_hashstore(self, origin_dir, obj_type, num): + def retrieve_and_validate_from_hashstore( + self, origin_dir, obj_type, num, skip_obj_size + ): """Retrieve objects or metadata from a Hashstore and validate the content. - Args: - origin_dir (str): Directory to convert - obj_type (str): 'object' or 'metadata' - num (int): Number of files to store + :param str origin_dir: Directory to convert. + :param str obj_type: Type of objects ('object' or 'metadata'). + :param int num: Number of files to store. + :param int skip_obj_size: Size of obj in GB to skip (ex. 4 = 4GB) """ info_msg = ( f"HashStore Client - Begin retrieving and validating {obj_type} objects." ) logging.info(info_msg) # Object and Metadata list - metacat_obj_list = self.metacatdb.get_object_metadata_list(origin_dir, num) + metacat_obj_list = self.metacatdb.get_object_metadata_list( + origin_dir, num, skip_obj_size + ) # Get list of objects to store from metacat db logging.info("HashStore Client - Refining object list for %s", obj_type) + checked_obj_list = None if obj_type == self.OBJ_TYPE: checked_obj_list = self.metacatdb.refine_list_for_objects( metacat_obj_list, "retrieve" @@ -384,15 +410,14 @@ def retrieve_and_validate_from_hashstore(self, origin_dir, obj_type, num): def validate_object(self, obj_tuple): """Retrieves an object from HashStore and validates its checksum. - Args: - obj_tuple: pid_guid, obj_checksum_algo, obj_checksum + :param obj_tuple: Tuple containing pid_guid, obj_checksum_algo, obj_checksum. """ pid_guid = obj_tuple[0] algo = obj_tuple[1] obj_db_checksum = obj_tuple[2] with self.hashstore.retrieve_object(pid_guid) as obj_stream: - computed_digest = self.hashstore.computehash(obj_stream, algo) + computed_digest = self.hashstore.get_hex_digest(obj_stream, algo) obj_stream.close() if computed_digest != obj_db_checksum: @@ -407,10 +432,9 @@ def validate_object(self, obj_tuple): return def validate_metadata(self, obj_tuple): - """Retrieves a metadata from HashStore and validates its checksum + """Retrieves a metadata from HashStore and validates its checksum. - Args: - obj_tuple: pid_guid, format_id, obj_checksum, obj_algorithm + :param obj_tuple: Tuple containing pid_guid, format_id, obj_checksum, obj_algorithm. """ pid_guid = obj_tuple[0] namespace = obj_tuple[1] @@ -432,19 +456,23 @@ def validate_metadata(self, obj_tuple): return - def delete_objects_from_list(self, origin_dir, obj_type, num): - """Store objects in a given directory into HashStore - Args: - origin_dir (str): Directory to convert - obj_type (str): 'object' or 'metadata' - num (int): Number of files to store + def delete_objects_from_list(self, origin_dir, obj_type, num, skip_obj_size): + """Deletes objects in a given directory into HashStore. + + :param str origin_dir: Directory to convert. + :param str obj_type: Type of objects ('object' or 'metadata'). + :param int num: Number of files to store. + :param int skip_obj_size: Size of obj in GB to skip (ex. 4 = 4GB) """ info_msg = f"HashStore Client - Begin deleting {obj_type} objects." logging.info(info_msg) # Object and Metadata list - metacat_obj_list = self.metacatdb.get_object_metadata_list(origin_dir, num) + metacat_obj_list = self.metacatdb.get_object_metadata_list( + origin_dir, num, skip_obj_size + ) # Get list of objects to store from metacat db + checked_obj_list = None if obj_type == self.OBJ_TYPE: checked_obj_list = self.metacatdb.refine_list_for_objects( metacat_obj_list, "delete" @@ -482,10 +510,9 @@ def delete_objects_from_list(self, origin_dir, obj_type, num): logging.info(content) def try_delete_object(self, obj_pid): - """Delete an object to HashStore and log exceptions as warning. + """Delete an object from HashStore and log exceptions as a warning. - Args: - obj_pid (str): Pid of object to delete + :param str obj_pid: PID of the object to delete. """ try: self.hashstore.delete_object(obj_pid) @@ -495,10 +522,9 @@ def try_delete_object(self, obj_pid): print(do_exception) def try_delete_metadata(self, obj_tuple): - """Delete an object to HashStore and log exceptions as warning. + """Delete an object from HashStore and log exceptions as a warning. - Args: - obj_tuple: pid_guid, format_id (namespace) + :param obj_tuple: Tuple containing the PID and format ID (namespace). """ pid_guid = obj_tuple[0] namespace = obj_tuple[1] @@ -543,12 +569,12 @@ def __init__(self, hashstore_path, hashstore): checked_property = yaml_data[key] self.db_yaml_dict[key] = checked_property - def get_object_metadata_list(self, origin_directory, num): - """Query the metacat db for the full obj and metadata list and order by guid. + def get_object_metadata_list(self, origin_directory, num, skip_obj_size=None): + """Query the Metacat database for the full object and metadata list, ordered by GUID. - Args: - origin_directory (string): 'var/metacat/data' or 'var/metacat/documents' - num (int): Number of rows to retrieve from metacat db + :param str origin_directory: 'var/metacat/data' or 'var/metacat/documents'. + :param int num: Number of rows to retrieve from the Metacat database. + :param int skip_obj_size: Size of obj in GB to skip (ex. 4 = 4GB), defaults to 'None' """ # Create a connection to the database db_user = self.db_yaml_dict["db_user"] @@ -575,8 +601,9 @@ def get_object_metadata_list(self, origin_directory, num): limit_query = f" LIMIT {num}" query = f"""SELECT identifier.guid, identifier.docid, identifier.rev, systemmetadata.object_format, systemmetadata.checksum, - systemmetadata.checksum_algorithm FROM identifier INNER JOIN systemmetadata - ON identifier.guid = systemmetadata.guid ORDER BY identifier.guid{limit_query};""" + systemmetadata.checksum_algorithm, systemmetadata.size FROM identifier INNER JOIN + systemmetadata ON identifier.guid = systemmetadata.guid ORDER BY + identifier.guid{limit_query};""" cursor.execute(query) # Fetch all rows from the result set @@ -585,21 +612,31 @@ def get_object_metadata_list(self, origin_directory, num): # Create full object list to store into HashStore print("Creating list of objects and metadata from metacat db") object_metadata_list = [] + gb_files_to_skip = None + if skip_obj_size is not None: + gb_files_to_skip = int(skip_obj_size) * (1024**3) + for row in rows: - # Get pid, filepath and formatId - pid_guid = row[0] - metadatapath_docid_rev = origin_directory + "/" + row[1] + "." + str(row[2]) - metadata_namespace = row[3] - row_checksum = row[4] - row_checksum_algorithm = row[5] - tuple_item = ( - pid_guid, - metadatapath_docid_rev, - metadata_namespace, - row_checksum, - row_checksum_algorithm, - ) - object_metadata_list.append(tuple_item) + size = int(row[6]) + if gb_files_to_skip is not None and size > gb_files_to_skip: + continue + else: + # Get pid, filepath and formatId + pid_guid = row[0] + metadatapath_docid_rev = ( + origin_directory + "/" + row[1] + "." + str(row[2]) + ) + metadata_namespace = row[3] + row_checksum = row[4] + row_checksum_algorithm = row[5] + tuple_item = ( + pid_guid, + metadatapath_docid_rev, + metadata_namespace, + row_checksum, + row_checksum_algorithm, + ) + object_metadata_list.append(tuple_item) # Close the cursor and connection when done cursor.close() @@ -607,18 +644,18 @@ def get_object_metadata_list(self, origin_directory, num): return object_metadata_list - def refine_list_for_objects(self, metacat_obj_list, action): + @staticmethod + def refine_list_for_objects(metacat_obj_list, action): """Refine a list of objects by checking for file existence and removing duplicates. - Args: - metacat_obj_list (List): List of tuple objects representing rows from metacat db - action (string): "store", "retrieve" or "delete". - "store" will create a list of objects to store that do not exist in HashStore. - "retrieve" will create a list of objects that exist in HashStore. - "delete" will create a list of object pids - - Returns: - refined_object_list (List): List of tuple objects based on "action" + :param List metacat_obj_list: List of tuple objects representing rows from Metacat database. + :param str action: Action to perform. Options: "store", "retrieve", or "delete". + - "store": Create a list of objects to store that do not exist in HashStore. + - "retrieve": Create a list of objects that exist in HashStore. + - "delete": Create a list of object PIDs to delete. + + :return: Refined list of tuple objects based on the specified action. + :rtype: List """ refined_object_list = [] for tuple_item in metacat_obj_list: @@ -628,50 +665,40 @@ def refine_list_for_objects(self, metacat_obj_list, action): item_checksum_algorithm = tuple_item[4] if os.path.exists(filepath_docid_rev): if action == "store": - # If the file has already been stored, skip it - if not self.hashstore.exists( - "objects", self.hashstore.get_sha256_hex_digest(pid_guid) - ): - # This tuple is formed to match 'HashStore' store_object's signature - # Which is '.starmap()'ed when called - store_object_tuple_item = ( - pid_guid, - filepath_docid_rev, - None, - item_checksum, - item_checksum_algorithm, - ) - refined_object_list.append(store_object_tuple_item) + # This tuple is formed to match 'HashStore' store_object's signature + # Which is '.starmap()'ed when called + store_object_tuple_item = ( + pid_guid, + filepath_docid_rev, + None, + item_checksum, + item_checksum_algorithm, + ) + refined_object_list.append(store_object_tuple_item) if action == "retrieve": - if self.hashstore.exists( - "objects", self.hashstore.get_sha256_hex_digest(pid_guid) - ): - retrieve_object_tuple_item = ( - pid_guid, - item_checksum_algorithm, - item_checksum, - ) - refined_object_list.append(retrieve_object_tuple_item) + retrieve_object_tuple_item = ( + pid_guid, + item_checksum_algorithm, + item_checksum, + ) + refined_object_list.append(retrieve_object_tuple_item) if action == "delete": - if self.hashstore.exists( - "objects", self.hashstore.get_sha256_hex_digest(pid_guid) - ): - refined_object_list.append(pid_guid) + refined_object_list.append(pid_guid) return refined_object_list - def refine_list_for_metadata(self, metacat_obj_list, action): + @staticmethod + def refine_list_for_metadata(metacat_obj_list, action): """Refine a list of metadata by checking for file existence and removing duplicates. - Args: - metacat_obj_list (List): List of tuple objects representing rows from metacat db - action (string): "store", "retrieve" or "delete". - "store" will create a list of metadata to store that do not exist in HashStore. - "retrieve" will create a list of metadata that exist in HashStore. - "delete" will create a list of metadata pids with their format_ids - - Returns: - refined_object_list (List): List of tuple metadata based on "action" + :param List metacat_obj_list: List of tuple objects representing rows from metacat db. + :param str action: Action to perform - "store", "retrieve", or "delete". + - "store": Create a list of metadata to store that do not exist in HashStore. + - "retrieve": Create a list of metadata that exist in HashStore. + - "delete": Create a list of metadata pids with their format_ids. + + :return: List of tuple metadata based on the specified action. + :rtype: List """ refined_metadata_list = [] for tuple_item in metacat_obj_list: @@ -682,41 +709,22 @@ def refine_list_for_metadata(self, metacat_obj_list, action): item_checksum_algorithm = tuple_item[4] if os.path.exists(filepath_docid_rev): if action == "store": - # If the file has already been stored, skip it - if not self.hashstore.exists( - "metadata", - self.hashstore.get_sha256_hex_digest( - pid_guid + metadata_namespace - ), - ): - tuple_item = (pid_guid, filepath_docid_rev, metadata_namespace) - refined_metadata_list.append(tuple_item) + tuple_item = (pid_guid, filepath_docid_rev, metadata_namespace) + refined_metadata_list.append(tuple_item) if action == "retrieve": - if self.hashstore.exists( - "metadata", - self.hashstore.get_sha256_hex_digest( - pid_guid + metadata_namespace - ), - ): - tuple_item = ( - pid_guid, - metadata_namespace, - item_checksum, - item_checksum_algorithm, - ) - refined_metadata_list.append(tuple_item) + tuple_item = ( + pid_guid, + metadata_namespace, + item_checksum, + item_checksum_algorithm, + ) + refined_metadata_list.append(tuple_item) if action == "delete": - if self.hashstore.exists( - "metadata", - self.hashstore.get_sha256_hex_digest( - pid_guid + metadata_namespace - ), - ): - tuple_item = ( - pid_guid, - metadata_namespace, - ) - refined_metadata_list.append(tuple_item) + tuple_item = ( + pid_guid, + metadata_namespace, + ) + refined_metadata_list.append(tuple_item) return refined_metadata_list @@ -746,6 +754,13 @@ def main(): f"Missing config file (hashstore.yaml) at store path: {store_path}." + " HashStore must first be initialized, use `--help` for more information." ) + else: + # Get the default format_id for sysmeta + with open(store_path_config_yaml, "r", encoding="utf-8") as hs_yaml_file: + yaml_data = yaml.safe_load(hs_yaml_file) + + default_formatid = yaml_data["store_metadata_namespace"] + # Setup logging, create log file if it doesn't already exist hashstore_py_log = store_path + "/python_client.log" python_log_file_path = Path(hashstore_py_log) @@ -773,6 +788,8 @@ def main(): checksum_algorithm = getattr(args, "object_checksum_algorithm") size = getattr(args, "object_size") formatid = getattr(args, "object_formatid") + if formatid is None: + formatid = default_formatid knbvm_test = getattr(args, "knbvm_flag") # Instantiate HashStore Client props = parser.load_store_properties(store_path_config_yaml) @@ -787,6 +804,7 @@ def main(): number_of_objects_to_convert = getattr(args, "num_obj_to_convert") # Determine if we are working with objects or metadata directory_type = getattr(args, "source_directory_type") + size_of_obj_to_skip = getattr(args, "gb_file_size_to_skip") accepted_directory_types = ["object", "metadata"] if directory_type not in accepted_directory_types: raise ValueError( @@ -798,18 +816,21 @@ def main(): directory_to_convert, directory_type, number_of_objects_to_convert, + size_of_obj_to_skip, ) if getattr(args, "retrieve_and_validate"): hashstore_c.retrieve_and_validate_from_hashstore( directory_to_convert, directory_type, number_of_objects_to_convert, + size_of_obj_to_skip, ) if getattr(args, "delete_from_hashstore"): hashstore_c.delete_objects_from_list( directory_to_convert, directory_type, number_of_objects_to_convert, + size_of_obj_to_skip, ) else: raise FileNotFoundError( @@ -844,7 +865,7 @@ def main(): raise ValueError("'-path' option is required") # Store metadata to HashStore metadata_cid = hashstore_c.hashstore.store_metadata(pid, path, formatid) - print(f"Metadata ID: {metadata_cid}") + print(f"Metadata Path: {metadata_cid}") elif getattr(args, "client_retrieveobject"): if pid is None: diff --git a/tests/conftest.py b/tests/conftest.py index 9b25c520..e10a83e0 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -1,4 +1,5 @@ """Pytest overall configuration file for fixtures""" + import pytest from hashstore.filehashstore import FileHashStore @@ -16,8 +17,8 @@ def pytest_addoption(parser): @pytest.fixture(name="props") def init_props(tmp_path): """Properties to initialize HashStore.""" - directory = tmp_path / "metacat" - directory.mkdir() + directory = tmp_path / "metacat" / "hashstore" + directory.mkdir(parents=True) hashstore_path = directory.as_posix() # Note, objects generated via tests are placed in a temporary folder # with the 'directory' parameter above appended @@ -26,7 +27,7 @@ def init_props(tmp_path): "store_depth": 3, "store_width": 2, "store_algorithm": "SHA-256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", } return properties @@ -47,7 +48,6 @@ def init_pids(): test_pids = { "doi:10.18739/A2901ZH2M": { "file_size_bytes": 39993, - "object_cid": "0d555ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e", "metadata_cid": "323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7", "md5": "db91c910a3202478c8def1071c54aae5", "sha1": "1fe86e3c8043afa4c70857ca983d740ad8501ccd", @@ -55,10 +55,10 @@ def init_pids(): "sha256": "4d198171eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c", "sha384": "d5953bd802fa74edea72eb941ead7a27639e62792fedc065d6c81de6c613b5b8739ab1f90e7f24a7500d154a727ed7c2", "sha512": "e9bcd6b91b102ef5803d1bd60c7a5d2dbec1a2baf5f62f7da60de07607ad6797d6a9b740d97a257fd2774f2c26503d455d8f2a03a128773477dfa96ab96a2e54", + "blake2s": "5895fa29c17f8768d613984bb86791e5fcade7643c15e84663c03be89205d81e", }, "jtao.1700.1": { "file_size_bytes": 8724, - "object_cid": "a8241925740d5dcd719596639e780e0a090c9d55a5d0372b0eaf55ed711d4edf", "metadata_cid": "ddf07952ef28efc099d10d8b682480f7d2da60015f5d8873b6e1ea75b4baf689", "md5": "f4ea2d07db950873462a064937197b0f", "sha1": "3d25436c4490b08a2646e283dada5c60e5c0539d", @@ -66,10 +66,10 @@ def init_pids(): "sha256": "94f9b6c88f1f458e410c30c351c6384ea42ac1b5ee1f8430d3e365e43b78a38a", "sha384": "a204678330fcdc04980c9327d4e5daf01ab7541e8a351d49a7e9c5005439dce749ada39c4c35f573dd7d307cca11bea8", "sha512": "bf9e7f4d4e66bd082817d87659d1d57c2220c376cd032ed97cadd481cf40d78dd479cbed14d34d98bae8cebc603b40c633d088751f07155a94468aa59e2ad109", + "blake2s": "8978c46ee4cc5d1d79698752fd663c60c817d58d6aea901843bf4fc2cb173bef", }, "urn:uuid:1b35d0a5-b17a-423b-a2ed-de2b18dc367a": { "file_size_bytes": 18699, - "object_cid": "7f5cc18f0b04e812a3b4c8f686ce34e6fec558804bf61e54b176742a7f6368d6", "metadata_cid": "9a2e08c666b728e6cbd04d247b9e556df3de5b2ca49f7c5a24868eb27cddbff2", "md5": "e1932fc75ca94de8b64f1d73dc898079", "sha1": "c6d2a69a3f5adaf478ba796c114f57b990cf7ad1", @@ -77,6 +77,7 @@ def init_pids(): "sha256": "4473516a592209cbcd3a7ba4edeebbdb374ee8e4a49d19896fafb8f278dc25fa", "sha384": "b1023a9be5aa23a102be9bce66e71f1f1c7a6b6b03e3fc603e9cd36b4265671e94f9cc5ce3786879740536994489bc26", "sha512": "c7fac7e8aacde8546ddb44c640ad127df82830bba6794aea9952f737c13a81d69095865ab3018ed2a807bf9222f80657faf31cfde6c853d7b91e617e148fec76", + "blake2s": "c8c9aea2f7ddcfaf8db93ce95f18e467b6293660d1a0b08137636a3c92896765", }, } return test_pids diff --git a/tests/filehashstore/__init__.py b/tests/filehashstore/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/tests/filehashstore/test_filehashstore.py b/tests/filehashstore/test_filehashstore.py new file mode 100644 index 00000000..0b0e94c3 --- /dev/null +++ b/tests/filehashstore/test_filehashstore.py @@ -0,0 +1,1970 @@ +"""Test module for FileHashStore init, core, utility and supporting methods.""" + +import io +import os +import hashlib +import shutil +from pathlib import Path +import pytest +from hashstore.filehashstore import FileHashStore, ObjectMetadata, Stream +from hashstore.filehashstore_exceptions import ( + OrphanPidRefsFileFound, + NonMatchingChecksum, + NonMatchingObjSize, + PidNotFoundInCidRefsFile, + PidRefsDoesNotExist, + RefsFileExistsButCidObjMissing, + UnsupportedAlgorithm, + HashStoreRefsAlreadyExists, + PidRefsAlreadyExistsError, + CidRefsContentError, + CidRefsFileNotFound, + PidRefsContentError, + PidRefsFileNotFound, + IdentifierNotLocked, +) + + +# pylint: disable=W0212 + + +def test_init_directories_created(store): + """Confirm that object and metadata directories have been created.""" + assert os.path.exists(store.root) + assert os.path.exists(store.objects) + assert os.path.exists(store.objects / "tmp") + assert os.path.exists(store.metadata) + assert os.path.exists(store.metadata / "tmp") + assert os.path.exists(store.refs) + assert os.path.exists(store.refs / "tmp") + assert os.path.exists(store.refs / "pids") + assert os.path.exists(store.refs / "cids") + + +def test_init_existing_store_incorrect_algorithm_format(store): + """Confirm that exception is thrown when store_algorithm is not a DataONE controlled value ( + the string must exactly match the expected format). DataONE uses the library of congress + vocabulary to standardize algorithm types.""" + properties = { + "store_path": store.root / "incorrect_algo_format", + "store_depth": 3, + "store_width": 2, + "store_algorithm": "sha256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + with pytest.raises(ValueError): + FileHashStore(properties) + + +def test_init_existing_store_correct_algorithm_format(store): + """Confirm second instance of HashStore with DataONE controlled value.""" + properties = { + "store_path": store.root, + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + hashstore_instance = FileHashStore(properties) + assert isinstance(hashstore_instance, FileHashStore) + + +def test_init_write_properties_hashstore_yaml_exists(store): + """Verify config file present in store root directory.""" + assert os.path.exists(store.hashstore_configuration_yaml) + + +def test_init_with_existing_hashstore_mismatched_config_depth(store): + """Test init with existing HashStore raises a ValueError when supplied with + mismatching depth.""" + properties = { + "store_path": store.root, + "store_depth": 1, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + with pytest.raises(ValueError): + FileHashStore(properties) + + +def test_init_with_existing_hashstore_mismatched_config_width(store): + """Test init with existing HashStore raises a ValueError when supplied with + mismatching width.""" + properties = { + "store_path": store.root, + "store_depth": 3, + "store_width": 1, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + with pytest.raises(ValueError): + FileHashStore(properties) + + +def test_init_with_existing_hashstore_mismatched_config_algo(store): + """Test init with existing HashStore raises a ValueError when supplied with + mismatching default algorithm.""" + properties = { + "store_path": store.root, + "store_depth": 3, + "store_width": 1, + "store_algorithm": "SHA-512", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + with pytest.raises(ValueError): + FileHashStore(properties) + + +def test_init_with_existing_hashstore_mismatched_config_metadata_ns(store): + """Test init with existing HashStore raises a ValueError when supplied with + mismatching default name space.""" + properties = { + "store_path": store.root, + "store_depth": 3, + "store_width": 1, + "store_algorithm": "SHA-512", + "store_metadata_namespace": "http://ns.dataone.org/service/types/v5.0", + } + with pytest.raises(ValueError): + FileHashStore(properties) + + +def test_init_with_existing_hashstore_missing_yaml(store, pids): + """Test init with existing store raises RuntimeError when hashstore.yaml + not found but objects exist.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + store._store_and_validate_data(pid, path) + os.remove(store.hashstore_configuration_yaml) + properties = { + "store_path": store.root, + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + with pytest.raises(RuntimeError): + FileHashStore(properties) + + +def test_load_properties(store): + """Verify dictionary returned from _load_properties matches initialization.""" + hashstore_yaml_dict = store._load_properties( + store.hashstore_configuration_yaml, store.property_required_keys + ) + assert hashstore_yaml_dict.get("store_depth") == 3 + assert hashstore_yaml_dict.get("store_width") == 2 + assert hashstore_yaml_dict.get("store_algorithm") == "SHA-256" + assert ( + hashstore_yaml_dict.get("store_metadata_namespace") + == "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + ) + + +def test_load_properties_hashstore_yaml_missing(store): + """Confirm FileNotFoundError is raised when hashstore.yaml does not exist.""" + os.remove(store.hashstore_configuration_yaml) + with pytest.raises(FileNotFoundError): + store._load_properties( + store.hashstore_configuration_yaml, store.property_required_keys + ) + + +def test_validate_properties(store): + """Confirm no exceptions are thrown when all key/values are supplied.""" + properties = { + "store_path": "/etc/test", + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + assert store._validate_properties(properties) + + +def test_validate_properties_missing_key(store): + """Confirm exception raised when key missing in properties.""" + properties = { + "store_path": "/etc/test", + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + } + with pytest.raises(KeyError): + store._validate_properties(properties) + + +def test_validate_properties_key_value_is_none(store): + """Confirm exception raised when a value from a key is 'None'.""" + properties = { + "store_path": "/etc/test", + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": None, + } + with pytest.raises(ValueError): + store._validate_properties(properties) + + +def test_validate_properties_incorrect_type(store): + """Confirm exception raised when a bad properties value is given.""" + properties = "etc/filehashstore/hashstore.yaml" + with pytest.raises(ValueError): + store._validate_properties(properties) + + +def test_set_default_algorithms_missing_yaml(store, pids): + """Confirm set_default_algorithms raises FileNotFoundError when hashstore.yaml + not found.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + store._store_and_validate_data(pid, path) + os.remove(store.hashstore_configuration_yaml) + with pytest.raises(FileNotFoundError): + store._set_default_algorithms() + + +# Tests for FileHashStore Core Methods + + +def test_find_object_no_sysmeta(pids, store): + """Test _find_object returns the correct content and expected value for non-existent sysmeta.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + obj_info_dict = store._find_object(pid) + retrieved_cid = obj_info_dict["cid"] + + assert retrieved_cid == object_metadata.hex_digests.get("sha256") + + data_object_path = store._get_hashstore_data_object_path(retrieved_cid) + assert data_object_path == obj_info_dict["cid_object_path"] + + cid_refs_path = store._get_hashstore_cid_refs_path(retrieved_cid) + assert cid_refs_path == obj_info_dict["cid_refs_path"] + + pid_refs_path = store._get_hashstore_pid_refs_path(pid) + assert pid_refs_path == obj_info_dict["pid_refs_path"] + + assert obj_info_dict["sysmeta_path"] == "Does not exist." + + +def test_find_object_sysmeta(pids, store): + """Test _find_object returns the correct content along with the sysmeta path""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + object_metadata = store.store_object(pid, path) + stored_metadata_path = store.store_metadata(pid, syspath, format_id) + + obj_info_dict = store._find_object(pid) + retrieved_cid = obj_info_dict["cid"] + + assert retrieved_cid == object_metadata.hex_digests.get("sha256") + + data_object_path = store._get_hashstore_data_object_path(retrieved_cid) + assert data_object_path == obj_info_dict["cid_object_path"] + + cid_refs_path = store._get_hashstore_cid_refs_path(retrieved_cid) + assert cid_refs_path == obj_info_dict["cid_refs_path"] + + pid_refs_path = store._get_hashstore_pid_refs_path(pid) + assert pid_refs_path == obj_info_dict["pid_refs_path"] + + assert str(obj_info_dict["sysmeta_path"]) == stored_metadata_path + + +def test_find_object_refs_exist_but_obj_not_found(pids, store): + """Test _find_object throws exception when refs file exist but the object does not.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + store.store_object(pid, path) + + cid = store._find_object(pid).get("cid") + obj_path = store._get_hashstore_data_object_path(cid) + os.remove(obj_path) + + with pytest.raises(RefsFileExistsButCidObjMissing): + store._find_object(pid) + + +def test_find_object_cid_refs_not_found(pids, store): + """Test _find_object throws exception when pid refs file is found (and contains a cid) + but the cid refs file does not exist.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + _object_metadata = store.store_object(pid, path) + + # Place the wrong cid into the pid refs file that has already been created + pid_ref_abs_path = store._get_hashstore_pid_refs_path(pid) + with open(pid_ref_abs_path, "w", encoding="utf8") as pid_ref_file: + pid_ref_file.seek(0) + pid_ref_file.write("intentionally.wrong.pid") + pid_ref_file.truncate() + + with pytest.raises(OrphanPidRefsFileFound): + store._find_object(pid) + + +def test_find_object_cid_refs_does_not_contain_pid(pids, store): + """Test _find_object throws exception when pid refs file is found (and contains a cid) + but the cid refs file does not contain the pid.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + + # Remove the pid from the cid refs file + cid_ref_abs_path = store._get_hashstore_cid_refs_path( + object_metadata.hex_digests.get("sha256") + ) + store._update_refs_file(cid_ref_abs_path, pid, "remove") + + with pytest.raises(PidNotFoundInCidRefsFile): + store._find_object(pid) + + +def test_find_object_pid_refs_not_found(store): + """Test _find_object throws exception when a pid refs file does not exist.""" + with pytest.raises(PidRefsDoesNotExist): + store._find_object("dou.test.1") + + +def test_find_object_pid_none(store): + """Test _find_object throws exception when pid is None.""" + with pytest.raises(ValueError): + store._find_object(None) + + +def test_find_object_pid_empty(store): + """Test _find_object throws exception when pid is empty.""" + with pytest.raises(ValueError): + store._find_object("") + + +def test_store_and_validate_data_files_path(pids, store): + """Test _store_and_validate_data accepts path object for the path arg.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir) / pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + assert store._exists("objects", object_metadata.cid) + + +def test_store_and_validate_data_files_string(pids, store): + """Test _store_and_validate_data accepts string for the path arg.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + assert store._exists("objects", object_metadata.cid) + + +def test_store_and_validate_data_files_stream(pids, store): + """Test _store_and_validate_data accepts stream for the path arg.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + object_metadata = store._store_and_validate_data(pid, input_stream) + input_stream.close() + assert store._exists("objects", object_metadata.cid) + assert store._count("objects") == 3 + + +def test_store_and_validate_data_cid(pids, store): + """Check _store_and_validate_data returns the expected content identifier""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + assert object_metadata.cid == pids[pid][store.algorithm] + + +def test_store_and_validate_data_file_size(pids, store): + """Check _store_and_validate_data returns correct file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + assert object_metadata.obj_size == pids[pid]["file_size_bytes"] + + +def test_store_and_validate_data_hex_digests(pids, store): + """Check _store_and_validate_data successfully generates hex digests dictionary.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + assert object_metadata.hex_digests.get("md5") == pids[pid]["md5"] + assert object_metadata.hex_digests.get("sha1") == pids[pid]["sha1"] + assert object_metadata.hex_digests.get("sha256") == pids[pid]["sha256"] + assert object_metadata.hex_digests.get("sha384") == pids[pid]["sha384"] + assert object_metadata.hex_digests.get("sha512") == pids[pid]["sha512"] + + +def test_store_and_validate_data_additional_algorithm(pids, store): + """Check _store_and_validate_data returns an additional algorithm in hex digests + when provided with an additional algo value.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + algo = "sha224" + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data( + pid, path, additional_algorithm=algo + ) + sha224_hash = object_metadata.hex_digests.get(algo) + assert sha224_hash == pids[pid][algo] + + +def test_store_and_validate_data_with_correct_checksums(pids, store): + """Check _store_and_validate_data stores a data object when a valid checksum and checksum + algorithm is supplied.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + algo = "sha224" + algo_checksum = pids[pid][algo] + path = test_dir + pid.replace("/", "_") + store._store_and_validate_data( + pid, path, checksum=algo_checksum, checksum_algorithm=algo + ) + assert store._count("objects") == 3 + + +def test_store_and_validate_data_with_incorrect_checksum(pids, store): + """Check _store_and_validate_data does not store data objects when a bad checksum supplied.""" + test_dir = "tests/testdata/" + entity = "objects" + for pid in pids.keys(): + algo = "sha224" + algo_checksum = "badChecksumValue" + path = test_dir + pid.replace("/", "_") + with pytest.raises(NonMatchingChecksum): + store._store_and_validate_data( + pid, path, checksum=algo_checksum, checksum_algorithm=algo + ) + assert store._count(entity) == 0 + + +def test_store_data_only_cid(pids, store): + """Check _store_data_only returns correct id.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_data_only(path) + assert object_metadata.cid == pids[pid][store.algorithm] + + +def test_store_data_only_file_size(pids, store): + """Check _store_data_only returns correct file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_data_only(path) + assert object_metadata.obj_size == pids[pid]["file_size_bytes"] + + +def test_store_data_only_hex_digests(pids, store): + """Check _store_data_only generates a hex digests dictionary.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_data_only(path) + assert object_metadata.hex_digests.get("md5") == pids[pid]["md5"] + assert object_metadata.hex_digests.get("sha1") == pids[pid]["sha1"] + assert object_metadata.hex_digests.get("sha256") == pids[pid]["sha256"] + assert object_metadata.hex_digests.get("sha384") == pids[pid]["sha384"] + assert object_metadata.hex_digests.get("sha512") == pids[pid]["sha512"] + + +def test_move_and_get_checksums_id(pids, store): + """Test _move_and_get_checksums returns correct id.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + ( + move_id, + _, + _, + ) = store._move_and_get_checksums(pid, input_stream) + input_stream.close() + assert move_id == pids[pid][store.algorithm] + + +def test_move_and_get_checksums_file_size(pids, store): + """Test _move_and_get_checksums returns correct file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + ( + _, + tmp_file_size, + _, + ) = store._move_and_get_checksums(pid, input_stream) + input_stream.close() + assert tmp_file_size == pids[pid]["file_size_bytes"] + + +def test_move_and_get_checksums_hex_digests(pids, store): + """Test _move_and_get_checksums returns correct hex digests.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + ( + _, + _, + hex_digests, + ) = store._move_and_get_checksums(pid, input_stream) + input_stream.close() + assert hex_digests.get("md5") == pids[pid]["md5"] + assert hex_digests.get("sha1") == pids[pid]["sha1"] + assert hex_digests.get("sha256") == pids[pid]["sha256"] + assert hex_digests.get("sha384") == pids[pid]["sha384"] + assert hex_digests.get("sha512") == pids[pid]["sha512"] + + +def test_move_and_get_checksums_does_not_store_duplicate(pids, store): + """Test _move_and_get_checksums does not store duplicate objects.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + store._move_and_get_checksums(pid, input_stream) + input_stream.close() + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + store._move_and_get_checksums(pid, input_stream) + input_stream.close() + assert store._count("objects") == 3 + + +def test_move_and_get_checksums_raises_error_with_nonmatching_checksum(pids, store): + """Test _move_and_get_checksums raises error when incorrect checksum supplied.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + with pytest.raises(NonMatchingChecksum): + # pylint: disable=W0212 + store._move_and_get_checksums( + pid, + input_stream, + checksum="nonmatchingchecksum", + checksum_algorithm="sha256", + ) + input_stream.close() + assert store._count("objects") == 0 + + +def test_move_and_get_checksums_incorrect_file_size(pids, store): + """Test _move_and_get_checksums raises error with an incorrect file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + with pytest.raises(NonMatchingObjSize): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + incorrect_file_size = 1000 + (_, _, _, _,) = store._move_and_get_checksums( + pid, input_stream, file_size_to_validate=incorrect_file_size + ) + input_stream.close() + + +def test_write_to_tmp_file_and_get_hex_digests_additional_algo(store): + """Test _write...hex_digests returns correct hex digests with an additional algorithm.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + input_stream = io.open(path, "rb") + checksum_algo = "sha3_256" + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + hex_digests, _, _ = store._write_to_tmp_file_and_get_hex_digests( + input_stream, additional_algorithm=checksum_algo + ) + input_stream.close() + assert hex_digests.get("sha3_256") == checksum_correct + + +def test_write_to_tmp_file_and_get_hex_digests_checksum_algo(store): + """Test _write...hex_digests returns correct hex digests when given a checksum_algorithm + and checksum.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + input_stream = io.open(path, "rb") + checksum_algo = "sha3_256" + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + hex_digests, _, _ = store._write_to_tmp_file_and_get_hex_digests( + input_stream, checksum_algorithm=checksum_algo + ) + input_stream.close() + assert hex_digests.get("sha3_256") == checksum_correct + + +def test_write_to_tmp_file_and_get_hex_digests_checksum_and_additional_algo(store): + """Test _write...hex_digests returns correct hex digests when an additional and + checksum algorithm is provided.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + input_stream = io.open(path, "rb") + additional_algo = "sha224" + additional_algo_checksum_correct = ( + "9b3a96f434f3c894359193a63437ef86fbd5a1a1a6cc37f1d5013ac1" + ) + checksum_algo = "sha3_256" + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + hex_digests, _, _ = store._write_to_tmp_file_and_get_hex_digests( + input_stream, + additional_algorithm=additional_algo, + checksum_algorithm=checksum_algo, + ) + input_stream.close() + assert hex_digests.get("sha3_256") == checksum_correct + assert hex_digests.get("sha224") == additional_algo_checksum_correct + + +def test_write_to_tmp_file_and_get_hex_digests_checksum_and_additional_algo_duplicate( + store, +): + """Test _write...hex_digests succeeds with duplicate algorithms (de-duplicates).""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + input_stream = io.open(path, "rb") + additional_algo = "sha224" + checksum_algo = "sha224" + checksum_correct = "9b3a96f434f3c894359193a63437ef86fbd5a1a1a6cc37f1d5013ac1" + hex_digests, _, _ = store._write_to_tmp_file_and_get_hex_digests( + input_stream, + additional_algorithm=additional_algo, + checksum_algorithm=checksum_algo, + ) + input_stream.close() + assert hex_digests.get("sha224") == checksum_correct + + +def test_write_to_tmp_file_and_get_hex_digests_file_size(pids, store): + """Test _write...hex_digests returns correct file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + _, _, tmp_file_size = store._write_to_tmp_file_and_get_hex_digests(input_stream) + input_stream.close() + assert tmp_file_size == pids[pid]["file_size_bytes"] + + +def test_write_to_tmp_file_and_get_hex_digests_hex_digests(pids, store): + """Test _write...hex_digests returns correct hex digests.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + hex_digests, _, _ = store._write_to_tmp_file_and_get_hex_digests(input_stream) + input_stream.close() + assert hex_digests.get("md5") == pids[pid]["md5"] + assert hex_digests.get("sha1") == pids[pid]["sha1"] + assert hex_digests.get("sha256") == pids[pid]["sha256"] + assert hex_digests.get("sha384") == pids[pid]["sha384"] + assert hex_digests.get("sha512") == pids[pid]["sha512"] + + +def test_write_to_tmp_file_and_get_hex_digests_tmpfile_object(pids, store): + """Test _write...hex_digests returns a tmp file successfully.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + _, tmp_file_name, _ = store._write_to_tmp_file_and_get_hex_digests(input_stream) + input_stream.close() + assert os.path.isfile(tmp_file_name) is True + + +def test_write_to_tmp_file_and_get_hex_digests_with_unsupported_algorithm(pids, store): + """Test _write...hex_digests raises an exception when an unsupported algorithm supplied.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + algo = "md2" + with pytest.raises(UnsupportedAlgorithm): + _, _, _ = store._write_to_tmp_file_and_get_hex_digests( + input_stream, additional_algorithm=algo + ) + with pytest.raises(UnsupportedAlgorithm): + _, _, _ = store._write_to_tmp_file_and_get_hex_digests( + input_stream, checksum_algorithm=algo + ) + input_stream.close() + + +def test_mktmpfile(store): + """Test that _mktmpfile creates and returns a tmp file.""" + path = store.root / "doutest" / "tmp" + store._create_path(path) + tmp = store._mktmpfile(path) + assert os.path.exists(tmp.name) + + +def test_store_hashstore_refs_files_(pids, store): + """Test _store_hashstore_refs_files does not throw exception when successful.""" + for pid in pids.keys(): + cid = pids[pid][store.algorithm] + store._store_hashstore_refs_files(pid, cid) + assert store._count("pid") == 3 + assert store._count("cid") == 3 + + +def test_store_hashstore_refs_files_pid_refs_file_exists(pids, store): + """Test _store_hashstore_refs_file creates the expected pid reference file.""" + for pid in pids.keys(): + cid = pids[pid][store.algorithm] + store._store_hashstore_refs_files(pid, cid) + pid_refs_file_path = store._get_hashstore_pid_refs_path(pid) + assert os.path.exists(pid_refs_file_path) + + +def test_store_hashstore_refs_file_cid_refs_file_exists(pids, store): + """Test _store_hashstore_refs_file creates the cid reference file.""" + for pid in pids.keys(): + cid = pids[pid][store.algorithm] + store._store_hashstore_refs_files(pid, cid) + cid_refs_file_path = store._get_hashstore_cid_refs_path(cid) + assert os.path.exists(cid_refs_file_path) + + +def test_store_hashstore_refs_file_pid_refs_file_content(pids, store): + """Test _store_hashstore_refs_file created the pid reference file with the expected cid.""" + for pid in pids.keys(): + cid = pids[pid][store.algorithm] + store._store_hashstore_refs_files(pid, cid) + pid_refs_file_path = store._get_hashstore_pid_refs_path(pid) + with open(pid_refs_file_path, "r", encoding="utf8") as f: + pid_refs_cid = f.read() + assert pid_refs_cid == cid + + +def test_store_hashstore_refs_file_cid_refs_file_content(pids, store): + """Test _store_hashstore_refs_file creates the cid reference file successfully with pid + tagged.""" + for pid in pids.keys(): + cid = pids[pid][store.algorithm] + store._store_hashstore_refs_files(pid, cid) + cid_refs_file_path = store._get_hashstore_cid_refs_path(cid) + with open(cid_refs_file_path, "r", encoding="utf8") as f: + pid_refs_cid = f.read().strip() + assert pid_refs_cid == pid + + +def test_store_hashstore_refs_file_pid_refs_found_cid_refs_found(pids, store): + """Test _store_hashstore_refs_file does not throw an exception when any refs file already exists + and verifies the content, and does not double tag the cid refs file.""" + for pid in pids.keys(): + cid = pids[pid][store.algorithm] + store._store_hashstore_refs_files(pid, cid) + + with pytest.raises(HashStoreRefsAlreadyExists): + store.tag_object(pid, cid) + + cid_refs_file_path = store._get_hashstore_cid_refs_path(cid) + line_count = 0 + with open(cid_refs_file_path, "r", encoding="utf8") as ref_file: + for _line in ref_file: + line_count += 1 + assert line_count == 1 + + +def test_store_hashstore_refs_files_pid_refs_found_cid_refs_not_found(store, pids): + """Test that _store_hashstore_refs_files throws an exception when pid refs file exists, + contains a different cid, and is correctly referenced in the associated cid refs file""" + for pid in pids.keys(): + cid = pids[pid][store.algorithm] + store._store_hashstore_refs_files(pid, cid) + + with pytest.raises(PidRefsAlreadyExistsError): + store._store_hashstore_refs_files( + pid, "another_cid_value_that_is_not_found" + ) + + +def test_store_hashstore_refs_files_refs_not_found_cid_refs_found(store): + """Test _store_hashstore_refs_files updates a cid reference file that already exists.""" + pid = "jtao.1700.1" + cid = "94f9b6c88f1f458e410c30c351c6384ea42ac1b5ee1f8430d3e365e43b78a38a" + # Tag object + store._store_hashstore_refs_files(pid, cid) + # Tag the cid with another pid + additional_pid = "dou.test.1" + store._store_hashstore_refs_files(additional_pid, cid) + + # Read cid file to confirm cid refs file contains the additional pid + line_count = 0 + cid_ref_abs_path = store._get_hashstore_cid_refs_path(cid) + with open(cid_ref_abs_path, "r", encoding="utf8") as f: + for _, line in enumerate(f, start=1): + value = line.strip() + line_count += 1 + assert value == pid or value == additional_pid + assert line_count == 2 + assert store._count("pid") == 2 + assert store._count("cid") == 1 + + +def test_untag_object(pids, store): + """Test _untag_object untags successfully.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + + store._synchronize_referenced_locked_pids(pid) + store._synchronize_object_locked_cids(cid) + store._untag_object(pid, cid) + store._release_reference_locked_pids(pid) + store._release_object_locked_cids(cid) + + assert store._count("pid") == 0 + assert store._count("cid") == 0 + assert store._count("objects") == 3 + + +def test_untag_object_pid_not_locked(pids, store): + """Test _untag_object throws exception when pid is not locked""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + + with pytest.raises(IdentifierNotLocked): + store._untag_object(pid, cid) + + +def test_untag_object_cid_not_locked(pids, store): + """Test _untag_object throws exception with cid is not locked""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + + with pytest.raises(IdentifierNotLocked): + store._synchronize_referenced_locked_pids(pid) + store._untag_object(pid, cid) + store._release_reference_locked_pids(pid) + + +def test_untag_object_orphan_pid_refs_file_found(store): + """Test _untag_object removes an orphan pid refs file""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + + # Remove cid refs file + cid_refs_abs_path = store._get_hashstore_cid_refs_path(cid) + os.remove(cid_refs_abs_path) + + with pytest.raises(OrphanPidRefsFileFound): + store._find_object(pid) + + store._synchronize_referenced_locked_pids(pid) + store._synchronize_object_locked_cids(cid) + store._untag_object(pid, cid) + store._release_reference_locked_pids(pid) + store._release_object_locked_cids(cid) + + assert store._count("pid") == 0 + + +def test_untag_object_orphan_refs_exist_but_data_object_not_found(store): + """Test _untag_object removes orphaned pid and cid refs files""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + + assert store._count("pid") == 1 + assert store._count("cid") == 1 + + # Remove cid refs file + data_obj_path = store._get_hashstore_data_object_path(cid) + os.remove(data_obj_path) + + with pytest.raises(RefsFileExistsButCidObjMissing): + store._find_object(pid) + + store._synchronize_referenced_locked_pids(pid) + store._synchronize_object_locked_cids(cid) + store._untag_object(pid, cid) + store._release_reference_locked_pids(pid) + store._release_object_locked_cids(cid) + + assert store._count("pid") == 0 + assert store._count("cid") == 0 + + +def test_untag_object_refs_found_but_pid_not_in_cid_refs(store): + """Test _untag_object removes pid refs file whose pid is not found in the cid refs file.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + pid_two = pid + ".dou" + path = test_dir + pid + object_metadata = store.store_object(pid, path) + _object_metadata_two = store.store_object(pid_two, path) + cid = object_metadata.cid + + assert store._count("pid") == 2 + assert store._count("cid") == 1 + + # Remove pid from cid refs + cid_refs_file = store._get_hashstore_cid_refs_path(cid) + # First remove the pid + store._update_refs_file(cid_refs_file, pid, "remove") + + with pytest.raises(PidNotFoundInCidRefsFile): + store._find_object(pid) + + store._synchronize_referenced_locked_pids(pid) + store._synchronize_object_locked_cids(cid) + store._untag_object(pid, cid) + store._release_reference_locked_pids(pid) + store._release_object_locked_cids(cid) + + assert store._count("pid") == 1 + assert store._count("cid") == 1 + + +def test_untag_object_pid_refs_file_does_not_exist(store): + """Test _untag_object removes pid from cid refs file since the pid refs file does not exist, + and does not delete the cid refs file because a reference is still present.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + pid_two = pid + ".dou" + path = test_dir + pid + object_metadata = store.store_object(pid, path) + _object_metadata_two = store.store_object(pid_two, path) + cid = object_metadata.cid + + assert store._count("pid") == 2 + assert store._count("cid") == 1 + + # Remove pid from cid refs + pid_refs_file = store._get_hashstore_pid_refs_path(pid) + os.remove(pid_refs_file) + + with pytest.raises(PidRefsDoesNotExist): + store._find_object(pid) + + store._synchronize_referenced_locked_pids(pid) + store._synchronize_object_locked_cids(cid) + store._untag_object(pid, cid) + store._release_reference_locked_pids(pid) + store._release_object_locked_cids(cid) + + assert store._count("pid") == 1 + assert store._count("cid") == 1 + + +def test_untag_object_pid_refs_file_does_not_exist_and_cid_refs_is_empty(store): + """Test '_untag_object' removes pid from cid refs file since the pid refs file does not exist, + and deletes the cid refs file because it contains no more references (after the pid called + with '_untag_object' is removed from the cid refs).""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + + assert store._count("pid") == 1 + assert store._count("cid") == 1 + + # Remove pid from cid refs + pid_refs_file = store._get_hashstore_pid_refs_path(pid) + os.remove(pid_refs_file) + + with pytest.raises(PidRefsDoesNotExist): + store._find_object(pid) + + store._synchronize_referenced_locked_pids(pid) + store._synchronize_object_locked_cids(cid) + store._untag_object(pid, cid) + store._release_reference_locked_pids(pid) + store._release_object_locked_cids(cid) + + assert store._count("pid") == 0 + assert store._count("cid") == 0 + + +def test_put_metadata_with_path(pids, store): + """Test _put_metadata with path object for the path arg.""" + entity = "metadata" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + metadata_stored_path = store._put_metadata(syspath, pid, format_id) + assert store._exists(entity, metadata_stored_path) + assert store._count(entity) == 3 + + +def test_put_metadata_with_string(pids, store): + """Test_put metadata with string for the path arg.""" + entity = "metadata" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = str(Path(test_dir) / filename) + metadata_stored_path = store._put_metadata(syspath, pid, format_id) + assert store._exists(entity, metadata_stored_path) + assert store._count(entity) == 3 + + +def test_put_metadata_stored_path(pids, store): + """Test put metadata returns correct path to the metadata stored.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + metadata_document_name = store._computehash(pid + format_id) + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + metadata_stored_path = store._put_metadata(syspath, pid, metadata_document_name) + + # Manually calculate expected path + metadata_directory = store._computehash(pid) + rel_path = Path(*store._shard(metadata_directory)) + full_path = ( + store._get_store_path("metadata") / rel_path / metadata_document_name + ) + assert metadata_stored_path == full_path + + +def test_mktmpmetadata(pids, store): + """Test mktmpmetadata creates tmpFile.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + sys_stream = io.open(syspath, "rb") + # pylint: disable=W0212 + tmp_name = store._mktmpmetadata(sys_stream) + sys_stream.close() + assert os.path.exists(tmp_name) + + +# Tests for FileHashStore Utility & Supporting Methods + + +def test_delete_marked_files(store): + """Test that _delete_marked_files removes all items from a given list""" + pid = "jtao.1700.1" + cid = "94f9b6c88f1f458e410c30c351c6384ea42ac1b5ee1f8430d3e365e43b78a38a" + # Tag object + store._store_hashstore_refs_files(pid, cid) + # Tag the cid with another pid + additional_pid = "dou.test.1" + store._store_hashstore_refs_files(additional_pid, cid) + + list_to_check = [] + pid_refs_path = store._get_hashstore_pid_refs_path(pid) + store._mark_pid_refs_file_for_deletion(pid, list_to_check, pid_refs_path) + pid_refs_path_two = store._get_hashstore_pid_refs_path(additional_pid) + store._mark_pid_refs_file_for_deletion(pid, list_to_check, pid_refs_path_two) + + assert len(list_to_check) == 2 + + store._delete_marked_files(list_to_check) + + assert not os.path.exists(list_to_check[0]) + assert not os.path.exists(list_to_check[1]) + + +def test_delete_marked_files_empty_list_or_none(store): + """Test that _delete_marked_files throws exception when supplied 'None' value - and does not + throw any exception when provided with an empty list.""" + list_to_check = [] + store._delete_marked_files(list_to_check) + + with pytest.raises(ValueError): + store._delete_marked_files(None) + + +def test_mark_pid_refs_file_for_deletion(store): + """Test _mark_pid_refs_file_for_deletion renames a given path for deletion (adds '_delete' to + the path name) and adds it to the given list.""" + pid = "dou.test.1" + cid = "agoodcid" + list_to_check = [] + store._store_hashstore_refs_files(pid, cid) + pid_refs_path = store._get_hashstore_pid_refs_path(pid) + + store._mark_pid_refs_file_for_deletion(pid, list_to_check, pid_refs_path) + + assert len(list_to_check) == 1 + assert "_delete" in str(list_to_check[0]) + + +def test_remove_pid_and_handle_cid_refs_deletion_multiple_cid_refs_contains_multi_pids( + store, +): + """Test _remove_pid_and_handle_cid_refs_deletion removes a pid from the cid refs file.""" + pid = "dou.test.1" + pid_two = "dou.test.2" + cid = "agoodcid" + list_to_check = [] + store._store_hashstore_refs_files(pid, cid) + store._store_hashstore_refs_files(pid_two, cid) + + cid_refs_path = store._get_hashstore_cid_refs_path(cid) + store._remove_pid_and_handle_cid_refs_deletion(pid, list_to_check, cid_refs_path) + + assert store._is_string_in_refs_file(pid, cid_refs_path) is False + assert store._count("cid") == 1 + assert len(list_to_check) == 0 + + +def test_remove_pid_and_handle_cid_refs_deletion_cid_refs_empty(store): + """Test _remove_pid_and_handle_cid_refs_deletion removes a pid from the cid refs file and + deletes it when it is empty after removal.""" + pid = "dou.test.1" + cid = "agoodcid" + list_to_check = [] + store._store_hashstore_refs_files(pid, cid) + + cid_refs_path = store._get_hashstore_cid_refs_path(cid) + store._remove_pid_and_handle_cid_refs_deletion(pid, list_to_check, cid_refs_path) + delete_path = cid_refs_path.with_name(cid_refs_path.name + "_delete") + + assert not os.path.exists(cid_refs_path) + assert os.path.exists(delete_path) + assert len(list_to_check) == 1 + + +def test_validate_and_check_cid_lock_non_matching_cid(store): + """Test that _validate_and_check_cid_lock throws exception when cid is different""" + pid = "dou.test.1" + cid = "thegoodcid" + cid_to_check = "thebadcid" + + with pytest.raises(ValueError): + store._validate_and_check_cid_lock(pid, cid, cid_to_check) + + +def test_validate_and_check_cid_lock_identifier_not_locked(store): + """Test that _validate_and_check_cid_lock throws exception when cid is not locked""" + pid = "dou.test.1" + cid = "thegoodcid" + + with pytest.raises(IdentifierNotLocked): + store._validate_and_check_cid_lock(pid, cid, cid) + + +def test_write_refs_file_ref_type_cid(store): + """Test that write_refs_file writes a reference file when given a 'cid' update_type.""" + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, "test_pid", "cid") + assert os.path.exists(tmp_cid_refs_file) + + +def test_write_refs_file_ref_type_content_cid(pids, store): + """Test that write_refs_file writes the expected content when given a 'cid' update_type.""" + for pid in pids.keys(): + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + with open(tmp_cid_refs_file, "r", encoding="utf8") as f: + cid_ref_file_pid = f.read() + + assert pid == cid_ref_file_pid.strip() + + +def test_write_refs_file_ref_type_pid(pids, store): + """Test that write_pid_refs_file writes a reference file when given a 'pid' update_type.""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_pid_refs_file = store._write_refs_file(tmp_root_path, cid, "pid") + assert os.path.exists(tmp_pid_refs_file) + + +def test_write_refs_file_ref_type_content_pid(pids, store): + """Test that write_refs_file writes the expected content when given a 'pid' update_type""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_pid_refs_file = store._write_refs_file(tmp_root_path, cid, "pid") + with open(tmp_pid_refs_file, "r", encoding="utf8") as f: + pid_refs_cid = f.read() + + assert cid == pid_refs_cid + + +def test_update_refs_file_content(pids, store): + """Test that update_refs_file updates the ref file as expected.""" + for pid in pids.keys(): + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + pid_other = "dou.test.1" + store._update_refs_file(tmp_cid_refs_file, pid_other, "add") + + with open(tmp_cid_refs_file, "r", encoding="utf8") as f: + for _, line in enumerate(f, start=1): + value = line.strip() + assert value == pid or value == pid_other + + +def test_update_refs_file_content_multiple(pids, store): + """Test that _update_refs_file adds multiple references successfully.""" + for pid in pids.keys(): + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + + cid_reference_list = [pid] + for i in range(0, 5): + store._update_refs_file(tmp_cid_refs_file, f"dou.test.{i}", "add") + cid_reference_list.append(f"dou.test.{i}") + + line_count = 0 + with open(tmp_cid_refs_file, "r", encoding="utf8") as f: + for _, line in enumerate(f, start=1): + line_count += 1 + value = line.strip() + assert value in cid_reference_list + + assert line_count == 6 + + +def test_update_refs_file_deduplicates_pid_already_found(pids, store): + """Test that _update_refs_file does not add a pid to a refs file that already + contains the pid.""" + for pid in pids.keys(): + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + # Exception should not be thrown + store._update_refs_file(tmp_cid_refs_file, pid, "add") + + line_count = 0 + with open(tmp_cid_refs_file, "r", encoding="utf8") as ref_file: + for _line in ref_file: + line_count += 1 + assert line_count == 1 + + +def test_update_refs_file_content_cid_refs_does_not_exist(pids, store): + """Test that _update_refs_file throws exception if refs file doesn't exist.""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + cid_ref_abs_path = store._get_hashstore_cid_refs_path(cid) + with pytest.raises(FileNotFoundError): + store._update_refs_file(cid_ref_abs_path, pid, "add") + + +def test_update_refs_file_remove(pids, store): + """Test that _update_refs_file deletes the given pid from the ref file.""" + for pid in pids.keys(): + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + + pid_other = "dou.test.1" + store._update_refs_file(tmp_cid_refs_file, pid_other, "add") + store._update_refs_file(tmp_cid_refs_file, pid, "remove") + + with open(tmp_cid_refs_file, "r", encoding="utf8") as f: + for _, line in enumerate(f, start=1): + value = line.strip() + assert value == pid_other + + +def test_update_refs_file_empty_file(pids, store): + """Test that _update_refs_file leaves a file empty when removing the last pid.""" + for pid in pids.keys(): + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + # First remove the pid + store._update_refs_file(tmp_cid_refs_file, pid, "remove") + + assert os.path.exists(tmp_cid_refs_file) + assert os.path.getsize(tmp_cid_refs_file) == 0 + + +def test_is_string_in_refs_file(pids, store): + """Test that _update_refs_file leaves a file empty when removing the last pid.""" + for pid in pids.keys(): + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + + cid_reference_list = [pid] + for i in range(0, 5): + store._update_refs_file(tmp_cid_refs_file, f"dou.test.{i}", "add") + cid_reference_list.append(f"dou.test.{i}") + + assert store._is_string_in_refs_file("dou.test.2", tmp_cid_refs_file) is True + + +def test_verify_object_information(pids, store): + """Test _verify_object_information succeeds given good arguments.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + hex_digests = object_metadata.hex_digests + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = store.algorithm + expected_file_size = object_metadata.obj_size + store._verify_object_information( + None, + checksum, + checksum_algorithm, + None, + hex_digests, + None, + expected_file_size, + expected_file_size, + ) + + +def test_verify_object_information_incorrect_size(pids, store): + """Test _verify_object_information throws exception when size is incorrect.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + hex_digests = object_metadata.hex_digests + checksum = hex_digests.get(store.algorithm) + checksum_algorithm = store.algorithm + with pytest.raises(NonMatchingObjSize): + store._verify_object_information( + None, + checksum, + checksum_algorithm, + None, + hex_digests, + None, + 1000, + 2000, + ) + + +def test_verify_object_information_incorrect_size_with_pid(pids, store): + """Test _verify_object_information deletes the expected tmp file if obj size does + not match and raises an exception.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + hex_digests = object_metadata.hex_digests + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = store.algorithm + expected_file_size = object_metadata.obj_size + + objects_tmp_folder = store.objects / "tmp" + tmp_file = store._mktmpfile(objects_tmp_folder) + assert os.path.isfile(tmp_file.name) + with pytest.raises(NonMatchingObjSize): + store._verify_object_information( + "Test_Pid", + checksum, + checksum_algorithm, + None, + hex_digests, + tmp_file.name, + 1000, + expected_file_size, + ) + assert not os.path.isfile(tmp_file.name) + + +def test_verify_object_information_missing_key_in_hex_digests_unsupported_algo( + pids, store +): + """Test _verify_object_information throws exception when algorithm is not found + in hex digests and is not supported.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = "md10" + expected_file_size = object_metadata.obj_size + with pytest.raises(UnsupportedAlgorithm): + store._verify_object_information( + None, + checksum, + checksum_algorithm, + "objects", + object_metadata.hex_digests, + None, + expected_file_size, + expected_file_size, + ) + + +def test_verify_object_information_missing_key_in_hex_digests_supported_algo( + pids, store +): + """Test _verify_object_information throws exception when algorithm is not found + in hex digests but is supported, and the checksum calculated does not match.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = "blake2s" + expected_file_size = object_metadata.obj_size + with pytest.raises(NonMatchingChecksum): + store._verify_object_information( + None, + checksum, + checksum_algorithm, + "objects", + object_metadata.hex_digests, + None, + expected_file_size, + expected_file_size, + ) + + +def test_verify_object_information_missing_key_in_hex_digests_matching_checksum( + pids, store +): + """Test _verify_object_information does not throw exception when algorithm is not found + in hex digests but is supported, and the checksum calculated matches.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum_algorithm = "blake2s" + checksum = pids[pid][checksum_algorithm] + expected_file_size = object_metadata.obj_size + store._verify_object_information( + None, + checksum, + checksum_algorithm, + "objects", + object_metadata.hex_digests, + None, + expected_file_size, + expected_file_size, + ) + + +def test_verify_hashstore_references_pid_refs_file_missing(pids, store): + """Test _verify_hashstore_references throws exception when pid refs file is missing.""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + with pytest.raises(PidRefsFileNotFound): + store._verify_hashstore_references(pid, cid) + + +def test_verify_hashstore_references_pid_refs_incorrect_cid(pids, store): + """Test _verify_hashstore_references throws exception when pid refs file cid is incorrect.""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + # Write the cid refs file and move it where it needs to be + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, pid, "cid") + cid_ref_abs_path = store._get_hashstore_cid_refs_path(cid) + print(cid_ref_abs_path) + store._create_path(os.path.dirname(cid_ref_abs_path)) + shutil.move(tmp_cid_refs_file, cid_ref_abs_path) + # Write the pid refs file and move it where it needs to be with a bad cid + pid_ref_abs_path = store._get_hashstore_pid_refs_path(pid) + print(pid_ref_abs_path) + store._create_path(os.path.dirname(pid_ref_abs_path)) + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_pid_refs_file = store._write_refs_file(tmp_root_path, "bad_cid", "pid") + shutil.move(tmp_pid_refs_file, pid_ref_abs_path) + + with pytest.raises(PidRefsContentError): + store._verify_hashstore_references(pid, cid) + + +def test_verify_hashstore_references_cid_refs_file_missing(pids, store): + """Test _verify_hashstore_references throws exception when cid refs file is missing.""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + pid_ref_abs_path = store._get_hashstore_pid_refs_path(pid) + store._create_path(os.path.dirname(pid_ref_abs_path)) + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_pid_refs_file = store._write_refs_file(tmp_root_path, "bad_cid", "pid") + shutil.move(tmp_pid_refs_file, pid_ref_abs_path) + + with pytest.raises(CidRefsFileNotFound): + store._verify_hashstore_references(pid, cid) + + +def test_verify_hashstore_references_cid_refs_file_missing_pid(pids, store): + """Test _verify_hashstore_references throws exception when cid refs file does not contain + the expected pid.""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + # Get a tmp cid refs file and write the wrong pid into it + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, "bad pid", "cid") + cid_ref_abs_path = store._get_hashstore_cid_refs_path(cid) + store._create_path(os.path.dirname(cid_ref_abs_path)) + shutil.move(tmp_cid_refs_file, cid_ref_abs_path) + # Now write the pid refs file, both cid and pid refs must be present + pid_ref_abs_path = store._get_hashstore_pid_refs_path(pid) + store._create_path(os.path.dirname(pid_ref_abs_path)) + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_pid_refs_file = store._write_refs_file(tmp_root_path, cid, "pid") + shutil.move(tmp_pid_refs_file, pid_ref_abs_path) + + with pytest.raises(CidRefsContentError): + store._verify_hashstore_references(pid, cid) + + +def test_verify_hashstore_references_cid_refs_file_with_multiple_refs_missing_pid( + pids, store +): + """Test _verify_hashstore_references throws exception when cid refs file with multiple + references does not contain the expected pid.""" + for pid in pids.keys(): + cid = pids[pid]["sha256"] + # Write the wrong pid into a cid refs file and move it where it needs to be + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_cid_refs_file = store._write_refs_file(tmp_root_path, "bad pid", "cid") + cid_ref_abs_path = store._get_hashstore_cid_refs_path(cid) + store._create_path(os.path.dirname(cid_ref_abs_path)) + shutil.move(tmp_cid_refs_file, cid_ref_abs_path) + # Now write the pid refs with expected values + pid_ref_abs_path = store._get_hashstore_pid_refs_path(pid) + store._create_path(os.path.dirname(pid_ref_abs_path)) + tmp_root_path = store._get_store_path("refs") / "tmp" + tmp_pid_refs_file = store._write_refs_file(tmp_root_path, cid, "pid") + shutil.move(tmp_pid_refs_file, pid_ref_abs_path) + + for i in range(0, 5): + store._update_refs_file(cid_ref_abs_path, f"dou.test.{i}", "add") + + with pytest.raises(CidRefsContentError): + store._verify_hashstore_references(pid, cid) + + +def test_delete_object_only(pids, store): + """Test _delete_object successfully deletes only object.""" + test_dir = "tests/testdata/" + entity = "objects" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid=None, data=path) + store._delete_object_only(object_metadata.cid) + assert store._count(entity) == 0 + + +def test_delete_object_only_cid_refs_file_exists(pids, store): + """Test _delete_object does not delete object if a cid refs file still exists.""" + test_dir = "tests/testdata/" + entity = "objects" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + object_metadata = store.store_object(pid, path) + _metadata_stored_path = store.store_metadata(pid, syspath, format_id) + store._delete_object_only(object_metadata.cid) + assert store._count(entity) == 3 + assert store._count("pid") == 3 + assert store._count("cid") == 3 + + +def test_clean_algorithm(store): + """Check that algorithm values get formatted as expected.""" + algorithm_underscore = "sha_256" + algorithm_hyphen = "sha-256" + algorithm_other_hyphen = "sha3-256" + cleaned_algo_underscore = store._clean_algorithm(algorithm_underscore) + cleaned_algo_hyphen = store._clean_algorithm(algorithm_hyphen) + cleaned_algo_other_hyphen = store._clean_algorithm(algorithm_other_hyphen) + assert cleaned_algo_underscore == "sha256" + assert cleaned_algo_hyphen == "sha256" + assert cleaned_algo_other_hyphen == "sha3_256" + + +def test_clean_algorithm_unsupported_algo(store): + """Check that algorithm values get formatted as expected.""" + algorithm_unsupported = "mok22" + with pytest.raises(UnsupportedAlgorithm): + _ = store._clean_algorithm(algorithm_unsupported) + + +def test_computehash(pids, store): + """Test to check computehash method.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + obj_stream = io.open(path, "rb") + obj_sha256_hash = store._computehash(obj_stream, "sha256") + obj_stream.close() + assert pids[pid]["sha256"] == obj_sha256_hash + + +def test_shard(store): + """Test shard creates list.""" + hash_id = "0d555ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e" + predefined_list = [ + "0d", + "55", + "5e", + "d77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e", + ] + sharded_list = store._shard(hash_id) + assert predefined_list == sharded_list + + +def test_count(pids, store): + """Check that count returns expected number of objects.""" + test_dir = "tests/testdata/" + entity = "objects" + for pid in pids.keys(): + path_string = test_dir + pid.replace("/", "_") + store._store_and_validate_data(pid, path_string) + assert store._count(entity) == 3 + + +def test_exists_object_with_object_metadata_id(pids, store): + """Test exists method with an absolute file path.""" + test_dir = "tests/testdata/" + entity = "objects" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + assert store._exists(entity, object_metadata.cid) + + +def test_exists_object_with_sharded_path(pids, store): + """Test exists method with an absolute file path.""" + test_dir = "tests/testdata/" + entity = "objects" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + object_metadata_shard_path = os.path.join(*store._shard(object_metadata.cid)) + assert store._exists(entity, object_metadata_shard_path) + + +def test_exists_metadata_files_path(pids, store): + """Test exists works as expected for metadata.""" + test_dir = "tests/testdata/" + entity = "metadata" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + metadata_stored_path = store.store_metadata(pid, syspath, format_id) + assert store._exists(entity, metadata_stored_path) + + +def test_exists_object_with_nonexistent_file(store): + """Test exists method with a nonexistent file.""" + entity = "objects" + non_existent_file = "tests/testdata/filedoesnotexist" + does_not_exist = store._exists(entity, non_existent_file) + assert does_not_exist is False + + +def test_open_objects(pids, store): + """Test open returns a stream.""" + test_dir = "tests/testdata/" + entity = "objects" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + object_metadata_id = object_metadata.cid + io_buffer = store._open(entity, object_metadata_id) + assert isinstance(io_buffer, io.BufferedReader) + io_buffer.close() + + +def test_private_delete_objects(pids, store): + """Confirm _delete deletes for entity type 'objects'""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(pid, path) + + store._delete("objects", object_metadata.cid) + assert store._count("objects") == 0 + + +def test_private_delete_metadata(pids, store): + """Confirm _delete deletes for entity type 'metadata'""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + store.store_metadata(pid, syspath, format_id) + + # Manually calculate expected path + metadata_directory = store._computehash(pid) + metadata_document_name = store._computehash(pid + format_id) + rel_path = Path(*store._shard(metadata_directory)) / metadata_document_name + + store._delete("metadata", rel_path) + + assert store._count("metadata") == 0 + + +def test_private_delete_absolute_path(pids, store): + """Confirm _delete deletes for absolute paths'""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(pid, path) + + cid_refs_path = store._get_hashstore_cid_refs_path(object_metadata.cid) + store._delete("other", cid_refs_path) + assert store._count("cid") == 0 + + pid_refs_path = store._get_hashstore_pid_refs_path(pid) + store._delete("other", pid_refs_path) + assert store._count("pid") == 0 + + +def test_create_path(pids, store): + """Test makepath creates folder successfully.""" + for pid in pids: + root_directory = store.root + pid_hex_digest_directory = pids[pid]["metadata_cid"][:2] + pid_directory = root_directory / pid_hex_digest_directory + store._create_path(pid_directory) + assert os.path.isdir(pid_directory) + + +def test_get_store_path_object(store): + """Check get_store_path for object path.""" + # pylint: disable=W0212 + path_objects = store._get_store_path("objects") + path_objects_string = str(path_objects) + assert path_objects_string.endswith("/metacat/hashstore/objects") + + +def test_get_store_path_metadata(store): + """Check get_store_path for metadata path.""" + # pylint: disable=W0212 + path_metadata = store._get_store_path("metadata") + path_metadata_string = str(path_metadata) + assert path_metadata_string.endswith("/metacat/hashstore/metadata") + + +def test_get_store_path_refs(store): + """Check get_store_path for refs path.""" + # pylint: disable=W0212 + path_metadata = store._get_store_path("refs") + path_metadata_string = str(path_metadata) + assert path_metadata_string.endswith("/metacat/hashstore/refs") + + +def test_get_hashstore_data_object_path_file_does_not_exist(store): + """Test _get_hashstore_data_object_path returns None when object does not exist.""" + test_path = "tests/testdata/helloworld.txt" + with pytest.raises(FileNotFoundError): + store._get_hashstore_data_object_path(test_path) + + +def test_get_hashstore_data_object_path_with_object_id(store, pids): + """Test _get_hashstore_data_object_path returns absolute path given an object id.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store._store_and_validate_data(pid, path) + obj_abs_path = store._get_hashstore_data_object_path(object_metadata.cid) + assert os.path.exists(obj_abs_path) + + +def test_get_hashstore_metadata_path_absolute_path(store, pids): + """Test _get_hashstore_metadata_path returns absolute path given a metadata id.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + metadata_stored_path = store.store_metadata(pid, syspath, format_id) + metadata_abs_path = store._get_hashstore_metadata_path(metadata_stored_path) + assert os.path.exists(metadata_abs_path) + + +def test_get_hashstore_metadata_path_relative_path(pids, store): + """Confirm resolve path returns correct metadata path.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _metadata_stored_path = store.store_metadata(pid, syspath, format_id) + + metadata_directory = store._computehash(pid) + metadata_document_name = store._computehash(pid + format_id) + rel_path = Path(*store._shard(metadata_directory)) + full_path_without_dir = Path(rel_path) / metadata_document_name + + metadata_resolved_path = store._get_hashstore_metadata_path( + full_path_without_dir + ) + calculated_metadata_path = store.metadata / rel_path / metadata_document_name + + assert Path(calculated_metadata_path) == metadata_resolved_path + + +def test_get_hashstore_pid_refs_path(pids, store): + """Confirm resolve path returns correct object pid refs path""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + _object_metadata = store.store_object(pid, path) + + resolved_pid_ref_abs_path = store._get_hashstore_pid_refs_path(pid) + pid_refs_metadata_hashid = store._computehash(pid) + calculated_pid_ref_path = store.pids / Path( + *store._shard(pid_refs_metadata_hashid) + ) + + assert resolved_pid_ref_abs_path == Path(calculated_pid_ref_path) + + +def test_get_hashstore_cid_refs_path(pids, store): + """Confirm resolve path returns correct object pid refs path""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + + resolved_cid_ref_abs_path = store._get_hashstore_cid_refs_path(cid) + calculated_cid_ref_path = store.cids / Path(*store._shard(cid)) + + assert resolved_cid_ref_abs_path == Path(calculated_cid_ref_path) + + +def test_check_string(store): + """Confirm that an exception is raised when a string is None, empty or contains an illegal + character (ex. tabs or new lines)""" + empty_pid_with_spaces = " " + with pytest.raises(ValueError): + store._check_string(empty_pid_with_spaces, "empty_pid_with_spaces") + + none_value = None + with pytest.raises(ValueError): + store._check_string(none_value, "none_value") + + new_line = "\n" + with pytest.raises(ValueError): + store._check_string(new_line, "new_line") + + new_line_with_other_chars = "hello \n" + with pytest.raises(ValueError): + store._check_string(new_line_with_other_chars, "new_line_with_other_chars") + + tab_line = "\t" + with pytest.raises(ValueError): + store._check_string(tab_line, "tab_line") + + +def test_cast_to_bytes(store): + """Test _to_bytes returns bytes.""" + string = "teststring" + # pylint: disable=W0212 + string_bytes = store._cast_to_bytes(string) + assert isinstance(string_bytes, bytes) + + +def test_stream_reads_file(pids): + """Test that a stream can read a file and yield its contents.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path_string = test_dir + pid.replace("/", "_") + obj_stream = Stream(path_string) + hashobj = hashlib.new("sha256") + for data in obj_stream: + hashobj.update(data) + obj_stream.close() + hex_digest = hashobj.hexdigest() + assert pids[pid]["sha256"] == hex_digest + + +def test_stream_reads_path_object(pids): + """Test that a stream can read a file-like object and yield its contents.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + obj_stream = Stream(path) + hash_obj = hashlib.new("sha256") + for data in obj_stream: + hash_obj.update(data) + obj_stream.close() + hex_digest = hash_obj.hexdigest() + assert pids[pid]["sha256"] == hex_digest + + +def test_stream_returns_to_original_position_on_close(pids): + """Test that a stream returns to its original position after closing the file.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path_string = test_dir + pid.replace("/", "_") + input_stream = io.open(path_string, "rb") + input_stream.seek(5) + hashobj = hashlib.new("sha256") + obj_stream = Stream(input_stream) + for data in obj_stream: + hashobj.update(data) + obj_stream.close() + assert input_stream.tell() == 5 + input_stream.close() + + +# noinspection PyTypeChecker +def test_stream_raises_error_for_invalid_object(): + """Test that a stream raises ValueError for an invalid input object.""" + with pytest.raises(ValueError): + Stream(1234) + + +def test_objectmetadata(): + """Test ObjectMetadata class returns correct values via dot notation.""" + pid = "hashstore" + ab_id = "hashstoretest" + obj_size = 1234 + hex_digest_dict = { + "md5": "md5value", + "sha1": "sha1value", + "sha224": "sha224value", + "sha256": "sha256value", + "sha512": "sha512value", + } + object_metadata = ObjectMetadata(pid, ab_id, obj_size, hex_digest_dict) + assert object_metadata.pid == pid + assert object_metadata.cid == ab_id + assert object_metadata.obj_size == obj_size + assert object_metadata.hex_digests.get("md5") == hex_digest_dict["md5"] + assert object_metadata.hex_digests.get("sha1") == hex_digest_dict["sha1"] + assert object_metadata.hex_digests.get("sha224") == hex_digest_dict["sha224"] + assert object_metadata.hex_digests.get("sha256") == hex_digest_dict["sha256"] + assert object_metadata.hex_digests.get("sha512") == hex_digest_dict["sha512"] diff --git a/tests/filehashstore/test_filehashstore_interface.py b/tests/filehashstore/test_filehashstore_interface.py new file mode 100644 index 00000000..381e1035 --- /dev/null +++ b/tests/filehashstore/test_filehashstore_interface.py @@ -0,0 +1,1564 @@ +"""Test module for FileHashStore HashStore interface methods.""" + +import io +import os +from pathlib import Path +from threading import Thread +import random +import threading +import time +import pytest + +from hashstore.filehashstore_exceptions import ( + NonMatchingChecksum, + NonMatchingObjSize, + PidRefsDoesNotExist, + UnsupportedAlgorithm, + HashStoreRefsAlreadyExists, + PidRefsAlreadyExistsError, +) + +# pylint: disable=W0212 + + +# Define a mark to be used to label slow tests +slow_test = pytest.mark.skipif( + "not config.getoption('--run-slow')", + reason="Only run when --run-slow is given", +) + + +def test_store_object_refs_files_and_object(pids, store): + """Test store object stores objects and creates reference files.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(pid, path) + assert object_metadata.cid == pids[pid][store.algorithm] + assert store._count("objects") == 3 + assert store._count("pid") == 3 + assert store._count("cid") == 3 + + +def test_store_object_only_object(pids, store): + """Test store object stores an object only (no reference files will be created)""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + object_metadata = store.store_object(data=path) + assert object_metadata.cid == pids[pid][store.algorithm] + assert store._count("objects") == 3 + assert store._count("pid") == 0 + assert store._count("cid") == 0 + + +def test_store_object_files_path(pids, store): + """Test store object when given a path object.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = Path(test_dir + pid.replace("/", "_")) + _object_metadata = store.store_object(pid, path) + assert store._exists("objects", pids[pid][store.algorithm]) + assert store._count("objects") == 3 + + +def test_store_object_files_string(pids, store): + """Test store object when given a string object.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path_string = test_dir + pid.replace("/", "_") + _object_metadata = store.store_object(pid, path_string) + assert store._exists("objects", pids[pid][store.algorithm]) + assert store._count("objects") == 3 + + +def test_store_object_files_input_stream(pids, store): + """Test store object when given a stream object.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + input_stream = io.open(path, "rb") + _object_metadata = store.store_object(pid, input_stream) + input_stream.close() + assert store._exists("objects", pids[pid][store.algorithm]) + assert store._count("objects") == 3 + + +def test_store_object_cid(pids, store): + """Test store object returns expected content identifier.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + assert object_metadata.cid == pids[pid][store.algorithm] + + +def test_store_object_pid(pids, store): + """Test store object returns expected persistent identifier.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + assert object_metadata.pid == pid + + +def test_store_object_obj_size(pids, store): + """Test store object returns expected file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + object_size = object_metadata.obj_size + assert object_size == pids[pid]["file_size_bytes"] + + +def test_store_object_hex_digests(pids, store): + """Test store object returns expected hex digests dictionary.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + assert object_metadata.hex_digests.get("md5") == pids[pid]["md5"] + assert object_metadata.hex_digests.get("sha1") == pids[pid]["sha1"] + assert object_metadata.hex_digests.get("sha256") == pids[pid]["sha256"] + assert object_metadata.hex_digests.get("sha384") == pids[pid]["sha384"] + assert object_metadata.hex_digests.get("sha512") == pids[pid]["sha512"] + + +def test_store_object_pid_empty(store): + """Test store object raises error when supplied with empty pid string.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + with pytest.raises(ValueError): + store.store_object("", path) + + +def test_store_object_pid_empty_spaces(store): + """Test store object raises error when supplied with empty space character.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + with pytest.raises(ValueError): + store.store_object(" ", path) + + +def test_store_object_data_incorrect_type_none(store): + """Test store object raises error when data is 'None'.""" + pid = "jtao.1700.1" + path = None + with pytest.raises(TypeError): + store.store_object(pid, data=path) + + +def test_store_object_data_incorrect_type_empty(store): + """Test store object raises error when data is an empty string.""" + pid = "jtao.1700.1" + path = "" + with pytest.raises(TypeError): + store.store_object(pid, data=path) + + +def test_store_object_data_incorrect_type_empty_spaces(store): + """Test store object raises error when data is an empty string with spaces.""" + pid = "jtao.1700.1" + path = " " + with pytest.raises(TypeError): + store.store_object(pid, data=path) + + +def test_store_object_data_incorrect_type_special_characters(store): + """Test store object raises error when data is empty string with special characters""" + pid = "jtao.1700.1" + path = " \n\t" + with pytest.raises(TypeError): + store.store_object(pid, data=path) + + +def test_store_object_data_incorrect_type_path_with_special_character(store): + """Test store object raises error when data path contains special characters.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + "\n" + with pytest.raises(ValueError): + store.store_object("", path) + + +def test_store_object_additional_algorithm_invalid(store): + """Test store object raises error when supplied with unsupported algorithm.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_not_in_list = "abc" + with pytest.raises(UnsupportedAlgorithm): + store.store_object(pid, path, algorithm_not_in_list) + + +def test_store_object_additional_algorithm_hyphen_uppercase(pids, store): + """Test store object accepts an additional algo that's supported in uppercase.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_with_hyphen_and_upper = "SHA-384" + object_metadata = store.store_object(pid, path, algorithm_with_hyphen_and_upper) + sha256_cid = object_metadata.hex_digests.get("sha384") + assert sha256_cid == pids[pid]["sha384"] + assert store._exists("objects", pids[pid][store.algorithm]) + + +def test_store_object_additional_algorithm_hyphen_lowercase(pids, store): + """Test store object accepts an additional algo that's supported in lowercase.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_other = "sha3-256" + object_metadata = store.store_object(pid, path, algorithm_other) + additional_sha3_256_hex_digest = object_metadata.hex_digests.get("sha3_256") + sha3_256_checksum = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + assert additional_sha3_256_hex_digest == sha3_256_checksum + assert store._exists("objects", pids[pid][store.algorithm]) + + +def test_store_object_additional_algorithm_underscore(pids, store): + """Test store object accepts an additional algo that's supported with underscore.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_other = "sha3_256" + object_metadata = store.store_object(pid, path, algorithm_other) + additional_sha3_256_hex_digest = object_metadata.hex_digests.get("sha3_256") + sha3_256_checksum = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + assert additional_sha3_256_hex_digest == sha3_256_checksum + assert store._exists("objects", pids[pid][store.algorithm]) + + +def test_store_object_checksum_correct(store): + """Test store object does not throw exception with good checksum.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + checksum_algo = "sha3_256" + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + _object_metadata = store.store_object( + pid, path, checksum=checksum_correct, checksum_algorithm=checksum_algo + ) + assert store._count("objects") == 1 + + +def test_store_object_checksum_correct_and_additional_algo(store): + """Test store object with good checksum and an additional algorithm.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_additional = "sha224" + sha224_additional_checksum = ( + "9b3a96f434f3c894359193a63437ef86fbd5a1a1a6cc37f1d5013ac1" + ) + algorithm_checksum = "sha3_256" + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + object_metadata = store.store_object( + pid, + path, + additional_algorithm=algorithm_additional, + checksum=checksum_correct, + checksum_algorithm=algorithm_checksum, + ) + assert object_metadata.hex_digests.get("sha224") == sha224_additional_checksum + assert object_metadata.hex_digests.get("sha3_256") == checksum_correct + + +def test_store_object_checksum_correct_and_additional_algo_duplicate(store): + """Test store object does not throw exception with duplicate algorithms (de-dupes).""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_additional = "sha3_256" + algorithm_checksum = "sha3_256" + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + object_metadata = store.store_object( + pid, + path, + additional_algorithm=algorithm_additional, + checksum=checksum_correct, + checksum_algorithm=algorithm_checksum, + ) + assert object_metadata.hex_digests.get("sha3_256") == checksum_correct + + +def test_store_object_checksum_empty(store): + """Test store object raises error when checksum_algorithm supplied with + an empty checksum.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + checksum_algorithm = "sha3_256" + with pytest.raises(ValueError): + store.store_object( + pid, path, checksum="", checksum_algorithm=checksum_algorithm + ) + + +def test_store_object_checksum_empty_spaces(store): + """Test store object raises error when checksum_algorithm supplied and + checksum is empty with spaces.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + checksum_algorithm = "sha3_256" + with pytest.raises(ValueError): + store.store_object( + pid, path, checksum=" ", checksum_algorithm=checksum_algorithm + ) + + +def test_store_object_checksum_incorrect_checksum(store): + """Test store object raises error when supplied with incorrect checksum.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_other = "sha224" + checksum_incorrect = ( + "bbbb069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + with pytest.raises(NonMatchingChecksum): + store.store_object( + pid, path, checksum=checksum_incorrect, checksum_algorithm=algorithm_other + ) + + +def test_store_object_checksum_unsupported_checksum_algo(store): + """Test store object raises error when supplied with unsupported checksum algo.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + algorithm_other = "sha3_256" + checksum_incorrect = ( + "bbbb069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + with pytest.raises(UnsupportedAlgorithm): + store.store_object( + pid, path, checksum=algorithm_other, checksum_algorithm=checksum_incorrect + ) + + +def test_store_object_checksum_algorithm_empty(store): + """Test store object raises error when checksum supplied with no checksum_algorithm.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + with pytest.raises(ValueError): + store.store_object(pid, path, checksum=checksum_correct, checksum_algorithm="") + + +def test_store_object_checksum_algorithm_empty_spaces(store): + """Test store object raises error when checksum is supplied and with empty + spaces as the checksum_algorithm.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + with pytest.raises(ValueError): + store.store_object( + pid, path, checksum=checksum_correct, checksum_algorithm=" " + ) + + +def test_store_object_checksum_algorithm_special_character(store): + """Test store object raises error when checksum is supplied and with special characters + as the checksum_algorithm.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + checksum_correct = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + with pytest.raises(ValueError): + store.store_object( + pid, path, checksum=checksum_correct, checksum_algorithm="\n" + ) + + +def test_store_object_duplicate_does_not_store_duplicate(store): + """Test that storing duplicate object does not store object twice.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + # Store first blob + _object_metadata_one = store.store_object(pid, path) + # Store second blob + pid_that_refs_existing_cid = "dou.test.1" + _object_metadata_two = store.store_object(pid_that_refs_existing_cid, path) + # Confirm only one object exists and the tmp file created is deleted + assert store._count("objects") == 1 + + +def test_store_object_duplicate_object_references_file_count(store): + """Test that storing a duplicate object but with different pids creates the expected + amount of reference files.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + # Store with first pid + _object_metadata_one = store.store_object(pid, path) + # Store with second pid + pid_two = "dou.test.1" + _object_metadata_two = store.store_object(pid_two, path) + # Store with third pid + pid_three = "dou.test.2" + _object_metadata_three = store.store_object(pid_three, path) + # Confirm that there are 3 pid reference files + assert store._count("pid") == 3 + # Confirm that there are 1 cid reference files + assert store._count("cid") == 1 + assert store._count("objects") == 1 + + +def test_store_object_duplicate_object_references_file_content(pids, store): + """Test that storing duplicate object but different pid updates the cid refs file + with the correct amount of pids and content.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + # Store with first pid + store.store_object(pid, path) + # Store with second pid + pid_two = "dou.test.1" + store.store_object(pid_two, path) + # Store with third pid + pid_three = "dou.test.2" + store.store_object(pid_three, path) + # Confirm the content of the cid reference files + cid_ref_abs_path = store._get_hashstore_cid_refs_path(pids[pid][store.algorithm]) + cid_count = 0 + with open(cid_ref_abs_path, "r", encoding="utf8") as f: + for _, line in enumerate(f, start=1): + cid_count += 1 + value = line.strip() + assert value == pid or value == pid_two or value == pid_three + + assert cid_count == 3 + + +def test_store_object_duplicate_raises_error_with_bad_validation_data(pids, store): + """Test store duplicate object throws exception when the data to validate against + is incorrect.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + # Store first blob + _object_metadata_one = store.store_object(pid, path) + # Store second blob + with pytest.raises(NonMatchingChecksum): + _object_metadata_two = store.store_object( + pid, path, checksum="nonmatchingchecksum", checksum_algorithm="sha256" + ) + assert store._count("objects") == 1 + # Confirm tmp files created during this process was handled + assert store._count("tmp") == 0 + assert store._exists("objects", pids[pid][store.algorithm]) + + +def test_store_object_with_obj_file_size(store, pids): + """Test store object stores object with correct file sizes.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + obj_file_size = pids[pid]["file_size_bytes"] + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object( + pid, path, expected_object_size=obj_file_size + ) + object_size = object_metadata.obj_size + assert object_size == obj_file_size + + +def test_store_object_with_obj_file_size_incorrect(store, pids): + """Test store object throws exception with incorrect file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + obj_file_size = 1234 + path = test_dir + pid.replace("/", "_") + with pytest.raises(NonMatchingObjSize): + store.store_object(pid, path, expected_object_size=obj_file_size) + assert store._count("objects") == 0 + + +def test_store_object_with_obj_file_size_non_integer(store, pids): + """Test store object throws exception with a non integer value (ex. a string) + as the file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + obj_file_size = "Bob" + path = test_dir + pid.replace("/", "_") + with pytest.raises(TypeError): + store.store_object(pid, path, expected_object_size=obj_file_size) + + +def test_store_object_with_obj_file_size_zero(store, pids): + """Test store object throws exception with zero as the file size.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + obj_file_size = 0 + path = test_dir + pid.replace("/", "_") + with pytest.raises(ValueError): + store.store_object(pid, path, expected_object_size=obj_file_size) + + +def test_store_object_duplicates_threads(pids, store): + """Test store object thread lock.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + entity = "objects" + + def store_object_wrapper(obj_pid, obj_path): + try: + store.store_object(obj_pid, obj_path) # Call store_object inside the thread + # pylint: disable=W0718 + except Exception as e: + assert ( + type(e).__name__ == "HashStoreRefsAlreadyExists" + or type(e).__name__ == "StoreObjectForPidAlreadyInProgress" + ) + + thread1 = Thread(target=store_object_wrapper, args=(pid, path)) + thread2 = Thread(target=store_object_wrapper, args=(pid, path)) + thread3 = Thread(target=store_object_wrapper, args=(pid, path)) + thread1.start() + thread2.start() + thread3.start() + thread1.join() + thread2.join() + thread3.join() + # One thread will succeed, file count must still be 1 + assert store._count(entity) == 1 + assert store._exists(entity, pids[pid][store.algorithm]) + + +# Note: +# Multiprocessing has been tested through the HashStore client using +# metacat db data from 'test.arcticdata.io'. When time-permitting, +# implement a multiprocessing test + + +def test_store_object_threads_multiple_pids_one_cid_content(pids, store): + """Test store object thread lock and refs files content""" + entity = "objects" + test_dir = "tests/testdata/" + path = test_dir + "jtao.1700.1" + pid_list = ["jtao.1700.1"] + for n in range(0, 5): + pid_list.append(f"dou.test.{n}") + + def store_object_wrapper(obj_pid, obj_path): + store.store_object(obj_pid, obj_path) # Call store_object inside the thread + + thread1 = Thread(target=store_object_wrapper, args=(pid_list[0], path)) + thread2 = Thread(target=store_object_wrapper, args=(pid_list[1], path)) + thread3 = Thread(target=store_object_wrapper, args=(pid_list[2], path)) + thread4 = Thread(target=store_object_wrapper, args=(pid_list[3], path)) + thread5 = Thread(target=store_object_wrapper, args=(pid_list[4], path)) + thread6 = Thread(target=store_object_wrapper, args=(pid_list[5], path)) + thread1.start() + thread2.start() + thread3.start() + thread4.start() + thread5.start() + thread6.start() + thread1.join() + thread2.join() + thread3.join() + thread4.join() + thread5.join() + thread6.join() + # All threads will succeed, file count must still be 1 + assert store._count(entity) == 1 + assert store._exists(entity, pids["jtao.1700.1"][store.algorithm]) + + cid_refs_path = store._get_hashstore_cid_refs_path( + "94f9b6c88f1f458e410c30c351c6384ea42ac1b5ee1f8430d3e365e43b78a38a" + ) + number_of_pids_reffed = 0 + with open(cid_refs_path, "r", encoding="utf8") as ref_file: + # Confirm that pid is not currently already tagged + for pid in ref_file: + if pid.strip() in pid_list: + number_of_pids_reffed += 1 + + assert number_of_pids_reffed == 6 + + +def test_store_object_threads_multiple_pids_one_cid_files(store): + """Test store object with threads produces the expected amount of files""" + test_dir = "tests/testdata/" + path = test_dir + "jtao.1700.1" + pid_list = ["jtao.1700.1"] + for n in range(0, 5): + pid_list.append(f"dou.test.{n}") + + def store_object_wrapper(obj_pid, obj_path): + store.store_object(obj_pid, obj_path) # Call store_object inside the thread + + thread1 = Thread(target=store_object_wrapper, args=(pid_list[0], path)) + thread2 = Thread(target=store_object_wrapper, args=(pid_list[1], path)) + thread3 = Thread(target=store_object_wrapper, args=(pid_list[2], path)) + thread4 = Thread(target=store_object_wrapper, args=(pid_list[3], path)) + thread5 = Thread(target=store_object_wrapper, args=(pid_list[4], path)) + thread6 = Thread(target=store_object_wrapper, args=(pid_list[5], path)) + thread1.start() + thread2.start() + thread3.start() + thread4.start() + thread5.start() + thread6.start() + thread1.join() + thread2.join() + thread3.join() + thread4.join() + thread5.join() + thread6.join() + + # Confirm that tmp files do not remain in refs + def folder_has_files(folder_path): + # Iterate over directory contents + for _, _, files in os.walk(folder_path): + if files: # If there are any files in the folder + print(files) + return True + return False + + # Confirm that tmp files do not remain in refs + def get_number_of_files(folder_path): + # Iterate over directory contents + file_count = 0 + for _, _, files in os.walk(folder_path): + if files: # If there are any files in the folder + file_count += len(files) + return file_count + + assert get_number_of_files(store.refs / "pids") == 6 + assert get_number_of_files(store.refs / "cids") == 1 + assert folder_has_files(store.refs / "tmp") is False + + +@slow_test +def test_store_object_interrupt_process(store): + """Test that tmp file created when storing a large object (2GB) and + interrupting the process is cleaned up. + """ + file_size = 2 * 1024 * 1024 * 1024 # 2GB + file_path = store.root + "random_file_2.bin" + + pid = "Testpid" + # Generate a random file with the specified size + with open(file_path, "wb") as file: + remaining_bytes = file_size + buffer_size = 1024 * 1024 # 1MB buffer size (adjust as needed) + + while remaining_bytes > 0: + # Generate random data for the buffer + buffer = bytearray(random.getrandbits(8) for _ in range(buffer_size)) + # Write the buffer to the file + bytes_to_write = min(buffer_size, remaining_bytes) + file.write(buffer[:bytes_to_write]) + remaining_bytes -= bytes_to_write + + interrupt_flag = False + + def store_object_wrapper(obj_pid, path): + print(store.root) + while not interrupt_flag: + store.store_object(obj_pid, path) # Call store_object inside the thread + + # Create/start the thread + thread = threading.Thread(target=store_object_wrapper, args=(pid, file_path)) + thread.start() + + # Sleep for 5 seconds to let the thread run + time.sleep(5) + + # Interrupt the thread + interrupt_flag = True + + # Wait for the thread to finish + thread.join() + + # Confirm no tmp objects found in objects/tmp directory + assert len(os.listdir(store.root + "/objects/tmp")) == 0 + + +@slow_test +def test_store_object_large_file(store): + """Test storing a large object (1GB). This test has also been executed with + a 4GB file and the test classes succeeded locally in 296.85s (0:04:56) + """ + # file_size = 4 * 1024 * 1024 * 1024 # 4GB + file_size = 1024 * 1024 * 1024 # 1GB + file_path = store.root + "random_file.bin" + # Generate a random file with the specified size + with open(file_path, "wb") as file: + remaining_bytes = file_size + buffer_size = 1024 * 1024 # 1MB buffer size (adjust as needed) + + while remaining_bytes > 0: + # Generate random data for the buffer + buffer = bytearray(random.getrandbits(8) for _ in range(buffer_size)) + # Write the buffer to the file + bytes_to_write = min(buffer_size, remaining_bytes) + file.write(buffer[:bytes_to_write]) + remaining_bytes -= bytes_to_write + # Store object + pid = "testfile_filehashstore" + object_metadata = store.store_object(pid, file_path) + object_metadata_id = object_metadata.cid + assert object_metadata_id == object_metadata.hex_digests.get("sha256") + + +@slow_test +def test_store_object_sparse_large_file(store): + """Test storing a large object (4GB) via sparse file. This test has also been + executed with a 10GB file and the test classes succeeded locally in 117.03s (0:01:57). + """ + # file_size = 10 * 1024 * 1024 * 1024 # 10GB + file_size = 4 * 1024 * 1024 * 1024 # 4GB + file_path = store.root + "random_file.bin" + # Generate a random file with the specified size + with open(file_path, "wb") as file: + file.seek(file_size - 1) + file.write(b"\0") + # Store object + pid = "testfile_filehashstore" + object_metadata = store.store_object(pid, file_path) + object_metadata_id = object_metadata.cid + assert object_metadata_id == object_metadata.hex_digests.get("sha256") + + +def test_tag_object(pids, store): + """Test tag_object does not throw exception when successful.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(None, path) + store.tag_object(pid, object_metadata.cid) + assert store._count("pid") == 3 + assert store._count("cid") == 3 + + +def test_tag_object_pid_refs_not_found_cid_refs_found(store): + """Test tag_object updates a cid reference file that already exists.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid.replace("/", "_") + # Store data only + object_metadata = store.store_object(None, path) + cid = object_metadata.cid + # Tag object + store.tag_object(pid, cid) + # Tag the cid with another pid + additional_pid = "dou.test.1" + store.tag_object(additional_pid, cid) + + # Read cid file to confirm cid refs file contains the additional pid + line_count = 0 + cid_ref_abs_path = store._get_hashstore_cid_refs_path(cid) + with open(cid_ref_abs_path, "r", encoding="utf8") as f: + for _, line in enumerate(f, start=1): + value = line.strip() + line_count += 1 + assert value == pid or value == additional_pid + assert line_count == 2 + assert store._count("pid") == 2 + assert store._count("cid") == 1 + + +def test_tag_object_hashstore_refs_already_exist(pids, store): + """Confirm that tag throws HashStoreRefsAlreadyExists when refs already exist""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + + with pytest.raises(HashStoreRefsAlreadyExists): + store.tag_object(pid, object_metadata.cid) + + +def test_tag_object_pid_refs_already_exist(pids, store): + """Confirm that tag throws PidRefsAlreadyExistsError when a pid refs already exists""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + cid_refs_file_path = store._get_hashstore_cid_refs_path(object_metadata.cid) + os.remove(cid_refs_file_path) + + with pytest.raises(PidRefsAlreadyExistsError): + store.tag_object(pid, "adifferentcid") + + +def test_delete_if_invalid_object(pids, store): + """Test delete_if_invalid_object does not throw exception given good arguments.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = store.algorithm + expected_file_size = object_metadata.obj_size + store.delete_if_invalid_object( + object_metadata, checksum, checksum_algorithm, expected_file_size + ) + assert store._exists("objects", object_metadata.cid) + + +def test_delete_if_invalid_object_supported_other_algo_not_in_default(pids, store): + """Test delete_if_invalid_object does not throw exception when supported add algo is + supplied.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + supported_algo = "sha224" + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = pids[pid][supported_algo] + expected_file_size = object_metadata.obj_size + store.delete_if_invalid_object( + object_metadata, checksum, supported_algo, expected_file_size + ) + assert store._exists("objects", object_metadata.cid) + + +def test_delete_if_invalid_object_exception_incorrect_object_metadata_type(pids, store): + """Test delete_if_invalid_object throws exception when incorrect obj type is given to + object_metadata arg.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = store.algorithm + expected_file_size = object_metadata.obj_size + with pytest.raises(ValueError): + store.delete_if_invalid_object( + "not_object_metadata", checksum, checksum_algorithm, expected_file_size + ) + + +def test_delete_if_invalid_object_exception_incorrect_size(pids, store): + """Test delete_if_invalid_object throws exception when incorrect size is supplied and that data + object is deleted as we are storing without a pid.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = store.algorithm + + with pytest.raises(NonMatchingObjSize): + store.delete_if_invalid_object( + object_metadata, checksum, checksum_algorithm, 1000 + ) + + assert not store._exists("objects", object_metadata.cid) + + +def test_delete_if_invalid_object_exception_incorrect_size_object_exists(pids, store): + """Test delete_if_invalid_object throws exception when incorrect size is supplied and that data + object is not deleted since it already exists (a cid refs file is present).""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + store.store_object(pid, data=path) + # Store again without pid and wrong object size + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + checksum_algorithm = store.algorithm + + with pytest.raises(NonMatchingObjSize): + store.delete_if_invalid_object( + object_metadata, checksum, checksum_algorithm, 1000 + ) + + assert store._exists("objects", object_metadata.cid) + assert store._count("tmp") == 0 + + +def test_delete_if_invalid_object_exception_incorrect_checksum(pids, store): + """Test delete_if_invalid_object throws exception when incorrect checksum is supplied.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum_algorithm = store.algorithm + expected_file_size = object_metadata.obj_size + + with pytest.raises(NonMatchingChecksum): + store.delete_if_invalid_object( + object_metadata, "abc123", checksum_algorithm, expected_file_size + ) + + assert not store._exists("objects", object_metadata.cid) + + +def test_delete_if_invalid_object_exception_incorrect_checksum_algo(pids, store): + """Test delete_if_invalid_object throws exception when unsupported algorithm is supplied.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + expected_file_size = object_metadata.obj_size + with pytest.raises(UnsupportedAlgorithm): + store.delete_if_invalid_object( + object_metadata, checksum, "md2", expected_file_size + ) + + assert store._exists("objects", object_metadata.cid) + assert store._count("tmp") == 0 + + +def test_delete_if_invalid_object_exception_supported_other_algo_bad_checksum( + pids, store +): + """Test delete_if_invalid_object throws exception when incorrect checksum is supplied.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(data=path) + checksum = object_metadata.hex_digests.get(store.algorithm) + expected_file_size = object_metadata.obj_size + with pytest.raises(NonMatchingChecksum): + store.delete_if_invalid_object( + object_metadata, checksum, "sha224", expected_file_size + ) + + assert not store._exists("objects", object_metadata.cid) + + +def test_store_metadata(pids, store): + """Test store_metadata.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + stored_metadata_path = store.store_metadata(pid, syspath, format_id) + # Manually calculate expected path + metadata_directory = store._computehash(pid) + metadata_document_name = store._computehash(pid + format_id) + rel_path = Path(*store._shard(metadata_directory)) + full_path = ( + store._get_store_path("metadata") / rel_path / metadata_document_name + ) + assert stored_metadata_path == str(full_path) + assert store._count("metadata") == 3 + + +def test_store_metadata_one_pid_multiple_docs_correct_location(store): + """Test store_metadata for a pid with multiple metadata documents.""" + test_dir = "tests/testdata/" + entity = "metadata" + pid = "jtao.1700.1" + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + metadata_directory = store._computehash(pid) + rel_path = Path(*store._shard(metadata_directory)) + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + format_id3 = "http://ns.dataone.org/service/types/v3.0" + format_id4 = "http://ns.dataone.org/service/types/v4.0" + stored_metadata_path = store.store_metadata(pid, syspath, format_id) + stored_metadata_path3 = store.store_metadata(pid, syspath, format_id3) + stored_metadata_path4 = store.store_metadata(pid, syspath, format_id4) + + metadata_document_name = store._computehash(pid + format_id) + metadata_document_name3 = store._computehash(pid + format_id3) + metadata_document_name4 = store._computehash(pid + format_id4) + full_path = store._get_store_path("metadata") / rel_path / metadata_document_name + full_path3 = store._get_store_path("metadata") / rel_path / metadata_document_name3 + full_path4 = store._get_store_path("metadata") / rel_path / metadata_document_name4 + + assert stored_metadata_path == str(full_path) + assert stored_metadata_path3 == str(full_path3) + assert stored_metadata_path4 == str(full_path4) + assert store._count(entity) == 3 + + +def test_store_metadata_default_format_id(pids, store): + """Test store_metadata returns expected id when storing with default format_id.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + stored_metadata_path = store.store_metadata(pid, syspath) + # Manually calculate expected path + metadata_directory = store._computehash(pid) + metadata_document_name = store._computehash(pid + format_id) + rel_path = Path(*store._shard(metadata_directory)) + full_path = ( + store._get_store_path("metadata") / rel_path / metadata_document_name + ) + assert stored_metadata_path == str(full_path) + + +def test_store_metadata_files_string(pids, store): + """Test store_metadata with a string object to the metadata.""" + test_dir = "tests/testdata/" + entity = "metadata" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath_string = str(Path(test_dir) / filename) + stored_metadata_path = store.store_metadata(pid, syspath_string, format_id) + assert store._exists(entity, stored_metadata_path) + assert store._count(entity) == 3 + + +def test_store_metadata_files_input_stream(pids, store): + """Test store_metadata with a stream to the metadata.""" + test_dir = "tests/testdata/" + entity = "metadata" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath_string = str(Path(test_dir) / filename) + syspath_stream = io.open(syspath_string, "rb") + _stored_metadata_path = store.store_metadata(pid, syspath_stream, format_id) + syspath_stream.close() + assert store._count(entity) == 3 + + +def test_store_metadata_pid_empty(store): + """Test store_metadata raises error with an empty string as the pid.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid = "" + filename = pid.replace("/", "_") + ".xml" + syspath_string = str(Path(test_dir) / filename) + with pytest.raises(ValueError): + store.store_metadata(pid, syspath_string, format_id) + + +def test_store_metadata_pid_empty_spaces(store): + """Test store_metadata raises error with empty spaces as the pid.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid = " " + filename = pid.replace("/", "_") + ".xml" + syspath_string = str(Path(test_dir) / filename) + with pytest.raises(ValueError): + store.store_metadata(pid, syspath_string, format_id) + + +def test_store_metadata_pid_format_id_spaces(store): + """Test store_metadata raises error with empty spaces as the format_id.""" + test_dir = "tests/testdata/" + format_id = " " + pid = "jtao.1700.1" + filename = pid.replace("/", "_") + ".xml" + syspath_string = str(Path(test_dir) / filename) + with pytest.raises(ValueError): + store.store_metadata(pid, syspath_string, format_id) + + +def test_store_metadata_metadata_empty(store): + """Test store_metadata raises error with empty spaces as the metadata path.""" + pid = "jtao.1700.1" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + syspath_string = " " + with pytest.raises(TypeError): + store.store_metadata(pid, syspath_string, format_id) + + +def test_store_metadata_metadata_none(store): + """Test store_metadata raises error with empty None metadata path.""" + pid = "jtao.1700.1" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + syspath_string = None + with pytest.raises(TypeError): + store.store_metadata(pid, syspath_string, format_id) + + +def test_store_metadata_metadata_path(pids, store): + """Test store_metadata returns expected path to metadata document.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _object_metadata = store.store_object(pid, path) + stored_metadata_path = store.store_metadata(pid, syspath, format_id) + metadata_path = store._get_hashstore_metadata_path(stored_metadata_path) + assert Path(stored_metadata_path) == metadata_path + + +def test_store_metadata_thread_lock(store): + """Test store_metadata thread lock.""" + test_dir = "tests/testdata/" + entity = "metadata" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid = "jtao.1700.1" + path = test_dir + pid + filename = pid + ".xml" + syspath = Path(test_dir) / filename + _object_metadata = store.store_object(pid, path) + store.store_metadata(pid, syspath, format_id) + # Start threads + thread1 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) + thread2 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) + thread3 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) + thread4 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) + thread1.start() + thread2.start() + thread3.start() + thread4.start() + thread1.join() + thread2.join() + thread3.join() + thread4.join() + assert store._count(entity) == 1 + + +def test_retrieve_object(pids, store): + """Test retrieve_object returns a stream to the correct object data.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + obj_stream = store.retrieve_object(pid) + sha256_hex = store._computehash(obj_stream) + obj_stream.close() + assert sha256_hex == object_metadata.hex_digests.get("sha256") + + +def test_retrieve_object_pid_empty(store): + """Test retrieve_object raises error when supplied with empty pid.""" + pid = " " + with pytest.raises(ValueError): + store.retrieve_object(pid) + + +def test_retrieve_object_pid_invalid(store): + """Test retrieve_object raises error when supplied with bad pid.""" + pid = "jtao.1700.1" + pid_does_not_exist = pid + "test" + with pytest.raises(PidRefsDoesNotExist): + store.retrieve_object(pid_does_not_exist) + + +def test_retrieve_metadata(store): + """Test retrieve_metadata returns a stream to the correct metadata.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid = "jtao.1700.1" + filename = pid + ".xml" + syspath = Path(test_dir) / filename + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + metadata_stream = store.retrieve_metadata(pid, format_id) + metadata_content = metadata_stream.read().decode("utf-8") + metadata_stream.close() + metadata = syspath.read_bytes() + assert metadata.decode("utf-8") == metadata_content + + +def test_retrieve_metadata_default_format_id(store): + """Test retrieve_metadata retrieves expected metadata without a format_id.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + filename = pid + ".xml" + syspath = Path(test_dir) / filename + _stored_metadata_path = store.store_metadata(pid, syspath) + metadata_stream = store.retrieve_metadata(pid) + metadata_content = metadata_stream.read().decode("utf-8") + metadata_stream.close() + metadata = syspath.read_bytes() + assert metadata.decode("utf-8") == metadata_content + + +def test_retrieve_metadata_bytes_pid_invalid(store): + """Test retrieve_metadata raises exception when supplied with pid with no system metadata.""" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid_does_not_exist = "jtao.1700.1.metadata.does.not.exist" + with pytest.raises(ValueError): + store.retrieve_metadata(pid_does_not_exist, format_id) + + +def test_retrieve_metadata_bytes_pid_empty(store): + """Test retrieve_metadata raises exception when supplied with empty pid.""" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid = " " + with pytest.raises(ValueError): + store.retrieve_metadata(pid, format_id) + + +def test_retrieve_metadata_format_id_empty(store): + """Test retrieve_metadata raises error when supplied with an empty format_id.""" + format_id = "" + pid = "jtao.1700.1" + with pytest.raises(ValueError): + store.retrieve_metadata(pid, format_id) + + +def test_retrieve_metadata_format_id_empty_spaces(store): + """Test retrieve_metadata raises exception when supplied with empty spaces as the format_id.""" + format_id = " " + pid = "jtao.1700.1" + with pytest.raises(ValueError): + store.retrieve_metadata(pid, format_id) + + +def test_delete_object_object_deleted(pids, store): + """Test delete_object successfully deletes object.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _object_metadata = store.store_object(pid, path) + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + store.delete_object(pid) + assert store._count("objects") == 0 + + +def test_delete_object_metadata_deleted(pids, store): + """Test delete_object successfully deletes associated metadata files.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _object_metadata = store.store_object(pid, path) + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + store.delete_object(pid) + assert store._count("metadata") == 0 + + +def test_delete_object_refs_files_deleted(pids, store): + """Test delete_object successfully deletes refs files.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _object_metadata = store.store_object(pid, path) + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + store.delete_object(pid) + assert store._count("pid") == 0 + assert store._count("cid") == 0 + + +def test_delete_object_pid_refs_file_deleted(pids, store): + """Test delete_object deletes the associated pid refs file for the object.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _object_metadata = store.store_object(pid, path) + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + store.delete_object(pid) + pid_refs_file_path = store._get_hashstore_pid_refs_path(pid) + assert not os.path.exists(pid_refs_file_path) + + +def test_delete_object_cid_refs_file_deleted(pids, store): + """Test delete_object deletes the associated cid refs file for the object.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + object_metadata = store.store_object(pid, path) + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + cid = object_metadata.cid + store.delete_object(pid) + cid_refs_file_path = store._get_hashstore_cid_refs_path(cid) + assert not os.path.exists(cid_refs_file_path) + + +def test_delete_object_cid_refs_file_with_pid_refs_remaining(pids, store): + """Test delete_object does not delete the cid refs file that still contains refs.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + object_metadata = store.store_object(pid, path) + cid = object_metadata.cid + cid_refs_abs_path = store._get_hashstore_cid_refs_path(cid) + store._update_refs_file(cid_refs_abs_path, "dou.test.1", "add") + store.delete_object(pid) + cid_refs_file_path = store._get_hashstore_cid_refs_path(cid) + assert os.path.exists(cid_refs_file_path) + assert store._count("cid") == 3 + + +def test_delete_object_pid_empty(store): + """Test delete_object raises error when empty pid supplied.""" + pid = " " + with pytest.raises(ValueError): + store.delete_object(pid) + + +def test_delete_object_pid_none(store): + """Test delete_object raises error when pid is 'None'.""" + pid = None + with pytest.raises(ValueError): + store.delete_object(pid) + + +def test_delete_metadata(pids, store): + """Test delete_metadata successfully deletes metadata.""" + test_dir = "tests/testdata/" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + store.delete_metadata(pid, format_id) + assert store._count("metadata") == 0 + + +def test_delete_metadata_one_pid_multiple_metadata_documents(store): + """Test delete_metadata for a pid with multiple metadata documents deletes + all associated metadata files as expected.""" + test_dir = "tests/testdata/" + entity = "metadata" + pid = "jtao.1700.1" + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + format_id3 = "http://ns.dataone.org/service/types/v3.0" + format_id4 = "http://ns.dataone.org/service/types/v4.0" + _stored_metadata_path = store.store_metadata(pid, syspath, format_id) + _stored_metadata_path3 = store.store_metadata(pid, syspath, format_id3) + _stored_metadata_path4 = store.store_metadata(pid, syspath, format_id4) + store.delete_metadata(pid) + assert store._count(entity) == 0 + + +def test_delete_metadata_specific_pid_multiple_metadata_documents(store): + """Test delete_metadata for a pid with multiple metadata documents deletes + only the specified metadata file.""" + test_dir = "tests/testdata/" + entity = "metadata" + pid = "jtao.1700.1" + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + format_id3 = "http://ns.dataone.org/service/types/v3.0" + format_id4 = "http://ns.dataone.org/service/types/v4.0" + stored_metadata_path = store.store_metadata(pid, syspath, format_id) + stored_metadata_path3 = store.store_metadata(pid, syspath, format_id3) + _stored_metadata_path4 = store.store_metadata(pid, syspath, format_id4) + store.delete_metadata(pid, format_id4) + assert store._count(entity) == 2 + assert os.path.exists(stored_metadata_path) + assert os.path.exists(stored_metadata_path3) + + +def test_delete_metadata_does_not_exist(pids, store): + """Test delete_metadata does not throw exception when called to delete + metadata that does not exist.""" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + for pid in pids.keys(): + store.delete_metadata(pid, format_id) + + +def test_delete_metadata_default_format_id(store, pids): + """Test delete_metadata deletes successfully with default format_id.""" + test_dir = "tests/testdata/" + for pid in pids.keys(): + filename = pid.replace("/", "_") + ".xml" + syspath = Path(test_dir) / filename + _stored_metadata_path = store.store_metadata(pid, syspath) + store.delete_metadata(pid) + assert store._count("metadata") == 0 + + +def test_delete_metadata_pid_empty(store): + """Test delete_metadata raises error when empty pid supplied.""" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid = " " + with pytest.raises(ValueError): + store.delete_metadata(pid, format_id) + + +def test_delete_metadata_pid_none(store): + """Test delete_metadata raises error when pid is 'None'.""" + format_id = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + pid = None + with pytest.raises(ValueError): + store.delete_metadata(pid, format_id) + + +def test_delete_metadata_format_id_empty(store): + """Test delete_metadata raises error when empty format_id supplied.""" + format_id = " " + pid = "jtao.1700.1" + with pytest.raises(ValueError): + store.delete_metadata(pid, format_id) + + +def test_get_hex_digest(store): + """Test get_hex_digest for expected value.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + _object_metadata = store.store_object(pid, path) + sha3_256_hex_digest = ( + "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" + ) + sha3_256_get = store.get_hex_digest(pid, "sha3_256") + assert sha3_256_hex_digest == sha3_256_get + + +def test_get_hex_digest_pid_not_found(store): + """Test get_hex_digest raises error when supplied with bad pid.""" + pid = "jtao.1700.1" + pid_does_not_exist = pid + "test" + algorithm = "sha256" + with pytest.raises(PidRefsDoesNotExist): + store.get_hex_digest(pid_does_not_exist, algorithm) + + +def test_get_hex_digest_pid_unsupported_algorithm(store): + """Test get_hex_digest raises error when supplied with unsupported algorithm.""" + test_dir = "tests/testdata/" + pid = "jtao.1700.1" + path = test_dir + pid + filename = pid + ".xml" + syspath = Path(test_dir) / filename + syspath.read_bytes() + _object_metadata = store.store_object(pid, path) + algorithm = "sm3" + with pytest.raises(UnsupportedAlgorithm): + store.get_hex_digest(pid, algorithm) + + +def test_get_hex_digest_pid_empty(store): + """Test get_hex_digest raises error when supplied pid is empty.""" + pid = " " + algorithm = "sm3" + with pytest.raises(ValueError): + store.get_hex_digest(pid, algorithm) + + +def test_get_hex_digest_pid_none(store): + """Test get_hex_digest raises error when supplied pid is 'None'.""" + pid = None + algorithm = "sm3" + with pytest.raises(ValueError): + store.get_hex_digest(pid, algorithm) + + +def test_get_hex_digest_algorithm_empty(store): + """Test get_hex_digest raises error when supplied algorithm is empty.""" + pid = "jtao.1700.1" + algorithm = " " + with pytest.raises(ValueError): + store.get_hex_digest(pid, algorithm) + + +def test_get_hex_digest_algorithm_none(store): + """Test get_hex_digest raises error when supplied algorithm is 'None'.""" + pid = "jtao.1700.1" + algorithm = None + with pytest.raises(ValueError): + store.get_hex_digest(pid, algorithm) + + +def test_store_and_delete_objects_100_pids_1_cid(store): + """Test that deleting an object that is tagged with 100 pids successfully + deletes all related files""" + test_dir = "tests/testdata/" + path = test_dir + "jtao.1700.1" + refs_pids_path = store.root / "refs" / "pids" + refs_cids_path = store.root / "refs" / "cids" + # Store + upper_limit = 101 + for i in range(1, upper_limit): + pid_modified = f"dou.test.{str(i)}" + store.store_object(pid_modified, path) + assert sum([len(files) for _, _, files in os.walk(refs_pids_path)]) == 100 + assert sum([len(files) for _, _, files in os.walk(refs_cids_path)]) == 1 + assert store._count("objects") == 1 + # Delete + for i in range(1, upper_limit): + pid_modified = f"dou.test.{str(i)}" + store.delete_object(pid_modified) + assert sum([len(files) for _, _, files in os.walk(refs_pids_path)]) == 0 + assert sum([len(files) for _, _, files in os.walk(refs_cids_path)]) == 0 + assert store._count("objects") == 0 + + +def test_store_and_delete_object_300_pids_1_cid_threads(store): + """Test store object thread lock.""" + + def store_object_wrapper(pid_var): + try: + test_dir = "tests/testdata/" + path = test_dir + "jtao.1700.1" + upper_limit = 101 + for i in range(1, upper_limit): + pid_modified = f"dou.test.{pid_var}.{str(i)}" + store.store_object(pid_modified, path) + # pylint: disable=W0718 + except Exception as e: + print(e) + + # Store + thread1 = Thread(target=store_object_wrapper, args=("matt",)) + thread2 = Thread(target=store_object_wrapper, args=("matthew",)) + thread3 = Thread(target=store_object_wrapper, args=("matthias",)) + thread1.start() + thread2.start() + thread3.start() + thread1.join() + thread2.join() + thread3.join() + + def delete_object_wrapper(pid_var): + try: + upper_limit = 101 + for i in range(1, upper_limit): + pid_modified = f"dou.test.{pid_var}.{str(i)}" + store.delete_object(pid_modified) + # pylint: disable=W0718 + except Exception as e: + print(e) + + # Delete + thread4 = Thread(target=delete_object_wrapper, args=("matt",)) + thread5 = Thread(target=delete_object_wrapper, args=("matthew",)) + thread6 = Thread(target=delete_object_wrapper, args=("matthias",)) + thread4.start() + thread5.start() + thread6.start() + thread4.join() + thread5.join() + thread6.join() + + refs_pids_path = store.root / "refs" / "pids" + refs_cids_path = store.root / "refs" / "cids" + assert sum([len(files) for _, _, files in os.walk(refs_pids_path)]) == 0 + assert sum([len(files) for _, _, files in os.walk(refs_cids_path)]) == 0 + assert store._count("objects") == 0 diff --git a/tests/test_filehashstore.py b/tests/test_filehashstore.py deleted file mode 100644 index a2f0fdfe..00000000 --- a/tests/test_filehashstore.py +++ /dev/null @@ -1,879 +0,0 @@ -"""Test module for FileHashStore core, utility and supporting methods""" -import io -import os -from pathlib import Path -import pytest -from hashstore.filehashstore import FileHashStore - - -def test_pids_length(pids): - """Ensure test harness pids are present.""" - assert len(pids) == 3 - - -def test_init_directories_created(store): - """Confirm that object and metadata directories have been created.""" - assert os.path.exists(store.root) - assert os.path.exists(store.objects) - assert os.path.exists(store.objects + "/tmp") - assert os.path.exists(store.metadata) - assert os.path.exists(store.metadata + "/tmp") - - -def test_init_existing_store_incorrect_algorithm_format(store): - """Confirm that exception is thrown when store_algorithm is not a DataONE controlled value""" - properties = { - "store_path": store.root, - "store_depth": 3, - "store_width": 2, - "store_algorithm": "sha256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", - } - with pytest.raises(ValueError): - FileHashStore(properties) - - -def test_init_existing_store_correct_algorithm_format(store): - """Confirm second instance of HashStore with DataONE controlled value""" - properties = { - "store_path": store.root, - "store_depth": 3, - "store_width": 2, - "store_algorithm": "SHA-256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", - } - hashstore_instance = FileHashStore(properties) - assert isinstance(hashstore_instance, FileHashStore) - - -def test_init_write_properties_hashstore_yaml_exists(store): - """Verify config file present in store root directory.""" - assert os.path.exists(store.hashstore_configuration_yaml) - - -def test_init_with_existing_hashstore_mismatched_config_depth(store): - """Test init with existing HashStore raises ValueError with mismatching properties.""" - properties = { - "store_path": store.root, - "store_depth": 1, - "store_width": 2, - "store_algorithm": "SHA-256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", - } - with pytest.raises(ValueError): - FileHashStore(properties) - - -def test_init_with_existing_hashstore_mismatched_config_width(store): - """Test init with existing HashStore raises ValueError with mismatching properties.""" - properties = { - "store_path": store.root, - "store_depth": 3, - "store_width": 1, - "store_algorithm": "SHA-256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", - } - with pytest.raises(ValueError): - FileHashStore(properties) - - -def test_init_with_existing_hashstore_mismatched_config_algo(store): - """Test init with existing HashStore raises ValueError with mismatching properties.""" - properties = { - "store_path": store.root, - "store_depth": 3, - "store_width": 1, - "store_algorithm": "SHA-512", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", - } - with pytest.raises(ValueError): - FileHashStore(properties) - - -def test_init_with_existing_hashstore_mismatched_config_metadata_ns(store): - """Test init with existing HashStore raises ValueError with mismatching properties.""" - properties = { - "store_path": store.root, - "store_depth": 3, - "store_width": 1, - "store_algorithm": "SHA-512", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v5.0", - } - with pytest.raises(ValueError): - FileHashStore(properties) - - -def test_init_with_existing_hashstore_missing_yaml(store, pids): - """Test init with existing store raises FileNotFoundError when hashstore.yaml - not found but objects exist.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - store.put_object(pid, path) - os.remove(store.hashstore_configuration_yaml) - properties = { - "store_path": store.root, - "store_depth": 3, - "store_width": 2, - "store_algorithm": "SHA-256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", - } - with pytest.raises(FileNotFoundError): - FileHashStore(properties) - - -def test_load_properties(store): - """Verify dictionary returned from load_properties matches initialization.""" - hashstore_yaml_dict = store.load_properties() - assert hashstore_yaml_dict.get("store_depth") == 3 - assert hashstore_yaml_dict.get("store_width") == 2 - assert hashstore_yaml_dict.get("store_algorithm") == "SHA-256" - assert ( - hashstore_yaml_dict.get("store_metadata_namespace") - == "http://ns.dataone.org/service/types/v2.0" - ) - - -def test_load_properties_hashstore_yaml_missing(store): - """Confirm FileNotFoundError is raised when hashstore.yaml does not exist.""" - os.remove(store.hashstore_configuration_yaml) - with pytest.raises(FileNotFoundError): - store.load_properties() - - -def test_validate_properties(store): - """Confirm properties validated when all key/values are supplied.""" - properties = { - "store_path": "/etc/test", - "store_depth": 3, - "store_width": 2, - "store_algorithm": "SHA-256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", - } - # pylint: disable=W0212 - assert store._validate_properties(properties) - - -def test_validate_properties_missing_key(store): - """Confirm exception raised when key missing in properties.""" - properties = { - "store_path": "/etc/test", - "store_depth": 3, - "store_width": 2, - "store_algorithm": "SHA-256", - } - with pytest.raises(KeyError): - # pylint: disable=W0212 - store._validate_properties(properties) - - -def test_validate_properties_key_value_is_none(store): - """Confirm exception raised when value from key is 'None'.""" - properties = { - "store_path": "/etc/test", - "store_depth": 3, - "store_width": 2, - "store_algorithm": "SHA-256", - "store_metadata_namespace": None, - } - with pytest.raises(ValueError): - # pylint: disable=W0212 - store._validate_properties(properties) - - -def test_validate_properties_incorrect_type(store): - """Confirm exception raised when key missing in properties.""" - properties = "etc/filehashstore/hashstore.yaml" - with pytest.raises(ValueError): - # pylint: disable=W0212 - store._validate_properties(properties) - - -def test_set_default_algorithms_missing_yaml(store, pids): - """Confirm set_default_algorithms raises FileNotFoundError when hashstore.yaml - not found.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - store.put_object(pid, path) - os.remove(store.hashstore_configuration_yaml) - with pytest.raises(FileNotFoundError): - # pylint: disable=W0212 - store._set_default_algorithms() - - -def test_put_object_files_path(pids, store): - """Test put objects with path object.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = Path(test_dir) / pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_id = object_metadata.id - assert store.exists(entity, object_metadata_id) - - -def test_put_object_files_string(pids, store): - """Test put objects with string.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_id = object_metadata.id - assert store.exists(entity, object_metadata_id) - - -def test_put_object_files_stream(pids, store): - """Test put objects with stream.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - object_metadata = store.put_object(pid, input_stream) - input_stream.close() - object_metadata_id = object_metadata.id - assert store.exists(entity, object_metadata_id) - assert store.count(entity) == 3 - - -def test_put_object_cid(pids, store): - """Check put returns correct id.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_id = object_metadata.id - assert object_metadata_id == pids[pid]["object_cid"] - - -def test_put_object_file_size(pids, store): - """Check put returns correct file size.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_size = object_metadata.obj_size - assert object_size == pids[pid]["file_size_bytes"] - - -def test_put_object_hex_digests(pids, store): - """Check put successfully generates hex digests dictionary.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_hex_digests = object_metadata.hex_digests - assert object_metadata_hex_digests.get("md5") == pids[pid]["md5"] - assert object_metadata_hex_digests.get("sha1") == pids[pid]["sha1"] - assert object_metadata_hex_digests.get("sha256") == pids[pid]["sha256"] - assert object_metadata_hex_digests.get("sha384") == pids[pid]["sha384"] - assert object_metadata_hex_digests.get("sha512") == pids[pid]["sha512"] - - -def test_put_object_additional_algorithm(pids, store): - """Check put_object returns additional algorithm in hex digests.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - algo = "sha224" - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path, additional_algorithm=algo) - hex_digests = object_metadata.hex_digests - sha224_hash = hex_digests.get(algo) - assert sha224_hash == pids[pid][algo] - - -def test_put_object_with_correct_checksums(pids, store): - """Check put_object success with valid checksum and checksum algorithm supplied.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - algo = "sha224" - algo_checksum = pids[pid][algo] - path = test_dir + pid.replace("/", "_") - store.put_object(pid, path, checksum=algo_checksum, checksum_algorithm=algo) - assert store.count("objects") == 3 - - -def test_put_object_with_incorrect_checksum(pids, store): - """Check put fails when bad checksum supplied.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - algo = "sha224" - algo_checksum = "badChecksumValue" - path = test_dir + pid.replace("/", "_") - with pytest.raises(ValueError): - store.put_object(pid, path, checksum=algo_checksum, checksum_algorithm=algo) - assert store.count(entity) == 0 - - -def test_move_and_get_checksums_id(pids, store): - """Test move returns correct id.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - # pylint: disable=W0212 - ( - move_id, - _, - _, - ) = store._move_and_get_checksums(pid, input_stream) - input_stream.close() - object_cid = store.get_sha256_hex_digest(pid) - assert move_id == object_cid - - -def test_move_and_get_checksums_file_size(pids, store): - """Test move returns correct file size.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - # pylint: disable=W0212 - ( - _, - tmp_file_size, - _, - ) = store._move_and_get_checksums(pid, input_stream) - input_stream.close() - assert tmp_file_size == pids[pid]["file_size_bytes"] - - -def test_move_and_get_checksums_hex_digests(pids, store): - """Test move returns correct hex digests.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - # pylint: disable=W0212 - ( - _, - _, - hex_digests, - ) = store._move_and_get_checksums(pid, input_stream) - input_stream.close() - assert hex_digests.get("md5") == pids[pid]["md5"] - assert hex_digests.get("sha1") == pids[pid]["sha1"] - assert hex_digests.get("sha256") == pids[pid]["sha256"] - assert hex_digests.get("sha384") == pids[pid]["sha384"] - assert hex_digests.get("sha512") == pids[pid]["sha512"] - - -def test_move_and_get_checksums_duplicates_raises_error(pids, store): - """Test move does not store duplicate objects and raises error.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - # pylint: disable=W0212 - store._move_and_get_checksums(pid, input_stream) - input_stream.close() - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - with pytest.raises(FileExistsError): - # pylint: disable=W0212 - store._move_and_get_checksums(pid, input_stream) - input_stream.close() - assert store.count(entity) == 3 - - -def test_move_and_get_checksums_file_size_raises_error(pids, store): - """Test move and get checksum raises error with incorrect file size""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - with pytest.raises(ValueError): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - incorrect_file_size = 1000 - # pylint: disable=W0212 - ( - _, - _, - _, - _, - ) = store._move_and_get_checksums( - pid, input_stream, file_size_to_validate=incorrect_file_size - ) - input_stream.close() - - -def test_mktempfile_additional_algo(store): - """Test _mktempfile returns correct hex digests for additional algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - input_stream = io.open(path, "rb") - checksum_algo = "sha3_256" - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - # pylint: disable=W0212 - hex_digests, _, _ = store._mktmpfile( - input_stream, additional_algorithm=checksum_algo - ) - input_stream.close() - assert hex_digests.get("sha3_256") == checksum_correct - - -def test_mktempfile_checksum_algo(store): - """Test _mktempfile returns correct hex digests for checksum algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - input_stream = io.open(path, "rb") - checksum_algo = "sha3_256" - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - # pylint: disable=W0212 - hex_digests, _, _ = store._mktmpfile(input_stream, checksum_algorithm=checksum_algo) - input_stream.close() - assert hex_digests.get("sha3_256") == checksum_correct - - -def test_mktempfile_checksum_and_additional_algo(store): - """Test _mktempfile returns correct hex digests for checksum algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - input_stream = io.open(path, "rb") - additional_algo = "sha224" - additional_algo_checksum = ( - "9b3a96f434f3c894359193a63437ef86fbd5a1a1a6cc37f1d5013ac1" - ) - checksum_algo = "sha3_256" - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - # pylint: disable=W0212 - hex_digests, _, _ = store._mktmpfile( - input_stream, - additional_algorithm=additional_algo, - checksum_algorithm=checksum_algo, - ) - input_stream.close() - assert hex_digests.get("sha3_256") == checksum_correct - assert hex_digests.get("sha224") == additional_algo_checksum - - -def test_mktempfile_checksum_and_additional_algo_duplicate(store): - """Test _mktempfile succeeds with duplicate algorithms (de-duplicates).""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - input_stream = io.open(path, "rb") - additional_algo = "sha224" - checksum_algo = "sha224" - checksum_correct = "9b3a96f434f3c894359193a63437ef86fbd5a1a1a6cc37f1d5013ac1" - # pylint: disable=W0212 - hex_digests, _, _ = store._mktmpfile( - input_stream, - additional_algorithm=additional_algo, - checksum_algorithm=checksum_algo, - ) - input_stream.close() - assert hex_digests.get("sha224") == checksum_correct - - -def test_mktempfile_file_size(pids, store): - """Test _mktempfile returns correct file size.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - # pylint: disable=W0212 - _, _, tmp_file_size = store._mktmpfile(input_stream) - input_stream.close() - assert tmp_file_size == pids[pid]["file_size_bytes"] - - -def test_mktempfile_hex_digests(pids, store): - """Test _mktempfile returns correct hex digests.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - # pylint: disable=W0212 - hex_digests, _, _ = store._mktmpfile(input_stream) - input_stream.close() - assert hex_digests.get("md5") == pids[pid]["md5"] - assert hex_digests.get("sha1") == pids[pid]["sha1"] - assert hex_digests.get("sha256") == pids[pid]["sha256"] - assert hex_digests.get("sha384") == pids[pid]["sha384"] - assert hex_digests.get("sha512") == pids[pid]["sha512"] - - -def test_mktempfile_tmpfile_object(pids, store): - """Test _mktempfile creates file successfully.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - # pylint: disable=W0212 - _, tmp_file_name, _ = store._mktmpfile(input_stream) - input_stream.close() - assert os.path.isfile(tmp_file_name) is True - - -def test_mktempfile_with_unsupported_algorithm(pids, store): - """Test _mktempfile raises error when bad algorithm supplied.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - algo = "md2" - with pytest.raises(ValueError): - # pylint: disable=W0212 - _, _, _ = store._mktmpfile(input_stream, additional_algorithm=algo) - with pytest.raises(ValueError): - # pylint: disable=W0212 - _, _, _ = store._mktmpfile(input_stream, checksum_algorithm=algo) - input_stream.close() - - -def test_put_metadata_with_path(pids, store): - """Test put_metadata with path object.""" - entity = "metadata" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - metadata_cid = store.store_metadata(pid, syspath, format_id) - assert store.exists(entity, metadata_cid) - assert store.count(entity) == 3 - - -def test_put_metadata_with_string(pids, store): - """Test_put metadata with string.""" - entity = "metadata" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - filename = pid.replace("/", "_") + ".xml" - syspath = str(Path(test_dir) / filename) - metadata_cid = store.store_metadata(pid, syspath, format_id) - assert store.exists(entity, metadata_cid) - assert store.count(entity) == 3 - - -def test_put_metadata_cid(pids, store): - """Test put metadata returns correct id.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - metadata_cid = store.store_metadata(pid, syspath, format_id) - assert metadata_cid == pids[pid]["metadata_cid"] - - -def test_mktmpmetadata(pids, store): - """Test mktmpmetadata creates tmpFile.""" - test_dir = "tests/testdata/" - entity = "metadata" - for pid in pids.keys(): - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - sys_stream = io.open(syspath, "rb") - # pylint: disable=W0212 - tmp_name = store._mktmpmetadata(sys_stream) - sys_stream.close() - assert store.exists(entity, tmp_name) - - -def test_clean_algorithm(store): - """Check that algorithm values get formatted as expected.""" - algorithm_underscore = "sha_256" - algorithm_hyphen = "sha-256" - algorithm_other_hyphen = "sha3-256" - cleaned_algo_underscore = store.clean_algorithm(algorithm_underscore) - cleaned_algo_hyphen = store.clean_algorithm(algorithm_hyphen) - cleaned_algo_other_hyphen = store.clean_algorithm(algorithm_other_hyphen) - assert cleaned_algo_underscore == "sha256" - assert cleaned_algo_hyphen == "sha256" - assert cleaned_algo_other_hyphen == "sha3_256" - - -def test_computehash(pids, store): - """Test to check computehash method.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - obj_stream = io.open(path, "rb") - obj_sha256_hash = store.computehash(obj_stream, "sha256") - obj_stream.close() - assert pids[pid]["sha256"] == obj_sha256_hash - - -def test_get_store_path_object(store): - """Check get_store_path for object path.""" - # pylint: disable=W0212 - path_objects = store.get_store_path("objects") - path_objects_string = str(path_objects) - assert path_objects_string.endswith("/metacat/objects") - - -def test_get_store_path_metadata(store): - """Check get_store_path for metadata path.""" - # pylint: disable=W0212 - path_metadata = store.get_store_path("metadata") - path_metadata_string = str(path_metadata) - assert path_metadata_string.endswith("/metacat/metadata") - - -def test_exists_with_object_metadata_id(pids, store): - """Test exists method with an absolute file path.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - assert store.exists(entity, object_metadata.id) - - -def test_exists_with_sharded_path(pids, store): - """Test exists method with an absolute file path.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_shard = store.shard(object_metadata.id) - object_metadata_shard_path = "/".join(object_metadata_shard) - assert store.exists(entity, object_metadata_shard_path) - - -def test_exists_with_nonexistent_file(store): - """Test exists method with a nonexistent file.""" - entity = "objects" - non_existent_file = "tests/testdata/filedoesnotexist" - does_not_exist = store.exists(entity, non_existent_file) - assert does_not_exist is False - - -def test_shard(store): - """Test shard creates list.""" - hash_id = "0d555ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e" - predefined_list = [ - "0d", - "55", - "5e", - "d77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e", - ] - sharded_list = store.shard(hash_id) - assert predefined_list == sharded_list - - -def test_open_objects(pids, store): - """Test open returns a stream.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_id = object_metadata.id - io_buffer = store.open(entity, object_metadata_id) - assert isinstance(io_buffer, io.BufferedReader) - io_buffer.close() - - -def test_delete_by_object_metadata_id(pids, store): - """Check objects are deleted after calling delete with hash address id.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_id = object_metadata.id - store.delete(entity, object_metadata_id) - assert store.count(entity) == 0 - - -def test_remove_empty_removes_empty_folders_string(store): - """Test empty folders (via string) are removed.""" - three_dirs = "dir1/dir2/dir3" - two_dirs = "dir1/dir4" - one_dir = "dir5" - os.makedirs(os.path.join(store.root, three_dirs)) - os.makedirs(os.path.join(store.root, two_dirs)) - os.makedirs(os.path.join(store.root, one_dir)) - assert os.path.exists(os.path.join(store.root, three_dirs)) - assert os.path.exists(os.path.join(store.root, two_dirs)) - assert os.path.exists(os.path.join(store.root, one_dir)) - store.remove_empty(os.path.join(store.root, three_dirs)) - store.remove_empty(os.path.join(store.root, two_dirs)) - store.remove_empty(os.path.join(store.root, one_dir)) - assert not os.path.exists(os.path.join(store.root, three_dirs)) - assert not os.path.exists(os.path.join(store.root, two_dirs)) - assert not os.path.exists(os.path.join(store.root, one_dir)) - - -def test_remove_empty_removes_empty_folders_path(store): - """Test empty folders (via Path object) are removed.""" - three_dirs = Path("dir1/dir2/dir3") - two_dirs = Path("dir1/dir4") - one_dir = Path("dir5") - (store.root / three_dirs).mkdir(parents=True) - (store.root / two_dirs).mkdir(parents=True) - (store.root / one_dir).mkdir(parents=True) - assert (store.root / three_dirs).exists() - assert (store.root / two_dirs).exists() - assert (store.root / one_dir).exists() - store.remove_empty(store.root / three_dirs) - store.remove_empty(store.root / two_dirs) - store.remove_empty(store.root / one_dir) - assert not (store.root / three_dirs).exists() - assert not (store.root / two_dirs).exists() - assert not (store.root / one_dir).exists() - - -def test_remove_empty_does_not_remove_nonempty_folders(pids, store): - """Test non-empty folders are not removed.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_shard = store.shard(object_metadata.id) - object_metadata_shard_path = "/".join(object_metadata_shard) - # Get parent directory of the relative path - parent_dir = os.path.dirname(object_metadata_shard_path) - # Attempt to remove the parent directory - store.remove_empty(parent_dir) - abs_parent_dir = store.objects + "/" + parent_dir - assert os.path.exists(abs_parent_dir) - - -def test_has_subdir_subdirectory_string(store): - """Test that subdirectory is recognized.""" - sub_dir = store.root + "/filehashstore/test" - os.makedirs(sub_dir) - # pylint: disable=W0212 - is_sub_dir = store._has_subdir(sub_dir) - assert is_sub_dir - - -def test_has_subdir_subdirectory_path(store): - """Test that subdirectory is recognized.""" - sub_dir = Path(store.root) / "filehashstore" / "test" - sub_dir.mkdir(parents=True) - # pylint: disable=W0212 - is_sub_dir = store._has_subdir(sub_dir) - assert is_sub_dir - - -def test_has_subdir_non_subdirectory(store): - """Test that non-subdirectory is not recognized.""" - parent_dir = os.path.dirname(store.root) - non_sub_dir = parent_dir + "/filehashstore/test" - os.makedirs(non_sub_dir) - # pylint: disable=W0212 - is_sub_dir = store._has_subdir(non_sub_dir) - assert not is_sub_dir - - -def test_create_path(pids, store): - """Test makepath creates folder successfully.""" - for pid in pids: - root_directory = store.root - pid_hex_digest_directory = pids[pid]["metadata_cid"][:2] - pid_directory = root_directory + pid_hex_digest_directory - store.create_path(pid_directory) - assert os.path.isdir(pid_directory) - - -def test_get_real_path_file_does_not_exist(store): - """Test get_real_path returns None when object does not exist.""" - entity = "objects" - test_path = "tests/testdata/helloworld.txt" - real_path_exists = store.get_real_path(entity, test_path) - assert real_path_exists is None - - -def test_get_real_path_with_object_id(store, pids): - """Test get_real_path returns absolute path given an object id.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - obj_abs_path = store.get_real_path(entity, object_metadata.id) - assert os.path.exists(obj_abs_path) - - -def test_get_real_path_with_object_id_sharded(pids, store): - """Test exists method with a sharded path (relative path).""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - object_metadata_shard = store.shard(object_metadata.id) - object_metadata_shard_path = "/".join(object_metadata_shard) - obj_abs_path = store.get_real_path(entity, object_metadata_shard_path) - assert os.path.exists(obj_abs_path) - - -def test_get_real_path_with_metadata_id(store, pids): - """Test get_real_path returns absolute path given a metadata id.""" - entity = "metadata" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - metadata_cid = store.store_metadata(pid, syspath, format_id) - metadata_abs_path = store.get_real_path(entity, metadata_cid) - assert os.path.exists(metadata_abs_path) - - -def test_get_real_path_with_bad_entity(store, pids): - """Test get_real_path returns absolute path given an object id.""" - test_dir = "tests/testdata/" - entity = "bad_entity" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.put_object(pid, path) - with pytest.raises(ValueError): - store.get_real_path(entity, object_metadata.id) - - -def test_build_abs_path(store, pids): - """Test build_abs_path builds the absolute file path.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - _ = store.put_object(pid, path) - # pylint: disable=W0212 - abs_path = store.build_abs_path(entity, pids[pid]["object_cid"]) - assert abs_path - - -def test_count(pids, store): - """Check that count returns expected number of objects.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path_string = test_dir + pid.replace("/", "_") - store.put_object(pid, path_string) - assert store.count(entity) == 3 - - -def test_to_bytes(store): - """Test _to_bytes returns bytes.""" - string = "teststring" - # pylint: disable=W0212 - string_bytes = store._to_bytes(string) - assert isinstance(string_bytes, bytes) - - -def test_get_sha256_hex_digest(pids, store): - """Test for correct sha256 return value.""" - for pid in pids: - hash_val = store.get_sha256_hex_digest(pid) - assert hash_val == pids[pid]["object_cid"] diff --git a/tests/test_filehashstore_interface.py b/tests/test_filehashstore_interface.py deleted file mode 100644 index 92b125cb..00000000 --- a/tests/test_filehashstore_interface.py +++ /dev/null @@ -1,955 +0,0 @@ -"""Test module for FileHashStore HashStore interface methods""" -import io -import os -from pathlib import Path -from threading import Thread -import random -import threading -import time -import pytest - -# Define a mark to be used to label slow tests -slow_test = pytest.mark.skipif( - "not config.getoption('--run-slow')", - reason="Only run when --run-slow is given", -) - - -def test_pids_length(pids): - """Ensure test harness pids are present.""" - assert len(pids) == 3 - - -def test_store_address_length(pids, store): - """Test store object object_cid length is 64 characters.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.store_object(pid, path) - object_cid = object_metadata.id - assert len(object_cid) == 64 - - -def test_store_object(pids, store): - """Test store object.""" - test_dir = "tests/testdata/" - entity = "objects" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = Path(test_dir + pid.replace("/", "_")) - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath, format_id) - assert object_metadata.id == pids[pid]["object_cid"] - assert store.count(entity) == 3 - - -def test_store_object_files_path(pids, store): - """Test store object when given a path.""" - test_dir = "tests/testdata/" - entity = "objects" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = Path(test_dir + pid.replace("/", "_")) - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath, format_id) - assert store.exists(entity, pids[pid]["object_cid"]) - assert store.count(entity) == 3 - - -def test_store_object_files_string(pids, store): - """Test store object when given a string.""" - test_dir = "tests/testdata/" - entity = "objects" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path_string = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path_string) - _metadata_cid = store.store_metadata(pid, syspath, format_id) - assert store.exists(entity, pids[pid]["object_cid"]) - assert store.count(entity) == 3 - - -def test_store_object_files_input_stream(pids, store): - """Test store object given an input stream.""" - test_dir = "tests/testdata/" - entity = "objects" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - input_stream = io.open(path, "rb") - _object_metadata = store.store_object(pid, input_stream) - input_stream.close() - object_cid = store.get_sha256_hex_digest(pid) - assert store.exists(entity, object_cid) - assert store.count(entity) == 3 - - -def test_store_object_id(pids, store): - """Test store object returns expected id (object_cid).""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.store_object(pid, path) - assert object_metadata.id == pids[pid]["object_cid"] - - -def test_store_object_obj_size(pids, store): - """Test store object returns expected file size.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.store_object(pid, path) - object_size = object_metadata.obj_size - assert object_size == pids[pid]["file_size_bytes"] - - -def test_store_object_hex_digests(pids, store): - """Test store object returns expected hex digests dictionary.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - object_metadata = store.store_object(pid, path) - assert object_metadata.hex_digests.get("md5") == pids[pid]["md5"] - assert object_metadata.hex_digests.get("sha1") == pids[pid]["sha1"] - assert object_metadata.hex_digests.get("sha256") == pids[pid]["sha256"] - assert object_metadata.hex_digests.get("sha384") == pids[pid]["sha384"] - assert object_metadata.hex_digests.get("sha512") == pids[pid]["sha512"] - - -def test_store_object_pid_empty(store): - """Test store object raises error when supplied with empty pid string.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - with pytest.raises(ValueError): - store.store_object("", path) - - -def test_store_object_pid_empty_spaces(store): - """Test store object raises error when supplied with empty space character.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - with pytest.raises(ValueError): - store.store_object(" ", path) - - -def test_store_object_pid_none(store): - """Test store object raises error when supplied with 'None' pid.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - with pytest.raises(ValueError): - store.store_object(None, path) - - -def test_store_object_data_incorrect_type_none(store): - """Test store object raises error when data is 'None'.""" - pid = "jtao.1700.1" - path = None - with pytest.raises(TypeError): - store.store_object(pid, path) - - -def test_store_object_data_incorrect_type_empty(store): - """Test store object raises error when data is an empty string.""" - pid = "jtao.1700.1" - path = "" - with pytest.raises(TypeError): - store.store_object(pid, path) - - -def test_store_object_data_incorrect_type_empty_spaces(store): - """Test store object raises error when data is an empty string with spaces.""" - pid = "jtao.1700.1" - path = " " - with pytest.raises(TypeError): - store.store_object(pid, path) - - -def test_store_object_additional_algorithm_invalid(store): - """Test store object raises error when supplied with unsupported algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - algorithm_not_in_list = "abc" - with pytest.raises(ValueError, match="Algorithm not supported"): - store.store_object(pid, path, algorithm_not_in_list) - - -def test_store_object_additional_algorithm_hyphen_uppercase(pids, store): - """Test store object formats algorithm in uppercase.""" - test_dir = "tests/testdata/" - entity = "objects" - pid = "jtao.1700.1" - path = test_dir + pid - algorithm_with_hyphen_and_upper = "SHA-384" - object_metadata = store.store_object(pid, path, algorithm_with_hyphen_and_upper) - sha256_cid = object_metadata.hex_digests.get("sha384") - assert sha256_cid == pids[pid]["sha384"] - object_cid = store.get_sha256_hex_digest(pid) - assert store.exists(entity, object_cid) - - -def test_store_object_additional_algorithm_hyphen_lowercase(store): - """Test store object with additional algorithm in lowercase.""" - test_dir = "tests/testdata/" - entity = "objects" - pid = "jtao.1700.1" - path = test_dir + pid - algorithm_other = "sha3-256" - object_metadata = store.store_object(pid, path, algorithm_other) - additional_sha3_256_hex_digest = object_metadata.hex_digests.get("sha3_256") - sha3_256_checksum = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - assert additional_sha3_256_hex_digest == sha3_256_checksum - object_cid = store.get_sha256_hex_digest(pid) - assert store.exists(entity, object_cid) - - -def test_store_object_additional_algorithm_underscore(store): - """Test store object with additional algorithm with underscore.""" - test_dir = "tests/testdata/" - entity = "objects" - pid = "jtao.1700.1" - path = test_dir + pid - algorithm_other = "sha3_256" - object_metadata = store.store_object(pid, path, algorithm_other) - additional_sha3_256_hex_digest = object_metadata.hex_digests.get("sha3_256") - sha3_256_checksum = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - assert additional_sha3_256_hex_digest == sha3_256_checksum - pid_hash = store.get_sha256_hex_digest(pid) - assert store.exists(entity, pid_hash) - - -def test_store_object_checksum_correct(store): - """Test store object successfully stores with good checksum.""" - test_dir = "tests/testdata/" - entity = "objects" - pid = "jtao.1700.1" - path = test_dir + pid - checksum_algo = "sha3_256" - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - _object_metadata = store.store_object( - pid, path, checksum=checksum_correct, checksum_algorithm=checksum_algo - ) - assert store.count(entity) == 1 - - -def test_store_object_checksum_correct_and_additional_algo(store): - """Test store object successfully stores with good checksum and same additional algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - algorithm_additional = "sha224" - sha224_additional_checksum = ( - "9b3a96f434f3c894359193a63437ef86fbd5a1a1a6cc37f1d5013ac1" - ) - algorithm_checksum = "sha3_256" - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - object_metadata = store.store_object( - pid, - path, - additional_algorithm=algorithm_additional, - checksum=checksum_correct, - checksum_algorithm=algorithm_checksum, - ) - assert object_metadata.hex_digests.get("sha224") == sha224_additional_checksum - assert object_metadata.hex_digests.get("sha3_256") == checksum_correct - - -def test_store_object_checksum_correct_and_additional_algo_duplicate(store): - """Test store object successfully stores with good checksum and same additional algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - algorithm_additional = "sha3_256" - algorithm_checksum = "sha3_256" - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - object_metadata = store.store_object( - pid, - path, - additional_algorithm=algorithm_additional, - checksum=checksum_correct, - checksum_algorithm=algorithm_checksum, - ) - assert object_metadata.hex_digests.get("sha3_256") == checksum_correct - - -def test_store_object_checksum_algorithm_empty(store): - """Test store object raises error when checksum supplied with no checksum_algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - with pytest.raises(ValueError): - store.store_object(pid, path, checksum=checksum_correct, checksum_algorithm="") - - -def test_store_object_checksum_empty(store): - """Test store object raises error when checksum_algorithm supplied with empty checksum.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - checksum_algorithm = "sha3_256" - with pytest.raises(ValueError): - store.store_object( - pid, path, checksum="", checksum_algorithm=checksum_algorithm - ) - - -def test_store_object_checksum_empty_spaces(store): - """Test store object raises error when checksum_algorithm supplied and checksum is empty - with spaces.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - checksum_algorithm = "sha3_256" - with pytest.raises(ValueError): - store.store_object( - pid, path, checksum=" ", checksum_algorithm=checksum_algorithm - ) - - -def test_store_object_checksum_algorithm_empty_spaces(store): - """Test store object raises error when checksum supplied with no checksum_algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - checksum_correct = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - with pytest.raises(ValueError): - store.store_object( - pid, path, checksum=checksum_correct, checksum_algorithm=" " - ) - - -def test_store_object_checksum_incorrect_checksum(store): - """Test store object raises error when supplied with bad checksum.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - algorithm_other = "sha3_256" - checksum_incorrect = ( - "bbbb069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - with pytest.raises(ValueError): - store.store_object( - pid, path, checksum=algorithm_other, checksum_algorithm=checksum_incorrect - ) - - -def test_store_object_duplicate_raises_error(store): - """Test store duplicate object throws FileExistsError.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - entity = "objects" - # Store first blob - _object_metadata_one = store.store_object(pid, path) - # Store second blob - with pytest.raises(FileExistsError): - _object_metadata_two = store.store_object(pid, path) - assert store.count(entity) == 1 - object_cid = store.get_sha256_hex_digest(pid) - assert store.exists(entity, object_cid) - - -def test_store_object_with_obj_file_size(store, pids): - """Test store object with correct file sizes.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - obj_file_size = pids[pid]["file_size_bytes"] - path = test_dir + pid.replace("/", "_") - object_metadata = store.store_object( - pid, path, expected_object_size=obj_file_size - ) - object_size = object_metadata.obj_size - assert object_size == obj_file_size - - -def test_store_object_with_obj_file_size_incorrect(store, pids): - """Test store object throws exception with incorrect file size.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - obj_file_size = 1234 - path = test_dir + pid.replace("/", "_") - with pytest.raises(ValueError): - store.store_object(pid, path, expected_object_size=obj_file_size) - - -def test_store_object_with_obj_file_size_non_integer(store, pids): - """Test store object throws exception with a non integer value as the file size.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - obj_file_size = "Bob" - path = test_dir + pid.replace("/", "_") - with pytest.raises(TypeError): - store.store_object(pid, path, expected_object_size=obj_file_size) - - -def test_store_object_with_obj_file_size_zero(store, pids): - """Test store object throws exception with zero as the file size.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - obj_file_size = 0 - path = test_dir + pid.replace("/", "_") - with pytest.raises(ValueError): - store.store_object(pid, path, expected_object_size=obj_file_size) - - -def test_store_object_duplicates_threads(store): - """Test store object thread lock.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - entity = "objects" - - file_exists_error_flag = False - - def store_object_wrapper(pid, path): - nonlocal file_exists_error_flag - try: - store.store_object(pid, path) # Call store_object inside the thread - except FileExistsError: - file_exists_error_flag = True - - thread1 = Thread(target=store_object_wrapper, args=(pid, path)) - thread2 = Thread(target=store_object_wrapper, args=(pid, path)) - thread3 = Thread(target=store_object_wrapper, args=(pid, path)) - thread1.start() - thread2.start() - thread3.start() - thread1.join() - thread2.join() - thread3.join() - # One thread will succeed, file count must still be 1 - assert store.count(entity) == 1 - object_cid = store.get_sha256_hex_digest(pid) - assert store.exists(entity, object_cid) - assert file_exists_error_flag - - -@slow_test -def test_store_object_interrupt_process(store): - """Test that tmp file created when storing a large object (2GB) and - interrupting the process is cleaned up. - """ - file_size = 2 * 1024 * 1024 * 1024 # 2GB - file_path = store.root + "random_file_2.bin" - - pid = "Testpid" - # Generate a random file with the specified size - with open(file_path, "wb") as file: - remaining_bytes = file_size - buffer_size = 1024 * 1024 # 1MB buffer size (adjust as needed) - - while remaining_bytes > 0: - # Generate random data for the buffer - buffer = bytearray(random.getrandbits(8) for _ in range(buffer_size)) - # Write the buffer to the file - bytes_to_write = min(buffer_size, remaining_bytes) - file.write(buffer[:bytes_to_write]) - remaining_bytes -= bytes_to_write - - interrupt_flag = False - - def store_object_wrapper(pid, path): - print(store.root) - while not interrupt_flag: - store.store_object(pid, path) # Call store_object inside the thread - - # Create/start the thread - thread = threading.Thread(target=store_object_wrapper, args=(pid, file_path)) - thread.start() - - # Sleep for 5 seconds to let the thread run - time.sleep(5) - - # Interrupt the thread - interrupt_flag = True - - # Wait for the thread to finish - thread.join() - - # Confirm no tmp objects found in objects/tmp directory - assert len(os.listdir(store.root + "/objects/tmp")) == 0 - - -@slow_test -def test_store_object_large_file(store): - """Test storing a large object (1GB). This test has also been executed with - a 4GB file and the test classes succeeded locally in 296.85s (0:04:56) - """ - # file_size = 4 * 1024 * 1024 * 1024 # 4GB - file_size = 1024 * 1024 * 1024 # 1GB - file_path = store.root + "random_file.bin" - # Generate a random file with the specified size - with open(file_path, "wb") as file: - remaining_bytes = file_size - buffer_size = 1024 * 1024 # 1MB buffer size (adjust as needed) - - while remaining_bytes > 0: - # Generate random data for the buffer - buffer = bytearray(random.getrandbits(8) for _ in range(buffer_size)) - # Write the buffer to the file - bytes_to_write = min(buffer_size, remaining_bytes) - file.write(buffer[:bytes_to_write]) - remaining_bytes -= bytes_to_write - # Store object - pid = "testfile_filehashstore" - object_metadata = store.store_object(pid, file_path) - object_metadata_id = object_metadata.id - pid_sha256_hex_digest = store.get_sha256_hex_digest(pid) - assert object_metadata_id == pid_sha256_hex_digest - - -@slow_test -def test_store_object_sparse_large_file(store): - """Test storing a large object (4GB) via sparse file. This test has also been - executed with a 10GB file and the test classes succeeded locally in 117.03s (0:01:57). - """ - # file_size = 10 * 1024 * 1024 * 1024 # 10GB - file_size = 4 * 1024 * 1024 * 1024 # 4GB - file_path = store.root + "random_file.bin" - # Generate a random file with the specified size - with open(file_path, "wb") as file: - file.seek(file_size - 1) - file.write(b"\0") - # Store object - pid = "testfile_filehashstore" - object_metadata = store.store_object(pid, file_path) - object_metadata_id = object_metadata.id - pid_sha256_hex_digest = store.get_sha256_hex_digest(pid) - assert object_metadata_id == pid_sha256_hex_digest - - -def test_store_metadata(pids, store): - """Test store metadata.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - metadata_cid = store.store_metadata(pid, syspath, format_id) - assert metadata_cid == pids[pid]["metadata_cid"] - - -def test_store_metadata_default_format_id(pids, store): - """Test store metadata returns expected id when storing with default format_id.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - metadata_cid = store.store_metadata(pid, syspath) - assert metadata_cid == pids[pid]["metadata_cid"] - - -def test_store_metadata_files_path(pids, store): - """Test store metadata with path.""" - test_dir = "tests/testdata/" - entity = "metadata" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - metadata_cid = store.store_metadata(pid, syspath, format_id) - assert store.exists(entity, metadata_cid) - assert metadata_cid == pids[pid]["metadata_cid"] - assert store.count(entity) == 3 - - -def test_store_metadata_files_string(pids, store): - """Test store metadata with string.""" - test_dir = "tests/testdata/" - entity = "metadata" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path_string = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath_string = str(Path(test_dir) / filename) - _object_metadata = store.store_object(pid, path_string) - metadata_cid = store.store_metadata(pid, syspath_string, format_id) - assert store.exists(entity, metadata_cid) - assert store.count(entity) == 3 - - -def test_store_metadata_files_input_stream(pids, store): - """Test store metadata with an input stream to metadata.""" - test_dir = "tests/testdata/" - entity = "metadata" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - _object_metadata = store.store_object(pid, path) - filename = pid.replace("/", "_") + ".xml" - syspath_string = str(Path(test_dir) / filename) - syspath_stream = io.open(syspath_string, "rb") - _metadata_cid = store.store_metadata(pid, syspath_stream, format_id) - syspath_stream.close() - assert store.count(entity) == 3 - - -def test_store_metadata_pid_empty(store): - """Test store metadata raises error with empty string.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = "" - filename = pid.replace("/", "_") + ".xml" - syspath_string = str(Path(test_dir) / filename) - with pytest.raises(ValueError): - store.store_metadata(pid, syspath_string, format_id) - - -def test_store_metadata_pid_empty_spaces(store): - """Test store metadata raises error with empty spaces.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = " " - filename = pid.replace("/", "_") + ".xml" - syspath_string = str(Path(test_dir) / filename) - with pytest.raises(ValueError): - store.store_metadata(pid, syspath_string, format_id) - - -def test_store_metadata_pid_format_id_spaces(store): - """Test store metadata raises error with empty spaces.""" - test_dir = "tests/testdata/" - format_id = " " - pid = "jtao.1700.1" - filename = pid.replace("/", "_") + ".xml" - syspath_string = str(Path(test_dir) / filename) - with pytest.raises(ValueError): - store.store_metadata(pid, syspath_string, format_id) - - -def test_store_metadata_metadata_empty(store): - """Test store metadata raises error with empty metadata string.""" - pid = "jtao.1700.1" - format_id = "http://ns.dataone.org/service/types/v2.0" - syspath_string = " " - with pytest.raises(TypeError): - store.store_metadata(pid, syspath_string, format_id) - - -def test_store_metadata_metadata_none(store): - """Test store metadata raises error with empty None metadata.""" - pid = "jtao.1700.1" - format_id = "http://ns.dataone.org/service/types/v2.0" - syspath_string = None - with pytest.raises(TypeError): - store.store_metadata(pid, syspath_string, format_id) - - -def test_store_metadata_metadata_cid(pids, store): - """Test store metadata returns expected metadata_cid.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - metadata_cid = store.store_metadata(pid, syspath, format_id) - assert metadata_cid == pids[pid]["metadata_cid"] - - -def test_store_metadata_thread_lock(store): - """Test store metadata thread lock.""" - test_dir = "tests/testdata/" - entity = "metadata" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = "jtao.1700.1" - path = test_dir + pid - filename = pid + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - store.store_metadata(pid, syspath, format_id) - # Start threads - thread1 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) - thread2 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) - thread3 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) - thread4 = Thread(target=store.store_metadata, args=(pid, syspath, format_id)) - thread1.start() - thread2.start() - thread3.start() - thread4.start() - thread1.join() - thread2.join() - thread3.join() - thread4.join() - assert store.count(entity) == 1 - - -def test_retrieve_object(pids, store): - """Test retrieve_object returns correct object data.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - object_metadata = store.store_object(pid, path) - store.store_metadata(pid, syspath, format_id) - obj_stream = store.retrieve_object(pid) - sha256_hex = store.computehash(obj_stream) - obj_stream.close() - assert sha256_hex == object_metadata.hex_digests.get("sha256") - - -def test_retrieve_object_pid_empty(store): - """Test retrieve_object raises error when supplied with empty pid.""" - pid = " " - with pytest.raises(ValueError): - store.retrieve_object(pid) - - -def test_retrieve_object_pid_invalid(store): - """Test retrieve_object raises error when supplied with bad pid.""" - pid = "jtao.1700.1" - pid_does_not_exist = pid + "test" - with pytest.raises(ValueError): - store.retrieve_object(pid_does_not_exist) - - -def test_retrieve_metadata(store): - """Test retrieve_metadata returns correct metadata.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = "jtao.1700.1" - path = test_dir + pid - filename = pid + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath, format_id) - metadata_stream = store.retrieve_metadata(pid, format_id) - metadata_content = metadata_stream.read().decode("utf-8") - metadata_stream.close() - metadata = syspath.read_bytes() - assert metadata.decode("utf-8") == metadata_content - - -def test_retrieve_metadata_default_format_id(store): - """Test retrieve_metadata retrieves expected metadata with default format_id.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - filename = pid + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath) - metadata_stream = store.retrieve_metadata(pid) - metadata_content = metadata_stream.read().decode("utf-8") - metadata_stream.close() - metadata = syspath.read_bytes() - assert metadata.decode("utf-8") == metadata_content - - -def test_retrieve_metadata_bytes_pid_invalid(store): - """Test retrieve_metadata raises error when supplied with bad pid.""" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = "jtao.1700.1" - pid_does_not_exist = pid + "test" - with pytest.raises(ValueError): - store.retrieve_metadata(pid_does_not_exist, format_id) - - -def test_retrieve_metadata_bytes_pid_empty(store): - """Test retrieve_metadata raises error when supplied with empty pid.""" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = " " - with pytest.raises(ValueError): - store.retrieve_metadata(pid, format_id) - - -def test_retrieve_metadata_format_id_empty(store): - """Test retrieve_metadata raises error when supplied with empty format_id.""" - format_id = "" - pid = "jtao.1700.1" - with pytest.raises(ValueError): - store.retrieve_metadata(pid, format_id) - - -def test_retrieve_metadata_format_id_empty_spaces(store): - """Test retrieve_metadata raises error when supplied with empty spaces format_id.""" - format_id = " " - pid = "jtao.1700.1" - with pytest.raises(ValueError): - store.retrieve_metadata(pid, format_id) - - -def test_delete_objects(pids, store): - """Test delete_object successfully deletes objects.""" - test_dir = "tests/testdata/" - entity = "objects" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath, format_id) - store.delete_object(pid) - assert store.count(entity) == 0 - - -def test_delete_object_pid_empty(store): - """Test delete_object raises error when empty pid supplied.""" - pid = " " - with pytest.raises(ValueError): - store.delete_object(pid) - - -def test_delete_object_pid_none(store): - """Test delete_object raises error when pid is 'None'.""" - pid = None - with pytest.raises(ValueError): - store.delete_object(pid) - - -def test_delete_metadata(pids, store): - """Test delete_metadata successfully deletes metadata.""" - test_dir = "tests/testdata/" - entity = "metadata" - format_id = "http://ns.dataone.org/service/types/v2.0" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath, format_id) - store.delete_metadata(pid, format_id) - assert store.count(entity) == 0 - - -def test_delete_metadata_default_format_id(store, pids): - """Test delete_metadata deletes successfully with default format_id.""" - test_dir = "tests/testdata/" - entity = "metadata" - for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") - filename = pid.replace("/", "_") + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath) - store.delete_metadata(pid) - assert store.count(entity) == 0 - - -def test_delete_metadata_pid_empty(store): - """Test delete_object raises error when empty pid supplied.""" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = " " - with pytest.raises(ValueError): - store.delete_metadata(pid, format_id) - - -def test_delete_metadata_pid_none(store): - """Test delete_object raises error when pid is 'None'.""" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = None - with pytest.raises(ValueError): - store.delete_metadata(pid, format_id) - - -def test_delete_metadata_format_id_empty(store): - """Test delete_object raises error when empty format_id supplied.""" - format_id = " " - pid = "jtao.1700.1" - with pytest.raises(ValueError): - store.delete_metadata(pid, format_id) - - -def test_get_hex_digest(store): - """Test get_hex_digest for expected value.""" - test_dir = "tests/testdata/" - format_id = "http://ns.dataone.org/service/types/v2.0" - pid = "jtao.1700.1" - path = test_dir + pid - filename = pid + ".xml" - syspath = Path(test_dir) / filename - _object_metadata = store.store_object(pid, path) - _metadata_cid = store.store_metadata(pid, syspath, format_id) - sha3_256_hex_digest = ( - "b748069cd0116ba59638e5f3500bbff79b41d6184bc242bd71f5cbbb8cf484cf" - ) - sha3_256_get = store.get_hex_digest(pid, "sha3_256") - assert sha3_256_hex_digest == sha3_256_get - - -def test_get_hex_digest_pid_not_found(store): - """Test get_hex_digest raises error when supplied with bad pid.""" - pid = "jtao.1700.1" - pid_does_not_exist = pid + "test" - algorithm = "sha256" - with pytest.raises(ValueError): - store.get_hex_digest(pid_does_not_exist, algorithm) - - -def test_get_hex_digest_pid_unsupported_algorithm(store): - """Test get_hex_digest raises error when supplied with unsupported algorithm.""" - test_dir = "tests/testdata/" - pid = "jtao.1700.1" - path = test_dir + pid - filename = pid + ".xml" - syspath = Path(test_dir) / filename - syspath.read_bytes() - _object_metadata = store.store_object(pid, path) - algorithm = "sm3" - with pytest.raises(ValueError): - store.get_hex_digest(pid, algorithm) - - -def test_get_hex_digest_pid_empty(store): - """Test get_hex_digest raises error when supplied pid is empty.""" - pid = " " - algorithm = "sm3" - with pytest.raises(ValueError): - store.get_hex_digest(pid, algorithm) - - -def test_get_hex_digest_pid_none(store): - """Test get_hex_digest raises error when supplied pid is 'None'.""" - pid = None - algorithm = "sm3" - with pytest.raises(ValueError): - store.get_hex_digest(pid, algorithm) - - -def test_get_hex_digest_algorithm_empty(store): - """Test get_hex_digest raises error when supplied algorithm is empty.""" - pid = "jtao.1700.1" - algorithm = " " - with pytest.raises(ValueError): - store.get_hex_digest(pid, algorithm) - - -def test_get_hex_digest_algorithm_none(store): - """Test get_hex_digest raises error when supplied algorithm is 'None'.""" - pid = "jtao.1700.1" - algorithm = None - with pytest.raises(ValueError): - store.get_hex_digest(pid, algorithm) diff --git a/tests/test_filehashstore_stream.py b/tests/test_filehashstore_stream.py deleted file mode 100644 index 8cf4a7d0..00000000 --- a/tests/test_filehashstore_stream.py +++ /dev/null @@ -1,54 +0,0 @@ -"""Test module for Stream""" -import hashlib -import io -from pathlib import Path -import pytest -from hashstore.filehashstore import Stream - - -def test_stream_reads_file(pids): - """Test that a stream can read a file and yield its contents.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path_string = test_dir + pid.replace("/", "_") - obj_stream = Stream(path_string) - hashobj = hashlib.new("sha256") - for data in obj_stream: - hashobj.update(data) - hex_digest = hashobj.hexdigest() - assert pids[pid]["sha256"] == hex_digest - - -def test_stream_reads_path_object(pids): - """Test that a stream can read a file-like object and yield its contents.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path = Path(test_dir + pid.replace("/", "_")) - obj_stream = Stream(path) - hashobj = hashlib.new("sha256") - for data in obj_stream: - hashobj.update(data) - hex_digest = hashobj.hexdigest() - assert pids[pid]["sha256"] == hex_digest - - -def test_stream_returns_to_original_position_on_close(pids): - """Test that a stream returns to its original position after closing the file.""" - test_dir = "tests/testdata/" - for pid in pids.keys(): - path_string = test_dir + pid.replace("/", "_") - input_stream = io.open(path_string, "rb") - input_stream.seek(5) - hashobj = hashlib.new("sha256") - obj_stream = Stream(input_stream) - for data in obj_stream: - hashobj.update(data) - obj_stream.close() - assert input_stream.tell() == 5 - input_stream.close() - - -def test_stream_raises_error_for_invalid_object(): - """Test that a stream raises ValueError for an invalid input object.""" - with pytest.raises(ValueError): - Stream(1234) diff --git a/tests/test_hashstore.py b/tests/test_hashstore.py index 68cd195a..02d83e3b 100644 --- a/tests/test_hashstore.py +++ b/tests/test_hashstore.py @@ -1,7 +1,8 @@ -"""Test module for HashStore Module""" +"""Test module for HashStore's HashStoreFactory and ObjectMetadata class.""" + import os import pytest -from hashstore.hashstore import ObjectMetadata, HashStoreFactory +from hashstore.hashstore import HashStoreFactory from hashstore.filehashstore import FileHashStore @@ -22,8 +23,8 @@ def test_factory_get_hashstore_filehashstore(factory, props): module_name = "hashstore.filehashstore" class_name = "FileHashStore" # These props can be found in tests/conftest.py - store = factory.get_hashstore(module_name, class_name, props) - assert isinstance(store, FileHashStore) + this_store = factory.get_hashstore(module_name, class_name, props) + assert isinstance(this_store, FileHashStore) def test_factory_get_hashstore_unsupported_class(factory): @@ -43,53 +44,146 @@ def test_factory_get_hashstore_unsupported_module(factory): def test_factory_get_hashstore_filehashstore_unsupported_algorithm(factory): - """Check factory raises exception with store algorithm value that part of the default list""" + """Check factory raises exception with store algorithm value that is not part of + the default list.""" module_name = "hashstore.filehashstore" class_name = "FileHashStore" properties = { - "store_path": os.getcwd() + "/metacat/test", + "store_path": os.getcwd() + "/metacat/hashstore", "store_depth": 3, "store_width": 2, "store_algorithm": "MD2", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", } with pytest.raises(ValueError): factory.get_hashstore(module_name, class_name, properties) def test_factory_get_hashstore_filehashstore_incorrect_algorithm_format(factory): - """Check factory raises exception with incorrectly formatted algorithm value""" + """Check factory raises exception with incorrectly formatted algorithm value.""" module_name = "hashstore.filehashstore" class_name = "FileHashStore" properties = { - "store_path": os.getcwd() + "/metacat/test", + "store_path": os.getcwd() + "/metacat/hashstore", "store_depth": 3, "store_width": 2, - "store_algorithm": "sha256", - "store_metadata_namespace": "http://ns.dataone.org/service/types/v2.0", + "store_algorithm": "dou_algo", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", } with pytest.raises(ValueError): factory.get_hashstore(module_name, class_name, properties) -def test_objectmetadata(): - """Test class returns correct values via dot notation.""" - ab_id = "hashstoretest" - obj_size = 1234 - hex_digest_dict = { - "md5": "md5value", - "sha1": "sha1value", - "sha224": "sha224value", - "sha256": "sha256value", - "sha512": "sha512value", +def test_factory_get_hashstore_filehashstore_conflicting_obj_dir(factory, tmp_path): + """Check factory raises exception when existing `/objects` directory exists.""" + module_name = "hashstore.filehashstore" + class_name = "FileHashStore" + + directory = tmp_path / "douhs" / "objects" + directory.mkdir(parents=True) + douhspath = (tmp_path / "douhs").as_posix() + + properties = { + "store_path": douhspath, + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + with pytest.raises(RuntimeError): + factory.get_hashstore(module_name, class_name, properties) + + +def test_factory_get_hashstore_filehashstore_conflicting_metadata_dir( + factory, tmp_path +): + """Check factory raises exception when existing `/metadata` directory exists.""" + module_name = "hashstore.filehashstore" + class_name = "FileHashStore" + + directory = tmp_path / "douhs" / "metadata" + directory.mkdir(parents=True) + douhspath = (tmp_path / "douhs").as_posix() + + properties = { + "store_path": douhspath, + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", } - object_metadata = ObjectMetadata(ab_id, obj_size, hex_digest_dict) - assert object_metadata.id == ab_id - assert object_metadata.obj_size == obj_size - assert object_metadata.hex_digests.get("md5") == hex_digest_dict["md5"] - assert object_metadata.hex_digests.get("sha1") == hex_digest_dict["sha1"] - assert object_metadata.hex_digests.get("sha224") == hex_digest_dict["sha224"] - assert object_metadata.hex_digests.get("sha256") == hex_digest_dict["sha256"] - assert object_metadata.hex_digests.get("sha512") == hex_digest_dict["sha512"] + with pytest.raises(RuntimeError): + factory.get_hashstore(module_name, class_name, properties) + + +def test_factory_get_hashstore_filehashstore_conflicting_refs_dir(factory, tmp_path): + """Check factory raises exception when existing `/refs` directory exists.""" + module_name = "hashstore.filehashstore" + class_name = "FileHashStore" + + directory = tmp_path / "douhs" / "refs" + directory.mkdir(parents=True) + douhspath = (tmp_path / "douhs").as_posix() + + properties = { + "store_path": douhspath, + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + with pytest.raises(RuntimeError): + factory.get_hashstore(module_name, class_name, properties) + + +def test_factory_get_hashstore_filehashstore_nonconflicting_dir(factory, tmp_path): + """Check factory does not raise exception when existing non-conflicting directory exists.""" + module_name = "hashstore.filehashstore" + class_name = "FileHashStore" + + directory = tmp_path / "douhs" / "other" + directory.mkdir(parents=True) + douhspath = (tmp_path / "douhs").as_posix() + + properties = { + "store_path": douhspath, + "store_depth": 3, + "store_width": 2, + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + + factory.get_hashstore(module_name, class_name, properties) + + +def test_factory_get_hashstore_filehashstore_string_int_prop(factory, tmp_path): + """Check factory does not raise exception when an integer is passed as a string in a + properties object.""" + module_name = "hashstore.filehashstore" + class_name = "FileHashStore" + + directory = tmp_path / "douhs" / "inttest" + directory.mkdir(parents=True) + douhspath = (tmp_path / "douhs").as_posix() + + properties = { + "store_path": douhspath, + "store_depth": "3", + "store_width": "2", + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + + factory.get_hashstore(module_name, class_name, properties) + + properties = { + "store_path": douhspath, + "store_depth": str(3), + "store_width": str(2), + "store_algorithm": "SHA-256", + "store_metadata_namespace": "https://ns.dataone.org/service/types/v2.0#SystemMetadata", + } + + factory.get_hashstore(module_name, class_name, properties) diff --git a/tests/test_hashstore_client.py b/tests/test_hashstore_client.py index 7d73e524..ba0bd566 100644 --- a/tests/test_hashstore_client.py +++ b/tests/test_hashstore_client.py @@ -1,8 +1,11 @@ -"""Test module for the Python client (Public API calls only)""" +"""Test module for the Python client (Public API calls only).""" + import sys import os from pathlib import Path -from hashstore import client +from hashstore import hashstoreclient + +# pylint: disable=W0212 def test_create_hashstore(tmp_path): @@ -29,7 +32,7 @@ def test_create_hashstore(tmp_path): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() hashstore_yaml = Path(client_test_store + "/hashstore.yaml") hashstore_object_path = Path(client_test_store + "/objects") @@ -41,14 +44,49 @@ def test_create_hashstore(tmp_path): assert os.path.exists(hashstore_client_python_log) +def test_get_checksum(capsys, store, pids): + """Test calculating a hash via HashStore through client.""" + client_directory = os.getcwd() + "/src/hashstore" + test_dir = "tests/testdata/" + for pid in pids.keys(): + path = test_dir + pid.replace("/", "_") + store.store_object(pid, path) + + client_module_path = f"{client_directory}/client.py" + test_store = str(store.root) + get_checksum_opt = "-getchecksum" + client_pid_arg = f"-pid={pid}" + algo_arg = f"-algo={store.algorithm}" + chs_args = [ + client_module_path, + test_store, + get_checksum_opt, + client_pid_arg, + algo_arg, + ] + + # Add file path of HashStore to sys so modules can be discovered + sys.path.append(client_directory) + # Manually change sys args to simulate command line arguments + sys.argv = chs_args + hashstoreclient.main() + + capsystext = capsys.readouterr().out + expected_output = ( + f"guid/pid: {pid}\n" + + f"algorithm: {store.algorithm}\n" + + f"Checksum/Hex Digest: {pids[pid][store.algorithm]}\n" + ) + assert capsystext == expected_output + + def test_store_object(store, pids): """Test storing objects to HashStore through client.""" client_directory = os.getcwd() + "/src/hashstore" test_dir = "tests/testdata/" for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") client_module_path = f"{client_directory}/client.py" - test_store = store.root + test_store = str(store.root) store_object_opt = "-storeobject" client_pid_arg = f"-pid={pid}" path = f'-path={test_dir + pid.replace("/", "_")}' @@ -64,22 +102,22 @@ def test_store_object(store, pids): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() - assert store.exists("objects", pids[pid]["object_cid"]) + assert store._exists("objects", pids[pid][store.algorithm]) -def test_store_metadata(store, pids): +def test_store_metadata(capsys, store, pids): """Test storing metadata to HashStore through client.""" client_directory = os.getcwd() + "/src/hashstore" test_dir = "tests/testdata/" - namespace = "http://ns.dataone.org/service/types/v2.0" + namespace = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" + entity = "metadata" for pid in pids.keys(): - path = test_dir + pid.replace("/", "_") filename = pid.replace("/", "_") + ".xml" syspath = Path(test_dir) / filename client_module_path = f"{client_directory}/client.py" - test_store = store.root + test_store = str(store.root) store_metadata_opt = "-storemetadata" client_pid_arg = f"-pid={pid}" path = f"-path={syspath}" @@ -97,9 +135,19 @@ def test_store_metadata(store, pids): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() + + metadata_directory = store._computehash(pid) + metadata_document_name = store._computehash(pid + namespace) + rel_path = Path(*store._shard(metadata_directory)) + full_path = ( + store._get_store_path("metadata") / rel_path / metadata_document_name + ) + capsystext = capsys.readouterr().out + expected_output = f"Metadata Path: {full_path}\n" + assert capsystext == expected_output - assert store.exists("metadata", pids[pid]["metadata_cid"]) + assert store._count(entity) == 3 def test_retrieve_objects(capsys, pids, store): @@ -108,10 +156,10 @@ def test_retrieve_objects(capsys, pids, store): test_dir = "tests/testdata/" for pid in pids.keys(): path = test_dir + pid.replace("/", "_") - _object_metadata = store.store_object(pid, path) + store.store_object(pid, path) client_module_path = f"{client_directory}/client.py" - test_store = store.root + test_store = str(store.root) delete_object_opt = "-retrieveobject" client_pid_arg = f"-pid={pid}" chs_args = [ @@ -125,7 +173,7 @@ def test_retrieve_objects(capsys, pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() object_stream = store.retrieve_object(pid) object_content = ( @@ -144,14 +192,14 @@ def test_retrieve_metadata(capsys, pids, store): """Test retrieving metadata from a HashStore through client.""" client_directory = os.getcwd() + "/src/hashstore" test_dir = "tests/testdata/" - namespace = "http://ns.dataone.org/service/types/v2.0" + namespace = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" for pid in pids.keys(): filename = pid.replace("/", "_") + ".xml" syspath = Path(test_dir) / filename _metadata_cid = store.store_metadata(pid, syspath, namespace) client_module_path = f"{client_directory}/client.py" - test_store = store.root + test_store = str(store.root) retrieve_metadata_opt = "-retrievemetadata" client_pid_arg = f"-pid={pid}" format_id = f"-formatid={namespace}" @@ -167,7 +215,7 @@ def test_retrieve_metadata(capsys, pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() metadata_stream = store.retrieve_metadata(pid, namespace) metadata_content = ( @@ -188,10 +236,10 @@ def test_delete_objects(pids, store): test_dir = "tests/testdata/" for pid in pids.keys(): path = test_dir + pid.replace("/", "_") - _object_metadata = store.store_object(pid, path) + store.store_object(pid, path) client_module_path = f"{client_directory}/client.py" - test_store = store.root + test_store = str(store.root) delete_object_opt = "-deleteobject" client_pid_arg = f"-pid={pid}" chs_args = [ @@ -205,23 +253,23 @@ def test_delete_objects(pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() - assert not store.exists("objects", pids[pid]["object_cid"]) + assert not store._exists("objects", pids[pid][store.algorithm]) def test_delete_metadata(pids, store): """Test deleting metadata from a HashStore through client.""" client_directory = os.getcwd() + "/src/hashstore" test_dir = "tests/testdata/" - namespace = "http://ns.dataone.org/service/types/v2.0" + namespace = "https://ns.dataone.org/service/types/v2.0#SystemMetadata" for pid in pids.keys(): filename = pid.replace("/", "_") + ".xml" syspath = Path(test_dir) / filename _metadata_cid = store.store_metadata(pid, syspath, namespace) client_module_path = f"{client_directory}/client.py" - test_store = store.root + test_store = str(store.root) delete_metadata_opt = "-deletemetadata" client_pid_arg = f"-pid={pid}" format_id = f"-formatid={namespace}" @@ -237,6 +285,6 @@ def test_delete_metadata(pids, store): sys.path.append(client_directory) # Manually change sys args to simulate command line arguments sys.argv = chs_args - client.main() + hashstoreclient.main() - assert not store.exists("metadata", pids[pid]["metadata_cid"]) + assert not store._exists("metadata", pids[pid]["metadata_cid"])