Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. #6059

Merged
merged 44 commits into from
Jan 14, 2025
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
c4fb29a
fix
yuqi1129 Jan 2, 2025
4791a64
Merge branch 'main' of github.com:datastrato/graviton into 5472
yuqi1129 Jan 2, 2025
baf42e1
fix
yuqi1129 Jan 3, 2025
d86610b
Merge branch 'main' of github.com:datastrato/graviton into 5472
yuqi1129 Jan 3, 2025
b7eb621
fix
yuqi1129 Jan 3, 2025
1ecc378
update the docs
yuqi1129 Jan 4, 2025
d232e92
polish document again.
yuqi1129 Jan 6, 2025
fbd57ba
Again
yuqi1129 Jan 6, 2025
4fb6e79
fix
yuqi1129 Jan 6, 2025
e481c8d
fix
yuqi1129 Jan 6, 2025
0a97fc7
fix
yuqi1129 Jan 6, 2025
a6fbe7b
fix
yuqi1129 Jan 7, 2025
7b47a9b
fix
yuqi1129 Jan 7, 2025
ab07455
Polish the doc
yuqi1129 Jan 7, 2025
6c1aac3
Optimize the docs
yuqi1129 Jan 7, 2025
44014d9
format code.
yuqi1129 Jan 7, 2025
f4968bd
Merge branch 'main' of github.com:datastrato/graviton into 5472
yuqi1129 Jan 7, 2025
8c61d18
Merge branch 'main' of github.com:datastrato/graviton into 5472
yuqi1129 Jan 8, 2025
8563c91
polish document
yuqi1129 Jan 8, 2025
0b066a5
polish docs
yuqi1129 Jan 8, 2025
4c6f4c8
typo
yuqi1129 Jan 8, 2025
76f651e
Polish document again.
yuqi1129 Jan 9, 2025
51446ce
fix
yuqi1129 Jan 9, 2025
2b9c35f
Fix error.
yuqi1129 Jan 9, 2025
d65b995
Fix error.
yuqi1129 Jan 9, 2025
58e3a90
Fix error.
yuqi1129 Jan 9, 2025
746a3ce
fix
yuqi1129 Jan 9, 2025
4d644f1
Optimize document `how-to-use-gvfs.md`
yuqi1129 Jan 10, 2025
cfb054c
Optimize structure.
yuqi1129 Jan 10, 2025
de96e74
resolve comments
yuqi1129 Jan 13, 2025
7806b2f
resolve comments
yuqi1129 Jan 13, 2025
71586f3
Polish documents
yuqi1129 Jan 13, 2025
7b8ad31
fix
yuqi1129 Jan 13, 2025
c9eca73
fix
yuqi1129 Jan 13, 2025
aacd58f
fix
yuqi1129 Jan 13, 2025
d3a8986
fix
yuqi1129 Jan 14, 2025
54536d9
fix
yuqi1129 Jan 14, 2025
b2e357f
Merge branch 'main' of github.com:datastrato/graviton into 5472
yuqi1129 Jan 14, 2025
1971ba1
Resolve python code indent and fix table format problem.
yuqi1129 Jan 14, 2025
30f4271
Fix incompleted description about endpoint for S3
yuqi1129 Jan 14, 2025
65c171c
Optimize ADLS descriptions
yuqi1129 Jan 14, 2025
1e155d4
Fix the problem in #5737 that does not change azure account-name and …
yuqi1129 Jan 14, 2025
d2d2de3
fix
yuqi1129 Jan 14, 2025
e01b201
fix again
yuqi1129 Jan 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions docs/hadoop-catalog-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: "Hadoop catalog index"
slug: /hadoop-catalog-index
date: 2025-01-13
keyword: Hadoop catalog index S3 GCS ADLS OSS
license: "This software is licensed under the Apache License version 2."
---

### Hadoop catalog overall

FANNG1 marked this conversation as resolved.
Show resolved Hide resolved
Gravitino Hadoop catalog index includes the following chapters:

- [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter provides an overview of the Hadoop catalog, its features, capabilities and related configurations.
- [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how to manage fileset metadata using Gravitino API and provides detailed examples.
- [Using Hadoop catalog with Gravitino virtual System](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with Gravitino virtual System and provides detailed examples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gravitino virtual System -> Gravitino virtual file system


### Hadoop catalog with cloud storage

Apart from the above, you can also refer to the following topics to manage and access cloud storage like S3, GCS, ADLS, and OSS:

- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md).
- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md).
- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md).
- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md).

More storage options will be added soon. Stay tuned!
525 changes: 525 additions & 0 deletions docs/hadoop-catalog-with-adls.md

Large diffs are not rendered by default.

503 changes: 503 additions & 0 deletions docs/hadoop-catalog-with-gcs.md

Large diffs are not rendered by default.

541 changes: 541 additions & 0 deletions docs/hadoop-catalog-with-oss.md

Large diffs are not rendered by default.

544 changes: 544 additions & 0 deletions docs/hadoop-catalog-with-s3.md

Large diffs are not rendered by default.

86 changes: 18 additions & 68 deletions docs/hadoop-catalog.md

Large diffs are not rendered by default.

173 changes: 7 additions & 166 deletions docs/how-to-use-gvfs.md
yuqi1129 marked this conversation as resolved.
Show resolved Hide resolved

Large diffs are not rendered by default.

59 changes: 3 additions & 56 deletions docs/manage-fileset-metadata-using-gravitino.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@ filesets to manage non-tabular data like training datasets and other raw data.

Typically, a fileset is mapped to a directory on a file system like HDFS, S3, ADLS, GCS, etc.
With the fileset managed by Gravitino, the non-tabular data can be managed as assets together with
tabular data in Gravitino in a unified way.
tabular data in Gravitino in a unified way. The following operations will use HDFS as an example, for other
HCFS like S3, OSS, GCS, etc, please refer to the corresponding operations [hadoop-with-s3](./hadoop-catalog-with-s3.md), [hadoop-with-oss](./hadoop-catalog-with-oss.md), [hadoop-with-gcs](./hadoop-catalog-with-gcs.md) and
[hadoop-with-adls](./hadoop-catalog-with-adls.md).

After a fileset is created, users can easily access, manage the files/directories through
the fileset's identifier, without needing to know the physical path of the managed dataset. Also, with
Expand Down Expand Up @@ -53,24 +55,6 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

# create a S3 catalog
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "catalog",
"type": "FILESET",
"comment": "comment",
"provider": "hadoop",
"properties": {
"location": "s3a://bucket/root",
"s3-access-key-id": "access_key",
"s3-secret-access-key": "secret_key",
"s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
"filesystem-providers": "s3"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
# The following link about the catalog properties.
```

</TabItem>
Expand All @@ -93,25 +77,8 @@ Catalog catalog = gravitinoClient.createCatalog("catalog",
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a Hadoop fileset catalog",
properties);

// create a S3 catalog
s3Properties = ImmutableMap.<String, String>builder()
.put("location", "s3a://bucket/root")
.put("s3-access-key-id", "access_key")
.put("s3-secret-access-key", "secret_key")
.put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
.put("filesystem-providers", "s3")
.build();

Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
Type.FILESET,
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a S3 fileset catalog",
s3Properties);
// ...

// For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
// The following link about the catalog properties.
```

</TabItem>
Expand All @@ -124,23 +91,6 @@ catalog = gravitino_client.create_catalog(name="catalog",
provider="hadoop",
comment="This is a Hadoop fileset catalog",
properties={"location": "/tmp/test1"})

# create a S3 catalog
s3_properties = {
"location": "s3a://bucket/root",
"s3-access-key-id": "access_key"
"s3-secret-access-key": "secret_key",
"s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
}

s3_catalog = gravitino_client.create_catalog(name="catalog",
type=Catalog.Type.FILESET,
provider="hadoop",
comment="This is a S3 fileset catalog",
properties=s3_properties)

# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
# The following link about the catalog properties.
```

</TabItem>
Expand Down Expand Up @@ -371,11 +321,8 @@ The `storageLocation` is the physical location of the fileset. Users can specify
when creating a fileset, or follow the rules of the catalog/schema location if not specified.

The value of `storageLocation` depends on the configuration settings of the catalog:
- If this is a S3 fileset catalog, the `storageLocation` should be in the format of `s3a://bucket-name/path/to/fileset`.
- If this is an OSS fileset catalog, the `storageLocation` should be in the format of `oss://bucket-name/path/to/fileset`.
- If this is a local fileset catalog, the `storageLocation` should be in the format of `file:///path/to/fileset`.
- If this is a HDFS fileset catalog, the `storageLocation` should be in the format of `hdfs://namenode:port/path/to/fileset`.
- If this is a GCS fileset catalog, the `storageLocation` should be in the format of `gs://bucket-name/path/to/fileset`.

For a `MANAGED` fileset, the storage location is:

Expand Down
Loading