Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup_audio #322

Merged
merged 65 commits into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
ade26eb
This should make the validation process easier.
monke6942021 Sep 14, 2023
5faaf02
Validation will be easier now.
monke6942021 Sep 14, 2023
b84be69
Merge branch 'main' of https://github.com/mlcommons/croissant
monke6942021 Sep 14, 2023
82b0217
This will let croissant recognize AudioObjects
monke6942021 Nov 8, 2023
506bd0d
mini_CC_BY_SA dataset has been added
monke6942021 Nov 16, 2023
9861395
This should make the validation process easier.
monke6942021 Sep 14, 2023
50d78ac
mini_CC_BY_SA dataset has been added
monke6942021 Nov 16, 2023
369621d
mini_CC_BY_SA is now part of the non-hermetic tests
monke6942021 Nov 16, 2023
5723b3f
Now it takes input from a Google Drive
monke6942021 Nov 16, 2023
4b7150e
Many of the tests were failing because of librosa, so anything with i…
monke6942021 Nov 20, 2023
fafe57d
Hopefully this passes the pr test
monke6942021 Nov 20, 2023
30d2aca
The validation script CLAIMS that it works.
monke6942021 Nov 20, 2023
5f23744
Hopefully black[jupyter] helps
monke6942021 Nov 20, 2023
56a86e8
Now jupyter has been added before the original black. Let's see if th…
monke6942021 Nov 20, 2023
babbbc2
I can try resetting to see if the formatting test works.
monke6942021 Nov 20, 2023
8cd0e89
Changed treatment of AudioObject in field.py
monke6942021 Dec 7, 2023
7d22310
This version can load audio in field.py
monke6942021 Dec 18, 2023
363bac6
Merge branch 'main' into setup_audio
monke6942021 Dec 18, 2023
4e5f54e
Audio files have some level of usability now.
monke6942021 Dec 18, 2023
081c129
The exact dataset I worked with
monke6942021 Dec 18, 2023
5a88008
I got the audio support to work.
monke6942021 Dec 18, 2023
154ef14
This should be a good touch up to help with jpeg files too.
monke6942021 Dec 19, 2023
28116d8
This is a cheap shot option, but it might pass the unit tests.
monke6942021 Jan 4, 2024
f42626f
Removed URL because it's not being used.
monke6942021 Jan 4, 2024
7f94125
librosa is now in ththe correct alphabetical position.
monke6942021 Jan 4, 2024
7706916
This is a gamble.
monke6942021 Jan 4, 2024
7fdc4b8
This is another gamble.
monke6942021 Jan 4, 2024
e4011ea
Changed import method for librosa.
monke6942021 Jan 4, 2024
8be8b76
I changed the pyptoject file. Let's see if it helps.
monke6942021 Jan 7, 2024
dcc8e01
Commented out the librosa import.
monke6942021 Jan 7, 2024
e9121b2
Hope this passes PyTest
monke6942021 Jan 7, 2024
8924b08
Installing black[jupyter]
monke6942021 Jan 7, 2024
2e24289
Storing black[jupyter] as a dependency.
monke6942021 Jan 7, 2024
5a441e8
black formatted some files.
monke6942021 Jan 7, 2024
12a0fd2
Another file has been black reformatted
monke6942021 Jan 7, 2024
578127e
I ran the test locally, and changed the file accordingly. I hope it w…
monke6942021 Jan 10, 2024
bec4b4f
I deleted the obvious error.
monke6942021 Jan 10, 2024
2dd38a7
Changed format of output.
monke6942021 Jan 10, 2024
ffb126d
Changed output to string format.
monke6942021 Jan 10, 2024
ca6db64
I just changed the test case, so this is a huge gamble.
monke6942021 Jan 10, 2024
24ffa9f
Pytest wants the data to be in double quotations.
monke6942021 Jan 10, 2024
4d48a9a
The output in records.jsonl and the output of the actual function are…
monke6942021 Jan 10, 2024
6777a07
The output file is going to be very long. I don't see any other way t…
monke6942021 Jan 10, 2024
b70229e
I forgot to save the output.
monke6942021 Jan 10, 2024
5c0eaed
Changed the number. Let's see if it works.
monke6942021 Jan 10, 2024
2a69efd
Mentioned that the package name is also librosa.
monke6942021 Jan 10, 2024
ab23572
Hope this works.
monke6942021 Jan 10, 2024
cf043c8
updated dev dependencies.
monke6942021 Jan 10, 2024
fd5393a
I used black in a different way, so it might work this time.
monke6942021 Jan 11, 2024
4c40d91
addressing the issues that Pierre bought up.
monke6942021 Jan 16, 2024
230d1fd
Since I used load.py for this one, I'm confident that this should work.
monke6942021 Jan 16, 2024
ba72b11
Changed the name of the library
monke6942021 Jan 30, 2024
0177150
removed the commment
monke6942021 Jan 30, 2024
9f2d083
dealt with conflicts.
monke6942021 Jan 30, 2024
c5d38e2
Hopefully I don't need git LFS for this.
monke6942021 Jan 30, 2024
b587614
black reformatted optional.py
monke6942021 Jan 30, 2024
1834c49
For some reason, my unit test is supposed to be in 0.8
monke6942021 Jan 30, 2024
7611a5a
Removed audio_test from 0.8.
monke6942021 Jan 31, 2024
c66524b
Added fileObject and fileSet in context.
monke6942021 Jan 31, 2024
a777c7f
I will remove this from 0.8 later, but I need to prove a point.
monke6942021 Jan 31, 2024
8b9b511
Adding audio_test to 0.8 did nothing of value.
monke6942021 Jan 31, 2024
b34ede7
Merge branch 'main' of https://github.com/mlcommons/croissant into se…
monke6942021 Feb 8, 2024
c0ce931
Changed audio_test to be compatible with new changes.
monke6942021 Feb 8, 2024
63603e4
This should pass the tests, but validation still works a bit differen…
monke6942021 Feb 8, 2024
d212703
They needed a version of the audio dataset in 0.8
monke6942021 Feb 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added datasets/audio_test/data/Clap.mp3
Binary file not shown.
Binary file added datasets/audio_test/data/Snap.mp3
Binary file not shown.
75 changes: 75 additions & 0 deletions datasets/audio_test/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"@context": {
"@language": "en",
"@vocab": "https://schema.org/",
"column": "ml:column",
"conformsTo": "dct:conformsTo",
"data": {
"@id": "ml:data",
"@type": "@json"
},
"dataBiases": "ml:dataBiases",
"dataCollection": "ml:dataCollection",
"dataType": {
"@id": "ml:dataType",
"@type": "@vocab"
},
"dct": "http://purl.org/dc/terms/",
"extract": "ml:extract",
"field": "ml:field",
"fileProperty": "ml:fileProperty",
"format": "ml:format",
"includes": "ml:includes",
"isEnumeration": "ml:isEnumeration",
"jsonPath": "ml:jsonPath",
"ml": "http://mlcommons.org/schema/",
"parentField": "ml:parentField",
"path": "ml:path",
"personalSensitiveInformation": "ml:personalSensitiveInformation",
"recordSet": "ml:recordSet",
"references": "ml:references",
"regex": "ml:regex",
"repeated": "ml:repeated",
"replace": "ml:replace",
"sc": "https://schema.org/",
"separator": "ml:separator",
"source": "ml:source",
"subField": "ml:subField",
"transform": "ml:transform",
"wd": "https://www.wikidata.org/wiki/"
},
"@type": "sc:Dataset",
"name": "audio_test",
"description": "This is the basic test case for audio files",
"conformsTo": "http://mlcommons.org/croissant/1.0",
"url": "None",
"distribution": [
{
"@type": "sc:FileSet",
"name": "files",
"encodingFormat": "audio/mpeg",
"includes": "data/*.mp3"
}
],
"recordSet": [
{
"@type": "ml:RecordSet",
"name": "records",
"description": "These are the records.",
"field": [
{
"@type": "ml:Field",
"name": "audio",
"description": "These are the sounds.",
"dataType": "sc:AudioObject",
"source": {
"distribution": "files",
"extract": {
"fileProperty": "content"
}
}
}
]
}
]
}
2 changes: 2 additions & 0 deletions datasets/audio_test/output/records.jsonl

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions editor/events/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,6 @@ def _create_instance1_from_instance2(instance1: Resource, instance2: type):
attributes1 = set((field.name for field in dataclasses.fields(instance1)))
attributes2 = set((field.name for field in dataclasses.fields(instance2)))
common_attributes = attributes2.intersection(attributes1)
return instance2(**{
attribute: getattr(instance1, attribute) for attribute in common_attributes
})
return instance2(
**{attribute: getattr(instance1, attribute) for attribute in common_attributes}
)
11 changes: 11 additions & 0 deletions python/mlcroissant/mlcroissant/_src/core/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,14 @@
SCHEMA_ORG_CREATOR = namespace.SDO.creator
SCHEMA_ORG_DATE_PUBLISHED = namespace.SDO.datePublished
SCHEMA_ORG_DATASET = namespace.SDO.Dataset
SCHEMA_ORG_DATA_TYPE_AUDIO_OBJECT = namespace.SDO.AudioObject
SCHEMA_ORG_DATA_TYPE_BOOL = namespace.SDO.Boolean
SCHEMA_ORG_DATA_TYPE_DATE = namespace.SDO.Date
SCHEMA_ORG_DATA_TYPE_FLOAT = namespace.SDO.Float
SCHEMA_ORG_DATA_TYPE_IMAGE_OBJECT = namespace.SDO.ImageObject
SCHEMA_ORG_DATA_TYPE_INTEGER = namespace.SDO.Integer
SCHEMA_ORG_DATA_TYPE_TEXT = namespace.SDO.Text
SCHEMA_ORG_DATA_TYPE_URL = namespace.SDO.URL
SCHEMA_ORG_DESCRIPTION = namespace.SDO.description
SCHEMA_ORG_DISTRIBUTION = namespace.SDO.distribution
SCHEMA_ORG_EMAIL = namespace.SDO.email
Expand Down Expand Up @@ -124,8 +132,10 @@ class EncodingFormat:

CSV = "text/csv"
GIT = "git+https"
JPG = "image/jpeg"
JSON = "application/json"
JSON_LINES = "application/jsonlines"
MP3 = "audio/mpeg"
PARQUET = "application/x-parquet"
TEXT = "text/plain"
TSV = "text/tsv"
Expand All @@ -136,6 +146,7 @@ class EncodingFormat:
class DataType:
"""Data types supported by Croissant."""

AUDIO_OBJECT = namespace.SDO.AudioObject
BOOL = namespace.SDO.Boolean
BOUNDING_BOX = ML_COMMONS.BoundingBox
DATE = namespace.SDO.Date
Expand Down
5 changes: 5 additions & 0 deletions python/mlcroissant/mlcroissant/_src/core/optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,5 +86,10 @@ def PIL_Image(cls) -> types.ModuleType: # pylint: disable=invalid-name
"""Cached git module."""
return _try_import("PIL.Image", package_name="Pillow")

@cached_class_property
def LIB_Audio(cls) -> types.ModuleType: # pylint: disable=invalid-name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def librosa (PIL.Image above is a library, so the name here should probably be librosa)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed the name, and the usage in field.py

"""Cached git module."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached librosa module

return _try_import("librosa", package_name="librosa")


deps = OptionalDependencies
5 changes: 5 additions & 0 deletions python/mlcroissant/mlcroissant/_src/datasets_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,11 @@ def test_hermetic_loading(dataset_name, record_set_name, num_records):
["huggingface-c4/metadata.json", "en", 1],
["huggingface-mnist/metadata.json", "default", 10],
["titanic/metadata.json", "passengers", -1],
[
"audio_test/metadata.json",
"records",
-1,
], # Switch the number to 10 if nessacary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed it.

],
)
def test_nonhermetic_loading(dataset_name, record_set_name, num_records):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ def _cast_value(value: Any, data_type: type | term.URIRef | None):
return deps.PIL_Image.open(io.BytesIO(value))
else:
raise ValueError(f"Type {type(value)} is not accepted for an image.")
elif data_type == DataType.AUDIO_OBJECT:
output = deps.LIB_Audio.load(io.BytesIO(value))
return str([output[0].tolist(), output[1]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a str? How will you use this in an ML pipeline? What would be the most useful signal to return here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it like that because the output needs to have double quotes. I'm about to push a version that just outputs the regular librosa output, but I don't know if it will pass the test.

elif data_type == DataType.BOUNDING_BOX:
return bounding_box.parse(value)
elif not isinstance(data_type, type):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,13 @@ def _read_file_content(self, encoding_format: str, file: Path) -> pd.DataFrame:
return pd.DataFrame({
FileProperty.content: [file.read()],
})
elif (
encoding_format == EncodingFormat.MP3
or encoding_format == EncodingFormat.JPG
):
return pd.DataFrame({
FileProperty.content: [file.read()],
})
else:
raise ValueError(
f"Unsupported encoding format for file: {encoding_format}"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ def data_type(self) -> type | term.URIRef | None:
elif data_type in [
DataType.IMAGE_OBJECT,
DataType.BOUNDING_BOX,
DataType.AUDIO_OBJECT,
]:
return term.URIRef(data_type)
# The data_type has to be found on a predecessor:
Expand Down
9 changes: 7 additions & 2 deletions python/mlcroissant/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ version = "0.0.5"
authors = [
{ name = "Joaquin Vanschoren" },
{ name = "Jos van der Velde" },
{ name = "Monjish Bhattacharyya" },
{ name = "Omar Benjelloun" },
{ name = "Peter Mattson" },
{ name = "Pieter Gijsbers" },
Expand All @@ -18,6 +19,7 @@ authors = [
# pip dependencies of the project
# Installed locally with `pip install -e .`
dependencies = [
"black[jupyter]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove jupyter?

"absl-py",
"etils[epath]",
"jsonpath-rw",
Expand All @@ -27,7 +29,7 @@ dependencies = [
"python-dateutil",
"rdflib",
"requests",
"tqdm",
"tqdm"
]
readme = "README.md"

Expand All @@ -38,6 +40,7 @@ dev = [
"black==23.11.0",
"datasets",
"flake8-docstrings",
"mlcroissant[audio]",
"mlcroissant[git]",
"mlcroissant[image]",
"mlcroissant[parquet]",
Expand All @@ -48,6 +51,7 @@ dev = [
"pytest",
"pytype",
]
audio = ["librosa"]
git = ["GitPython"]
image = ["Pillow"]
parquet = ["pyarrow"]
Expand Down Expand Up @@ -79,9 +83,10 @@ module = [
"datasets",
"etils.*",
"jsonpath_rw",
"librosa",
"networkx",
"pandas",
"pillow",
"pillow"
]
ignore_missing_imports = true

Expand Down
Loading