Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup_audio #322

Merged
merged 65 commits into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
ade26eb
This should make the validation process easier.
monke6942021 Sep 14, 2023
5faaf02
Validation will be easier now.
monke6942021 Sep 14, 2023
b84be69
Merge branch 'main' of https://github.com/mlcommons/croissant
monke6942021 Sep 14, 2023
82b0217
This will let croissant recognize AudioObjects
monke6942021 Nov 8, 2023
506bd0d
mini_CC_BY_SA dataset has been added
monke6942021 Nov 16, 2023
9861395
This should make the validation process easier.
monke6942021 Sep 14, 2023
50d78ac
mini_CC_BY_SA dataset has been added
monke6942021 Nov 16, 2023
369621d
mini_CC_BY_SA is now part of the non-hermetic tests
monke6942021 Nov 16, 2023
5723b3f
Now it takes input from a Google Drive
monke6942021 Nov 16, 2023
4b7150e
Many of the tests were failing because of librosa, so anything with i…
monke6942021 Nov 20, 2023
fafe57d
Hopefully this passes the pr test
monke6942021 Nov 20, 2023
30d2aca
The validation script CLAIMS that it works.
monke6942021 Nov 20, 2023
5f23744
Hopefully black[jupyter] helps
monke6942021 Nov 20, 2023
56a86e8
Now jupyter has been added before the original black. Let's see if th…
monke6942021 Nov 20, 2023
babbbc2
I can try resetting to see if the formatting test works.
monke6942021 Nov 20, 2023
8cd0e89
Changed treatment of AudioObject in field.py
monke6942021 Dec 7, 2023
7d22310
This version can load audio in field.py
monke6942021 Dec 18, 2023
363bac6
Merge branch 'main' into setup_audio
monke6942021 Dec 18, 2023
4e5f54e
Audio files have some level of usability now.
monke6942021 Dec 18, 2023
081c129
The exact dataset I worked with
monke6942021 Dec 18, 2023
5a88008
I got the audio support to work.
monke6942021 Dec 18, 2023
154ef14
This should be a good touch up to help with jpeg files too.
monke6942021 Dec 19, 2023
28116d8
This is a cheap shot option, but it might pass the unit tests.
monke6942021 Jan 4, 2024
f42626f
Removed URL because it's not being used.
monke6942021 Jan 4, 2024
7f94125
librosa is now in ththe correct alphabetical position.
monke6942021 Jan 4, 2024
7706916
This is a gamble.
monke6942021 Jan 4, 2024
7fdc4b8
This is another gamble.
monke6942021 Jan 4, 2024
e4011ea
Changed import method for librosa.
monke6942021 Jan 4, 2024
8be8b76
I changed the pyptoject file. Let's see if it helps.
monke6942021 Jan 7, 2024
dcc8e01
Commented out the librosa import.
monke6942021 Jan 7, 2024
e9121b2
Hope this passes PyTest
monke6942021 Jan 7, 2024
8924b08
Installing black[jupyter]
monke6942021 Jan 7, 2024
2e24289
Storing black[jupyter] as a dependency.
monke6942021 Jan 7, 2024
5a441e8
black formatted some files.
monke6942021 Jan 7, 2024
12a0fd2
Another file has been black reformatted
monke6942021 Jan 7, 2024
578127e
I ran the test locally, and changed the file accordingly. I hope it w…
monke6942021 Jan 10, 2024
bec4b4f
I deleted the obvious error.
monke6942021 Jan 10, 2024
2dd38a7
Changed format of output.
monke6942021 Jan 10, 2024
ffb126d
Changed output to string format.
monke6942021 Jan 10, 2024
ca6db64
I just changed the test case, so this is a huge gamble.
monke6942021 Jan 10, 2024
24ffa9f
Pytest wants the data to be in double quotations.
monke6942021 Jan 10, 2024
4d48a9a
The output in records.jsonl and the output of the actual function are…
monke6942021 Jan 10, 2024
6777a07
The output file is going to be very long. I don't see any other way t…
monke6942021 Jan 10, 2024
b70229e
I forgot to save the output.
monke6942021 Jan 10, 2024
5c0eaed
Changed the number. Let's see if it works.
monke6942021 Jan 10, 2024
2a69efd
Mentioned that the package name is also librosa.
monke6942021 Jan 10, 2024
ab23572
Hope this works.
monke6942021 Jan 10, 2024
cf043c8
updated dev dependencies.
monke6942021 Jan 10, 2024
fd5393a
I used black in a different way, so it might work this time.
monke6942021 Jan 11, 2024
4c40d91
addressing the issues that Pierre bought up.
monke6942021 Jan 16, 2024
230d1fd
Since I used load.py for this one, I'm confident that this should work.
monke6942021 Jan 16, 2024
ba72b11
Changed the name of the library
monke6942021 Jan 30, 2024
0177150
removed the commment
monke6942021 Jan 30, 2024
9f2d083
dealt with conflicts.
monke6942021 Jan 30, 2024
c5d38e2
Hopefully I don't need git LFS for this.
monke6942021 Jan 30, 2024
b587614
black reformatted optional.py
monke6942021 Jan 30, 2024
1834c49
For some reason, my unit test is supposed to be in 0.8
monke6942021 Jan 30, 2024
7611a5a
Removed audio_test from 0.8.
monke6942021 Jan 31, 2024
c66524b
Added fileObject and fileSet in context.
monke6942021 Jan 31, 2024
a777c7f
I will remove this from 0.8 later, but I need to prove a point.
monke6942021 Jan 31, 2024
8b9b511
Adding audio_test to 0.8 did nothing of value.
monke6942021 Jan 31, 2024
b34ede7
Merge branch 'main' of https://github.com/mlcommons/croissant into se…
monke6942021 Feb 8, 2024
c0ce931
Changed audio_test to be compatible with new changes.
monke6942021 Feb 8, 2024
63603e4
This should pass the tests, but validation still works a bit differen…
monke6942021 Feb 8, 2024
d212703
They needed a version of the audio dataset in 0.8
monke6942021 Feb 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/mlcroissant/mlcroissant/_src/core/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
SCHEMA_ORG_CONTENT_SIZE = namespace.SDO.contentSize
SCHEMA_ORG_CONTENT_URL = namespace.SDO.contentUrl
SCHEMA_ORG_DATASET = namespace.SDO.Dataset
SCHEMA_ORG_DATA_TYPE_AUDIO_OBJECT = namespace.SDO.AudioObject
SCHEMA_ORG_DATA_TYPE_BOOL = namespace.SDO.Boolean
SCHEMA_ORG_DATA_TYPE_DATE = namespace.SDO.Date
SCHEMA_ORG_DATA_TYPE_FLOAT = namespace.SDO.Float
Expand Down
3 changes: 3 additions & 0 deletions python/mlcroissant/mlcroissant/_src/core/data_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ def check_expected_type(issues: Issues, jsonld: Json, expected_type: str):
constants.SCHEMA_ORG_DATA_TYPE_IMAGE_OBJECT: (
constants.SCHEMA_ORG_DATA_TYPE_IMAGE_OBJECT
),
constants.SCHEMA_ORG_DATA_TYPE_AUDIO_OBJECT: (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has changed. Can you rebase to have a freshest version of the code?

git pull --rebase origin main

constants.SCHEMA_ORG_DATA_TYPE_AUDIO_OBJECT
),
constants.SCHEMA_ORG_DATA_TYPE_INTEGER: int,
constants.SCHEMA_ORG_DATA_TYPE_TEXT: str,
constants.SCHEMA_ORG_DATA_TYPE_URL: str,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
from etils import epath
import pandas as pd

import librosa

from mlcroissant._src.core import constants
from mlcroissant._src.core.optional import deps
from mlcroissant._src.operation_graph.base_operation import Operation
Expand All @@ -27,6 +29,8 @@ def _cast_value(self, value: Any):
return value
elif data_type == constants.SCHEMA_ORG_DATA_TYPE_IMAGE_OBJECT:
return deps.PIL_Image.open(io.BytesIO(value))
elif data_type == constants.SCHEMA_ORG_DATA_TYPE_AUDIO_OBJECT:
return librosa.load(io.BytesIO(value))
elif data_type == pd.Timestamp:
# The date format is the first format found in the field's source.
format = next(
Expand Down
27 changes: 27 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
absl-py==1.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use requirements.txt, but pyproject.toml to manage dependencies.

certifi==2023.7.22
charset-normalizer==3.2.0
decorator==5.1.1
etils==1.4.1
idna==3.4
importlib-resources==6.0.1
isodate==0.6.1
jsonpath-rw==1.4.0
networkx==3.1
numpy==1.25.2
pandas==2.1.0
pip==22.0.2
ply==3.11
pygraphviz==1.11
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
rdflib==7.0.0
requests==2.31.0
setuptools==59.6.0
six==1.16.0
toml==0.10.2
tqdm==4.66.1
typing_extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4