Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] fugashi.Tagger causes pickling error during multiprocessing in tokenizer (Japanese) #4031

Open
easyautoml opened this issue Oct 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@easyautoml
Copy link

Describe the bug

When fine-tuning the XTTS model with num_workers > 0 for Japanese dataset, a TypeError occurs related to fugashi.Tagger.

Specifically, the error self.c_tagger cannot be converted to a Python object for pickling is triggered because fugashi.Tagger, used in the cutlet library for Japanese text processing, cannot be serialized for multiprocessing.

To Reproduce

Steps to Reproduce:

  • Load training and evaluation samples using load_tts_samples().
  • Initialize the Trainer object.
  • Create a training DataLoader using trainer.get_train_dataloader().
  • Set num_workers=2 in the DataLoader to enable multiprocessing.
  • Attempt to iterate through the DataLoader and observe the error.
train_samples, eval_samples = load_tts_samples(
    # Your loading code here...
)

# Initialize the trainer
trainer = Trainer(
    # Trainer initialization code here...
)

train_loader = trainer.get_train_dataloader(
    {},
    train_samples,
    True
)

dataset = train_loader.dataset

# Create DataLoader with num_workers > 0, which uses multiprocessing and may trigger the pickling issue
loader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    collate_fn=dataset.collate_fn,
    drop_last=False,
    sampler=None,
    num_workers=2,  # Setting this to 2 will use multiple workers (multiprocessing)
    pin_memory=False,
)

# Create an iterator from the dataloader
data_iter = iter(loader)

# Try to fetch the first batch, this should trigger the pickling error
try:
    first_batch = next(data_iter)
    pd.DataFrame(list(first_batch.items()), columns=['Key', 'Value'])
except Exception as e:
    print(f"Error: {e}")

Expected behavior

The data should be processed without any errors, even with num_workers > 0.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3070 Laptop GPU"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.4.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Windows",
        "architecture": [
            "64bit",
            "WindowsPE"
        ],
        "processor": "Intel64 Family 6 Model 165 Stepping 2, GenuineIntel",
        "python": "3.9.19",
        "version": "10.0.22631"
    }
}

Additional context

This issue only occurs when processing Japanese text, due to the use of fugashi.Tagger in the tokenization process, which is not compatible with multiprocessing.

@easyautoml easyautoml added the bug Something isn't working label Oct 18, 2024
@4125412435
Copy link

the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants