Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in sampling the data #55

Open
parvaneh-soleimany opened this issue Jul 21, 2024 · 3 comments
Open

Error in sampling the data #55

parvaneh-soleimany opened this issue Jul 21, 2024 · 3 comments

Comments

@parvaneh-soleimany
Copy link

Hey. I am facing error on sampling data.
I get error: "Breaking the generation loop!"
and when I checked the code I understood in the next line, df_gen gets empty and it is obviously because the model just generated placeholder, and not the desired data.
df_gen = df_gen[~(df_gen == "placeholder").any(axis=1)]

I am working on it for more than one week but I couldn't handle it.
Can you please help me? I am really in need of this code to works.
My data is a tabular text data, each column consisting number or piece of text (sometimes long texts).

Thank you in advance.

@parvaneh-soleimany
Copy link
Author

No one has the same issue?

@juhoUnibw
Copy link

I got the same error. The only pattern I could identify so far is that it only happens with datasets having >=20 features.

@unnir
Copy link
Collaborator

unnir commented Nov 15, 2024

I got the same error. The only pattern I could identify so far is that it only happens with datasets having >=20 features.

This is correct, some feature names can exceed the context windows of an LLM model, thus, you want be able to generate synthetic samples.

Another solution is to use new (and better) models for example:

from be_great import GReaT
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True).frame

model = GReaT(llm='unsloth/Llama-3.2-1B', batch_size=32,  epochs=1, fp16=True, report_to="none")
model.fit(data)
synthetic_data = model.sample(n_samples=100) 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants