-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Databricks profiling report while using ydata-profiling #1605
Comments
I'm also getting the behavior described above in Databricks using 1.23.5 of numpy and 4.5.1 of ydata_profiling. I'm using a Personal Compute cluster with 15.2 ML Runtime, 28 GB Memory and 8 Active Cores at 1.5 DBU / h. |
For thoroughness. I also did a few tests on Azure Synapse Analytics (ASA) [without Databricks]. If I run this code in ASA: from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
Row(c1='Ali',c2='Brown'),
Row(c1='John',c2='Brown'),
Row(c1='Sara',c2='Brown')
])
p2 = ProfileReport(df1)
p2 I get the error: But if I simply add a numeric column at the end (Per Suggestion from Anomaly Author) from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
Row(c1='Ali',c2='Brown',c3=1),
Row(c1='John',c2='Brown',c3=2),
Row(c1='Sara',c2='Brown',c3=3)
])
p2 = ProfileReport(df1)
p2 I talked to the author of this anomaly report and understood her to say that ProfileReport will probably fail when all of the spark.createDataFrame columns are strings. This behavior seems to be happening in both Azure Databricks and ASA Spark. Spark Dependencies Spark Pool Settings: |
Hi @Fgoudarzi , thank you for your request. Have you tried to generate the report while following this tutorial? https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html |
Current Behaviour
I'm making a very simple Spark dataframe with only one column. Apparently, ProfileReport does not generate the report when I am using Databricks notebook.:
Below is the code that I'm using:
But if I convert the dataframe to panda, then it will generate the report:
Expected Behaviour
Generate the report as it does when I convert the Spark dataframe to Panda.
Data Description
Generated in the code.
Code that reproduces the bug
pandas-profiling version
ydata_profiling = 4.8.3
Dependencies
OS
Windows 11 Enterprise
Checklist
The text was updated successfully, but these errors were encountered: