You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the issue could not find how much time modin takes, and to me it was not quite clear how many rows were there in your original data. Could you clarify what 400w means and time taken by modin?
I tried creating a synthetic dataset with the below script and ran your benchmark.
import pandas as pd
import random
import string
import numpy as np
# Function to generate a random URL
def generate_random_url():
letters = string.ascii_lowercase
domain = ''.join(random.choice(letters) for i in range(random.randint(5, 10)))
extension = random.choice(['com', 'net', 'org', 'biz', 'info', 'co'])
return f"http://www.{domain}.{extension}"
# Function to generate random data for additional columns
def generate_random_data(size):
return np.random.rand(size)
# Number of URLs to generate
num_urls = 4000000
# Generate random URLs
urls = [generate_random_url() for _ in range(num_urls)]
# Create a DataFrame with 'URL' column
df = pd.DataFrame(urls, columns=['URL'])
# Adding 10 more random columns
for i in range(10):
col_name = f'Random_{i+1}'
df[col_name] = generate_random_data(num_urls)
# Adding some duplicates
num_duplicates = 3000
duplicate_indices = random.sample(range(num_urls), num_duplicates)
for index in duplicate_indices:
df.at[index, 'URL'] = df.at[index // 2, 'URL']
# Shuffle the DataFrame to randomize the order
df = df.sample(frac=1).reset_index(drop=True)
# Print the DataFrame info
print(df.info())
# Print the first few rows of the DataFrame
print(df.head())
df.to_csv('waybackmachines-www.amazon.com.csv')
I could observe at my end that modin on ray is faster when the number of rows (defined by num_urls in my script) is 4000000 and when number of rows are lesser(say 400000) dask performs better,
Perf comparison on Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz(112 cpus)
Number of rows
Modin on ray
Dask
4000000
18.183s
29.693s
400000
7.898s
5.461s
As modin is intended to work on large dataframes I would say it could occur than modin has bad performance when data size is too small.
rows:400w
dask total 50s
The text was updated successfully, but these errors were encountered: