Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why so slow compare to dask #7341

Open
wanghaisheng opened this issue Jul 13, 2024 · 1 comment
Open

why so slow compare to dask #7341

wanghaisheng opened this issue Jul 13, 2024 · 1 comment
Labels
question ❓ Questions about Modin Triage 🩹 Issues that need triage

Comments

@wanghaisheng
Copy link

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
def split_url(url):
    parts = url.split()
    return parts[1] if len(parts) > 1 else url

def deduplicate_csv(input_file, output_file):
    # Read the CSV file
    df = dd.read_csv(input_file, usecols=['url'])
    df['dedup_key'] = df['url'].apply(split_url, meta=('dedup_key', 'object'))
    
    # Remove duplicates based on the 'url' column
    df_deduplicated = df.drop_duplicates(subset='dedup_key')
    df_deduplicated = df_deduplicated.drop(columns=['dedup_key'])
    
    # Compute and write to a new CSV file
    with ProgressBar():
        df_deduplicated.to_csv(output_file, single_file=True, index=False)

    # Get some stats
    total_rows = len(df)
    unique_rows = len(df_deduplicated)
    print(f"Total rows: {total_rows:,}")
    print(f"Unique URLs: {unique_rows:,}")
    print(f"Removed {total_rows - unique_rows:,} duplicate rows")

if __name__ == "__main__":
    input_file=f'waybackmachines-www.amazon.com.csv'

    output_file=input_file.replace('.csv','-1.csv')
    deduplicate_csv(input_file, output_file)

rows:400w
dask total 50s

import modin.pandas as pd

# Initialize Ray (required for Modin to work with Ray engine)
ray.init()

def split_url(url):
    parts = url.split()
    return parts[1] if len(parts) > 1 else url

def deduplicate_csv(input_file, output_file):
    print("Reading CSV file...")
    df = pd.read_csv(input_file, usecols=['url'])
    
    print("Splitting URLs and creating deduplication key...")
    df['dedup_key'] = df['url'].apply(split_url)
    
    print("Removing duplicates...")
    df_deduplicated = df.drop_duplicates(subset='dedup_key')
    
    # Remove the 'dedup_key' column before saving
    df_deduplicated = df_deduplicated.drop(columns=['dedup_key'])
    
    print("Saving deduplicated data...")
    df_deduplicated.to_csv(output_file, index=False)

    # Get some stats
    total_rows = len(df)
    unique_rows = len(df_deduplicated)
    print(f"Total rows: {total_rows:,}")
    print(f"Unique URLs (based on second part): {unique_rows:,}")
    print(f"Removed {total_rows - unique_rows:,} duplicate rows")

if __name__ == "__main__":
    input_file=f'waybackmachines-www.amazon.com.csv'

    output_file=input_file.replace('.csv','-modin.csv')
    deduplicate_csv(input_file, output_file)
@wanghaisheng wanghaisheng added question ❓ Questions about Modin Triage 🩹 Issues that need triage labels Jul 13, 2024
@arunjose696
Copy link
Collaborator

arunjose696 commented Jul 15, 2024

Hi @wanghaisheng ,

In the issue could not find how much time modin takes, and to me it was not quite clear how many rows were there in your original data. Could you clarify what 400w means and time taken by modin?

I tried creating a synthetic dataset with the below script and ran your benchmark.

import pandas as pd
import random
import string
import numpy as np

# Function to generate a random URL
def generate_random_url():
    letters = string.ascii_lowercase
    domain = ''.join(random.choice(letters) for i in range(random.randint(5, 10)))
    extension = random.choice(['com', 'net', 'org', 'biz', 'info', 'co'])
    return f"http://www.{domain}.{extension}"

# Function to generate random data for additional columns
def generate_random_data(size):
    return np.random.rand(size)

# Number of URLs to generate
num_urls = 4000000

# Generate random URLs
urls = [generate_random_url() for _ in range(num_urls)]

# Create a DataFrame with 'URL' column
df = pd.DataFrame(urls, columns=['URL'])

# Adding 10 more random columns
for i in range(10):
    col_name = f'Random_{i+1}'
    df[col_name] = generate_random_data(num_urls)

# Adding some duplicates
num_duplicates = 3000
duplicate_indices = random.sample(range(num_urls), num_duplicates)

for index in duplicate_indices:
    df.at[index, 'URL'] = df.at[index // 2, 'URL']

# Shuffle the DataFrame to randomize the order
df = df.sample(frac=1).reset_index(drop=True)

# Print the DataFrame info
print(df.info())

# Print the first few rows of the DataFrame
print(df.head())

df.to_csv('waybackmachines-www.amazon.com.csv')

I could observe at my end that modin on ray is faster when the number of rows (defined by num_urls in my script) is 4000000 and when number of rows are lesser(say 400000) dask performs better,

Perf comparison on Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz(112 cpus)

Number of rows Modin on ray Dask
4000000 18.183s 29.693s
400000 7.898s 5.461s

As modin is intended to work on large dataframes I would say it could occur than modin has bad performance when data size is too small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❓ Questions about Modin Triage 🩹 Issues that need triage
Projects
None yet
Development

No branches or pull requests

2 participants