Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Inability to control automatic evaluation metrics, makes clustering infeasible for large datasets #3989

Open
3 tasks done
kasperjanehag opened this issue May 10, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@kasperjanehag
Copy link

kasperjanehag commented May 10, 2024

pycaret version checks

Issue Description

When using PyCaret's clustering module with large datasets, particularly those exceeding 1 million rows, users face significant challenges due to the mandatory calculation of certain evaluation metrics, such as the silhouette score. The lack of flexibility in disabling or customizing these metrics during the setup() function is a critical issue. The silhouette score's computational complexity is O(n^2), making its evaluation impractical for large datasets due to the excessive time and computational resources required.

The primary concern is not with the calculation method of the silhouette score but with the compulsory nature of its evaluation. This automatic evaluation can render clustering operations infeasible on large datasets, as users currently do not have the option to omit this expensive metric calculation.

Proposed Enhancements

To resolve this, I propose enhancing the setup() function to allow users to specify which evaluation metrics should be calculated, similar to how they can currently choose which algorithms to train. This could be achieved by introducing a parameter such as include_metrics or exclude_metrics, offering flexibility in metric computation. For instance:

from pycaret.clustering import setup
setup(data, include_metrics=['sse', 'dbscan'], exclude_metrics=['silhouette'])`)

This approach would allow users to tailor the computational load according to their dataset size and the specific requirements of their analysis, thereby enabling efficient clustering of very large datasets.

Benefits of the Enhancement

Implementing this feature would:

  • Make PyCaret's clustering module scalable and more adaptable to large-scale data environments.
  • Reduce computational costs and execution times, making it feasible to cluster large datasets.
  • Increase user control over the analytical process, enhancing PyCaret’s utility in diverse applications.

Thank you for considering this enhancement, also please let me know if you think this is a great addition and give advice where changes should be implemented. Happy to raise the PR.

Reproducible Example

I conducted benchmark tests comparing model training times and memory usage between plain scikit-learn KMeans and PyCaret's setup, specifically focusing on the impact of automatic silhouette score calculations in large datasets. The script includes detailed logging and measurement of memory consumption and processing times for datasets ranging from 1 to 4 million rows.

Findings clearly show that the exponential increase in computation time and memory usage in PyCaret setups is primarily due to the evaluation of the silhouette score, not from the training process itself. The benchmarking script provided insights into how the inclusion of this metric significantly impacts the feasibility of clustering operations in large-scale settings.


import logging

# Set the logging level to DEBUG
logging.basicConfig(level=logging.DEBUG)

import time

import mlflow

mlflow.autolog(disable=True)

import pandas as pd
import plotly.express as px
import seaborn as sns
from memory_profiler import memory_usage
from pycaret.clustering import ClusteringExperiment, create_model, setup
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

# Could be related to https://github.com/pycaret/pycaret/issues/3371?


def widen_dataset(df, num_columns):
    # Get the number of columns in the original dataset
    original_num_columns = df.shape[1]

    # Calculate the number of times the columns need to be repeated
    repeat_times = num_columns // original_num_columns

    # Repeat the columns and add a suffix to each column name
    df_repeated = pd.concat(
        [df.add_suffix(f"_{i}") for i in range(repeat_times)], axis=1
    )

    # If there are still columns needed, add them from the beginning of the dataframe
    remaining_columns = num_columns % original_num_columns
    if remaining_columns > 0:
        df_remaining = df[df.columns[:remaining_columns]].add_suffix(f"_{repeat_times}")
        df_repeated = pd.concat([df_repeated, df_remaining], axis=1)

    return df_repeated


def lengthen_dataset(df, num_rows):
    # Repeat the dataset until we reach the desired number of rows
    while len(df) < num_rows:
        df = pd.concat([df, df])

    # Trim the dataset to exactly the desired number of rows
    df = df.iloc[:num_rows]

    # Reset the index of the DataFrame
    df = df.reset_index(drop=True)

    return df


def fit_kmeans(df):
    # Initialize a KMeans model with 4 clusters
    kmeans = KMeans(n_clusters=4)

    # Fit the model to the dataset
    kmeans.fit(df)


def plot_memory_consumption(df):
    # Melt the DataFrame to a long format
    df_melted = df.melt(
        id_vars="Number of Rows",
        var_name="Process",
        value_name="Memory Consumption (MiB)",
    )

    # Create the plot with Plotly Express
    fig = px.line(
        df_melted,
        x="Number of Rows",
        y="Memory Consumption (MiB)",
        color="Process",
        markers=True,
    )
    fig.show()


def plot_memory_consumption(df):
    # Melt the DataFrame to a long format
    df_melted = df.melt(
        id_vars=["Number of Rows", "Library"],
        var_name="Process",
        value_name="Value",
    )

    # Create the plot with Plotly Express
    fig = px.line(
        df_melted,
        x="Number of Rows",
        y="Value",
        color="Library",
        facet_row="Process",
        markers=True,
    )

    # Update facet titles
    fig.update_yaxes(title_text="Processing Time (s)", row=1, col=1)
    fig.update_yaxes(title_text="Memory Consumption (MiB)", row=2, col=1)

    # Set y-axes to scale automatically
    fig.update_yaxes(matches=None)

    # Save the plot as a high-quality PNG file
    fig.write_image("memory_consumption.png", scale=2)


def create_memory_df(row_numbers, model_memory, libraries, processing_times):
    # Create a DataFrame from the memory consumption data
    df = pd.DataFrame(
        {
            "Number of Rows": row_numbers,
            "Memory Consumption": model_memory,
            "Library": libraries,
            "Processing Time": processing_times,
        }
    )
    return df


def label_encode_columns(df, columns):
    # Initialize a label encoder
    le = LabelEncoder()

    # Loop over the list of columns
    for column in columns:
        # Check if the column exists in the DataFrame
        if column in df.columns:
            # Transform the column
            df[column] = le.fit_transform(df[column])

    return df


def get_mock_data(num_rows):
    # Load the iris dataset
    iris = sns.load_dataset("iris")

    # Widen the dataset
    iris_wide = widen_dataset(iris, 15)

    # Lengthen the dataset
    iris_long = lengthen_dataset(iris_wide, num_rows)

    # Label encode the 'species' columns
    iris_long = label_encode_columns(iris_long, ["species_0", "species_1", "species_2"])

    return iris_long

def fit_kmeans_pycaret(df):
    s = ClusteringExperiment()
    s.setup(
        data=df,
        verbose=False,
        normalize=False,
        index=False,
        transformation=False,
        pca=False,
        preprocess=False,
        remove_outliers=False,
        system_log=False,
        log_experiment=False,
        log_plots=False,
        log_profile=False,
        log_data=False,
        memory=False,
        profile=False,
    )

    kmeans = s.create_model(
        "kmeans",
        # verbose=False,
    )
    return kmeans

def collect_memory_data():
    # Initialize lists to store the memory consumption data
    model_memory = []
    row_numbers = []
    libraries = []
    processing_times = []

    # Start at 500,000 rows and increase by 500,000 each time
    for num_rows in range(1000000, 4000001, 100000):
        for library in ["scikit", "pycaret"]:
            # Get the mock data
            iris_long = get_mock_data(num_rows)

            # Define a function to call fit_kmeans with iris_long
            def fit_model():
                if library == "scikit":
                    return fit_kmeans(iris_long)
                elif library == "pycaret":
                    return fit_kmeans_pycaret(iris_long)

            # Capture peak memory usage and processing time during the fit_model call
            start_time = time.time()
            mem_usage_model = max(memory_usage(proc=fit_model))
            end_time = time.time()

            # Calculate processing time and add it to the list
            processing_time = end_time - start_time
            processing_times.append(processing_time)

            # Add the peak memory usage to the model_memory list
            model_memory.append(mem_usage_model)

            # Add the number of rows to the list
            row_numbers.append(num_rows)

            # Add the library to the list
            libraries.append(library)

            # Log the current step and parameters
            logging.info(
                f"Step completed. Rows: {num_rows}, Library: {library}, Model Memory: {mem_usage_model}, Processing Time: {processing_time}"
            )

    return row_numbers, model_memory, libraries, processing_times


if __name__ == "__main__":
    # Call the function to collect the memory consumption data
    row_numbers, dataset_memory, model_memory, libraries = collect_memory_data()

    # Call the function to create the DataFrame
    df = create_memory_df(row_numbers, dataset_memory, model_memory, libraries)

    # Call the plotting function
    plot_memory_consumption(df)

Expected Behavior

It should be possible to configure pycaret to have the same time complexity as standard clustering with scikit-learn kmeans or similar, i.e not evaluating expensive metrics on the whole trainingset.

memory_consumption_10k_300k

Actual Results

There's not error trace to publish here, since model training simply never finnishes for large datasets. Try running the benchmarking script for 1M rows and you'll see how the model training seem to be stuck.

Installed Versions

System: python: 3.9.19 (main, May 5 2024, 08:18:20) [GCC 12.2.0] executable: /root/.pyenv/versions/3.9.19/bin/python machine: Linux-6.6.16-linuxkit-aarch64-with-glibc2.36

PyCaret required dependencies:
pip: 24.0
setuptools: 69.5.1
pycaret: 3.3.2
IPython: 8.18.1
ipywidgets: 8.1.2
tqdm: 4.66.4
numpy: 1.23.5
pandas: 1.5.3
jinja2: 3.1.3
scipy: 1.11.4
joblib: 1.3.2
sklearn: 1.4.2
pyod: 1.1.3
imblearn: 0.12.2
category_encoders: 2.6.3
lightgbm: 4.3.0
numba: 0.59.1
requests: 2.31.0
matplotlib: 3.7.4
scikitplot: 0.3.7
yellowbrick: 1.5
plotly: 5.22.0
plotly-resampler: Not installed
kaleido: 0.2.1
schemdraw: 0.15
statsmodels: 0.14.2
sktime: 0.26.0
tbats: 1.1.3
pmdarima: 2.0.4
psutil: 5.9.8
markupsafe: 2.1.5
pickle5: Not installed
cloudpickle: 3.0.0
deprecation: 2.1.0
xxhash: 3.4.1
wurlitzer: 3.1.0

PyCaret optional dependencies:
shap: 0.44.1
interpret: 0.6.1
umap: 0.5.6
ydata_profiling: 4.7.0
explainerdashboard: 0.3.8
autoviz: 0.1.802
fairlearn: 0.7.0
deepchecks: Not installed
xgboost: 1.6.2
catboost: Not installed
kmodes: Not installed
mlxtend: Not installed
statsforecast: Not installed
tune_sklearn: 0.5.0
ray: 2.20.0
hyperopt: 0.2.7
optuna: 3.6.1
skopt: 0.10.1
mlflow: 2.12.1
gradio: 4.26.0
fastapi: 0.111.0
uvicorn: 0.29.0
m2cgen: 0.10.0
evidently: 0.4.16
fugue: Not installed
streamlit: Not installed
prophet: Not installed

@kasperjanehag kasperjanehag added the bug Something isn't working label May 10, 2024
@kasperjanehag kasperjanehag changed the title [BUG]: Not Possible to Disable Automatic Evaluation Metrics [BUG]: Inability to control automatic evaluation metrics, makes clustering infeasible for large datasets May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant