You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using PyCaret's clustering module with large datasets, particularly those exceeding 1 million rows, users face significant challenges due to the mandatory calculation of certain evaluation metrics, such as the silhouette score. The lack of flexibility in disabling or customizing these metrics during the setup() function is a critical issue. The silhouette score's computational complexity is O(n^2), making its evaluation impractical for large datasets due to the excessive time and computational resources required.
The primary concern is not with the calculation method of the silhouette score but with the compulsory nature of its evaluation. This automatic evaluation can render clustering operations infeasible on large datasets, as users currently do not have the option to omit this expensive metric calculation.
Proposed Enhancements
To resolve this, I propose enhancing the setup() function to allow users to specify which evaluation metrics should be calculated, similar to how they can currently choose which algorithms to train. This could be achieved by introducing a parameter such as include_metrics or exclude_metrics, offering flexibility in metric computation. For instance:
This approach would allow users to tailor the computational load according to their dataset size and the specific requirements of their analysis, thereby enabling efficient clustering of very large datasets.
Benefits of the Enhancement
Implementing this feature would:
Make PyCaret's clustering module scalable and more adaptable to large-scale data environments.
Reduce computational costs and execution times, making it feasible to cluster large datasets.
Increase user control over the analytical process, enhancing PyCaret’s utility in diverse applications.
Thank you for considering this enhancement, also please let me know if you think this is a great addition and give advice where changes should be implemented. Happy to raise the PR.
Reproducible Example
Iconductedbenchmarktestscomparingmodeltrainingtimesandmemoryusagebetweenplainscikit-learnKMeansandPyCaret'ssetup, specificallyfocusingontheimpactofautomaticsilhouettescorecalculationsinlargedatasets. Thescriptincludesdetailedloggingandmeasurementofmemoryconsumptionandprocessingtimesfordatasetsrangingfrom1to4millionrows.
FindingsclearlyshowthattheexponentialincreaseincomputationtimeandmemoryusageinPyCaretsetupsisprimarilyduetotheevaluationofthesilhouettescore, notfromthetrainingprocessitself. Thebenchmarkingscriptprovidedinsightsintohowtheinclusionofthismetricsignificantlyimpactsthefeasibilityofclusteringoperationsinlarge-scalesettings.
importlogging# Set the logging level to DEBUGlogging.basicConfig(level=logging.DEBUG)
importtimeimportmlflowmlflow.autolog(disable=True)
importpandasaspdimportplotly.expressaspximportseabornassnsfrommemory_profilerimportmemory_usagefrompycaret.clusteringimportClusteringExperiment, create_model, setupfromsklearn.clusterimportKMeansfromsklearn.pipelineimportPipelinefromsklearn.preprocessingimportLabelEncoder# Could be related to https://github.com/pycaret/pycaret/issues/3371?defwiden_dataset(df, num_columns):
# Get the number of columns in the original datasetoriginal_num_columns=df.shape[1]
# Calculate the number of times the columns need to be repeatedrepeat_times=num_columns//original_num_columns# Repeat the columns and add a suffix to each column namedf_repeated=pd.concat(
[df.add_suffix(f"_{i}") foriinrange(repeat_times)], axis=1
)
# If there are still columns needed, add them from the beginning of the dataframeremaining_columns=num_columns%original_num_columnsifremaining_columns>0:
df_remaining=df[df.columns[:remaining_columns]].add_suffix(f"_{repeat_times}")
df_repeated=pd.concat([df_repeated, df_remaining], axis=1)
returndf_repeateddeflengthen_dataset(df, num_rows):
# Repeat the dataset until we reach the desired number of rowswhilelen(df) <num_rows:
df=pd.concat([df, df])
# Trim the dataset to exactly the desired number of rowsdf=df.iloc[:num_rows]
# Reset the index of the DataFramedf=df.reset_index(drop=True)
returndfdeffit_kmeans(df):
# Initialize a KMeans model with 4 clusterskmeans=KMeans(n_clusters=4)
# Fit the model to the datasetkmeans.fit(df)
defplot_memory_consumption(df):
# Melt the DataFrame to a long formatdf_melted=df.melt(
id_vars="Number of Rows",
var_name="Process",
value_name="Memory Consumption (MiB)",
)
# Create the plot with Plotly Expressfig=px.line(
df_melted,
x="Number of Rows",
y="Memory Consumption (MiB)",
color="Process",
markers=True,
)
fig.show()
defplot_memory_consumption(df):
# Melt the DataFrame to a long formatdf_melted=df.melt(
id_vars=["Number of Rows", "Library"],
var_name="Process",
value_name="Value",
)
# Create the plot with Plotly Expressfig=px.line(
df_melted,
x="Number of Rows",
y="Value",
color="Library",
facet_row="Process",
markers=True,
)
# Update facet titlesfig.update_yaxes(title_text="Processing Time (s)", row=1, col=1)
fig.update_yaxes(title_text="Memory Consumption (MiB)", row=2, col=1)
# Set y-axes to scale automaticallyfig.update_yaxes(matches=None)
# Save the plot as a high-quality PNG filefig.write_image("memory_consumption.png", scale=2)
defcreate_memory_df(row_numbers, model_memory, libraries, processing_times):
# Create a DataFrame from the memory consumption datadf=pd.DataFrame(
{
"Number of Rows": row_numbers,
"Memory Consumption": model_memory,
"Library": libraries,
"Processing Time": processing_times,
}
)
returndfdeflabel_encode_columns(df, columns):
# Initialize a label encoderle=LabelEncoder()
# Loop over the list of columnsforcolumnincolumns:
# Check if the column exists in the DataFrameifcolumnindf.columns:
# Transform the columndf[column] =le.fit_transform(df[column])
returndfdefget_mock_data(num_rows):
# Load the iris datasetiris=sns.load_dataset("iris")
# Widen the datasetiris_wide=widen_dataset(iris, 15)
# Lengthen the datasetiris_long=lengthen_dataset(iris_wide, num_rows)
# Label encode the 'species' columnsiris_long=label_encode_columns(iris_long, ["species_0", "species_1", "species_2"])
returniris_longdeffit_kmeans_pycaret(df):
s=ClusteringExperiment()
s.setup(
data=df,
verbose=False,
normalize=False,
index=False,
transformation=False,
pca=False,
preprocess=False,
remove_outliers=False,
system_log=False,
log_experiment=False,
log_plots=False,
log_profile=False,
log_data=False,
memory=False,
profile=False,
)
kmeans=s.create_model(
"kmeans",
# verbose=False,
)
returnkmeansdefcollect_memory_data():
# Initialize lists to store the memory consumption datamodel_memory= []
row_numbers= []
libraries= []
processing_times= []
# Start at 500,000 rows and increase by 500,000 each timefornum_rowsinrange(1000000, 4000001, 100000):
forlibraryin ["scikit", "pycaret"]:
# Get the mock datairis_long=get_mock_data(num_rows)
# Define a function to call fit_kmeans with iris_longdeffit_model():
iflibrary=="scikit":
returnfit_kmeans(iris_long)
eliflibrary=="pycaret":
returnfit_kmeans_pycaret(iris_long)
# Capture peak memory usage and processing time during the fit_model callstart_time=time.time()
mem_usage_model=max(memory_usage(proc=fit_model))
end_time=time.time()
# Calculate processing time and add it to the listprocessing_time=end_time-start_timeprocessing_times.append(processing_time)
# Add the peak memory usage to the model_memory listmodel_memory.append(mem_usage_model)
# Add the number of rows to the listrow_numbers.append(num_rows)
# Add the library to the listlibraries.append(library)
# Log the current step and parameterslogging.info(
f"Step completed. Rows: {num_rows}, Library: {library}, Model Memory: {mem_usage_model}, Processing Time: {processing_time}"
)
returnrow_numbers, model_memory, libraries, processing_timesif__name__=="__main__":
# Call the function to collect the memory consumption datarow_numbers, dataset_memory, model_memory, libraries=collect_memory_data()
# Call the function to create the DataFramedf=create_memory_df(row_numbers, dataset_memory, model_memory, libraries)
# Call the plotting functionplot_memory_consumption(df)
Expected Behavior
It should be possible to configure pycaret to have the same time complexity as standard clustering with scikit-learn kmeans or similar, i.e not evaluating expensive metrics on the whole trainingset.
Actual Results
There's not error trace to publish here, since model training simply never finnishes for large datasets. Try running the benchmarking script for 1M rows and you'll see how the model training seem to be stuck.
kasperjanehag
changed the title
[BUG]: Not Possible to Disable Automatic Evaluation Metrics
[BUG]: Inability to control automatic evaluation metrics, makes clustering infeasible for large datasets
May 10, 2024
pycaret version checks
I have checked that this issue has not already been reported here.
I have confirmed this bug exists on the latest version of pycaret.
I have confirmed this bug exists on the master branch of pycaret (pip install -U git+https://github.com/pycaret/pycaret.git@master).
Issue Description
When using PyCaret's clustering module with large datasets, particularly those exceeding 1 million rows, users face significant challenges due to the mandatory calculation of certain evaluation metrics, such as the silhouette score. The lack of flexibility in disabling or customizing these metrics during the
setup()
function is a critical issue. The silhouette score's computational complexity is O(n^2), making its evaluation impractical for large datasets due to the excessive time and computational resources required.The primary concern is not with the calculation method of the silhouette score but with the compulsory nature of its evaluation. This automatic evaluation can render clustering operations infeasible on large datasets, as users currently do not have the option to omit this expensive metric calculation.
Proposed Enhancements
To resolve this, I propose enhancing the
setup()
function to allow users to specify which evaluation metrics should be calculated, similar to how they can currently choose which algorithms to train. This could be achieved by introducing a parameter such asinclude_metrics
orexclude_metrics
, offering flexibility in metric computation. For instance:This approach would allow users to tailor the computational load according to their dataset size and the specific requirements of their analysis, thereby enabling efficient clustering of very large datasets.
Benefits of the Enhancement
Implementing this feature would:
Thank you for considering this enhancement, also please let me know if you think this is a great addition and give advice where changes should be implemented. Happy to raise the PR.
Reproducible Example
Expected Behavior
It should be possible to configure pycaret to have the same time complexity as standard clustering with scikit-learn kmeans or similar, i.e not evaluating expensive metrics on the whole trainingset.
Actual Results
Installed Versions
PyCaret required dependencies:
pip: 24.0
setuptools: 69.5.1
pycaret: 3.3.2
IPython: 8.18.1
ipywidgets: 8.1.2
tqdm: 4.66.4
numpy: 1.23.5
pandas: 1.5.3
jinja2: 3.1.3
scipy: 1.11.4
joblib: 1.3.2
sklearn: 1.4.2
pyod: 1.1.3
imblearn: 0.12.2
category_encoders: 2.6.3
lightgbm: 4.3.0
numba: 0.59.1
requests: 2.31.0
matplotlib: 3.7.4
scikitplot: 0.3.7
yellowbrick: 1.5
plotly: 5.22.0
plotly-resampler: Not installed
kaleido: 0.2.1
schemdraw: 0.15
statsmodels: 0.14.2
sktime: 0.26.0
tbats: 1.1.3
pmdarima: 2.0.4
psutil: 5.9.8
markupsafe: 2.1.5
pickle5: Not installed
cloudpickle: 3.0.0
deprecation: 2.1.0
xxhash: 3.4.1
wurlitzer: 3.1.0
PyCaret optional dependencies:
shap: 0.44.1
interpret: 0.6.1
umap: 0.5.6
ydata_profiling: 4.7.0
explainerdashboard: 0.3.8
autoviz: 0.1.802
fairlearn: 0.7.0
deepchecks: Not installed
xgboost: 1.6.2
catboost: Not installed
kmodes: Not installed
mlxtend: Not installed
statsforecast: Not installed
tune_sklearn: 0.5.0
ray: 2.20.0
hyperopt: 0.2.7
optuna: 3.6.1
skopt: 0.10.1
mlflow: 2.12.1
gradio: 4.26.0
fastapi: 0.111.0
uvicorn: 0.29.0
m2cgen: 0.10.0
evidently: 0.4.16
fugue: Not installed
streamlit: Not installed
prophet: Not installed
The text was updated successfully, but these errors were encountered: