Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST]: Deprecate and/or evict Microsoft.Data.Analysis from the Microsoft.Spark assembly #1171

Open
dbeavon opened this issue Apr 17, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@dbeavon
Copy link

dbeavon commented Apr 17, 2024

Is your feature request related to a problem? Please describe.

I'd like to deprecate Microsoft.Data.Analysis from this project, or at least move it out of Microsoft.Spark to a distinct assembly that must be introduced separately into the .Net drivers.

It can remain in the .net worker for now (Microsoft.Spark.Worker)

Describe the solution you'd like

I'm tired of doing the following :

using FxDataFrame = Microsoft.Data.Analysis.DataFrame;

This happens too frequently and it is nuts. This nonsense should not be necessary - especially not for a mission-critical class like DataFrame. The DataFrame class is perhaps the most fundamental component of Spark. The name should not be overloaded to mean two different things within the same Microsoft.Spark assembly.

It is sort of unconventional, but marking all the Udf's deprecated would still allow that version of DataFrame to be used but it would caution users to stay away.

To take a step further, we will move the related Udf stuff to a distinct assembly (Microsoft.Spark.Miscellaneous.DataAnalysis). This miscellaneous assembly would be introduced into Spark projects if developers really needed to use the "other DataFrame".

Ultimately this change will avoid confusing new Spark engineers. They are often unable to determine which version of the "DataFrame" is the "right" one. That type of confusion is unnecessary. Unfortunately that confusion is encountered all too quickly because of the proximity between the two types of "DataFrames" that are supported by Microsoft.Spark.

I'm willing to grant that the DataFrames live in different namespaces and that helps reduce confusion, of course. However a new Spark engineer will find the two classes as soon as they download the github project for the first time. They are likely to believe that the non-Spark version of DataFrame is more "advanced" or more "native" to .Net or is "better" in some way (else why would it be in the Spark project to begin with!) These assumptions are all wrong. I have always regretted using the "Microsoft.Data.Analysis" version of DataFrame whenever I took that path.

Final Note
I think Microsoft.Data.Analysis must remain as part of the .net worker for the convenience (and to avoid breaking legacy projects).

@dbeavon dbeavon added the enhancement New feature or request label Apr 17, 2024
@dbeavon
Copy link
Author

dbeavon commented Apr 18, 2024

@luisquintanilla Are you ok with this?

I think that having another library (Microsoft.Spark.Miscellaneous.DataAnalysis) is a good compromise and avoids losing the investments that were made on behalf of the ML community. It will only need to be introduced in the driver/runner project. Microsoft.Data.Analysis will remain in the .net worker next to the executors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant