Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add version field to data saver/ loader metadata #751

Open
skrawcz opened this issue Mar 8, 2024 · 2 comments
Open

Add version field to data saver/ loader metadata #751

skrawcz opened this issue Mar 8, 2024 · 2 comments
Labels
enhancement New feature or request i/o

Comments

@skrawcz
Copy link
Collaborator

skrawcz commented Mar 8, 2024

Is your feature request related to a problem? Please describe.
The data save and loaders return metadata. We should version this schema so that we can iterate and make changes-- people can then know how to version/write code against it.

Describe the solution you'd like
Add a __version__ field or something. It should follow semantic versioning. Major = breaking. Minor = new addition. Patch = bug fix.

Describe alternatives you've considered
N/A

Additional context
N/A

@skrawcz skrawcz added enhancement New feature or request good first issue Good for newcomers i/o labels Mar 8, 2024
@roothsmz
Copy link

roothsmz commented Mar 8, 2024

@skrawcz ,

One thought is to provide for an overridable Callable object.

def get_dataframe_metadata(df: pd.DataFrame, metadata_func: Optional[Callable[[pd.DataFrame], Dict[str, Any]]] = None) -> Dict[str, Any]:
    """Gives metadata from loading a dataframe.

    Note: we reserve the right to change this schema. So if you're using this come
    chat so that we can make sure we don't break your code.

    This will default to include:
    - the number of rows
    - the number of columns
    - the column names
    - the data types
    
    If you provide an override, then your DataFrame will be passed into that 
    Callable and the resulting Dictionary will include your custom metadata.
    """
    if metadata_func is not None:
        return {DATAFRAME_METADATA: metadata_func(df)}
    else:
        return {
            DATAFRAME_METADATA: {
                "rows": len(df),
                "columns": len(df.columns),
                "column_names": list(df.columns),
                "datatypes": [str(t) for t in list(df.dtypes)],  # for serialization purposes
            }
        }

Source code as is currently implemented:

def get_dataframe_metadata(df: pd.DataFrame) -> Dict[str, Any]:
"""Gives metadata from loading a dataframe.
Note: we reserve the right to change this schema. So if you're using this come
chat so that we can make sure we don't break your code.
This includes:
- the number of rows
- the number of columns
- the column names
- the data types
"""
return {
DATAFRAME_METADATA: {
"rows": len(df),
"columns": len(df.columns),
"column_names": list(df.columns),
"datatypes": [str(t) for t in list(df.dtypes)], # for serialization purposes
}
}

@skrawcz
Copy link
Collaborator Author

skrawcz commented Mar 8, 2024

Hmm -- this makes me think do we need schemas by type? Or can we have a general one on all dataframes, but where some fields might not be easily populated.

@skrawcz skrawcz removed the good first issue Good for newcomers label Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request i/o
Projects
None yet
Development

No branches or pull requests

2 participants