Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to just read variable names/metadata #167

Open
leeper opened this issue Jan 3, 2018 · 8 comments
Open

Function to just read variable names/metadata #167

leeper opened this issue Jan 3, 2018 · 8 comments
Labels

Comments

@leeper
Copy link
Contributor

leeper commented Jan 3, 2018

It might be useful to be able to just read metadata without loading an entire file. I don't think there's a way to do this consistently across file types, though.

@jsonbecker
Copy link
Collaborator

Would a good way to handle that be identifying all methods that has a "limit"-style parameters to grab a certain percentage/number of lines and then just do a transform to return a data.frame with column index, column name, and column type of the resulting data.frame?

@leeper
Copy link
Contributor Author

leeper commented Mar 15, 2018

That could work, but what formats will that work for other than plain text?

@jsonbecker
Copy link
Collaborator

I'll have to peak at what binary formats may or may not have some form of quick access. Unfortunately the benefit of faster loading, smaller file size, and consistent type definitions with binary files probably means little performance benefit for slicing data since it's stored in a non-adjacent ways. We could have a consistent interface though that's simply slower on binary (but more accurate).

It may make sense to make a limit param that is overridden where that's not usable (and you get a message that limit = Inf essentially). That would also allow people to trade speed for accuracy when using techniques that are doing type-guessing.

@leeper
Copy link
Contributor Author

leeper commented Mar 15, 2018

We could probably also take a look at the haven codebase and see if there's way to add some of this functionality upstream. I haven't looked but I imagine it's possible as a lot of these formats have metadata the beginning of the file before any of the actual contents start.

@bokov
Copy link
Contributor

bokov commented Dec 4, 2019

What I think would be nice is if among the meta-data snooping functions there was one that listed tables/sheets/etc. It would return names if available, numbers otherwise. For formats that cannot have multiple tables or sheets it would always return 1. What do you think?

@leeper
Copy link
Contributor Author

leeper commented Dec 20, 2019

How about we make a generic like get_col_names() and start creating methods for it. For some of these (like text/tabular), this is going to be easy. We should also be able to make it work with the new haven functionality in #248. It doesn't have to work for everything right away as the long tail of formats isn't going to give us this metadata easily.

@bokov I'm not sure how useful that is, at least initially. Let's start with the simple/flat file types and then think about how much work is worthwhile to make it work for other file types.

@bokov
Copy link
Contributor

bokov commented Dec 20, 2019

I agree about starting with simple/flat file types and not necessarily supporting all types. But what you think about the idea of following what seems to be the overall philosophy of this package and writing unified front end functions that does this for the supported formats (and some kind of message for unsupported ones). Instead of exporting format-specific functions.

@leeper
Copy link
Contributor Author

leeper commented Dec 20, 2019

Yea, that's what I meant. Sorry for not being clear - we wouldn't export the methods, just like we don't export import/export methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants