-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify the pandas analyzer code to always respect the sample size #12097
Conversation
Please take into account this issue #6669 as that is what inspired the FindFirstNonNull method |
This should still work, no? |
Can you explain how? If the analyzer can only find nulls at the given offset, the resulting type is null |
Not really, if the values sniffed are all null and we did not sniff the full dataframe, it now defaults to a varchar. |
That just means that this does not break the referenced issue because the columns happen to be VARCHAR, that's not equivalent I'm fine merging this, I agree that this regression is a problem, but we do have to be aware of the behavior we're throwing away here and the problems that will inevitably cause |
Maybe there's a faster way of checking for NULL values instead of calling |
e.g. maybe we could use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html - although ideally we wouldn't do that for the entire DataFrame range |
How about we do that as a fallback - if the result type is |
Sure, I'm happy to try/benchmark that :-) |
@Mytherin I've modifed the code to essentially use One thing to notice is that the query that is running does not require the columns that are object types. Maybe one thing we should consider doing is performing projection pushdown and the dataframe. |
The problem is that what we really need in that case is some callback on when we need the type of the column, then we could lazily figure out the type only when required. That's not really supported infrastructure wise in the system right now - but would definitely be interesting to add. |
@Mytherin The CI here is failing because the previous implementations has a bug on the benchmark I've added |
Ah yeah, good point. LGTM in that case |
Merge pull request duckdb/duckdb#12152 from carlopi/allow_community_extensions Merge pull request duckdb/duckdb#12097 from pdet/pandas_object_analyzer
The Pandas Analyzer code would ignore the sample size limit for null values when sniffing data types from
object
type columns.In the case where we have an object column where most (or all) values are null, the whole column would be sniffed.
Now the null values are not skipped, and if we sampled the file, we upgrade the type from null to varchar, if only nulls were found.
This improves the scan of the NYC Taxi dataset from a dataframe by 2 orders of magnitude.
Old Time: 1.72
New Time: 0.025545666925609112
Sample benchmark: