sniff_csv how to parse the Columns value safely #12071

no23reason · 2024-05-15T14:47:30Z

no23reason
May 15, 2024

Hi,
I am using the sniff_csv function to analyze unknown CSVs and I would like to get the column names detected by DuckDB as a list in Python. The column information is returned in the Columns column of the sniff_csv result as a string, but I am struggling with how to parse that to obtain the column names only. The best I could come up with is something like:

import duckdb

sample_filename = "path/to/my/file.csv"
conn = duckdb.connect(":memory:", config={"threads": 1})
query = conn.execute(
    "SELECT Delimiter, Quote, Escape, HasHeader, Columns FROM sniff_csv(?)",
    [sample_filename],
)
query_result = query.fetchone()
(delimiter, quote, escape, has_header, encoded_columns) = query_result
col_query = conn.execute(f"SELECT {encoded_columns}")
decoded_columns = col_query.fetchone()[0].keys()

but that is quite SQL-injection-adjacent and it fails for some specific CSVs that have apostrophe in some of their header names (see cloud_services_adoption.csv for one such example).

I was able to sidestep the problem by changing the original code to

col_query = conn.execute(f"DESCRIBE SELECT * FROM '{sample_filename}' LIMIT 0")
decoded_columns = [item[0] for item in col_query.fetchall()]

which works even for the files with apostrophes. But it runs the sniffer a second time which is not ideal, I would still like to be able to read this information directly from the original sniff_csv results.

Is there any better way of getting the column names please? Or is it a bug that the output of sniff_csv does not have the apostrophes escaped?

EDIT: raised as #12089

Answered by pdet

May 16, 2024

I think that the "correct" fix here is that the columns type should be a list of structs, and not a varchar

View full answer

cmdlineluser · 2024-05-15T17:43:55Z

cmdlineluser
May 15, 2024

Would it make sense for sniff_csv() to be able to optionally output the information in JSON?

Then one could potentially use duckdb's json functions to parse the output.

3 replies

no23reason May 16, 2024
Author

That would be awesome, yes. Or maybe have some function outside of SQL to convert the current output string to the object/map equivalent.

cmdlineluser May 16, 2024

Not really an answer, but I just learned of json_serialize_sql() in #12047

It fails on your example csv due to the quoting issue, but using a basic example file:

duckdb.sql("""
select json_serialize_sql(
   'from (describe ' 
   ||
   (from sniff_csv('population.csv') select Prompt[:-2]) -- need to remove trailing ;
   || 
   ')'
)
.statements[-1]
.node.from_table.subquery
.node.from_table.query
.from_table.function.children[-1]
.right.children->'$[*].alias'
as column_names
""")

# ┌────────────────────────────────────────────┐
# │                column_names                │
# │                   json[]                   │
# ├────────────────────────────────────────────┤
# │ ["Scenario", "Area", "Year", "Population"] │
# └────────────────────────────────────────────┘

So it seems as though some of the machinery for this type of thing does exist.

Perhaps it would be possible to make this available in simpler form to the end-user.

no23reason May 17, 2024
Author

Impressive! I wonder how stable this solution would be over time though.

pdet · 2024-05-16T18:48:52Z

pdet
May 16, 2024
Collaborator

I think that the "correct" fix here is that the columns type should be a list of structs, and not a varchar

3 replies

no23reason May 17, 2024
Author

For this usecase, it would indeed be the best. I am not sure why it is varchar now, but I assume there was some reasoning behind it (maybe some performance concerns or simplicity?).

pdet May 17, 2024
Collaborator

Pr is up:
#12099

no23reason May 17, 2024
Author

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sniff_csv how to parse the Columns value safely #12071

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

sniff_csv how to parse the Columns value safely #12071

no23reason May 15, 2024

Replies: 2 comments · 6 replies

cmdlineluser May 15, 2024

no23reason May 16, 2024 Author

cmdlineluser May 16, 2024

no23reason May 17, 2024 Author

pdet May 16, 2024 Collaborator

no23reason May 17, 2024 Author

pdet May 17, 2024 Collaborator

no23reason May 17, 2024 Author

no23reason
May 15, 2024

Replies: 2 comments 6 replies

cmdlineluser
May 15, 2024

no23reason May 16, 2024
Author

no23reason May 17, 2024
Author

pdet
May 16, 2024
Collaborator

no23reason May 17, 2024
Author

pdet May 17, 2024
Collaborator

no23reason May 17, 2024
Author