Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial & Iceberg extensions do not autoload #12095

Closed
2 tasks done
mehd-io opened this issue May 16, 2024 · 7 comments
Closed
2 tasks done

Spatial & Iceberg extensions do not autoload #12095

mehd-io opened this issue May 16, 2024 · 7 comments
Labels
expected behavior The behavior described in the issue is expected Needs Documentation Use for issues or PRs that require changes in the documentation needs triage

Comments

@mehd-io
Copy link

mehd-io commented May 16, 2024

What happens?

Spatial and Iceberg extensions should auto-load as their functions are in the extension_entries.hpp but they aren't.

To Reproduce

On DuckDB 10.2 CLI

For Spatial extension (data is from a public bucket), st_geomfromwkb is listed in the extensions_entries.hpp

create or replace table oslo as select bf_source, confidence, area_in_meters, country_iso, st_geomfromwkb(geometry) as geom from 's3://us-west-2.opendata.source.coop/vida/google-microsoft-open-buildings/geoparquet/by_country/country_iso=NOR/NOR.parquet' where st_dwithin(st_geomfromwkb(geometry), st_point(10.7409424, 59.9135533), .1);
Catalog Error: Scalar Function with name "st_dwithin" is not in the catalog, but it exists in the spatial extension.

Please try installing and loading the spatial extension:
INSTALL spatial;
LOAD spatial;

For Iceberg, iceberg_scan is listed in the extensions_entries.hpp

FROM iceberg_scan('https://motherduck-demo.s3.amazonaws.com/iceberg/lineitem_iceberg',
        allow_moved_paths=true) ;
 Catalog Error: Table Function with name "iceberg_scan" is not in the catalog, but it exists in the iceberg extension.

Please try installing and loading the iceberg extension:
INSTALL iceberg;
LOAD iceberg;

duckdb_settings ()

┌────────────────────────────────────┬──────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬─────────┐
│                name                │                value                 │                                                                       description                                                                       │ input_type │  scope  │
│              varchar               │               varchar                │                                                                         varchar                                                                         │  varchar   │ varchar │
├────────────────────────────────────┼──────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼─────────┤
│ access_mode                        │ automatic                            │ Access mode of the database (AUTOMATIC, READ_ONLY or READ_WRITE)                                                                                        │ VARCHAR    │ GLOBAL  │
│ allow_persistent_secrets           │ true                                 │ Allow the creation of persistent secrets, that are stored and loaded on restarts                                                                        │ BOOLEAN    │ GLOBAL  │
│ checkpoint_threshold               │ 16.0 MiB                             │ The WAL size threshold at which to automatically trigger a checkpoint (e.g. 1GB)                                                                        │ VARCHAR    │ GLOBAL  │
│ debug_checkpoint_abort             │ none                                 │ DEBUG SETTING: trigger an abort while checkpointing for testing purposes                                                                                │ VARCHAR    │ GLOBAL  │
│ debug_force_external               │ false                                │ DEBUG SETTING: force out-of-core computation for operators that support it, used for testing                                                            │ BOOLEAN    │ LOCAL   │
│ debug_force_no_cross_product       │ false                                │ DEBUG SETTING: Force disable cross product generation when hyper graph isn't connected, used for testing                                                │ BOOLEAN    │ LOCAL   │
│ debug_asof_iejoin                  │ false                                │ DEBUG SETTING: force use of IEJoin to implement AsOf joins                                                                                              │ BOOLEAN    │ LOCAL   │
│ prefer_range_joins                 │ false                                │ Force use of range joins with mixed predicates                                                                                                          │ BOOLEAN    │ LOCAL   │
│ debug_window_mode                  │ NULL                                 │ DEBUG SETTING: switch window mode to use                                                                                                                │ VARCHAR    │ GLOBAL  │
│ default_collation                  │                                      │ The collation setting used when none is specified                                                                                                       │ VARCHAR    │ GLOBAL  │
│ default_order                      │ asc                                  │ The order type used when none is specified (ASC or DESC)                                                                                                │ VARCHAR    │ GLOBAL  │
│ default_null_order                 │ nulls_last                           │ Null ordering used when none is specified (NULLS_FIRST or NULLS_LAST)                                                                                   │ VARCHAR    │ GLOBAL  │
│ disabled_filesystems               │                                      │ Disable specific file systems preventing access (e.g. LocalFileSystem)                                                                                  │ VARCHAR    │ GLOBAL  │
│ disabled_optimizers                │                                      │ DEBUG SETTING: disable a specific set of optimizers (comma separated)                                                                                   │ VARCHAR    │ GLOBAL  │
│ enable_external_access             │ true                                 │ Allow the database to access external state (through e.g. loading/installing modules, COPY TO/FROM, CSV readers, pandas replacement scans, etc)         │ BOOLEAN    │ GLOBAL  │
│ enable_fsst_vectors                │ false                                │ Allow scans on FSST compressed segments to emit compressed vectors to utilize late decompression                                                        │ BOOLEAN    │ GLOBAL  │
│ allow_unsigned_extensions          │ false                                │ Allow to load extensions with invalid or missing signatures                                                                                             │ BOOLEAN    │ GLOBAL  │
│ allow_extensions_metadata_mismatch │ false                                │ Allow to load extensions with not compatible metadata                                                                                                   │ BOOLEAN    │ GLOBAL  │
│ allow_unredacted_secrets           │ false                                │ Allow printing unredacted secrets                                                                                                                       │ BOOLEAN    │ GLOBAL  │
│ custom_extension_repository        │                                      │ Overrides the custom endpoint for remote extension installation                                                                                         │ VARCHAR    │ GLOBAL  │
│ autoinstall_extension_repository   │                                      │ Overrides the custom endpoint for extension installation on autoloading                                                                                 │ VARCHAR    │ GLOBAL  │
│ autoinstall_known_extensions       │ true                                 │ Whether known extensions are allowed to be automatically installed when a query depends on them                                                         │ BOOLEAN    │ GLOBAL  │
│ autoload_known_extensions          │ true                                 │ Whether known extensions are allowed to be automatically loaded when a query depends on them                                                            │ BOOLEAN    │ GLOBAL  │
│ enable_object_cache                │ false                                │ Whether or not object cache is used to cache e.g. Parquet metadata                                                                                      │ BOOLEAN    │ GLOBAL  │
│ enable_http_metadata_cache         │ false                                │ Whether or not the global http metadata is used to cache HTTP metadata                                                                                  │ BOOLEAN    │ GLOBAL  │
│ enable_profiling                   │ NULL                                 │ Enables profiling, and sets the output format (JSON, QUERY_TREE, QUERY_TREE_OPTIMIZER)                                                                  │ VARCHAR    │ LOCAL   │
│ enable_progress_bar                │ true                                 │ Enables the progress bar, printing progress to the terminal for long queries                                                                            │ BOOLEAN    │ LOCAL   │
│ enable_progress_bar_print          │ true                                 │ Controls the printing of the progress bar, when 'enable_progress_bar' is true                                                                           │ BOOLEAN    │ LOCAL   │
│ errors_as_json                     │ false                                │ Output error messages as structured JSON instead of as a raw string                                                                                     │ BOOLEAN    │ LOCAL   │
│ explain_output                     │ physical_only                        │ Output of EXPLAIN statements (ALL, OPTIMIZED_ONLY, PHYSICAL_ONLY)                                                                                       │ VARCHAR    │ LOCAL   │
│ extension_directory                │                                      │ Set the directory to store extensions in                                                                                                                │ VARCHAR    │ GLOBAL  │
│ external_threads                   │ 1                                    │ The number of external threads that work on DuckDB tasks.                                                                                               │ BIGINT     │ GLOBAL  │
│ file_search_path                   │                                      │ A comma separated list of directories to search for input files                                                                                         │ VARCHAR    │ LOCAL   │
│ force_compression                  │ Auto                                 │ DEBUG SETTING: forces a specific compression method to be used                                                                                          │ VARCHAR    │ GLOBAL  │
│ force_bitpacking_mode              │ auto                                 │ DEBUG SETTING: forces a specific bitpacking mode                                                                                                        │ VARCHAR    │ GLOBAL  │
│ home_directory                     │                                      │ Sets the home directory used by the system                                                                                                              │ VARCHAR    │ LOCAL   │
│ log_query_path                     │ NULL                                 │ Specifies the path to which queries should be logged (default: NULL, queries are not logged)                                                            │ VARCHAR    │ LOCAL   │
│ lock_configuration                 │ false                                │ Whether or not the configuration can be altered                                                                                                         │ BOOLEAN    │ GLOBAL  │
│ immediate_transaction_mode         │ false                                │ Whether transactions should be started lazily when needed, or immediately when BEGIN TRANSACTION is called                                              │ BOOLEAN    │ GLOBAL  │
│ integer_division                   │ false                                │ Whether or not the / operator defaults to integer division, or to floating point division                                                               │ BOOLEAN    │ LOCAL   │
│ max_expression_depth               │ 1000                                 │ The maximum expression depth limit in the parser. WARNING: increasing this setting and using very deep expressions might lead to stack overflow errors. │ UBIGINT    │ LOCAL   │
│ max_memory                         │ 51.1 GiB                             │ The maximum memory of the system (e.g. 1GB)                                                                                                             │ VARCHAR    │ GLOBAL  │
│ old_implicit_casting               │ false                                │ Allow implicit casting to/from VARCHAR                                                                                                                  │ BOOLEAN    │ GLOBAL  │
│ memory_limit                       │ 51.1 GiB                             │ The maximum memory of the system (e.g. 1GB)                                                                                                             │ VARCHAR    │ GLOBAL  │
│ null_order                         │ nulls_last                           │ Null ordering used when none is specified (NULLS_FIRST or NULLS_LAST)                                                                                   │ VARCHAR    │ GLOBAL  │
│ ordered_aggregate_threshold        │ 262144                               │ The number of rows to accumulate before sorting, used for tuning                                                                                        │ UBIGINT    │ LOCAL   │
│ password                           │ NULL                                 │ The password to use. Ignored for legacy compatibility.                                                                                                  │ VARCHAR    │ GLOBAL  │
│ perfect_ht_threshold               │ 12                                   │ Threshold in bytes for when to use a perfect hash table                                                                                                 │ BIGINT     │ LOCAL   │
│ pivot_filter_threshold             │ 10                                   │ The threshold to switch from using filtered aggregates to LIST with a dedicated pivot operator                                                          │ BIGINT     │ LOCAL   │
│ pivot_limit                        │ 100000                               │ The maximum number of pivot columns in a pivot statement                                                                                                │ BIGINT     │ LOCAL   │
│ preserve_identifier_case           │ true                                 │ Whether or not to preserve the identifier case, instead of always lowercasing all non-quoted identifiers                                                │ BOOLEAN    │ LOCAL   │
│ preserve_insertion_order           │ true                                 │ Whether or not to preserve insertion order. If set to false the system is allowed to re-order any results that do not contain ORDER BY clauses.         │ BOOLEAN    │ GLOBAL  │
│ profile_output                     │                                      │ The file to which profile output should be saved, or empty to print to the terminal                                                                     │ VARCHAR    │ LOCAL   │
│ profiling_mode                     │ NULL                                 │ The profiling mode (STANDARD or DETAILED)                                                                                                               │ VARCHAR    │ LOCAL   │
│ profiling_output                   │                                      │ The file to which profile output should be saved, or empty to print to the terminal                                                                     │ VARCHAR    │ LOCAL   │
│ progress_bar_time                  │ 2000                                 │ Sets the time (in milliseconds) how long a query needs to take before we start printing a progress bar                                                  │ BIGINT     │ LOCAL   │
│ schema                             │ main                                 │ Sets the default search schema. Equivalent to setting search_path to a single value.                                                                    │ VARCHAR    │ LOCAL   │
│ search_path                        │                                      │ Sets the default catalog search path as a comma-separated list of values                                                                                │ VARCHAR    │ LOCAL   │
│ secret_directory                   │ /Users/mehdio/.duckdb/stored_secrets │ Set the directory to which persistent secrets are stored                                                                                                │ VARCHAR    │ GLOBAL  │
│ default_secret_storage             │ local_file                           │ Allows switching the default storage for secrets                                                                                                        │ VARCHAR    │ GLOBAL  │
│ temp_directory                     │                                      │ Set the directory to which to write temp files                                                                                                          │ VARCHAR    │ GLOBAL  │
│ threads                            │ 10                                   │ The number of total threads used by the system.                                                                                                         │ BIGINT     │ GLOBAL  │
│ username                           │ NULL                                 │ The username to use. Ignored for legacy compatibility.                                                                                                  │ VARCHAR    │ GLOBAL  │
│ arrow_large_buffer_size            │ false                                │ If arrow buffers for strings, blobs, uuids and bits should be exported using large buffers                                                              │ BOOLEAN    │ GLOBAL  │
│ user                               │ NULL                                 │ The username to use. Ignored for legacy compatibility.                                                                                                  │ VARCHAR    │ GLOBAL  │
│ wal_autocheckpoint                 │ 16.0 MiB                             │ The WAL size threshold at which to automatically trigger a checkpoint (e.g. 1GB)                                                                        │ VARCHAR    │ GLOBAL  │
│ worker_threads                     │ 10                                   │ The number of total threads used by the system.                                                                                                         │ BIGINT     │ GLOBAL  │
│ allocator_flush_threshold          │ 128.0 MiB                            │ Peak allocation threshold at which to flush the allocator after completing a task.                                                                      │ VARCHAR    │ GLOBAL  │
│ duckdb_api                         │ cli                                  │ DuckDB API surface                                                                                                                                      │ VARCHAR    │ GLOBAL  │
│ custom_user_agent                  │                                      │ Metadata from DuckDB callers                                                                                                                            │ VARCHAR    │ GLOBAL  │
│ partitioned_write_flush_threshold  │ 524288                               │ The threshold in number of rows after which we flush a thread state when writing using PARTITION_BY                                                     │ BIGINT     │ LOCAL   │
│ binary_as_string                   │                                      │ In Parquet files, interpret binary data as a string.                                                                                                    │ BOOLEAN    │ GLOBAL  │
│ Calendar                           │ gregorian                            │ The current calendar                                                                                                                                    │ VARCHAR    │ GLOBAL  │
│ TimeZone                           │ Europe/Berlin                        │ The current time zone                                                                                                                                   │ VARCHAR    │ GLOBAL  │
├────────────────────────────────────┴──────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴─────────┤
│ 74 rows                                                                                                                                                                                                                                          5 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

OS:

MacOS

DuckDB Version:

0.10.2

DuckDB Client:

CLI

Full Name:

Mehdi Ouazza

Affiliation:

MotherDuck

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@Tishj
Copy link
Contributor

Tishj commented May 16, 2024

It sounds like autoinstall_known_extensions is perhaps not set?
Can you provide the output for select * from duckdb_settings(), I think that will be relevant here

@mehd-io
Copy link
Author

mehd-io commented May 16, 2024

@Tishj it is set, I'm using other extensions (httpfs, aws) and often rely on autoload.

v0.10.2 1601d94f94
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from duckdb_settings() where name='autoinstall_known_extensions';
┌──────────────────────────────┬─────────┬─────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬─────────┐
│             name             │  value  │                                           description                                           │ input_type │  scope  │
│           varchar            │ varchar │                                             varchar                                             │  varchar   │ varchar │
├──────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼─────────┤
│ autoinstall_known_extensions │ true    │ Whether known extensions are allowed to be automatically installed when a query depends on them │ BOOLEAN    │ GLOBAL  │
└──────────────────────────────┴─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴─────────┘
D create or replace table oslo as select bf_source, confidence, area_in_meters, country_iso, st_geomfromwkb(geometry) as geom from 's3://us-west-2.opendata.source.coop/vida/google-microsoft-open-buildings/geoparquet/by_country/country_iso=NOR/NOR.parquet' where st_dwithin(st_geomfromwkb(geometry), st_point(10.7409424, 59.9135533), .1);
Catalog Error: Scalar Function with name "st_dwithin" is not in the catalog, but it exists in the spatial extension.

Please try installing and loading the spatial extension:
INSTALL spatial;
LOAD spatial;

@carlopi
Copy link
Contributor

carlopi commented May 16, 2024

I think everything is working as intended, there are these lines:

static constexpr const char *AUTOLOADABLE_EXTENSIONS[] = {
    "aws", "azure", "autocomplete", "excel",          "fts",      "httpfs",           "inet",
    "icu", "json",  "parquet",      "sqlite_scanner", "sqlsmith", "postgres_scanner", "tpcds",
    "tpch"}; // END_OF_AUTOLOADABLE_EXTENSIONS

at https://github.com/duckdb/duckdb/blob/v0.10.2/src/include/duckdb/main/extension_entries.hpp#L351C1-L355C1 that specify the list of extensions for which autoloading will actually work.

In current main, the line is here: https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/main/extension_entries.hpp#L366C1-L370C1, but still no iceberg nor spatial.

Idea of the current code is that there are two sets: the set of known functions , and the set of extensions that can be autoloaded.

To trigger autoloading a function has to be known (e.g., not added in a later extension version) AND the extension has to be in the list. [edit: also, autoloading needs to be enabled, but that's the case in most stable deployments]

Strictly speaking I dont' see any bug in the implementation, but if it was not clear to you (as someone working with duckdb) then probably there is likely room for improvement in the error messages or [developer] docs or elsewhere.

Expanding the list of known to be autoloadable extensions can be considered.

@carlopi carlopi added the expected behavior The behavior described in the issue is expected label May 16, 2024
@mehd-io
Copy link
Author

mehd-io commented May 17, 2024

Thanks for the explanation, @carlopi , It's much clearer!
I think from an user point of view, this is pretty confusing as there are today no way of knowing if an extension will autoload unless you look at this code source.
So, at the minimum, as you suggested, the list should probably be referenced in the docs.

Regarding the list itself, what's the decision behind not having all duckdb supported extensions auto-loaded? That was my initial thought.
As the philosophy of making things simpler, most of the (basic/new) users shouldn't care about the extensions as they would most of the time use supported extensions.

@carlopi
Copy link
Contributor

carlopi commented May 17, 2024

Rationale is that extension that currently are marked as autoloadable are tested so that if LOAD x statement are ignored, and all keeps working, so basically LOAD statements are superfluous.

Spatial I think still needs to be explicitly loaded in a few cases, say when reading a DB file that store spatial data autoloading can't yet be triggered there (I think, or some other cases like it).

We can change this policy (and change the test infrastructure a bit to reflect that, basically adding in a few places that test do not support the full-autoloading mode), I can also see the reason for more autoloading.

@carlopi
Copy link
Contributor

carlopi commented May 17, 2024

Doc side the difference is sort of explained here: https://duckdb.org/docs/extensions/overview#extension-types, where there are this 3 categories: built-in, autoloadable, explicitly loadable extensions, but text can be improved, input is very welcome. (this is orthogonal to whether spatial is autoloadable or not, given there will always exist third party extensions that requires explicit load)

@carlopi
Copy link
Contributor

carlopi commented May 28, 2024

I think this will be reviewed docs side in duckdb/duckdb-web#2874.

Going forward we might want to move more core extensions to being autoloadable, but this is currently working as intended and so closing this.

Possibly it might be raised on the different repos, or more generally a discussion can also be great.

@carlopi carlopi closed this as completed May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expected behavior The behavior described in the issue is expected Needs Documentation Use for issues or PRs that require changes in the documentation needs triage
Projects
None yet
Development

No branches or pull requests

4 participants