annotations experiment #942

sh-rp · 2024-02-07T12:54:15Z

Description

This PR has an example of how we could leverage python annotations to have a more universal and more readable format of our schema.

Check out test_pipeline.py on what the interface for the user would be in this example. It actually already works. Basically all you need to to is to put our annotations on class vars, and you can use any new or pre-existing class as a basis for our schema. Typehints that are unknown to us will be ignored (as defined per "Annotated" PEP)

Notes and Considerations:

Not all hints and data_types are support here, this is a prototype, but one that is easy to extend
There is a meta data attribute "table" on the class, we can use this to set table level hints, currently the dlt core is not set up to support this via the columns attribute, but I think it should be quite easy to do.

netlify · 2024-02-07T12:54:18Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`7547aa8`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65c3b66fe5f9280008f8eab6

sh-rp · 2024-02-07T12:54:42Z

dlt/extract/utils.py

@@ -65,6 +66,11 @@ def ensure_table_schema_columns(columns: TAnySchemaColumns) -> TTableSchemaColum
        isinstance(columns, pydantic.BaseModel) or issubclass(columns, pydantic.BaseModel)
    ):
        return pydantic.pydantic_to_table_schema_columns(columns)
+    elif isinstance(columns, type):


no support for table level hints here yet

sh-rp · 2024-02-07T12:59:21Z

test_pipeline.py

+    # run simple pipeline and see wether schema was used
+    load_info = p.run(data, columns=Items, table_name="blah")
+    print(load_info)
+    print(p.default_schema.to_pretty_yaml())


The items example will produce this segment in the final schema:

blah: columns: id: data_type: text primary_key: true unique: true name: data_type: text nullable: true x-classifiers: - pii.name email: data_type: text nullable: true unique: true x-classifiers: - pii.email likes_herring: data_type: bool x-classifiers: - pii.food_preference _dlt_load_id: data_type: text nullable: false _dlt_id: data_type: text nullable: false unique: true write_disposition: append

sultaniman · 2024-02-07T13:07:00Z

dlt/common/schema/annotations.py

+def to_full_type(t: Type[Any]) -> TColumnSchema:
+    result: TColumnSchema = {}
+    if get_origin(t) is Union:
+        for arg in get_args(t):


In this case the last type in Union will override all previous values should aggregate things somehow?

we can only have one type in the schema, so either we have a default way of resolving if there are multiple types or we throw an error.

Or types of int and string will produce a string, but I think that is taking it to far for now.

rudolfix · 2024-02-07T13:32:06Z

dlt/common/schema/annotations.py

+#
+
+TypeMap: Dict[Any, TDataType] = {
+    str: "text",


we have py_type_to_sc_type. look at the type_helpers.py. tons of edge cases are handled there when converting types

ah yes, great :) I was wondering if we had something like this somewhere.

rudolfix

this is soo cool

rudolfix · 2024-02-07T13:33:17Z

dlt/common/schema/annotations.py

+
+
+def unwrap(t: Type[Any]) -> Tuple[Any, List[Any]]:
+    """Returns python type info and wrapped types if this was annotated type"""


look at typing.py in common, I think I have similar function

yes, i extracted the essential part.

rudolfix · 2024-02-07T13:40:02Z

test_pipeline.py

+class Items:
+
+    # metadata for the table, currently not picked up by the pipeline
+    __table__: Annotated[Never, a.TableName("my_items"), a.WriteDisposition("merge")]


hmmm there must be a better way! ie.
load_info = p.run(data, columns=Annotated[Items, a.TableName("my_items"), a.WriteDisposition("merge")] , table_name="blah")

if items are not annotated then we go to default: Items as name, append and write dispositions

also using columns for the above is a kind of abuse... let's sync on table level hints. they may require to change our core library. ie. to make resource a generic that takes model as T. also see: #753

or we add "model" to resource definition. but then we are going into a big overhaul of our schema system where relational and python schemas are different.

in my view we should rename the columns to "model" or "table" and allow a TableSchema or even multiple table schemas (to cover subtables) in there. If a list of columns is detected we can fall back to the old way.

i changed it to accept annotated tables now, so this would work.

rudolfix · 2024-02-07T13:48:13Z

test_pipeline.py

+    id: Annotated[str, a.PrimaryKey, a.Unique]
+
+    # additional columns
+    name: Annotated[Optional[str], a.Classifiers(["pii.name"])]


I'd probably generate literals for those

you mean for the classifiers?

allow annotation of table arg

annotations experiment

33633f1

sh-rp commented Feb 7, 2024

View reviewed changes

sultaniman reviewed Feb 7, 2024

View reviewed changes

rudolfix reviewed Feb 7, 2024

View reviewed changes

rudolfix requested changes Feb 7, 2024

View reviewed changes

sh-rp added 2 commits February 7, 2024 16:09

use built in type resolution

c3f2ba3

allow annotation of table arg

remove unneeded stuff

7547aa8

sh-rp self-assigned this Mar 18, 2024

rudolfix mentioned this pull request Apr 15, 2024

annotate Pydantic models with dlt hints #1221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotations experiment #942

annotations experiment #942

sh-rp commented Feb 7, 2024 •

edited

netlify bot commented Feb 7, 2024 •

edited

sh-rp Feb 7, 2024

sh-rp Feb 7, 2024

sultaniman Feb 7, 2024

sh-rp Feb 7, 2024

sh-rp Feb 7, 2024

sultaniman Feb 7, 2024

rudolfix Feb 7, 2024

sh-rp Feb 7, 2024

rudolfix left a comment

rudolfix Feb 7, 2024

sh-rp Feb 7, 2024

rudolfix Feb 7, 2024

rudolfix Feb 7, 2024

rudolfix Feb 7, 2024

sh-rp Feb 7, 2024

sh-rp Feb 7, 2024

rudolfix Feb 7, 2024

sh-rp Feb 7, 2024



		def unwrap(t: Type[Any]) -> Tuple[Any, List[Any]]:
		"""Returns python type info and wrapped types if this was annotated type"""

annotations experiment #942

Are you sure you want to change the base?

annotations experiment #942

Conversation

sh-rp commented Feb 7, 2024 • edited

Description

Notes and Considerations:

netlify bot commented Feb 7, 2024 • edited

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Feb 7, 2024 •

edited

netlify bot commented Feb 7, 2024 •

edited