Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamlit app cache is invalid & load info crashes #1264

Open
willi-mueller opened this issue Apr 23, 2024 · 8 comments
Open

Streamlit app cache is invalid & load info crashes #1264

willi-mueller opened this issue Apr 23, 2024 · 8 comments
Assignees
Labels
bug Something isn't working community This issue came from slack community workspace

Comments

@willi-mueller
Copy link
Contributor

willi-mueller commented Apr 23, 2024

dlt version

0.4.8

Describe the problem

When the schema changes by either
a) dropping a table manually
b) unselecting a resource while having the write_disposition='replace'

the streamlit app does not show the changed schema.
Screenshot 2024-04-23 at 13 06 40

Also, when I delete a row it is still shown when I click "Show Data".

Further, the "Number of loaded rows" in the load info tab only shows a run-time error and stack trace but no row count.

Screenshot 2024-04-23 at 12 57 05

Expected behavior

  1. When I drop a table the streamlit app should not show it
  2. When I deselect a resource and dlt removes the table from the destination the streamlit app should not show it
  3. The streamlit app should show the row count in the load info tab
  4. Different topic: The tables created by the rest_api should not be filed in dlt under a non-existing schema.

Steps to reproduce

See screencast: https://www.loom.com/share/700e9f4a1cbe48a5988f55f27c022588?sid=b937409a-329a-4770-85b6-b65afe05aa51

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

rest_api

dlt destination

DuckDB

Other deployment details

No response

Additional information

My test code:

import dlt

from rest_api import rest_api_source

pokemon_config = {
    "client": {
        "base_url": "https://pokeapi.co/api/v2/",
    },
    "resource_defaults": {
        "write_disposition": "replace",
        "endpoint": {
            "params": {
                "limit": 1000,
            },
        },
    },
    "resources": [
        {
          "name": "berries",
          "endpoint": {
            "path": "berry"
          },
          # "selected": False
        },
        "pokemon",
    ],
}

pokemon_source = rest_api_source(pokemon_config)

pipeline = dlt.pipeline(
    pipeline_name="pokemon_pipeline",
    destination="duckdb",
    dataset_name="pokemon",
    progress="log",
)

load_info = pipeline.run(pokemon_source)
print(load_info)
@willi-mueller willi-mueller changed the title Streamlit app cache is invalid Streamlit app cache is invalid & load info crashes Apr 23, 2024
@sultaniman sultaniman self-assigned this Apr 23, 2024
@sultaniman sultaniman added bug Something isn't working community This issue came from slack community workspace and removed bug Something isn't working labels Apr 23, 2024
@rudolfix rudolfix added the bug Something isn't working label Apr 25, 2024
@sultaniman
Copy link
Collaborator

Tried to reproduce this by running pipeline with the example code multiple times and then trying out

a) dropping a table manually
b) unselecting a resource while having the write_disposition='replace'

poke-repro.mov

I was unable to do so, can you please try maybe with the latest version?

@willi-mueller
Copy link
Contributor Author

willi-mueller commented Apr 26, 2024

Thanks for testing! Indeed, with v 0.49 it's better. But when I click on "Show data" after the second run after just having dropped the table, streamlit still shows data from the cache and does not query the DB and thus does not see that the table does not exist anymore.

How to reproduce:

  1. python pokemon_pipeline.py # with berry selected
  2. dlt pipeline pokemon_pipeline show
  3. in streamlit: "show data" on berries resource
  4. in duckDBL use pokemon; drop table berries;
  5. Refresh streamlit
  6. in streamlit: "show data" on berries resource. It still shows the deleted data
  7. python pokemon_pipeline.py # with berry unselected
  8. in streamlit: "show data" on berries resource. It still shows the deleted data
  9. dlt pipeline pokemon_pipeline drop berries
  10. streamlit: Now, berries are no longer visible

@willi-mueller
Copy link
Contributor Author

willi-mueller commented Apr 26, 2024

Also, after executing the pipeline even multiple times I get the blue message in streamlit: "pokemon is missing resource state". I wonder when this would disappear. Same for berries

@sultaniman
Copy link
Collaborator

@willi-mueller I think this is suboptimal ux, and probably we just need to hide it if there is no resource state found

@sultaniman
Copy link
Collaborator

@willi-mueller I was able to reproduce the issue with caching and it is due to the TTL we have set when users query the data. I am thinking about proper invalidation mechanism.

@willi-mueller
Copy link
Contributor Author

willi-mueller commented Apr 29, 2024 via email

@sultaniman
Copy link
Collaborator

sultaniman commented May 7, 2024

@willi-mueller I adjusted caching lifetime and disabled caching schema in the streamlit session state store please see the attached screen recording. Historically it was not anticipated that users might keep streamlit open and alter the state of pipeline and data from separate console or tools.

The PR with fix is right above this comment.

streamlit-schema-and-session-caching.mov

@sultaniman
Copy link
Collaborator

sultaniman commented May 14, 2024

@willi-mueller can you please check this with the latest version, the issue should be gone now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community This issue came from slack community workspace
Projects
Status: In Progress
Development

No branches or pull requests

3 participants