Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inserts via GCP cloud function and pub/sub #670

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jordanwillifordcruise
Copy link

Added support for handle_tests_results and insert_rows to use a new insert_rows_method method called gcp-cloud-function which calls a UDF to push results into BigQuery via Pub/Sub rather than direct insert queries. This significantly increases Elementary's capacity to insert records/test results when using BigQuery.

Here is an example configuration in dbt_project.yml. You identify a UDF that publishes to a topic (first argument of SELECT), and then define 3 pubsub topics to send data to. Those topics are defined to pass records straight to BigQuery, which is a simple option. I did also define schemas for those tables (all of this is created in Terraform) which is not ideal, it means this is coupled to schema/code changes for those tables.

  # This is to prevent BQ from exceeding the query size limit, specifically used to prevent errors writing Elementary metadata.
  insert_rows_method: gcp-cloud-function
  insert_rows_udf: "_project_._dataset_.publish_to_pubsub_function"
  insert_rows_topics: {
    "data_monitoring_metrics": "projects/_project_/topics/elementary-monitoring-metrics-topic",
    "test_result_rows": "projects/_project_/topics/elementary-test-result-rows-topic",
    "elementary_test_results": "projects/_project_/topics/elementary-elementary-test-results-topic"
  }
  query_max_size: 100000

This is an example for publish_to_pubsub_function:
CREATE OR REPLACE FUNCTION project.dataset.publish_to_pubsub_function(pubsub_topic STRING, json_data STRING, attributes STRING) RETURNS STRING REMOTE WITH CONNECTION .... OPTIONS (endpoint = ......, max_batching_rows = 500);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant