Skip to content

Automatically load data from Google Cloud Storage files into Big Query tables

Notifications You must be signed in to change notification settings

tfabien/bigquery-autojob

Repository files navigation

bigquery-autojob

Note: Documentation is currently not in sync with the rewrite, updating soon...

A Google Cloud Function providing a simple and configurable way to automatically load data from GCS files into Big Query tables.

It features a convention over configuration approches, and provides a sensible default configuration for common file formats (CSV, JSON, AVRO, ORC, Parquet)

  • The table name is automatically derived from the file's name, minus the extension, and date/timestamp suffix if any.
  • Autodetect features enabled
  • Avro logical types are used
  • New data is appended to the table

If the default behaviour does not suit your needs, it can be modified for all or certain files through mapping files or custom metadata.

Quickstart

  • Create a new bq-autoload Google Cloud Storage bucket

    $> gsutil mb -c regional -l europe-west1 "gs://bq-autoload"
  • Create a new Staging BigQuery dataset

    $> bq mk --dataset "Staging"
  • Clone and deploy this repository as a cloud function triggered by changes on this GCS bucket (do not forget to replace the project id)

    $> git clone "https://github.com/tfabien/bigquery-autoload/"              \
       && cd "bigquery-autoload"                                              \
       && npm install -g typescript                                           \
       && npm install                                                         \
       && npm build                                                           \
       && gcloud functions deploy "bq-autoload"                               \
              --entry-point autoload                                          \
              --trigger-bucket "bq-autoload"                                  \
              --set-env-vars "PROJECT_ID={{YOUR_GCP_PROJECT_ID}}"             \
              --runtime "nodejs10"                                            \
              --memory "128MB"                                                \
              --region europe-west1

That's it 👍

Any file you upload to the bq_autoload GCS bucket will now automatically be loaded into a BigQuery table within seconds.

Usage

See the wiki for usage samples and advanced configuration