Datacatalog Fileset Processor

A package to manage Google Cloud Data Catalog Fileset scripts.

Disclaimer: This is not an officially supported Google product.

Executing in Cloud Shell

# Set your SERVICE ACCOUNT, for instructions go to 1.3. Auth credentials
# This name is just a suggestion, feel free to name it following your naming conventions
export GOOGLE_APPLICATION_CREDENTIALS=~/datacatalog-fileset-processor-sa.json

# Install datacatalog-fileset-processor
pip3 install datacatalog-fileset-processor --user

# Add to your PATH
export PATH=~/.local/bin:$PATH

# Look for available commands
datacatalog-fileset-processor --help

1. Environment setup

1.1. Python + virtualenv

Using virtualenv is optional, but strongly recommended unless you use Docker.

1.1.1. Install Python 3.6+

1.1.2. Get the source code

git clone https://github.com/mesmacosta/datacatalog-fileset-processor
cd ./datacatalog-fileset-processor

All paths starting with ./ in the next steps are relative to the datacatalog-fileset-processor folder.

1.1.3. Create and activate an isolated Python environment

pip install --upgrade virtualenv
python3 -m virtualenv --python python3 env
source ./env/bin/activate

1.1.4. Install the package

pip install --upgrade .

1.2. Docker

Docker may be used as an alternative to run the script. In this case, please disregard the Virtualenv setup instructions.

1.3. Auth credentials

1.3.1. Create a service account and grant it below roles

Data Catalog Admin

1.3.2. Download a JSON key and save it as

This name is just a suggestion, feel free to name it following your naming conventions

./credentials/datacatalog-fileset-processor-sa.json

1.3.3. Set the environment variables

This step may be skipped if you're using Docker.

export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-fileset-processor-sa.json

2. Create Filesets from CSV file

2.1. Create a CSV file representing the Entry Groups and Entries to be created

Filesets are composed of as many lines as required to represent all of their fields. The columns are described as follows:

Column	Description	Mandatory
entry_group_name	Entry Group Name.	Y
entry_group_display_name	Entry Group Display Name.	N
entry_group_description	Entry Group Description.	N
entry_id	Entry ID.	Y
entry_display_name	Entry Display Name.	Y
entry_description	Entry Description.	N
entry_file_patterns	Entry File Patterns.	Y
schema_column_name	Schema column name.	N
schema_column_type	Schema column type.	N
schema_column_description	Schema column description.	N
schema_column_mode	Schema column mode.	N

Please note that the schema_column_type is an open string field and accept anything, if you want to use your fileset with Dataflow SQL, follow the data-types in the official docs.

2.2. Run the datacatalog-fileset-processor script - Create the Filesets Entry Groups and Entries

Python + virtualenv

datacatalog-fileset-processor filesets create --csv-file CSV_FILE_PATH

2.3. Run the datacatalog-fileset-processor script - Delete the Filesets Entry Groups and Entries

Python + virtualenv

datacatalog-fileset-processor filesets delete --csv-file CSV_FILE_PATH

TIPS

sample-input/create-filesets for reference;
If you want to create filesets without schema: sample-input/create-filesets/fileset-entry-opt-1-all-metadata-no-schema.csv for reference;

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.circleci		.circleci
sample-input		sample-input
src/datacatalog_fileset_processor		src/datacatalog_fileset_processor
tests/datacatalog_fileset_processor		tests/datacatalog_fileset_processor
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

mesmacosta/datacatalog-fileset-processor

Folders and files

Latest commit

History

Repository files navigation

Datacatalog Fileset Processor

Table of Contents

Executing in Cloud Shell

1. Environment setup

1.1. Python + virtualenv

1.1.1. Install Python 3.6+

1.1.2. Get the source code

1.1.3. Create and activate an isolated Python environment

1.1.4. Install the package

1.2. Docker

1.3. Auth credentials

1.3.1. Create a service account and grant it below roles

1.3.2. Download a JSON key and save it as

1.3.3. Set the environment variables

2. Create Filesets from CSV file

2.1. Create a CSV file representing the Entry Groups and Entries to be created

2.2. Run the datacatalog-fileset-processor script - Create the Filesets Entry Groups and Entries

2.3. Run the datacatalog-fileset-processor script - Delete the Filesets Entry Groups and Entries

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages