What is the best way to run a Meltano pipeline in 1000 variations? #7257
Replies: 1 comment
-
This is probably a bit of an aside, given it's not a wholly meltano solution, but might be of interest. I think we could do this in our environment as of now, we run meltano via AWS ECS. We already run many Airflow tasks using the same connections, but split across multiple tasks to sync different sets of tables/endpoints. To spin up lots of tasks, we could:
'command': [
'string',
],
'environment': [
{
'name': 'string',
'value': 'string'
},
] Tee up each task in your DAG with different parameters, i.e. 'command': [
'meltano', 'run', 'tap1', 'target1', ...
],
'environment': [
{
'name': 'TAP1_HOST',
'value': 'tiger'
},
{
'name': 'TAP1_PORT',
.....
] Shared credentials could be set in the |
Beta Was this translation helpful? Give feedback.
-
First challenge - how to invoke the same pipeline with 100 or 1000 different client connection strings or account numbers
The standard way to implement this as of today would be to declare a
state-id-suffix
in your environment definition, and use a unique string for each customer or each variation of the pipeline. This ensures that each variation will have its own incremental state tracking.From one layer above where you would invoke Meltano (for instance, in an Orchestration tool like Airflow, GitHub Actions, etc.), you would then declare a matrix of invocation parameters then make simply invoke Meltano up to the total
n
number of configuration sets ("configsets").Second challenge - how to store and retrieve config for those multiple variations
Meltano does not solve this as of now. The orchestrator would loop through each configset and invoke Meltano with the proper context and credentials declared in environment variables. Shared config values could be stored in the Meltano Environment's systemdb, or in the
meltano.yml
file, but Meltano itself does not handle multiple configsets per environment.Related
Beta Was this translation helpful? Give feedback.
All reactions