Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

## How to parameterized DBX Python Notebook #841

Open
ssr8998 opened this issue Aug 31, 2023 · 5 comments
Open

## How to parameterized DBX Python Notebook #841

ssr8998 opened this issue Aug 31, 2023 · 5 comments

Comments

@ssr8998
Copy link

ssr8998 commented Aug 31, 2023

The overall goal is to make database name (prod/dev/test) dynamic for each notebook in dbx job and passing that database name directly from jenkins without modifying notebook file or deployment.yaml file for each environment .
If I am creating a dbx job where I have few databricks notebook and I want to pass the database name dynamically into each python notebook without using databricks widget (assuming I am using sys.args that will read the input of dbx clie parameter and I want to run my job something like :-
dbx launch --job "my_job_name" --parameter='{"db_name": "my_db_name"}' and it will send that info to my job and all associated notebook which will read these info from conf/deployment.yaml and in deployment.yaml file I will have something like :--
notebook_task:
notebook_path:"/Reposs/My_github_repo/blala/notebookname"
base_parameters:
db_name"{{env.db_name_from_env}}"

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

  • dbx version used:0.7.4
  • databricks-cli:0.17.3
  • spark_version:12.2.x-scala2.12
  • Databricks Runtime version: 12.2 LTS or above
@doug-cresswell
Copy link

doug-cresswell commented Sep 4, 2023

Edit: I did not realise you specified a notebook task, updated with original comment left underneath
Edit 2: Updated CLI snippets to have same environment as yml example

To pass a value from a local environment variable to a workflow definition in a notebook you should instead define the environment variable in the cluster configuration and read them into the notebook e.g., database_name = os.environ.get('DATABASE_NAME'). This can be done in deployment.yml.

  basic-cluster: &basic-cluster
    new_cluster:
      spark_version: "10.4.x-cpu-ml-scala2.12"
      spark_conf:
        <<: *basic-spark-conf
        spark.databricks.passthrough.enabled: false
      spark_env_vars:
        DATABASE_NAME: "{{ env['DATABASE_NAME'] }}"

deployment.yml reference

See original comment below for how to use jinja with the deployment file.


Original comment

It is probably better practice to deploy separate workflows for separate environments, but to answer your question you can use the jinja support functionality (Jinja Support) combined with environment variables.

Also see Passing Parameters

Your deployment file should look something like this:
conf/deployment.yml.j2

build:
  python: "pip"

environments:
  default:
    workflows:
      - name: "my-workflow"
        tasks:
          - task_key: "task1"
            python_wheel_task:
              package_name: "some-pkg"
              entry_point: "some-ep"
              parameters: ["database_name", "{{ env['DATABASE_NAME'] }}"]

Deploy via CLI

export DATABASE_NAME=dev
dbx deploy --environment default --deployment-file conf/deployment.yml.j2 "my-workflow"

Launch via CLI

dbx launch --environment default --parameters='{"python_params":["database_name","${DATABASE_NAME}"]}' "my-workflow"

Note that you will need to append the .j2 extension to your yaml file, or alternately enable in place jinja support in your project configuration.

@ssr8998
Copy link
Author

ssr8998 commented Sep 6, 2023

I tried to follow your steps :-
Here is how my deployment.yaml.j2 look like :
{% set db_name =env['db_name'] | default('name_of_my_db') %}
......basic config etc.etc...

spark_python_task:
python_file: file://my_path_/name_of_python_notebook_converted_to_job.py"
parameters:
["db_name","{{env['db_name']}}"]
............

Now I am trying to access this database name into my name_of_python_notebook_converted_to_job.py by calling :-
db_name =json.loads(sys.argv[1]).get('python_params',[])[1]

I am calling the dbx cli like:-dbx deploy --deployment-file conf/deployment.yaml.j2 "name_of_my_work_flow"
and then to launch job:-dbx launch --parameters ='{"python_params":["db_name","${db_name}"]}' "name_of_my_work_flow"

look like my job can't read from sys.argv . I am getting error :-JSONDecoderError: Expecting value: line 1 column 1 (char 0)

----> db_name =json.loads(sys.argv[1]).get('python_params',[])[1]

@ssr8998
Copy link
Author

ssr8998 commented Sep 6, 2023

if I use export DATABASE_NAME=dev
dbx deploy -e dev --deployment-file conf/deployment.yml.j2 "my-workflow", it complains that "environment dev not found in the project file .dbx/project.json . In my project json I've environment -->default->profile, storage_type, properties -->workspce_directory, artifact_location

@doug-cresswell
Copy link

doug-cresswell commented Sep 7, 2023

JSONDecoderError

Notebooks use widgets to pass parameters, so you cannot pass parameters to a notebook task like you would for an entrypoint in a python wheel. You either need to use widgets, or define environment variables on the cluster using spark_env_vars. This way the environment variables will be available to the notebook through os.environ.

Environment Not Found Error

For the error environment dev not found in the project file .dbx/project.json the environments defined in your deployment yaml must match those in your project.json file.

environments:
  default:

You can use the dbx configure command to set up new environments in your project if you should need multiple. If not simply remove the -e / --environment from your cli commands and it will use the "default" instead.
dbx configure docs
project.json docs

@ssr8998
Copy link
Author

ssr8998 commented Sep 7, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants