Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Azure Databricks disk_offload error #11907

Open
1 of 23 tasks
vitaliy-sharandin opened this issue May 5, 2024 · 10 comments
Open
1 of 23 tasks

[BUG] Azure Databricks disk_offload error #11907

vitaliy-sharandin opened this issue May 5, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@vitaliy-sharandin
Copy link

vitaliy-sharandin commented May 5, 2024

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Azure Databricks

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

mlflow==2.12.1

System information

14.3 ML Cluster Azure DataBricks CLuster
accelerate==0.29.3
peft==0.10.0
torch==2.3.0
torchvision==0.18.0
transformers==4.41.0.dev0

Describe the problem

I encounter a disk_offload error whenever I try to register model in Unity Catalogue.

Tracking information

REPLACE_ME

Code to reproduce issue

catalog = "model_registry"
schema = "default"
model_name = "psy-ai"
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
    model_uri="runs:/bcae496e43a0496da7a6eafe0ab569d8/NousResearch/Meta-Llama-3-8B-Instruct-peft-trained",
    name=f"{catalog}.{schema}.{model_name}"
)

Stack trace

MlflowException: Failed to download the model weights from the HuggingFace hub and cannot register the model in the Unity Catalog. Please ensure that the model was saved with the correct reference to the HuggingFace hub repository and that you have access to fetch model weights from the defined repository.
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:627, in UcModelRegistryStore._download_model_weights_if_not_saved(self, local_model_path)
    626 try:
--> 627     mlflow.transformers.persist_pretrained_model(local_model_path)
    628 except Exception as e:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1080, in persist_pretrained_model(model_uri)
   1079 local_model_path = artifact_repo.download_artifacts(artifact_path, dst_path=tmp_dir.path())
-> 1080 pipeline = load_model(local_model_path, return_type="pipeline")
   1082 # Update MLModel flavor config
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/utils/docstring_utils.py:379, in docstring_version_compatibility_warning.<locals>.annotated_func.<locals>.version_func(*args, **kwargs)
    378     warnings.warn(notice, category=FutureWarning, stacklevel=2)
--> 379 return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1023, in load_model(model_uri, dst_path, return_type, device, **kwargs)
   1021 _add_code_from_conf_to_system_path(local_model_path, flavor_config)
-> 1023 return _load_model(local_model_path, flavor_config, return_type, device, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1211, in _load_model(path, flavor_config, return_type, device, **kwargs)
   1210 if peft_adapter_dir := flavor_config.get(FlavorKey.PEFT, None):
-> 1211     model_and_components[FlavorKey.MODEL] = get_model_with_peft_adapter(
   1212         base_model=model_and_components[FlavorKey.MODEL],
   1213         peft_adapter_path=os.path.join(path, peft_adapter_dir),
   1214     )
   1216 conf = {**conf, **model_and_components}
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/peft.py:50, in get_model_with_peft_adapter(base_model, peft_adapter_path)
     48 from peft import PeftModel
---> 50 return PeftModel.from_pretrained(base_model, peft_adapter_path)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/peft/peft_model.py:356, in PeftModel.from_pretrained(cls, model, model_id, adapter_name, is_trainable, config, **kwargs)
    355     model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config, adapter_name)
--> 356 model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
    357 return model
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/peft/peft_model.py:760, in PeftModel.load_adapter(self, model_id, adapter_name, is_trainable, **kwargs)
    757     device_map = infer_auto_device_map(
    758         self, max_memory=max_memory, no_split_module_classes=no_split_module_classes
    759     )
--> 760 dispatch_model(
    761     self,
    762     device_map=device_map,
    763     offload_dir=offload_dir,
    764     **dispatch_model_kwargs,
    765 )
    766 hook = AlignDevicesHook(io_same_device=True)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/accelerate/big_modeling.py:490, in dispatch_model(model, device_map, main_device, state_dict, offload_dir, offload_index, offload_buffers, skip_keys, preload_module_classes, force_hooks)
    489     else:
--> 490         raise ValueError(
    491             "You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead."
    492         )
    493 # Convert OrderedDict back to dict for easier usage
ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.

The above exception was the direct cause of the following exception:
MlflowException                           Traceback (most recent call last)
File <command-3799569696053756>, line 5
      3 model_name = "psy-ai"
      4 mlflow.set_registry_uri("databricks-uc")
----> 5 mlflow.register_model(
      6     model_uri="runs:/bcae496e43a0496da7a6eafe0ab569d8/NousResearch/Meta-Llama-3-8B-Instruct-peft-trained",
      7     name=f"{catalog}.{schema}.{model_name}"
      8 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/fluent.py:77, in register_model(model_uri, name, await_registration_for, tags)
     17 def register_model(
     18     model_uri,
     19     name,
   (...)
     22     tags: Optional[Dict[str, Any]] = None,
     23 ) -> ModelVersion:
     24     """Create a new model version in model registry for the model files specified by ``model_uri``.
     25 
     26     Note that this method assumes the model registry backend URI is the same as that of the
   (...)
     75         Version: 1
     76     """
---> 77     return _register_model(
     78         model_uri=model_uri, name=name, await_registration_for=await_registration_for, tags=tags
     79     )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/fluent.py:112, in _register_model(model_uri, name, await_registration_for, tags, local_model_path)
    109     source = RunsArtifactRepository.get_underlying_uri(model_uri)
    110     (run_id, _) = RunsArtifactRepository.parse_runs_uri(model_uri)
--> 112 create_version_response = client._create_model_version(
    113     name=name,
    114     source=source,
    115     run_id=run_id,
    116     tags=tags,
    117     await_creation_for=await_registration_for,
    118     local_model_path=local_model_path,
    119 )
    120 eprint(
    121     f"Created version '{create_version_response.version}' of model "
    122     f"'{create_version_response.name}'."
    123 )
    124 return create_version_response
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/client.py:2861, in MlflowClient._create_model_version(self, name, source, run_id, tags, run_link, description, await_creation_for, local_model_path)
   2853     # NOTE: we can't easily delete the target temp location due to the async nature
   2854     # of the model version creation - printing to let the user know.
   2855     eprint(
   2856         f"=== Source model files were copied to {new_source}"
   2857         + " in the model registry workspace. You may want to delete the files once the"
   2858         + " model version is in 'READY' status. You can also find this location in the"
   2859         + " `source` field of the created model version. ==="
   2860     )
-> 2861 return self._get_registry_client().create_model_version(
   2862     name=name,
   2863     source=new_source,
   2864     run_id=run_id,
   2865     tags=tags,
   2866     run_link=run_link,
   2867     description=description,
   2868     await_creation_for=await_creation_for,
   2869     local_model_path=local_model_path,
   2870 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/client.py:215, in ModelRegistryClient.create_model_version(self, name, source, run_id, tags, run_link, description, await_creation_for, local_model_path)
    213 arg_names = _get_arg_names(self.store.create_model_version)
    214 if "local_model_path" in arg_names:
--> 215     mv = self.store.create_model_version(
    216         name,
    217         source,
    218         run_id,
    219         tags,
    220         run_link,
    221         description,
    222         local_model_path=local_model_path,
    223     )
    224 else:
    225     # Fall back to calling create_model_version without
    226     # local_model_path since old model registry store implementations may not
    227     # support the local_model_path argument.
    228     mv = self.store.create_model_version(name, source, run_id, tags, run_link, description)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:714, in UcModelRegistryStore.create_model_version(self, name, source, run_id, tags, run_link, description, local_model_path)
    712 with self._local_model_dir(source, local_model_path) as local_model_dir:
    713     self._validate_model_signature(local_model_dir)
--> 714     self._download_model_weights_if_not_saved(local_model_dir)
    715     feature_deps = get_feature_dependencies(local_model_dir)
    716     other_model_deps = get_model_version_dependencies(local_model_dir)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:629, in UcModelRegistryStore._download_model_weights_if_not_saved(self, local_model_path)
    627     mlflow.transformers.persist_pretrained_model(local_model_path)
    628 except Exception as e:
--> 629     raise MlflowException(
    630         "Failed to download the model weights from the HuggingFace hub and cannot register "
    631         "the model in the Unity Catalog. Please ensure that the model was saved with the "
    632         "correct reference to the HuggingFace hub repository and that you have access to "
    633         "fetch model weights from the defined repository.",
    634         error_code=INTERNAL_ERROR,
    635     ) from e

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@vitaliy-sharandin vitaliy-sharandin added the bug Something isn't working label May 5, 2024
@harupy
Copy link
Member

harupy commented May 9, 2024

@vitaliy-sharandin Thanks for reporting this. Could you share your model logging code?

@harupy
Copy link
Member

harupy commented May 9, 2024

I ran the following code but could not reproduce the error:

%pip install -U git+https://github.com/huggingface/transformers torch accelerate==0.29.3 mlflow

dbutils.library.restartPython()

########

import transformers
import torch


model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    token="...",
)

import mlflow
import uuid

mlflow.set_registry_uri("databricks-uc")

with mlflow.start_run() as run:
  mlflow.transformers.log_model(pipeline, "model")


mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model",
    name=f"..."
)

@vitaliy-sharandin
Copy link
Author

vitaliy-sharandin commented May 9, 2024

The main difference between our code is that I am first fine-tuning adapters with peft and trying to register the run which has only adapters saved and base model reference without model weights. I have also read MLFLow Transformers guide which specifies that you don't need to use mlflow.transformers.persist_pretrained_model() once you are trying to register model to Unity Catalogue, hence my code has to work as I am trying to do exactly that.

Here is my notebook:
https://github.com/vitaliy-sharandin/data_science_projects/blob/master/portfolio/nlp/fine-tuned-llm/psy_ai_mlflow_tracking_deployment.ipynb

@harupy
Copy link
Member

harupy commented May 10, 2024

Thanks for the notebook! Let me run the notebook and see If I can reproduce the issue.

@harupy
Copy link
Member

harupy commented May 10, 2024

@vitaliy-sharandin Can you try inserting this code before loading the model to see if it can fix the error?

def get_model_with_peft_adapter(base_model, peft_adapter_path):
    from peft import PeftModel

    return PeftModel.from_pretrained(base_model, peft_adapter_path, offload_folder="offload")

mlflow.transformers.get_model_with_peft_adapter = get_model_with_peft_adapter

Not sure if offload_folder is the only to fix this issue, but want to give it a try.

@vitaliy-sharandin
Copy link
Author

vitaliy-sharandin commented May 10, 2024

It doesn't quite make sense, as I don't have adapters to load pre-model-tuning, so I don't have value for peft_adapter_path obligatory argument.

Copy link

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

@harupy
Copy link
Member

harupy commented May 13, 2024

@vitaliy-sharandin the traceback says get_model_with_peft_adapter is called.

@vitaliy-sharandin
Copy link
Author

@harupy Sorry, I have misunderstood your code at first. I did what you've proposed and it led to new error, please check out the notebook.

@vitaliy-sharandin
Copy link
Author

@harupy Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants