[BUG] Azure Databricks disk_offload error #11907

vitaliy-sharandin · 2024-05-05T00:10:45Z

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Azure Databricks

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

mlflow==2.12.1

System information

14.3 ML Cluster Azure DataBricks CLuster
accelerate==0.29.3
peft==0.10.0
torch==2.3.0
torchvision==0.18.0
transformers==4.41.0.dev0

Describe the problem

I encounter a disk_offload error whenever I try to register model in Unity Catalogue.

Tracking information

REPLACE_ME

Code to reproduce issue

catalog = "model_registry"
schema = "default"
model_name = "psy-ai"
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
    model_uri="runs:/bcae496e43a0496da7a6eafe0ab569d8/NousResearch/Meta-Llama-3-8B-Instruct-peft-trained",
    name=f"{catalog}.{schema}.{model_name}"
)

Stack trace

MlflowException: Failed to download the model weights from the HuggingFace hub and cannot register the model in the Unity Catalog. Please ensure that the model was saved with the correct reference to the HuggingFace hub repository and that you have access to fetch model weights from the defined repository.
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:627, in UcModelRegistryStore._download_model_weights_if_not_saved(self, local_model_path)
    626 try:
--> 627     mlflow.transformers.persist_pretrained_model(local_model_path)
    628 except Exception as e:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1080, in persist_pretrained_model(model_uri)
   1079 local_model_path = artifact_repo.download_artifacts(artifact_path, dst_path=tmp_dir.path())
-> 1080 pipeline = load_model(local_model_path, return_type="pipeline")
   1082 # Update MLModel flavor config
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/utils/docstring_utils.py:379, in docstring_version_compatibility_warning.<locals>.annotated_func.<locals>.version_func(*args, **kwargs)
    378     warnings.warn(notice, category=FutureWarning, stacklevel=2)
--> 379 return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1023, in load_model(model_uri, dst_path, return_type, device, **kwargs)
   1021 _add_code_from_conf_to_system_path(local_model_path, flavor_config)
-> 1023 return _load_model(local_model_path, flavor_config, return_type, device, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/__init__.py:1211, in _load_model(path, flavor_config, return_type, device, **kwargs)
   1210 if peft_adapter_dir := flavor_config.get(FlavorKey.PEFT, None):
-> 1211     model_and_components[FlavorKey.MODEL] = get_model_with_peft_adapter(
   1212         base_model=model_and_components[FlavorKey.MODEL],
   1213         peft_adapter_path=os.path.join(path, peft_adapter_dir),
   1214     )
   1216 conf = {**conf, **model_and_components}
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/transformers/peft.py:50, in get_model_with_peft_adapter(base_model, peft_adapter_path)
     48 from peft import PeftModel
---> 50 return PeftModel.from_pretrained(base_model, peft_adapter_path)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/peft/peft_model.py:356, in PeftModel.from_pretrained(cls, model, model_id, adapter_name, is_trainable, config, **kwargs)
    355     model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config, adapter_name)
--> 356 model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
    357 return model
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/peft/peft_model.py:760, in PeftModel.load_adapter(self, model_id, adapter_name, is_trainable, **kwargs)
    757     device_map = infer_auto_device_map(
    758         self, max_memory=max_memory, no_split_module_classes=no_split_module_classes
    759     )
--> 760 dispatch_model(
    761     self,
    762     device_map=device_map,
    763     offload_dir=offload_dir,
    764     **dispatch_model_kwargs,
    765 )
    766 hook = AlignDevicesHook(io_same_device=True)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/accelerate/big_modeling.py:490, in dispatch_model(model, device_map, main_device, state_dict, offload_dir, offload_index, offload_buffers, skip_keys, preload_module_classes, force_hooks)
    489     else:
--> 490         raise ValueError(
    491             "You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead."
    492         )
    493 # Convert OrderedDict back to dict for easier usage
ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.

The above exception was the direct cause of the following exception:
MlflowException                           Traceback (most recent call last)
File <command-3799569696053756>, line 5
      3 model_name = "psy-ai"
      4 mlflow.set_registry_uri("databricks-uc")
----> 5 mlflow.register_model(
      6     model_uri="runs:/bcae496e43a0496da7a6eafe0ab569d8/NousResearch/Meta-Llama-3-8B-Instruct-peft-trained",
      7     name=f"{catalog}.{schema}.{model_name}"
      8 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/fluent.py:77, in register_model(model_uri, name, await_registration_for, tags)
     17 def register_model(
     18     model_uri,
     19     name,
   (...)
     22     tags: Optional[Dict[str, Any]] = None,
     23 ) -> ModelVersion:
     24     """Create a new model version in model registry for the model files specified by ``model_uri``.
     25 
     26     Note that this method assumes the model registry backend URI is the same as that of the
   (...)
     75         Version: 1
     76     """
---> 77     return _register_model(
     78         model_uri=model_uri, name=name, await_registration_for=await_registration_for, tags=tags
     79     )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/fluent.py:112, in _register_model(model_uri, name, await_registration_for, tags, local_model_path)
    109     source = RunsArtifactRepository.get_underlying_uri(model_uri)
    110     (run_id, _) = RunsArtifactRepository.parse_runs_uri(model_uri)
--> 112 create_version_response = client._create_model_version(
    113     name=name,
    114     source=source,
    115     run_id=run_id,
    116     tags=tags,
    117     await_creation_for=await_registration_for,
    118     local_model_path=local_model_path,
    119 )
    120 eprint(
    121     f"Created version '{create_version_response.version}' of model "
    122     f"'{create_version_response.name}'."
    123 )
    124 return create_version_response
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/client.py:2861, in MlflowClient._create_model_version(self, name, source, run_id, tags, run_link, description, await_creation_for, local_model_path)
   2853     # NOTE: we can't easily delete the target temp location due to the async nature
   2854     # of the model version creation - printing to let the user know.
   2855     eprint(
   2856         f"=== Source model files were copied to {new_source}"
   2857         + " in the model registry workspace. You may want to delete the files once the"
   2858         + " model version is in 'READY' status. You can also find this location in the"
   2859         + " `source` field of the created model version. ==="
   2860     )
-> 2861 return self._get_registry_client().create_model_version(
   2862     name=name,
   2863     source=new_source,
   2864     run_id=run_id,
   2865     tags=tags,
   2866     run_link=run_link,
   2867     description=description,
   2868     await_creation_for=await_creation_for,
   2869     local_model_path=local_model_path,
   2870 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/tracking/_model_registry/client.py:215, in ModelRegistryClient.create_model_version(self, name, source, run_id, tags, run_link, description, await_creation_for, local_model_path)
    213 arg_names = _get_arg_names(self.store.create_model_version)
    214 if "local_model_path" in arg_names:
--> 215     mv = self.store.create_model_version(
    216         name,
    217         source,
    218         run_id,
    219         tags,
    220         run_link,
    221         description,
    222         local_model_path=local_model_path,
    223     )
    224 else:
    225     # Fall back to calling create_model_version without
    226     # local_model_path since old model registry store implementations may not
    227     # support the local_model_path argument.
    228     mv = self.store.create_model_version(name, source, run_id, tags, run_link, description)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:714, in UcModelRegistryStore.create_model_version(self, name, source, run_id, tags, run_link, description, local_model_path)
    712 with self._local_model_dir(source, local_model_path) as local_model_dir:
    713     self._validate_model_signature(local_model_dir)
--> 714     self._download_model_weights_if_not_saved(local_model_dir)
    715     feature_deps = get_feature_dependencies(local_model_dir)
    716     other_model_deps = get_model_version_dependencies(local_model_dir)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3432a8e-11c0-44e8-a291-0f0be438b27d/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py:629, in UcModelRegistryStore._download_model_weights_if_not_saved(self, local_model_path)
    627     mlflow.transformers.persist_pretrained_model(local_model_path)
    628 except Exception as e:
--> 629     raise MlflowException(
    630         "Failed to download the model weights from the HuggingFace hub and cannot register "
    631         "the model in the Unity Catalog. Please ensure that the model was saved with the "
    632         "correct reference to the HuggingFace hub repository and that you have access to "
    633         "fetch model weights from the defined repository.",
    634         error_code=INTERNAL_ERROR,
    635     ) from e

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

harupy · 2024-05-09T08:52:33Z

@vitaliy-sharandin Thanks for reporting this. Could you share your model logging code?

harupy · 2024-05-09T10:26:26Z

I ran the following code but could not reproduce the error:

%pip install -U git+https://github.com/huggingface/transformers torch accelerate==0.29.3 mlflow

dbutils.library.restartPython()

########

import transformers
import torch


model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    token="...",
)

import mlflow
import uuid

mlflow.set_registry_uri("databricks-uc")

with mlflow.start_run() as run:
  mlflow.transformers.log_model(pipeline, "model")


mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model",
    name=f"..."
)

vitaliy-sharandin · 2024-05-09T13:41:23Z

The main difference between our code is that I am first fine-tuning adapters with peft and trying to register the run which has only adapters saved and base model reference without model weights. I have also read MLFLow Transformers guide which specifies that you don't need to use mlflow.transformers.persist_pretrained_model() once you are trying to register model to Unity Catalogue, hence my code has to work as I am trying to do exactly that.

Here is my notebook:
https://github.com/vitaliy-sharandin/data_science_projects/blob/master/portfolio/nlp/fine-tuned-llm/psy_ai_mlflow_tracking_deployment.ipynb

harupy · 2024-05-10T01:14:34Z

Thanks for the notebook! Let me run the notebook and see If I can reproduce the issue.

harupy · 2024-05-10T01:30:59Z

@vitaliy-sharandin Can you try inserting this code before loading the model to see if it can fix the error?

def get_model_with_peft_adapter(base_model, peft_adapter_path):
    from peft import PeftModel

    return PeftModel.from_pretrained(base_model, peft_adapter_path, offload_folder="offload")

mlflow.transformers.get_model_with_peft_adapter = get_model_with_peft_adapter

Not sure if offload_folder is the only to fix this issue, but want to give it a try.

vitaliy-sharandin · 2024-05-10T14:57:42Z

It doesn't quite make sense, as I don't have adapters to load pre-model-tuning, so I don't have value for peft_adapter_path obligatory argument.

github-actions · 2024-05-12T00:14:28Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

harupy · 2024-05-13T06:48:38Z

@vitaliy-sharandin the traceback says get_model_with_peft_adapter is called.

vitaliy-sharandin · 2024-05-15T19:44:00Z

@harupy Sorry, I have misunderstood your code at first. I did what you've proposed and it led to new error, please check out the notebook.

vitaliy-sharandin · 2024-05-20T18:58:22Z

@harupy Any updates?

vitaliy-sharandin added the bug Something isn't working label May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Azure Databricks disk_offload error #11907

[BUG] Azure Databricks disk_offload error #11907

vitaliy-sharandin commented May 5, 2024 •

edited

harupy commented May 9, 2024

harupy commented May 9, 2024

vitaliy-sharandin commented May 9, 2024 •

edited

harupy commented May 10, 2024 •

edited

harupy commented May 10, 2024 •

edited

vitaliy-sharandin commented May 10, 2024 •

edited

github-actions bot commented May 12, 2024

harupy commented May 13, 2024

vitaliy-sharandin commented May 15, 2024

vitaliy-sharandin commented May 20, 2024

[BUG] Azure Databricks disk_offload error #11907

[BUG] Azure Databricks disk_offload error #11907

Comments

vitaliy-sharandin commented May 5, 2024 • edited

Issues Policy acknowledgement

Where did you encounter this bug?

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

harupy commented May 9, 2024

harupy commented May 9, 2024

vitaliy-sharandin commented May 9, 2024 • edited

harupy commented May 10, 2024 • edited

harupy commented May 10, 2024 • edited

vitaliy-sharandin commented May 10, 2024 • edited

github-actions bot commented May 12, 2024

harupy commented May 13, 2024

vitaliy-sharandin commented May 15, 2024

vitaliy-sharandin commented May 20, 2024

vitaliy-sharandin commented May 5, 2024 •

edited

vitaliy-sharandin commented May 9, 2024 •

edited

harupy commented May 10, 2024 •

edited

harupy commented May 10, 2024 •

edited

vitaliy-sharandin commented May 10, 2024 •

edited