ONNX models crash when they are used in Colab's T4 GPU runtime #14109

maziyarpanahi · 2023-12-25T18:54:16Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

@danilojsl

What are you working on?

Downloading and loading models on ONNX over GPU devices crashes. (at least on T4 on Colab)

Current Behavior

Crashes with:

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

	at ai.onnxruntime.providers.OrtCUDAProviderOptions.add(Native Method)
	at ai.onnxruntime.providers.OrtCUDAProviderOptions.<init>(OrtCUDAProviderOptions.java:44)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.mapToCUDASessionConfig(OnnxWrapper.scala:152)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.mapToSessionOptionsObject(OnnxWrapper.scala:136)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.com$johnsnowlabs$ml$onnx$OnnxWrapper$$withSafeOnnxModelLoader(OnnxWrapper.scala:90)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.read(OnnxWrapper.scala:122)
	at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel(OnnxSerializeModel.scala:98)
	at com.johnsnowlabs.ml.onnx.ReadOnnxModel.readOnnxModel$(OnnxSerializeModel.scala:75)
	at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.readOnnxModel(MPNetEmbeddings.scala:471)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.readModel(MPNetEmbeddings.scala:416)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.readModel$(MPNetEmbeddings.scala:407)
	at com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings$.readModel(MPNetEmbeddings.scala:471)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.$anonfun$$init$$1(MPNetEmbeddings.scala:424)
	at com.johnsnowlabs.nlp.embeddings.ReadMPNetDLModel.$anonfun$$init$$1$adapted(MPNetEmbeddings.scala:424)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61)
	at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

Expected Behavior

Should work before upgrading to newer version of Spark NLP

Steps To Reproduce

!pip install spark-nlp pyspark

embeddings = MPNetEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")

Spark NLP version and Apache Spark

Spark NLP version 5.2.0
Apache Spark version: 3.5.0

Type of Spark Application

Python Application

Java Version

11

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

danilojsl · 2023-12-26T20:20:36Z

Hi @maziyarpanahi

I haven't been able to replicate the error. I tried in Google Colab with T4 but it is working for spark-np 5.2.0. Can you take a look at this notebook, reproduce the error and let me know

maziyarpanahi · 2023-12-26T21:14:57Z

Hi @danilojsl

You forgot to load ONNX GPU build in start function: spark = sparknlp.start(gpu=True). Once the session is started with the GPU build of ONNX and TF, the ONNX models will fail with that error

maziyarpanahi · 2023-12-27T17:59:12Z

Some extra information, I can use A100 GPUs without any issue. So this must be something with Colab itself, it is either missing something (lib) or it has them but a different versions. (usually older, so for GPU we usually do something in the Colab script-setup to fix those)

@danilojsl Let's find out what's missing and how to fix them, then we can modify the GPU installation for Colab accordingly:

maziyarpanahi added the question label Dec 25, 2023

maziyarpanahi self-assigned this Dec 25, 2023

maziyarpanahi added bug and removed question labels Dec 26, 2023

maziyarpanahi assigned danilojsl Dec 26, 2023

maziyarpanahi changed the title ~~ONNX crashes on GPU in latest Spark NLP 5.2~~ ONNX crashes on Colab's T4 GPU runtime Dec 28, 2023

maziyarpanahi changed the title ~~ONNX crashes on Colab's T4 GPU runtime~~ ONNX models crashe on Colab's T4 GPU runtime Dec 28, 2023

maziyarpanahi changed the title ~~ONNX models crashe on Colab's T4 GPU runtime~~ ONNX models crash when they are used in Colab's T4 GPU runtime Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

maziyarpanahi commented Dec 25, 2023 •

edited

danilojsl commented Dec 26, 2023 •

edited by maziyarpanahi

maziyarpanahi commented Dec 26, 2023

maziyarpanahi commented Dec 27, 2023

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

Comments

maziyarpanahi commented Dec 25, 2023 • edited

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information

danilojsl commented Dec 26, 2023 • edited by maziyarpanahi

maziyarpanahi commented Dec 26, 2023

maziyarpanahi commented Dec 27, 2023

maziyarpanahi commented Dec 25, 2023 •

edited

danilojsl commented Dec 26, 2023 •

edited by maziyarpanahi