Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XlmRoBertaSentenceEmbeddings returns huge amount of embeddings instead of set dimensions #14181

Open
maziyarpanahi opened this issue Feb 20, 2024 Discussed in #14180 · 0 comments
Assignees
Labels

Comments

@maziyarpanahi
Copy link
Member

Discussed in #14180

Originally posted by kkwasnioch February 20, 2024
I am trying to produce embeddings for whole documents in 3 languages: english, polish, finnish. Previously I have tried sentence-transformers/paraphrase-multilingual-mpnet-base-v2 from huggingface and it works fine, returns 768 dims. But when I load model and run it with sparknlp XlmRoBertaSentenceEmbeddings it produce f.e. 26k dims. Am I loading model wrong way? Or are thare any othe issues? Thanks!
https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_XlmRoBertaSentenceEmbeddings.ipynb -> here is sample code which i took knowladge
Code:

MODEL_NAME = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"
robert = XlmRoBertaSentenceEmbeddings.loadSavedModel(f"{EXPORT_PATH}", spark)\
    .setInputCols(["document"])\
    .setOutputCol("embeddings")\
    .setStorageRef('xlmroberta_embeddings_paraphrase_mpnet_base_v2') 

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings_finisher = EmbeddingsFinisher() \
  .setInputCols('embeddings') \
  .setOutputCols('finnished_vectors') \
  .setOutputAsVector(False)

pipeline = Pipeline(stages=[document_assembler, robert, embeddings_finisher])

pipelineModel = pipeline.fit(sparkDF)
LightPipelinelightModel = LightPipeline(pipelineModel, parse_embeddings=True)

out = LightPipelinelightModel.transform(sparkDF).select('text', f.explode('finnished_vectors').alias('emb')).withColumn('size', f.size('emb'))

Output:
+--------------------+--------------------+-----+
| text| emb| size|
+--------------------+--------------------+-----+
|Do kościoła jak "... |[0.028680567, 0.2...|29952|
|Audi Q7 właśnie p... |[-0.01756316, -0.... |28416|
|Białoruś. KGB wpr... |[0.07118901, -0.0... |28416|
|"Są prawdziwym za...|[0.0972352, -0.04..|25344|
|Obsesja, za którą... |[0.07850968, 0.15..|32256|
|Ogromny sukces Po...|[-0.034644652, 0..|22272|
|Rolnicy "zajęli... |[-0.06938014, 0.0.. |29952|
|Szokujące wyznani... |[0.08084734, 0.18...|30720|
|Pogoda zaskoczy w...|[-0.086600736, 0....|34560|
|Kiedyś kary fizyc... |[0.059363756, 0.0..|28416|
+--------------------+--------------------+-----+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants