Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkNLP Embeddings inference 3X slower than with pandas_udf #14184

Open
1 task done
captify-sivakhno opened this issue Feb 27, 2024 · 3 comments
Open
1 task done

SparkNLP Embeddings inference 3X slower than with pandas_udf #14184

captify-sivakhno opened this issue Feb 27, 2024 · 3 comments
Assignees
Labels

Comments

@captify-sivakhno
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am trying to optimize workflow for creating sentence embeddings for large dataset to be used in vector database.
I am using two approached (code below) to compute embeddings - with pandas_udf and SentenceTransformer and using sparknlp.annotator.embeddings.BGEEmbeddings on the same g5.xlarge instance and the same model.

Current Behavior

I observe that code runs three times slower with BGEEmbeddings than pandas_udf.

Expected Behavior

I would have expected that BGEEmbeddings run twice as fast since I believe model SparkNLP model has been exported to ONNX format?

I wander what is the main bottleneck here since looking at GPU trace I find that it is 85% loaded when running BGEEmbeddings inference.

Maybe it's saving BGEEmbeddings model to Delta Lake?

Any suggestions how to optimize farther or additionally investigate would be most appreciated.

Steps To Reproduce

keyword_embeddings is in house data I upsample to roughly 5 the size of 500K

Compute:
"cluster_instnace": "g5.xlarge",

pandas_udf

model = SentenceTransformer("BAAI/bge-small-en")
broadcast_model = spark.sparkContext.broadcast(model)

@pandas_udf(returnType=ArrayType(FloatType()))
def embedd_text(x: pd.Series) -> pd.Series:
    return pd.Series(broadcast_model.value.encode(x).tolist())

keyword_embeddings.sample(withReplacement=True, fraction=10.0).select("keywords")\
    .filter((F.length(F.col("keywords")) > 9) & (F.length(F.col("keywords")) < 80))\
    .withColumn("keyphrase_embedded", embedd_text(F.col("keywords")))\
    .write.format("delta").mode("overwrite").saveAsTable("qa.tv_segmentation_bronze.semantic_embeddings_test_bge_3")

SparkNLP

from sparknlp.annotator.embeddings import BGEEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("keywords") \
    .setOutputCol("document")
embeddings = BGEEmbeddings.pretrained("bge_small", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("keyphrase_embedded") 
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings])

tmp = keyword_embeddings.sample(withReplacement=True, fraction=10.0).select("keywords")\
.filter((F.length(F.col("keywords")) > 9) & (F.length(F.col("keywords")) < 80))

pipeline.fit(tmp).transform(tmp).select("keywords", "keyphrase_embedded").write.format("delta").mode("overwrite").saveAsTable("qa.bronze.semantic_embeddings_test_bge_3_sparknlp_v3")

Spark NLP version and Apache Spark

com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
spark-nlp==5.2.3
"spark_version": "14.3.x-gpu-ml-scala2.12" https://docs.databricks.com/en/release-notes/runtime/14.3lts-ml.html
spark 3.5.0

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

Databricks runtime "14.3.x-gpu-ml-scala2.12" https://docs.databricks.com/en/release-notes/runtime/14.3lts-ml.html was used

@maziyarpanahi
Copy link
Member

Hi @captify-sivakhno

Thanks for the report, I will have a look. That said, usually when the GPU is not utilized it's more about the throughput and how many rows at once it goes in.

I suggest these resources while I'll have a look into BGEEmbeddings in particular:

If you have a closer look, unlike UDF, the Spark NLP integrates the DL part natively. If you tune the pipeline with the right batchSize, correct number of partitions for GPUs, and config the cluster right for the amount of data the Spark NLP will be 30%-40% faster and more efficient than having the same exact solution wrapped in a UDF. (pure ONNX inference for instance)

@captify-sivakhno
Copy link
Author

@maziyarpanahi thanks for prompt reply and comprehensive answers. Just to confirm I have tested for different batch sizes up to the point of GPU RAM errors (with .setBatchSize()), but have not found a difference between variations. All are still 3X slower than pandas_udf (just to note I am not using pure udf, but pandas_udf), specifically 3min for pandas_udf vs for SparkNLP. I have also confirmed that GPU usage reaches 90% in both cases.
The experiments set-up is the same as above.
Could it be an issue with model implementation or should I still try optimising parameters?
Thanks for your suggestions in advance.

@maziyarpanahi
Copy link
Member

Hi @captify-sivakhno

  • Could you please share specs of that Runtime? (single-node or multi-node? How many rows, numPartitions, etc.)
  • Could you please run the exact same experiment for BertEmbeddings? I know it's for word embeddings as supposed to sentence embeddings, but BGE is a very new annotator, I just want to be sure it's not the annotator implementation - since you have the env ready to run the same test between spark nlp and pandas udf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants