NLP analyzer latency #1097

Gommorach · 2023-06-20T13:33:20Z

Gommorach
Jun 20, 2023

Our use case for Presidio is whether PII information is present in the text that is analyzed, without needing to know which entity that is. We use the built-in entities and custom matchers. We're hitting performance issues in terms of latency as we rely on live feedback. The allocated cpu and memory resources are not being maxed out by Presidio.

We suspect that the underlying spaCy pipeline is too heavy for us in the sense that we're not relying on the ner step in our output. Does this analysis make sense? If so, would it be possible to make that pipeline configurable?

omri374 · 2023-06-21T07:00:23Z

omri374
Jun 21, 2023
Maintainer

Hi, latency depends heavily on the types of recognizers and NLP models you apply, and there's a latency-accuracy tradeoff.
The fastest setup I can think of, is to use the small spaCy model (en_core_web_sm) as the NER model, and remove recognizers that are not needed (the PhoneRecognizer being the slowest I believe. If you expect to have phone numbers only from a certain country, you can also configure it to look only for patterns belonging to this country).

Then, going with heavier spacy models (en_core_web_lg) all the way to transformers (en_core_web_trf or using huggingface) and flair models.

0 replies

Gommorach · 2023-06-21T09:00:43Z

Gommorach
Jun 21, 2023
Author

Thanks for your suggestion. The switch to the small models gave marginal improvements but it's not a panacea.

Are the built in entity recognizers relying on the ner step in spaCy's pipeline? That still seems the bottleneck for our use case. Is it possible to disable that step?

1 reply

omri374 Jun 21, 2023
Maintainer

While the different recognizers do not rely on the NER step, they rely on other spacy components like tokenization and lemmatization. You could tweak the spacy model load to remove NER.
We previously did some profiling and were able to get <10ms per a 1000 token request, so I'm curious to see where the issue is. Have you ran any profiling? This could potentially be useful: https://microsoft.github.io/code-with-engineering-playbook/machine-learning/ml-profiling/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP analyzer latency #1097

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

NLP analyzer latency #1097

Gommorach Jun 20, 2023

Replies: 2 comments · 1 reply

omri374 Jun 21, 2023 Maintainer

Gommorach Jun 21, 2023 Author

omri374 Jun 21, 2023 Maintainer

Gommorach
Jun 20, 2023

Replies: 2 comments 1 reply

omri374
Jun 21, 2023
Maintainer

Gommorach
Jun 21, 2023
Author

omri374 Jun 21, 2023
Maintainer