Add a way to pass expected language to FastTextLangId filter #565

shuoyangd · 2025-02-21T22:09:43Z

Description

Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be a useful addition to keep only the data that matches with the expected language.

We use two-letter ISO-639 code to denote languages.

Usage

Passing an extra argument when initializing the filter will make it check against expected language, for example:

FastTextLangId(model_path=FAST_TEXT_MODEL_DIR, lang=SRC_LANG)

If lang argument is not passed, it falls back to the old behavior of filtering by minimum language ID score.

bitext_filtering tutorial is updated to demonstrate how this is used in a pipeline.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

(FastTextLangId filter is currently only tested with a fake emulator class. Not sure how to best cover this change with test.)

Signed-off-by: Shuoyang Ding <[email protected]>

ryantwolf

Few comments, nothing major.

ryantwolf · 2025-03-06T16:46:08Z

nemo_curator/filters/classifier_filter.py

@@ -68,14 +68,14 @@ def _load_model(self):

 class FastTextLangId(DocumentFilter):

-    def __init__(self, model_path=None, min_langid_score=0.3):
+    def __init__(self, model_path=None, min_langid_score=0.3, lang=None):


Can you add a docstring for this function?

ryantwolf · 2025-03-06T16:46:41Z

nemo_curator/filters/classifier_filter.py

@@ -91,14 +91,17 @@ def _score_document(text):
            pp = text.strip().replace("\n", " ")
            label, score = model.predict(pp, k=1)
            score = score[0]
-            lang_code = label[0][-2:].upper()
+            lang_code = label[0][-2:].lower()


Can we keep it as upper for backwards compatibility? Or, is there some reason to prefer lower?

I'm trying to follow ISO-639 set 1 here which is 2-letter lowercase: https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes

ryantwolf · 2025-03-06T16:47:26Z

nemo_curator/filters/classifier_filter.py

        if model_path is None:
            raise ValueError(
                "Must provide a valid path to a FastText model "
                "to identify languages with this filter"
            )
        self._model_path = model_path
-        self._lang_code = None
+        self._lang_code = lang


regardless of what we decide on with upper vs lower, can you also manually set self._lang_code = lang.upper() so the user doesn't have to remember which case to specify?

ryantwolf · 2025-03-06T16:48:24Z

tutorials/bitext_cleaning/main.py

@@ -67,6 +72,15 @@ def filter_dataset(dataset: ParallelDataset, gpu: bool = False) -> ParallelDatas
        ]
    )

+    if FAST_TEXT_MODEL_DIR:


Instead of making this a hardcoded directory, how do you feel about making this a CLI arg with argparse?

shuoyangd changed the title ~~Add a way to add expected language to FastTextLangId filter~~ Add a way to pass expected language to FastTextLangId filter Feb 21, 2025

add expected language argument to FastText language id filter

4f164a1

Signed-off-by: Shuoyang Ding <[email protected]>

shuoyangd force-pushed the expected_langid_filter branch from 7b1e380 to 4f164a1 Compare February 21, 2025 22:15

ryantwolf reviewed Mar 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to pass expected language to FastTextLangId filter #565

Add a way to pass expected language to FastTextLangId filter #565

shuoyangd commented Feb 21, 2025

ryantwolf left a comment

ryantwolf Mar 6, 2025

ryantwolf Mar 6, 2025

shuoyangd Mar 6, 2025 •

edited

Loading

ryantwolf Mar 6, 2025

ryantwolf Mar 6, 2025

Add a way to pass expected language to FastTextLangId filter #565

Are you sure you want to change the base?

Add a way to pass expected language to FastTextLangId filter #565

Conversation

shuoyangd commented Feb 21, 2025

Description

Usage

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

ryantwolf Mar 6, 2025

Choose a reason for hiding this comment

ryantwolf Mar 6, 2025

Choose a reason for hiding this comment

shuoyangd Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

ryantwolf Mar 6, 2025

Choose a reason for hiding this comment

ryantwolf Mar 6, 2025

Choose a reason for hiding this comment

shuoyangd Mar 6, 2025 •

edited

Loading