Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to pass expected language to FastTextLangId filter #565

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shuoyangd
Copy link
Contributor

Description

Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be a useful addition to keep only the data that matches with the expected language.

We use two-letter ISO-639 code to denote languages.

Usage

Passing an extra argument when initializing the filter will make it check against expected language, for example:

FastTextLangId(model_path=FAST_TEXT_MODEL_DIR, lang=SRC_LANG)

If lang argument is not passed, it falls back to the old behavior of filtering by minimum language ID score.

bitext_filtering tutorial is updated to demonstrate how this is used in a pipeline.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

(FastTextLangId filter is currently only tested with a fake emulator class. Not sure how to best cover this change with test.)

@shuoyangd shuoyangd changed the title Add a way to add expected language to FastTextLangId filter Add a way to pass expected language to FastTextLangId filter Feb 21, 2025
@shuoyangd shuoyangd force-pushed the expected_langid_filter branch from 7b1e380 to 4f164a1 Compare February 21, 2025 22:15
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, nothing major.

@@ -68,14 +68,14 @@ def _load_model(self):

class FastTextLangId(DocumentFilter):

def __init__(self, model_path=None, min_langid_score=0.3):
def __init__(self, model_path=None, min_langid_score=0.3, lang=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring for this function?

@@ -91,14 +91,17 @@ def _score_document(text):
pp = text.strip().replace("\n", " ")
label, score = model.predict(pp, k=1)
score = score[0]
lang_code = label[0][-2:].upper()
lang_code = label[0][-2:].lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep it as upper for backwards compatibility? Or, is there some reason to prefer lower?

Copy link
Contributor Author

@shuoyangd shuoyangd Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to follow ISO-639 set 1 here which is 2-letter lowercase: https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes

if model_path is None:
raise ValueError(
"Must provide a valid path to a FastText model "
"to identify languages with this filter"
)
self._model_path = model_path
self._lang_code = None
self._lang_code = lang
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of what we decide on with upper vs lower, can you also manually set self._lang_code = lang.upper() so the user doesn't have to remember which case to specify?

@@ -67,6 +72,15 @@ def filter_dataset(dataset: ParallelDataset, gpu: bool = False) -> ParallelDatas
]
)

if FAST_TEXT_MODEL_DIR:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this a hardcoded directory, how do you feel about making this a CLI arg with argparse?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants