Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to pass expected language to FastTextLangId filter #565

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions nemo_curator/filters/classifier_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,14 @@ def _load_model(self):

class FastTextLangId(DocumentFilter):

def __init__(self, model_path=None, min_langid_score=0.3):
def __init__(self, model_path=None, min_langid_score=0.3, lang=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring for this function?

if model_path is None:
raise ValueError(
"Must provide a valid path to a FastText model "
"to identify languages with this filter"
)
self._model_path = model_path
self._lang_code = None
self._lang_code = lang
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of what we decide on with upper vs lower, can you also manually set self._lang_code = lang.upper() so the user doesn't have to remember which case to specify?

self._cutoff = min_langid_score
self._name = "lang_id"

Expand All @@ -91,14 +91,17 @@ def _score_document(text):
pp = text.strip().replace("\n", " ")
label, score = model.predict(pp, k=1)
score = score[0]
lang_code = label[0][-2:].upper()
lang_code = label[0][-2:].lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep it as upper for backwards compatibility? Or, is there some reason to prefer lower?

Copy link
Contributor Author

@shuoyangd shuoyangd Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to follow ISO-639 set 1 here which is 2-letter lowercase: https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes


return [score, lang_code]

return df.apply(_score_document)

def keep_document(self, score):
return score[0] >= self._cutoff
if self._lang_code:
return score[1] == self._lang_code
else:
return score[0] >= self._cutoff

def _load_model(self):
return fasttext.load_model(self._model_path)
Expand Down
14 changes: 14 additions & 0 deletions tutorials/bitext_cleaning/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from nemo_curator import ParallelScoreFilter, Sequential
from nemo_curator.datasets.parallel_dataset import ParallelDataset
from nemo_curator.filters import (
FastTextLangId,
HistogramFilter,
LengthRatioFilter,
QualityEstimationFilter,
Expand All @@ -38,6 +39,10 @@
SCRIPT_DIR_PATH = os.path.dirname(os.path.abspath(__file__))
DATA_DIR = os.path.join(SCRIPT_DIR_PATH, "data")

# If you want to test FastText language ID,
# download the model from here first then update this with your local model path (https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz)
FAST_TEXT_MODEL_DIR = ""


def download_files() -> str:
downloader = TedTalksDownloader(DATA_DIR)
Expand Down Expand Up @@ -67,6 +72,15 @@ def filter_dataset(dataset: ParallelDataset, gpu: bool = False) -> ParallelDatas
]
)

if FAST_TEXT_MODEL_DIR:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this a hardcoded directory, how do you feel about making this a CLI arg with argparse?

filters.modules.append(
ParallelScoreFilter(
FastTextLangId(model_path=FAST_TEXT_MODEL_DIR, lang=SRC_LANG),
FastTextLangId(model_path=FAST_TEXT_MODEL_DIR, lang=TGT_LANG),
score_type=str,
)
)

if gpu:
filters.modules.append(
QualityEstimationFilter(
Expand Down