Skip to content

Commit d3b3144

Browse files
authored
Merge branch 'main' into curator_readme_edits
2 parents 4e3a848 + 9b1a13c commit d3b3144

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1086
-446
lines changed

.github/workflows/build-test-publish-wheel.yml

+6-7
Original file line numberDiff line numberDiff line change
@@ -17,24 +17,23 @@ name: Build, test, and publish a PyPi wheel (to testpypi)
1717
on:
1818
push:
1919
branches:
20-
- 'main'
21-
- '[rv][0-9].[0-9].[0-9]'
22-
- '[rv][0-9].[0-9].[0-9]rc[0-9]'
20+
- "main"
21+
- "[rv][0-9].[0-9].[0-9]"
22+
- "[rv][0-9].[0-9].[0-9]rc[0-9]"
2323

2424
defaults:
2525
run:
2626
shell: bash -x -e -u -o pipefail {0}
2727

2828
jobs:
2929
build-test-publish-wheel:
30-
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_build_test_publish_wheel.yml@v0.20.0
30+
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_build_test_publish_wheel.yml@v0.22.3
3131
with:
3232
dry-run: true
3333
python-package: nemo_curator
34-
environment: public
35-
python-version: '3.10'
34+
python-version: "3.10"
3635
secrets:
3736
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
3837
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
38+
SLACK_WEBHOOK: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
3939
SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}
40-
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

.github/workflows/release-freeze.yml

+8-3
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,20 @@ on:
1414
description: Commit SHA to use for cut-off
1515
required: false
1616
default: main
17-
17+
dry-run:
18+
type: boolean
19+
description: Dry-run of code-freeze
20+
required: false
21+
default: true
1822
jobs:
1923
code-freeze:
20-
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_code_freeze.yml@v0.21.6
24+
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_code_freeze.yml@v0.22.5
2125
with:
2226
library-name: NeMo Curator
2327
python-package: nemo_curator
2428
release-type: ${{ inputs.release-type }}
2529
freeze-commit: ${{ inputs.freeze-commit }}
30+
dry-run: ${{ inputs.dry-run }}
2631
secrets:
27-
SLACK_RELEASE_ENDPOINT: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
32+
SLACK_WEBHOOK: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
2833
SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}

.github/workflows/release.yml

+5-5
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
name: 'Release NeMo Curator'
14+
name: "Release NeMo Curator"
1515

1616
on:
1717
workflow_dispatch:
@@ -31,17 +31,17 @@ on:
3131
description: Branch to target for version bump
3232
jobs:
3333
release:
34-
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.20.1
34+
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_release_library.yml@v0.22.6
3535
with:
3636
release-ref: ${{ inputs.release-ref }}
3737
python-package: nemo_curator
38-
python-version: '3.10'
38+
python-version: "3.10"
3939
library-name: NeMo Curator
4040
dry-run: ${{ inputs.dry-run }}
4141
version-bump-branch: ${{ inputs.version-bump-branch }}
4242
secrets:
4343
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
4444
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
45-
SLACK_RELEASE_ENDPOINT: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
45+
SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}
46+
SLACK_WEBHOOK: ${{ secrets.SLACK_RELEASE_ENDPOINT }}
4647
PAT: ${{ secrets.PAT }}
47-
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

docs/user-guide/distributeddataclassification.rst

+8-8
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ Let's see how ``DomainClassifier`` works in a small excerpt taken from ``example
6565
6666
from nemo_curator.classifiers import DomainClassifier
6767
68-
files = get_all_files_paths_under("books_dataset/")
68+
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
6969
input_dataset = DocumentDataset.read_json(files, backend="cudf")
7070
7171
domain_classifier = DomainClassifier(filter_by=["Games", "Sports"])
@@ -87,7 +87,7 @@ Using the ``MultilingualDomainClassifier`` is very similar to using the ``Domain
8787
8888
from nemo_curator.classifiers import MultilingualDomainClassifier
8989
90-
files = get_all_files_paths_under("japanese_books_dataset/")
90+
files = get_all_files_paths_under("japanese_books_dataset/", keep_extensions="jsonl")
9191
input_dataset = DocumentDataset.read_json(files, backend="cudf")
9292
9393
multilingual_domain_classifier = MultilingualDomainClassifier(
@@ -110,7 +110,7 @@ Here's an example of how to use the ``QualityClassifier``:
110110
111111
from nemo_curator.classifiers import QualityClassifier
112112
113-
files = get_all_files_paths_under("web_documents/")
113+
files = get_all_files_paths_under("web_documents/", keep_extensions="jsonl")
114114
input_dataset = DocumentDataset.read_json(files, backend="cudf")
115115
116116
quality_classifier = QualityClassifier(filter_by=["High", "Medium"])
@@ -138,7 +138,7 @@ NeMo Curator provides an easy way to annotate and filter your data using the saf
138138

139139
.. code-block:: python
140140
141-
files = get_all_files_paths_under("unsafe_documents/")
141+
files = get_all_files_paths_under("unsafe_documents/", keep_extensions="jsonl")
142142
input_dataset = DocumentDataset.read_json(files, backend="cudf")
143143
144144
token = "hf_1234" # Replace with your user access token
@@ -185,7 +185,7 @@ Here is a small example of how to use the ``InstructionDataGuardClassifier``:
185185
186186
# The model expects instruction-response style text data. For example:
187187
# "Instruction: {instruction}. Input: {input_}. Response: {response}."
188-
files = get_all_files_paths_under("instruction_input_response_dataset/")
188+
files = get_all_files_paths_under("instruction_input_response_dataset/", keep_extensions="jsonl")
189189
input_dataset = DocumentDataset.read_json(files, backend="cudf")
190190
191191
token = "hf_1234" # Replace with your user access token
@@ -214,7 +214,7 @@ To use the FineWeb Educational Content Classifier, you can follow this example:
214214
215215
from nemo_curator.classifiers import FineWebEduClassifier
216216
217-
files = get_all_files_paths_under("web_documents/")
217+
files = get_all_files_paths_under("web_documents/", keep_extensions="jsonl")
218218
input_dataset = DocumentDataset.read_json(files, backend="cudf")
219219
220220
edu_classifier = FineWebEduClassifier(
@@ -337,7 +337,7 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex
337337
338338
from nemo_curator.classifiers import ContentTypeClassifier
339339
340-
files = get_all_files_paths_under("books_dataset/")
340+
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
341341
input_dataset = DocumentDataset.read_json(files, backend="cudf")
342342
343343
content_type_classifier = ContentTypeClassifier(filter_by=["Blogs", "News"])
@@ -359,7 +359,7 @@ Here's an example of how to use the ``PromptTaskComplexityClassifier``:
359359
360360
from nemo_curator.classifiers import PromptTaskComplexityClassifier
361361
362-
files = get_all_files_paths_under("my_dataset/")
362+
files = get_all_files_paths_under("my_dataset/", keep_extensions="jsonl")
363363
input_dataset = DocumentDataset.read_json(files, backend="cudf")
364364
365365
classifier = PromptTaskComplexityClassifier()

docs/user-guide/documentdataset.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ You could read, filter the dataset, and write it using the following methods
4343
from nemo_curator.utils.file_utils import get_all_files_paths_under
4444
from nemo_curator.filters import WordCountFilter
4545
46-
files = get_all_files_paths_under("books_dataset/")
46+
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
4747
books = DocumentDataset.read_json(files, add_filename=True)
4848
4949
filter_step = nc.ScoreFilter(
@@ -58,7 +58,7 @@ You could read, filter the dataset, and write it using the following methods
5858
5959
Let's walk through this code line by line.
6060

61-
* ``files = get_all_files_paths_under("books_dataset/")`` This retrieves a list of all files in the given directory.
61+
* ``files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")`` This retrieves a list of all files in the given directory, then filters the list to include only files ending with ".jsonl".
6262
In our case, this is equivalent to writing
6363

6464
.. code-block:: python

docs/user-guide/download.rst

+91-29
Original file line numberDiff line numberDiff line change
@@ -36,41 +36,103 @@ By "extraction", we typically mean the process of converting a data format from
3636
Common crawl has an S3 bucket and a direct HTTPS endpoint. If you want to use the S3 bucket, ensure you have properly set up your credentials with `s5cmd <https://github.com/peak/s5cmd>`_.
3737
Otherwise, the HTTPS endpoints will be used with ``wget``. Here is a small example of how to use it:
3838

39-
.. code-block:: python
40-
41-
from nemo_curator.download import download_common_crawl
42-
43-
common_crawl = download_common_crawl("/extracted/output/folder", "2020-50", "2021-04", output_type="jsonl")
44-
45-
* ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
46-
* ``"2020-50"`` is the first common crawl snapshot that will be included in the download. **Note:** Not every year and week has a snapshot. Ensure that your range includes at least one valid Common Crawl snapshot. A list of valid Common Crawl snapshots can be found `here <https://data.commoncrawl.org/>`_.
47-
* ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
48-
* ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.
39+
.. code-block:: python
40+
41+
import os
42+
from nemo_curator import get_client
43+
from nemo_curator.download import download_common_crawl
44+
from nemo_curator.datasets import DocumentDataset
45+
46+
def main():
47+
# Initialize a distributed Dask client
48+
client = get_client(cluster_type="cpu")
49+
50+
# Parameters for downloading Common Crawl data.
51+
# - output_folder: directory for temporary download/extraction files
52+
# - start_snapshot and end_snapshot define the range to fetch
53+
# - output_type: specifies file format for the extracted data (e.g., "jsonl")
54+
output_folder = "/extracted/output/folder"
55+
start_snapshot = "2020-50"
56+
end_snapshot = "2021-04"
57+
output_type = "jsonl"
58+
os.makedirs(output_folder, exist_ok=True)
59+
60+
# Download and extract the Common Crawl data.
61+
# The function returns a DocumentDataset that contains the extracted documents.
62+
# Note: The output folder and output type are passed here to store intermediate files
63+
# and check if the data has already been downloaded. They should match the final location
64+
# and format of the extracted data.
65+
common_crawl_dataset = download_common_crawl(
66+
output_folder, start_snapshot, end_snapshot, output_type=output_type
67+
)
68+
69+
# Write the extracted dataset to JSON format.
70+
# The 'to_json' method will write one JSON document per line,
71+
# preserving the original shard information if write_to_filename is True.
72+
common_crawl_dataset.to_json(output_path=output_folder, write_to_filename=True)
73+
print("Extracted dataset saved to:", output_folder)
74+
75+
if __name__ == "__main__":
76+
main()
77+
78+
* ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
79+
* ``"2020-50"`` is the first common crawl snapshot that will be included in the download. **Note:** Not every year and week has a snapshot. Ensure that your range includes at least one valid Common Crawl snapshot. A list of valid Common Crawl snapshots can be found `here <https://data.commoncrawl.org/>`_.
80+
* ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
81+
* ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.
4982

5083
You can choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.
5184

52-
.. code-block:: python
85+
.. code-block:: python
5386
54-
from nemo_curator.download import (
87+
import os
88+
from nemo_curator import get_client
89+
from nemo_curator.download import (
5590
ResiliparseExtractor,
5691
download_common_crawl,
57-
)
58-
59-
# Change the extraction algorithm
60-
extraction_algorithm = ResiliparseExtractor()
61-
common_crawl = download_common_crawl(
62-
"/extracted/output/folder",
63-
"2020-50",
64-
"2021-04",
65-
output_type="jsonl",
66-
algorithm=extraction_algorithm,
67-
)
68-
69-
Above, we changed the extraction algorithm from the default ``JusTextExtractor``.
70-
71-
The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
72-
73-
NeMo Curator's Common Crawl extraction process looks like this under the hood:
92+
)
93+
from nemo_curator.datasets import DocumentDataset
94+
95+
def main():
96+
# Initialize a distributed Dask client
97+
client = get_client(cluster_type="cpu")
98+
99+
# Parameters for downloading Common Crawl data.
100+
# - output_folder: directory for temporary download/extraction files
101+
# - start_snapshot and end_snapshot define the range to fetch
102+
# - output_type: specifies file format for the extracted data (e.g., "jsonl")
103+
output_folder = "/extracted/output/folder"
104+
start_snapshot = "2020-50"
105+
end_snapshot = "2021-04"
106+
output_type = "jsonl"
107+
os.makedirs(output_folder, exist_ok=True)
108+
109+
# Change the extraction algorithm to use ResiliparseExtractor
110+
extraction_algorithm = ResiliparseExtractor()
111+
112+
# Download and extract the Common Crawl data using the Resiliparse extraction algorithm.
113+
# The function returns a DocumentDataset that contains the extracted documents.
114+
common_crawl_dataset = download_common_crawl(
115+
output_folder,
116+
start_snapshot,
117+
end_snapshot,
118+
output_type=output_type,
119+
algorithm=extraction_algorithm,
120+
)
121+
122+
# Write the extracted dataset to JSON format.
123+
# The 'to_json' method writes one JSON document per line,
124+
# preserving the original shard information if write_to_filename is True.
125+
common_crawl_dataset.to_json(output_path=output_folder, write_to_filename=True)
126+
print("Extracted dataset saved to:", output_folder)
127+
128+
if __name__ == "__main__":
129+
main()
130+
131+
Above, we changed the extraction algorithm from the default ``JusTextExtractor``.
132+
133+
The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
134+
135+
NeMo Curator's Common Crawl extraction process looks like this under the hood:
74136

75137
1. Decode the HTML within the record from binary to text.
76138
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.

docs/user-guide/qualityfiltering.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Let's examine this small example:
3535
from nemo_curator.utils.file_utils import get_all_files_paths_under
3636
from nemo_curator.filters import WordCountFilter
3737
38-
files = get_all_files_paths_under("books_dataset/")
38+
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
3939
books = DocumentDataset.read_json(files, add_filename=True)
4040
4141
filter_step = nc.ScoreFilter(

docs/user-guide/sparkother.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -91,4 +91,4 @@ The following code snippet demonstrates how to read output from a Spark DataFram
9191
stories_dataset = DocumentDataset.read_parquet(processed_files, backend="pandas")
9292
9393
It is worth noting that Spark typically tends to create checksum and other marker files which can vary by Spark distribution,
94-
so it is advisable to ignore them when reading data into a NeMo Curator ``DocumentDataset``.
94+
so it is advisable to ignore them when reading data into a NeMo Curator ``DocumentDataset``.

docs/user-guide/taskdecontamination.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Let's examine this small example:
2828
from nemo_curator.utils.file_utils import get_all_files_paths_under
2929
from nemo_curator.tasks import Winogrande, Squad, TriviaQA,
3030
31-
files = get_all_files_paths_under("books_dataset/")
31+
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
3232
books = DocumentDataset.read_json(files, add_filename=True)
3333
3434
downstream_tasks = [

examples/classifier_filtering.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727

2828

2929
def load_dataset(input_data_dir):
30-
files = list(get_all_files_paths_under(input_data_dir))
30+
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
3131
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
3232
dataset = DocumentDataset(raw_data)
3333

examples/identify_languages.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626

2727

2828
def load_dataset(input_data_dir):
29-
files = list(get_all_files_paths_under(input_data_dir))
29+
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
3030
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
3131
dataset = DocumentDataset(raw_data)
3232

examples/task_decontamination.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444

4545

4646
def load_dataset(input_data_dir):
47-
files = list(get_all_files_paths_under(input_data_dir))
47+
files = list(get_all_files_paths_under(input_data_dir, keep_extensions="jsonl"))
4848
raw_data = read_data(files, file_type="jsonl", backend="pandas", add_filename=True)
4949
dataset = DocumentDataset(raw_data)
5050

nemo_curator/classifiers/base.py

+7-4
Original file line numberDiff line numberDiff line change
@@ -123,10 +123,13 @@ def _run_classifier_helper(
123123
prob_col: str = None,
124124
) -> "dask_cudf.DataFrame":
125125

126-
if prob_col:
127-
df[prob_col] = 0
128-
else:
126+
if prob_col is None:
129127
prob_col = "_prob"
128+
labeler = op.Labeler(labels, cols=[prob_col], suffix=label_col)
129+
else:
130+
labeler = op.Labeler(
131+
labels, cols=[prob_col], keep_cols=[prob_col], suffix=label_col
132+
)
130133

131134
columns_to_keep_list = df.columns.to_list()
132135

@@ -140,7 +143,7 @@ def _run_classifier_helper(
140143
batch_size=batch_size,
141144
pred_output_col=prob_col,
142145
),
143-
op.Labeler(labels, cols=[prob_col], suffix=label_col),
146+
labeler,
144147
repartition=df.npartitions,
145148
keep_cols=columns_to_keep_list,
146149
)

0 commit comments

Comments
 (0)