You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -58,7 +58,7 @@ You could read, filter the dataset, and write it using the following methods
58
58
59
59
Let's walk through this code line by line.
60
60
61
-
* ``files = get_all_files_paths_under("books_dataset/")`` This retrieves a list of all files in the given directory.
61
+
* ``files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")`` This retrieves a list of all files in the given directory, then filters the list to include only files ending with ".jsonl".
Copy file name to clipboardexpand all lines: docs/user-guide/download.rst
+91-29
Original file line number
Diff line number
Diff line change
@@ -36,41 +36,103 @@ By "extraction", we typically mean the process of converting a data format from
36
36
Common crawl has an S3 bucket and a direct HTTPS endpoint. If you want to use the S3 bucket, ensure you have properly set up your credentials with `s5cmd <https://github.com/peak/s5cmd>`_.
37
37
Otherwise, the HTTPS endpoints will be used with ``wget``. Here is a small example of how to use it:
38
38
39
-
.. code-block:: python
40
-
41
-
from nemo_curator.download import download_common_crawl
* ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
46
-
* ``"2020-50"`` is the first common crawl snapshot that will be included in the download. **Note:** Not every year and week has a snapshot. Ensure that your range includes at least one valid Common Crawl snapshot. A list of valid Common Crawl snapshots can be found `here <https://data.commoncrawl.org/>`_.
47
-
* ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
48
-
* ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.
39
+
.. code-block:: python
40
+
41
+
import os
42
+
from nemo_curator import get_client
43
+
from nemo_curator.download import download_common_crawl
44
+
from nemo_curator.datasets import DocumentDataset
45
+
46
+
defmain():
47
+
# Initialize a distributed Dask client
48
+
client = get_client(cluster_type="cpu")
49
+
50
+
# Parameters for downloading Common Crawl data.
51
+
# - output_folder: directory for temporary download/extraction files
52
+
# - start_snapshot and end_snapshot define the range to fetch
53
+
# - output_type: specifies file format for the extracted data (e.g., "jsonl")
54
+
output_folder ="/extracted/output/folder"
55
+
start_snapshot ="2020-50"
56
+
end_snapshot ="2021-04"
57
+
output_type ="jsonl"
58
+
os.makedirs(output_folder, exist_ok=True)
59
+
60
+
# Download and extract the Common Crawl data.
61
+
# The function returns a DocumentDataset that contains the extracted documents.
62
+
# Note: The output folder and output type are passed here to store intermediate files
63
+
# and check if the data has already been downloaded. They should match the final location
* ``"/extracted/output/folder"`` is the path to on your local filesystem where the final extracted files will be placed.
79
+
* ``"2020-50"`` is the first common crawl snapshot that will be included in the download. **Note:** Not every year and week has a snapshot. Ensure that your range includes at least one valid Common Crawl snapshot. A list of valid Common Crawl snapshots can be found `here <https://data.commoncrawl.org/>`_.
80
+
* ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
81
+
* ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.
49
82
50
83
You can choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.
51
84
52
-
.. code-block:: python
85
+
.. code-block:: python
53
86
54
-
from nemo_curator.download import (
87
+
import os
88
+
from nemo_curator import get_client
89
+
from nemo_curator.download import (
55
90
ResiliparseExtractor,
56
91
download_common_crawl,
57
-
)
58
-
59
-
# Change the extraction algorithm
60
-
extraction_algorithm = ResiliparseExtractor()
61
-
common_crawl = download_common_crawl(
62
-
"/extracted/output/folder",
63
-
"2020-50",
64
-
"2021-04",
65
-
output_type="jsonl",
66
-
algorithm=extraction_algorithm,
67
-
)
68
-
69
-
Above, we changed the extraction algorithm from the default ``JusTextExtractor``.
70
-
71
-
The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
72
-
73
-
NeMo Curator's Common Crawl extraction process looks like this under the hood:
92
+
)
93
+
from nemo_curator.datasets import DocumentDataset
94
+
95
+
defmain():
96
+
# Initialize a distributed Dask client
97
+
client = get_client(cluster_type="cpu")
98
+
99
+
# Parameters for downloading Common Crawl data.
100
+
# - output_folder: directory for temporary download/extraction files
101
+
# - start_snapshot and end_snapshot define the range to fetch
102
+
# - output_type: specifies file format for the extracted data (e.g., "jsonl")
103
+
output_folder ="/extracted/output/folder"
104
+
start_snapshot ="2020-50"
105
+
end_snapshot ="2021-04"
106
+
output_type ="jsonl"
107
+
os.makedirs(output_folder, exist_ok=True)
108
+
109
+
# Change the extraction algorithm to use ResiliparseExtractor
110
+
extraction_algorithm = ResiliparseExtractor()
111
+
112
+
# Download and extract the Common Crawl data using the Resiliparse extraction algorithm.
113
+
# The function returns a DocumentDataset that contains the extracted documents.
114
+
common_crawl_dataset = download_common_crawl(
115
+
output_folder,
116
+
start_snapshot,
117
+
end_snapshot,
118
+
output_type=output_type,
119
+
algorithm=extraction_algorithm,
120
+
)
121
+
122
+
# Write the extracted dataset to JSON format.
123
+
# The 'to_json' method writes one JSON document per line,
124
+
# preserving the original shard information if write_to_filename is True.
Above, we changed the extraction algorithm from the default ``JusTextExtractor``.
132
+
133
+
The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
134
+
135
+
NeMo Curator's Common Crawl extraction process looks like this under the hood:
74
136
75
137
1. Decode the HTML within the record from binary to text.
76
138
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
0 commit comments