Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor edits to instructions in readme, requirements, fuzzy and semantic dedupe flags #548

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions tutorials/dapt-curation/README.md
Original file line number Diff line number Diff line change
@@ -44,12 +44,23 @@ The tutorial follows the steps below:<br>

## Usage

After installing the NeMo Curator package, install the dependencies and run:
Please follow the instructions in NeMo Curator's [README](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#nemo-framework-container) to run the NeMo Framework Container and install the NeMo Curator package. Then, install the following dependencies for running the DAPT tutorial:

```bash
cd code
cd /opt/NeMo-Curator/tutorials/dapt-curation/code/
apt update
apt-get install poppler-utils
apt-get install tesseract-ocr
apt install libtesseract-dev
pip install -r requirements.txt
pip uninstall --yes $(pip list --format=freeze | grep opencv)
rm -rf /usr/local/lib/python3.10/dist-packages/cv2/
pip install opencv-python-headless
python -c "import nltk; nltk.download('punkt_tab')"
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
python main.py --device "gpu"
```

This will download chip-design related datasets and begin the data curation pipeline. Please use `--device "gpu"` to enable semantic and fuzzy deduplication, which require the GPU.
This will download chip-design related datasets and begin the data curation pipeline.

Please use `--device "gpu"` to enable semantic and fuzzy deduplication, which require the GPU.
Original file line number Diff line number Diff line change
@@ -13,15 +13,14 @@ write_to_filename: false

# Clustering configuration
max_iter: 100
n_clusters: 20
n_clusters: 15
clustering_save_loc: "clustering_results"
random_state: 1234
sim_metric: "cosine"
which_to_keep: "hard"
batched_cosine_similarity: 1024
sort_clusters: true
kmeans_with_cos_dist: false
clustering_input_partition_size: "2gb"
partition_size: "2gb"
Copy link
Collaborator

@praateekmahajan praateekmahajan Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this key changed? I believe this key is still called clustering_input_partition_size

clustering_input_partition_size: str = "2gb"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the change

Copy link
Contributor Author

@ruchaa-apte ruchaa-apte Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VibhuJawa / @praateekmahajan - I tested with this change in the 25.02 image, however clustering_input_partition_size: "2gb" it gave me a key error.
When I reverted it to partition_size: "2gb" I was able to run the code to completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the error that pops up upon changing the line to clustering_input_partition_size: "2gb"

Traceback (most recent call last): File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 301, in <module> main() File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 282, in main run_curation_pipeline(args, text_files, code_files) File "/opt/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 189, in run_curation_pipeline duplicates = semantic_dedupe( ^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/tutorials/dapt-curation/code/utils.py", line 343, in semantic_dedupe semdedup_config = SemDedupConfig.from_yaml(sem_dedupe_config_yaml_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/NeMo-Curator/nemo_curator/modules/config.py", line 27, in from_yaml yaml_dict = yaml.safe_load(file) ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 125, in safe_load return load(stream, SafeLoader) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/__init__.py", line 81, in load return loader.get_single_data() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/constructor.py", line 49, in get_single_data node = self.get_single_node() ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 36, in get_single_node document = self.compose_document() ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 55, in compose_document node = self.compose_node(None, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 84, in compose_node node = self.compose_mapping_node(anchor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/composer.py", line 127, in compose_mapping_node while not self.check_event(MappingEndEvent): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 98, in check_event self.current_event = self.state() ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/parser.py", line 428, in parse_block_mapping_key if self.check_token(KeyToken): ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 116, in check_token self.fetch_more_tokens() File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 223, in fetch_more_tokens return self.fetch_value() ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/yaml/scanner.py", line 577, in fetch_value raise ScannerError(None, None, yaml.scanner.ScannerError: mapping values are not allowed here in "/opt/NeMo-Curator/tutorials/dapt-curation/code/configs/text_semantic_dedupe_config.yaml", line 46, column 28

Copy link
Collaborator

@VibhuJawa VibhuJawa Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So in we added clustering_input_partition_size in : https://github.com/NVIDIA/NeMo-Curator/pull/564/files

I wonder if you just remove this line/config because for both versions it should hit the default (which you anyways are using).


# Extract dedup configuration
eps_thresholds:
@@ -30,3 +29,4 @@ eps_thresholds:

# Which threshold to use for extracting deduped data
eps_to_extract: 0.1

9 changes: 7 additions & 2 deletions tutorials/dapt-curation/code/main.py
Original file line number Diff line number Diff line change
@@ -119,7 +119,9 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:
jsonl_dir (str): Directory path where the JSONL files are stored.
"""
# Initialize the Dask cluster.
client = get_client(**ArgumentHelper.parse_client_args(args))
client = get_client(
**ArgumentHelper.parse_client_args(args), set_torch_to_use_rmm=True
)

# Define data curation steps for text and pdf files
curation_steps_text = Sequential(
@@ -171,6 +173,7 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:
dataset_text = curation_steps_text(orig_dataset_text)
dataset_code = curation_steps_code(orig_dataset_code)

print("********************* Generating Statistics *********************")
print(f"Original dataset length for text files: {len(orig_dataset_text.df)}")
print(f"After dataprep for text files: {len(dataset_text.df)}")
print(f"Original dataset length for code files: {len(orig_dataset_code.df)}")
@@ -193,6 +196,7 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:
semantic_dataset_text = DocumentDataset(
gpu_dataset_text.df[gpu_dataset_text.df.id.isin(unique_ids)]
)
print("********************* Generating Statistics *********************")
print(f"After semantic dedupe for text files: {len(semantic_dataset_text.df)}")

print("Executing the fuzzy dedupe pipeline...")
@@ -207,8 +211,9 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:

dataset_text.df = fuzzy_dataset_text.df.to_backend("pandas")
dataset_code.df = fuzzy_dataset_code.df.to_backend("pandas")
print("********************* Generating Statistics *********************")
print(f"After fuzzy dedupe for text files: {len(dataset_text.df)}")
print(f"After fuzzy dedupe: {len(dataset_code.df)}")
print(f"After fuzzy dedupe for code files: {len(dataset_code.df)}")

final_dataset_text = dataset_text.persist()
final_dataset_code = dataset_code.persist()
2 changes: 2 additions & 0 deletions tutorials/dapt-curation/code/requirements.txt
Original file line number Diff line number Diff line change
@@ -3,5 +3,7 @@ arxiv-downloader
cchardet
nltk==3.8.1
poppler-utils
qgrid
tesseract-ocr
Comment on lines +6 to +7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently discussed this, let's pin version here so that we don't see breakage with subsequent releases cc @ayushdg @ryantwolf

unstructured[all-docs]==0.14.5
unstructured[pdf]