Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor edits to instructions in readme, requirements, fuzzy and semantic dedupe flags #548

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
e661439
Minor edits to instructions, fuzzy and semantic
ruchaa-apte Feb 14, 2025
a8d2513
Merge branch 'main' into curator_readme_edits
ruchaa-apte Feb 14, 2025
dab3789
Update tutorials/dapt-curation/README.md
ruchaa-apte Feb 14, 2025
8e86a1a
Update tutorials/dapt-curation/code/main.py
ruchaa-apte Feb 14, 2025
517a219
Update tutorials/dapt-curation/code/requirements.txt
ruchaa-apte Feb 14, 2025
4e3a848
Merge branch 'main' into curator_readme_edits
ruchaa-apte Feb 14, 2025
d3b3144
Merge branch 'main' into curator_readme_edits
ruchaa-apte Feb 20, 2025
bec0bfe
Addressing PR comments
ruchaa-apte Feb 20, 2025
6b14249
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 20, 2025
0c5868f
Merge branch 'main' into curator_readme_edits
ruchaa-apte Feb 26, 2025
31a62da
Update tutorials/dapt-curation/README.md
ruchaa-apte Feb 26, 2025
6cb179f
edits to ensure recent PR changes
ruchaa-apte Mar 11, 2025
8d09ae3
Merge branch 'main' into curator_readme_edits
ruchaa-apte Mar 11, 2025
8026d7d
Update tutorials/dapt-curation/code/configs/text_semantic_dedupe_conf…
ruchaa-apte Mar 12, 2025
063f5ed
Addressing PR comment to add new line, partition size key in configs
ruchaa-apte Mar 12, 2025
82682a7
Merge branch 'main' into curator_readme_edits
ayushdg Mar 13, 2025
9bd1b47
Changes in requirements, config and fuzzy fix
ruchaa-apte Mar 24, 2025
6bd5adb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2025
0954fbf
Merge branch 'main' into curator_readme_edits
ruchaa-apte Mar 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions tutorials/dapt-curation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,23 @@ The tutorial follows the steps below:<br>

## Usage

After installing the NeMo Curator package, install the dependencies and run:
Please follow the instructions in NeMo Curator's [README](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#nemo-framework-container) to run the NeMo Framework Container and install the NeMo Curator package. Then, install the following dependencies for running the DAPT tutorial:

```bash
cd code
cd /opt/NeMo-Curator/tutorials/dapt-curation/code/
apt update
apt-get install poppler-utils
apt-get install tesseract-ocr
apt install libtesseract-dev
pip install -r requirements.txt
pip uninstall --yes $(pip list --format=freeze | grep opencv)
rm -rf /usr/local/lib/python3.10/dist-packages/cv2/
pip install opencv-python-headless
python -c "import nltk; nltk.download('punkt_tab')"
python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng')"
python main.py --device "gpu"
```

This will download chip-design related datasets and begin the data curation pipeline. Please use `--device "gpu"` to enable semantic and fuzzy deduplication, which require the GPU.
This will download chip-design related datasets and begin the data curation pipeline.

Please use `--device "gpu"` to enable semantic and fuzzy deduplication, which require the GPU.
Original file line number Diff line number Diff line change
Expand Up @@ -13,20 +13,19 @@ write_to_filename: false

# Clustering configuration
max_iter: 100
n_clusters: 20
n_clusters: 15
clustering_save_loc: "clustering_results"
random_state: 1234
sim_metric: "cosine"
which_to_keep: "hard"
batched_cosine_similarity: 1024
sort_clusters: true
kmeans_with_cos_dist: false
clustering_input_partition_size: "2gb"
partition_size: "2gb"

# Extract dedup configuration
eps_thresholds:
- 0.1
- 0.01

# Which threshold to use for extracting deduped data
eps_to_extract: 0.1
eps_to_extract: 0.1
9 changes: 7 additions & 2 deletions tutorials/dapt-curation/code/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,9 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:
jsonl_dir (str): Directory path where the JSONL files are stored.
"""
# Initialize the Dask cluster.
client = get_client(**ArgumentHelper.parse_client_args(args))
client = get_client(
**ArgumentHelper.parse_client_args(args), set_torch_to_use_rmm=True
)

# Define data curation steps for text and pdf files
curation_steps_text = Sequential(
Expand Down Expand Up @@ -171,6 +173,7 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:
dataset_text = curation_steps_text(orig_dataset_text)
dataset_code = curation_steps_code(orig_dataset_code)

print("********************* Generating Statistics *********************")
print(f"Original dataset length for text files: {len(orig_dataset_text.df)}")
print(f"After dataprep for text files: {len(dataset_text.df)}")
print(f"Original dataset length for code files: {len(orig_dataset_code.df)}")
Expand All @@ -193,6 +196,7 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:
semantic_dataset_text = DocumentDataset(
gpu_dataset_text.df[gpu_dataset_text.df.id.isin(unique_ids)]
)
print("********************* Generating Statistics *********************")
print(f"After semantic dedupe for text files: {len(semantic_dataset_text.df)}")

print("Executing the fuzzy dedupe pipeline...")
Expand All @@ -207,8 +211,9 @@ def run_curation_pipeline(args: Any, text_files: str, code_files: str) -> None:

dataset_text.df = fuzzy_dataset_text.df.to_backend("pandas")
dataset_code.df = fuzzy_dataset_code.df.to_backend("pandas")
print("********************* Generating Statistics *********************")
print(f"After fuzzy dedupe for text files: {len(dataset_text.df)}")
print(f"After fuzzy dedupe: {len(dataset_code.df)}")
print(f"After fuzzy dedupe for code files: {len(dataset_code.df)}")

final_dataset_text = dataset_text.persist()
final_dataset_code = dataset_code.persist()
Expand Down
2 changes: 2 additions & 0 deletions tutorials/dapt-curation/code/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,7 @@ arxiv-downloader
cchardet
nltk==3.8.1
poppler-utils
qgrid
tesseract-ocr
unstructured[all-docs]==0.14.5
unstructured[pdf]