Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to add a multiprocessing argument? #59

Open
myrainbowandsky opened this issue Dec 28, 2022 · 1 comment
Open

Is it possible to add a multiprocessing argument? #59

myrainbowandsky opened this issue Dec 28, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@myrainbowandsky
Copy link

myrainbowandsky commented Dec 28, 2022

We could split a csv into pieces of csvs and use twarc-csv for each of them. Or maybe we could use joblib's n_jobs to read CPU-thread number of blocks of Json lines. A server usually has multi-cores. Is it possible to add such a parameter?

@igorbrigadir
Copy link
Collaborator

igorbrigadir commented Dec 28, 2022

Yes, it's not implemented but possible, and welcome (with an added requirement of keeping things in the same order as the json, as this helps downstream tools).

The process of generating CSVs is already "embarrassingly parallel" as it splits the data into chunks here: https://github.com/DocNow/twarc-csv/blob/main/csv_writer.py#L82 so that would be the best place for it.

Joblib would be nice. Will accept good PRs for this.

@igorbrigadir igorbrigadir added the enhancement New feature or request label Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants