Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.4 with rewritten parsers + python bindings #45

Merged
merged 11 commits into from
Jul 6, 2020
Merged

v0.4 with rewritten parsers + python bindings #45

merged 11 commits into from
Jul 6, 2020

Conversation

Keats
Copy link
Contributor

@Keats Keats commented Jun 9, 2020

Closes DEV-3520, closes DEV-3519, closes DEV-3521.

@Keats
Copy link
Contributor Author

Keats commented Jun 23, 2020

cc @bovee in case you haven't seen that

@bovee
Copy link
Contributor

bovee commented Jun 27, 2020

This is really cool! My only nit is I think your "random TSV" example is actually from a SAM file header, but SAM files are technically TSVs so that's probably okay. :)

The new streaming iterator syntax is neat; I've been playing with building toy parsers with similar style so it's really nice to see needletail has that. Super excited for new stable maturin too; didn't know they were finally nailing that down.

@luizirber
Copy link
Contributor

luizirber commented Jun 27, 2020

I just tried out this branch (see this commit), and was a bit sad to not see a parse_fastx_reader function, so I tried out a few things:

  • First, using io::Cursor and Box to avoid the Seek bound. Check this commit. This is a trick I learned with @natir, and we use it in niffler. And then...
  • This commit uses niffler to replace the compression detection logic, but implies adding a dependency. I checked the dep tree, and I think the only extra one is thiserror and enum_primitive. I also added parse_fastx_reader as the main method, and parse_fastx_file then becomes just a small shim that opens a File and passes to parse_fastx_reader (still need to fix one unwarp and understand how to pass the compression feature from needletail further down to the right niffler feature fixed in this commit, which also brings back stdin support)

The second commit is pretty aggressive (sorry!) with all the code replacing, but I think the first one is less invasive. I can open a PR against this one if you want.

@luizirber
Copy link
Contributor

oh, and I really liked the new iterator style! It makes working with errors a breeze, I used to have a lot of .expect in my parse_sequence_file function because it didn't allow returning errors...

@Keats
Copy link
Contributor Author

Keats commented Jun 29, 2020

I like the parse_fastx_reader a lot, I forgot to implement it again but it was on my TODO list in the back of my brain.
As for niffler I'm not sure, It's nice to be able to control the decoder. For example you have https://github.com/luizirber/niffler/blob/master/src/lib.rs#L163 which means multi gz file would fail to parse and if I want to change that I would to either fix it in niffle (where it might not be appropriate?) or remove it. We do need the MultiGzDecoder here since we do have files in that format (few but some).

Copy link
Contributor

@boydgreenfield boydgreenfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor typos. The body of this I didn't closely review, however, and would be better with somebody with more Rust experience's eyes...

@Keats One question, has our support for streaming from stdin changed in any measurable way with the port to the seqio parsing model? Just one thing we should be sure is stable that may not have the same test coverage.

@@ -8,61 +8,53 @@ Needletail is a MIT-licensed, minimal-copying FASTA/FASTQ parser and _k_-mer pro
The goal is to write a fast *and* well-tested set of functions that more specialized bioinformatics programs can use.
Needletail's goal is to be as fast as the [readfq](https://github.com/lh3/readfq) C library at parsing FASTX files and much (i.e. 25 times) faster than equivalent Python implementations at _k_-mer counting.

# Example
## Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? Looks a bit like the whole block got commented out in an editor. (Referring not just to the header but to the rust code block below too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The # part is intentional, the commented out code not so much...


## Acknowledgements
Starting from 0.4, the parsers algorithms is taken from [seq_io](https://github.com/markschl/seq_io). While it has been slightly modified, it is mainly
coming that library. Links to the original files are available in `src/parser/fast{a,q}.rs`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "coming from that library".

@luizirber
Copy link
Contributor

As for niffler I'm not sure, It's nice to be able to control the decoder. For example you have https://github.com/luizirber/niffler/blob/master/src/lib.rs#L163 which means multi gz file would fail to parse and if I want to change that I would to either fix it in niffle (where it might not be appropriate?) or remove it. We do need the MultiGzDecoder here since we do have files in that format (few but some).

Fair point about controlling your deps. But I also tested changing that GzDecoder to a MultiGzDecoder and niffler worked fine (but need to test with more inputs), so that would be an option too.

@Keats
Copy link
Contributor Author

Keats commented Jun 29, 2020

I'm ok with niffler if it can be changed to MultiGzDecoder, I didn't know if that was ok to change or not as it is pretty niche.

@luizirber
Copy link
Contributor

I'm ok with niffler if it can be changed to MultiGzDecoder, I didn't know if that was ok to change or not as it is pretty niche.

It's niche, but it works fine for regular gzip files too (I checked the flate2 code).
PR opened in niffler: luizirber/niffler#28

@luizirber
Copy link
Contributor

niffler 2.1.1 released, with MultiGzDecoder as the default decoder for gz files.

@Keats
Copy link
Contributor Author

Keats commented Jun 30, 2020

Cheers, can you make a PR with your changes?

luizirber and others added 2 commits July 2, 2020 10:51
…#46)

* Use cursor and Box to avoid Seek bound

* use niffler

* fix feature, bring back stdin support, remove unwrap

* use niffler 2.1.1, with MultiGzDecoder as default

* use niffler 2.2.0, and add stdin tests
@Keats
Copy link
Contributor Author

Keats commented Jul 2, 2020

@bovee do you have a new email I can put in the Cargo.toml?

@bovee
Copy link
Contributor

bovee commented Jul 3, 2020

First initial, last name at gmail. Maybe your email should be in the Cargo.toml instead though since you've fixed up most of the code? ;)

@Keats Keats merged commit 814eb98 into master Jul 6, 2020
@Keats
Copy link
Contributor Author

Keats commented Jul 6, 2020

I've done both!

@Keats Keats deleted the vp_py3 branch July 6, 2020 09:31
@Keats
Copy link
Contributor Author

Keats commented Jul 6, 2020

@luizirber looks like pyo3 0.11 is requiring Send in its latest version for pyclass and the Chain/ dyn std::io::Read type breaks it. I'm looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants