Releases: edgardomortiz/Captus
Releases · edgardomortiz/Captus
Captus v1.2.0
- Extraction speed has been improved considerably by applying a pre-filter to the PSL hits produced by BLAT. Irrelevant hits are removed earlier as well as cross-loci overlaps (except for some plastome genes that are known to overlap) under the assumption that two different loci in a target file represent non-overlapping regions in the genome, which is what we expect if the loci were carefully selected for phylogenomics.
- The depth of coverage filter for the extraction step has been completely rewritten. The new filter is applied locus-wise, Captus will use the depth of the contig where the best hit for the locus was found to determine the minimum depth allowed for other contigs to be considered as part of the locus using the formula 10^(log(depth of contig with best hit)/
depth_tolerance
), wheredepth_tolerance
can be given by the user with--nuc_depth_tolerance
,--ptd_depth_tolerance
,--mit_depth_tolerance
,--dna_depth_tolerance
,--clr_depth_tolerance
. The defaultdepth_tolerance
of 2.0 means that if the best hit for a locus was found in a contig with depth of coverage of 100.0, the minimum depth allowed for other contigs will be 10^(log(100)/2.0) = 10.0 (or one order of magnitude smaller), but for example, for another contig with a best hit that has a depth of 275.0 the minimum allowed depth becomes 16.6. Accepted tolerance values are decimals greater or equal than 1.0, where 1.0 is the most strict tolerance (because it will retain only the contig with the best hit and other contigs with higher depth). To replicate the behavior of versions before 1.1.0 use--ignore_depth
in yourcaptus extract
command. - Default clustering parameters changed (
--cl_max_copies
reduced from 5 to 3; also whenauto
is used for--cl_min_samples
a sample proportion of 0.66 is chosen instead 0.30) to produce fewer clusters that includes more samples while keeping potential paralogy lower. - Installation instructions for Macs with Apple Silicon have been added to the README file.
Captus v1.1.2
- Ability to restart incomplete assemblies automatically (just rerun same
assemble
command on the directory where assembly was interrupted) - Defaults for all
depth_tolerance
arguments have been increased to 20 until more tests are performed - Tweaking of settings for second round of Scipio
- Despite choosing '--ignore_depth' the logs were showing the '--depth_tolerance' parameters. Now a message indicates the filters are disabled
- Fix small bugs
Captus v1.1.1
- If RAM is set to
auto
Captus will use 70% of available RAM for Java programs to avoid allocation errors - Empty assemblies produced by MEGAHIT are now logged as FAILED, and skipped for depth calculation and filtering
- Code has been reformatted with the
Ruff
extension in VSCode
Captus v1.1.0
New in the assemble
module:
- Contig depth of coverage is now calculated by mapping the reads back to the contigs using
Salmon
right after the assembly withMEGAHIT
. This is now the default behavior unless--disable_mapping
is enabled. - The assembly is then automatically filtered by depth of contig, if
--disable_mapping
is used then only contigs with depth of coverage >1x are retained, otherwise contigs with depth of coverage >=1.5x are retained. The filtering threshold for depth can be changed with--min_contig_depth
. - To replicate the behavior of previous versions use
--disable_mapping --min_contig_depth 0
. - The filtering can be repeated with
--redo_filtering
, without the need to reassemble, to try different values for--max_contig_gc
and--min_contig_depth
. - The assembly HTML report has been completely rewritten to reflect these changes.
New in the extract
module:
- Options
--nuc_depth_tolerance
,--ptd_depth_tolerance
,--mit_depth_tolerance
, and--dna_depth_tolerance
allow to filter contigs by depth of coverage during locus extraction. Among the contigs with hits to a particular marker type (e.g., nuclear), the median of the depths of coverage is calculated and this tolerance factor is used to determine the minimum (median / tolerance) and maximum (median * tolerance) depth allowed. The depth of coverage is taken from the contig names when they contain the pattern_cov_X.XX_
. - To replicate the behavior of previous versions use
--ignore_depth
. - Added option
--disable_stitching
. By default, Captus recover a locus across multiple contigs, this option forces the recovery of a locus in a single contig (for example when providing chromosome-level genome assemblies).
Other improvements or additions:
- The accessory script
filter_most_common_target_per_locus.py
creates a new reference target file with only the most common target per locus found during the extraction step. This new reference target set can be used to re-extract the loci and potentially improve theinformed
paralog filtering. - All the reports have been updated to include the version and command of Captus used.
- Updated installation instructions and documentation.
- Some long output filenames have been shortened.
Captus v1.0.1
- During assembly of hits when extracting a miscellaneous DNA reference target, the delta in identity percentage between two hits to be considered compatible has been reduced from 5% to 3.33%, initial test indicate slight improvement in recovery.
- In some edge cases, when translating a CDS reference target set, the same nucleotide sequence can produce perfectly translated protein in more than a single reading frame, we give now priority to positive reading frames in case of a tie.
- Latest
pandas
versions introduced breaking changes, we provide a fix. - When creating a new miscellaneous DNA reference from clustering, each target sequence in a reference locus can have different strands. We add a method to uniformize the strand per reference locus.
- Added an option to the
align
step to--only_collect
the extracted markers and exit afterwards (requested by Diego Morales) - Fixed multiple small bugs.
Captus v1.0.0
- Additional improvements to
captusd bait
: added options--min_expected_tiling
and--remove_ambiguous_loci
for the creation of baitsets and their corresponding reference target files.
Captus v0.9.99
- Now any BUSCO lineage database can be used as reference target file, just download a .tar.gz from https://busco-data.ezlab.org/v5/data/lineages/ and provide the file path for Captus extraction
- Added shortcut for
captus_assembly
as simplycaptus
(data assembly) - Added entry point for
captus_design
and a shortcut ascaptusd
(bait design) - The
cluster
step of bait design now reports mean number of copies per locus instead of just classifying it as single- or multi-copy - Added a function to create a reference target file (for locus extraction) after bait clustering and tiling
- Code cleanup and minor cosmetic changes
Captus v0.9.98
- Fixed potential problem with recognition of
_R1.
or_R1_
patterns in filenames - Support for FastQC v0.12.1 update (s-andrews/FastQC@fbd9cf5)
- Speed up QC step during cleaning step
- If the user provides a clustering threshold with
--cl_min_identity
then the miscellaneous DNA extraction is performed using the same identity. - Allow decimals in maximum average number of copies in a cluster via
--cl_max_copies
- Minor cosmetic improvements
Captus v0.9.97
- Fixed a bug in the extraction report happening when the extraction statistics tables are not sorted. This bug doesn't affect the output at all, just the report heatmap.
Captus v0.9.96
- Fixed indentation bugs that prevented Falco or FastQC from running during the
clean
step and the subsampling of reads during theassemble
step - Secret feature, coding genes databases can also be extracted as nucleotide
- Code cleanup and minor fixes