Merge pull request #69 from sbslee/0.37.0-dev

sbslee · web-flow · commit 4b84de83d2db · 2023-09-09T09:54:43.000+09:00
0.37.0 dev
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -5,6 +5,12 @@
 # Required
 version: 2
 
+# Set the OS, Python version and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.7"
+
 # Build documentation in the docs/ directory with Sphinx
 sphinx:
    configuration: docs/conf.py
@@ -15,6 +21,5 @@ sphinx:
 
 # Optionally set the version of Python and requirements required to build your docs
 python:
-   version: 3.7
    install:
    - requirements: docs/requirements.txt
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,15 @@
 Changelog
 *********
 
+0.37.0 (2023-09-09)
+-------------------
+
+* :issue:`67`: Fix bug in :meth:`pymaf.MafFrame.plot_waterfall` method where ``count=1`` was causing color mismatch.
+* Add new submodule ``pychip``.
+* Add new method :meth:`common.reverse_complement`.
+* Fix bug in :meth:`common.extract_sequence` method where a long DNA sequence output was truncated.
+* :issue:`68`: Refresh the variant consequences database from Ensembl VEP. The database's latest update was on May 31, 2021.
+
 0.36.0 (2022-08-12)
 -------------------
 
diff --git a/README.rst b/README.rst
@@ -20,9 +20,6 @@ README
 .. image:: https://anaconda.org/bioconda/fuc/badges/downloads.svg
    :target: https://anaconda.org/bioconda/fuc/files
 
-.. image:: https://anaconda.org/bioconda/fuc/badges/installer/conda.svg
-   :target: https://conda.anaconda.org/bioconda
-
 Introduction
 ============
 
@@ -65,6 +62,11 @@ and cite the following article:
 
 Lee et al., 2022. `ClinPharmSeq: A targeted sequencing panel for clinical pharmacogenetics implementation <https://doi.org/10.1371/journal.pone.0272129>`__. PLOS ONE.
 
+Support fuc
+===========
+
+If you find my work useful, please consider becoming a `sponsor <https://github.com/sponsors/sbslee>`__.
+
 Installation
 ============
 
@@ -183,6 +185,7 @@ Below is the list of submodules available in the fuc API:
 - **common** : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.
 - **pybam** : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
+- **pychip** : The pychip submodule is designed for working with annotation or manifest files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina) array platforms.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
diff --git a/docs/api.rst b/docs/api.rst
@@ -14,6 +14,7 @@ Below is the list of submodules available in the fuc API:
 - **common** : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.
 - **pybam** : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. If you are mainly interested in working with depth of coverage data, please check out the pycov submodule which is specifically designed for the task.
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
+- **pychip** : The pychip submodule is designed for working with annotation or manifest files from the Axiom (Thermo Fisher Scientific) and Infinium (Illumina) array platforms.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
@@ -48,6 +49,12 @@ fuc.pybed
 .. automodule:: fuc.api.pybed
    :members:
 
+fuc.pychip
+==========
+
+.. automodule:: fuc.api.pychip
+   :members:
+
 fuc.pycov
 =========
 
diff --git a/docs/create.py b/docs/create.py
@@ -48,9 +48,6 @@
 .. image:: https://anaconda.org/bioconda/fuc/badges/downloads.svg
    :target: https://anaconda.org/bioconda/fuc/files
 
-.. image:: https://anaconda.org/bioconda/fuc/badges/installer/conda.svg
-   :target: https://conda.anaconda.org/bioconda
-
 Introduction
 ============
 
@@ -93,6 +90,11 @@
 
 Lee et al., 2022. `ClinPharmSeq: A targeted sequencing panel for clinical pharmacogenetics implementation <https://doi.org/10.1371/journal.pone.0272129>`__. PLOS ONE.
 
+Support fuc
+===========
+
+If you find my work useful, please consider becoming a `sponsor <https://github.com/sponsors/sbslee>`__.
+
 Installation
 ============
 
diff --git a/fuc/api/common.py b/fuc/api/common.py
@@ -804,7 +804,12 @@ def parse_variant(variant):
 
 def extract_sequence(fasta, region):
     """
-    Extract the region's DNA sequence from the FASTA file.
+    Extract the DNA sequence corresponding to a selected region from a FASTA
+    file.
+
+    The method also allows users to retrieve the reference allele of a
+    variant in a genomic coordinate format, instead of providing a genomic
+    region.
 
     Parameters
     ----------
@@ -817,9 +822,20 @@ def extract_sequence(fasta, region):
     -------
     str
         DNA sequence. Empty string if there is no matching sequence.
+
+    Examples
+    --------
+
+    >>> from fuc import common
+    >>> fasta = 'resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta'
+    >>> common.extract_sequence(fasta, 'chr1:15000-15005')
+    'GATCCG'
+    >>> # rs1423852 is chr16-80874864-C-T
+    >>> common.extract_sequence(fasta, 'chr16:80874864-80874864')
+    'C'
     """
     try:
-        sequence = pysam.faidx(fasta, region).split('\n')[1]
+        sequence = ''.join(pysam.faidx(fasta, region).split('\n')[1:])
     except pysam.SamtoolsError as e:
         warnings.warn(str(e))
         sequence = ''
@@ -1434,3 +1450,44 @@ def parse_list_or_file(obj, extensions=['txt', 'tsv', 'csv', 'list']):
             return convert_file2list(obj[0])
 
     return obj
+
+def reverse_complement(seq, complement=True, reverse=False):
+    """
+    Given a DNA sequence, generate its reverse, complement, or
+    reverse-complement.
+
+    Parameters
+    ----------
+    seq : str
+        DNA sequence.
+    complement : bool, default: True
+        Whether to return the complment.
+    reverse : bool, default: False
+        Whether to return the reverse.
+
+    Returns
+    -------
+    str
+        Updated sequence.
+
+    Examples
+    --------
+
+    >>> from fuc import common
+    >>> common.reverse_complement('AGC')
+    'TCG'
+    >>> common.reverse_complement('AGC', reverse=True)
+    'GCT'
+    >>> common.reverse_complement('AGC', reverse=True, complement=False)
+    'GCT'
+    >>> common.reverse_complement('agC', reverse=True)
+    'Gct'
+    """
+    new_seq = seq[:]
+    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A',
+                  'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}
+    if complement:
+        new_seq = ''.join([complement[x] for x in new_seq])
+    if reverse:
+        new_seq = new_seq[::-1]
+    return new_seq
diff --git a/fuc/api/pychip.py b/fuc/api/pychip.py
diff --git a/fuc/api/pymaf.py b/fuc/api/pymaf.py
diff --git a/fuc/version.py b/fuc/version.py