Bio Parsers

Bio Parsers

About this Repo

This repo contains a set of parsers to convert between datatypes through a generalized JSON format.

CHANGELOG

Exported Functions

Use the following exports to convert to a generalized JSON format:

fastaToJson //handles fasta files (.fa, .fasta)
genbankToJson //handles genbank files (.gb, .gbk)
ab1ToJson //handles .ab1 sequencing read files 
sbolXmlToJson //handles .sbol files
geneiousXmlToJson //handles .genious files
snapgeneToJson //handles snapgene (.dna) files
anyToJson    //this handles any of the above file types based on file extension

Use the following exports to convert from a generalized JSON format back to a specific format:

jsonToGenbank
jsonToFasta
jsonToBed

Format Specification

The generalized JSON format looks like:

const generalizedJsonFormat = {
    "size": 25,
    "sequence": "asaasdgasdgasdgasdgasgdasgdasdgasdgasgdagasdgasdfasdfdfasdfa",
    "circular": true,
    "name": "pBbS8c-RFP",
    "description": "",
    "parts": [
      {
        "name": "part 1",
        "type": "CDS", //optional for parts
        "id": "092j92", //Must be a unique id. If no id is provided, we'll autogenerate one for you
        "start": 10, //0-based inclusive index
        "end": 30, //0-based inclusive index
        "strand": 1,
        "notes": {},
      }
    ],
    "primers": [
      {
        "name": "primer 1",
        "id": "092j92", //Must be a unique id. If no id is provided, we'll autogenerate one for you
        "start": 10, //0-based inclusive index
        "end": 30, //0-based inclusive index
        "strand": 1,
        "notes": {},
      }
    ],
    "features": [
        {
            "name": "anonymous feature",
            "type": "misc_feature",
            "id": "5590c1978979df000a4f02c7", //Must be a unique id. If no id is provided, we'll autogenerate one for you
            "start": 1,
            "end": 3,
            "strand": 1,
            "notes": {},
        },
        {
            "name": "coding region 1",
            "type": "CDS",
            "id": "5590c1d88979df000a4f02f5",
            "start": 12,
            "end": 9,
            "strand": -1,
            "notes": {},
        }
    ],
    //only if parsing in an ab1 file
    "chromatogramData": { 
      "aTrace": [], //same as cTrace but for a
      "tTrace": [], //same as cTrace but for t
      "gTrace": [], //same as cTrace but for g
      "cTrace": [0,0,0,1,3,5,11,24,56,68,54,30,21,3,1,4,1,0,0, ...etc], //heights of the curve spaced 1 per x position (aka if the cTrace.length === 1000, then the max basePos can be is 1000)
      "basePos": [33, 46, 55, ...etc], //x position of the bases (can be unevenly spaced)
      "baseCalls": ["A", "T", ...etc],
      "qualNums": [], //or undefined if no qualNums are detected on the file
    },
}

Usage

install

npm install -S bio-parsers

or

yarn add bio-parsers

or

use it from a script tag:

<script src="https://unpkg.com/bio-parsers/umd/bio-parsers.js"></script>
<script>
      async function main() {
        var jsonOutput = await window.bioParsers.genbankToJson(
          `LOCUS       kc2         108 bp    DNA     linear    01-NOV-2016
COMMENT             teselagen_unique_id: 581929a7bc6d3e00ac7394e8
FEATURES             Location/Qualifiers
     CDS             1..108
                     /label="GFPuv"
     misc_feature    61..108
                     /label="gly_ser_linker"
     bogus_dude      4..60
                     /label="ccmN_sig_pep"
     misc_feature    4..60
                     /label="ccmN_nterm_sig_pep"
                     /pragma="Teselagen_Part"
                     /preferred5PrimeOverhangs=""
                     /preferred3PrimeOverhangs=""
ORIGIN      
        1 atgaaggtct acggcaagga acagtttttg cggatgcgcc agagcatgtt ccccgatcgc
       61 ggtggcagtg gtagcgggag ctcgggtggc tcaggctctg ggg
//`
        );
        console.log('jsonOutput:', jsonOutput);
        var genbankString = window.bioParsers.jsonToGenbank(jsonOutput[0].parsedSequence);
        console.log(genbankString);
      }
      main();
</script>

see the ./umd_demo.html file for a full working example

jsonToGenbank (same interface as jsonToFasta)

//To go from json to genbank:
import { jsonToGenbank } from "bio-parsers"
//You can pass an optional options object as the second argument. Here are the defaults
const options = {
  isProtein: false, //by default the sequence will be parsed and validated as type DNA (unless U's instead of T's are found). If isProtein=true the sequence will be parsed and validated as a PROTEIN type (seqData.isProtein === true)
  guessIfProtein: false, //if true the parser will attempt to guess if the sequence is of type DNA or type PROTEIN (this will override the isProtein flag)
  guessIfProteinOptions: {
    threshold = 0.90, //percent of characters that must be DNA letters to be considered of type DNA
    dnaLetters = ['G', 'A', 'T', 'C'] //customizable set of letters to use as DNA 
  }, 
  inclusive1BasedStart: false //by default feature starts are parsed out as 0-based and inclusive 
  inclusive1BasedEnd: false //by default feature ends are parsed out as 0-based and inclusive 
  // Example:
  // 0123456
  // ATGAGAG
  // --fff--  (the feature covers GAG)
  // 0-based inclusive start:
  // feature.start = 2
  // 1-based inclusive start:
  // feature.start = 3
  // 0-based inclusive end:
  // feature.end = 4
  // 1-based inclusive end:
  // feature.end = 5
} 
const genbankString = jsonToGenbank(generalizedJsonFormat, options)

anyToJson (same interface as genbankToJson, fastaToJson, xxxxToJson) (async required)

import { anyToJson } from "bio-parsers"

//note, anyToJson should be called using an await to allow for file parsing to occur (if a file is being passed)
const results = await anyToJson(
  stringOrFile, //if ab1 files are being passed in you should pass files only, otherwise strings or files are fine as inputs
  options //options.fileName (eg "pBad.ab1" or "pCherry.fasta") is important to pass here in order for the parser to!
) 

//we always return an array of results because some files my contain multiple sequences 
results[0].success //either true or false 
results[0].messages //either an array of strings giving any warnings or errors generated during the parsing process
results[0].parsedSequence //this will be the generalized json format as specified above :)
//chromatogram data will be here (ab1 only): 
results[0].parsedSequence.chromatogramData

Options (for anyToJson or xxxxToJson)

//You can pass an optional options object as the third argument. Here are the defaults
const options = {
  fileName: "example.gb", //the filename is used if none is found in the genbank           
  isProtein: false, //if you know that it is a protein string being parsed you can pass true here
  parseFastaAsCircular: false; //by default fasta files are parsed as linear sequences. You can change this by setting parseFastaAsCircular=true 
  //genbankToJson options only
  inclusive1BasedStart: false //by default feature starts are parsed out as 0-based and inclusive 
  inclusive1BasedEnd: false //by default feature ends are parsed out as 0-based and inclusive 
  acceptParts: true //by default features with a feature.notes.pragma[0] === "Teselagen_Part" are added to the sequenceData.parts array. Setting this to false will keep them as features instead
  // fastaToJson options only
  parseName: true //by default attempt to parse the name and description of sequence from the comment line. Setting this to false will keep the name unchanged with no description
}

ab1ToJson

import { ab1ToJson } from "bio-parsers"
const results = await ab1ToJson(
  //this can be either a browser file  <input type="file" id="input" multiple onchange="ab1ToJson(this.files[0])">
  // or a node file ab1ToJson(fs.readFileSync(path.join(__dirname, './testData/ab1/example1.ab1')));
  file, 
  options //options.fileName (eg "pBad.ab1" or "pCherry.fasta") is important to pass here in order for the parser to!
)

//we always return an array of results because some files my contain multiple sequences 
results[0].success //either true or false 
results[0].messages //either an array of strings giving any warnings or errors generated during the parsing process
results[0].parsedSequence //this will be the generalized json format as specified above :)
//chromatogram data will be here (ab1 only): 
results[0].parsedSequence.chromatogramData

snapgeneToJson (.dna files)

import { snapgeneToJson } from "bio-parsers"
//file can be either a browser file  <input type="file" id="input" multiple onchange="snapgeneToJson(this.files[0])">
// or a node file snapgeneToJson(fs.readFileSync(path.join(__dirname, './testData/ab1/example1.ab1')));
const results = await snapgeneToJson(file,options)

genbankToJson

import { genbankToJson } from "bio-parsers"

const result = genbankToJson(string, options)

console.info(result)
// [
//     {
//         "messages": [
//             "Import Error: Illegal character(s) detected and removed from sequence. Allowed characters are: atgcyrswkmbvdhn",
//             "Invalid feature end:  1384 detected for Homo sapiens and set to 1",
//         ],
//         "success": true,
//         "parsedSequence": {
//             "features": [
//                 {
//                     "notes": {
//                         "organism": [
//                             "Homo sapiens"
//                         ],
//                         "db_xref": [
//                             "taxon:9606"
//                         ],
//                         "chromosome": [
//                             "17"
//                         ],
//                         "map": [
//                             "17q21"
//                         ]
//                     },
//                     "type": "source",
//                     "strand": 1,
//                     "name": "Homo sapiens",
//                     "start": 0,
//                     "end": 1
//                 }
//             ],
//             "name": "NP_003623",
//             "sequence": "gagaggggggttatccccccttcgtcagtcgatcgtaacgtatcagcagcgcgcgagattttctggcgcagtcag",
//             "circular": true,
//             "extraLines": [
//                 "DEFINITION  contactin-associated protein 1 precursor [Homo sapiens].",
//                 "ACCESSION   NP_003623",
//                 "VERSION     NP_003623.1  GI:4505463",
//                 "DBSOURCE    REFSEQ: accession NM_003632.2",
//                 "KEYWORDS    RefSeq."
//             ],
//             "type": "DNA",
//             "size": 925
//         }
//     }
// ]

You can see more examples by looking at the tests.

Editing This Repo

All collaborators:

Edit/create a new file and update/add any relevant tests. Make sure they pass by running yarn test

Debug

yarn test-debug

Updating this repo

Teselagen collaborators

Commit and push all changes Sign into npm using the teselagen npm account (npm whoami)

npm version patch|minor|major
npm publish

Outside collaborators

fork and pull request please :)

Thanks/Collaborators

IsaacLuo - https://github.com/IsaacLuo/SnapGeneFileReader (from which the snapgene parser was adapted)
Joshua Nixon (original collaborator)
Thomas Rich (original collaborator)

Name		Name	Last commit message	Last commit date
Latest commit History 552 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
src		src
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmignore		.npmignore
.nvmrc		.nvmrc
.travis.yml		.travis.yml
Changelog.md		Changelog.md
LICENSE		LICENSE
README.md		README.md
jest.transform.js		jest.transform.js
nwb.config.js		nwb.config.js
package.json		package.json
tslint.json		tslint.json
umd_demo.html		umd_demo.html
yarn-error.log		yarn-error.log
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bio Parsers

About this Repo

CHANGELOG

Exported Functions

Format Specification

Usage

install

jsonToGenbank (same interface as jsonToFasta)

anyToJson (same interface as genbankToJson, fastaToJson, xxxxToJson) (async required)

Options (for anyToJson or xxxxToJson)

ab1ToJson

snapgeneToJson (.dna files)

genbankToJson

Editing This Repo

All collaborators:

Debug

Updating this repo

Teselagen collaborators

Outside collaborators

Thanks/Collaborators

About

Releases

Packages

Languages

License

Revvity/ve-sequence-parsers

Folders and files

Latest commit

History

Repository files navigation

Bio Parsers

About this Repo

CHANGELOG

Exported Functions

Format Specification

Usage

install

jsonToGenbank (same interface as jsonToFasta)

anyToJson (same interface as genbankToJson, fastaToJson, xxxxToJson) (async required)

Options (for anyToJson or xxxxToJson)

ab1ToJson

snapgeneToJson (.dna files)

genbankToJson

Editing This Repo

All collaborators:

Debug

Updating this repo

Teselagen collaborators

Outside collaborators

Thanks/Collaborators

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages