Skip to content

Database decision

mtorchio edited this page Nov 13, 2017 · 9 revisions

Flux Notes currently loads a single patient's SHR from a static JSON file which contains a list of SHR Entries. We will evolve the storage capability in phases to support use cases as they become supported by Flux Notes.

Phase 0 of database storage:

Support the current API in use by our Data Access Layer, which is to get a single patient's record by passing their unique shrId identifier to the Data Access Layer.

getPatient(id)

Phase 1:

The Curation study will require the storage of records for multiple patients. To support this, we will need a way to list the patients in the database.

getListOfPatients()

Phase 2:

When curating data, the notes will be edited to add structured phrases. The edited notes will be saved back to the database.

savePatient(patient)

Phase 3:

Creation of new patients in Flux Notes. This generates a new shrId for the new patient.

newPatient()

Phase 4:

A more selective way to query patients based on criteria including the unique shrId and other fields such as entry type and entry ID.

getListOfPatients(shrIdList, otherCriteria)

Phase 5:

Notional - We may in the future desire the ability to search across patients, for example to find all Progression objects matching some criteria. The shrId will not be used in these calls.

getListOfPatients(otherCriteria)

Supporting these use cases

To support these use cases with maximum flexibility, each Entry will be stored separately. To assemble a patient's entire record, the database adapter will query for all documents with a matching shrId. The more selective searches can add additional criteria to the query.

The storage method must allow for storing many small Entries. Currently, our one patient has ~30 Entries, each of which is between 1KB and 2KB of text including whitespace.

A database that can store in a supported SHR format natively has the potential to avoid a layer of translation between database format and an SHR format. Since we are starting with a patient record in document form, a document-oriented database is closer to what we need than a SQL table structure. Flux Notes is currently using JSON, so the simplest path forward is to store data as it is sent and received, in JSON. Each Entry has unique data, including dates, dosage information, and the unique shrId. There is little chance of duplicate Entries.

Both MySQL and PostgreSQL support the JSON data type and would be viable options for the Flux Notes database. PostgreSQL in particular is in use on other OHS projects. However, because PostgreSQL isn't able to keep statistics on the values of fields in JSON columns, performance can suffer dramatically when queries are run against large numbers of documents. Experiments documented online (https://heap.engineering/when-to-avoid-jsonb-in-a-postgresql-schema/) show that this can be 2000 times slower than a comparable SQL query in PostgreSQL 9.4. Newer versions of PostgreSQL have improved upon this.

The Flux Notes code is is written in Javascript, and the chosen database must be accessible either by a Javascript client or HTTP.

Searching the list at https://en.wikipedia.org/wiki/Document-oriented_database we find several that meet these criteria:

  • Couchbase Server - started as a key-value store that added persistence. Distributed. N1QL query language. Aimed at realtime performance.
  • CouchDB - Stores data in JSON, query language is Javascript, HTTP API, Apache project. No database locking.
  • HyperDex - project inactive as of 2016.
  • RethinkDB - creating company shut down, no longer offering support, open-sourced their code
  • MongoDB - Interact with data in JSON (Stored in BSON), widely used, server-side Javascript execution possible
  • TokuMX - fork of MongoDB for write-intensive workloads, otherwise very similar to MongoDB. Of these, the most viable options are CouchDB and MongoDB.

MongoDB has the larger set of third-party tools and will likely be more familiar to developers and installers of Flux Notes. It does now allow joins across collections, but the aggregation framework allows only one join at a time. This requires the joins to be performed sequentially in code, which is likely to be less performant than a query planner at runtime.

However, in each of the phases described above, our database design will be very simple. There will be one collection of Entries, which will contain all of the SHR Entries for all patients. The most complex query envisioned in Phase 5 which will return a list of Entries across multiple patients, and even this is contained within one collection and one type of document, the Entry. We do not anticipate using the aggregation pipeline to combine Entries.

For these reasons, Flux Notes should use MongoDB as its database.

Searching in MongoDB is straightforward. For example, the Phase 0 use case will require a query for all Entries associated with a given patient, identified by shrId. To search the entire document collection for all documents with that shrId, the query is:

db.entries.find( { shrId : "the ID value in GUID form" } )

Clone this wiki locally