Skip to content

Commit

Permalink
Document a general-purpose recovery process (broadinstitute#4991)
Browse files Browse the repository at this point in the history
  • Loading branch information
cjllanwarne authored May 23, 2019
1 parent 765d4e9 commit 937cb05
Show file tree
Hide file tree
Showing 14 changed files with 100 additions and 31 deletions.
18 changes: 18 additions & 0 deletions processes/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Documented Processes

This directory contains a selection of processes which are:

* Best expressed as a `.dot` chart
* Still manual for now
* Under source control to make edits easy yet reviewable (just like code!)

## How to update these processes

Do you have a better idea about how any of these processes should work?
Make a PR and it'll be reviewed, just like a code change!

* Modify the appropriate `.dot` file(s)
* Navigate to the `processes` directory
* Run `refresh.sh` to update the png files.
* Add and commit the changed `.dot` and `.png` files to git
* Submit a PR for the change to be reviewed - and hopefully adopted!
7 changes: 7 additions & 0 deletions processes/refresh.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env bash

while IFS= read -r -d '' file
do
echo "Rendering graph ${file} into ${file}.png"
dot -Tpng -o "$file.png" "$file"
done < <(find . -name "*.dot" -print0)
20 changes: 20 additions & 0 deletions processes/release_processes/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Release Processes

## How to update these processes

Have a better idea about how the deployment processes should work?
See our "updating the process" [process](../README.MD)!

## How to Release Cromwell

![release-cromwell-version](release-cromwell-version.dot.png)

## How to Deploy Cromwell releases in Firecloud

![firecloud-develop](firecloud-develop.dot.png)


## How to Deploy Cromwell in CAAS prod

![caas-prod](caas-prod.dot.png)

File renamed without changes.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 21 additions & 0 deletions processes/troubleshooting/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Production Troubleshooting Processes

**Note:** These processes contain shorthand descriptions for various tasks.
If you aren't sure how to achieve any of these steps, look for the details in
the [Cromwell playbook](https://docs.google.com/document/d/1_iRESDzuCgPTOPJnTYxTncIqJU8B1IFWarypDe3gbCY).

## General Purpose Fallback Process

* Have you run through the end of the playbook suggestions and not found anything which fixes the issue?
* Do you just want the problem to go away so that you can get back to sleep as quickly as possible?

This is a (near-) foolproof series of steps to bring Cromwell back into a good state as quickly as
possible if something weird is happening in Cromwell and you don't know why. It also leaves any offending
workflows from a problem-causing submission in the database in a recoverable state for when the issue is resolved.

![all-purpose-mess-remover](all-purpose-mess-remover.dot.png)

## How to update these processes

Have a better idea about how the troubleshooting processes should work?
See our "updating the process" [process](../README.MD)!
34 changes: 34 additions & 0 deletions processes/troubleshooting/all-purpose-mess-remover.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
digraph {

# Nodes

something_wrong [shape=oval label="Something went wrong with Cromwell and I can't 'fix' it.\nI just want it to go away and restore service."];

# Always start with a restart:
restart_cromwell_instance [shape=oval label="Restart Cromwell's 'writer' instance"];

determine_time [shape=oval label="Determine what time things started going wrong"];
determine_submissions_of_interest [shape=oval label="Determine a submission of interest from around that time"];

place_submissions_on_hold [shape=oval label="Place all workflows from that submission on hold in the database"];


go_to_sleep [shape=oval label="Great!\nYour work here is done."];

{ rank=max go_to_sleep }


# Edges

something_wrong -> restart_cromwell_instance

restart_cromwell_instance -> go_to_sleep [label="That worked!"]

restart_cromwell_instance -> determine_time [label="The problem persists"]
determine_time -> determine_submissions_of_interest
determine_submissions_of_interest -> place_submissions_on_hold

place_submissions_on_hold -> restart_cromwell_instance


}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 0 additions & 25 deletions scripts/release_processes/README.MD

This file was deleted.

6 changes: 0 additions & 6 deletions scripts/release_processes/refresh.sh

This file was deleted.

0 comments on commit 937cb05

Please sign in to comment.