Skip to content

Latest commit

 

History

History
315 lines (242 loc) · 11.5 KB

README.md

File metadata and controls

315 lines (242 loc) · 11.5 KB

Monitor Library

A convenient library to wrap up all of the pieces needed for a Taskcluster service to record metrics and write structured logs. By default it will report any errors that cause the process to exit, and report as warnings any errors that cause stats writing to not work. To disable any of these, you can see the Options and Defaults section below.

Process monitoring can be turned on by using the monitor.resources(<process name>) function where <process name> will generally end up being something like web or worker.

Taskcluster has some generic concepts that are able to be monitored easily using utility functions in this package. The Usage section lists these cases and shows how to use this package to measure them.

Usage

First set up a monitorManager and register types on it (described later) before you set things up at runtime and finally get prefixed monitors to use elsewhere.

You must first set up a monitorManager and then configure it with setup() before you can get and use monitors. See the example below.

const MonitorManager = require('taskcluster-lib-monitor');

const manager = new MonitorManager({
  serviceName: 'foo',
});
manager.setup({
  mock: cfg.monitor.mock,  // false in production, true in testing
  processName: 'server',       // or otherwise for e.g., periodic tasks
});
const monitor = manager.monitor(); // To get a root monitor
const monitor = manager.monitor('prefix'); // To get a child monitor
const monitor = manager.monitor('prefix', {meta: 5}); // To get a child monitor with extra metadata

The available options to the manager's constructor are:

  • serviceName - The short name of this service.

The available options to the setup function are:

  • level - A syslog logging level. Any messages with less severity than this level will not be logged.
  • pretty - Only for development use. Pretty prints logs rather than using a structured form.
  • patchGlobal - If true (the default), any uncaught errors in the service will be reported.
  • processName - If set to a string that identifies this process, cpu and memory usage of the process will be reported on an interval. Note: This can also be turned on by monitor.resources(...) later if wanted. That allows for gracefully stopping as well.
  • mock - If true, the monitoring object will be a fake that stores data for testing but does not report it (for testing).
  • enable - If false, the monitoring object will only report to the console (but not store data; for deployments without monitoring).
  • metadata - an object of extra fields to attach to any logs generated by this object.
  • metadata.gitVersion - git version (for correlating errors); or...
  • gitVersionFile - file containing git version (relative to app root)
  • verify - If this is true, log messages that have been registered will be verified to define all of the required fields.

Logging

This library allows writing logs to stdout in the mozlog format.

We add an extra message field to the top level if any of the contents of Fields are message or stack. This is for compatibility with the logging tools we use. We will add configurable output formats later if wanted.

We have both a Severity and severity` field to support both our logging tooling requiremtents and mozlog's. The lowercase reports in a string version of severity and uppercase is the syslog number for each level.

We also add a serviceContext which is used in our tooling. This contains a single field with the name of the service.

{
  "Timestamp": <time since unix epoch in nanoseconds>,
  "Type": "...",
  "Logger": "...",
  "message": "...",
  "serviceContext": {"service": "..."},
  "Hostname": "...",
  "EnvVersion": "2.0",
  "Severity": ...,
  "severity": ...,
  "Pid": ...,
  "Fields": {...}
}

It provides the following functions:

  emerg(type, fields) // Not recommended for use
  alert(type, fields)
  crit(type, fields)
  err(type, fields)
  warning(type, fields)
  notice(type, fields)
  info(type, fields)
  debug(type, fields)

The type will be set in the event object as the Type field in the mozlog format. Everything in fields will be set in the Fields. We default to info logging level so normally debug logs will not be logged.

If you leave out type, the first argument will be used for fields. If fields is a string or number, we will log it in a generic message.

Registering message types

This library allows creating custom message types. We add these registered types to our published references much the same as our api and pulse documentation. To add a message type, do the following:

const manager = new MonitorManager({
  serviceName: 'notify',
});

manager.register({
  name: 'email',
  type: 'email',
  version: 1,
  level: 'info',
  description: 'A request to send an email.',
  fields: {
    address: 'The requested recepient of the email.',
  },
});

// And now after getting a monitor post-setup:
monitor.log.email({address: req.body.address});

Each part is:

  • name - This will be made available on your monitor under a .log prefix.
  • type - This will be the Type field of the logged message.
  • version - This will end up in a v field of the Fields part of a logged message. Bump this if making a backwards-incompatible change
  • level - This will be the level that this message logs at.
  • description - A description of what this logging message means
  • fields: An object where every key is the name of a required field to be logged. Corresponding values are documentation of the meaning of that field.

If the verify option is set to true during manager setup, this library will verify that at least the required fields have been passed into the logger upon invoking it.

When using this feature, pass in monitorManager.references() to an instance of taskcluster-lib-docs to publish.

Per-child Levels

If you set the level during setup to a single valid syslog level, it will be propagated to all child loggers. If you would like more control, you can use a specifically formatted string:

level: 'root:info root.api:debug root.handler:warning'

When you use this format, you specify a prefix with root included and after a : you specify a valid syslog level. You must set a value for root in these cases and any unspecified prefixes will default to that.

Measuring and Counting Things

Note: You should prefer logging specific types rather than these generic counts and measures. They exist mostly for backwards compatibility.

To record a current measurement of a named value:

monitor.measure('foo', 10);

To increment the count of some value:

monitor.count('bar');
monitor.count('bar', 4); // increment by 4

These events will have types of monitor.measure and monitor.count respectively. The fields will have key and val.

Reporting Errors

There are lots of options for reporting errors:

// Report error as a string
monitor.reportError('Something went wrong!');
// Report error (from catch-block or something like that)
monitor.reportError(new Error("..."));
// Report an error with extra info
monitor.reportError(new Error("..."), {foo: 'bar'});
// (DEPRECATED) Report error as a warning. This will simply append 'warning' to fields
monitor.reportError(new Error("..."), 'warning');

Monitoring CPU & Memory

// Begin monitoring CPU & Memory
const stopMonitor = monitor.resources('web');

you can later call stopMonitor() to gracefully shut down the monitoring.

Timing Functions/Promises

Often we wish to measure how long time an operation takes, synchronous or asynchronous, this can done using the monitor.timer(key, funcOrPromise) method. It takes a key (as name of the metric) and a function or promise to measure the time of. If the function returns a promise, it'll include the time it takes for the promise to resolve.

The following examples are valid usages:

// Timing a synchronous operation
const root = monitor.timer('compute-sqrt', () => {
  return Math.sqrt(25);
})
assert(root === 5);

// Timing a single asynchronous function (promise)
const task = await monitor.timer('load-task', queue.task(taskId));
assert(task.workerType == '...'); // task is the task definition response

// Timing an asynchronous function
const task = await monitor.timer('poll-for-task', async () => {
  while (true) {
    try {
      return await queue.task(taskId);
    } catch (err) {
      // Ignore error and try again
      // In the real would you want a maximum time before you stop polling
      // And probably some sleeping between each polling attempt...
    }
  }
});
assert(task.workerType == '...'); // task is the task definition response

Rejected promises and errors will bubble up, and the time be will measured and recoded just like successful functions or promises.

Timing Handlers

A common pattern in Taskcluster projects is to have handler functions in a worker that take a message as an argument and perform some action. These can be timed (in milliseconds) by wrapping them with taskcluster-lib-monitor:

const listener = new taskcluster.PulseListener({
  credentials: {clientId: 'test-client', accessToken: 'test'},
  queueName: 'a-queue-name',
});

const handler = function(message) {
  console.log(message);
};

listener.on('message', monitor.timedHandler('logging-listener', handler));

Specifically, timedHandler takes a function and wraps it with timing logic, returning a function with the same signature.

Express Timing Middleware

Most Taskcluster services are Express services. We can easily time how long endpoints take to respond to requests by inserting taskcluster-lib-monitor as middleware:

middleware.push(monitor.expressMiddleware('name_of_function'));

This is already integrated in taskcluster-lib-api and probably doesn't need to be implemented in your service on its own.

Timing AWS SDK Calls

Oftentimes a lot of a service's time will be spent interacting with AWS services. These interactions can be measured as in the following example:

const aws = require('aws-sdk');
const ec2 = new aws.EC2({region: 'us-west-2'});
monitor.patchAWS(ec2);
await ec2.describeAvailabilityZones().promise().catch(err => {
  debug('Ignored ec2 error, we measure duration, not success, err: ', err);
});

Timing Arbitary Steps

If none of the above options are convenient for you, you can also just start and stop timers whenever you want. A timer may only be started and measured once. Any attempts over that will cause it to throw an Error.

const doodad = monitor.timeKeeper('metricName', {optional: 'extra data'});
// Do some stuff here that takes some amount of time
// ...
// Keep doing stuff here however long you like
doodad.measure();

Monitoring One-Shot Processes

Many Taskcluster services use one-shot processes to expire old data. The expectation is that these processes will be started at specific times, do their business, and exit. The oneShot method is designed to wrap such processs with timing and error handling support.

  'expire-info': {
    requires: ['cfg', 'monitor'],
    setup: ({cfg, monitor}) => {
      return monitor.oneShot('expire-info', () => {
        // do the expiration stuff
      });
    },  
  },  

This function will:

  • time the invocation, measured as duration
  • count the invocation if successful, measured as done
  • report any errors or promise rejections
  • shut down and flush monitoring
  • call process.exit on success or failure Note, then, that the function does not return.