Datadog

Overview
- Concepts
Logs
Metrics
- Metric types
- SLI & SLO
Application Performance Monitoring (APM)
- Instrumentation
- Continuous Profiler
Network Performance Monitoring (NPM)
Integrations
- Installation
Kubernetes
Tagging
- docker-compose
- Best practices
Agent/library Configuration
- Remote configuration
Monitor
- Notifications
Universal Service Monitoring (USM)
- docker-compose
- Service Catalog
Synthetic testing
Real User Monitoring (RUM)
Keys
DogStatsD
Audit Trail

Overview

Concepts

Monitoring: what happened in a system
Observability: why it happened

Logs

Example queries

Search term	Format	Example
tag	key:value	`service:frontend`
attribute	@key:value	`@http.method:POST`
single term	word	`Response`
sequence	group of words surrounded by double quotes	`"Response fetched"`
wildcard	tag or attribute name and value	`:prod`
wildcard	log message	`prod*`

Configs

docker-composer.yml config for the Agent service

services:
  agent:
    image: "datadog/agent:7.31.1"
    environment:
      - DD_API_KEY
      - DD_APM_ENABLED=true
      - DD_LOGS_ENABLED=true
      - DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
      - DD_PROCESS_AGENT_ENABLED=true
      - DD_DOCKER_LABELS_AS_TAGS={"my.custom.label.team":"team"}
      - DD_TAGS='env:intro-to-logs'
      - DD_HOSTNAME=intro-logs-host
    ports:
      - "8126:8126"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
    labels:
      com.datadoghq.ad.logs: '[{"source": "agent", "service": "agent"}]'

Tags and attributes

Tags are assigned at host or container level
- source tag is for the integration name, corresponding to Log Processing Pipeline
Attributes are extracted from logs
- By a log processing pipeline, either a built-in Integration Pipeline, or a custom one

Facet

Automatically created from common tags and log attributes
A facet could be a measure, which is numerical and continuous, could be filtered by a range
- eg. @network.bytes_written:[512 TO 1024]
You can create custom facet from log tags or attributes

Views

Queries could be saved into views
There are also predefined views, eg. Postgres, NGINX, Redis, ...

Log pathway

Log processing rules (Pipelines)

Each pipeline includes a list of sequential Processors
- Each pipeline has a query filter (eg. source:nginx), only matching logs are processed by the pipeline
- Pipelines could be nested up to one level
Pipelines extract attributes from each log message
There are out-of-the-box integration pipelines for common services
JSON format logs are pre-processed before pipelines
Processors
- Grok
  - Regex matching
  - A pipeline can have multiple Grok parsers
  - One Grok parser can have multiple parsing rules
  - Subsequent Grok parser can be used on an attribute extracted by preceding parsers
Standard Attribute
- Processed after all the pipelines
- Instead of adding a remapper to each pipeline, you can use this to remap a common attribute from any source

Indexing

Ingested logs:
- Watchdog(automated) Insights, Error Tracking, generating metrics, and Cloud SIEM detection rules
Indexed logs:
- Can be used in monitors, dashboards, notebooks

Metrics

Can be collected by:

DD Agent
Integrations
Generated within Datadog (eg. form logs)
Custom metrics
- Agent
- DogStatsD
- HTTP API

Metric types

Count (times in an interval)
Rate (frequency)
Gauge (last value in an interval)
Histogram (five values: mean, count, median, 95th percentile, and maximum)
Distribution (summarize values across all the hosts)
- Enhanced query functionality and configuration options

RED metrics: Rate, Errors, Duration

SLI & SLO

Service Level Indicators (SLI): metrics to measure some aspect of the level of service

Service Level Objectives (SLO): SLIs monitored overtime, eg.

99% of requests being successful over the past 7 days
less than 1 second latency 99% of the time over the past 30 days

You can create an SLO based on a monitor, then you can create a monitor on an SLO to get alerts.

Application Performance Monitoring (APM)

Trace: tracks the time spent by an application processing a request and the status of this request. Each trace consists of one or more spans.
Span: represents a logical unit of work in a distributed system for a given time period. Multiple spans construct a trace.
- Trace root span: The entry point of the entire trace, the service that generates this first span also creates the Trace ID

Instrumentation

You use language-specific Datadog libraries (ddtrace) in your application code.
Traces are submitted to Datadog Agent first, then sent to Datadog.
By default, Agent collects traces using TCP port 8126.
Instrumented application expect some environment variables, eg. DATADOG_HOST DD_ENV, DD_VERSION, and DD_SERVICE.
- DD_AGENT_HOST: which service hosts the agent
- DD_LOGS_INJECTION: injects trace_id and span_id into logs
- DD_TRACE_SAMPLE_RATE
- DD_PROFILING_ENABLED whether enable continuous profiler
- DD_SERVICE_MAPPING rename service

For Python app, run it with command ddtrace-run, like:

DD_SERVICE="<SERVICE>" DD_ENV="<ENV>" DD_LOGS_INJECTION=true ddtrace-run python my_app.py

Continuous Profiler

Gives you insight into the system resource consumption (eg. CPU, memory and IO bottlenecks) of your applications beyond traces
Supported by client libraries

Network Performance Monitoring (NPM)

Built on eBPF (detailed visibility into network flows at the Linux kernel level)
Powerful and efficient with extremely low overhead
Can monitor DNS traffic and DNS servers

To enable with containerized agent:

    environment:
      - DD_SYSTEM_PROBE_NETWORK_ENABLED=true
      - ...
    volumes:
      - /sys/kernel/debug/:/sys/kernel/debug
      - ...
    cap_add:
      - SYS_ADMIN
      - SYS_RESOURCE
      - SYS_PTRACE
      - NET_ADMIN
      - NET_BROADCAST
      - NET_RAW
      - IPC_LOCK
      - CHOWN
    security_opt:
      - apparmor:unconfined

Integrations

Three types:

Agent-based (system checks), use a Python class method called check
- check method executes every 15 seconds
- A check could collects multiple metrics, events, logs and service checks
- Show the checks docker compose exec datadog agent status
- Run a specific check docker compose exec datadog agent check disk
Authentication based (crawler)
- Either pull data from other systems, using other system's credentials
- Or authorize other systems to push data to Datadog, using Datadog's API key
Library integrations, use the Datadog API to allow you to monitor applications based on the language they are written in, like Node.js or Python.
- Imported as packages to your code
- Use Datadog's tracing API
- Collect performance, profiling, and debugging metrics from your application at runtime

Installation

When an integration is installed, it may also install OOTB dashboards, log processing pipelines, etc

Kubernetes

The Datadog Agent is run as a DaemonSet to ensure the Agent is deployed on all nodes in the cluster.

Agent config:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
  namespace: default
spec:
  global:
    clusterName: tagging-use-cases-k8s
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
      appSecret:
        secretName: datadog-secret
        keyName: api-key
    podLabelsAsTags:
      "*": kube_pod_%%label%%

Configure podLabelsAsTags: to extract pod labels as tags

Pod config:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod61
  labels:
    component: backend
  annotations:
    ad.datadoghq.com/tags: '{"env": "production", "service": "user-service", "office": "lax", "team": "community", "role": "backend", "color": "red"}'
...

Pod labels and annotations will be extracted as tags

Tagging

Tags could be key-value pairs (eg. env:prod), or simple value tags (eg. file-server)

Reserved tag key:

host: correlation between metrics, traces, processes, and logs
device
source
env
service
version
team

Unified Service Tagging: service, env, version

`docker-compose`

To map a custom container label to a tag, use this environment variable on the agent container: DD_CONTAINER_LABELS_AS_TAGS={"my.custom.label.color":"color"}

Best practices

trace_id, span_id can be injected as tags in logs, for correlation

Agent/library Configuration

By priority (high to low):

Remote configuration
Environment variables
Local configuration (remote_config.enabled setting controls whether an agent accepts Remote Configuration)

Remote configuration

Works for agents or tracing libraries
Enables them to pull configurations from Datadog
Could be enabled at organization scope
Supported features:
- APM (config, sampling rate)
- ASM (protect against OWASP, WAF attack patterns)
- CSM (default agent rules, agentless scanning in AWS only ?)
- Dynamic instrumentation (metrics, logs and traces from live application without code change)
- Fleet automation
- Control observability pipeline workers

Monitor

Notifications

There's not a field dedicated for recipients, you need to specify it with @, @slack in the message

The {{service.name}} service container {{color.name}} has high CPU usage!!

Contact: Email - @{{service.name}}@mycompany.com, @admin@mycompany.com
Slack - @slack-{{service.name}}

Universal Service Monitoring (USM)

Enabling USM requires the following:

If on Linux, your service must be running in a container.
If on Windows and using IIS, your service must be running on a virtual machine.
The Datadog Agent needs to be installed alongside your service.
The env tag for Unified Service Tagging must be applied to your deployment.

Commonly used container tags: app, short_image, container_name

short_name tag is used to discover common services, eg. short_name:nginx will identify nginx service

`docker-compose`

You need a few settings for the agent container to turn on USM
Use labels like com.datadoghq.tags.* in other containers for tagging

Service Catalog

For a service to show up, it needs to have unified service tags: service, env, version

Some services (eg. databases) showing up in the Catalog, but do not communicate with the Datadog Agent directly, their traces get captured by other services.

You can manage metadata of a service either:

Manually: using the web UI
Automatically: Github or Terraform

Synthetic testing

Associate testing results to APM:

Not done by default
You must specify the URLs for which Datadog should add the necessary HTTP headers

Real User Monitoring (RUM)

Works for web JS and mobile apps
You need to instrument your app with RUM SDK (by <script /> tag or NPM package)

Keys

	API keys	App Keys	Client tokens
Scope	org	user	org
Disabled with user ?	No	Yes	No
Auth scopes customizable	No	Yes	No
Usage	DD Agent	DD API	End user facing applications (browser, mobile, TV)

API keys
- Datadog Agent requires an API key to submit metrics and events to Datadog
Application keys
- In conjunction with your organization's API key, give users access to Datadog's programmatic API.
- By default have the permissions and scopes of the user who created them
- Permissions required to create or edit application keys:
  - user_app_keys permission to scope their own application keys
  - org_app_keys_write permission to scope application keys owned by any user in their organization
  - service_account_write permission to scope application keys for service accounts
- If a user's role or permissions change, authorization scopes specified for their application keys remain unchanged

DogStatsD

DogStatsD consists of a server, which is bundled with the Datadog Agent
- Could be installed as a standalone package as well
and a client library, which is available in multiple languages.
The DogStatsD server is enabled by default over UDP port 8125 for Agent v6+. You can set a custom port for the server if necessary.
DogStatsD accepts custom metrics, events, and service checks over UDP and periodically aggregates and forwards them to Datadog.
Because it uses UDP, your application can send metrics to DogStatsD and resume its work without waiting for a response. If DogStatsD ever becomes unavailable, your application doesn’t experience an interruption.
As it receives data, DogStatsD aggregates multiple data points for each unique metric into a single data point over a period of time called the flush interval. DogStatsD uses a flush interval of 10 seconds.

Audit Trail

Retention in Datadog up to 90 days
Can be forwarded for archiving in Azure Storage, etc

Files

datadog.markdown

Latest commit

History

datadog.markdown

File metadata and controls

Datadog

Overview

Concepts

Logs

Example queries

Configs

Tags and attributes

Facet

Views

Log pathway

Log processing rules (Pipelines)

Indexing

Metrics

Metric types

SLI & SLO

Application Performance Monitoring (APM)

Instrumentation

Continuous Profiler

Network Performance Monitoring (NPM)

Integrations

Installation

Kubernetes

Tagging

docker-compose

Best practices

Agent/library Configuration

Remote configuration

Monitor

Notifications

Universal Service Monitoring (USM)

docker-compose

Service Catalog

Synthetic testing

Real User Monitoring (RUM)

Keys

DogStatsD

Audit Trail

`docker-compose`

`docker-compose`