- Overview
- Logs
- Metrics
- Application Performance Monitoring (APM)
- Network Performance Monitoring (NPM)
- Integrations
- Kubernetes
- Tagging
- Agent/library Configuration
- Monitor
- Universal Service Monitoring (USM)
- Synthetic testing
- Real User Monitoring (RUM)
- Keys
- DogStatsD
- Audit Trail
- Monitoring: what happened in a system
- Observability: why it happened
Search term | Format | Example |
---|---|---|
tag | key:value | service:frontend |
attribute | @key:value | @http.method:POST |
single term | word | Response |
sequence | group of words surrounded by double quotes | "Response fetched" |
wildcard | tag or attribute name and value | *:prod* |
wildcard | log message | prod* |
docker-composer.yml
config for the Agent service
services:
agent:
image: "datadog/agent:7.31.1"
environment:
- DD_API_KEY
- DD_APM_ENABLED=true
- DD_LOGS_ENABLED=true
- DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
- DD_PROCESS_AGENT_ENABLED=true
- DD_DOCKER_LABELS_AS_TAGS={"my.custom.label.team":"team"}
- DD_TAGS='env:intro-to-logs'
- DD_HOSTNAME=intro-logs-host
ports:
- "8126:8126"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
labels:
com.datadoghq.ad.logs: '[{"source": "agent", "service": "agent"}]'
- Tags are assigned at host or container level
source
tag is for the integration name, corresponding to Log Processing Pipeline
- Attributes are extracted from logs
- By a log processing pipeline, either a built-in Integration Pipeline, or a custom one
- Automatically created from common tags and log attributes
- A facet could be a measure, which is numerical and continuous, could be filtered by a range
- eg.
@network.bytes_written:[512 TO 1024]
- eg.
- You can create custom facet from log tags or attributes
- Queries could be saved into views
- There are also predefined views, eg. Postgres, NGINX, Redis, ...
- Each pipeline includes a list of sequential Processors
- Each pipeline has a query filter (eg.
source:nginx
), only matching logs are processed by the pipeline - Pipelines could be nested up to one level
- Each pipeline has a query filter (eg.
- Pipelines extract attributes from each log message
- There are out-of-the-box integration pipelines for common services
- JSON format logs are pre-processed before pipelines
- Processors
- Grok
- Regex matching
- A pipeline can have multiple Grok parsers
- One Grok parser can have multiple parsing rules
- Subsequent Grok parser can be used on an attribute extracted by preceding parsers
- Grok
- Standard Attribute
- Processed after all the pipelines
- Instead of adding a remapper to each pipeline, you can use this to remap a common attribute from any source
- Ingested logs:
- Watchdog(automated) Insights, Error Tracking, generating metrics, and Cloud SIEM detection rules
- Indexed logs:
- Can be used in monitors, dashboards, notebooks
Can be collected by:
- DD Agent
- Integrations
- Generated within Datadog (eg. form logs)
- Custom metrics
- Agent
- DogStatsD
- HTTP API
- Count (times in an interval)
- Rate (frequency)
- Gauge (last value in an interval)
- Histogram (five values: mean, count, median, 95th percentile, and maximum)
- Distribution (summarize values across all the hosts)
- Enhanced query functionality and configuration options
RED metrics: Rate, Errors, Duration
Service Level Indicators (SLI): metrics to measure some aspect of the level of service
Service Level Objectives (SLO): SLIs monitored overtime, eg.
- 99% of requests being successful over the past 7 days
- less than 1 second latency 99% of the time over the past 30 days
You can create an SLO based on a monitor, then you can create a monitor on an SLO to get alerts.
- Trace: tracks the time spent by an application processing a request and the status of this request. Each trace consists of one or more spans.
- Span: represents a logical unit of work in a distributed system for a given time period. Multiple spans construct a trace.
- Trace root span: The entry point of the entire trace, the service that generates this first span also creates the Trace ID
-
You use language-specific Datadog libraries (
ddtrace
) in your application code. -
Traces are submitted to Datadog Agent first, then sent to Datadog.
-
By default, Agent collects traces using TCP port 8126.
-
Instrumented application expect some environment variables, eg.
DATADOG_HOST
DD_ENV
,DD_VERSION
, andDD_SERVICE
.DD_AGENT_HOST
: which service hosts the agentDD_LOGS_INJECTION
: injectstrace_id
andspan_id
into logsDD_TRACE_SAMPLE_RATE
DD_PROFILING_ENABLED
whether enable continuous profilerDD_SERVICE_MAPPING
rename service
-
For Python app, run it with command
ddtrace-run
, like:DD_SERVICE="<SERVICE>" DD_ENV="<ENV>" DD_LOGS_INJECTION=true ddtrace-run python my_app.py
- Gives you insight into the system resource consumption (eg. CPU, memory and IO bottlenecks) of your applications beyond traces
- Supported by client libraries
- Built on eBPF (detailed visibility into network flows at the Linux kernel level)
- Powerful and efficient with extremely low overhead
- Can monitor DNS traffic and DNS servers
To enable with containerized agent:
environment:
- DD_SYSTEM_PROBE_NETWORK_ENABLED=true
- ...
volumes:
- /sys/kernel/debug/:/sys/kernel/debug
- ...
cap_add:
- SYS_ADMIN
- SYS_RESOURCE
- SYS_PTRACE
- NET_ADMIN
- NET_BROADCAST
- NET_RAW
- IPC_LOCK
- CHOWN
security_opt:
- apparmor:unconfined
Three types:
- Agent-based (system checks), use a Python class method called
check
check
method executes every 15 seconds- A check could collects multiple metrics, events, logs and service checks
- Show the checks
docker compose exec datadog agent status
- Run a specific check
docker compose exec datadog agent check disk
- Authentication based (crawler)
- Either pull data from other systems, using other system's credentials
- Or authorize other systems to push data to Datadog, using Datadog's API key
- Library integrations, use the Datadog API to allow you to monitor applications based on the language they are written in, like Node.js or Python.
- Imported as packages to your code
- Use Datadog's tracing API
- Collect performance, profiling, and debugging metrics from your application at runtime
When an integration is installed, it may also install OOTB dashboards, log processing pipelines, etc
The Datadog Agent is run as a DaemonSet to ensure the Agent is deployed on all nodes in the cluster.
Agent config:
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog-agent
namespace: default
spec:
global:
clusterName: tagging-use-cases-k8s
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
appSecret:
secretName: datadog-secret
keyName: api-key
podLabelsAsTags:
"*": kube_pod_%%label%%
Configure podLabelsAsTags:
to extract pod labels as tags
Pod config:
apiVersion: v1
kind: Pod
metadata:
name: my-pod61
labels:
component: backend
annotations:
ad.datadoghq.com/tags: '{"env": "production", "service": "user-service", "office": "lax", "team": "community", "role": "backend", "color": "red"}'
...
Pod labels
and annotations
will be extracted as tags
Tags could be key-value pairs (eg. env:prod
), or simple value tags (eg. file-server
)
Reserved tag key:
host
: correlation between metrics, traces, processes, and logsdevice
source
env
service
version
team
Unified Service Tagging: service
, env
, version
- To map a custom container label to a tag, use this environment variable on the agent container:
DD_CONTAINER_LABELS_AS_TAGS={"my.custom.label.color":"color"}
trace_id
,span_id
can be injected as tags in logs, for correlation
By priority (high to low):
- Remote configuration
- Environment variables
- Local configuration (
remote_config.enabled
setting controls whether an agent accepts Remote Configuration)
- Works for agents or tracing libraries
- Enables them to pull configurations from Datadog
- Could be enabled at organization scope
- Supported features:
- APM (config, sampling rate)
- ASM (protect against OWASP, WAF attack patterns)
- CSM (default agent rules, agentless scanning in AWS only ?)
- Dynamic instrumentation (metrics, logs and traces from live application without code change)
- Fleet automation
- Control observability pipeline workers
There's not a field dedicated for recipients, you need to specify it with @
, @slack
in the message
The {{service.name}} service container {{color.name}} has high CPU usage!!
Contact: Email - @{{service.name}}@mycompany.com, @[email protected]
Slack - @slack-{{service.name}}
Enabling USM requires the following:
- If on Linux, your service must be running in a container.
- If on Windows and using IIS, your service must be running on a virtual machine.
- The Datadog Agent needs to be installed alongside your service.
- The
env
tag for Unified Service Tagging must be applied to your deployment.
Commonly used container tags: app
, short_image
, container_name
short_name
tag is used to discover common services, eg.short_name:nginx
will identifynginx
service
- You need a few settings for the agent container to turn on USM
- Use
labels
likecom.datadoghq.tags.*
in other containers for tagging
For a service to show up, it needs to have unified service tags: service
, env
, version
Some services (eg. databases) showing up in the Catalog, but do not communicate with the Datadog Agent directly, their traces get captured by other services.
You can manage metadata of a service either:
- Manually: using the web UI
- Automatically: Github or Terraform
Associate testing results to APM:
- Not done by default
- You must specify the URLs for which Datadog should add the necessary HTTP headers
- Works for web JS and mobile apps
- You need to instrument your app with RUM SDK (by
<script />
tag or NPM package)
API keys | App Keys | Client tokens | |
---|---|---|---|
Scope | org | user | org |
Disabled with user ? | No | Yes | No |
Auth scopes customizable | No | Yes | No |
Usage | DD Agent | DD API | End user facing applications (browser, mobile, TV) |
- API keys
- Datadog Agent requires an API key to submit metrics and events to Datadog
- Application keys
- In conjunction with your organization's API key, give users access to Datadog's programmatic API.
- By default have the permissions and scopes of the user who created them
- Permissions required to create or edit application keys:
user_app_keys
permission to scope their own application keysorg_app_keys_write
permission to scope application keys owned by any user in their organizationservice_account_write
permission to scope application keys for service accounts
- If a user's role or permissions change, authorization scopes specified for their application keys remain unchanged
-
DogStatsD consists of a server, which is bundled with the Datadog Agent
- Could be installed as a standalone package as well
-
and a client library, which is available in multiple languages.
-
The DogStatsD server is enabled by default over UDP port 8125 for Agent v6+. You can set a custom port for the server if necessary.
-
DogStatsD accepts custom metrics, events, and service checks over UDP and periodically aggregates and forwards them to Datadog.
-
Because it uses UDP, your application can send metrics to DogStatsD and resume its work without waiting for a response. If DogStatsD ever becomes unavailable, your application doesn’t experience an interruption.
-
As it receives data, DogStatsD aggregates multiple data points for each unique metric into a single data point over a period of time called the flush interval. DogStatsD uses a flush interval of 10 seconds.
- Retention in Datadog up to 90 days
- Can be forwarded for archiving in Azure Storage, etc