Answer:
Prometheus is an open-source monitoring and alerting system used to collect metrics from applications and infrastructure. It is widely used because of its pull-based model, powerful query language (PromQL), and time-series database capabilities.
Example Use Case:
- Monitoring CPU, memory, and network usage
- Collecting application performance metrics
- Alerting on high error rates or latency
Answer:
Prometheus pulls metrics from target endpoints exposed via HTTP at /metrics
. The targets can be defined in a static configuration or discovered dynamically (e.g., Kubernetes service discovery).
Example scrape configuration (prometheus.yml
):
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
Answer:
PromQL (Prometheus Query Language) is used to query and analyze metrics stored in Prometheus. It enables users to create alerts, dashboards, and graphs.
Example Queries:
-
CPU usage:
node_cpu_seconds_total{mode="user"} / sum(node_cpu_seconds_total) * 100
-
Request rate:
rate(http_requests_total[5m])
Answer:
Exporters are agents that collect and expose metrics from various applications and systems.
Common Exporters:
- Node Exporter (system metrics)
- Blackbox Exporter (network probes)
- MySQL Exporter (database metrics)
Answer:
Alerts are configured in alerting_rules.yml
and evaluated by the Alertmanager.
Example Rule:
groups:
- name: instance_down
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
description: "Instance {{ $labels.instance }} is down."
Answer:
Grafana is an open-source analytics and visualization tool used to create interactive dashboards for monitoring data from Prometheus, ELK, and other sources.
Answer:
- Login to Grafana (
http://localhost:3000
). - Navigate to "Configuration" → "Data Sources".
- Select Prometheus as the data source.
- Enter Prometheus URL (
http://localhost:9090
). - Click Save & Test.
Answer:
Panels are visual components in Grafana used to display data in various formats:
- Graph Panel: Time-series data visualization
- Single Stat Panel: Displays a single numeric value
- Table Panel: Tabular data display
Answer:
- Select a panel.
- Click "Edit" → "Alert".
- Define a condition using PromQL queries.
- Set the evaluation interval (e.g., every 1m).
- Configure the alert notification (Slack, Email, etc.).
Answer:
Export and import dashboards using JSON files.
Example JSON snippet:
{
"panels": [
{
"type": "graph",
"title": "CPU Usage",
"targets": [
{ "expr": "node_cpu_seconds_total", "format": "time_series" }
]
}
]
}
Answer:
The ELK Stack consists of:
- Elasticsearch (search and analytics engine)
- Logstash (log processing pipeline)
- Kibana (visualization tool)
Answer:
Elasticsearch is a NoSQL, distributed search engine used to store, search, and analyze log data.
Answer:
Logstash processes logs using a pipeline:
- Input: Reads logs (from files, databases, Kafka, etc.)
- Filter: Transforms logs (parse JSON, remove sensitive data)
- Output: Sends logs to Elasticsearch or other storage
Example Logstash Configuration:
input { file { path => "/var/log/syslog" } }
filter { grok { match => { "message" => "%{SYSLOGTIMESTAMP:timestamp}" } } }
output { elasticsearch { hosts => ["localhost:9200"] } }
Answer:
Kibana is used to visualize and explore log data stored in Elasticsearch. It provides features like:
- Dashboards: Custom data visualizations
- Discover: Search raw logs
- Alerts: Set up log-based alerts
Answer:
Install Elasticsearch, Logstash, and Kibana:
# Install Elasticsearch
sudo apt install elasticsearch
# Install Logstash
sudo apt install logstash
# Install Kibana
sudo apt install kibana
Start services:
sudo systemctl start elasticsearch logstash kibana
Answer:
An index in Elasticsearch is like a database table that stores documents.
Example:
curl -X PUT "localhost:9200/logs"
Answer:
Define an output plugin in Logstash configuration:
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
Answer:
A Kibana Visualization is a graph, chart, or table displaying log data.
Example Visualizations:
- Bar Chart (Logs per hour)
- Pie Chart (Error types distribution)
- Line Chart (CPU usage over time)
Answer:
Filebeat is a lightweight log shipper that forwards logs to Logstash or Elasticsearch.
Example Filebeat Configuration:
filebeat.inputs:
- type: log
paths:
- "/var/log/syslog"
output.elasticsearch:
hosts: ["localhost:9200"]
Answer:
- Logstash: Heavyweight, processes logs with complex transformations
- Filebeat: Lightweight, only forwards logs with minimal processing
Answer:
- Pull Model (Prometheus) → The monitoring system requests data from targets at regular intervals.
- Push Model (StatsD, InfluxDB) → The target system sends data to a central monitoring system.
Prometheus uses a pull model because it provides better control over scraping intervals, avoids data duplication, and reduces unnecessary load on monitored systems. However, in some cases (e.g., short-lived jobs), Prometheus Pushgateway can be used to support push-based metrics.
Answer:
Prometheus stores time-series data efficiently, but high-cardinality metrics (many unique label combinations) can cause excessive memory and storage usage. Best practices include:
- Avoid unnecessary labels (e.g.,
user_id
orrequest_id
). - Use histograms and summaries instead of tracking individual events.
- Enable retention policies and downsampling for old data.
Answer:
Recording Rules allow precomputing and storing frequently used queries as new time-series metrics. This improves query performance.
Example:
groups:
- name: response_time_rules
rules:
- record: instance:response_time:avg
expr: avg(rate(http_request_duration_seconds[5m]))
This stores the average request duration as instance:response_time:avg
, making future queries faster.
Answer:
Thanos extends Prometheus for scalability, long-term storage, and high availability. It:
- Provides deduplication across multiple Prometheus instances.
- Enables object storage support (e.g., S3, GCS).
- Allows querying across multiple Prometheus servers via a single query layer.
Thanos is useful in multi-cluster environments where Prometheus instances are spread across multiple regions or clouds.
Answer:
Prometheus is a single-node system by design, but HA can be achieved by:
- Running multiple Prometheus replicas (scraping the same targets).
- Using Thanos or Cortex for deduplication and query federation.
- Storing time-series data externally (e.g., in S3, Bigtable).
Answer:
Grafana supports multiple authentication methods:
- Basic authentication (default).
- OAuth providers (Google, GitHub, Azure AD, etc.).
- LDAP authentication for enterprise use.
To enable OAuth authentication, modify grafana.ini
:
[auth.github]
enabled = true
client_id = YOUR_CLIENT_ID
client_secret = YOUR_CLIENT_SECRET
Answer:
Templating allows users to create dynamic dashboards by using variables. Instead of hardcoding values, users can select values from dropdown menus.
Example:
rate(http_requests_total{job="$service"}[5m])
Here, $service
is a variable that can be selected from a dropdown list in Grafana.
Answer:
Grafana supports automated provisioning of dashboards and data sources using YAML configuration files.
Example datasource.yaml
:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy
Answer:
- Loki is Grafana's log aggregation system, similar to Elasticsearch but optimized for Kubernetes and microservices.
- Promtail is the log collection agent for pushing logs to Loki.
Promtail collects logs from /var/log
and forwards them to Loki.
Answer:
Use kube-prometheus-stack, which includes:
- Prometheus Operator (for Kubernetes metrics).
- Grafana dashboards for cluster monitoring.
- Node Exporter and Kube-State-Metrics for detailed node/pod-level metrics.
Answer:
An Elasticsearch shard is a subdivision of an index. Each index is split into shards to allow parallel processing and redundancy.
- Primary Shards: Store original data.
- Replica Shards: Duplicates of primary shards for fault tolerance.
Example:
curl -X PUT "localhost:9200/logs?pretty" -H 'Content-Type: application/json' -d'
{
"settings": { "number_of_shards": 3, "number_of_replicas": 2 }
}'
This creates an index with 3 primary and 2 replica shards.
Answer:
ILM automates index retention policies, ensuring efficient storage use. Stages include:
- Hot Phase: Frequent reads/writes.
- Warm Phase: Less frequent queries.
- Cold Phase: Rarely accessed data.
- Delete Phase: Data deletion.
ILM is useful for managing log retention in ELK stacks.
Answer:
Logstash uses a pipeline of input → filter → output.
Example logstash.conf
:
input {
beats {
port => 5044
}
}
filter {
grok { match => { "message" => "%{TIMESTAMP_ISO8601:timestamp}" } }
}
output {
elasticsearch { hosts => ["localhost:9200"] }
}
This pipeline processes logs from Filebeat → Logstash → Elasticsearch.
Answer:
- Canvas → Used for creating custom, highly stylized reports and presentations.
- Lens → Drag-and-drop interface for creating advanced visualizations easily.
Answer:
Enable authentication in kibana.yml
:
xpack.security.enabled: true
elasticsearch.username: "kibana"
elasticsearch.password: "changeme"
Use role-based access control (RBAC) to restrict access.
Answer:
Beats are lightweight data shippers for sending logs, metrics, and security data to ELK.
- Filebeat: Log shipping.
- Metricbeat: System metrics.
- Packetbeat: Network monitoring.
Answer:
Curator is a tool for managing Elasticsearch indices, used for deleting old indices, snapshot backups, and optimizing performance.
Answer:
Use Metricbeat to collect system metrics and send them to Elasticsearch, while Prometheus Node Exporter collects Prometheus-compatible metrics.
Answer:
A slow query is a query that takes too long to execute, often due to large data scans or missing indexes. Enable slow query logs to debug:
PUT _settings
{
"index.search.slowlog.threshold.query.warn": "2s"
}
Answer:
- Prometheus + Grafana → Metrics-based monitoring.
- ELK Stack (Elasticsearch, Logstash, Kibana) → Log-based monitoring.
- Alternative: OpenTelemetry, Loki, and InfluxDB.
Answer:
Prometheus is a single-node system, so for large environments:
- Use multiple Prometheus instances scraping different targets.
- Federation: Create a parent Prometheus that scrapes aggregated metrics from child Prometheus instances.
- Remote storage: Use Thanos, Cortex, or Mimir to store metrics in scalable object storage (S3, GCS).
- Sharding: Distribute scraping targets across Prometheus instances using load balancing tools like Kube StatefulSets.
Answer:
- Stale markers: Prometheus marks time-series data as stale if a target stops reporting metrics.
- Absent function (
absent()
): Used in PromQL to detect missing metrics. - Dead Man’s Switch: A constant alert (e.g.,
ALWAYS_ON
) ensures the alerting system is functional.
Example:
absent(up{job="my_service"})
Triggers an alert if up{job="my_service"}
is missing.
Answer:
The Write-Ahead Log (WAL) in Prometheus:
- Stores data on disk before committing it to TSDB (Time-Series Database).
- Reduces data loss during crashes.
- WAL files are stored in /data/wal/ and help recover metrics quickly after a restart.
Answer:
Both are used for measuring latency and response time:
- Histogram: Buckets data into predefined ranges, allowing percentiles to be calculated later.
- Summary: Precomputes percentiles but cannot be aggregated across instances.
Example (Histogram metric):
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
This calculates the 95th percentile response time.
Answer:
- Enable authentication & TLS via a reverse proxy (Nginx, Traefik).
- Use RBAC (Role-Based Access Control) in Kubernetes for limiting access.
- Set up network policies to restrict Prometheus access.
Example: Using basic auth with Nginx:
server {
listen 9090;
location / {
auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
}
}
Answer:
- Enable the built-in Prometheus self-metrics endpoint (
/metrics
). - Use dashboards to monitor scrape latency, TSDB memory usage, query duration.
- Use the Prometheus Federation API to get meta-metrics.
Answer:
Annotations mark events (deployments, incidents, downtimes) on Grafana graphs for better visualization.
Example: Mark a Kubernetes deployment event in Grafana.
Answer:
- Organizations: Create multiple teams with separate dashboards.
- Data source permissions: Restrict access at the data-source level.
- Multi-instance deployment: Run separate Grafana instances for different teams.
Answer:
- Grafana alerts monitor query conditions.
- Alert states: OK, Pending, Alerting, No Data.
- Notification channels: Slack, PagerDuty, Email, Webhooks.
Example Grafana alert condition:
avg(http_requests_total) > 1000
→ Sends an alert if requests exceed 1000.
Answer:
Feature | Loki | Elasticsearch |
---|---|---|
Storage | Compressed logs | Full-text index |
Querying | Label-based | Query DSL |
Performance | Faster (optimized for Kubernetes) | Heavy resource usage |
Loki is recommended for lightweight, Kubernetes-native logging, while Elasticsearch is better for complex log analysis.
Answer:
This strategy optimizes storage cost:
- Hot Nodes → Store recent, frequently queried data.
- Warm Nodes → Store older logs with infrequent access.
- Cold Nodes → Store archived logs for long-term retention.
Answer:
- Use ILM (Index Lifecycle Management).
- Optimize shard count (Avoid too many small shards).
- Increase refresh intervals (
index.refresh_interval: 30s
).
Answer:
- Persistent Queues → Buffer data before sending to Elasticsearch.
- Dead Letter Queue (DLQ) → Stores failed events for reprocessing.
Example:
queue.type: persisted
queue.max_bytes: 1gb
Answer:
- Request cache: Stores query results.
- Shard request cache: Caches aggregations and filters.
- Doc value cache: Optimizes sorting and aggregations.
Answer:
- Machine Learning Jobs → Identify unusual trends in logs.
- SIEM (Security Information and Event Management) → Detect security threats.
Example anomaly detection job:
{
"analysis_config": {
"bucket_span": "15m",
"detectors": [{ "function": "mean", "field_name": "cpu_usage" }]
}
}
Answer:
- Enable TLS (
xpack.security.enabled: true
). - Use API Key authentication.
- Implement firewall rules to restrict access.
Answer:
- Use Metricbeat to push Prometheus data into Elasticsearch.
- Use Grafana to visualize both Prometheus & ELK logs.
Example Metricbeat configuration:
metricbeat.modules:
- module: prometheus
metricsets: ["collector"]
host: "localhost:9090"
Answer:
- Use filters (
term
,match_phrase
) instead of full-text search. - Avoid wildcard (
*
) searches. - Use
doc_values
for sorting and aggregations.
Answer:
- Use Fluentd/Filebeat to collect logs.
- Send logs to Elasticsearch or Loki.
- Monitor logs via Kibana or Grafana dashboards.
Example Fluentd configuration:
<match kubernetes.**>
@type elasticsearch
host elasticsearch
logstash_format true
</match>
Answer:
- Use ILM to delete old logs automatically.
- Encrypt sensitive logs (
xpack.security
). - Mask PII data before indexing logs.
- Set audit logs for security compliance.
💡 Want to contribute?
We welcome contributions! If you have insights, new tools, or improvements, feel free to submit a pull request.
📌 How to Contribute?
- Read the CONTRIBUTING.md guide.
- Fix errors, add missing topics, or suggest improvements.
- Submit a pull request with your updates.
📢 Stay Updated:
⭐ Star the repository to get notified about new updates and additions.
💬 Join discussions in GitHub Issues to suggest improvements.
🔗 GitHub: @NotHarshhaa
📝 Blog: ProDevOpsGuy
💬 Telegram Community: Join Here