云原生应用监控体系构建:Prometheus+Grafana+Loki全栈可观测性解决方案
引言
在云原生时代,应用的复杂性和动态性显著增加,传统的监控方式已无法满足现代应用的需求。可观测性(Observability)作为云原生应用的核心要求,不仅包括传统的监控告警,更涵盖了指标(Metrics)、日志(Logs)、追踪(Traces)三个维度的全面观测。
本文将深入探讨如何构建一个完整的云原生应用监控体系,通过Prometheus收集指标、Grafana进行可视化展示、Loki管理日志,形成一套全栈可观测性解决方案。
云原生监控体系架构
核心组件概述
现代云原生监控体系通常采用三层架构:
- 数据采集层:负责从各种数据源收集指标、日志等信息
- 数据存储层:高效存储和管理收集到的观测数据
- 数据展示层:提供可视化界面和告警机制
技术选型分析
Prometheus:指标收集的核心
Prometheus作为CNCF毕业项目,具有以下优势:
- 多维数据模型,支持丰富的标签系统
- 强大的查询语言PromQL
- 服务发现机制,自动发现监控目标
- 高可用性和水平扩展能力
Grafana:可视化展示的首选
Grafana作为开源的可视化平台,特点包括:
- 丰富的图表类型和面板
- 支持多种数据源集成
- 强大的告警功能
- 灵活的仪表板定制
Loki:日志管理的新选择
Loki是专为云原生设计的日志聚合系统:
- 成本效益高,索引效率优秀
- 与Prometheus类似的标签模型
- 原生支持Kubernetes集成
- 易于扩展和维护
Prometheus指标收集系统搭建
安装与配置
Docker部署方式
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.45.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
restart: unless-stopped
volumes:
prometheus_data:
核心配置文件
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# 监控Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# 监控Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# Kubernetes服务发现
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
服务发现配置
Kubernetes集成
# prometheus-rbac.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
自定义指标收集
应用代码集成
// Go应用集成Prometheus客户端
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func instrumentedHandler(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// 调用原始处理函数
next(w, r)
duration := time.Since(start).Seconds()
// 记录指标
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
}
func main() {
http.HandleFunc("/api/users", instrumentedHandler(func(w http.ResponseWriter, r *http.Request) {
// 业务逻辑
w.WriteHeader(http.StatusOK)
w.Write([]byte("Users data"))
}))
// 暴露Prometheus指标端点
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Grafana可视化平台配置
安装与基础配置
Docker部署
# grafana-docker-compose.yml
version: '3.8'
services:
grafana:
image: grafana/grafana-enterprise:10.0.3
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
restart: unless-stopped
volumes:
grafana_data:
数据源配置
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
jsonData:
maxLines: 1000
仪表板配置
通过配置文件预置仪表板
# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
核心业务指标仪表板
{
"dashboard": {
"id": null,
"title": "应用性能监控",
"timezone": "browser",
"schemaVersion": 16,
"version": 0,
"refresh": "30s",
"panels": [
{
"type": "graph",
"title": "HTTP请求速率",
"datasource": "Prometheus",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}",
"refId": "A"
}
],
"gridPos": {
"x": 0,
"y": 0,
"w": 12,
"h": 8
}
},
{
"type": "stat",
"title": "错误率",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 * sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
"refId": "A"
}
],
"gridPos": {
"x": 12,
"y": 0,
"w": 6,
"h": 4
}
},
{
"type": "gauge",
"title": "CPU使用率",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"refId": "A"
}
],
"gridPos": {
"x": 18,
"y": 0,
"w": 6,
"h": 4
}
}
]
}
}
告警规则配置
Prometheus告警规则
# alert_rules.yml
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "高错误率 (instance {{ $labels.instance }})"
description: "5分钟内错误率超过5%: {{ $value }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "高延迟 (instance {{ $labels.instance }})"
description: "95%请求延迟超过1秒: {{ $value }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用 (instance {{ $labels.instance }})"
description: "服务已停止响应"
Grafana告警规则
# grafana/provisioning/alerting/alerts.yml
apiVersion: 1
groups:
- orgId: 1
name: app_alerts
folder: App Alerts
interval: 60s
rules:
- uid: high_error_rate
title: 高错误率告警
condition: B
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: PBFA97CFB590B2093
model:
expr: rate(http_requests_total{status=~"5.."}[5m])
intervalMs: 1000
maxDataPoints: 43200
refId: A
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: -100
model:
conditions:
- evaluator:
params:
- 0.05
type: gt
operator:
type: and
query:
params:
- A
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: "-100"
expression: A
intervalMs: 1000
maxDataPoints: 43200
refId: B
type: classic_conditions
dashboardUid: ""
panelId: 0
noDataState: "NoData"
execErrState: "Alerting"
for: 120s
annotations:
summary: 应用错误率过高
labels:
severity: warning
Loki日志管理系统部署
安装与配置
Docker Compose部署
# loki-docker-compose.yml
version: '3.8'
services:
loki:
image: grafana/loki:2.8.4
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
promtail:
image: grafana/promtail:2.8.4
container_name: promtail
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
Loki配置文件
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-05-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /tmp/loki/index
filesystem:
directory: /tmp/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
Promtail配置文件
# promtail-config.yaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
- job_name: application
static_configs:
- targets:
- localhost
labels:
job: applogs
__path__: /app/logs/*.log
pipeline_stages:
- json:
expressions:
level: level
message: message
timestamp: timestamp
- labels:
level:
Kubernetes集成
Promtail DaemonSet配置
# promtail-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: monitoring
spec:
selector:
matchLabels:
name: promtail
template:
metadata:
labels:
name: promtail
spec:
serviceAccount: promtail
containers:
- name: promtail
image: grafana/promtail:2.8.4
args:
- -config.file=/etc/promtail/promtail.yaml
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: logs
mountPath: /var/log
- name: pods
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
name: promtail-config
- name: logs
hostPath:
path: /var/log
- name: pods
hostPath:
path: /var/lib/docker/containers
---
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: monitoring
data:
promtail.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
clients:
- url: http://loki:3100/loki/api/v1/push
positions:
filename: /tmp/positions.yaml
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- docker:
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: __host__
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
全栈集成与最佳实践
统一监控视图
创建综合仪表板
{
"dashboard": {
"title": "全栈可观测性仪表板",
"panels": [
{
"type": "row",
"title": "应用指标",
"collapsed": false
},
{
"type": "graph",
"title": "请求速率",
"datasource": "Prometheus",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
]
},
{
"type": "row",
"title": "基础设施",
"collapsed": false
},
{
"type": "graph",
"title": "CPU使用率",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
]
},
{
"type": "row",
"title": "日志分析",
"collapsed": false
},
{
"type": "logs",
"title": "应用日志",
"datasource": "Loki",
"targets": [
{
"expr": "{job=\"applogs\", level=\"ERROR\"}",
"refId": "A"
}
]
}
]
}
}
性能优化策略
Prometheus优化
# prometheus-optimized.yml
global:
scrape_interval: 30s
scrape_timeout: 10s
evaluation_interval: 30s
storage:
tsdb:
out_of_order_time_window: 30m
remote_write:
- url: http://remote-storage:9090/api/v1/write
write_relabel_configs:
- source_labels: [__name__]
regex: 'prometheus_.*'
action: drop
scrape_configs:
- job_name: 'kubernetes-pods'
scrape_interval: 1m
scrape_timeout: 30s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
Loki性能调优
# loki-optimized-config.yaml
schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://access_key:secret_access_key@region/bucket_name
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /tmp/loki/boltdb-shipper-active
cache_location: /tmp/loki/boltdb-shipper-cache
cache_ttl: 24h
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_entries_limit_per_query: 10000
max_query_length: 168h
max_query_parallelism: 14
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
creation_grace_period: 10m
max_streams_per_user: 0
max_global_streams_per_user: 0
unordered_writes: true
安全配置
网络安全
# prometheus-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-network-policy
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: grafana
ports:
- protocol: TCP
port: 9090
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 9100
认证授权
# grafana-security-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: grafana-admin-secret
namespace: monitoring
type: Opaque
data:
admin-password: base64_encoded_password
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-security-config
namespace: monitoring
data:
grafana.ini: |
[security]
admin_user = admin
admin_password = $__file{/etc/grafana/secrets/admin-password}
[auth]
disable_login_form = false
disable_signout_menu = false
[auth.anonymous]
enabled = false
[auth.basic]
enabled = true
监控体系运维管理
健康检查
Prometheus健康检查脚本
#!/bin/bash
# prometheus-health-check.sh
PROMETHEUS_URL="http://localhost:9090"
TIMEOUT=10
# 检查Prometheus是否响应
check_prometheus() {
local response=$(curl -s -o /dev/null -w "%{http_code}" --max-time $TIMEOUT $PROMETHEUS_URL/-/healthy)
if [ "$response" = "200" ]; then
echo "Prometheus is healthy"
return 0
else
echo "Prometheus is unhealthy (HTTP $response)"
return 1
fi
}
# 检查目标状态
check_targets() {
local targets=$(curl -s --max-time $TIMEOUT $PROMETHEUS_URL/api/v1/targets | jq '.data.activeTargets | map(select(.health != "up")) | length')
if [ "$targets" = "0" ]; then
echo "All targets are up"
return 0
else
echo "$targets targets are down"
return 1
fi
}
# 检查规则状态
check_rules() {
local rules=$(curl -s --max-time $TIMEOUT $PROMETHEUS_URL/api/v1/rules | jq '.data.groups[].rules[].state' | grep -c "firing")
if [ "$rules" = "0" ]; then
echo "No rules are firing"
return 0
else
echo "$rules rules are firing"
return 1
fi
}
# 执行所有检查
check_prometheus && check_targets && check_rules
数据
本文来自极简博客,作者:神秘剑客,转载请注明原文链接:云原生应用监控体系构建:Prometheus+Grafana+Loki全栈可观测性解决方案
微信扫一扫,打赏作者吧~