云原生应用监控体系构建：Prometheus+Grafana+Loki全栈可观测性解决方案

引言

在云原生时代，应用的复杂性和动态性显著增加，传统的监控方式已无法满足现代应用的需求。可观测性（Observability）作为云原生应用的核心要求，不仅包括传统的监控告警，更涵盖了指标（Metrics）、日志（Logs）、追踪（Traces）三个维度的全面观测。

本文将深入探讨如何构建一个完整的云原生应用监控体系，通过Prometheus收集指标、Grafana进行可视化展示、Loki管理日志，形成一套全栈可观测性解决方案。

云原生监控体系架构

核心组件概述

现代云原生监控体系通常采用三层架构：

数据采集层：负责从各种数据源收集指标、日志等信息
数据存储层：高效存储和管理收集到的观测数据
数据展示层：提供可视化界面和告警机制

技术选型分析

Prometheus：指标收集的核心

Prometheus作为CNCF毕业项目，具有以下优势：

多维数据模型，支持丰富的标签系统
强大的查询语言PromQL
服务发现机制，自动发现监控目标
高可用性和水平扩展能力

Grafana：可视化展示的首选

Grafana作为开源的可视化平台，特点包括：

丰富的图表类型和面板
支持多种数据源集成
强大的告警功能
灵活的仪表板定制

Loki：日志管理的新选择

Loki是专为云原生设计的日志聚合系统：

成本效益高，索引效率优秀
与Prometheus类似的标签模型
原生支持Kubernetes集成
易于扩展和维护

Prometheus指标收集系统搭建

安装与配置

Docker部署方式

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.45.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    restart: unless-stopped

volumes:
  prometheus_data:

核心配置文件

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # 监控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 监控Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Kubernetes服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

服务发现配置

Kubernetes集成

# prometheus-rbac.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

自定义指标收集

应用代码集成

// Go应用集成Prometheus客户端
package main

import (
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func instrumentedHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // 调用原始处理函数
        next(w, r)
        
        duration := time.Since(start).Seconds()
        
        // 记录指标
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    }
}

func main() {
    http.HandleFunc("/api/users", instrumentedHandler(func(w http.ResponseWriter, r *http.Request) {
        // 业务逻辑
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Users data"))
    }))
    
    // 暴露Prometheus指标端点
    http.Handle("/metrics", promhttp.Handler())
    
    http.ListenAndServe(":8080", nil)
}

Grafana可视化平台配置

安装与基础配置

Docker部署

# grafana-docker-compose.yml
version: '3.8'
services:
  grafana:
    image: grafana/grafana-enterprise:10.0.3
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
    restart: unless-stopped

volumes:
  grafana_data:

数据源配置

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true
    jsonData:
      maxLines: 1000

仪表板配置

通过配置文件预置仪表板

# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards

核心业务指标仪表板

{
  "dashboard": {
    "id": null,
    "title": "应用性能监控",
    "timezone": "browser",
    "schemaVersion": 16,
    "version": 0,
    "refresh": "30s",
    "panels": [
      {
        "type": "graph",
        "title": "HTTP请求速率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}",
            "refId": "A"
          }
        ],
        "gridPos": {
          "x": 0,
          "y": 0,
          "w": 12,
          "h": 8
        }
      },
      {
        "type": "stat",
        "title": "错误率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 * sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
            "refId": "A"
          }
        ],
        "gridPos": {
          "x": 12,
          "y": 0,
          "w": 6,
          "h": 4
        }
      },
      {
        "type": "gauge",
        "title": "CPU使用率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "refId": "A"
          }
        ],
        "gridPos": {
          "x": 18,
          "y": 0,
          "w": 6,
          "h": 4
        }
      }
    ]
  }
}

告警规则配置

Prometheus告警规则

# alert_rules.yml
groups:
- name: application_alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "高错误率 (instance {{ $labels.instance }})"
      description: "5分钟内错误率超过5%: {{ $value }}"

  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "高延迟 (instance {{ $labels.instance }})"
      description: "95%请求延迟超过1秒: {{ $value }}"

  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "服务不可用 (instance {{ $labels.instance }})"
      description: "服务已停止响应"

Grafana告警规则

# grafana/provisioning/alerting/alerts.yml
apiVersion: 1
groups:
  - orgId: 1
    name: app_alerts
    folder: App Alerts
    interval: 60s
    rules:
      - uid: high_error_rate
        title: 高错误率告警
        condition: B
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: PBFA97CFB590B2093
            model:
              expr: rate(http_requests_total{status=~"5.."}[5m])
              intervalMs: 1000
              maxDataPoints: 43200
              refId: A
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: -100
            model:
              conditions:
                - evaluator:
                    params:
                      - 0.05
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - A
                  reducer:
                    params: []
                    type: last
                  type: query
              datasource:
                type: __expr__
                uid: "-100"
              expression: A
              intervalMs: 1000
              maxDataPoints: 43200
              refId: B
              type: classic_conditions
        dashboardUid: ""
        panelId: 0
        noDataState: "NoData"
        execErrState: "Alerting"
        for: 120s
        annotations:
          summary: 应用错误率过高
        labels:
          severity: warning

Loki日志管理系统部署

安装与配置

Docker Compose部署

# loki-docker-compose.yml
version: '3.8'
services:
  loki:
    image: grafana/loki:2.8.4
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  promtail:
    image: grafana/promtail:2.8.4
    container_name: promtail
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml
    restart: unless-stopped

Loki配置文件

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-05-15
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Promtail配置文件

# promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*log

  - job_name: application
    static_configs:
      - targets:
          - localhost
        labels:
          job: applogs
          __path__: /app/logs/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
            timestamp: timestamp
      - labels:
          level:

Kubernetes集成

Promtail DaemonSet配置

# promtail-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: monitoring
spec:
  selector:
    matchLabels:
      name: promtail
  template:
    metadata:
      labels:
        name: promtail
    spec:
      serviceAccount: promtail
      containers:
        - name: promtail
          image: grafana/promtail:2.8.4
          args:
            - -config.file=/etc/promtail/promtail.yaml
          env:
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: config
              mountPath: /etc/promtail
            - name: logs
              mountPath: /var/log
            - name: pods
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: promtail-config
        - name: logs
          hostPath:
            path: /var/log
        - name: pods
          hostPath:
            path: /var/lib/docker/containers
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: monitoring
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0

    clients:
      - url: http://loki:3100/loki/api/v1/push

    positions:
      filename: /tmp/positions.yaml

    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        pipeline_stages:
          - docker:
        relabel_configs:
          - source_labels:
              - __meta_kubernetes_pod_node_name
            target_label: __host__
          - action: replace
            replacement: $1
            separator: /
            source_labels:
              - __meta_kubernetes_namespace
              - __meta_kubernetes_pod_name
            target_label: job
          - action: replace
            source_labels:
              - __meta_kubernetes_namespace
            target_label: namespace
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_name
            target_label: pod
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_container_name
            target_label: container
          - replacement: /var/log/pods/*$1/*.log
            separator: /
            source_labels:
              - __meta_kubernetes_pod_uid
              - __meta_kubernetes_pod_container_name
            target_label: __path__

全栈集成与最佳实践

统一监控视图

创建综合仪表板

{
  "dashboard": {
    "title": "全栈可观测性仪表板",
    "panels": [
      {
        "type": "row",
        "title": "应用指标",
        "collapsed": false
      },
      {
        "type": "graph",
        "title": "请求速率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      },
      {
        "type": "row",
        "title": "基础设施",
        "collapsed": false
      },
      {
        "type": "graph",
        "title": "CPU使用率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ]
      },
      {
        "type": "row",
        "title": "日志分析",
        "collapsed": false
      },
      {
        "type": "logs",
        "title": "应用日志",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{job=\"applogs\", level=\"ERROR\"}",
            "refId": "A"
          }
        ]
      }
    ]
  }
}

性能优化策略

Prometheus优化

# prometheus-optimized.yml
global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s

storage:
  tsdb:
    out_of_order_time_window: 30m

remote_write:
  - url: http://remote-storage:9090/api/v1/write
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'prometheus_.*'
        action: drop

scrape_configs:
  - job_name: 'kubernetes-pods'
    scrape_interval: 1m
    scrape_timeout: 30s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

Loki性能调优

# loki-optimized-config.yaml
schema_config:
  configs:
    - from: 2020-05-15
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://access_key:secret_access_key@region/bucket_name
    s3forcepathstyle: true
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h

limits_config:
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_entries_limit_per_query: 10000
  max_query_length: 168h
  max_query_parallelism: 14
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  creation_grace_period: 10m
  max_streams_per_user: 0
  max_global_streams_per_user: 0
  unordered_writes: true

安全配置

网络安全

# prometheus-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-network-policy
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app: prometheus
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: grafana
    ports:
    - protocol: TCP
      port: 9090
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 9100

认证授权

# grafana-security-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: grafana-admin-secret
  namespace: monitoring
type: Opaque
data:
  admin-password: base64_encoded_password

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-security-config
  namespace: monitoring
data:
  grafana.ini: |
    [security]
    admin_user = admin
    admin_password = $__file{/etc/grafana/secrets/admin-password}
    
    [auth]
    disable_login_form = false
    disable_signout_menu = false
    
    [auth.anonymous]
    enabled = false
    
    [auth.basic]
    enabled = true

监控体系运维管理

健康检查

Prometheus健康检查脚本

#!/bin/bash
# prometheus-health-check.sh

PROMETHEUS_URL="http://localhost:9090"
TIMEOUT=10

# 检查Prometheus是否响应
check_prometheus() {
    local response=$(curl -s -o /dev/null -w "%{http_code}" --max-time $TIMEOUT $PROMETHEUS_URL/-/healthy)
    
    if [ "$response" = "200" ]; then
        echo "Prometheus is healthy"
        return 0
    else
        echo "Prometheus is unhealthy (HTTP $response)"
        return 1
    fi
}

# 检查目标状态
check_targets() {
    local targets=$(curl -s --max-time $TIMEOUT $PROMETHEUS_URL/api/v1/targets | jq '.data.activeTargets | map(select(.health != "up")) | length')
    
    if [ "$targets" = "0" ]; then
        echo "All targets are up"
        return 0
    else
        echo "$targets targets are down"
        return 1
    fi
}

# 检查规则状态
check_rules() {
    local rules=$(curl -s --max-time $TIMEOUT $PROMETHEUS_URL/api/v1/rules | jq '.data.groups[].rules[].state' | grep -c "firing")
    
    if [ "$rules" = "0" ]; then
        echo "No rules are firing"
        return 0
    else
        echo "$rules rules are firing"
        return 1
    fi
}

# 执行所有检查
check_prometheus && check_targets && check_rules

数据

本文来自极简博客，作者：神秘剑客，转载请注明原文链接：云原生应用监控体系构建：Prometheus+Grafana+Loki全栈可观测性解决方案

云原生应用监控体系构建：Prometheus+Grafana+Loki全栈可观测性解决方案

云原生应用监控体系构建：Prometheus+Grafana+Loki全栈可观测性解决方案

引言

云原生监控体系架构

核心组件概述

技术选型分析

Prometheus：指标收集的核心

Grafana：可视化展示的首选

Loki：日志管理的新选择

Prometheus指标收集系统搭建

安装与配置

Docker部署方式

核心配置文件

服务发现配置

Kubernetes集成

自定义指标收集

应用代码集成

Grafana可视化平台配置

安装与基础配置

Docker部署

数据源配置

仪表板配置

通过配置文件预置仪表板

核心业务指标仪表板

告警规则配置

Prometheus告警规则

Grafana告警规则

Loki日志管理系统部署

安装与配置

Docker Compose部署

Loki配置文件

Promtail配置文件

Kubernetes集成

Promtail DaemonSet配置

全栈集成与最佳实践

统一监控视图

创建综合仪表板

性能优化策略

Prometheus优化

Loki性能调优

安全配置

网络安全

认证授权

监控体系运维管理

健康检查

Prometheus健康检查脚本

数据

您可能还会对这些文章感兴趣！

云原生应用监控体系构建：Prometheus+Grafana+Loki全栈可观测性解决方案：等您坐沙发呢！

发表评论

分类目录

最新日志热评日志随机日志

最活跃的读者

最新评论

标签云集

博客统计

友情链接

用户登录