云原生应用监控体系构建:Prometheus+Grafana+Loki全栈可观测性解决方案

 
更多

云原生应用监控体系构建:Prometheus+Grafana+Loki全栈可观测性解决方案

引言

在云原生时代,应用的复杂性和动态性显著增加,传统的监控方式已无法满足现代应用的需求。可观测性(Observability)作为云原生应用的核心要求,不仅包括传统的监控告警,更涵盖了指标(Metrics)、日志(Logs)、追踪(Traces)三个维度的全面观测。

本文将深入探讨如何构建一个完整的云原生应用监控体系,通过Prometheus收集指标、Grafana进行可视化展示、Loki管理日志,形成一套全栈可观测性解决方案。

云原生监控体系架构

核心组件概述

现代云原生监控体系通常采用三层架构:

  1. 数据采集层:负责从各种数据源收集指标、日志等信息
  2. 数据存储层:高效存储和管理收集到的观测数据
  3. 数据展示层:提供可视化界面和告警机制

技术选型分析

Prometheus:指标收集的核心

Prometheus作为CNCF毕业项目,具有以下优势:

  • 多维数据模型,支持丰富的标签系统
  • 强大的查询语言PromQL
  • 服务发现机制,自动发现监控目标
  • 高可用性和水平扩展能力

Grafana:可视化展示的首选

Grafana作为开源的可视化平台,特点包括:

  • 丰富的图表类型和面板
  • 支持多种数据源集成
  • 强大的告警功能
  • 灵活的仪表板定制

Loki:日志管理的新选择

Loki是专为云原生设计的日志聚合系统:

  • 成本效益高,索引效率优秀
  • 与Prometheus类似的标签模型
  • 原生支持Kubernetes集成
  • 易于扩展和维护

Prometheus指标收集系统搭建

安装与配置

Docker部署方式

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.45.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    restart: unless-stopped

volumes:
  prometheus_data:

核心配置文件

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # 监控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 监控Node Exporter
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # Kubernetes服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

服务发现配置

Kubernetes集成

# prometheus-rbac.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

自定义指标收集

应用代码集成

// Go应用集成Prometheus客户端
package main

import (
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func instrumentedHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // 调用原始处理函数
        next(w, r)
        
        duration := time.Since(start).Seconds()
        
        // 记录指标
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    }
}

func main() {
    http.HandleFunc("/api/users", instrumentedHandler(func(w http.ResponseWriter, r *http.Request) {
        // 业务逻辑
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Users data"))
    }))
    
    // 暴露Prometheus指标端点
    http.Handle("/metrics", promhttp.Handler())
    
    http.ListenAndServe(":8080", nil)
}

Grafana可视化平台配置

安装与基础配置

Docker部署

# grafana-docker-compose.yml
version: '3.8'
services:
  grafana:
    image: grafana/grafana-enterprise:10.0.3
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
    restart: unless-stopped

volumes:
  grafana_data:

数据源配置

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true
    jsonData:
      maxLines: 1000

仪表板配置

通过配置文件预置仪表板

# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards

核心业务指标仪表板

{
  "dashboard": {
    "id": null,
    "title": "应用性能监控",
    "timezone": "browser",
    "schemaVersion": 16,
    "version": 0,
    "refresh": "30s",
    "panels": [
      {
        "type": "graph",
        "title": "HTTP请求速率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}",
            "refId": "A"
          }
        ],
        "gridPos": {
          "x": 0,
          "y": 0,
          "w": 12,
          "h": 8
        }
      },
      {
        "type": "stat",
        "title": "错误率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 * sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))",
            "refId": "A"
          }
        ],
        "gridPos": {
          "x": 12,
          "y": 0,
          "w": 6,
          "h": 4
        }
      },
      {
        "type": "gauge",
        "title": "CPU使用率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "refId": "A"
          }
        ],
        "gridPos": {
          "x": 18,
          "y": 0,
          "w": 6,
          "h": 4
        }
      }
    ]
  }
}

告警规则配置

Prometheus告警规则

# alert_rules.yml
groups:
- name: application_alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "高错误率 (instance {{ $labels.instance }})"
      description: "5分钟内错误率超过5%: {{ $value }}"

  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "高延迟 (instance {{ $labels.instance }})"
      description: "95%请求延迟超过1秒: {{ $value }}"

  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "服务不可用 (instance {{ $labels.instance }})"
      description: "服务已停止响应"

Grafana告警规则

# grafana/provisioning/alerting/alerts.yml
apiVersion: 1
groups:
  - orgId: 1
    name: app_alerts
    folder: App Alerts
    interval: 60s
    rules:
      - uid: high_error_rate
        title: 高错误率告警
        condition: B
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: PBFA97CFB590B2093
            model:
              expr: rate(http_requests_total{status=~"5.."}[5m])
              intervalMs: 1000
              maxDataPoints: 43200
              refId: A
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: -100
            model:
              conditions:
                - evaluator:
                    params:
                      - 0.05
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - A
                  reducer:
                    params: []
                    type: last
                  type: query
              datasource:
                type: __expr__
                uid: "-100"
              expression: A
              intervalMs: 1000
              maxDataPoints: 43200
              refId: B
              type: classic_conditions
        dashboardUid: ""
        panelId: 0
        noDataState: "NoData"
        execErrState: "Alerting"
        for: 120s
        annotations:
          summary: 应用错误率过高
        labels:
          severity: warning

Loki日志管理系统部署

安装与配置

Docker Compose部署

# loki-docker-compose.yml
version: '3.8'
services:
  loki:
    image: grafana/loki:2.8.4
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  promtail:
    image: grafana/promtail:2.8.4
    container_name: promtail
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml
    restart: unless-stopped

Loki配置文件

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-05-15
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Promtail配置文件

# promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*log

  - job_name: application
    static_configs:
      - targets:
          - localhost
        labels:
          job: applogs
          __path__: /app/logs/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            message: message
            timestamp: timestamp
      - labels:
          level:

Kubernetes集成

Promtail DaemonSet配置

# promtail-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: monitoring
spec:
  selector:
    matchLabels:
      name: promtail
  template:
    metadata:
      labels:
        name: promtail
    spec:
      serviceAccount: promtail
      containers:
        - name: promtail
          image: grafana/promtail:2.8.4
          args:
            - -config.file=/etc/promtail/promtail.yaml
          env:
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: config
              mountPath: /etc/promtail
            - name: logs
              mountPath: /var/log
            - name: pods
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: promtail-config
        - name: logs
          hostPath:
            path: /var/log
        - name: pods
          hostPath:
            path: /var/lib/docker/containers
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: monitoring
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0

    clients:
      - url: http://loki:3100/loki/api/v1/push

    positions:
      filename: /tmp/positions.yaml

    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        pipeline_stages:
          - docker:
        relabel_configs:
          - source_labels:
              - __meta_kubernetes_pod_node_name
            target_label: __host__
          - action: replace
            replacement: $1
            separator: /
            source_labels:
              - __meta_kubernetes_namespace
              - __meta_kubernetes_pod_name
            target_label: job
          - action: replace
            source_labels:
              - __meta_kubernetes_namespace
            target_label: namespace
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_name
            target_label: pod
          - action: replace
            source_labels:
              - __meta_kubernetes_pod_container_name
            target_label: container
          - replacement: /var/log/pods/*$1/*.log
            separator: /
            source_labels:
              - __meta_kubernetes_pod_uid
              - __meta_kubernetes_pod_container_name
            target_label: __path__

全栈集成与最佳实践

统一监控视图

创建综合仪表板

{
  "dashboard": {
    "title": "全栈可观测性仪表板",
    "panels": [
      {
        "type": "row",
        "title": "应用指标",
        "collapsed": false
      },
      {
        "type": "graph",
        "title": "请求速率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      },
      {
        "type": "row",
        "title": "基础设施",
        "collapsed": false
      },
      {
        "type": "graph",
        "title": "CPU使用率",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ]
      },
      {
        "type": "row",
        "title": "日志分析",
        "collapsed": false
      },
      {
        "type": "logs",
        "title": "应用日志",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{job=\"applogs\", level=\"ERROR\"}",
            "refId": "A"
          }
        ]
      }
    ]
  }
}

性能优化策略

Prometheus优化

# prometheus-optimized.yml
global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s

storage:
  tsdb:
    out_of_order_time_window: 30m

remote_write:
  - url: http://remote-storage:9090/api/v1/write
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'prometheus_.*'
        action: drop

scrape_configs:
  - job_name: 'kubernetes-pods'
    scrape_interval: 1m
    scrape_timeout: 30s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

Loki性能调优

# loki-optimized-config.yaml
schema_config:
  configs:
    - from: 2020-05-15
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://access_key:secret_access_key@region/bucket_name
    s3forcepathstyle: true
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h

limits_config:
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_entries_limit_per_query: 10000
  max_query_length: 168h
  max_query_parallelism: 14
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  creation_grace_period: 10m
  max_streams_per_user: 0
  max_global_streams_per_user: 0
  unordered_writes: true

安全配置

网络安全

# prometheus-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-network-policy
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app: prometheus
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    - podSelector:
        matchLabels:
          app: grafana
    ports:
    - protocol: TCP
      port: 9090
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 9100

认证授权

# grafana-security-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: grafana-admin-secret
  namespace: monitoring
type: Opaque
data:
  admin-password: base64_encoded_password

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-security-config
  namespace: monitoring
data:
  grafana.ini: |
    [security]
    admin_user = admin
    admin_password = $__file{/etc/grafana/secrets/admin-password}
    
    [auth]
    disable_login_form = false
    disable_signout_menu = false
    
    [auth.anonymous]
    enabled = false
    
    [auth.basic]
    enabled = true

监控体系运维管理

健康检查

Prometheus健康检查脚本

#!/bin/bash
# prometheus-health-check.sh

PROMETHEUS_URL="http://localhost:9090"
TIMEOUT=10

# 检查Prometheus是否响应
check_prometheus() {
    local response=$(curl -s -o /dev/null -w "%{http_code}" --max-time $TIMEOUT $PROMETHEUS_URL/-/healthy)
    
    if [ "$response" = "200" ]; then
        echo "Prometheus is healthy"
        return 0
    else
        echo "Prometheus is unhealthy (HTTP $response)"
        return 1
    fi
}

# 检查目标状态
check_targets() {
    local targets=$(curl -s --max-time $TIMEOUT $PROMETHEUS_URL/api/v1/targets | jq '.data.activeTargets | map(select(.health != "up")) | length')
    
    if [ "$targets" = "0" ]; then
        echo "All targets are up"
        return 0
    else
        echo "$targets targets are down"
        return 1
    fi
}

# 检查规则状态
check_rules() {
    local rules=$(curl -s --max-time $TIMEOUT $PROMETHEUS_URL/api/v1/rules | jq '.data.groups[].rules[].state' | grep -c "firing")
    
    if [ "$rules" = "0" ]; then
        echo "No rules are firing"
        return 0
    else
        echo "$rules rules are firing"
        return 1
    fi
}

# 执行所有检查
check_prometheus && check_targets && check_rules

数据

打赏

本文固定链接: https://www.cxy163.net/archives/5878 | 绝缘体

该日志由 绝缘体.. 于 2024年02月19日 发表在 未分类 分类下, 你可以发表评论,并在保留原文地址及作者的情况下引用到你的网站或博客。
原创文章转载请注明: 云原生应用监控体系构建:Prometheus+Grafana+Loki全栈可观测性解决方案 | 绝缘体
关键字: , , , ,

云原生应用监控体系构建:Prometheus+Grafana+Loki全栈可观测性解决方案:等您坐沙发呢!

发表评论


快捷键:Ctrl+Enter