部署 Doris 集群监控

本文介绍如何对通过Doris Operator 部署的Doris 集群进行监控及日志集中查询。

配置 DorisMonitor

Doris Operator 通过 Prometheus 收集 Doris 集群指标,通过 Loki 收集 Doris 集群日志,并在 Grafana 提供统一的可视化界面。

在通过Doris Operator 创建新的 Doris 集群时,可以对于每个Doris 集群,创建、配置一套独立的监控系统,与Doris 集群运行在同一 Namespace,包括 Prometheus、Grafana、Loki、Promtail 四个组件。

DorisInitializer CR 定义了 Doris 可视化组件的配置:

A basic DorisMonitor CR sample

doris-monitor.yaml

# IT IS NOT SUITABLE FOR PRODUCTION USE.
# This YAML describes a basic Doris monitor components with minimum resource requirements,
# which should be able to run in any Kubernetes cluster with storage support.

apiVersion: al-assad.github.io/v1beta1
kind: DorisMonitor
metadata:
  name: basic-monitor
spec:
  # The doris cluster name to be monitored
  cluster: basic

  prometheus:
    image: prom/prometheus:v2.37.8
    # The retention time of the prometheus data in the storage
    retentionTime: 15d
    # The storage size of prometheus persistent data at pvc.
    # It is recommended to be greater than 50Gi in the production env.
    requests:
      storage: 5Gi

  grafana:
    image: grafana/grafana:9.5.2
    # The default admin user and password of grafana (optional)
    adminUser: admin
    adminPassword: admin
    # The storage size of grafana persistent data at pvc.
    # It is recommended to be 10Gi in the production env.
    requests:
      storage: 1Gi

  loki:
    image: grafana/loki:2.9.1
    # The retention time of the loki data in the storage
    retentionTime: 15d
    # The storage size of loki persistent data at pvc.
    # It is recommended to be greater than 50Gi in the production env.
    requests:
      storage: 5Gi

  promtail:
    image: grafana/promtail:2.9.1
A advanced DorisMonitor CR sample

doris-monitor.yaml

apiVersion: al-assad.github.io/v1beta1
kind: DorisMonitor
metadata:
  name: basic-monitor
spec:
  ## The doris cluster name to be monitored
  cluster: basic

  ## ImagePullPolicy of Doris monitor Pods
  ## Ref: https://kubernetes.io/docs/concepts/configuration/overview/#container-images
  # imagePullPolicy: IfNotPresent

  ## Ref: https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
  # imagePullSecrets:
  # - name: secretName

  ## The storageClassName of the persistent volume for prometheus/grafana/loki data storage.
  # Kubernetes default storage class is used if not setting this field.
  # storageClassName: ""

  ## Specifies the service account for prometheus/grafana/loki/promtail components.
  # serviceAccount: ""

  ## Whether to disable Loki for log collection
  # disableLoki: false

  ## NodeSelector of pods。
  ## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
  # nodeSelector:
  #   node-role.kubernetes.io/doris-monitor: true

  ###########################
  # Prometheus Configuration #
  ###########################
  prometheus:
    ## Image of the prometheus
    image: prom/prometheus:v2.37.8

    ## The retention time of the prometheus data in the storage
    ## When this field is not set, all data from Prometheus will be retained.
    # retentionTime: 15d

    ## The resource requirements
    ## Ref: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
    requests:
      # cpu: 500m
      # memory: 500Mi
      ## The storage size of prometheus,
      # it is recommended to be greater than 50Gi in the production env.
      storage: 5Gi
    ##  Describes the resource limit
    # limits:
    #   cpu: 4
    #   memory: 8Gi

    ## Defines Kubernetes service for prometheus-service
    # service:
    #  type: NodePort
    #  httpPort: 0

    ## NodeSelector of pods。
    # nodeSelector: {}

  ########################
  # Grafana Configuration #
  ########################
  grafana:
    ## Image of the grafana
    image: grafana/grafana:9.5.2

    ## The default admin user and password of grafana (optional)
    # adminUser: admin
    # adminPassword: admin

    ## Describes the resource requirements
    ## Ref: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
    requests:
      # cpu: 250m
      # memory: 500Mi
      ## It is recommended to be 10Gi in the production env.
      storage: 1Gi
    ##  Describes the resource limit
    # limits:
    #   cpu: 4
    #   memory: 8Gi

    ## The storageClassName of the persistent volume for grafana data storage.
    # storageClassName: ""

    ## Defines Kubernetes service for grafana-service
    # service:
    #  type: NodePort
    #  httpPort: 0

    ## NodeSelector of pods。
    # nodeSelector: {}

  #####################
  # Loki Configuration #
  #####################
  loki:
    ## Image of the loki
    image: grafana/loki:2.9.1

    ## The retention time of the loki data in the storage
    ## When this field is not set, all data from Loki will be retained.
    retentionTime: 15d

    ## Describes the resource requirements
    ## Ref: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
    requests:
      # cpu: 500m
      # memory: 500Mi
      ## It is recommended to be greater than 50Gi in the production env.
      storage: 5Gi
    ##  Describes the resource limit
    # limits:
    #   cpu: 4
    #   memory: 8Gi

    ## The storageClassName of the persistent volume for grafana data storage.
    # storageClassName: ""

    ## NodeSelector of pods。
    # nodeSelector: {}

  #########################
  # Promtail Configuration #
  #########################
  promtail:
    ## Image of the promtail
    image: grafana/promtail:2.9.1
    ## The resource requirements
    # requests:
    #   cpu: 250m
    #   memory: 256Mi
    # limits:
    #   cpu: 4
    #   memory: 8Gi
Note
建议在 ${cluster_name} 目录下组织 Doris 集群的配置,并将其另存为 ${cluster_name}/doris-monitor.yaml

存储

spec.storageClassName 定义了监控组件的存储类型,参考存储配置文档

spec:
  # ...
  storageClassName: ${storageClassName}

spec.<prometheus/grafana/loki>.requests.storage 定义了 Prometheus、Loki、Grafana 的持久存储大小。请根据您的数据保留时间选择合适的大小,以下是生产环境的建议:

  • prometheus: 50Gi 以上;
  • loki:50Gi 以上;
  • grafana:5Gi
spec:
  # ...
  prometheus:
    requests:
      storage: 50Gi
  grafana:
    requests:
      storage: 5Gi
  loki:
    requests:
      storage: 50Gi

数据保留时间

可以通过 spec.<prometheus/loki>.retentionTime 配置 Prometheus,Loki 组件的数据保留时间,当不设置该值时,Prometheus 和 Loki 的数据会永久保留在对应绑定的 PVC 上。

以下示例设置了 prometheus、loki 的数据保留时间为 15 天:

spec:
  # ...
  prometheus:
    retentionTime: 15d
  loki:
    retentionTime: 15d

部署 DorisMonitor

kubectl apply -f ${cluster_name}/doris-monitor.yaml --namespace=${namespace}

查看 monitor 组件的运行情况:

kubectl get dorismonitor ${dorismonitor_name} -n ${namespace} -o yaml

访问 DorisMonitor

访问 Grafana 面板

可以通过 kubectl port-forward 访问 Grafana 监控面板:

kubectl port-forward -n ${namespace} svc/${dorismonitor_name}-grafana 3000:3000

然后在浏览器中打开 http://localhost:3000,默认用户名和密码都为 admin

也可以设置 spec.grafana.service.typeNodePort,通过 NodePort查看监控面板。

访问 Prometheus 监控数据

对于需要直接访问监控数据的情况,可以通过 kubectl port-forward 来访问 Prometheus:

kubectl port-forward -n ${namespace} svc/${dorismonitor_name}-prometheus 9090:9090 

然后在浏览器中打开 http://localhost:9090,或通过客户端工具访问此地址即可。

也可以设置 spec.prometheus.service.typeNodePort,通过 NodePort 访问监控数据。