基于Prometheus的监控系统实践
单位时间接收到的请求数量
单位时间内请求的成功率/失败率
请求的平均处理耗时
支持PromQL(一种查询语言),可以灵活地聚合指标数据
部署简单,只需要一个二进制文件就能跑起来,不需要依赖分布式存储
Go语言编写,组件更方便集成在同样是Go编写项目代码中
原生自带WebUI,通过PromQL渲染时间序列到面板上
生态组件众多,Alertmanager,Pushgateway,Exporter……
使用基础Unit(如seconds而非milliseconds)
指标名以application namespace作为前缀,如:
process_cpu_seconds_total
http_request_duration_seconds
用后缀来描述Unit,如:
http_request_duration_seconds
node_memory_usage_bytes
http_requests_total
process_cpu_seconds_total
foobar_build_info
Counter:代表一种样本数据单调递增的指标,即只增不减,通常用来统计如服务的请求数,错误数等。
Gauge:代表一种样本数据可以任意变化的指标,即可增可减,通常用来统计如服务的CPU使用值,内存占用值等。
Histogram和Summary:用于表示一段时间内的数据采样和点分位图统计结果,通常用来统计请求耗时或响应大小等。
http_requests{host='host1',service='web',code='200',env='test'}
http_requests{host='host1',service='web',code='200',env='test'} 10http_requests{host='host2',service='web',code='200',env='test'} 0http_requests{host='host3',service='web',code='200',env='test'} 12
http_requests{host='host1',service='web',code='200',env='test'}[:5m]
http_requests{host='host1',service='web',code='200',env='test'} 0 4 6 8 10http_requests{host='host2',service='web',code='200',env='test'} 0 0 0 0 0http_requests{host='host3',service='web',code='200',env='test'} 0 2 5 9 12
rate(http_requests{host='host1',service='web',code='200',env='test'}[:5m])
increase(http_requests{host='host1',service='web',code='200',env='test'}[:5m])
histogram_quantile(0.9, rate(employee_age_bucket_bucket[10m]))
relabel_configs:- source_labels: [__address__] modulus: 3 target_label: __tmp_hash action: hashmod- source_labels: [__tmp_hash] regex: $(PROM_ID) action: keep
relabel_configs:
- source_labels: ['__meta_consul_dc']
regex: 'dc1'
action: keep
Querier收到一个请求时,它会向相关的Sidecar发送请求,并从他们的Prometheus服务器获取时间序列数据。
它将这些响应的数据聚合在一起,并对它们执行PromQL查询。它可以聚合不相交的数据也可以针对Prometheus的高可用组进行数据去重。
Pushgateway被设计为一个监控指标的缓存,这意味着它不会主动过期服务上报的指标,这种情况在服务一直运行的时候不会有问题,但当服务被重新调度或销毁时,Pushgateway依然会保留着之前节点上报的指标。而且,假如多个Pushgateway运行在LB下,会造成一个监控指标有可能出现在多个Pushgateway的实例上,造成数据重复多份,需要在代理层加入一致性哈希路由来解决
在拉模式下,Prometheus可以更容易的查看监控目标实例的健康状态,并且可以快速定位故障,但在推模式下,由于不会对客户端进行主动探测,因此对目标实例的健康状态也变得一无所知
apiVersion: apps/v1kind: StatefulSetmetadata: name: prometheus labels: app: prometheusspec: serviceName: 'prometheus' updateStrategy: type: RollingUpdate replicas: 3 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus thanos-store-api: 'true' spec: serviceAccountName: prometheus volumes: - name: prometheus-config configMap: name: prometheus-config - name: prometheus-data hostPath: path: /data/prometheus - name: prometheus-config-shared emptyDir: {} containers: - name: prometheus image: prom/prometheus:v2.11.1 args: - --config.file=/etc/prometheus-shared/prometheus.yml - --web.enable-lifecycle - --storage.tsdb.path=/data/prometheus - --storage.tsdb.retention=2w - --storage.tsdb.min-block-duration=2h - --storage.tsdb.max-block-duration=2h - --web.enable-admin-api ports: - name: http containerPort: 9090 volumeMounts: - name: prometheus-config-shared mountPath: /etc/prometheus-shared - name: prometheus-data mountPath: /data/prometheus livenessProbe: httpGet: path: /-/healthy port: http - name: watch image: watch args: ['-v', '-t', '-p=/etc/prometheus-shared', 'curl', '-X', 'POST', '--fail', '-o', '-', '-sS', 'http://localhost:9090/-/reload'] volumeMounts: - name: prometheus-config-shared mountPath: /etc/prometheus-shared - name: thanos image: improbable/thanos:v0.6.0 command: ['/bin/sh', '-c'] args: - PROM_ID=`echo $POD_NAME| rev | cut -d '-' -f1` /bin/thanos sidecar --prometheus.url=http://localhost:9090 --reloader.config-file=/etc/prometheus/prometheus.yml.tmpl --reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yml env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name ports: - name: http-sidecar containerPort: 10902 - name: grpc containerPort: 10901 volumeMounts: - name: prometheus-config mountPath: /etc/prometheus - name: prometheus-config-shared mountPath: /etc/prometheus-shared
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: prometheus
namespace: default
labels:
app: prometheus
rules:
- apiGroups: ['']
resources: ['services', 'pods', 'nodes', 'nodes/proxy', 'endpoints']
verbs: ['get', 'list', 'watch']
- apiGroups: ['']
resources: ['configmaps']
verbs: ['create']
- apiGroups: ['']
resources: ['configmaps']
resourceNames: ['prometheus-config']
verbs: ['get', 'update', 'delete']
- nonResourceURLs: ['/metrics']
verbs: ['get']
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: prometheus
namespace: default
labels:
app: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
roleRef:
kind: ClusterRole
name: prometheus
apiGroup: ''
apiVersion: apps/v1kind: Deploymentmetadata: labels: app: thanos-query name: thanos-queryspec: replicas: 2 selector: matchLabels: app: thanos-query minReadySeconds: 5 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 1 template: metadata: labels: app: thanos-query spec: containers: - args: - query - --log.level=debug - --query.timeout=2m - --query.max-concurrent=20 - --query.replica-label=replica - --query.auto-downsampling - --store=dnssrv+thanos-store-gateway.default.svc - --store.sd-dns-interval=30s image: improbable/thanos:v0.6.0 name: thanos-query ports: - containerPort: 10902 name: http - containerPort: 10901 name: grpc livenessProbe: httpGet: path: /-/healthy port: http---apiVersion: v1kind: Servicemetadata: labels: app: thanos-query name: thanos-queryspec: type: LoadBalancer ports: - name: http port: 10901 targetPort: http selector: app: thanos-query---apiVersion: v1kind: Servicemetadata: labels: thanos-store-api: 'true' name: thanos-store-gatewayspec: type: ClusterIP clusterIP: None ports: - name: grpc port: 10901 targetPort: grpc selector: thanos-store-api: 'true'
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: thanos-rule
name: thanos-rule
spec:
replicas: 1
selector:
matchLabels:
app: thanos-rule
template:
metadata:
labels:
labels:
app: thanos-rule
spec:
containers:
- name: thanos-rule
image: improbable/thanos:v0.6.0
args:
- rule
- --web.route-prefix=/rule
- --web.external-prefix=/rule
- --log.level=debug
- --eval-interval=15s
- --rule-file=/etc/rules/thanos-rule.yml
- --query=dnssrv+thanos-query.default.svc
- --alertmanagers.url=dns+http://alertmanager.default
ports:
- containerPort: 10902
name: http
volumeMounts:
- name: thanos-rule-config
mountPath: /etc/rules
volumes:
- name: thanos-rule-config
configMap:
name: thanos-rule-config
apiVersion: apps/v1kind: Deploymentmetadata: labels: app: pushgateway name: pushgatewayspec: replicas: 15 selector: matchLabels: app: pushgateway template: metadata: labels: app: pushgateway spec: containers: - image: prom/pushgateway:v1.0.0 name: pushgateway ports: - containerPort: 9091 name: http resources: limits: memory: 1Gi requests: memory: 512Mi---apiVersion: v1kind: Servicemetadata: labels: app: pushgateway name: pushgatewayspec: type: LoadBalancer ports: - name: http port: 9091 targetPort: http selector: app: pushgateway
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
spec:
replicas: 3
selector:
matchLabels:
app: alertmanager
template:
metadata:
name: alertmanager
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:latest
args:
- --web.route-prefix=/alertmanager
- --config.file=/etc/alertmanager/config.yml
- --storage.path=/alertmanager
- --cluster.listen-address=0.0.0.0:8001
- --cluster.peer=alertmanager-peers.default:8001
ports:
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: alertmanager-config
mountPath: /etc/alertmanager
- name: alertmanager
mountPath: /alertmanager
volumes:
- name: alertmanager-config
configMap:
name: alertmanager-config
- name: alertmanager
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
labels:
name: alertmanager-peers
name: alertmanager-peers
spec:
type: ClusterIP
clusterIP: None
selector:
app: alertmanager
ports:
- name: alertmanager
protocol: TCP
port: 9093
targetPort: 9093
apiVersion: extensions/v1beta1kind: Ingressmetadata: name: pushgateway-ingress annotations: kubernetes.io/ingress.class: 'nginx' nginx.ingress.kubernetes.io/upstream-hash-by: '$request_uri' nginx.ingress.kubernetes.io/ssl-redirect: 'false'spec: rules: - host: $(DOMAIN) http: paths: - backend: serviceName: pushgateway servicePort: 9091 path: /metrics---apiVersion: extensions/v1beta1kind: Ingressmetadata: name: prometheus-ingress annotations: kubernetes.io/ingress.class: 'nginx'spec: rules: - host: $(DOMAIN) http: paths: - backend: serviceName: thanos-query servicePort: 10901 path: / - backend: serviceName: alertmanager servicePort: 9093 path: /alertmanager - backend: serviceName: thanos-rule servicePort: 10092 path: /rule - backend: serviceName: grafana servicePort: 3000 path: /grafana