Prometheus TKE Monitoring Data Collection Best Practices
How to configure collection rules when using Prometheus to collect monitoring data from Tencent Cloud Container Service? The main considerations are monitoring metric collection for kubelet and cadvisor. This article shares methods for configuring scrape_config for Prometheus to collect monitoring data from Tencent Cloud Container Service clusters.
TKE Cluster Regular Node Collection Rules
If you use kube-prometheus-stack or victoria-metrics-k8s-stack to build your monitoring system, then regular node monitoring data does not need to be configured manually.
If manually maintaining Prometheus collection rules, you can refer to the following collection rules:
- job_name: "tke-cadvisor"
scheme: https
metrics_path: /metrics/cadvisor # Collect container cadvisor monitoring data
tls_config:
insecure_skip_verify: true # TKE's kubelet uses self-signed certificates, ignore certificate verification
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type]
regex: eklet # Exclude super nodes
action: drop
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: "tke-kubelet"
scheme: https
metrics_path: /metrics # Collect kubelet's own monitoring data
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type]
regex: eklet
action: drop
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: "tke-probes" # Collect container health check data
scheme: https
metrics_path: /metrics/probes
tls_config:
insecure_skip_verify: true
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_label_node_kubernetes_io_instance_type]
regex: eklet
action: drop
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- Use node service discovery (
kubernetes_sd_configswith role asnode) to scrape several types of monitoring data exposed bykubelet:10250on all nodes. - If the cluster mixes regular nodes and super nodes, exclude super nodes (in
relabel_configs, drop nodes with labelnode.kubernetes.io/instance-type: eklet). - kubelet certificates on TKE nodes are self-signed, so
insecure_skip_verifyneeds to be set to true. - kubelet exposes container cadvisor monitoring data, kubelet's own monitoring data, and container health check data through
/metrics/cadvisor,/metrics, and/metrics/probespaths respectively. Configure separate collection jobs for these three different paths.
TKE Cluster Super Node Collection Rules
Super nodes are virtual nodes where each Pod occupies an exclusive virtual machine. Monitoring data is exposed on port 9100 of each Pod. Use the following collection rules for collection:
- job_name: serverless-pod # Collect super node Pod monitoring data
honor_timestamps: true
metrics_path: '/metrics' # All health data is on this path
params: # Usually need to add parameters to filter out ipvs-related metrics as they may have large data volume, increasing Pod load.
collect[]:
- 'ipvs'
# - 'cpu'
# - 'meminfo'
# - 'diskstats'
# - 'filesystem'
# - 'load0vg'
# - 'netdev'
# - 'filefd'
# - 'pressure'
# - 'vmstat'
scheme: http
kubernetes_sd_configs:
- role: pod # Super node Pod monitoring data is exposed on Pod's own IP port 9100, so use Pod service discovery
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_tke_cloud_tencent_com_pod_type]
regex: eklet # Only collect super node Pods
action: keep
- source_labels: [__meta_kubernetes_pod_phase]
regex: Running # Pods not in Running state have released machine resources, no need to collect
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
separator: ;
regex: (.*)
target_label: __address__
replacement: ${1}:9100 # Monitoring metrics exposed on Pod's 9100 port
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod # Write Pod name to "pod" label
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace # Write Pod's namespace to "namespace" label
replacement: ${1}
action: replace
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: (container_.*|pod_.*|kubelet_.*)
replacement: $1
action: keep
- Super node monitoring data is exposed on port 9100 of each Pod through the
/metricsHTTP API path (not HTTPS). Use Pod service discovery (kubernetes_sd_configswith role aspod) to collect everything with one job. - Super node Pods support filtering out unwanted metrics using the
collect[]query parameter, avoiding excessive metric data volume that could increase Pod load. Usually need to filter outipvsmetrics. - If the cluster mixes regular nodes and super nodes, ensure only super node Pods are collected (in
relabel_configs, only keep Pods with annotationtke.cloud.tencent.com/pod-type:eklet). - If Pod's phase is not Running, collection is impossible, so exclude them.
- Metrics starting with
container_are cadvisor monitoring data,pod_prefix metrics are monitoring data of the super node Pod's machine (equivalent to replacingnode_exporter'snode_prefix metrics withpod_),kubelet_prefix metrics are kubelet-compatible metrics within super node Pod machines (mainly PVC storage monitoring).
If you use kube-prometheus-stack to deploy the monitoring system, super node collection configuration can be written under the prometheus.prometheusSpec.additionalScrapeConfigs field. Example:
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: serverless-pod
...
TKE Serverless Cluster Collection Rules
TKE Serverless clusters only have super nodes. Pods don't have the tke.cloud.tencent.com/pod-type annotation, so this filtering condition is not needed. Collection rules are:
- job_name: serverless-pod # Collect super node Pod monitoring data
honor_timestamps: true
metrics_path: '/metrics' # All health data is on this path
params: # Usually need to add parameters to filter out ipvs-related metrics as they may have large data volume, increasing Pod load.
collect[]:
- 'ipvs'
# - 'cpu'
# - 'meminfo'
# - 'diskstats'
# - 'filesystem'
# - 'load0vg'
# - 'netdev'
# - 'filefd'
# - 'pressure'
# - 'vmstat'
scheme: http
kubernetes_sd_configs:
- role: pod # Super node Pod monitoring data is exposed on Pod's own IP port 9100, so use Pod service discovery
relabel_configs:
- source_labels: [__meta_kubernetes_pod_phase]
regex: Running # Pods not in Running state have released machine resources, no need to collect
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
separator: ;
regex: (.*)
target_label: __address__
replacement: ${1}:9100 # Monitoring metrics exposed on Pod's 9100 port
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod # Write Pod name to "pod" label
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace # Write Pod's namespace to "namespace" label
replacement: ${1}
action: replace
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: (container_.*|pod_.*|kubelet_.*)
replacement: $1
action: keep
FAQ
Why Use the Strange collect[] Parameter to Filter Metrics?
Super node Pod monitoring metrics use the collect[] query parameter to filter unwanted monitoring metrics:
curl ${IP}:9100/metrics?collect[]=ipvs&collect[]=vmstat
Why use such a strange parameter name? This is because node_exporter uses this parameter, and super node Pods internally reference node_exporter logic. Here is the node_exporter's collect[] parameter usage description.