Skip to content

Kubernetes Monitoring Helm Chart


This Helm chart deploys the components needed for full stack monitoring.

What this chart is:

  • Collectors for metrics and logging

  • Distribution / Routing of metrics/logging

  • Data visualization

  • Ready for logging long term storag integration, in particular Loki

  • Ready for metrics long term storage integration

What this chart is not:

  • A complete monitoring stack inclusive of long term storage.

Intentionally we have kept this helm chart to the monitoring components only. Keeping the monitoring components seperate from storage is advantageous. for all intents and purposes this helm chart is ephemeral.

This helm chart started off with components from multiple open-source projects. As such attribution is warrented, so head on over to thier projects and give them a star. Projects used to create the building blocks of this helm chart are

Features

  • Alert Manager

  • Grafana as the visualization frontend

    • Dashboards

      • Ceph

      • Cluster Overview

      • Node Exporter Full

      • Alert Manager

    • Sidecar to load dashboards from ConfigMaps Optional

  • Grafana-Agent

    • daemonset for cluster nodes for exposing metrics and an injecter to loki for node and container logs.
  • Loki integration for storing logs

  • Mixins Compatible (partial), Tested with:

    • alert-manager

    • ceph

    • coreDNS

    • Kubernetes

    • Loki helm chart included dashboards, see loki repo for dashboards_

    • Node-Exporter

    • Prometheus

    • Promtail

  • Prometheus

    • Thanos sidecar Optional
  • Service monitors

    • API Server

    • CAdvisor

    • Calico, Optional

      enable calico metrics with kubectl patch felixconfiguration default --type merge --patch '{"spec":{"prometheusMetricsEnabled": true}}'

    • Ceph Optional

    • CoreDNS

    • Grafana

    • Kube-Controler-Manager

    • Kube-Scheduler

    • Kubelet

    • Kube-state-metrics

    • Node

    • Node-Exporter

    • Prometheus

    • Prometheus-Adaptor

    • Promtail

    • Thanos

  • kyverno policies (optional, set in values.yaml)

    • auto deploy policy for prometheus role and rolebinding to access other namespaces
  • values.yaml customization of components. See values.yaml for more details.

Installation

This chart as the following dependencies:

These dependencies must be present before installing this chart. to install run the following commands:

git clone -b development https://gitlab.com/nofusscomputing/projects/kubernetes_monitoring.git

helm upgrade -i nfc_monitoring kubernetes_monitoring/

Metrics Storage

There are two different ways that this chart porvides to route your metrics to long-term storage.

  • Prometheus remote_write

  • Thanos Side Car container

Prometheus Remote Write

Using this method will have prometheus pushing it's metrics to another service to store its metrics. This option for example could be used to push metrics to Grafana Mimir. To configure add the following to the values.yaml file at path nfc_monitoring.prometheus.additional:

remoteWrite: 
  - name: mimir
    url: http://mimir-gateway.metrics.svc.cluster.local/api/v1/push

Ensure that you set the url to the correct url.

Thanos Side Car

This method will have a Thanos container deployed within the prometheus pod. This container will then read and upload the Prometheus containers tsdb to the configured storage. To configure add the following to the values.yaml

  • monitoring.thanos.sidecar.enabled set to bool true

  • nfc_monitoring.thanos.sidecar.config updated to include the sidecar config.

    For example to configure the Thanos sideCar to upload Prometheus metrics to S3 storage use (updating values to suit your environment):

    type: S3
    config:
      bucket: "thanos-metrics"
      endpoint: "rook-ceph-rgw-earth.ceph.svc:80"
      access_key: "7J5NM2MNCDB4T4Y9OKJ5"
      secret_key: "t9r69RzZdWEBL3NCKiUIpDk6j5625xc6HucusiGG"
    

Template Values

The values file included in this helm chart is below.

Note

This values file is from the development branch and as such may not reflect the current version you have deployed. If you require a specific version, head on over to the git repository and select the git tag that matches the version you are after.

values.yaml
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
---

# All values within this helm chart values.yaml file are under namespace `nfc_monitoring`.
# this provides the opportunity to include this helm chart as a dependency without
# variable collision

nfc_monitoring:

  kubernetes:
    cluster_dns_name: cluster.local
    networking: calico


  alert_manager:

    enabled: true 
    image: 
      name: quay.io/prometheus/alertmanager
      tag: 'v0.26.0'

    # How many replicas to deploy
    replicas: 1


    ingress:
      annotations:
        cert-manager.io/cluster-issuer: "selfsigned-issuer"
        nginx.ingress.kubernetes.io/ssl-redirect: "true"

      enabled: false

      hostname: alert-manager.local


    labels:
      app.kubernetes.io/component: alert-router
      app.kubernetes.io/name: alertmanager

    namespace: alerting


  grafana:

    dashboards:
      cert_manager: false

    enabled: false

    # Grafana Configuration
    # Type: Dict
    # See: https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana
    config:
      analytics:
        enabled: 'false'
      # database:
      #   type: mysql
      #   host: mariadb-galera.mariadb.svc:3306
      #   name: grafana
      #   user: root
      #   password: admin

      log:
        mode: "console"
      auth:
        disable_login_form: "false"
      security:
        admin_user: admin
        admin_password: admin

    image: 
      name: grafana/grafana
      tag: '10.3.1' # '10.0.5'

    ingress:
      annotations:
        cert-manager.io/cluster-issuer: "selfsigned-issuer"
        nginx.ingress.kubernetes.io/ssl-redirect: "true"

      enabled: true

      hostname:  grafana.local


    labels:
      app.kubernetes.io/component: graphing
      app.kubernetes.io/name: grafana

    namespace: grafana

    replicas: 1

    # storage_accessModes: ReadWriteMany

    affinity: 
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/worker
              operator: Exists
          weight: 100
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/storage
              operator: DoesNotExist
          weight: 100
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - prometheus
            topologyKey: kubernetes.io/hostname
          weight: 10

    # To add Grafan datasources
    # Type: list
    # See: https://grafana.com/docs/grafana/latest/administration/provisioning/#data-sources
    DataSources:
      - name: alertmanager
        type: alertmanager
        access: proxy
        url: "http://alertmanager-main.{{ .Values.nfc_monitoring.alert_manager.namespace }}.svc:9093"
        isDefault: false
        jsonData:
          tlsSkipVerify: true
          timeInterval: "5s"
          implementation: prometheus
          handleGrafanaManagedAlerts: false
          orgId: 1
        editable: true

      - name: loki
        type: loki
        access: proxy
        url: "http://{{ .Values.nfc_monitoring.loki.service_name }}.{{ .Values.nfc_monitoring.loki.namespace }}.svc.{{ .Values.nfc_monitoring.kubernetes.cluster_dns_name }}:{{ .Values.nfc_monitoring.loki.service_port }}"
        isDefault: false
        jsonData:
          orgId: 1
        editable: true

      # - name: mimir
      #   type: prometheus
      #   access: proxy
      #   url: "http://mimir-gateway.metrics.svc.cluster.local/prometheus"
      #   isDefault: false
      #   jsonData:
      #     manageAlerts: true
      #     orgId: 1
      #     prometheusType: Mimir
      #   editable: true

      # - name: prometheus
      #   type: prometheus
      #   access: proxy
      #   url: "http://prometheus-k8s.{{ .Values.nfc_monitoring.prometheus.namespace }}.svc:9090"
      #   isDefault: true
      #   jsonData:
      #     manageAlerts: true
      #     orgId: 1
      #     prometheusType: Prometheus
      #     prometheusVersion: 2.42.0
      #   editable: true

      - name: thanos
        type: prometheus
        access: proxy
        url: "http://thanos-query.metrics.svc:9090"
        isDefault: true
        jsonData:
          manageAlerts: true
          orgId: 1
          prometheusType: Thanos
          prometheusVersion: 0.31.0
        editable: true


  grafana_agent:

    enabled: true 

    image: 
      name: grafana/agent
      tag: 'v0.39.2'

    labels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: grafana-agent

    namespace: monitoring


  loki:

    enabled: true

    image: 
      name: grafana/loki
      tag: 2.7.4

    namespace: logging

    # If no config is setup, logging will not be enabled.
    config: {}
    # service name and port are used for the connection to your loki instance
    # service_name: loki-gateway
    # service_port: 80

    ServiceMonitor:
      selector:
        matchLabels:
          app.kubernetes.io/name: loki
          app.kubernetes.io/component: logging


  kube_monitor_proxy:
    enabled: false
    namespace: monitoring


  kube_rbac_proxy:

    # This image is used as part of kube-monitor-proxy.
    image: 
      name: quay.io/brancz/kube-rbac-proxy
      tag: 'v0.14.2'


  kube_state_metrics:

    enabled: false
    image: 
      name: registry.k8s.io/kube-state-metrics/kube-state-metrics
      tag: 'v2.8.1'
    namespace: monitoring


  prometheus:

    image:
      name: prom/prometheus
      tag: 'v2.49.0'

    # How many replicas to deploy
    replicas: 1

    # alertmanagers:
    #   - name: 

    # Configure prometheus to write metrics to remote host
    # below example config uses a secret named "prometheus-remote-write" with two keys username and password.
    # Documentation: https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.RemoteWriteSpec
    remotewrite: {}
      # url:
      # name:
      # remoteTimeout: 30
      # writeRelabelConfigs:
      # basicAuth:
      #   username:
      #     name: prometheus-remote-write
      #     key: username
      #   password:
      #     name: prometheus-remote-write
      #     key: password


    ingress:
      annotations:
        cert-manager.io/cluster-issuer: "selfsigned-issuer"
        nginx.ingress.kubernetes.io/ssl-redirect: "true"
      enabled: true
      hostname:  prometheus.local


    # These labels are appended to all Prometheus items and are also the selector labels
    labels:
      app.kubernetes.io/component: prometheus
      app.kubernetes.io/name: prometheus

    namespace: monitoring

    affinity: 
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/worker
              operator: Exists
          weight: 100
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/storage
              operator: DoesNotExist
          weight: 100
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - prometheus
            topologyKey: kubernetes.io/hostname
          weight: 10


    # Deploy a generate policy for kyverno to create Role and RoleBindings
    # for the prometheus service account so it can monitor
    # new/existing namespaces
    kyverno_role_policy: false

    storage:
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 40Gi

    # Additional settings for Prometheus.
    # See: https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.PrometheusSpec
    # Type: dict
    additional:

      # Don't declare remoteWrite Here, as it's don at path .prometheus.remote_write
      # remoteWrite: 

      retention: 24h
      retentionSize: 2GB
      ruleSelector:
        matchLabels:
          role: alert-rules

    service_monitor:
      apiserver: false
      cadvisor: false
      calico: false
      ceph: false
      coredns: false
      kube_controller_manager: false
      kubelet: false
      kube_scheduler: false


  prometheus_adaptor:

    enalbed: false

    image:
      name: registry.k8s.io/prometheus-adapter/prometheus-adapter
      tag: 'v0.11.1'

    labels:
      app.kubernetes.io/component: metrics-adapter
      app.kubernetes.io/name: prometheus-adapter

    namespace: monitoring

    affinity: 
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/worker
              operator: Exists
          weight: 100
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/storage
              operator: DoesNotExist
          weight: 100
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - prometheus
            topologyKey: kubernetes.io/hostname
          weight: 10

  thanos:
    image:
      name: thanosio/thanos
      tag: v0.32.3

    # Prometheus thanos sidecar
    # see: https://thanos.io/tip/components/sidecar.md/
    # Type: Dict
    sidecar:

      enabled: true

      # Config must be specified for the sidecar to deploy
      config: {}
      #   type: S3
      #   config:
      #     bucket: "thanos-metrics"
      #     endpoint: "rook-ceph-rgw-earth.ceph.svc:80"
      #     access_key: "7J5NM2MNCDB4T4Y9OKJ5"
      #     secret_key: "t9r69RzZdWEBL3NCKiUIpDk6j5625xc6HucusiGG"
      #     insecure: true


  additions:

    ceph:

      enabled: false

      namespace: ceph

      PrometheusRules: true

      ServiceMonitor:

        selector:
          matchLabels:
            app: rook-ceph-mgr

    # Add sidecar to grafana pod to load dashboards from configMap
    dashboard_sidecar: 

      enabled: false

      image:
        name: ghcr.io/kiwigrid/k8s-sidecar
        tag: '1.24.5'

      label_name: grafana_dashboard
      label_value: "1"


  network_policy:

    enabled: false


loki_instance:
  image: 
    name: grafana/loki
    tag: 2.7.4
    # tag: 2.9.0
  namespace: loki


oncall_instance:
  image: 
    name: grafana/oncall
    tag: v1.1.40


# oncall:

#   # image:
#   #   # Grafana OnCall docker image repository
#   #   repository: grafana/oncall
#   #   tag: v1.1.38
#   #   pullPolicy: Always

#   service:
#     enabled: false
#     type: LoadBalancer
#     port: 8080
#     annotations: {}

#   engine:
#     replicaCount: 1
#     resources:
#       limits:
#         cpu: 100m
#         memory: 128Mi
#       requests:
#         cpu: 100m
#         memory: 128Mi

#   celery:
#     replicaCount: 1
#     resources:
#       limits:
#         cpu: 100m
#         memory: 128Mi
#       requests:
#         cpu: 100m
#         memory: 128Mi
#   database:
#     type: none

About:

This page forms part of our Project Kubernetes Monitoring Helm Chart.

Page Metadata
Version: ToDo: place files short git commit here
Date Created: 2023-09-19
Date Edited: 2023-09-27

Contribution:

Would You like to contribute to our Kubernetes Monitoring Helm Chart project? You can assist in the following ways:

 

ToDo: Add the page list of contributors