Post-Deploy Checks¶

There are many ways to check whether Monitoring is working correctly and collecting metrics after deployment. This guide covers both common verification steps and specific validation techniques.

Pod Status Verification¶

First, check that all components are running using the following command:

kubectl get pods -n <monitoring_namespace>

Expected Pod List¶

For a typical Monitoring deployment, you should see pods similar to:

$ kubectl get pods -n monitoring
NAME                                          READY   STATUS    RESTARTS   AGE
alertmanager-k8s-0                            2/2     Running   0          9d
grafana-deployment-7d869f989d-24657           1/1     Running   0          9d
grafana-operator-5857f4d47d-f7nv4             1/1     Running   0          9d
kube-state-metrics-6c89f47c94-tbx7l           1/1     Running   0          9d
monitoring-operator-monito-6d6c78dd86-wqfcx   1/1     Running   0          9d
node-exporter-g5hsr                           1/1     Running   0          9d
node-exporter-jq7gq                           1/1     Running   0          9d
node-exporter-zppgl                           1/1     Running   0          9d
victoriametrics-operator-c5649d646-7tnw8      1/1     Running   0          9d
vmagent-k8s-8dd4ffccc-msdmx                   2/2     Running   0          9d
vmalert-k8s-6c4b989889-dndmv                  2/2     Running   0          9d
vmalertmanager-k8s-0                          2/2     Running   0          9d
vmauth-k8s-76cc654596-gn8pl                   1/1     Running   0          9d
vmsingle-k8s-6cf5b84846-c22xs                 1/1     Running   0          9d

Troubleshooting Pod Issues¶

If pods are not in Running status, investigate further:

# Check specific pod details
kubectl describe pod <pod-name> -n monitoring

# Check pod logs
kubectl logs <pod-name> -n monitoring

# For containers with multiple containers
kubectl logs <pod-name> -c <container-name> -n monitoring

Service Health Verification¶

VictoriaMetrics Health Check¶

Check that all VictoriaMetrics services have OK status at the /health endpoint:

# Using curl
curl "http://<victoriametrics-pod-ip-or-service-name>/health"

# Using wget
wget -O - "http://<victoriametrics-pod-ip-or-service-name>/health"

# Port-forward to access from local machine
kubectl port-forward svc/vmsingle-k8s 8428:8428 -n monitoring
curl "http://localhost:8428/health"

Expected response:

OK

Component-Specific Health Checks¶

Grafana Health Check¶

kubectl port-forward svc/grafana-service 3000:3000 -n monitoring
curl "http://localhost:3000/api/health"

Expected response:

{
  "commit": "...",
  "database": "ok",
  "version": "..."
}

AlertManager Health Check¶

kubectl port-forward svc/alertmanager-k8s 9093:9093 -n monitoring
curl "http://localhost:9093/-/healthy"

Expected response:

Alertmanager is Healthy.

Metrics Collection Verification¶

List Available Metrics¶

Get the list of metrics collected by VictoriaMetrics:

# Port-forward VictoriaMetrics
kubectl port-forward svc/vmsingle-k8s 8428:8428 -n monitoring

# Get all metric labels
curl "http://localhost:8428/api/v1/labels" | jq '.'

Verify Specific Metrics¶

Check that key metrics are being collected:

# Check node metrics
curl "http://localhost:8428/api/v1/label/__name__/values" | jq '.data[]' | grep node

# Check Kubernetes metrics
curl "http://localhost:8428/api/v1/label/__name__/values" | jq '.data[]' | grep kube

# Check specific metric values
curl "http://localhost:8428/api/v1/query?query=up" | jq '.'

Validate Data Collection¶

Ensure metrics have recent timestamps and values:

# Query a basic metric to verify data collection
curl "http://localhost:8428/api/v1/query?query=up{job=\"node-exporter\"}" | jq '.data.result[] | {metric: .metric, value: .value}'

# Check metric count
curl "http://localhost:8428/api/v1/query?query=count(count by (__name__)({__name__!=\"\"}))" | jq '.data.result[0].value[1]'

Service Discovery Verification¶

Check Service Monitors¶

# List ServiceMonitors
kubectl get servicemonitors -n monitoring

# Check specific ServiceMonitor
kubectl describe servicemonitor <servicemonitor-name> -n monitoring

Check Prometheus Targets¶

If using Prometheus instead of VictoriaMetrics:

kubectl port-forward svc/prometheus-k8s 9090:9090 -n monitoring

Navigate to http://localhost:9090/targets to verify target discovery.

Check VictoriaMetrics Targets¶

kubectl port-forward svc/vmagent-k8s 8429:8429 -n monitoring

Navigate to http://localhost:8429/targets to verify target discovery.

UI Access Verification¶

Grafana Dashboard Access¶

# Port-forward Grafana
kubectl port-forward svc/grafana-service 3000:3000 -n monitoring

Open http://localhost:3000
Login with default credentials (admin/admin) or configured credentials
Verify dashboards are loaded and displaying data

VictoriaMetrics UI Access¶

# Port-forward VictoriaMetrics
kubectl port-forward svc/vmsingle-k8s 8428:8428 -n monitoring

Open http://localhost:8428/vmui
Try executing queries like up or node_cpu_seconds_total

AlertManager UI Access¶

# Port-forward AlertManager
kubectl port-forward svc/alertmanager-k8s 9093:9093 -n monitoring

Open http://localhost:9093
Verify AlertManager interface loads
Check for any active alerts

Configuration Verification¶

Check Custom Resources¶

# Check PlatformMonitoring
kubectl get platformmonitoring -n monitoring -o yaml

# Check VictoriaMetrics resources
kubectl get vmsingle -n monitoring
kubectl get vmagent -n monitoring
kubectl get vmalert -n monitoring

# Check Grafana resources
kubectl get grafana -n monitoring
kubectl get grafanadashboard -n monitoring

Verify Ingress Configuration¶

If ingress is enabled:

# Check ingress resources
kubectl get ingress -n monitoring

# Check ingress details
kubectl describe ingress <ingress-name> -n monitoring

# Test external access (if DNS is configured)
curl -H "Host: grafana.example.com" http://<ingress-ip>

Storage Verification¶

Check Persistent Volumes¶

# Check PVCs
kubectl get pvc -n monitoring

# Check PV details
kubectl get pv

# Verify storage usage
kubectl exec -it <victoriametrics-pod> -n monitoring -- df -h /victoria-metrics-data

Alerts Verification¶

Check Alert Rules¶

# Check PrometheusRules
kubectl get prometheusrules -n monitoring

# Verify specific rule
kubectl get prometheusrule <rule-name> -n monitoring -o yaml

Test Alert Generation¶

Trigger a test alert to verify the alerting pipeline:

# Create a test metric that should trigger an alert
kubectl run test-alert --image=busybox --restart=Never -- /bin/sh -c "while true; do echo 'test'; sleep 60; done"

# Check if alert appears in AlertManager
curl "http://localhost:9093/api/v1/alerts" | jq '.data[] | select(.labels.alertname=="TestAlert")'

Performance Verification¶

Resource Usage Check¶

# Check resource usage of monitoring components
kubectl top pods -n monitoring

# Check node resource availability
kubectl top nodes

Query Performance¶

# Test query performance
time curl "http://localhost:8428/api/v1/query?query=up"

# Check heavy queries
curl "http://localhost:8428/api/v1/query?query=node_memory_MemTotal_bytes" | jq '.data.result | length'

Common Issues and Solutions¶

Issue: Pods in CrashLoopBackOff¶

Solutions: 1. Check resource limits and requests 2. Verify persistent volume permissions 3. Check configuration syntax

kubectl describe pod <failing-pod> -n monitoring
kubectl logs <failing-pod> -n monitoring --previous

Issue: No Metrics Appearing¶

Solutions: 1. Verify ServiceMonitors are correctly configured 2. Check network policies 3. Verify RBAC permissions

kubectl get servicemonitor -n monitoring -o yaml
kubectl get networkpolicy -n monitoring

Issue: Grafana Dashboards Empty¶

Solutions: 1. Verify data source configuration 2. Check VictoriaMetrics connectivity 3. Verify time range settings

kubectl exec -it <grafana-pod> -n monitoring -- grafana-cli admin data-source list

Automated Health Check Script¶

Create a comprehensive health check script:

#!/bin/bash
# monitoring-health-check.sh

NAMESPACE="monitoring"

echo "=== Monitoring Health Check ==="

echo "1. Checking pod status..."
kubectl get pods -n $NAMESPACE

echo "2. Checking services..."
kubectl get svc -n $NAMESPACE

echo "3. Checking VictoriaMetrics health..."
VM_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=vmsingle -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward pod/$VM_POD 8428:8428 -n $NAMESPACE &
sleep 5
curl -s http://localhost:8428/health || echo "VictoriaMetrics health check failed"
pkill -f "port-forward.*8428"

echo "4. Checking metrics count..."
kubectl port-forward pod/$VM_POD 8428:8428 -n $NAMESPACE &
sleep 5
METRIC_COUNT=$(curl -s "http://localhost:8428/api/v1/query?query=count(count%20by%20(__name__)({__name__!=\"\"}))" | jq -r '.data.result[0].value[1]')
echo "Total metrics: $METRIC_COUNT"
pkill -f "port-forward.*8428"

echo "=== Health check complete ==="

Run with:

chmod +x monitoring-health-check.sh
./monitoring-health-check.sh

Next Steps¶

After successful verification:

Configuration - Customize monitoring setup
Component Configuration - Fine-tune individual components
Troubleshooting - Handle common issues
Maintenance - Ongoing operations