Post-Deploy Checks¶
There are many ways to check whether Monitoring is working correctly and collecting metrics after deployment. This guide covers both common verification steps and specific validation techniques.
Pod Status Verification¶
First, check that all components are running using the following command:
Expected Pod List¶
For a typical Monitoring deployment, you should see pods similar to:
$ kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-k8s-0 2/2 Running 0 9d
grafana-deployment-7d869f989d-24657 1/1 Running 0 9d
grafana-operator-5857f4d47d-f7nv4 1/1 Running 0 9d
kube-state-metrics-6c89f47c94-tbx7l 1/1 Running 0 9d
monitoring-operator-monito-6d6c78dd86-wqfcx 1/1 Running 0 9d
node-exporter-g5hsr 1/1 Running 0 9d
node-exporter-jq7gq 1/1 Running 0 9d
node-exporter-zppgl 1/1 Running 0 9d
victoriametrics-operator-c5649d646-7tnw8 1/1 Running 0 9d
vmagent-k8s-8dd4ffccc-msdmx 2/2 Running 0 9d
vmalert-k8s-6c4b989889-dndmv 2/2 Running 0 9d
vmalertmanager-k8s-0 2/2 Running 0 9d
vmauth-k8s-76cc654596-gn8pl 1/1 Running 0 9d
vmsingle-k8s-6cf5b84846-c22xs 1/1 Running 0 9d
Troubleshooting Pod Issues¶
If pods are not in Running
status, investigate further:
# Check specific pod details
kubectl describe pod <pod-name> -n monitoring
# Check pod logs
kubectl logs <pod-name> -n monitoring
# For containers with multiple containers
kubectl logs <pod-name> -c <container-name> -n monitoring
Service Health Verification¶
VictoriaMetrics Health Check¶
Check that all VictoriaMetrics services have OK
status at the /health
endpoint:
# Using curl
curl "http://<victoriametrics-pod-ip-or-service-name>/health"
# Using wget
wget -O - "http://<victoriametrics-pod-ip-or-service-name>/health"
# Port-forward to access from local machine
kubectl port-forward svc/vmsingle-k8s 8428:8428 -n monitoring
curl "http://localhost:8428/health"
Expected response:
Component-Specific Health Checks¶
Grafana Health Check¶
kubectl port-forward svc/grafana-service 3000:3000 -n monitoring
curl "http://localhost:3000/api/health"
Expected response:
AlertManager Health Check¶
kubectl port-forward svc/alertmanager-k8s 9093:9093 -n monitoring
curl "http://localhost:9093/-/healthy"
Expected response:
Metrics Collection Verification¶
List Available Metrics¶
Get the list of metrics collected by VictoriaMetrics:
# Port-forward VictoriaMetrics
kubectl port-forward svc/vmsingle-k8s 8428:8428 -n monitoring
# Get all metric labels
curl "http://localhost:8428/api/v1/labels" | jq '.'
Verify Specific Metrics¶
Check that key metrics are being collected:
# Check node metrics
curl "http://localhost:8428/api/v1/label/__name__/values" | jq '.data[]' | grep node
# Check Kubernetes metrics
curl "http://localhost:8428/api/v1/label/__name__/values" | jq '.data[]' | grep kube
# Check specific metric values
curl "http://localhost:8428/api/v1/query?query=up" | jq '.'
Validate Data Collection¶
Ensure metrics have recent timestamps and values:
# Query a basic metric to verify data collection
curl "http://localhost:8428/api/v1/query?query=up{job=\"node-exporter\"}" | jq '.data.result[] | {metric: .metric, value: .value}'
# Check metric count
curl "http://localhost:8428/api/v1/query?query=count(count by (__name__)({__name__!=\"\"}))" | jq '.data.result[0].value[1]'
Service Discovery Verification¶
Check Service Monitors¶
# List ServiceMonitors
kubectl get servicemonitors -n monitoring
# Check specific ServiceMonitor
kubectl describe servicemonitor <servicemonitor-name> -n monitoring
Check Prometheus Targets¶
If using Prometheus instead of VictoriaMetrics:
Navigate to http://localhost:9090/targets
to verify target discovery.
Check VictoriaMetrics Targets¶
Navigate to http://localhost:8429/targets
to verify target discovery.
UI Access Verification¶
Grafana Dashboard Access¶
- Open
http://localhost:3000
- Login with default credentials (admin/admin) or configured credentials
- Verify dashboards are loaded and displaying data
VictoriaMetrics UI Access¶
- Open
http://localhost:8428/vmui
- Try executing queries like
up
ornode_cpu_seconds_total
AlertManager UI Access¶
- Open
http://localhost:9093
- Verify AlertManager interface loads
- Check for any active alerts
Configuration Verification¶
Check Custom Resources¶
# Check PlatformMonitoring
kubectl get platformmonitoring -n monitoring -o yaml
# Check VictoriaMetrics resources
kubectl get vmsingle -n monitoring
kubectl get vmagent -n monitoring
kubectl get vmalert -n monitoring
# Check Grafana resources
kubectl get grafana -n monitoring
kubectl get grafanadashboard -n monitoring
Verify Ingress Configuration¶
If ingress is enabled:
# Check ingress resources
kubectl get ingress -n monitoring
# Check ingress details
kubectl describe ingress <ingress-name> -n monitoring
# Test external access (if DNS is configured)
curl -H "Host: grafana.example.com" http://<ingress-ip>
Storage Verification¶
Check Persistent Volumes¶
# Check PVCs
kubectl get pvc -n monitoring
# Check PV details
kubectl get pv
# Verify storage usage
kubectl exec -it <victoriametrics-pod> -n monitoring -- df -h /victoria-metrics-data
Alerts Verification¶
Check Alert Rules¶
# Check PrometheusRules
kubectl get prometheusrules -n monitoring
# Verify specific rule
kubectl get prometheusrule <rule-name> -n monitoring -o yaml
Test Alert Generation¶
Trigger a test alert to verify the alerting pipeline:
# Create a test metric that should trigger an alert
kubectl run test-alert --image=busybox --restart=Never -- /bin/sh -c "while true; do echo 'test'; sleep 60; done"
# Check if alert appears in AlertManager
curl "http://localhost:9093/api/v1/alerts" | jq '.data[] | select(.labels.alertname=="TestAlert")'
Performance Verification¶
Resource Usage Check¶
# Check resource usage of monitoring components
kubectl top pods -n monitoring
# Check node resource availability
kubectl top nodes
Query Performance¶
# Test query performance
time curl "http://localhost:8428/api/v1/query?query=up"
# Check heavy queries
curl "http://localhost:8428/api/v1/query?query=node_memory_MemTotal_bytes" | jq '.data.result | length'
Common Issues and Solutions¶
Issue: Pods in CrashLoopBackOff¶
Solutions: 1. Check resource limits and requests 2. Verify persistent volume permissions 3. Check configuration syntax
kubectl describe pod <failing-pod> -n monitoring
kubectl logs <failing-pod> -n monitoring --previous
Issue: No Metrics Appearing¶
Solutions: 1. Verify ServiceMonitors are correctly configured 2. Check network policies 3. Verify RBAC permissions
Issue: Grafana Dashboards Empty¶
Solutions: 1. Verify data source configuration 2. Check VictoriaMetrics connectivity 3. Verify time range settings
Automated Health Check Script¶
Create a comprehensive health check script:
#!/bin/bash
# monitoring-health-check.sh
NAMESPACE="monitoring"
echo "=== Monitoring Health Check ==="
echo "1. Checking pod status..."
kubectl get pods -n $NAMESPACE
echo "2. Checking services..."
kubectl get svc -n $NAMESPACE
echo "3. Checking VictoriaMetrics health..."
VM_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/name=vmsingle -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward pod/$VM_POD 8428:8428 -n $NAMESPACE &
sleep 5
curl -s http://localhost:8428/health || echo "VictoriaMetrics health check failed"
pkill -f "port-forward.*8428"
echo "4. Checking metrics count..."
kubectl port-forward pod/$VM_POD 8428:8428 -n $NAMESPACE &
sleep 5
METRIC_COUNT=$(curl -s "http://localhost:8428/api/v1/query?query=count(count%20by%20(__name__)({__name__!=\"\"}))" | jq -r '.data.result[0].value[1]')
echo "Total metrics: $METRIC_COUNT"
pkill -f "port-forward.*8428"
echo "=== Health check complete ==="
Run with:
Next Steps¶
After successful verification:
- Configuration - Customize monitoring setup
- Component Configuration - Fine-tune individual components
- Troubleshooting - Handle common issues
- Maintenance - Ongoing operations