Debugging Service Connectivity in Kubernetes: The Ultimate Guide with Real-World Scenarios
Service connectivity issues in Kubernetes can be daunting, especially in production environments. This guide provides a step-by-step framework to diagnose and resolve these issues, enriched with real-world scenarios, detailed explanations, and actionable fixes. Whether you’re a developer or an SRE, this guide will help you troubleshoot like a pro.
1. Check Endpoints: Are Pods Properly Associated with the Service?
Step 1: Verify Endpoints Exist
In Kubernetes, a Service routes traffic to Pods based on label selectors. If no endpoints are linked to the Service, traffic cannot flow.
Command:
kubectl get endpoints <service-name>
What to Look For:
- No Endpoints? The Service’s selector does not match any Pod labels, or Pods are not in a Ready state.
- Endpoints Exist but Unreachable? Pods might be running but failing readiness probes.
Real-World Scenarios & Fixes
Scenario 1: Misconfigured Selector Labels
Problem:
A Service named order-service
uses the selector app: order-v2
, but the Pods are labeled app: orders
.
Diagnosis:
kubectl get endpoints order-service # Returns "No endpoints found"
Fix:
Update the Service’s selector to match the Pod labels:
# service.yaml
selector:
app: orders
Scenario 2: Pods Not Ready
Problem:
Endpoints exist, but traffic fails.
Diagnosis:
kubectl get pods -l app=orders
Output shows Pods in CrashLoopBackOff or not Ready.
Root Cause:
- Failed Readiness Probes: The application isn’t responding to health checks.
- Resource Limits: Pods are OOMKilled due to memory limits.
Fix:
Check Pod events:
kubectl describe pod <pod-name>
If the readiness probe fails, adjust the probe’s path or timeout in the Deployment:
# deployment.yaml
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
2. Verify Service Ports: Are Ports Mapped Correctly?
Step 2: Check Service Configuration
A Service’s port
(external) must map to the Pod’s targetPort
(container port). Misconfigurations here are common.
Command:
kubectl get svc <service-name>
Key Fields:
- Port: The port exposed by the Service.
- TargetPort: The port the Pod is listening on.
- Protocol: TCP/UDP (default: TCP).
Real-World Scenario: Port Mismatch
Problem:
A redis-service
is configured with port: 6379
and targetPort: 6380
, but the Redis Pod listens on 6379
.
Diagnosis:
kubectl get svc redis-service
Output:
NAME TYPE CLUSTER-IP PORT(S) AGE
redis-service ClusterIP 10.96.128.15 6379/TCP 2d
Check the Pod’s container port:
kubectl describe pod redis-pod | grep Port
Output:
Container Port: 6379/TCP
Fix:
Update the Service’s targetPort
to 6379
:
# service.yaml
ports:
- port: 6379
targetPort: 6379
3. Check Network Policies: Is Traffic Being Blocked?
Step 3: Review Network Policies
Network Policies act as firewalls for Pods. A default “deny-all” policy or overly restrictive rules can block traffic.
Command:
kubectl get networkpolicies -n <namespace>
Real-World Scenario: Default Deny-All Policy
Problem:
A backend-service
is unreachable from frontend
Pods.
Diagnosis:
A deny-all
Network Policy is active in the namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Fix:
Create a policy allowing traffic from the frontend
Pods:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend
spec:
podSelector:
matchLabels:
app: backend
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 80
4. Check Pod Logs: Is the Application Healthy?
Step 4: Investigate Pod Logs
Logs reveal application-level errors, such as failed database connections or misconfigurations.
Commands:
# View logs from a running Pod:
kubectl logs <pod-name>
# Debug crashed Pods:
kubectl logs <pod-name> --previous
Real-World Scenario: Database Connection Refused
Problem:
A payment-service
Pod crashes repeatedly.
Diagnosis:
kubectl logs payment-pod-xyz --previous
Logs show:
Error: Failed to connect to database-service:3306: Connection refused
Investigation:
Test connectivity from the Pod:
kubectl exec -it payment-pod-xyz -- curl -v telnet://database-service:3306
Output:
Connection refused
Root Cause:
The database-service
selector doesn’t match the database Pod labels.
Fix:
Update the database-service
selector to match the Pod’s labels.
5. Test DNS Resolution: Can Pods Resolve Service Names?
Step 5: Validate DNS
Kubernetes Services are accessible via DNS names like <service>.<namespace>.svc.cluster.local
. If DNS fails, communication breaks.
Test DNS from a Pod:
kubectl run -it --rm --restart=Never dns-test \
--image=busybox:1.28 -- nslookup <service-name>
Real-World Scenario: CoreDNS Failure
Problem:
Pods in the monitoring
namespace can’t resolve prometheus-service.monitoring.svc.cluster.local
.
Diagnosis:
kubectl exec -it prometheus-pod -- nslookup prometheus-service
Output:
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'prometheus-service'
Investigation:
Check CoreDNS Pods:
kubectl get pods -n kube-system -l k8s-app=kube-dns
Output shows CoreDNS Pods in CrashLoopBackOff
.
Root Cause:
A misconfigured Corefile
in the CoreDNS ConfigMap.
Fix:
Update the CoreDNS ConfigMap:
kubectl edit configmap coredns -n kube-system
Ensure the Corefile
includes:
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
6. Check Firewalls and Security Groups: Is External Traffic Allowed?
Step 6: Validate External Access
For Services of type NodePort
or LoadBalancer
, cloud firewalls or security groups might block traffic.
Real-World Scenario: Blocked LoadBalancer Traffic
Problem:
A web-service
of type LoadBalancer
is unreachable externally.
Diagnosis:
- The Service has an external IP:
Output:kubectl get svc web-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE web-service LoadBalancer 10.96.128.20 203.0.113.10 80:31000/TCP 1h
- Testing
curl http://203.0.113.10
fails.
Root Cause:
The cloud provider’s firewall blocks inbound traffic on port 80
.
Fix:
Update the firewall rules (e.g., AWS Security Groups, GCP Firewall Rules) to allow traffic on port 80
.
7. Validate Service Configuration: Are Selectors and Ports Correct?
Step 7: Audit Service YAML
Typos in selectors or incorrect Service types (e.g., ClusterIP
instead of NodePort
) are common culprits.
Command:
kubectl get svc <service-name> -o yaml
Real-World Scenario: Incorrect Service Type
Problem:
A user-service
is configured as type: ClusterIP
, but the team expects external access.
Diagnosis:
kubectl get svc user-service
Output:
NAME TYPE CLUSTER-IP PORT(S) AGE
user-service ClusterIP 10.96.128.30 80/TCP 3d
Fix:
Update the Service type to LoadBalancer
:
# service.yaml
apiVersion: v1
kind: Service metadata:
name: user-service
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 80
selector:
app: user
8. Advanced: Check kube-proxy and CNI Plugins
Step 8: Investigate kube-proxy and CNI Health
If all previous checks pass but services still don’t work, kube-proxy or the Container Network Interface (CNI) plugin might be misconfigured.
Commands:
# Check kube-proxy logs:
kubectl logs -n kube-system -l k8s-app=kube-proxy
# Verify CNI plugin health (e.g., Calico):
kubectl get pods -n calico-system
Real-World Scenario: kube-proxy Misconfiguration
Problem:
After a cluster upgrade, kube-proxy
Pods crash.
Diagnosis:
kubectl get pods -n kube-system -l k8s-app=kube-proxy
Output shows Pods in CrashLoopBackOff
.
Root Cause:
Misconfigured iptables
rules after the upgrade.
Fix:
Reconfigure kube-proxy
or roll back the upgrade. Check the configuration file for errors:
kubectl edit configmap kube-proxy -n kube-system
Summary: Debugging Checklist
- Check Endpoints: Ensure Pods are matched and ready.
- Verify Ports: Confirm
port
andtargetPort
align. - Check Network Policies: Review allow/deny rules.
- Investigate Logs: Diagnose Pod crashes/errors.
- Test DNS: Validate resolution from a Pod.
- Check Firewalls: Ensure cloud/provider rules allow traffic.
- Audit Service Config: Confirm selectors and type.
- Investigate kube-proxy/CNI: Check logs and health.
By following this structured approach, you can systematically troubleshoot service connectivity issues in Kubernetes. Each step builds on the previous one, ensuring a comprehensive understanding of potential pitfalls and their resolutions. Remember, practice makes perfect—frequent troubleshooting will enhance your skills and confidence in managing Kubernetes environments.
Labels: Debugging Service Connectivity in Kubernetes: The Ultimate Guide with Real-World Scenarios
0 Comments:
Post a Comment
Note: only a member of this blog may post a comment.
<< Home