Tuesday, 4 March 2025

Debugging Service Connectivity in Kubernetes: The Ultimate Guide with Real-World Scenarios

Service connectivity issues in Kubernetes can be daunting, especially in production environments. This guide provides a step-by-step framework to diagnose and resolve these issues, enriched with real-world scenarios, detailed explanations, and actionable fixes. Whether you’re a developer or an SRE, this guide will help you troubleshoot like a pro.

1. Check Endpoints: Are Pods Properly Associated with the Service?

Step 1: Verify Endpoints Exist

In Kubernetes, a Service routes traffic to Pods based on label selectors. If no endpoints are linked to the Service, traffic cannot flow.

Command:

kubectl get endpoints <service-name>

What to Look For:

  • No Endpoints? The Service’s selector does not match any Pod labels, or Pods are not in a Ready state.
  • Endpoints Exist but Unreachable? Pods might be running but failing readiness probes.

Real-World Scenarios & Fixes

Scenario 1: Misconfigured Selector Labels

Problem:
A Service named order-service uses the selector app: order-v2, but the Pods are labeled app: orders.
Diagnosis:

kubectl get endpoints order-service  # Returns "No endpoints found"

Fix:
Update the Service’s selector to match the Pod labels:

# service.yaml
selector:
  app: orders

Scenario 2: Pods Not Ready

Problem:
Endpoints exist, but traffic fails.
Diagnosis:

kubectl get pods -l app=orders

Output shows Pods in CrashLoopBackOff or not Ready.
Root Cause:

  • Failed Readiness Probes: The application isn’t responding to health checks.
  • Resource Limits: Pods are OOMKilled due to memory limits.

Fix:
Check Pod events:

kubectl describe pod <pod-name>

If the readiness probe fails, adjust the probe’s path or timeout in the Deployment:

# deployment.yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

2. Verify Service Ports: Are Ports Mapped Correctly?

Step 2: Check Service Configuration

A Service’s port (external) must map to the Pod’s targetPort (container port). Misconfigurations here are common.
Command:

kubectl get svc <service-name>

Key Fields:

  • Port: The port exposed by the Service.
  • TargetPort: The port the Pod is listening on.
  • Protocol: TCP/UDP (default: TCP).

Real-World Scenario: Port Mismatch

Problem:
A redis-service is configured with port: 6379 and targetPort: 6380, but the Redis Pod listens on 6379.
Diagnosis:

kubectl get svc redis-service

Output:

NAME            TYPE        CLUSTER-IP     PORT(S)    AGE
redis-service   ClusterIP   10.96.128.15   6379/TCP   2d

Check the Pod’s container port:

kubectl describe pod redis-pod | grep Port

Output:

Container Port: 6379/TCP

Fix:
Update the Service’s targetPort to 6379:

# service.yaml
ports:
- port: 6379
  targetPort: 6379

3. Check Network Policies: Is Traffic Being Blocked?

Step 3: Review Network Policies

Network Policies act as firewalls for Pods. A default “deny-all” policy or overly restrictive rules can block traffic.
Command:

kubectl get networkpolicies -n <namespace>

Real-World Scenario: Default Deny-All Policy

Problem:
A backend-service is unreachable from frontend Pods.
Diagnosis:
A deny-all Network Policy is active in the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Fix:
Create a policy allowing traffic from the frontend Pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 80

4. Check Pod Logs: Is the Application Healthy?

Step 4: Investigate Pod Logs

Logs reveal application-level errors, such as failed database connections or misconfigurations.
Commands:

# View logs from a running Pod:
kubectl logs <pod-name>

# Debug crashed Pods:
kubectl logs <pod-name> --previous

Real-World Scenario: Database Connection Refused

Problem:
A payment-service Pod crashes repeatedly.
Diagnosis:

kubectl logs payment-pod-xyz --previous

Logs show:

Error: Failed to connect to database-service:3306: Connection refused

Investigation:
Test connectivity from the Pod:

kubectl exec -it payment-pod-xyz -- curl -v telnet://database-service:3306

Output:

Connection refused

Root Cause:
The database-service selector doesn’t match the database Pod labels.
Fix:
Update the database-service selector to match the Pod’s labels.

5. Test DNS Resolution: Can Pods Resolve Service Names?

Step 5: Validate DNS

Kubernetes Services are accessible via DNS names like <service>.<namespace>.svc.cluster.local. If DNS fails, communication breaks.

Test DNS from a Pod:

kubectl run -it --rm --restart=Never dns-test \
  --image=busybox:1.28 -- nslookup <service-name>

Real-World Scenario: CoreDNS Failure

Problem:
Pods in the monitoring namespace can’t resolve prometheus-service.monitoring.svc.cluster.local.
Diagnosis:

kubectl exec -it prometheus-pod -- nslookup prometheus-service

Output:

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'prometheus-service'

Investigation:
Check CoreDNS Pods:

kubectl get pods -n kube-system -l k8s-app=kube-dns

Output shows CoreDNS Pods in CrashLoopBackOff.
Root Cause:
A misconfigured Corefile in the CoreDNS ConfigMap.
Fix:
Update the CoreDNS ConfigMap:

kubectl edit configmap coredns -n kube-system

Ensure the Corefile includes:

.:53 {
    errors
    health
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      fallthrough in-addr.arpa ip6.arpa
    }
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

6. Check Firewalls and Security Groups: Is External Traffic Allowed?

Step 6: Validate External Access

For Services of type NodePort or LoadBalancer, cloud firewalls or security groups might block traffic.

Real-World Scenario: Blocked LoadBalancer Traffic

Problem:
A web-service of type LoadBalancer is unreachable externally.
Diagnosis:

  • The Service has an external IP:
    kubectl get svc web-service
    
    Output:
    NAME          TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)        AGE
    web-service   LoadBalancer   10.96.128.20   203.0.113.10     80:31000/TCP   1h
    
  • Testing curl http://203.0.113.10 fails.

Root Cause:
The cloud provider’s firewall blocks inbound traffic on port 80.
Fix:
Update the firewall rules (e.g., AWS Security Groups, GCP Firewall Rules) to allow traffic on port 80.

7. Validate Service Configuration: Are Selectors and Ports Correct?

Step 7: Audit Service YAML

Typos in selectors or incorrect Service types (e.g., ClusterIP instead of NodePort) are common culprits.
Command:

kubectl get svc <service-name> -o yaml

Real-World Scenario: Incorrect Service Type

Problem:
A user-service is configured as type: ClusterIP, but the team expects external access.
Diagnosis:

kubectl get svc user-service

Output:

NAME           TYPE        CLUSTER-IP      PORT(S)    AGE
user-service   ClusterIP   10.96.128.30    80/TCP     3d

Fix:
Update the Service type to LoadBalancer:

# service.yaml
apiVersion: v1
kind: Service metadata:
  name: user-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: user

8. Advanced: Check kube-proxy and CNI Plugins

Step 8: Investigate kube-proxy and CNI Health

If all previous checks pass but services still don’t work, kube-proxy or the Container Network Interface (CNI) plugin might be misconfigured.

Commands:

# Check kube-proxy logs:
kubectl logs -n kube-system -l k8s-app=kube-proxy

# Verify CNI plugin health (e.g., Calico):
kubectl get pods -n calico-system

Real-World Scenario: kube-proxy Misconfiguration

Problem:
After a cluster upgrade, kube-proxy Pods crash.
Diagnosis:

kubectl get pods -n kube-system -l k8s-app=kube-proxy

Output shows Pods in CrashLoopBackOff.
Root Cause:
Misconfigured iptables rules after the upgrade.
Fix:
Reconfigure kube-proxy or roll back the upgrade. Check the configuration file for errors:

kubectl edit configmap kube-proxy -n kube-system

Summary: Debugging Checklist

  1. Check Endpoints: Ensure Pods are matched and ready.
  2. Verify Ports: Confirm port and targetPort align.
  3. Check Network Policies: Review allow/deny rules.
  4. Investigate Logs: Diagnose Pod crashes/errors.
  5. Test DNS: Validate resolution from a Pod.
  6. Check Firewalls: Ensure cloud/provider rules allow traffic.
  7. Audit Service Config: Confirm selectors and type.
  8. Investigate kube-proxy/CNI: Check logs and health.

By following this structured approach, you can systematically troubleshoot service connectivity issues in Kubernetes. Each step builds on the previous one, ensuring a comprehensive understanding of potential pitfalls and their resolutions. Remember, practice makes perfect—frequent troubleshooting will enhance your skills and confidence in managing Kubernetes environments.

Labels:

0 Comments:

Post a Comment

Note: only a member of this blog may post a comment.

<< Home