Wednesday, 26 February 2025

21 DevOps Interview Scenarios: Build a Secure, Scalable Cloud Project with AWS, Terraform, Docker & Kubernetes


In today’s cloud-driven world, building a secure, scalable, and fault-tolerant infrastructure is critical for businesses of all sizes. This guide walks through a real-world project that integrates AWS services, Terraform, Docker, Kubernetes, and CI/CD pipelines, while addressing key security, disaster recovery, and operational best practices. Each section answers a critical question, with detailed examples and corrections based on industry standards.

1. AWS Services & Security Best Practices

Key Services Used

  1. Amazon EC2:
    • Security Groups: Act as virtual firewalls to control inbound/outbound traffic.
    • Key Management: Use AWS KMS to encrypt EBS volumes and instance storage.
  2. Amazon S3:
    • Bucket Policies: Restrict access using IAM roles, IP whitelisting, and encryption (SSE-S3 or SSE-KMS).
    • Example policy with IP restriction:
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": { "AWS": "arn:aws:iam::123456789012:user/ExampleUser" },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::example-bucket/*",
            "Condition": { "IpAddress": { "aws:SourceIp": "192.0.2.0/24" } }
          }
        ]
      }
      
  3. AWS IAM:
    • Enforce least privilege access and enable MFA for root and privileged users.
  4. AWS Shield & WAF:
    • Protect against DDoS attacks and malicious traffic.
  5. AWS Config:
    • Audit resource configurations for compliance.

2. Regaining Access to a Locked EC2 Instance

Steps to Recover Access

  1. Stop the Instance:
    • Warning: Stopping the instance causes downtime.
  2. Detach the Root Volume: Attach it to another EC2 instance as a secondary volume.
  3. Modify authorized_keys: Add a new public key to the ~/.ssh/authorized_keys file.
  4. Reattach the Volume: Restart the original instance and connect using the new key.

Alternative Methods:

  • Use EC2 Instance Connect (if supported).
  • Leverage AWS Systems Manager Session Manager (requires SSM agent).

3. Restricting Access to an EC2 Server

If a user’s private key is compromised:

  1. Revoke IAM Permissions: Remove their access via IAM policies.
  2. Rotate SSH Keys: Generate new key pairs and update authorized_keys.
  3. Update Security Groups: Block the user’s IP address.

4. Terraform Infrastructure as Code (IaC)

VPC, Subnets, and Route Tables

provider "aws" {
  region = "us-east-1"
}

# VPC
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

# Internet Gateway (Missing in Original Example)
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

# Subnets
resource "aws_subnet" "public_subnet" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"
}

resource "aws_subnet" "private_subnet" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.2.0/24"
  availability_zone = "us-east-1b"
}

# Route Table for Public Subnet
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

# Associate Route Table with Public Subnet
resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public_subnet.id
  route_table_id = aws_route_table.public.id
}

Why VPC, Subnets, and Route Tables?

  • VPC: Isolates resources in a private network.
  • Subnets: Segment networks for security (public/private tiers).
  • Route Tables: Direct traffic between subnets and to the internet.

5. Managing Terraform State

Recovering a Corrupted .tfstate File

  1. Enable S3 Versioning: Restore from a previous version.
  2. Use Terraform Cloud: Automatically backup and version state files.
  3. Manual Recovery: Rebuild state using terraform import.

Terraform Drift

Drift occurs when the actual infrastructure diverges from the Terraform configuration. Detect it with:

terraform plan -refresh-only

State Locking

Prevent concurrent edits using S3 + DynamoDB:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

6. Docker & Kubernetes Deployment

Dockerfile Best Practices

Use multi-stage builds to minimize image size:

# Build Stage
FROM node:14 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

# Production Stage
FROM node:14-alpine
WORKDIR /app
COPY --from=builder /app/package*.json ./
COPY --from=builder /app/dist ./dist
RUN npm install --only=production
CMD ["node", "dist/app.js"]

Staging Builds

Deploy to a staging environment mirroring production to test integrations, performance, and security before release.

Kubernetes Manifest for Database with PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: db-pvc
spec:
  storageClassName: gp2  # Aligns with AWS EBS
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: "postgres"
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:13
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: db-storage
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: db-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: gp2
      resources:
        requests:
          storage: 10Gi

Kubernetes Architecture

  • Control Plane: Manages the cluster (API Server, Scheduler, Controller Manager, etcd).
  • Worker Nodes: Run application pods (kubelet, kube-proxy, container runtime).
  • Pods: Smallest deployable units containing one or more containers.

7. Disaster Recovery & High Availability

Multi-AZ Deployment

  • Deploy critical workloads across multiple Availability Zones (AZs).
  • Use Amazon RDS Multi-AZ for databases.

Backup & Restore

  • AWS Backup: Automate cross-region backups for EBS, RDS, and DynamoDB.
  • Route 53 Health Checks: Route traffic to healthy regions during outages.

8. CI/CD Pipeline & Production Deployment

Zero-Downtime Deployment Strategies

  1. Blue/Green:
    • Deploy new version alongside old, then switch traffic.
  2. Canary Releases:
    • Gradually roll out updates to a subset of users.

Jenkins Pipeline Example

pipeline {
  agent any
  stages {
    stage('Build') {
      steps {
        sh 'docker build -t my-app:${BUILD_ID} .'
      }
    }
    stage('Test') {
      steps {
        sh 'docker run my-app:${BUILD_ID} npm test'
      }
    }
    stage('Deploy') {
      steps {
        sh 'kubectl set image deployment/my-app my-app=my-app:${BUILD_ID}'
      }
    }
  }
}

9. Logging, Monitoring & Security Scanning

AWS CloudWatch & ELK Stack

  • CloudWatch Alarms: Monitor CPU, memory, and custom metrics.
  • ELK Stack: Centralize logs with Elasticsearch, Logstash, and Kibana.

SonarQube Configuration

  1. Install SonarQube: Use Docker for quick setup.
  2. Add sonar-project.properties:
    sonar.projectKey=my-node-app
    sonar.projectName=My Node App
    sonar.sources=src
    sonar.exclusions=**/node_modules/**
    sonar.login=${SONAR_TOKEN}
    
  3. Integrate with CI/CD:
    docker run --rm -v $(pwd):/usr/src sonarsource/sonar-scanner-cli
    

10. Service Mesh & Advanced Networking

What is a Service Mesh?

A service mesh (e.g., Istio or Linkerd) manages service-to-service communication, providing:

  • Traffic management (retries, timeouts).
  • Security (mTLS encryption).
  • Observability (metrics, tracing).

11. Final Architecture Overview

  1. Network Layer: VPC with public/private subnets, NAT gateways, and security groups.
  2. Compute Layer: ECS/Kubernetes clusters for container orchestration.
  3. Storage Layer: S3 for static assets, RDS/Aurora for databases.
  4. Security Layer: IAM roles, WAF, and encrypted secrets.
  5. CI/CD: Jenkins/GitHub Actions for automated testing and deployment.

Building a production-grade cloud infrastructure requires:

  • Security First: IAM policies, encryption, and regular audits.
  • Resilience: Multi-AZ deployments, backups, and disaster recovery plans.
  • Automation: Terraform for IaC, CI/CD for seamless deployments.
  • Observability: Logging, monitoring, and performance tracing.

By following this guide, teams can deploy scalable, secure, and maintainable systems that withstand real-world challenges while minimizing downtime and risk.

Labels: , , ,

0 Comments:

Post a Comment

Note: only a member of this blog may post a comment.

<< Home