Homelab Infrastructure - Proxmox Cluster & Services

A complete homelab infrastructure project featuring a 4-node Proxmox VE cluster with Ceph distributed storage, containerized microservices, automated deployment pipelines, and comprehensive monitoring. Demonstrates enterprise-grade infrastructure design for home use.

Last updated:

Overview

The Homelab Infrastructure project is a comprehensive, production-grade virtualized computing environment built on Proxmox VE. It represents the most complex technical undertaking, integrating hardware, virtualization, distributed storage, containerization, networking, and automation into a cohesive system.

This project serves as both a personal learning platform and a fully functional self-hosted infrastructure running dozens of services for personal and community use.

Architecture Overview

Hardware Infrastructure

Compute Nodes

4-Node Proxmox VE Cluster:

  1. PVE-5600G-1

    • CPU: AMD Ryzen 5 5600G (6 cores / 12 threads)
    • RAM: 32 GB DDR4
    • Storage: 4TB (2TB NVMe + 2TB SATA SSD)
    • Network: 2.5Gbps Ethernet
  2. PVE-5600G-2

    • CPU: AMD Ryzen 5 5600G
    • RAM: 32 GB DDR4
    • Storage: 4TB (2TB NVMe + 2TB SATA SSD)
    • Network: 2.5Gbps Ethernet
  3. PVE-5600G-3

    • CPU: AMD Ryzen 5 5600G
    • RAM: 16 GB DDR4 (upgradable to 32GB)
    • Storage: 4TB (2TB NVMe + 2TB SATA SSD)
    • Network: 2.5Gbps Ethernet
  4. PVE-Chinese (Legacy, being replaced)

    • Older hardware
    • Scheduled for replacement 2025
    • Temporary storage server role

Planned Hardware Upgrades (2025)

Data Boxes (2 units) - Replacing legacy Chinese server

  • Case: 2U rackmount
  • RAM: 128GB per machine
  • Network: 2x 10Gbps SFP+ cards
  • Storage: 9x SATA bays each

PVE-N305-1 - NAS appliance

  • Case: Fractal Design Node 304
  • CPU: Intel N305
  • RAM: 32GB DDR5
  • Network: 4x 2.5Gbps + PCIe slot
  • Storage: 4x 20TB HDD + 2x 4TB SSD

Networking

Network Topology:

  • Central managed switch (8 SFP+, VLAN support)
  • Horatio unmanaged switches (2.5Gbps and 10Gbps)
  • Cat 7 cabling (50m, future-proofed)
  • VLAN segmentation for service isolation

Equipment:

Storage Architecture

Ceph Distributed Storage

Ceph Cluster (3 nodes)
├── Monitor (MON) - Cluster health
├── Object Storage Daemon (OSD) - Data storage
│   ├── Node 1: 2x 2TB NVMe + 2x 2TB SATA
│   ├── Node 2: 2x 2TB NVMe + 2x 2TB SATA
│   └── Node 3: 2x 2TB NVMe + 2x 2TB SATA
└── Data Replication (3x replication factor)

Ceph Benefits:

  • Distributed, self-healing storage
  • Automatic data replication
  • Scalable capacity
  • Fault tolerance (survives 2 simultaneous node failures)
  • RBD (block storage) for VM disks
  • Object storage for media files

Ceph Operations:

  • Monitor cluster health
  • OSD management and recovery
  • Placement group balancing
  • Data replication tuning
  • Performance optimization

Remote Storage

Wasabi S3 Bucket

  • Hot media storage
  • S3FS mounted for direct filesystem access
  • Cost-effective cloud backup
  • Replicated media library

Hetzner Storage Box

  • Cold incremental backups
  • Geographic redundancy
  • Off-site disaster recovery
  • Affordable cold storage

Virtualization Layer

Virtual Machine Services

Media Stack (media.box)

  • IP: 192.168.178.103
  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 6TB (Ceph)
  • Services: Jellyfin, Calibre, NFS, Samba, MPD
  • Purpose: Media center and file sharing

Organization Stack (org.box)

  • IP: 192.168.178.101
  • CPU: 2 cores
  • RAM: 4GB
  • Storage: 256GB (Ceph)
  • Services: Bitwarden, WikiJS, Wallabag, RSS, Traggo, Uguu
  • Purpose: Productivity and knowledge management

Development Stack (dev.box)

  • IP: 192.168.178.100
  • CPU: 2 cores
  • RAM: 4GB
  • Storage: 256GB (Ceph)
  • Services: Gitea, Jenkins, GitLab CI/CD, n8n
  • Purpose: CI/CD and automation platform

Control Stack (control.box)

  • IP: 192.168.178.104
  • CPU: 2 cores
  • RAM: 4GB
  • Storage: 256GB (Ceph)
  • Services: Grafana, Portainer, Cockpit, Uptime Kuma, Heimdall
  • Purpose: Monitoring and system management

Container Platform

Docker Compose

  • Service orchestration on each VM
  • Health checks and restart policies
  • Networking and volume management
  • Environment-based configuration

Kubernetes Cluster (planned expansion)

  • Current: 2 nodes
  • Planned: 5+ node cluster
  • Advanced orchestration and scaling

Service Architecture

Internet
    ↓
[Firewall/Router - OPNsense planned]
    ↓
[Reverse Proxy - Traefik]
    ↓
[Load Balancer / Router]
    ↓
┌─────┬─────┬─────┬─────┐
│ Media│ Org│  Dev│Control│
└─────┴─────┴─────┴─────┘
    ↓
[Ceph Storage Backend]

Infrastructure as Code

Terraform

Purpose: Declarative VM provisioning

resource "proxmox_vm_qemu" "media_box" {
  name        = "media.box"
  target_node = "pve-5600g-1"
  cores       = 4
  sockets     = 1
  memory      = 8192
  ssd         = true
  
  disk {
    storage = "ceph"
    size    = 256
  }
}

Capabilities:

  • VM creation and destruction
  • Resource allocation
  • Networking configuration
  • Template management

Ansible

Purpose: Configuration management and deployment automation

Playbooks:

  • VM initial setup and hardening
  • Service deployment (Docker Compose)
  • System updates and patches
  • Configuration synchronization
  • Secrets management (Ansible Vault)

Example Playbooks:

- ansible-proxmox: Main cluster provisioning
- vps-hardening: SSH, firewall, security
- service-deploy: Docker service deployment

Execution:

  • ansible-playbook main.yml: Full cluster setup
  • ansible-playbook hardening.yml: Security hardening
  • Rolling deployments for zero-downtime updates

Git Repository

Monitoring & Observability

Prometheus + Grafana Stack

Prometheus:

  • Metrics collection from all nodes and services
  • Custom scrape targets for applications
  • Alert rule evaluation
  • Time-series data storage

Grafana:

  • Comprehensive dashboards
  • Real-time monitoring
  • Historical trend analysis
  • Alert visualization

Monitoring Targets:

  • Node exporter (host metrics)
  • Ceph cluster health
  • VM resource utilization
  • Service-specific metrics
  • Network statistics

Alerting

Uptime Kuma:

  • Service health monitoring
  • Status page for public status
  • Notification routing (Slack, email)
  • Incident tracking

Alert Rules:

  • High CPU/memory usage
  • Storage capacity warnings
  • Network connectivity issues
  • Service unavailability
  • Ceph health warnings

Logging

Centralized Logging:

  • Container logs aggregation
  • System journal capture
  • Application-specific logs
  • Log analysis and searching

Security & Hardening

Network Security

VLANs:

  • Service isolation
  • Traffic segmentation
  • DDoS mitigation
  • Guest network separation

Firewall:

  • OPNsense/pfSense planned
  • Stateful firewall rules
  • VPN support
  • Network segmentation

SSH Hardening

  • SSH key-based authentication only
  • Disabled password login
  • Custom SSH port
  • Fail2ban for brute-force protection
  • Regular key rotation

Access Control

Authentik (Planned):

  • Centralized SSO/LDAP
  • User management
  • OAuth2 / OIDC
  • Application-level authentication
  • Multi-factor authentication (MFA)

Secrets Management

  • Ansible Vault for sensitive data
  • Environment variable-based secrets
  • No hardcoded passwords
  • Regular secret rotation

Deployment Pipeline

Git-Driven Infrastructure

Git Push
    ↓
GitHub Actions Trigger
    ↓
Terraform Plan
    ↓
Manual Approval (or auto)
    ↓
Terraform Apply
    ↓
Ansible Provisioning
    ↓
Service Verification
    ↓
Smoke Tests
    ↓
Production Deployment

CI/CD Integration

  • Jenkins: Local CI/CD orchestration
  • GitLab CI: Alternative pipeline engine
  • GitHub Actions: External automation
  • Automated testing before deployment

Capacity & Performance

Current Capacity

  • Total CPU Cores: 18 cores (3x 5600G × 6 cores)
  • Total RAM: 80GB (32 + 32 + 16)
  • Storage: 12TB local NVMe + 12TB SATA SSD
  • Ceph Capacity: Scales with node addition

Performance Targets

  • VM Startup Time: < 30 seconds
  • Storage IOPS: 10,000+ (NVMe tier)
  • Network Throughput: 2.5Gbps per node
  • Cluster Failover: < 5 minutes
  • Service Availability: 99.9% uptime

Maintenance & Operations

Monitoring Schedule

  • Daily: Automated health checks
  • Weekly: Performance review, backup verification
  • Monthly: Capacity planning, security updates
  • Quarterly: Disaster recovery drills

Backup Strategy

3-2-1 Backup Rule:

  • 3 copies of critical data
  • 2 different storage media
  • 1 off-site backup

Implementation:

  • Ceph replication (3 copies on-site)
  • Wasabi S3 hot backup
  • Hetzner Storage Box cold backup
  • Incremental backups with Restic

Updates & Patching

  • Automated security updates via Ansible
  • Kernel updates with planned downtime
  • Service updates via Docker image pulls
  • Rolling updates to maintain uptime

Current Status

🔄 In Progress - Active Expansion

Completed

  • 4-node Proxmox cluster operational
  • Ceph distributed storage running
  • 4 VM stacks deployed and operational
  • Traefik reverse proxy configured
  • Prometheus + Grafana monitoring
  • Docker Compose service orchestration
  • Ansible provisioning framework
  • Basic networking and VLAN setup

In Progress

  • Network switch installation (2025-10-05 partial)
  • New managed central switch
  • OPNsense firewall setup
  • Authentik SSO/LDAP
  • Data Box hardware integration
  • PVE-N305 NAS setup
  • Kubernetes cluster expansion

Planned 2025

  • Hardware replacement (Data Boxes, N305)
  • Network infrastructure upgrade (10GbE)
  • OPNsense firewall deployment
  • Advanced monitoring with Elastic Stack
  • Kubernetes cluster (5 nodes)
  • Enhanced backup automation
  • DNS and reverse proxy optimization

Skills Demonstrated

Virtualization & Cloud

  • Proxmox VE cluster design and management
  • VM resource allocation and optimization
  • High availability and failover
  • Cluster networking and VLAN support

Distributed Systems

  • Ceph cluster architecture
  • Data replication and fault tolerance
  • Distributed storage optimization
  • Cluster health monitoring

DevOps & Automation

  • Infrastructure as Code (Terraform)
  • Configuration management (Ansible)
  • Automated deployment pipelines
  • Version-controlled infrastructure

Linux Administration

  • Enterprise Linux configuration
  • Network configuration and optimization
  • Security hardening
  • Service management

Networking

  • Network design and topology
  • VLAN configuration
  • 10Gbps network optimization
  • Firewall and security rules

Monitoring & Observability

  • Prometheus metrics collection
  • Grafana dashboard design
  • Alert rule creation
  • Performance analysis

Storage Management

  • Distributed storage architecture
  • Backup strategy implementation
  • Off-site replication
  • Disaster recovery planning

Resource Files

Impact & Value

This homelab project demonstrates:

  1. Enterprise-Grade Infrastructure Design: Production-quality systems for personal use
  2. Complete System Ownership: From hardware selection through operations
  3. DevOps Mastery: Automation, monitoring, disaster recovery
  4. Continuous Learning: Regular iteration and improvement
  5. Cost Optimization: High-performance setup at reasonable cost
  6. Reliability: Self-healing, fault-tolerant systems

The infrastructure runs dozens of personal services, serves as a learning platform, and represents thousands of hours of accumulated expertise in virtualization, networking, and cloud infrastructure.

Future Vision

  • Kubernetes-based orchestration for advanced scaling
  • Multi-site setup for geographic redundancy
  • Advanced backup automation (Restic + S3)
  • Enhanced monitoring with log aggregation
  • Cost analysis and optimization tooling
  • Public status page and community engagement
  • Documentation and blog series

This is an ongoing, evolving project that continues to grow and improve as technology and needs evolve.