Homelab Infrastructure - Proxmox Cluster & Services

Overview

The Homelab Infrastructure project is a comprehensive, production-grade virtualized computing environment built on Proxmox VE. It represents the most complex technical undertaking, integrating hardware, virtualization, distributed storage, containerization, networking, and automation into a cohesive system.

This project serves as both a personal learning platform and a fully functional self-hosted infrastructure running dozens of services for personal and community use.

Architecture Overview

Hardware Infrastructure

Compute Nodes

4-Node Proxmox VE Cluster:

PVE-5600G-1
- CPU: AMD Ryzen 5 5600G (6 cores / 12 threads)
- RAM: 32 GB DDR4
- Storage: 4TB (2TB NVMe + 2TB SATA SSD)
- Network: 2.5Gbps Ethernet
PVE-5600G-2
- CPU: AMD Ryzen 5 5600G
- RAM: 32 GB DDR4
- Storage: 4TB (2TB NVMe + 2TB SATA SSD)
- Network: 2.5Gbps Ethernet
PVE-5600G-3
- CPU: AMD Ryzen 5 5600G
- RAM: 16 GB DDR4 (upgradable to 32GB)
- Storage: 4TB (2TB NVMe + 2TB SATA SSD)
- Network: 2.5Gbps Ethernet
PVE-Chinese (Legacy, being replaced)
- Older hardware
- Scheduled for replacement 2025
- Temporary storage server role

Planned Hardware Upgrades (2025)

Data Boxes (2 units) - Replacing legacy Chinese server

Case: 2U rackmount
RAM: 128GB per machine
Network: 2x 10Gbps SFP+ cards
Storage: 9x SATA bays each

PVE-N305-1 - NAS appliance

Case: Fractal Design Node 304
CPU: Intel N305
RAM: 32GB DDR5
Network: 4x 2.5Gbps + PCIe slot
Storage: 4x 20TB HDD + 2x 4TB SSD

Networking

Network Topology:

Central managed switch (8 SFP+, VLAN support)
Horatio unmanaged switches (2.5Gbps and 10Gbps)
Cat 7 cabling (50m, future-proofed)
VLAN segmentation for service isolation

Equipment:

8x2.5Gbps + 1x10Gbps Horatio Switch (3 units)
4x2.5Gbps + 2x10Gbps Horatio Switch (1 unit)
USB 3.0 to 2.5Gbps Ethernet adapters (4 units)
Dual-port 10Gbps SFP+ cards

Storage Architecture

Ceph Distributed Storage

Ceph Cluster (3 nodes)
├── Monitor (MON) - Cluster health
├── Object Storage Daemon (OSD) - Data storage
│   ├── Node 1: 2x 2TB NVMe + 2x 2TB SATA
│   ├── Node 2: 2x 2TB NVMe + 2x 2TB SATA
│   └── Node 3: 2x 2TB NVMe + 2x 2TB SATA
└── Data Replication (3x replication factor)

Ceph Benefits:

Distributed, self-healing storage
Automatic data replication
Scalable capacity
Fault tolerance (survives 2 simultaneous node failures)
RBD (block storage) for VM disks
Object storage for media files

Ceph Operations:

Monitor cluster health
OSD management and recovery
Placement group balancing
Data replication tuning
Performance optimization

Remote Storage

Wasabi S3 Bucket

Hot media storage
S3FS mounted for direct filesystem access
Cost-effective cloud backup
Replicated media library

Hetzner Storage Box

Cold incremental backups
Geographic redundancy
Off-site disaster recovery
Affordable cold storage

Virtualization Layer

Virtual Machine Services

Media Stack (media.box)

IP: 192.168.178.103
CPU: 4 cores
RAM: 8GB
Storage: 6TB (Ceph)
Services: Jellyfin, Calibre, NFS, Samba, MPD
Purpose: Media center and file sharing

Organization Stack (org.box)

IP: 192.168.178.101
CPU: 2 cores
RAM: 4GB
Storage: 256GB (Ceph)
Services: Bitwarden, WikiJS, Wallabag, RSS, Traggo, Uguu
Purpose: Productivity and knowledge management

Development Stack (dev.box)

IP: 192.168.178.100
CPU: 2 cores
RAM: 4GB
Storage: 256GB (Ceph)
Services: Gitea, Jenkins, GitLab CI/CD, n8n
Purpose: CI/CD and automation platform

Control Stack (control.box)

IP: 192.168.178.104
CPU: 2 cores
RAM: 4GB
Storage: 256GB (Ceph)
Services: Grafana, Portainer, Cockpit, Uptime Kuma, Heimdall
Purpose: Monitoring and system management

Container Platform

Docker Compose

Service orchestration on each VM
Health checks and restart policies
Networking and volume management
Environment-based configuration

Kubernetes Cluster (planned expansion)

Current: 2 nodes
Planned: 5+ node cluster
Advanced orchestration and scaling

Service Architecture

Internet
    ↓
[Firewall/Router - OPNsense planned]
    ↓
[Reverse Proxy - Traefik]
    ↓
[Load Balancer / Router]
    ↓
┌─────┬─────┬─────┬─────┐
│ Media│ Org│  Dev│Control│
└─────┴─────┴─────┴─────┘
    ↓
[Ceph Storage Backend]

Infrastructure as Code

Terraform

Purpose: Declarative VM provisioning

resource "proxmox_vm_qemu" "media_box" {
  name        = "media.box"
  target_node = "pve-5600g-1"
  cores       = 4
  sockets     = 1
  memory      = 8192
  ssd         = true
  
  disk {
    storage = "ceph"
    size    = 256
  }
}

Capabilities:

VM creation and destruction
Resource allocation
Networking configuration
Template management

Ansible

Purpose: Configuration management and deployment automation

Playbooks:

VM initial setup and hardening
Service deployment (Docker Compose)
System updates and patches
Configuration synchronization
Secrets management (Ansible Vault)

Example Playbooks:

- ansible-proxmox: Main cluster provisioning
- vps-hardening: SSH, firewall, security
- service-deploy: Docker service deployment

Execution:

ansible-playbook main.yml: Full cluster setup
ansible-playbook hardening.yml: Security hardening
Rolling deployments for zero-downtime updates

Git Repository

ziegelstein/ansible-proxmox
Version-controlled infrastructure
Change tracking and auditing
CI/CD integration

Monitoring & Observability

Prometheus + Grafana Stack

Prometheus:

Metrics collection from all nodes and services
Custom scrape targets for applications
Alert rule evaluation
Time-series data storage

Grafana:

Comprehensive dashboards
Real-time monitoring
Historical trend analysis
Alert visualization

Monitoring Targets:

Node exporter (host metrics)
Ceph cluster health
VM resource utilization
Service-specific metrics
Network statistics

Alerting

Uptime Kuma:

Service health monitoring
Status page for public status
Notification routing (Slack, email)
Incident tracking

Alert Rules:

High CPU/memory usage
Storage capacity warnings
Network connectivity issues
Service unavailability
Ceph health warnings

Logging

Centralized Logging:

Container logs aggregation
System journal capture
Application-specific logs
Log analysis and searching

Security & Hardening

Network Security

VLANs:

Service isolation
Traffic segmentation
DDoS mitigation
Guest network separation

Firewall:

OPNsense/pfSense planned
Stateful firewall rules
VPN support
Network segmentation

SSH Hardening

SSH key-based authentication only
Disabled password login
Custom SSH port
Fail2ban for brute-force protection
Regular key rotation

Access Control

Authentik (Planned):

Centralized SSO/LDAP
User management
OAuth2 / OIDC
Application-level authentication
Multi-factor authentication (MFA)

Secrets Management

Ansible Vault for sensitive data
Environment variable-based secrets
No hardcoded passwords
Regular secret rotation

Deployment Pipeline

Git-Driven Infrastructure

Git Push
    ↓
GitHub Actions Trigger
    ↓
Terraform Plan
    ↓
Manual Approval (or auto)
    ↓
Terraform Apply
    ↓
Ansible Provisioning
    ↓
Service Verification
    ↓
Smoke Tests
    ↓
Production Deployment

CI/CD Integration

Jenkins: Local CI/CD orchestration
GitLab CI: Alternative pipeline engine
GitHub Actions: External automation
Automated testing before deployment

Capacity & Performance

Current Capacity

Total CPU Cores: 18 cores (3x 5600G × 6 cores)
Total RAM: 80GB (32 + 32 + 16)
Storage: 12TB local NVMe + 12TB SATA SSD
Ceph Capacity: Scales with node addition

Performance Targets

VM Startup Time: < 30 seconds
Storage IOPS: 10,000+ (NVMe tier)
Network Throughput: 2.5Gbps per node
Cluster Failover: < 5 minutes
Service Availability: 99.9% uptime

Maintenance & Operations

Monitoring Schedule

Daily: Automated health checks
Weekly: Performance review, backup verification
Monthly: Capacity planning, security updates
Quarterly: Disaster recovery drills

Backup Strategy

3-2-1 Backup Rule:

3 copies of critical data
2 different storage media
1 off-site backup

Implementation:

Ceph replication (3 copies on-site)
Wasabi S3 hot backup
Hetzner Storage Box cold backup
Incremental backups with Restic

Updates & Patching

Automated security updates via Ansible
Kernel updates with planned downtime
Service updates via Docker image pulls
Rolling updates to maintain uptime

Current Status

🔄 In Progress - Active Expansion

Completed

4-node Proxmox cluster operational
Ceph distributed storage running
4 VM stacks deployed and operational
Traefik reverse proxy configured
Prometheus + Grafana monitoring
Docker Compose service orchestration
Ansible provisioning framework
Basic networking and VLAN setup

In Progress

Network switch installation (2025-10-05 partial)
New managed central switch
OPNsense firewall setup
Authentik SSO/LDAP
Data Box hardware integration
PVE-N305 NAS setup
Kubernetes cluster expansion

Planned 2025

Hardware replacement (Data Boxes, N305)
Network infrastructure upgrade (10GbE)
OPNsense firewall deployment
Advanced monitoring with Elastic Stack
Kubernetes cluster (5 nodes)
Enhanced backup automation
DNS and reverse proxy optimization

Skills Demonstrated

Virtualization & Cloud

Proxmox VE cluster design and management
VM resource allocation and optimization
High availability and failover
Cluster networking and VLAN support

Distributed Systems

Ceph cluster architecture
Data replication and fault tolerance
Distributed storage optimization
Cluster health monitoring

DevOps & Automation

Infrastructure as Code (Terraform)
Configuration management (Ansible)
Automated deployment pipelines
Version-controlled infrastructure

Linux Administration

Enterprise Linux configuration
Network configuration and optimization
Security hardening
Service management

Networking

Network design and topology
VLAN configuration
10Gbps network optimization
Firewall and security rules

Monitoring & Observability

Prometheus metrics collection
Grafana dashboard design
Alert rule creation
Performance analysis

Storage Management

Distributed storage architecture
Backup strategy implementation
Off-site replication
Disaster recovery planning

Resource Files

Impact & Value

This homelab project demonstrates:

Enterprise-Grade Infrastructure Design: Production-quality systems for personal use
Complete System Ownership: From hardware selection through operations
DevOps Mastery: Automation, monitoring, disaster recovery
Continuous Learning: Regular iteration and improvement
Cost Optimization: High-performance setup at reasonable cost
Reliability: Self-healing, fault-tolerant systems

The infrastructure runs dozens of personal services, serves as a learning platform, and represents thousands of hours of accumulated expertise in virtualization, networking, and cloud infrastructure.

Future Vision

Kubernetes-based orchestration for advanced scaling
Multi-site setup for geographic redundancy
Advanced backup automation (Restic + S3)
Enhanced monitoring with log aggregation
Cost analysis and optimization tooling
Public status page and community engagement
Documentation and blog series

This is an ongoing, evolving project that continues to grow and improve as technology and needs evolve.