Homelab Infrastructure - Proxmox Cluster & Services
A complete homelab infrastructure project featuring a 4-node Proxmox VE cluster with Ceph distributed storage, containerized microservices, automated deployment pipelines, and comprehensive monitoring. Demonstrates enterprise-grade infrastructure design for home use.
Last updated:
Overview
The Homelab Infrastructure project is a comprehensive, production-grade virtualized computing environment built on Proxmox VE. It represents the most complex technical undertaking, integrating hardware, virtualization, distributed storage, containerization, networking, and automation into a cohesive system.
This project serves as both a personal learning platform and a fully functional self-hosted infrastructure running dozens of services for personal and community use.
Architecture Overview
Hardware Infrastructure
Compute Nodes
4-Node Proxmox VE Cluster:
PVE-5600G-1
- CPU: AMD Ryzen 5 5600G (6 cores / 12 threads)
- RAM: 32 GB DDR4
- Storage: 4TB (2TB NVMe + 2TB SATA SSD)
- Network: 2.5Gbps Ethernet
PVE-5600G-2
- CPU: AMD Ryzen 5 5600G
- RAM: 32 GB DDR4
- Storage: 4TB (2TB NVMe + 2TB SATA SSD)
- Network: 2.5Gbps Ethernet
PVE-5600G-3
- CPU: AMD Ryzen 5 5600G
- RAM: 16 GB DDR4 (upgradable to 32GB)
- Storage: 4TB (2TB NVMe + 2TB SATA SSD)
- Network: 2.5Gbps Ethernet
PVE-Chinese (Legacy, being replaced)
- Older hardware
- Scheduled for replacement 2025
- Temporary storage server role
Planned Hardware Upgrades (2025)
Data Boxes (2 units) - Replacing legacy Chinese server
- Case: 2U rackmount
- RAM: 128GB per machine
- Network: 2x 10Gbps SFP+ cards
- Storage: 9x SATA bays each
PVE-N305-1 - NAS appliance
- Case: Fractal Design Node 304
- CPU: Intel N305
- RAM: 32GB DDR5
- Network: 4x 2.5Gbps + PCIe slot
- Storage: 4x 20TB HDD + 2x 4TB SSD
Networking
Network Topology:
- Central managed switch (8 SFP+, VLAN support)
- Horatio unmanaged switches (2.5Gbps and 10Gbps)
- Cat 7 cabling (50m, future-proofed)
- VLAN segmentation for service isolation
Equipment:
- 8x2.5Gbps + 1x10Gbps Horatio Switch (3 units)
- 4x2.5Gbps + 2x10Gbps Horatio Switch (1 unit)
- USB 3.0 to 2.5Gbps Ethernet adapters (4 units)
- Dual-port 10Gbps SFP+ cards
Storage Architecture
Ceph Distributed Storage
Ceph Cluster (3 nodes)
├── Monitor (MON) - Cluster health
├── Object Storage Daemon (OSD) - Data storage
│ ├── Node 1: 2x 2TB NVMe + 2x 2TB SATA
│ ├── Node 2: 2x 2TB NVMe + 2x 2TB SATA
│ └── Node 3: 2x 2TB NVMe + 2x 2TB SATA
└── Data Replication (3x replication factor)
Ceph Benefits:
- Distributed, self-healing storage
- Automatic data replication
- Scalable capacity
- Fault tolerance (survives 2 simultaneous node failures)
- RBD (block storage) for VM disks
- Object storage for media files
Ceph Operations:
- Monitor cluster health
- OSD management and recovery
- Placement group balancing
- Data replication tuning
- Performance optimization
Remote Storage
Wasabi S3 Bucket
- Hot media storage
- S3FS mounted for direct filesystem access
- Cost-effective cloud backup
- Replicated media library
Hetzner Storage Box
- Cold incremental backups
- Geographic redundancy
- Off-site disaster recovery
- Affordable cold storage
Virtualization Layer
Virtual Machine Services
Media Stack (media.box)
- IP: 192.168.178.103
- CPU: 4 cores
- RAM: 8GB
- Storage: 6TB (Ceph)
- Services: Jellyfin, Calibre, NFS, Samba, MPD
- Purpose: Media center and file sharing
Organization Stack (org.box)
- IP: 192.168.178.101
- CPU: 2 cores
- RAM: 4GB
- Storage: 256GB (Ceph)
- Services: Bitwarden, WikiJS, Wallabag, RSS, Traggo, Uguu
- Purpose: Productivity and knowledge management
Development Stack (dev.box)
- IP: 192.168.178.100
- CPU: 2 cores
- RAM: 4GB
- Storage: 256GB (Ceph)
- Services: Gitea, Jenkins, GitLab CI/CD, n8n
- Purpose: CI/CD and automation platform
Control Stack (control.box)
- IP: 192.168.178.104
- CPU: 2 cores
- RAM: 4GB
- Storage: 256GB (Ceph)
- Services: Grafana, Portainer, Cockpit, Uptime Kuma, Heimdall
- Purpose: Monitoring and system management
Container Platform
Docker Compose
- Service orchestration on each VM
- Health checks and restart policies
- Networking and volume management
- Environment-based configuration
Kubernetes Cluster (planned expansion)
- Current: 2 nodes
- Planned: 5+ node cluster
- Advanced orchestration and scaling
Service Architecture
Internet
↓
[Firewall/Router - OPNsense planned]
↓
[Reverse Proxy - Traefik]
↓
[Load Balancer / Router]
↓
┌─────┬─────┬─────┬─────┐
│ Media│ Org│ Dev│Control│
└─────┴─────┴─────┴─────┘
↓
[Ceph Storage Backend]
Infrastructure as Code
Terraform
Purpose: Declarative VM provisioning
resource "proxmox_vm_qemu" "media_box" {
name = "media.box"
target_node = "pve-5600g-1"
cores = 4
sockets = 1
memory = 8192
ssd = true
disk {
storage = "ceph"
size = 256
}
}
Capabilities:
- VM creation and destruction
- Resource allocation
- Networking configuration
- Template management
Ansible
Purpose: Configuration management and deployment automation
Playbooks:
- VM initial setup and hardening
- Service deployment (Docker Compose)
- System updates and patches
- Configuration synchronization
- Secrets management (Ansible Vault)
Example Playbooks:
- ansible-proxmox: Main cluster provisioning
- vps-hardening: SSH, firewall, security
- service-deploy: Docker service deployment
Execution:
ansible-playbook main.yml: Full cluster setupansible-playbook hardening.yml: Security hardening- Rolling deployments for zero-downtime updates
Git Repository
- ziegelstein/ansible-proxmox
- Version-controlled infrastructure
- Change tracking and auditing
- CI/CD integration
Monitoring & Observability
Prometheus + Grafana Stack
Prometheus:
- Metrics collection from all nodes and services
- Custom scrape targets for applications
- Alert rule evaluation
- Time-series data storage
Grafana:
- Comprehensive dashboards
- Real-time monitoring
- Historical trend analysis
- Alert visualization
Monitoring Targets:
- Node exporter (host metrics)
- Ceph cluster health
- VM resource utilization
- Service-specific metrics
- Network statistics
Alerting
Uptime Kuma:
- Service health monitoring
- Status page for public status
- Notification routing (Slack, email)
- Incident tracking
Alert Rules:
- High CPU/memory usage
- Storage capacity warnings
- Network connectivity issues
- Service unavailability
- Ceph health warnings
Logging
Centralized Logging:
- Container logs aggregation
- System journal capture
- Application-specific logs
- Log analysis and searching
Security & Hardening
Network Security
VLANs:
- Service isolation
- Traffic segmentation
- DDoS mitigation
- Guest network separation
Firewall:
- OPNsense/pfSense planned
- Stateful firewall rules
- VPN support
- Network segmentation
SSH Hardening
- SSH key-based authentication only
- Disabled password login
- Custom SSH port
- Fail2ban for brute-force protection
- Regular key rotation
Access Control
Authentik (Planned):
- Centralized SSO/LDAP
- User management
- OAuth2 / OIDC
- Application-level authentication
- Multi-factor authentication (MFA)
Secrets Management
- Ansible Vault for sensitive data
- Environment variable-based secrets
- No hardcoded passwords
- Regular secret rotation
Deployment Pipeline
Git-Driven Infrastructure
Git Push
↓
GitHub Actions Trigger
↓
Terraform Plan
↓
Manual Approval (or auto)
↓
Terraform Apply
↓
Ansible Provisioning
↓
Service Verification
↓
Smoke Tests
↓
Production Deployment
CI/CD Integration
- Jenkins: Local CI/CD orchestration
- GitLab CI: Alternative pipeline engine
- GitHub Actions: External automation
- Automated testing before deployment
Capacity & Performance
Current Capacity
- Total CPU Cores: 18 cores (3x 5600G × 6 cores)
- Total RAM: 80GB (32 + 32 + 16)
- Storage: 12TB local NVMe + 12TB SATA SSD
- Ceph Capacity: Scales with node addition
Performance Targets
- VM Startup Time: < 30 seconds
- Storage IOPS: 10,000+ (NVMe tier)
- Network Throughput: 2.5Gbps per node
- Cluster Failover: < 5 minutes
- Service Availability: 99.9% uptime
Maintenance & Operations
Monitoring Schedule
- Daily: Automated health checks
- Weekly: Performance review, backup verification
- Monthly: Capacity planning, security updates
- Quarterly: Disaster recovery drills
Backup Strategy
3-2-1 Backup Rule:
- 3 copies of critical data
- 2 different storage media
- 1 off-site backup
Implementation:
- Ceph replication (3 copies on-site)
- Wasabi S3 hot backup
- Hetzner Storage Box cold backup
- Incremental backups with Restic
Updates & Patching
- Automated security updates via Ansible
- Kernel updates with planned downtime
- Service updates via Docker image pulls
- Rolling updates to maintain uptime
Current Status
🔄 In Progress - Active Expansion
Completed
- 4-node Proxmox cluster operational
- Ceph distributed storage running
- 4 VM stacks deployed and operational
- Traefik reverse proxy configured
- Prometheus + Grafana monitoring
- Docker Compose service orchestration
- Ansible provisioning framework
- Basic networking and VLAN setup
In Progress
- Network switch installation (2025-10-05 partial)
- New managed central switch
- OPNsense firewall setup
- Authentik SSO/LDAP
- Data Box hardware integration
- PVE-N305 NAS setup
- Kubernetes cluster expansion
Planned 2025
- Hardware replacement (Data Boxes, N305)
- Network infrastructure upgrade (10GbE)
- OPNsense firewall deployment
- Advanced monitoring with Elastic Stack
- Kubernetes cluster (5 nodes)
- Enhanced backup automation
- DNS and reverse proxy optimization
Skills Demonstrated
Virtualization & Cloud
- Proxmox VE cluster design and management
- VM resource allocation and optimization
- High availability and failover
- Cluster networking and VLAN support
Distributed Systems
- Ceph cluster architecture
- Data replication and fault tolerance
- Distributed storage optimization
- Cluster health monitoring
DevOps & Automation
- Infrastructure as Code (Terraform)
- Configuration management (Ansible)
- Automated deployment pipelines
- Version-controlled infrastructure
Linux Administration
- Enterprise Linux configuration
- Network configuration and optimization
- Security hardening
- Service management
Networking
- Network design and topology
- VLAN configuration
- 10Gbps network optimization
- Firewall and security rules
Monitoring & Observability
- Prometheus metrics collection
- Grafana dashboard design
- Alert rule creation
- Performance analysis
Storage Management
- Distributed storage architecture
- Backup strategy implementation
- Off-site replication
- Disaster recovery planning
Resource Files
- homelab-2024.md
- homelab-2025.md
- Serveraufbau.md
- homelab-gear.md
- Git Repository
- Proxmox Documentation
- Ceph Documentation
Impact & Value
This homelab project demonstrates:
- Enterprise-Grade Infrastructure Design: Production-quality systems for personal use
- Complete System Ownership: From hardware selection through operations
- DevOps Mastery: Automation, monitoring, disaster recovery
- Continuous Learning: Regular iteration and improvement
- Cost Optimization: High-performance setup at reasonable cost
- Reliability: Self-healing, fault-tolerant systems
The infrastructure runs dozens of personal services, serves as a learning platform, and represents thousands of hours of accumulated expertise in virtualization, networking, and cloud infrastructure.
Future Vision
- Kubernetes-based orchestration for advanced scaling
- Multi-site setup for geographic redundancy
- Advanced backup automation (Restic + S3)
- Enhanced monitoring with log aggregation
- Cost analysis and optimization tooling
- Public status page and community engagement
- Documentation and blog series
This is an ongoing, evolving project that continues to grow and improve as technology and needs evolve.