The server hardware is racked and cabled. It's time to choose an operating system. Linux was an obvious choice, but which distribution? With a little research, I narrowed the choices down to Unraid and Proxmox VE.

Unraid has a reasonable cost ($60 for up to 6 attached devices) and has support for running Virtual Machines and Docker containers. However, it is a NAS with additional features rather than these being the primary features. I already have a NAS, QNAP TS-451+, which also runs VMs and Docker Containers, but it's not powerful enough to run everything I want. This new server is destined to be primarily a compute server with access to some internal and external storage from NAS.

Proxmox VE has a subscription support model based on the number of CPU sockets ranging from € 85 (~ $91 USD) per year per socket for community support up to to € 796 (~ $856 USD) per year per socket for premium support.  It has all features enabled without a subscription, but you do get a reminder that you don't have a subscription in the administrative user interface. Subscriptions also grant access to the Enterprise package repository and you will need to use the No-Subscription repository if you don't have one.

Another plus for Proxmox VE is that it is based on Debian and can be installed directly on top of an existing Debian 10 (Buster) installation. However, the recommended installation method is using the Bare Metal installer. The only problem I ran into was that that the default BIOS settings didn't have AMD-V enabled. There was no error message, the installer would just never finish loading. It wasn't until I decided to try the Unraid trial that I got an error message about virtualization not being enabled that it dawned on me why the installer failed.

I also didn't take the time to customize the layout of the internal disk so Proxmox took the entire thing and configured it with LVM. I would have preferred ZFS so I could use the internal disk as cache for the external storage, but root zpools require special configuration and I managed to make the cache work anyway.

Fortunately, Proxmox VE offers comprehensive documentation covering both the command-line (CLI) and web-based user interface (GUI). Command line means it can be configured and managed with Ansible.

Storage Configuration

The choice of available storage types you choose depends on the features you need and capabilities of your available storage. I may add more physical hosts to the mix in the future so I wanted to make sure I could grow into a Proxmox Cluster. I needed storage options which could be shared across multiple nodes. I also wanted to have support for snapshots for backups.

From a Proxmox perspective, it can utilize either file-based or block-based storage. Once visible to the physical host, the storage is configured within Proxmox to contain one or more types of content: isos, templates, backups, images, rootdir, or snippets. This is configured through the pvesm (Proxmox VE storage manager) command, editing /etc/pve/storage.cfg directly, or through the administration interface.

Here is an exerpt from storage.cfg after installation:

dir: local
	path /var/lib/vz
	content backup,iso,vztmpl

lvmthin: local-lvm
	thinpool data
	vgname pve
	content rootdir,images

  • "local" is just a local directory on the physical host. It's not shared with any other hosts in the cluster.
  • "local-lvm" is a volume on the localhost which is thin provisioned meaning that storage is consumed only as it's needed and not pre-allocated.

Using the various LVM commands, we can see there there is one physical device and one volume group containing that physical device:

# pvs
  PV             VG  Fmt  Attr PSize    PFree   
  /dev/nvme0n1p3 pve lvm2 a--  <931.01g 1016.00m
  
# vgs 
  VG  #PV #LV #SN Attr   VSize    VFree   
  pve   1  19   0 wz--n- <931.01g 1016.00m

This is going to be high performance storage because it's m.2 NVMe, but it's not redudnant so anything stored there will be lost if the disk fails. Within that volume group, there are a number of logcial volumes representing all logical disks available for use within the system:

# lvs
  LV                VG  Attr       LSize    Pool Origin           Data%  Meta%  Move Log Cpy%Sync Convert
  data              pve twi-aotz-- <794.79g                       3.38   0.40                            
  root              pve -wi-ao----   96.00g                                                              
  swap              pve -wi-ao----    8.00g                                                              
  vm-100-disk-0     pve Vwi-aotz--    8.00g data                  14.32                                  
  vm-101-disk-0     pve Vwi-a-tz--   10.00g data                  9.40                                   
  .
  .
  .
  zcache            pve -wi-ao----   10.00g                                                              
  zlog              pve -wi-ao----    5.00g                                                              

These logical volumes, including the vm-disk volumes, can take advantage of all of the LVM features such as snapshots. Take note of the zcache and zlog volumes, I will explain those later.

On the QNAP, I have a folder, available by NFS, which contains the iso images that can be mounted within virtual machines as a CD-ROM device. To make it available for Proxmox to use, I added it like this:

nfs: qnap-iso
	export /iso
	path /mnt/pve/qnap-iso
	server 192.168.xx.xx
	content iso

The last piece is to configure is a large pool of storage from the QNAP which will be used for any critical data because it's on two physical disks which are mirrored. First, I will create an iSCSI LUN on QNAP which is thinly provisioned from the primary storage pool of mirrored disks:

Next, I need to install the open-iscsi to gain access to the iscsiadm command:

# apt-get install open-iscsi

The client in iSCSI (or SCSI over TCP/IP) is called the initiator, the server is the portal, and LUN is the target. The initiator name is set in /etc/iscsi/initiatorname.iscsi. While you can use username/password (CHAP) authentication, I had trouble getting  it to work reliably with multipath so within my private LAN I'm relying only on the client presenting the correct initiator name for access control.

The first thing is to query the iSCSI server and find out about what targets are available:

iscsiadm  -m discovery -t st -p 192.168.xx.xx

If you will be utilizing CHAP or configuring other options, you go to /etc/iscsi/nodes/<targetname>/<portal>/default and edit as needed. Once everything is working, change the node.startup option from manual to automatic so it's available after reboots.

Next, I need open a session to the target and portal I want:

iscsiadm  -m node  --targetname "iqn.<target-name>" --portal "192.168.xx.xx" --login

If successful, a new block device will show up using the lsblk command. For me, it was /dev/sda and /dev/sdb because the QNAP has two network interfaces with their own IP addresses:

NAME                            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                               8:0    0     1T  0 disk  
sdb                               8:16   0     1T  0 disk  

Next I need to configure multipath by editing /etc/multipath.conf:

# apt install multipath-tools

Before we edit multipath.conf, we need the unique id of our target to it so we can block all devices from multipath control except the devices we want:

# /lib/udev/scsi_id -u -g /dev/sda
36e843b6f3f12999d1c84d4130db5e3de

I looked at a lot of different samples and ways to configure /etc/multipath.conf, but here is what worked for me:

defaults {
        polling_interval        2
        path_selector           "round-robin 0"
        path_grouping_policy    multibus
        uid_attribute           ID_SERIAL
        rr_min_io               100
        failback                immediate
        no_path_retry           queue
        user_friendly_names     yes
}
blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^(td|hd)[a-z]"
        devnode "^dcssblk[0-9]*"
        devnode "^cciss!c[0-9]d[0-9]*"
        device {
                vendor "DGC"
                product "LUNZ"
        }
        device {
                vendor "EMC"
                product "LUNZ"
        }
        device {
                vendor "IBM"
                product "Universal Xport"
        }
        device {
                vendor "IBM"
                product "S/390.*"
        }
        device {
                vendor "DELL"
                product "Universal Xport"
        }
        device {
                vendor "SGI"
                product "Universal Xport"
        }
        device {
                vendor "STK"
                product "Universal Xport"
        }
        device {
                vendor "SUN"
                product "Universal Xport"
        }
        device {
                vendor "(NETAPP|LSI|ENGENIO)"
                product "Universal Xport"
        }
}
blacklist_exceptions {
	wwid "36e843b6f3f12999d1c84d4130db5e3de"
}
multipaths {
 multipath {
	wwid "36e843b6f3f12999d1c84d4130db5e3de"
	alias iproxmox
}

After restarting multipath so it could refresh it's configuration, it now sees one device with two paths:

# multipath -ll
iproxmox (36e843b6f3f12999d1c84d4130db5e3de) dm-6 QNAP,iSCSI Storage
size=1.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 12:0:0:0 sda     8:0   active ready running
  `- 13:0:0:0 sdb     8:16  active ready running

Run lsblk again:

NAME                            MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                               8:0    0     1T  0 disk  
└─iproxmox                      253:6    0     1T  0 mpath 
sdb                               8:16   0     1T  0 disk  
└─iproxmox                      253:6    0     1T  0 mpath 

Now we can treat this device (/dev/mapper/iproxmox) like any other block device. We can partition, create filesystems, and mount them.

However, I created a zfs pool with it:

# zpool create proxmoxqnap iproxmox

ZFS is both a filesystem and a logical volume manager. You can create logical volumes, mount them, even add additional devices to create mirrors and RAID configurations. That's already done (mirrored) by QNAP so I don't need to do anything else but add the pool to Proxmox:

zfspool: qnap-zfs
	pool proxmoxqnap
	content images,rootdir
	mountpoint /proxmoxqnap

I mentioned, though, that I wanted to add the local disk to act as a cache for the QNAP storage. Remember, the zcache and zlog volumes on LVM?

# lvcreate -L 10G -n zcache pve
# lvcreate -L 5G -n zlog pve

ZFS has two different kinds of cache devices. Writes are cached by allowing them to be written to a ZFS Intent Log (zlog) device. The write will succeed immediately, but will be written to the pool devices later. If the log device fails, the data is lost so the log device could be mirrored if needed. Read caches are handled by a cache device.

# zpool add proxmoxqnap log zlog
# zpool add proxmoxqnap cache zcache

Now our ZFS pool has both a read and a write cache device on local fast storage:

# zpool status
  pool: proxmoxqnap
 state: ONLINE
  scan: none requested
config:

	NAME          STATE     READ WRITE CKSUM
	proxmoxqnap   ONLINE       0     0     0
	  iproxmox    ONLINE       0     0     0
	logs	
	  pve-zlog    ONLINE       0     0     0
	cache
	  pve-zcache  ONLINE       0     0     0

errors: No known data errors

Before I move any volumes to our new pool, I better check the performance with fio:

random_rw: (groupid=0, jobs=1): err= 0: pid=467849: Sun Mar 23 07:11:19 2020
  read: IOPS=1757, BW=7029KiB/s (7198kB/s)(512MiB/74567msec)
    clat (nsec): min=1152, max=14238M, avg=116314.93, stdev=39331283.12
     lat (nsec): min=1182, max=14238M, avg=116345.84, stdev=39331283.14
    clat percentiles (nsec):
     |  1.00th=[  1496],  5.00th=[  1608], 10.00th=[  1720], 20.00th=[  1944],
     | 30.00th=[  2040], 40.00th=[  2128], 50.00th=[  2192], 60.00th=[  2288],
     | 70.00th=[  2480], 80.00th=[  2832], 90.00th=[ 30336], 95.00th=[ 31616],
     | 99.00th=[ 51456], 99.50th=[103936], 99.90th=[134144], 99.95th=[148480],
     | 99.99th=[181248]
   bw (  KiB/s): min= 6576, max=167264, per=100.00%, avg=99458.30, stdev=59612.78, samples=10
   iops        : min= 1644, max=41816, avg=24864.50, stdev=14903.12, samples=10
  write: IOPS=1758, BW=7033KiB/s (7202kB/s)(512MiB/74567msec); 0 zone resets
    clat (usec): min=2, max=16786k, avg=451.80, stdev=80780.30
     lat (usec): min=2, max=16786k, avg=451.84, stdev=80780.30
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    5],
     | 70.00th=[    5], 80.00th=[    6], 90.00th=[   33], 95.00th=[   35],
     | 99.00th=[   97], 99.50th=[  125], 99.90th=[  172], 99.95th=[  202],
     | 99.99th=[  289]
   bw (  KiB/s): min= 6264, max=168104, per=100.00%, avg=99341.60, stdev=59532.43, samples=10
   iops        : min= 1566, max=42026, avg=24835.40, stdev=14883.11, samples=10
  lat (usec)   : 2=12.31%, 4=57.12%, 10=13.37%, 20=0.07%, 50=15.78%
  lat (usec)   : 100=0.62%, 250=0.73%, 500=0.01%
  cpu          : usr=0.18%, sys=2.99%, ctx=906, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=131040,131104,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=7029KiB/s (7198kB/s), 7029KiB/s-7029KiB/s (7198kB/s-7198kB/s), io=512MiB (537MB), run=74567-74567msec
  WRITE: bw=7033KiB/s (7202kB/s), 7033KiB/s-7033KiB/s (7202kB/s-7202kB/s), io=512MiB (537MB), run=74567-74567msec

Now that we have a fully operational Proxmox server, it's time to create some VMs!