Working with High Performance Computing resources (scale up vs. scale out in the Microsoft World)
This week I got the chance to work with some x86 High Performance Computing hardware at the Donald Smits Center for Information Technology (CIT) at the University of Groningen.
Their hardware typically contains more than 64 logical processors and TBs of RAM. 10Gb/s interconnects and loads of Solid State Disks (SSDs) are also not very uncommon in their daily routines. I had the chance to work with systems like Dell’s PowerEdge R910 G11 and HP’s ProLiant DL980, which were labeled as spare hardware. With a 40k € price tag, each, this is no cheap pile of iron.
Installing Windows Server on these systems is easy, since Windows Server Datacenter runs on hardware with 256 logical processors and 2 TB of RAM for almost three years. (Windows Server 2008 R2 RTM’d on July 22, 2009)
High Performance Computing hardware is useful in ‘number crunching’ and ‘database’ scenarios. In the first scenario a multithreaded application is used on the hardware to analyze data. Commonly this data amounts to PBs per week/month. In the second scenario the hardware is used as a database solution.
A third viable scenario would be a highly-efficient x86 virtualization solution. You can build this on top of Windows Server 2012 Datacenter Edition with Hyper-V. With Hyper-V guests now capable of addressing 32 logical processors and 1 TB RAM, each, running a couple of these VMs can easily account for the purchase of typical High Performance Computing (HPC) hardware.
I would advice against using HPC hardware as virtualization hosts.
First, let me address the use cases for number crunching: For most of these cases Microsoft offers Windows High Performance Computing (HPC) Server. In this version of Windows Server, a large amount of standard hardware servers (“compute nodes”) are combined with the use of a supervisor server (“head node”) into a high performance cluster (with or without the help of additional “broker nodes” and “Job schedulers”). The method used is scale-out. If you have serious High Performance Computing (HPC) needs, you can use lots of HP DL980s as compute nodes.
Now, for virtualization. Although a single HP DL980 server is more than capable of running the VMs for a typical office automation setup, this is not ideal for the following reasons:
- When the server malfunctions, all VMs will malfunction or go offline. This can be solved by adding a second server and shared storage and transform the two servers into a Windows Cluster. To avoid a “Dutch Cluster”, though, the second node will not be active to allow the VMs to resume/replica onto the second server. This is a terrifying amount of waste of resources. With more smaller nodes, a smaller amount of resources is needed for redundancy, and thus a smaller amount of resources is wasted. The Dell R910 we saw recently encountered memory problems and ran more safely in memory-redundancy mode. More DIMMs in use simply increases the chance of one of them failing.
- You might assume a bigger piece of iron scales like a smaller piece of iron, but it doesn’t. A Server like a HP DL980 uses the same type of RAM (DDR3) as its little brothers. It simple takes a