During my day job, I support a number of developers, QA folks, etc. Another day, living the dream. Twice very recently, I have been accused of providing substandard equipment due to slow script or program performance. Great, now I have to spend 45 minutes schooling an alleged professional.
First, a Little Background.
I am the administrator of several virtualized environments. These hypervisors are running on top of the line hardware, great storage on the back end of it, huge network pipes between all right components. A lot of work has gone into these platforms, and it functions pretty great, if I do say so myself.
Recently, I received complaints from two particular QA folks that certain scripts they were running in their virtual machines were not running as expected. For example, when running the script from the local machine, 100 iterations could be achieved in an 1 hour period. Running the same script on the virtual machine would only achieve 4 (four) iterations in eight hours. *sigh*
When I get these kinds of complaints, I usually take them with a grain of salt. Devs are notorious for doing strange things to their test machines. These acrobatic tricks they perform are not allowed on staging or production machines, and they don’t have access anyway. I’ve built the virtual environments for the test/dev folks similar to the production environments, with the only change being their access to logging tools, and the ability to deploy specifc packages to the VM. So, I knew that it wasn’t likely that something had been pooched on the VM. Besides, 74 other folks using the same image were having no problems.
So I do all the usual perfmon investigation, and nothing is standing out with CPU, RAM, or disk utilization. However, I notice that the network utilization is maxed out.
The Plot Thickens
When I ask the owner of the VM what the script is doing, and where it is reaching out to, he swears that it is not leaving the machine. Everything is on the local machine. Hmmmmm. Every time the script is started, network utilization spikes for that one virtual machine.
Since the code jockey is not very forthcoming, I crack open the PowerShell and inspect the script myself. It’s not a very complicated script; all it simply reads a directory of files, shells out to a command prompt, and executes each file it has found in a serial fashion. Nothing magical.
The first thing I notice is that indeed a remote NAS share is specified as the data source directory for this script. This is key. Here is why:
The virtual infrastructure is configured as a cloud. This means that the clusters of hypervisors are geographically dispersed. In this case, the virtual machine actually lived in the Northwest US, while the dev who was blaming the hardware worked 2000 miles away in the midwest. Also located at the midwest office was the NAS share that was being accessed by the script. Are you seeing where I am going with this?
The Big Picture
In this picture, you can see Site A (where the virtual machine lives) and Site B (where the dev and the NAS lives).
Also in this picture are color-coded lines for network connections:
- The green lines define 1 gbps LAN connections, with less than 1ms of latency.
- The purple lines equal 20 gbps LAN connections, with less than 1ms of latency.
- The red line represents a 100 mbps WAN connection, with approximately 77ms of latency.
So when Mr. developer runs his script from his workstation in Site B, it will run very well, since it is accessing the NAS from within his site, at LAN speeds.
But, when the script is run from the virtual machine located in Site A, it will run orders of magnitude slower, due to the introduction of the lower speed, higher latency WAN connection. The script was attempting to read a directory of files, and running each shelled command prompt execution against the NAS in Site B, from Site A.
So, in the end, nothing was wrong with the hardware (they claimed low CPU speed), and nothing wrong with the VM (they claimed not enough cores). The problem turned out to be someone writing a script with no knowledge of what a network is.
How Did I Resolve the Problem ?
Yes, there is a soluton for stupid. At least in this case. When I built this dev private cloud, both sites were members of the cloud, with hyper-v clusters available in both locations. Since both sites were members of the dev private cloud, it was a simple matter of moving the virtual machine to Site B. Now, since both the script and the data lived within the same site, it could take advantage of the LAN speeds within. Actually, it probably runs quite a bit faster, since it is all running on the higher speed server network.
Why am I going through the trouble of writing this all down? I am hoping that in some small way, somehow, someone developer or software tester attempting to automate his testing will stop and take a minute to look at what he has done, before being embarassed by a system engineer in front of his bosses, which he included on the email chain.
I am sure they will find something else to complain about.