« The Way It Was | Main | 24 Aussie Commandments »

January 07, 2004

Solaris Performance

What is it about PHC (Pointy Haired Consultants*)? They know how to say exactly the things that appeal to the PHB (Pointy Haired Boss*) even if what they say makes no sense.

I am thinking all the geeks out there need to start a collective campaign to explain the realities of performance measurement when it comes to Solaris. When you start talking about bad performance, what is the very first question they ask? The PHC will ask exactly the same question!

No idea? If you have been in this situation you know that performance is made up of many factors, normally we can look at both the throughput (rate of activity) and the capacity (finite limit imposed by hardware). These are not always the same thing. For example, with a filesystem we care about the amount of data (capacity) and the rate of change (I/O throughput). As well as disk I/O you need to worry about three other major subsystems: network, memory and cpu. This is all basic stuff, and most geeks would be aware that when you start to deal with multiple CPU boxes there is no trivial one number that can describe the over all performance. The PHB on the other hand wants that one number.

Fortunately in the meeting yesterday, the PHB was absent. We had an internal architect and some operations people and a single PHC who was ducking an weaving to avoid any possible criticism of his product. Fair enough, that is his job BUT when bad performance was mentioned he trots out the old standard PHB question: What is the CPU%.

CPU% - who gives a toss. Let me see, every time I measure it, the cpu% is 100%. Of course it is, because for me to measure it I have to be running on it! [Yes, that is an exaggeration but let me continue with some poetic licence here.] In this case we are talking about a 4-cpu server (running Solaris) and it is possible to have cpu% close to 100% and yet have a box performing beautifully and at the same time it is possible to have cpu% close to 0% with abysmal performance. So stop already with the cpu%.

What can we use instead? Well I think the best number to look at (in you insist on a single measure) would be the run queue (use 'uptime' to see the 1 sec, 5 sec and 15 sec run queues). This is a measure of the backlog of work. If the runq is increasing then the processors are not coping with the workload currently running on the server. If the runq is falling you have spare capacity. Nice and easy but it is not measured as a percentage so bosses don't understand it. Well the old rule of thumb I used is to look at the number of processors - say 'n'. If the runq is less than 'n+1' then the box is under-loaded. If the runq falls between 'n+1' and '(2 * n) + 1' then the box is loaded and over '(2 * n) + 1' then it is straining at the seams. To convert to a percentage, try something like 'runq / (2 * n) * 100' and you have a number that makes the PHB happy and should divert the PHC. In the meantime as a technician you can go and check mpstat which is an even better tool - check that you processors have a similar run pattern and if not, go hunt down the PHC who supplied the java code that causing the problem :-)

* The PHB (and by derivation the PHC) are inspired by Dilbert.

Posted by Ozguru at January 7, 2004 07:01 AM


Comments


What is a consultant? A person who borrows your watch to tell you the time and charges you for the privilege!

Posted by: ozguru at January 7, 2004 07:01 AM

Project Management vs Software Engineers You may (or may not) recall that I have been working on a performance problem (see this post and this post for details). Yesterday I came across a reference (now unfortunately lost) to this article which is really relevant (as you will see in a minute)...

Posted by: GDay Mate at January 7, 2004 07:01 AM