The purpose of "Open Forum" is to provide a forum for SPEC members and newsletter subscribers to discuss issues, controversial or otherwise, related to SPEC's purpose of providing performance evaluation benchmarks and information.

Articles in the "Open Forum" are the opinions of the authors and do not reflect the official position of SPEC, its board of directors, committees, or member companies.

Open Forum

Lies, Damn Lies, and Benchmarks

What Does One Have To Do To Find Performance Truth?

By Alexander Carlton
Hewlett-Packard
Cupertino, Calif.

Published December, 1994; see disclaimer.

The Performance Problem
What's Your Size?
Getting Help
Conclusion

It has been said that there are three classes of untruths, and these can be classified (in order from bad to worse) as: Lies, Damn Lies, and Benchmarks. Actually, this view is a corollary to the observation that "Figures don't lie, but liars can figure..." Regardless of the derivation of this opinion, criticism of the state of performance marketing has become common in the computer industry press.

The level of media attention is a reflection of how computer performance has become a growing concern for virtually everyone. Computers are becoming ubiquitous, and as such they are becoming a significant part of any company's budget -- and in today's competitive climate every significant budget item is being closely monitored. Buying too little computing power can seriously limit the ability to get the job done. However, buying too much can raise the cost of the job above where it is effective. Thus, there is great interest in determining just how much performance can be expected from any given computer system.

This interest in performance has not gone unnoticed by the computer vendors. Just about every vendor promotes their product as being faster or having better "bang for the buck." All of this performance marketing begs the question: "How can these competitors all be the fastest?" The truth is that computer performance is a complex phenomenon, and who is fastest all depends upon the particular simplifications being employed to present a particular simplistic conclusion.

Still, when trying to buy a computer system, many people feel like they are surrounded by used-car dealers. The truth is often buried under layers and layers of hype. The solution, just as in dealing with car dealers, is to be an informed consumer. Know your own needs and learn about how to find real data rather than marketing without content.

Performance groups like SPEC (and its siblings TPC, BAPco, et al) can be a source for a great deal of performance data. There is a lot of information available from these results, especially if one digs into the raw data and the disclosures rather than just focusing only on the summary metrics. The standardized benchmarks provide openly defined competition within predefined boundaries where the contestants have to disclose at least what features they relied upon to reach their results. The vendors, in their efforts to put out leading results, also release a lot of information in the full disclosures about how they achieved those results. Unfortunately, no standardized test can be expected to provide individualized answers for every particular configuration. In the end, the usefulness of any performance data will be determined by how well you understand your own system and appreciate which pieces of data are relevant and which are superfluous.

The Performance Problem

Just as a chain is only as strong as its weakest link, so too is a system's performance limited by its most severe bottleneck. Maximizing system performance involves attempting to distribute a workload across an interconnected set of chains and examining each link in the system to ensure that it can handle the strain at least as well as its neighbors. If a workload puts too much strain in any one link, it may be possible to adjust the interconnections to spread the weight across other chains and hence be able to manage to carry the load. A system's performance is maximized when it is not possible to add any more work without some link being unable to handle the strain.

The particular problem in competitive performance is that each system has its own set of chains, each with differing weak links. The number of performance sensitive links in any given system probably number in the thousands; but the single greatest performance factor is the workload. Some workloads may run fine on one system, but might break links all over another system. Meanwhile some other workloads will shatter the first system but hardly stress some of the others. Sometimes, very small differences in a workload can have great impact on whether certain links are overstressed or not.

What's Your Size?

One size does not fit all. At one time, manufacturers like Henry Ford covered an entire market by selling one model available only in basic black. Soon Ford was selling truck configurations as well as Model-Ts, and nowadays there are dozens of basic types (coupes, sedans, pickups, vans, on up to tractor-trailers) each of which has many variations. The point is, that as a technology matures, it evolves necessarily into many forms, each adapting to compete in a particular market segment. The buyer then is free to choose whichever is the best fit -- but that choice requires the buyer to have some basic ideas about what makes a good individual fit.

Evaluating systems may not be as simple as it might seem. You might be able to tell if a car has serious problems by just driving it around the block. But, to get a good understanding of how the car can perform takes an open road and a fair bit of time. Because computer performance is so very dependent on the workload, it is essential that your tests carefully measure what is actually important for your work situation. It is easy to make something look good in the showroom, but to get a feel for how it might handle the real world does take some investigative experience.

The first step is to get a good understanding of your own workload. What are the criteria which define good performance? What are the significant aspects of your workload? What subsystems do you commonly stress? What trade-offs can you afford to make and which elements are not negotiable? No matter how you will go about getting your performance information, it is essential that first you understand how your workload behaves and what things can effect its behavior.

For example, let's say you are looking for a compute server. Would you be looking for a fast turnaround time for each single job or is the total throughput more important? Does your workload have any significant I/O component? It might make a big difference if that I/O is typically random or serial in nature. Does that I/O come in bursts or is there a steady number of requests outstanding? Does your workload fit easily into memory or do you often require paging? Is your workload interactive in nature requiring that certain maximum response times be satisfied, or can the tasks be scheduled for optimal throughput at the expense of some responsiveness? These are just a few questions for this example corresponding to the issues outlined above.

Once you have an understanding of your workload, then you can try to create a benchmark to test it. Typically it would be too difficult and inefficient to attempt to exactly reproduce a specific workload. A benchmark is usually an abstraction of a workload. It is necessary, in this process of abstracting a workload into a benchmark, to capture the essential aspects of the workload and represent them in a way that maps accurately. One must be careful that the benchmark carries something to represent each significant aspect of the workload and that the resulting benchmark reflects changes in the operating environment in accordance to how the real workload would behave. It is easy in the abstracting process for the desire of simplicity to drop out factors which may have dramatic effects on the real workload. It is even more common for the abstract benchmark to have some aspects out of balance with respect to the real workload, and hence a system might behave differently running the benchmark than if it ran the real workload.

The most difficult step in developing a benchmark is ensuring that the result really does measure what you want it to. This can be especially difficult in situations where some of the systems tested can have very different architectures. What might be a close model of some environment on one system may involve assumptions that are inappropriate, or even simply impossible, on another. One solution to this dilemma is to avoid codifying the benchmark in a rigid implementation, but instead to define a problem to be solved and to allow any valid solution to be tested. The new dilemma then becomes determining if the provided solutions really do pertain to your problem. On the other hand, limiting the means of satisfying the benchmark can force certain systems into unrepresentative configurations which might unfairly bias the benchmarks results even if those systems would be quite capable of carrying the workload in question.

Thus, as part of understanding your workload, it is also important to appreciate what are the available degrees of freedom in your environment. Prior investments might have tied you to some particular piece of hardware or software, even to a particular version or release. The level of expertise of your users and administrators may determine whether some performance features are useful or unrealistic. Recognize that these limitations will have repercussions on which configurations eventually prove to be effective. Again, the more you know about your workload, the more likely the final fit will be comfortable.

Getting Help

Unfortunately, not everyone has the time and the resources to develop their own benchmarks. Even for those who do have the necessary resources, it is probably not reasonable to run these tests on all available machines. Thus, it is common to look for performance information from other sources.

However, it must first be admitted that no other source is going to provide data for your exact workload. One of the primary tasks in looking for performance data is determining what parts of the available data are pertinent to your needs. This means that is still important to understand your own workload; then you can seek out those data points which are relevant to your situation. Perhaps only a subset of the results have been run in configurations applicable to your environment, perhaps the summary metrics may not be relevant to but some of the individual data points could be useful. Perhaps the system-level results rely upon some options or packages that you will not utilize, still the component-level results (because their limited scope avoids many dependencies) could provide some useful comparison points. Understanding your workload is key to being selective and able to weight the available data in accordance with the relevance to your needs.

There are basically two categories for sources of performance data: analysts and vendors. The analysts are individuals or groups who are selling their insight rather than a particular product. They can be academics, consultants, or members of the press. On the other hand, the vendors, in order to promote their products, are commonly providing a wealth of performance data; some of this data comes from internal experiments or estimates. Some of it comes from the more accessible standardized testing.

The analysts' data is typically available in the form of reviews of certain products or product families. The strengths of these sources are the independence and the breadth of the investigations by these analysts. With analyst supplied data one can assume that one will get a reasonably full disclosure of the tester's experience. An analyst will report everything of note about a system, whether good or bad. However, the degree to which a system under test has been tuned and is free of avoidable bottlenecks is dictated by the interests and knowledge of the analysts; after all, the goal of an analyst is typically gaining an appreciation across the market and not taking the time to understand all the idiosyncrasies of an individual system. The performance data from the analysts will usually correspond to reasonably, though not necessarily fully, tuned systems and will include data points which flatter the system and others which expose its flaws.

Where an analysts might be limited by a lack of experience or resources with a particular piece of technology, one can assume that the vendor is not so limited. The strengths of the vendor supplied data comes from the depths of the experience and efforts of the vendors. With vendor supplied data one has to assume that the system has been tuned to near the best possible advantage. There will not be any under appreciated tuning parameter, there will not be any unfinished tuning experiments. However, there may some results left unreported owing to their unflattering conclusions. The vendors will supply data on well tuned systems, and can be made to provide all the related configuration information; but, it is not reasonable to expect the vendors to publicize their weaknesses. The performance data that you can get from vendors will correspond to what you might be able to expect from the system as configured by those who are familiar with both the workload and the system. (Note: all vendors will tune their systems well, so there is no risk of comparing unbalanced configurations when using vendor supplied data.)

Both analysts and vendors make some use of standardized performance benchmark suites, such as the SPEC benchmark suites. These results can be particularly useful. The benchmarks themselves are the results of shared research and progressive compromise across a wide variety of vendors, and hence these tests are not likely to be inappropriately limited by any bottleneck nor unduly biased towards any particular configuration. Information about these standardized tests, and often the benchmark themselves, is usually available to anyone so that it is possible for you to understand which parts of a benchmark may, or may not, apply to your own situation. The rules that accompany these standardized tests also require a fair degree of disclosure so that it may be possible to appreciate what features are required to achieve certain levels of performance. Finally, these standardized benchmarks provide some level playing field where everyone has agreed to compete evenly -- even if the playing field is not at the same level as your own working situation, it may be interesting to watch the open competition.

As with most data, performance data is only as good as its source. Do not necessarily accept everything at face value. Look under the hood. A lot of performance data comes in the form of estimates. Be sure to find out just what is there behind any of the numbers. That is not to say that all estimates are bad; there are some sources who put much more effort into their estimates than some others put into their fully disclosed experiments. It pays to ask a few questions: ask about the basis for the estimates, check into the conditions under which the experiments were performed, be sure that you are comfortable with how the particular values were arrived at. But in the end trust in the data only as much as you trust the source of that data.

The task then becomes extracting the data from these sources and weighing them properly. Review any available analyst data that pertains to systems of interest. Make note of the performance numbers that are relevant to your situation. Pay particular attention to any weaknesses exposed and determine whether that weakness is significant or has a solution for your environment. Review the available data from the vendors. Recognize that estimates or internal metrics many have limited comparative value. Extract all the relevant data from the standardized tests and be sure to get a full disclosure of the configurations tested to be sure that the results quoted are applicable to your configuration. Press the salespeople to provide whatever available data you might need to make your comparison. Finally, be sure to evaluate all data as weighted by its relevance to your own workload -- a dramatic benchmark result may not help you much if it does not stress any of the system that you care about.

Conclusion

There is no magic formula for evaluating computer performance. As with so many things, the answer depends upon your point of view. More to the point however, the answer usually depends upon what kind of trade-offs you can live with. Thus, the first order of the day is to understand just what your needs are so that you can make to correct trade-offs. There are a lot of performance numbers available. The salespeople may use some strange math to support their conclusions. However, just because their numbers might not make sense, that does not mean that you cannot find out what is true for you. Don't just take their word for it, ask to see all the numbers; make your own calculations.

Before you sign the bottom line, it pays to read the small print, go out kick a few tires, and best of all take a test drive out onto some of your favorite open roads and see for yourself how the system really performs.