Opinion The benchmark farce continues
By Mario Rodrigues: luni 12 mai 2003, 11:03
WHEN AMD LAUNCHED Barton, the latest iteration of its Athlon K7 processor, the chipmaker claimed that its XP 3000+ was the highest performing PC processor. Of course, Intel vehemently disagreed and said vendor claims should not be used to decide the performance of a chip, which is very true.
Because some PC performance measuring tools do not meet the criterion to be called a benchmark - that is to say they are not accepted by all sides to be a fair representation of PC performance - there is diverse opinion about which PC performance measuring tools merit the honor of being called a benchmark. So let's take Intel's advice onboard and review how the hardware community rated AMD's top desktop offering, and then highlight the benchmark issues that arose.
Not all reviewers are singing off the same hymn sheet
Starting with the mega hardware web sites, Tom's Hardware Guide and AnandTech had differing conclusions. It was THG's opinion that the 3000+ model number rating wasn't justified. On the other hand, AnandTech said the overall performance was close enough to warrant the rating. So we have a score draw for the behemoths of this industry.
The "second tier" web sites were also in disagreement. Aces Hardware and Hardware Zone felt that the P4 had a performance edge, and because of this, they felt a higher frequency 3000+ would have been more appropriate. [H]ard|OCP, The Tech Report, The Hardware Review, Hexus, and HotHardware either concluded a score draw or put the latest Athlon in front, thus endorsing the 3000+ rating. Sudhian Media, Xbit-labs, ExtremeTech, 3D Velocity, SimHQ, Lost Circuits, DeviantPC, LinuxHardware.org, and Motherboards.org did not conclude that the rating was unfair. This straw poll puts those that did not approve of the 3000+ rating in a rather small minority.
Too many widely used benchmarks fall short of the mark
For many people, the older benchmarks used by reviewers are still very relevant. The Tech Report uncovered an eye opener with Ziff Davis' Content Creation Winstone (CCW is now owned by Lionbridge Technologies). Scott Wasson, the reviewer who heads up the Tech Report, showed the benchmark results for versions 2001, 2002, and 2003. For the 2001 results, all the Athlons took the top five places. For the 2002 results, the Athlons took the top two places and the rest of the placings were mixed. For the latest edition of this benchmark, the tables were totally turned, which enabled the P4s to take the top five places. This is so reminiscent of what happened to BAPCo's SYSmark, where no credible explanation has yet been given for the changes that were made to that benchmark. If this was poker or craps, you'd have a very strong suspicion that someone was playing with marked cards or loaded dice. It will be interesting to see if AMD makes any public comment about these results. Scott Wasson said this about his findings:
"CC Winstone has changed over time to incorporate newer versions of applications, more video processing, larger data files, and in the 2003 release, Newtek's Lightwave 7.5 renderer, which is optimized for the Pentium 4's SSE2 but not for the SSE or 3DNow! instruction set extensions (both of which the Athlon XP supports). I'll let you draw your own conclusions about the merits and motivations behind the changes that ZD has wrought to CC Winstone over the years. Given the results, I thought including all three versions would be the best policy."
In a news story from 2001, Randall C. Kennedy, who has done benchmark work for Intel but is now Director of Research at CSA Research, claimed that Ziff Davis' Winstone and benchmarks from BAPCo are heavily influenced by Intel. With the 2002 Intel bias claims that AMD made against BAPCo, and the disparities that are clearly evident in later versions of Content Creation Winstone, Kennedy's claims continue to echo legitimacy when credible explanations to the contrary fail to materialize.
Anand Lal Shimpi, editor-in-chief of AnandTech, said in his review, "Since SYSMark isn't the best for comparing AMD and Intel CPUs, the focus here should be on Content Creation Winstone 2003,...". Unfortunately, from what I've already said, the results from that benchmark mirror those of SYSmark, so his argument is questionable.
For these two benchmarks though (which includes the link to the next page), AnandTech showed the CPU scaling for both Intel and AMD CPUs. These graphs are very interesting to look at. But they would have far more impact if graphs from previous versions of those benchmarks were also shown as well. If one could see them all side by side, it would be very clear to see the disparities that take place from one benchmark version to another. It would also help those who make buying decisions based on those benchmarks to question the methodology used when disparities between iterations exist. If AnandTech or someone else was to publish the results from such tests, the benefits to those who read them would, I believe, be very beneficial indeed. If this happened, these graphs would probably raise more questions than they answered, which would be good, as more pressure would be exerted on the likes of BAPCo, Lionbridge Technologies (owner of Content Creation Winstone), and others to credibly explain their discrepancies.
These graphs could also act as an integrity barometer. Benchmark companies may not want to credibly explain the changes they make to their benchmarks, but the graph disparities would be clear for all to see.
Of course, whoever published such results "might" suffer consequences that "could" effect the prosperity of their web site. Have no fear; send in your test results to the INQUIRER, we will publish those test results anonymously.
Whenever the latest processor debuts, the PC World review is one that I always look forward to reading. The PC WorldBench 4 benchmark suite is application-based. It's not relying on esoteric synthetic methods, which is what makes its reviews all the more interesting to read, and from my perspective, believable.
Hardware web sites have concluded that the P4 wins the synthetic tests. But when it comes to real world benchmarks, Athlon is leading the pack. This assertion was reflected in the PC World results. AMD's latest and greatest set a new PC WorldBench 4 record of 137. Intel's best offering, which was kitted out with RDRAM, was five points behind. The P4 systems that PC World used for its review comparison had 1 GB of RAM, but could only muster a best result of 127, ten points behind the best Athlon system. I can only conclude that those systems were not using RDRAM, as the reviewer does not make this clear.
PC performance measuring tools that claim to be benchmarks
With all of the caveats that have already been highlighted, there are other questions that need to be asked about the validity of today's benchmarks. How much of the software that is in use today has been optimized to run with SSE and SSE2 extensions? I've seen no reliable data that answers that question. I continue to read reports that software continues to get optimized with these extensions, and hence newer benchmarks should be used, but I never see figures that back up those claims. Conversely, what proportion of software that is in use today is only optimized for the x87 floating point unit? Reason being that the P4's x87 FPU performance is weak when measured against Athlon. If those stats were readily available, hardware web sites would find it far easier to mirror their benchmarks to the real world environment we work in. As I see it, benchmark choice today can only be described as ad hoc.
With all the qualification language that reviewers have to use to describe the legitimacy and use of a benchmark, the word "benchmark" is now used to describe some PC performance measuring tools that clearly do not merit that description. Let's define the word: If a benchmark is a measuring standard, we'll need to define "standard" as well, which means an authorized model of unit of measurement that is widely used and respected. Well, it's clear that later versions of SYSmark, Content Creation Winstone, and 3Dmark fail the definition test. AMD accused BAPCo of Intel bias with respect to SYSMark 2002, the disparities between the later versions of Content Creation Winstone are clear for all to see, and Nvidia has made lots of noise about the latest release of 3DMark.
When so-called benchmarks fail the definition test, the onus should be on the benchmark house to prove that the contention is false. If it fails to deliver evidence that categorically dismisses the claim, then that benchmark should no longer be called a benchmark, but should be described as something else.
When a new version of a PC performance measuring tool is released, it's not good enough that everyone should automatically treat it as kosher. Its benchmark status should have to be earned. Categorizing benchmarks could act as a very powerful lever to force benchmark houses to be more responsive to the needs of its user base, and be less responsive to monopolistic type companies.
Hardware web sites could put together a very simple table that would show, from their perspective, the usefulness of a benchmark. It would have three columns: Benchmark, measuring tool only, and dog's breakfast. Any benchmark house that sees its measuring tool in the derogatory columns would be all ears to resolve issues that would move its benchmark to the column that matters. Competent reviewers test all available benchmarks. So it would be easy to make their assessments public.
AMD: Making a stand against the benchmark industry
When one sees how benchmarks can change from one revision to another, it should be no surprise that AMD made the decision to draw a line in the sand, which meant it would choose a selection of benchmarks, the results of which are audited, that would determine the model number rating of its processors. Because of the disparities that later generations of SYSmark and Content Creation Winstone make apparent, if AMD had been dependent on these, its system of determining model numbers would have been found to be very wanting indeed.
Since October 2001, when AMD introduced its model number nomenclature, its benchmarks have only been updated once to add Content Creation Winstone 2002. With this general stability in benchmark choice, AMD has been able to reliably measure how its platform architecture improves and compares to the competition. Even with last year's release of the Intel 3.06 GHz P4 Hyper-Threading processor, AMD's benchmark results show that overall, it trails the Athlon XP 3000+ by 17% with HT enabled, and by 11% with it disabled. So with AMD and others alleging Intel bias in benchmarks, what is it that has changed in these later generation benchmarks that shows the Netburst architecture in such a favorable light? Netburst is a netbust when older, more respected benchmarks are used. Likewise, Hyper-Threading comes up a negative.
This brings us back to the earlier point about the validity of today's benchmarks. What benchmarks are really representative of software use today? Are the benchmarks that favorably reflect on the P4 really measuring real world performance? Or have they been optimized to reflect the architectural strengths of the Netburst architecture? Which may not be representative of how people really use their computers. This is the conundrum that the hardware community has been placed in, but that they could get out of it if they really wanted to. Let's not forget who put us in this mess. It's the benchmark industry that has not reconciled - in terms of credible explanations - the disparities that exist in their benchmarks. Until the slate is wiped clean, the benchmark industry will continue to be tarnished with distrust.
On AMD's FAQ web page for Opteron, it answers the question about how AMD chooses its performance benchmarks: "AMD selects industry-standard benchmarks developed and validated by independent third parties that are most likely to represent typical customer usage scenarios." It's a shame that some benchmarks used in the desktop world don't merit that same level of integrity, value, and trust.
Looking beyond the mark
One can discuss the merits of whether AMD's model number nomenclature is fair and just. But as we do this we ignore a product that displays a level of inequality when measured against Athlon that cannot be ignored. It's called Celeron. For a similar price, and knowing how much better Athlon performs against Celeron, how many people would recommend to their friends, relatives, or colleagues a Celeron-based system? Not many I hope. But what actually happens is very different. For every Athlon that AMD sells, Intel probably sells at least two Celerons. Consumers and businesses continue this blind approach to IT expenditure.
Celeron: Reliant Robin in Porsche clothing
Paul Otellini, Intel president and COO, repeated again his mantra at his company's first quarter earnings Webcast when he said, "For the value segment, we continue to position Celeron as being equal or better than the best products available from our competition." Which implies that in performance terms, Celeron is equal or better than Athlon. It's not what you say that's important, it's how it's meant to sound.
Of course, if SYSmark and Content Creation Winstone are the PC performance measuring tools used, it's possible to give the impression that a three-wheeled Reliant Robin would perform like a Porsche. It's only when these vehicles are seen side by side that the con becomes fully apparent. In the PC world unfortunately, the disparity between Athlon and Celeron will not be as obvious as two cars side by side in a showroom. So if you don't make an informed buying decision, you may end up purchasing a PC that falls far short of your expectations.
Intel customers have already suffered from the Reliant Robin analogy. After the launch of the P4 in November 2000, consumers who had purchased PCs based on that processor thought that they had made an informed buying decision. It was based on the premise that a P4 with a higher frequency than a PIII or Athlon would deliver superior performance. They were wrong. Now they have filed a class action lawsuit against Intel, Gateway, and Hewlett-Packard because of alleged deceit.
So if you're considering buying a Celeron-based system, just remember what PC World had to say about that processor: "But based on PC World's exclusive tests of a PC using the new Celeron, you should avoid it: This chip is all bark and no bite."
One other observation needs to be made about Otellini's remarks. Intel will soon be releasing new Celeron speed grades at 2.5 and 2.6 GHz. Otellini will no doubt continue his mantra when these new parts arrive. If Celeron is just a speed bump improvement, is he really saying that a 2.5 GHz Celeron that only has 128 KB of inclusive cache is somehow equivalent or better than an Athlon 2500+ that has 640 KB of exclusive cache? That's a five fold difference. And not to mention that the 2500+ increases the FSB frequency to 166 MHz (333 DDR). The performance comparison that Otellini is alluding to is a reach. When you measure his remarks against a 2500+ Athlon, his comparison becomes a pipe dream.
If Otellini is not reeled in from his "loose cannon" remarks, he might just be setting up Intel for another lawsuit. Consumers can be gullible when they have money to spend. Many may purchase Celeron-based systems believing them to be good value. Unfortunately, when they discover the disparity in performance against a comparably priced Athlon system, just like those who have already filed against Intel et al, those people will want their pound of flesh. In light of this, if a lawsuit was filed that alleged Celeron misrepresentation, I believe such a case would have a far better chance of success than the current litigation.
Athlon and model numbers - life savers for AMD
Doing a comparison of Athlon against the 3.06 GHz P4, you have to give AMD credit for keeping the Athlon architecture competitive for so long. When the two models are compared, the XP 3000+ is 29% slower in frequency, its FSB has 38% less bandwidth, its die size is 22% smaller, it uses 29% less power at maximum load, it doesn't use SSE2 extensions, it doesn't have Hyper-Threading technology, yet the vast majority of the hardware community did not conclude that the 3000+ rating was unfair. When one looks at what Athlon delivers, less is most definitely more, and it really does highlight the inefficiencies of the Netburst architecture.
It is fair to say that the introduction of the model number nomenclature has been a strategic decision that has really paid off for AMD. Imagine where AMD would be today if it was still using frequency to describe performance, or if it was relying on the latest iterations of SYSmark and Content Creation Winstone to determine the model number ratings of its processors. It would have been a farce. Its processors wouldn't have been able to command the price points they hold today, Intel would be exploiting its frequency advantage for all it was worth, and AMD's red ink would be deteriorating by the day.
This all goes to prove that AMD's model number nomenclature is just, and its decision to not use later editions of certain benchmarks to be sound. Fundamentally, this gives the consumer comparable technologies that they can choose from. This competition will also drive down costs, which will ultimately widen the user base of PCs, which has to be good for everyone.
Using today's benchmarks, how will Athlon64 be portrayed?
With all of the enhancements that Athlon64 will bring to the desktop space, this new processor will slingshot AMD to the front of the pack technologically. But with the benchmarks that are in use today, I don't think we'll see the performance lead that Opteron has over Xeon. It will be interesting to see what changes AMD makes to its benchmark portfolio, which determines the model number used for its processors. Even with the disparities that exist in today's benchmarks, I think Athlon64 will measure up.
It should not be forgotten that we have one last spin of Athlon before it's replaced by the AMD64 platform. The 200 MHz FSB Athlon (400 DDR) will add 20% to the bandwidth that is currently available. That sounds minuscule to the 50% increase that the latest P4 generation delivers. But as we all should know by now, FSB bandwidth is only one factor that goes towards determining the overall performance of a processor. Likewise, processor frequency and instructions per cycle are others. Intel is able to tout bigger numbers, but that doesn't always mean better performance.
With AMD about to launch its 200 MHz FSB Athlon (400 DDR) ahead of Intel's latest and greatest, the price positioning needs some explanation. When both products become available, the 3200+ Athlon is going to be about $200 cheaper - $460 against $660 retail. Remember that the Athlon 3200+ is essentially just a FSB increase to 200 MHz (400 DDR). Besides using its 200 MHz FSB (800 QPB) for its next P4, Intel will also increase the frequency by 200 MHz to 3.2 GHz. This should be enough to give Intel the performance edge, but buyers will have to weigh up if that extra performance justifies the additional cost.
Will the Athlon XP 3200+ merit its rating? In the purist sense, probably no, but that $200 saving certainly brings equality to the minds of those that have to make that purchasing decision.
Becoming better informed
So when you come to read the next round of processor comparisons, bear in mind that there are two sides to every story. My advice is to read all shades of opinion. This will give you a real perspective of what's been said. Read also the discussion boards, as the opinion expressed there will also help in determining what the bottom line really is. By doing what I suggest, you should be able to make an informed buying decision based on need.
Prevailing against the odds
With the latest revelations about alleged Intel strong-arming of companies that had planned to be present at the AMD launch of Opteron, and the continued allegations of Intel favoritism in benchmarks, there really can be no smoke without fire - and yes, the fire is burning brightly. µ