View Full Version : What defines a stable system?
phil
10th March 2004, 07:57
As a lot of you know, I am quite a hardware tweaker. One of the things I see so often on hardware forums is people quoting their maximum overclocks without qualifying just how stable they are. Personally, I always check my overclocks by running Prime95 for around 8 hours....if it doesn't report an error in this time, I am happy that it (the CPU at least) is stable.
Recently though, I have seen on a few forums people calling Prime95's usefulness into question. They say it doesn't accurately reflect how stable their system is. They can run much higher overclocks while still having a stable system - 3DMark 2001, 3DMark 2003, SETI, Games, SuperPi etc all run fine but P95 will crash almost immediately. My response has always been that P95 error checks all it's calculations against known good values so if you have an error, you know that there is a problem. Personally, when P95 error'd for me, I could always expect a crash at somepoint.
I have just built a system recently based on an AMD FX-51, Asus SK8V and 1Gb Corsair XMS3200R. I have added my water cooling so I should have no problems right? Well, this is where things get interesting. I decided to keep the fsb (HT) at the default 200 and clock using only the multipliers....this should keep any errors reported purely down to the CPU. The memory tested fine with memtest86 before I even started clocking and P95 ran the full 8 hours at default speed (2.2GHz). I bumped up the multiplier to 11.5 and everything went well. On hitting 2.4GHz though, P95 would fail at any vcore. Everything else would run absolutely fine. I decided to keep pushing until I found instability. I could reach 2.6GHz @ 1.675V and still run games, 3DMark 2001, 3DMark 2003, SuperPi, CPUBurn (which error checks it's results) and even the OcUK SETI bench (the results were verified as being correct). 2.7GHz wouldn't post.
Almost everything runs with full stability at 2.6GHz yet P95 would fail. I decided to fire up F@H and it also died immediately. I decided to download StressCPU (http://www.em-dc.com/) to check to see if the CPU could run SSE instructions continuously. That immediatley error'd out. I then backed back down until I had stability again on StressCPU, F@H and P95.....all run fine @ 2.38GHz and no more.
So, to sum up. Although your overclock may look and run totally stable for hours on end; if it is not 100% Prime95 stable, it isn't stable at all!
dnar
10th March 2004, 08:05
To add another test to the pot, I have found that an O/C that "just" passes Prime for many hours fails compiling glibc (Linux etc). I now compile glibc repeatedly as my stress test...
Many people say they also ignore memtest86 errors, "as my system is stable in normal use"... What they dont realise is their OS and apps may not be using particular memory address ranges, and/or particular memory read/write functions such as cached moves... If id wont run memtest86 at least once without errors, your system WILL crash, it's just a matter of time. This is particularly true for Linux and the latest M$ offerings, as they utilise as much free memory as possible for cache and buffers.
Nice post phil.
MikeTimbers
10th March 2004, 08:19
I too still believe in Prime95 but I do quick tests first after raising the overclock.
SuperPi 32M is a good one
I have one called CPU Stability Test which seems to have disappeared from the net but does a range of maths tests including Mercenne, Fibonacci, etc. so it's checking against known results. If anyone wants it, e-mail me.
I've also seen people saying you need to let P95 run for as much as 48 hours! but that seems like overkill. A failure after that amount of time could be down to almost anything.
Bruce
10th March 2004, 11:46
One thing that we've discovered with Folding@Home is that a system which has been classified as stable using common tests like prime95 and memtest86 may not be stable when running applications that depend on SSE.
There's a 3rd party program to Stress Test the SSE components of your CPU. http://forum.folding-community.org/viewtopic.php?t=4737 The Windows version is at the very top, but if you read farther into the thread, you'll find a Linux version.
Since most applications use very little SSE (or none), this may or may not be of concern to you, but it's available.
phil
10th March 2004, 17:20
Cheers Bruce.....that's the prog I linked to earlier. I didn't know it was linked to F@H.
Bruce
10th March 2004, 21:08
:o I didn't notice your link until now.
TheWeatherMan is the author of EM3 which is a great monitor for F@H. (You probably remember EM2 for GAH-classic.) The actual source was developed by the developers of Gromacs, who have written really tightly optimized SSE code (as well as 3DNow+ and AltiVec) that actually runs about 3.5 times as many FP operations as the optimized code. (A benchmark can reach the maximum of 4 simultaneous operations, but when you do real work, the number is always lower.)
It was that code that identified a problem with a system freeze associated with AMD hardware running SSE code. This was confirmed by their test lab and an AMD announcement is pending. (I agreed to an NDA, so I can't tell you any more until they make the formal announcement.)
Dustin
31st March 2004, 09:53
Prime95 will crash before most other programs show any sign of weakness.
If you notice, Prime95 heats your CPU noticably more than most other DC programs. The reason for this is that Prime95 uses only pure x86 FPU FDIV (the newer version can use SSE2). Prime95 uses lots of RAM and turns your CPU into a mini block heater.
When I was running lots of G@H clients, I noticed I could pump my CPU's on average about 100MHz higher than when I ran Prime95. I speculated that this was because the CPU's were running noticably cooler. If this is the case, then you will not see any stability problems under normal DC usage because the CPU will never get that hot.
Now me, I subscribe to the 100% rock stable theory. If my box can't run any program, under any circumstance, then it's not 100% stable!
Prime95 stress text:
Ignoring the problem is a matter of personal preference. There are
two schools of thought on this subject.
Most programs you run will not stress your computer enough to cause a
wrong result or system crash. If you ignore the problem, then video games
may stress your machine resulting in a system crash. Also, stay away from
distributed computing projects where an incorrect calculation might cause
you to return wrong results. Bad data will not help these projects!
In conclusion, if you are comfortable with a small risk of an occasional
system crash then feel free to live a little dangerously! Keep in mind
that the faster prime95 finds a hardware error the more likely it is that
other programs will experience problems.
The second school of thought is, "Why run a stress test if you are going
to ignore the results?" These people want a guaranteed 100% rock solid
machine. Passing these stability tests gives them the ability to run
CPU intensive programs with confidence
FREQUENTLY ASKED QUESTIONS
--------------------------
Q) My machine is not overclocked. If I'm getting an error, then there must
be a bug in the program, right?
A) The torture test is comparing your machines results against
KNOWN CORRECT RESULTS. If your machine cannot generate correct
results, you have a hardware problem. HOWEVER, if you are failing
the torture test in the SAME SPOT with the SAME ERROR MESSAGE
every time, then ask for help at http://mersenneforum.org - it is
possible that a recent change to the torture test code may have
introduced a software bug.
Q) How long should I run the torture test?
A) I recommend running it for somewhere between 6 and 24 hours.
The program has been known to fail only after several hours and in
some cases several weeks of operation. In most cases though, it will
fail within a few minutes on a flaky machine.
Q) Prime95 reports errors during the torture test, but other stability
tests don't. Do I have a problem?
A) Yes, you've reached the point where your machine has been
pushed just beyond its limits. Follow the recommendations above
to make your machine 100% stable or decide to live with a
machine that could have problems in rare circumstances.
Q) A forum member said "Don't bother with prime95, it always pukes on me,
and my system is stable!. What do you make of that?"
or
"We had a server at work that ran for 2 MONTHS straight, without a reboot
I installed Prime95 on it and ran it - a couple minutes later I get an error.
You are going to tell me that the server wasn't stable?"
A) These users obviously do not subscribe to the 100% rock solid
school of thought. THEIR MACHINES DO HAVE HARDWARE PROBLEMS.
But since they are not presently running any programs that reveal
the hardware problem, the machines are quite stable. As long as
these machines never run a program that uncovers the hardware problem,
then the machines will continue to be stable.
Keep in mind when testing though, that Prime95 only stresses the FSB, RAM, CPU FPU, SSE2, L1, and L2 caches.
Just my 2¢. :)
Bruce
31st March 2004, 10:27
Prime95 will crash before most other programs show any sign of weakness.
Actually, that's not true. There's nothing wrong with prime95 for the Pentium MMX and other old machines that don't support SSE, but we've found that the Gromacs core used by Folding at Home will push a lot of newer chips even harder. There's a SSE Torture Test which is an adaptation of that code that generally raises temperatures a few more degrees and detects stability problems that prime95 misses.
Of course, a lot of people never run programs that use heavy SSE - - but I figure if you've paid good money for a chip with the SIMD instructions, you ought to give them a good work-out before declaring your machine "stable"
Dustin
31st March 2004, 13:26
Actually, that's not true. There's nothing wrong with prime95 for the Pentium MMX and other old machines that don't support SSE, but we've found that the Gromacs core used by Folding at Home will push a lot of newer chips even harder. There's a SSE Torture Test which is an adaptation of that code that generally raises temperatures a few more degrees and detects stability problems that prime95 misses.
Of course, a lot of people never run programs that use heavy SSE - - but I figure if you've paid good money for a chip with the SIMD instructions, you ought to give them a good work-out before declaring your machine "stable"
Bruce, Perhaps you didn’t read my post closely. Prime95 has used SSE2 (and makes very heavy use of it) for quite some time. Prime95 can't use SSE "1". I'm not positive, but I think it has to do with the lack of double precision in SSE1. Also doing a stress test, you can set the program to use any amoutnt of RAM you desire. I set it to use 1GIG of RAM, and it used every last Megabyte. :eek:
Bruce
31st March 2004, 18:58
Well, SSE and SSE2 are, in fact different, though they use common registers. I'm sure it depends on just how a particular chip implements Float and DoubleFloat operations.
Feel free to try this test and report your temperatures. Some people say there is no difference - - others say it runs their machine several degrees warmer. YMMV.
http://forum.folding-community.org/viewtopic.php?t=4737
vBulletin® v3.7.4, Copyright ©2000-2012, Jelsoft Enterprises Ltd.