Sunday, March 28, 2010

The Speed Demon:Intel's Pentium 4

Although this ageing processor is 10 years old, but strangely, there are some model of the Pentium 4 still available in any computer shops, and more strangely, even some models of the Pentium 4 were still in production(during the time I wrote this posting).so, why does this ageing processor still cannot die off completely for 10 years? unlike its elder and faster brother, the Intel Pentium 3, which have a shorter production years(1999-2002), so what behind the Pentium 4 processor?

The Intel Pentium 4 is a based on the Intel's 7th generation Netburst Micro-Architecture.The Netburst is a redesigned architecture with some minor resemblance from the older Pentium III's P6, like the "Advanced Transfer Cache". So what makes the Pentium 4 to be more modern and advanced than the Pentium III? here it goes:

400Mhz System Bus(100Mhz quad-pumped Bus):
This is one of the prominent changes in the Pentium 4, the 400MHZ clocked system bus. originally the Pentium 4's system bus is 400MHz, but it is quad-pumped(pumped 4 times), so technically the effective clock speed of Pentium 4 is 400MHz. compare to Pentium 3's system that limited to 133MHz, and it is single-pumped and the maths jobs,the Pentium 3's system bus clock speed is intact at 133MHz, and this makes Pentium 4 leaves Pentium 3's system bus very far away. but both CPU shares the same capability to transfer 64 bit of data per clock. but in terms of data transfer, the Pentium 4 can transfer data up to 3,200MB/s, compare to Pentium 3 that only capable to transfer data up to 1,066MB/s, while it's direct non Intel Competitor, the AMD Athlon's EV6 system bus, which clocked at 266MHz(133MHz double-pumped) offers data transfer up to 2,133MB/s. Although armed with a powerful quad-pumped bus, the Pentium 4 also uses the same "Front side bus" with with Pentium 3, which is inferior compare to the more advanced EV6 bus signalling used by AMD Athlon.

Pentium 4 new old cache.
Pentium 4 did retain the same L2 cache that used by Pentium 3's 'Coppermine' L2 cache or famously known as 'Advanced Transfer Cache', which it shares the same 256 kiB size of L2 cache and the same 8-way associative way which both these cpu used. But the Pentium 4 L2 cache is using a 128-Byte lines, and it is divided in two 64 Bytes to Pentium 3 that only using 32-Bytes line cache. larger cache lines enable the CPU to fetches data from the system much faster thus increasing overall performance.
Futhermore, the Pentium 4 caches are redesigned to have a lower latency and higher bandwidth. In order to achieve this, Intel drastically reduced the L1 data cache size to 8kB. Compare to Pentium 3 L1 that have L1 16kiB Instruction cache and L1 Data cache, and 64 KiB L1 instruction cache and 64KiB L1 data cache(Core 2 CPU and the Core i3, Core i5, Core i7 and Core i9 have 32KiB L1 Instruction cache and 32KiB data cache). This is one of the reason why Pentium 4 merely match or even fall behind in clock-to-clock basis with Pentium 3 and even its main rival, the AMD Athlon.
The unique feature found in Pentium 4 L1 cache is the absence of the L1 Instruction Cache, but replaces it with the 'Execution Trace Cache. technically, the Trace Cache works the same like the L1 Instruction Cache, but the difference is that the Trace Cache stores micro-operation(micro-operation are instructions that have been fetched and decoded, and ready for execution), which, when executing a new instruction again, it will not need to fetch and decoding the instruction whole over again from the start,Instead the CPU will accesses the already decoded micro-ops from the Trace Cache. the secrets behind the Trace Cahe is its location, which unlike conventional L1 instruction cache, it is located behind the decoders, by this, it will reduces the latency.The branch predictions units also has 8 times larger branch target buffer that can be found on Pentium III, and new predictions algorithm, which this can reduces 33% of mispredictions that Pentium III suffers.

Hyper Pipelined Technology
Yes, this is the main reason why Pentium 4's IPC(more IPC means more performance) is hurt, which Pentium 4 has 'very deep' 20 stages(31 stages in the later versions of Pentium 4) compare to Pentium III with 10 stages and AMD Athlon with 11 stages(14 Stages in Core 2 series)

The reason that Pentium 4 have this deep pipeline is that it will enable Pentium 4 to reach highest possible clock speed, but in the end, heat problems that caused by the high clock speed ended Pentium 4's dream to reach clock speeds up to 10 GHz.The ONE BIG disadvantage of having deep pipeline is that if an incorrect predictions occurred, the processing cycle must start all over again. While deeper pipelines has higher mis-predicted branch than lower pipeline.

Rapid Execution Engine
The fancy name given by Intel to Pentium 4's execution units, which consists of 2 ALU units and 2 AGU units with each of them operates double the speed of the processor. While 2 two ALU executes simple instructions there is another units of slower ALU that dedicated to execute complex instruction, and this slower ALU operates at the same speed of the processor.(compare 2 Pentium III that only have one ALU and AGU while working on same speed of the processor)All the units in its execution units in order to compensate the deep pipeline stages, and to deliver the same integer performance equal to the lower clocked Pentium III.
The advanced Pentium 4's branch predictor also 'served' as a compensate for the deep 20 stages pipelines, which this advanced branch predictor also reduces the mis-predicted that might occurs in the latter stages of the pipeline.
SSE2 and Floating Point Unit
SSE2 is Pentium 4's most succesful new features that even AMD impresses it and its next line of processors will support the SSE2 instructions, in addition to the earlier but succesfull SSE. With the inclusion of SSE2 for the Pentium 4, which it will offers extensions MMX and SSE that also included for the Pentium 4 itself. while the SSE2 increases performance for the Pentium 4, the SSE2 will able to handle some 64 bit operations(it can handle two 64 bit SIMD INT and two double precision 64 bit SIMD-FP operations),which this is very important to handle(increases performance) for Floating Point Performance, which useful for many professional applications.

Is It?
Despite all this enhancements, the Pentium 4's clock-to-clock performance seems a little disappointing compare to Pentium III....for example take a look at these benchmark graph that I grab from Tom's Hardware;

Whatever it is, the Pentium 4 did have stable performance, and running greatly on gaming applications and Multimedia-Oriented software, but still losing on clock-to-clock basis.

-------------------------------------------THE END--------------------------------------------

My sources and references (with no particular order)


The New Old Tanak Wagu © 2008. Design by :Yanku Templates Sponsored by: Tutorial87 Commentcute