Tech

Why are Multicore CPUs So Hot?

CIOL Bureau

08 Jan 2007 00:00 IST

Updated On 08 Jan 2007 05:58 IST

New Update

Sujay V Sarma

Advertisment

These days it is common to have processor codenames and architecture names used with abandon. The names can confuse anyone, and after a while, it becomes nearly impossible to tell them apart. This month, we take you through this jargon jungle and unveil the secrets of how modern x86 processors work. There are two parts of the story.

One is the micro-architecture part, which is a combination of multiple technologies. For instance, when Intel says 'Core 2' or AMD says 'AM2', it denotes that a particular combination of technologies is at work to produce a processor that can be called a 'Core 2' or an 'AM2' product.

The second part to our story is the architecture itself, which is known by a specific codename (say 'Conroe' or 'Manchester'). This month, we look at technologies behind the different micro-architectures and next month, we'll see what each combination (the architectures themselves) is about.

Advertisment

Direct Hit!

Applies To: IT managers

USP: Learn how multicore processors work under the hood and what makes them so special

Primary Link: en.wikipedia.org/wiki/Multi_core

Google Keywords: Multicore architecture

Macro Fusion

A term coined by Intel, it refers to the processor's ability to combine several instructions into one, thus optimizing it and making for a faster execute. Thus, if values from two different memory locations already in the processor cache are to be compared but the instruction set first loads them elsewhere and then compares them, the Macro Fusion technique would enable you to directly compare them, skipping a step. Spread across an entire application or thread, this can significantly boost execution times.

Handling L2 Cache

Advertisment

When you have more than one core on a processor die, selecting the right place to have the L2 cache is key. L1 is not touched here since it is a small-sized cache that contains instructions being immediately fed to the processing core. One method uses a common L2 cache for multiple cores (Intel) while the other seeks a dedicated cache location per core (AMD). There are pros and cons to both sides of the coin.

When you go the AMD way, you have two L2 caches in a dual core die and four of them on a quad core. While each core gets its own dedicated cache, the drawback is that when some cores are more active than the others, their caches will be overflowing and hence those cores suffer a performance hit even though the core waits for a fetch from the main memory.

Intel's shared core (what it calls an 'Advanced Smart Cache') uses shared L2 caches between two cores, letting cores be better utilized when one of the cores is relatively less loaded. But, this method introduces the headache of having a memory controller that manages the memory between the two cores (who gets to put what where and so forth).

Advertisment

The memory bandwidth Intel promises with its way out peaks at 96 GB per second at 3 GHz. Obviously it is hard to take a call on which one is better, as each method compensates for a different scenario.

In FSB-based architecture, data hits a memory controller which routes it to the CPU, memory, etc. AMD's Direct Connect places the memory controller in the CPU, causing all data to pass through the processor, which then routes it appropriately

Intel Smart Memory Access

Advertisment

So, when you have a shared L2 cache like Intel has with its Advanced Smart Cache technology, the headache of managing the cache between two cores falls on a memory controller. This is the bundle they call 'Smart Memory Access.' Along with memory management, SMA also resolves memory locations and adjusts internal pointers so that when the core finds an instruction to jump elsewhere in the code or fetch an instruction or data that's already been cached, it can be directly fetched from the L2 or main memory location instead of reloading it. This speeds up out-of-order execution.

The speed arises from the fact that instructions that are independent of each other can be executed as soon as that instruction has been decoded, without having to wait for the sequence of instructions before it to finish. Thus, if a block of code loads two values and stores them elsewhere without any processing with other logical code in between can be reordered and the LOAD/STORE can be executed in independent threads and many clock cycles earlier.

Intel's vPro technology lets IT admins create system partitions for troubleshooting, maintenance and inventory management

Advertisment

AMD Direct Connect: AMD solves the problem of managing L2 cache in multi core or multi processor systems, by doing away with the need for an FSB and talking directly to the various components through something called the HyperTransport Link. Taken together this architecture is called 'Direct Connect' (DCA). Each processor built around DCA has an integrated memory controller and is HyperTransport enabled. Each processor with DCA is linked to specific portions of memory.

When one CPU needs to access data that is in the memory linked to the other processor, it will use HyperTransport to link to that processor/memory. This HyperTransport linkage is called Coherent HyperTransport.

Memory access speeds for processor transactions are boosted when you put the controller on the processor die itself. But in a multi-processor system, when you add more processors, you also add more memory controllers. To solve the problems with access overlaps and violations and race conditions, AMD uses something called NUMA (Non-Uniform Memory Access) that is similar to Intel's SMP (Symmetric Multi-Processing). NUMA deviates from the SMP in that NUMA is asymmetric.

Advertisment

Although the original use for NUMA and DCA involve multiple processors rather than cores, it can be applied to a multi-core system just as easily.

Multimedia instructions

With the Pentium-1 class of processors, we had the SSE (or AMD's equivalent of 3DNow!), SSE-2 and SSE-3 that themselves are descendants of the now defunct MMX instruction set. SIMD is one more evolution forwards from there. 128-bit SSE instructions traditionally take two clock cycles each to execute. SIMD allows for these instructions to complete in one clock, doubling their throughput.

In addition to all that, SIMD adds about 70 new instructions to process packed floating-point data, control memory without cache pollution and extend the MMX instruction set. This lets your application do really parallel threads with independent control flow. This expanded instruction set is also called the 'Streaming SIMD extensions' because it works best for streaming data (into the processor) and letting you do video processing like encoding video and other multimedia data, faster.

Intel vPro

If you've heard the talk about having system level partitions with each partition being able to run isolated operating systems and software, and your IT department being able to manage your system remotely using hidden or access restricted areas of your system while you were busy working with your section of it... vPro is the technology that is set to make that a reality.

With vPro, your IT department can remotely monitor, diagnose and repair your desktop even it is switched off. Also, the system can send out its configuration information like what cards are installed, how the BIOS has been configured and so on. This also improves asset tracking and inventory since one no longer needs to go to each computer to audit this and neither do they need to install specific software or open (desktop) firewall ports to do this.

Additionally, vPro also has the benefit of lower power usage by turning off processor functions when they are not in use. Yes, vPro can turn off specific portions inside the processor when they are not being used rather than turning an entire core on or off.

New in virtualization

This is a big topic outside the discussion on processors as well. The reason this is hot is well known-the more the number of applications you can put into a single box (most usually by cramming more virtual machines into that box) the lower your cost of ownership and running costs and better your cost efficiency. But the biggest limiting factor so far to that argument has been that while virtualization has been possible in the non-x86 world for some time now, it has been primarily software based in the x86 realms.

The VMware, Xen and Virtual Servers/PCs of the world have ruled this area. However, the performance gained by doing that while the hardware did not inherently support virtualization is not that great. The traditional way to boost virtual machine performance is by jacking up the amount of RAM and the processor speed.

Intel's vPro and VT (Virtualization Technology) and AMD's AMD-V are rather big steps to taking the x86 into a fully virtualizible environment. We have discussed vPro above, let's take some time to understand VT and AMD-V.

Intel VT: Codename 'Vanderpool', it enables support for virtualization software layers to control processing actually being done inside a virtual machine. This allows such software to monitor virtual systems and marshal their resource (processor and memory) usage. Intel VT is in use not just in x86 desktop processors, but also affects Itanium 32 and Itanium 64 bit families.

Three new features for the 32-bit VT (called VT-x) are: a more coordinated way of dealing with blocked NMI (Non Maskable Interrupts) when VMs (guest OS in a VM) are exiting, setting up virtual processor IDs by VMMs and then use these IDs to translate buffer addresses in memory and; instead of having a common memory page tables for host and guest OS VMs now have their own page tables.

The Xen Hypervisor gets extra support from the Intel VT architecture, letting it expose resources and configuration to its guest software that return better virtualized performance. This includes presenting all the real processor information bits (CPUID) to the guest, with the exception of the VMX and MCA.

AMD-V: This is a step further from AMD's 64-bit architecture (AMD64). AMD-V adds two modes of operation to the systems it runs on: Host mode and Guest mode. Also added is a new instruction called VMRUN that lets virtual machines along with the guest OS and its applications work a little faster inside a VM. An AMD-V processor will initially boot up with its new VM capabilities disabled (the 'guest' mode) until compatible VMM (VM Managers) are detected.

Once such a VMM is detected, the processor switches to 'host' mode and turns on all its capabilities. An AMD-V processor in host mode features a number of interesting capabilities like setting up at the hardware level itself the kinds of resources the system software is allowed to access. Using these instructions one can even assign different VMs exclusive access to different resources (like network interfaces).

As we continue adding cores to processors and add more of such processors to our computer systems, it is natural that their technologies will advance to let multiple processor cores access what so far has been a resource dedicated to one processor core. Next month, we will be looking to demystify the various processor code names in the multi-core domain.

Source: PCQuest

tech-news