

# MOTOROLA THAWS COLDFIRE V4

Integrated CF5407 Chip Beats Award-Winning CF5307 By Tom R. Halfhill {5/15/00-01}

It's coming a year later than Motorola had hoped, but the CF5407—the first standard chip based on the ColdFire V4 core—is a significant improvement over the two-year-old CF5307. It delivers three times the raw performance, twice as many mips per megahertz,

and nearly four times as many mips per watt. And the 5407 is almost pin compatible with the 5307, requiring only a lower Vcc supply (1.8V) and different clock inputs for its core, so developers can make boards that work with either chip.

Wisely, Motorola didn't tamper much with the basic design. As Figure 1 shows, the 5407 has the same integrated peripherals as its popular and proven predecessor (see *MPR 2/16/98-02*, "5307 Brings Clock Doubling to ColdFire"). But by redesigning the execution pipeline, tripling the amount of cache, and enhancing some of the peripherals, Motorola was able to boost the performance and utility of an already versatile chip—the 5307 won our Best Integrated Processor award for 1998 (see *MPR 1/25/99-02*, "Integration, Diversification Key in 1998").

The 5407 is sampling now and is scheduled for production in 3Q00. It's bargain-priced at \$19.95, only \$5 more than the current price for the much slower 5307. Embedded developers who remain devoted to the 68K architecture should bond quickly with the new chip, which is compatible enough with the 68K to make a transition to ColdFire relatively painless.

### **CISC in RISC Clothing**

Motorola describes the 32-bit ColdFire architecture as "variable-length RISC"—a marketing ploy that tries to disassociate ColdFire from the unfashionable CISC architecture of the 68K. Although ColdFire is a clever overhaul of the 68K that combines some of the efficiencies of RISC with the variable-length instructions of CISC, it is not RISC. ColdFire retains some of the complex instructions and addressing modes of a classic CISC architecture. But the



**Figure 1.** Motorola's ColdFire CF5407 is very similar to the CF5307; new or improved features are highlighted in purple.

instructions come in only three sizes (16, 32, and 48 bits) and typically execute in a single cycle—and they have enough in common with 68K instructions to breed content with familiar developers.

ColdFire has never been truly 68K compatible, however. To lure more 68K diehards to ColdFire, V4 adds several new instructions that restore some 68K functions purged from earlier versions of ColdFire. For instance, the Bcc (branch conditionally), BRA (branch always), and BSR (branch to subroutine) instructions now support 32-bit longword fields for jumps to distant addresses—a convenience inherited from the 68020. Another resurrected 68K instruction is TAS (test and set N and Z condition codes), which is intended for implementing semaphores.

Genuinely new instructions in V4 include three moves, a 32-bit saturating arithmetic instruction, and the oddly named INTOUCH instruction, which doesn't phone home but does fill a 16-byte block in the instruction cache. Table 1 summarizes the additions; for more details, refer to our previous article about Motorola's ColdFire V4 presentation at Microprocessor Forum in 1998 (see *MPR 10/26/98-06*, "ColdFire Doubles Performance With V4").

The V4 core is more efficient than earlier ColdFire cores because of the overhauled pipeline, an unusual instruction-folding technique, and better branch prediction. While the V2 and V3 pipelines could execute only registerto-register and store instructions in a single cycle, the V4 executes virtually all instructions-including those that involve memory operands-at a rate of one per cycle. The pipeline is divided into two sections, decoupled by an eightentry FIFO buffer. The first section, which is four stages long, prefetches instructions. The second section, five stages long, has a pair of two-stage compute engines. One engine at the front of the pipeline typically handles address generation, while the other executes instructions. The addressgeneration engine allows some types of instructions to execute early, so their results are available to subsequent instructions without stalls.

The instruction-folding technique enables limited parallel execution without the extra logic of dual-issue

| Instruction | Description                  | Change                    | 68K |
|-------------|------------------------------|---------------------------|-----|
| Bcc         | Branch conditionally         | Supports 32-bit longwords | •   |
| BRA         | Branch always                | Supports 32-bit longwords | •   |
| BSR         | Branch to subroutine         | Supports 32-bit longwords | •   |
| СМР         | Compare                      | Supports bytes, words     | •   |
| CMPI        | Compare immediate            | Supports bytes, words     | •   |
| INTOUCH     | Load 16-byte I-cache block   | New in ColdFire v4        |     |
| MOV3Q       | Move 3-bit data quick        | New in ColdFire v4        |     |
| MOVE.x      | Move data to d16(Ax)         | Supports bytes, longwords | •   |
| MVS.{b,w}   | Move with sign extend        | New in ColdFire v4        |     |
| MVZ.{b,w}   | Move with zero-fill          | New in ColdFire v4        |     |
| SATS        | Signed saturate if overflow  | New in ColdFire v4        |     |
| TAS         | Test and set condition codes | New in ColdFire v4        | •   |

**Table 1.** The ColdFire V4 core has 12 new or improved instructions, including7 from the 68K instruction set.

superscalar pipelines. The second section of the V4 pipeline automatically folds certain pairs of instructions into a single-cycle operation. For example, it combines MOV.I <mem>,Rx and ADD.I Ry,Rx to create ADD.I <mem>,Ry,Rx. Motorola says these kinds of instruction pairs occur frequently in embedded programs. Programmers writing in assembly language could make sure it happens by deliberately pairing those kinds of instructions inside critical loops.

Like the 68060, the V4 uses a branch target cache to store the target addresses of branch instructions. Unlike the 68060, the V4 also has a branch history table for more accurate dynamic branch prediction. Correct predictions pay off with zero-latency branches.

All those core enhancements and an 80% higher clock frequency contribute mightily to the 5407's performance advantage over the 5307. So do the triple-size Harvardarchitecture caches. Whereas the older chip has an 8K unified cache, the new chip has a 16K instruction cache and an 8K data cache, both four-way set-associative. As before, half of the caches are lockable for real-time applications that require deterministic cache behavior.

Motorola also improved the flexibility of the on-chip memory. The 5407, like its predecessor, has 4K of SRAM in addition to the caches. But the 5407's SRAM is divided into a pair of 2K blocks. Programmers can use the blocks as zero-latency memory for 2K of instructions and 2K of data, or they can map both blocks as either 4K of instruction or 4K of data memory.

#### Minor Changes to Peripherals

The real attraction of integrated devices like the 5407 is their complement of on-chip peripherals, because equally capable CPUs are available almost anywhere. Motorola didn't rock the boat, preferring to stick with virtually the same level of integration found in the 5307.

Like its predecessor, the 5407 has a memory controller (which supports SDRAM, EDO, and FPM RAM), two UARTS, eight chip selects, four DMA channels, two 16-bit timers, 16 general-purpose I/O ports, an I<sup>2</sup>C serial interface, and a debug module. Motorola improved one of the UARTs,

> allowing it to function as either a synchronous or asynchronous interface (USART). The debug module has three additional program-counter breakpoint registers plus another set of address-range and data breakpoints, among other enhancements. It also allows the chip to service normal interrupt requests while debugging. To handle the 5407's higher clock rate, the clock circuit supports more divider ratios, so the I/O bus can run at ratios of 1:3, 1:4, 1:5, or 1:6 relative to the core frequency. (The slowest possible bus frequency is 25MHz, because the core can run at 150MHz.) Table 2 compares the features of the 5407 and 5307.

An overdue process shrink accounts for the remaining differences between these two chips.

3

Motorola still fabricates the 3.3V 5307 on an ancient 0.35micron three-layer-metal process. The 5407 graduates to Motorola's HIP4, a 0.22-micron four-layer-metal (aluminum) process, which drops the core voltage to 1.8V (with 3.3V I/O) and boosts the clock to 162MHz. Motorola also plans to make a faster version of the chip that will run at 200MHz or better.

At 162MHz, the 5407 delivers 257 mips (Dhrystone 2.1), which easily beats the 5307's score of 75 mips at its maximum clock rate of 90MHz. More illuminating is the 5407's performance efficiency: it racks up 1.6 mips per megahertz, compared with 0.8 mips per megahertz for the 5307. The 5407 would deliver only 135 mips if its performance scaled on a linear slope with clock frequency, so apparently the V4 architectural enhancements were worthwhile.

One tradeoff, however, is that the additional logic and larger caches increased the gate count by a factor of 1.7, which prevents the 5407 from realizing some other benefits of a smaller IC process and lower core voltage. The 5407's die size is 28mm<sup>2</sup>, which is actually 1mm<sup>2</sup> larger than the 5307, despite the 35% linear process shrink. And although Motorola estimates that the 5407 will consume less than 700mW (typical), that's not a big improvement over the 5307's 792mW. Normally, one would expect that migrating the design from 0.35 to 0.22 microns and from 3.3V to 1.8V,

| Feature                     | CF5407            | CF5307            |  |
|-----------------------------|-------------------|-------------------|--|
| Core Architecture           | ColdFire v4       | ColdFire v3       |  |
| Core Frequency              | 162MHz            | 66MHz, 90MHz      |  |
| Core Voltage                | 1.8V              | 3.3V              |  |
| I/O Voltage                 | 3.3V              | 3.3V              |  |
| I/O Bus Width               | 32 bits           | 32 bits           |  |
| I/O Bus Dividers            | 3, 4, 5, 6        | 2, 3, 4           |  |
| L1 Cache (I/D)              | 16K/8K            | 8K unified        |  |
| L1 Cache Locking?           | Yes (half)        | Yes (half)        |  |
| On-chip SRAM                | 2 x 2K            | 1 x 4K            |  |
| DRAM Control                | SDRAM, EDO, FPM   | SDRAM, EDO, FPM   |  |
| UARTs                       | 1 async, 1 sync   | 2 async           |  |
| Chip Selects                | 8                 | 8                 |  |
| DMA                         | 4 channels        | 4 channels        |  |
| Timers (16-bit)             | 2                 | 2                 |  |
| GPIO                        | 16                | 16                |  |
| MAC & HW Divide?            | Yes               | Yes               |  |
| Debug Module                | Revision C        | Revision B        |  |
| I <sup>2</sup> C Interface? | Yes               | Yes               |  |
| IC Process                  | 0.22µ 4LM         | 0.35µ 3LM         |  |
| Die Size                    | 28mm <sup>2</sup> | 27mm <sup>2</sup> |  |
| Package                     | 208-pin QFP       | 208-pin QFP       |  |
| Power (Typical)             | <700mW* 792mW     |                   |  |
| Dhrystone 2.1               | 257 mips          | 75 mips @ 90MHz   |  |
| Perf Efficiency             | 1.6 mips/MHz      | 0.8 mips/MHz      |  |
| Power Efficiency            | 367 mips/W*       | 94.6 mips/W       |  |
| Price Efficiency            | 12.8 mips/\$      | 4.4 mips/\$       |  |
| Price (10K)                 | \$19.95           | \$14.95           |  |
| Production                  | 3Q00              | Now               |  |

 Table 2.
 The 5407 is significantly faster and more efficient than

 the two-year-old 5307.
 (\*Based on Motorola's estimate.)

without adding more peripherals, would yield a smaller die and a greater reduction in power consumption.

The larger caches are the biggest reason that the die didn't shrink dramatically. Another reason is starkly visible in Figure 2, the die photo. ColdFire is the only family of processors from Motorola that's entirely synthesized from high-level models with automated design tools. There's no custom circuit layout at all. Compiled chips are bigger, slower, and less power-efficient than full-custom designs, but they are much quicker and cheaper to create. Where a hand-packed design typically has neat blocks of function units inside a Piet Mondrian grid of buses, the 5407 has an amorphous mass of compiler-generated circuits on a Jackson Pollock canvas of silicon. The only semblance of order comes from the caches and on-chip memories around the periphery of the die. They're compiled too, but SRAM arrays obediently fall into dense rows and columns, even without a guiding hand.

Fortunately, the mess of logic circuitry isn't as inefficient as it appears. Based on Motorola's upper-range powerconsumption estimate of 700mW, the 5407 delivers a whopping 367 mips per watt, nearly four times better than the 5307's 94.6 mips per watt. Beauty is in the eye of the beholder, but performance can be measured.

## ColdFire in the Middle

With M-Core anchoring the low end of Motorola's embedded-processor line, and PowerPC dominating the



**Figure 2.** The 5407's die size is 28mm<sup>2</sup>, which is 1mm<sup>2</sup> larger than the 5307. Note the core's lack of custom circuit layout. Motorola compiled it with high-level automation tools, creating a "sea of gates" with no distinguishable function units or buses. Only the SRAM arrays appear as neat blocks. Motorola makes all ColdFire chips this way.

## Price & Availability

Motorola is sampling the 162MHz CF5407 now and plans to ramp up production in 3Q00. The price, in 10,000-unit quantities, is \$19.95.

high end, ColdFire and the 68K fill the vast hole in the middle. ColdFire chips serve in a wide range of applications, including industrial equipment, test and measurement devices, communications, printers, disk drives, consumer electronics, set-top boxes, and musical instruments. Some networking applications can use ColdFire chips, but Motorola has better solutions in its arsenal: the QUICC and Power-QUICC series and the C-5 network processor that came with Motorola's recent acquisition of C-Port (see *MPR 3/6/00-03*, "Motorola Buys C-Port: Smart Move").

Except for its lower core voltage and different clock inputs, the 5407 is virtually a drop-in replacement for the 5307 that offers much better performance. And it costs only \$5 more (although Motorola may cut the price of the 5307 again by the time the new chip appears). As Table 3 shows, the 5407 also competes strongly against a representative sample of other 32-bit embedded processors in its price range. It's practically a no-brainer for developers already committed to the 5307 or other ColdFire chips.

ColdFire's biggest competition is probably Motorola's own 68K, which continues to score new design wins. There's

no chance the 5407 will dethrone the 68K. For one thing, the 68K line has processors with true superscalar pipelines, FPUs, and MMUs—valuable features that Motorola may add to future versions of ColdFire but is withholding for now. Also, as mentioned above, ColdFire isn't completely compatible with the 68K.

To ease the transition to ColdFire, Motorola offers free migration tools on its Web site (*www.motorola.com/coldfire*). One tool is an emulator that traps unsupported 68K instructions and executes them in software. Another tool is a static translator that converts 68K assembly language into ColdFire assembly language. (Both tools are from MicroAPL, which offers technical support for \$500.)

ColdFire is an important part of Motorola's FlexCore program, a cell library and design system for ASIC customers. Though ColdFire is not as broadly licensed as the cores from ARC Cores, ARM, Lexra, MIPS, and Tensilica, FlexCore opens the door to custom chips for special applications. In fact, Motorola has designed more ASICs than standard parts using ColdFire cores. Three V4-based ASICs are currently in development, including two that will be manufactured in 0.15- or 0.18-micron processes.

Despite its lengthy gestation, the 5407 is a worthy descendant of the 5307 and a hint of better things to come. Motorola's roadmap for ColdFire anticipates true superscalar execution, superpipelining, and possibly the floatingpoint capability that currently prevents the architecture from winning some designs against the 68K and other rivals—including Motorola's own PowerPC.

|                  | Motorola          | IDT                | National     | AMD                | IBM                  |
|------------------|-------------------|--------------------|--------------|--------------------|----------------------|
| Feature          | CF5407            | 32364              | 486SXL       | 486DX5             | 403GCX               |
| Architecture     | ColdFire          | MIPS               | x86          | x86                | PowerPC              |
| Clock Rate       | 162MHz            | 133MHz             | 25MHz        | 133MHz             | 66MHz                |
| L1 Cache (I/D)   | 16K/8K            | 8K/2K              | 1K unified   | 16K unified        | 16K/8K               |
| FPU?             | No                | No                 | No           | Yes                | No                   |
| MMU?             | No                | Yes                | No           | Yes                | Yes                  |
| Bus Width        | 32 bits           | 32 bits            | 16 bits      | 32 bits            | 32 bits              |
| Bus Frequency    | 25–54MHz          | 66MHz              | 25MHz        | 33MHz              | 33MHz                |
| MIPS             | 257 mips          | 175 mips           | 12 mips      | 57 mips            | 81 mips              |
| Voltage (I/O)    | 3.3V              | 3.3V               | 5V           | 3.3V               | 3.3V                 |
| Power (typ)      | <700mW*           | 850mW              | 600mW        | 2.75W              | 400mW                |
| Perf Efficiency  | 1.6 mips/MHz      | 1.3 mips/MHz       | 0.5 mips/MHz | 0.4 mips/MHz       | 1.2 mips/MHz         |
| Power Efficiency | 367 mips/W*       | 206 mips/W         | 20 mips/W    | 21 mips/W          | 203 mips/W           |
| Price Efficiency | 12.9 mips/\$      | 14 mips/\$         | 0.6 mips/\$  | 2.1 mips/\$        | 4.8 mips/\$          |
| IC Process       | 0.22μ 4M          | 0.35µ 3M           | 0.65µ 3M     | 0.35µ 3M           | 0.45µ 3M             |
| Die Size         | 28mm <sup>2</sup> | 29 mm <sup>2</sup> | n/a          | 43 mm <sup>2</sup> | 47.4 mm <sup>2</sup> |
| Availability     | 3Q00              | Now                | Now          | Now                | Now                  |
| Price (10K)      | \$19.95           | \$12.50            | \$20         | \$26.80            | \$17                 |

**Table 3.** In this sample of comparably priced 32-bit embedded processors, the 5407 clearly excels in both raw performance and price/performance. The 5407 and IBM's 403GX look even better in this comparison when their on-chip peripherals are considered, because the other chips are not as highly integrated. (\*Based on Motorola's estimate.)

Reducing the differentiation between ColdFire and PowerPC seems like a strange product strategy, though. PowerPC's superior RISC architecture gives it a clear advantage at the high end, but Motorola's roadmap appears to put ColdFire on a collision course with PowerPC's low end. Perhaps ColdFire's near-compatibility with the 68K will be enough to differentiate it from PowerPC in the minds of customers. Then Motorola's only challenge will be differentiating the superscalar, superpipelined, floating-point Cold-Fire chips from the superscalar, superpipelined, floating-point 68K chips. It will take some deft juggling to manage all those architectures. 🛇

To subscribe to Microprocessor Report, phone 408.328.3900 or visit www.MDRonline.com

4

© MICRODESIGN RESOURCES 🔷

MAY 15, 2000