

# TENSILICA TWEAKS XTENSA

Xtensa LX3 and Xtensa 8 Cores Boost Performance, Cut Power

By Tom R. Halfhill {11/30/09-01}

Tensilica is introducing two new versions of its configurable embedded-processor cores: the Xtensa LX3 and Xtensa 8. In addition to having new features, they are generally smaller and faster than their predecessors and use less power when fabricated in the same CMOS

process. Tensilica began delivering the cores to licensees in October and November.

The new processors supersede the Xtensa LX2 and Xtensa 7 cores introduced three years ago. (See *MPR 12/4/06-02*, "Tensilica Upgrades Xtensa Cores.") These 32-bit cores are the base configurations underlying Tensilica's entire product line, including the Diamond Standard preconfigured cores and the ConnX Baseband Engine (ConnX BBE) introduced last summer. (See *MPR 3/20/06-01*, "Tensilica's Preconfigured Cores," and *MPR 8/10/09-01*, "Tensilica Plays Baseband.")

Tensilica's configurable-processor technology, the best in the industry, lets chip designers customize the base configurations of the cores for specific applications. Many optional features are available as simple check-box choices in Tensilica's graphical processor-configuration tools. Developers can create their own extensions in Tensilica Instruction Extension (TIE) language, a proprietary HDL. Better yet, Tensilica's Xtensa PRocessor Extension Synthesis (XPRES) tool can automatically generate custom extensions by analyzing application software written in C/C++. (See *MPR* 7/10/04-01, "Tensilica's Automaton Arrives.")

Ultimately, Tensilica's Processor Generator outputs the synthesis-ready RTL. From that point, the development flow is conventional and uses industry-standard designautomation tools. By upgrading the Processor Generator to use the Xtensa LX3 and Xtensa 8 cores, Tensilica is propagating the new features and improved performance throughout its growing product line. The Xtensa 8 core is the smallest base configuration of Tensilica's Xtensa architecture and is intended primarily for 32-bit microcontrollers. It's a relatively minor improvement over the Xtensa 7. (See sidebar, "Tensilica Debuts Xtensa 8 Core.")

The Xtensa LX3 is bigger news. It's much more configurable than Xtensa 8 and is intended primarily for data-plane processing and signal processing. New Xtensa LX3 features include ConnX 16-bit DSP extensions, a smaller version of Tensilica's Vectra LX DSP engine, a double-precision floatingpoint math accelerator, more system-bus options, better SystemC modeling, and code enhancements for C and C++ programmers.

Best of all, Tensilica improved both cores while trimming their gate counts, reducing power consumption, and boosting clock frequencies. Optimized for maximum clock speed, a small configuration of the Xtensa LX3 can exceed 1.0GHz when fabricated in TSMC's 45nm-GS process. Optimized for low power and slowed to 60MHz, the same configuration in the same process consumes less than a milliwatt. Developers are getting a better processor without the usual trade-offs in silicon and power.

## New Focus: Signal Processing

As we noted in our August report on ConnX BBE, Tensilica's new direction is data-plane processing in general and digitalsignal processing in particular. Tensilica's general-purpose processors remain as useful as ever for control functions, but ARM, MIPS, and the Power Architecture are more prevalent in the control plane. Tensilica sees more opportunities for market growth in the data plane, especially when the application requires signal processing beyond the basic capabilities of general-purpose embedded-processor cores.

The main target for ConnX BBE is baseband processing in communications, which requires a powerful DSP. Digital audio and video are somewhat less demanding and are spreading everywhere, creating new opportunities for lowto-midrange DSPs. Tensilica has been pursuing audio and video for years—indeed, the company says most licensees are using Xtensa cores for signal processing already—but the latest strategy is more ambitious. Therefore, Tensilica needs a greater range of DSP options throughout its catalog.

Figure 1 is Tensilica's view of its DSP cores and their target markets. All the preconfigured processors on the product map were based on the Xtensa LX2 core, now superseded by the Xtensa LX3. At the low end are the 16-bit ConnX D2 DSP Engine (introduced in August) and the 24-bit HiFi 2 Audio Engine (introduced in 2005). Both can execute two multiply-accumulate (MAC) instructions per clock cycle. ConnX BBE, Tensilica's highest-end DSP core, can execute 16 MACs per cycle. Designers can build even bigger DSPs by using multiple cores and application-specific TIE blocks.

The ConnX D2 DSP Engine is essentially an Xtensa LX3 processor with dual 16-bit MACs, eight 40-bit registers, and extensions that pack two parallel operations into one 64-bit instruction word. Designers can turn the 15,000-gate Xtensa LX3 core into a 70,000-gate ConnX D2 DSP by

selecting a single check-box option in Tensilica's processorconfiguration tool. This DSP configuration is intended for telecommunications infrastructure equipment, VoIP telephony, computer peripherals, and consumer electronics.

Until now, the next step up the ladder was to add Tensilica's Vectra LX DSP Engine to the LX2/LX3 core. Vectra LX (recently rebranded ConnX Vectra LX) is a 16-bit fixed-point DSP engine capable of executing four MACs per cycle. Two load/store units keep the engine busy by fetching data from separate X/Y local memories, a classic DSP memory architecture. (See *MPR* 5/31/04-01, "Tensilica Tackles Bottlenecks.")

The dual-load/store Vectra LX engine remains available, but Tensilica is introducing a half-step option: the same DSP engine with a single load/store unit. Instead of separate X/Y memories, there's only one unified memory for DSP data. In fact, it's possible to omit the dedicated memory altogether and fetch DSP data from the LX3's data cache—or to use a combination of local memory and cache. Everything else in the Vectra LX engine is the same.

Although eliminating one of the load/store units may seem like a downgrade, it saves 10,000 to 20,000 gates in applications where performance is limited by the DSP engine's processing power, not by I/O throughput. (With both load/store units, the Vectra LX engine adds about 180,000 gates to the Xtensa LX3 core.) A simpler memory system also cuts costs by allowing the use of single-ported SRAM instead of dual-ported memory. With the new Vectra LX option, developers will have more flexibility to create an optimized design.



Figure 1. Tensilica's DSP product line. All these DSP cores were based on the Xtensa LX2 and are moving to Xtensa LX3. To the right of the product map are some optional DSP extensions that turn the general-purpose LX3 into a range of DSPs. The double-precision floating-point accelerator is a new option introduced with the LX3.

Another new option for the Xtensa LX3 is a double-precision floatingpoint accelerator. Tensilica stops short of calling it an FPU. It's not intended for applications requiring heavy-duty floating-point math, and it lacks some attributes of a full-fledged FPU, such as a dedicated register file. Consequently, it requires only 11,000 gates.

Tensilica continues to offer the option of a full-featured singleprecision FPU, which requires

about 25,000 gates and was introduced nine years ago with the Xtensa 3 processor. This FPU can execute single-precision operations at a throughput rate of one instruction per clock cycle. The new double-precision accelerator can be used with or without this FPU. (See *MPR 6/19/00-02*, "Vector DSP, FPU Extend Xtensa.")

The accelerator is intended for applications needing double-precision floating-point math in brief spurts, but not often enough to justify the larger gate-count of a full FPU. Two example applications are embedded motor controllers and Global Positioning System (GPS) satellite receivers. Until now, the alternative was to run floating-point emulation in software, which is slow and nondeterministic. (Software floating-point routines stop calculating when they achieve the desired degree of precision, so the latency of an operation depends on the data.)

Table 1 shows the vast improvement that a little floatingpoint hardware can make. Double-precision addition, subtraction, and multiplication are approximately twice as fast.

Double-precision division is about seven times faster, and square-root operations are 11 to 13 times faster.

The accelerator conforms to the latest IEEE 754-2008 standard for double-precision arithmetic and logical comparisons. It also accelerates some 32-bit signed and unsigned integer math, such as division and modulus operations.

#### **New Bus Options**

The Xtensa LX3 has new options that will give developers greater flexibility when implementing an on-chip system bus. All Xtensa processors have a proprietary Tensilica bus called the Processor Interface (PIF), but most SoCs

|                                                                                                       | Software I        | Software Emulation DP-FPA Hardware |                   |                   |                           |                           |
|-------------------------------------------------------------------------------------------------------|-------------------|------------------------------------|-------------------|-------------------|---------------------------|---------------------------|
| Operation                                                                                             | Average<br>Cycles | Maximum<br>Cycles                  | Average<br>Cycles | Maximum<br>Cycles | Acceleration<br>(Average) | Acceleration<br>(Maximum) |
| Addition                                                                                              | 35.2              | 93                                 | 19.1              | 45                | 1.8x                      | 2.1x                      |
| Subtraction                                                                                           | 34.8              | 84                                 | 18.7              | 41                | 1.9x                      | 2.0x                      |
| Multiplication                                                                                        | 75                | 101                                | 26.3              | 46                | 2.9x                      | 2.2x                      |
| Division                                                                                              | 496.9             | 718                                | 72.2              | 103               | 6.9x                      | 7.0x                      |
| Square Root                                                                                           | 860.3             | 1,056                              | 77.8              | 80                | 11.1x                     | 13.2x                     |
| Table 1. Software emulation vs. hardware acceleration for double-precision floating-point operations. |                   |                                    |                   |                   |                           |                           |

 Table 1. Software emulation vs. hardware acceleration for double-precision floating-point operations, measured in clock cycles. Tensilica's new accelerator isn't quite a full FPU, but it significantly improves performance while adding only 11,000 gates to the Xtensa LX3 processor core. (Data source: Tensilica.)

need a bus that's more industry-standard and easier to connect to on-chip peripherals. Various incarnations of ARM's AMBA bus have emerged as the prevailing standard. (Tensilica estimates that 70% to 80% of the latest SoCs are using AMBA.) Xtensa processors have long supported AMBA bus bridges, but the Xtensa LX3 has new timing options for those bridges.

The Xtensa LX2 was showing its age by limiting the bus to the same clock frequency as the processor core. That limitation simply won't do when the latest process technologies are vaulting core clock speeds into the gigahertz range the peripherals can't keep up. Indeed, for many peripherals, high clock speeds are unnecessary or even detrimental. Therefore, the Xtensa LX3 supports new synchronous and asynchronous bus options for AMBA-3 AXI and AHB-Lite. With either bus bridge, the Xtensa LX3 supports separate read and write channels, each 32 bits wide.

The new synchronous core-to-bus clock-frequency ratios are 1:1, 2:1, 3:1, and 4:1. In a synchronous arrangement, the



**Figure 2.** New asynchronous bus bridge for the Xtensa LX3. This option adds bridge logic to negotiate I/O transfers between the processor core and AMBA AXI or AHB-Lite bus. It allows the processor and bus to run asynchronously, freed from the limitations of fixed clock ratios. However, the bus can't run faster than the processor, and the additional bridge logic adds 3.5 cycles of latency during read/write transactions.

3

# Tensilica Tweaks Xtensa

4



**Figure 3.** Tensilica Xtensa LX3 block diagram. This is the industry's most configurable embedded-processor core. Virage Logic's ARC 700 comes close, followed by some MIPS processors with CorExtend, but the Xtensa architecture is by far the most malleable. A minimum configuration starts at only 15,000 gates and is suitable for 32-bit microcontrollers. A full-featured configuration with the ConnX Vectra LX DSP Engine and custom extensions could easily exceed one million gates. The I/O options for this processor are particularly impressive.

PIF-to-AMBA bridge adds no additional latency between the processor core and the bus. However, clock speeds are fixed at design time and can't be changed at run time to fine tune performance and power consumption.

In an asynchronous arrangement, there's no fixed relationship between the core clock frequency and bus frequency. The processor and bus are clocked independently, with the limitation that the bus can't run faster than the processor. This arrangement requires additional logic between the core and bus to negotiate I/O transfers. Tensilica is introducing a new block (PIF Async) that handles these transactions between the PIF and AMBA buses. Unlike a synchronousratio bus, the asynchronous logic adds a few cycles of I/O latency. Figure 2 illustrates the asynchronous bus bridge.

The new synchronous and asynchronous bus options have greater scope than matching the Xtensa LX3 processor core with on-chip peripherals. They also give developers more flexibility in multicore designs, because different cores can run at different speeds to suit their assigned tasks. Tensilica says most of its licensees have used Xtensa processors in multicore designs.

#### Additional Options for LX3

The latest version of Tensilica's Processor Generator adds some configuration options that developers can select with a few mouse clicks. One option is ConnX VectraVMB, a package of signal-processing extensions optimized for wireless communications.

VectraVMB extensions include hardware acceleration for the Viterbi convolutional-decoder algorithm; a multiplier that operates on eight 20-bit operands at a time (useful for filtering); and hardware acceleration for bit-unpacking algorithms (such as Huffman codes). Tensilica introduced these extensions in 2006 with the Diamond Standard 545CK preconfigured processor, now rebranded the ConnX 545CK.

Another option package for the Xtensa LX3 is FLIX3, which bundles three operations into a 64-bit very long instruction word (VLIW). Normally, developers must use TIE language to implement Tensilica's FLIX (FLexible Instruction eXtensions), but the FLIX3 package is a quick way to gain the benefits of FLIX without custom handiwork.

An Xtensa processor with FLIX can issue up to three VLIW operations in parallel. Unlike conventional VLIW

5

# Tensilica Debuts Xtensa 8 Core

Shortly before announcing the new Xtensa LX3, Tensilica introduced the latest version of its Xtensa 8 processor core as well. The Xtensa 8 is the direct descendent of the first configurable processor that brought Tensilica out of stealth mode in 1998. Although today the Xtensa 8 plays second fiddle to the higher-end Xtensa LX family, it's still an important part of Tensilica's product line.

All Xtensa processors share a common architecture and base instruction set. The main difference is the degree of configurability. The Xtensa LX family offers more options, such as the ConnX Vectra LX DSP Engine and FLexible Instruction eXtensions (FLIX) for VLIW processing. Tensilica keeps the original Xtensa family simpler to suit the needs of chip designers working on smaller projects. The Xtensa 8 is intended for 32-bit microcontrollers and other deeply embedded applications.

To that end, the minimum configuration of an Xtensa 8 core requires only about 15,000 gates. Several new options for the Xtensa 8 are the same features offered for the Xtensa LX3: a double-precision floating-point accelerator; point-and-click options for quickly adding 32 pairs of GPIO ports or 32-bit queued I/O ports that bypass the system bus; AMBA-3 AXI or AHB-Lite bus bridges; and synchronous or asynchronous bus clocks.

Without those features, the Xtensa 8 base core requires only 0.05mm<sup>2</sup> of silicon in 40nm CMOS and dissipates

architectures, FLIX can handle variable-length instructions (16, 24, 32, or 64 bits long), so unused slots in the instruction words needn't be padded with do-nothing NOPs. (See *MPR* 11/25/02-06, "FLIX: The New Xtensa ISA Mix.")

Two additional option packages let developers quickly implement the ports-and-queues feature that Tensilica introduced with the first Xtensa LX processor in 2004. These configurable I/O ports bypass the system bus by wiring application-specific logic and other blocks directly into the core. Almost anything can be hooked up to these ports, including custom logic, lookup tables, and coprocessors. FIFO queues buffer the ports, which are highly configurable. Each port can be as wide as 1,024 bits, and a processor can have as many as 1,024 ports. The practical limit is routing. (See *MPR 5/31/04-01*, "Tensilica Tackles Bottlenecks.")

Normally, developers must implement ports and queues in TIE language. The revised Processor Generator for the Xtensa LX3 and Xtensa 8 has two check-box options called GPIO32 and QIF32. The GPIO32 option adds a pair of 32wire general-purpose I/O interfaces for input and output. The QIF32 option adds two 32-bit ports with FIFO queues. Both options are time-savers for implementing common I/O interfaces in SoCs. Figure 3 is a block diagram of the only 12 microwatts per megahertz. A more typical configuration would be similar to Tensilica's Diamond Standard 106Micro, a preconfigured core that will be upgraded to Xtensa 8. (See *MPR 3/29/06-01*, "Tensilica's Preconfigured Cores.")

The 106Micro configuration would include a 32- x 32-bit multiplier; separate I/O interfaces for instruction and data memories; an interrupt controller with 15 interrupts; two interrupt priorities; an integrated timer; and onchip debugging hardware with embedded trace support. When optimized for low power, this Xtensa 8 configuration would require only 0.046mm<sup>2</sup> of silicon in TSMC's 40nm-LP process and would dissipate only 17 microwatts per megahertz. When optimized for speed in the same process, it could reach 540MHz while consuming only 25 microwatts per megahertz (13.5mW total). Core area would be 0.074mm<sup>2</sup>. (All estimates are Tensilica's.)

The Xtensa 8 renews the competition for tiny 32-bit processor cores that aim to replace 8- and 16-bit microcontrollers. Competing cores include the ARM Cortex-M0, ARM Cortex-M3, Cambridge Consultants XAP5a, Cortus APS3, MIPS32 M4K, and MIPS32 M14K. (See *MPR* 5/11/09-01, "Itty-Bitty 32-Bitters," and *MPR* 11/16/09-01, "MicroMIPS Crams Code.") Tensilica already had the most configurable architecture in this class, so the new options for Xtensa 8 will only enhance that attraction.

Xtensa LX3 that shows some of the options and configurable features of this versatile processor.

Software-development tools for the Xtensa LX3 are improving, too. Tensilica is withholding details until a formal announcement in December but has disclosed the improvements to *Microprocessor Report*. Without giving too much away, we can say that writing code in C or C++ will be a little easier at the syntax level. In particular, the changes will make it easier to port existing C/C++ code to Xtensa processors. That's important, because Tensilica's strategy aims to lure developers away from Ceva, Freescale Semiconductor, Texas Instruments, and other DSP vendors.

#### Smaller, Faster, Lower Power

All these new features haven't slowed or bloated the Xtensa LX3. In fact, Tensilica says the new core is faster than a similarly configured Xtensa LX2 and is smaller when optimized for area. When optimized for speed, the LX3 is larger than the LX2, but it uses less dynamic power, thanks to enhanced clock gating and other power-management measures. Table 2 compares the Xtensa LX3 with the Xtensa LX2 when both cores are configured as midrange DSPs for audio applications.

|                              | Tensilica<br>Xtensa LX3 | Tensilica<br>Xtensa LX2 | LX3<br>Difference |  |  |
|------------------------------|-------------------------|-------------------------|-------------------|--|--|
| Optimized for Speed          |                         |                         |                   |  |  |
| Core Frequency               | 384MHz                  | 312MHz                  | +23%              |  |  |
| Core Area                    | 0.538mm <sup>2</sup>    | 0.479mm <sup>2</sup>    | +12%              |  |  |
| Core Power                   | 150µW / MHz             | 186µW / MHz             | -19%              |  |  |
| Optimized for Area and Power |                         |                         |                   |  |  |
| Core Frequency               | 56MHz                   | 56MHz                   | —                 |  |  |
| Core Area                    | 0.281mm <sup>2</sup>    | 0.317mm <sup>2</sup>    | -11%              |  |  |
| Core Power                   | 89µW / MHz              | 151µW / MHz             | -41%              |  |  |

**Table 2.** Performance comparison, Xtensa LX3 vs. Xtensa LX2. All specifications are Tensilica's estimates for fabrication in TSMC's 65nm-LP low-leakage CMOS process. In a speed-optimized design flow, the LX3 is significantly faster and uses less dynamic power, despite requiring more silicon. In a design flow optimized for area and power, with the clock frequency pegged at 56MHz, the LX3 is a little smaller and uses much less dynamic power than the LX2.

To estimate the performance specifications in Table 2, Tensilica configured both cores like the 330HiFi processor, one of Tensilica's Diamond Standard preconfigured cores. The 330HiFi is a 24-bit audio processor for digital music and VoIP applications. It has FLIX instructions, small L1 caches, and a 64-bit load/store unit. When the 330HiFi made its debut in 2006, it was based on the original Xtensa LX core and HiFi-2 audio-DSP engine. The 330HiFi was upgraded to the LX2 core in late 2006 and now will be upgraded to the LX3. (See *MPR 3/20/06-01*, "Tensilica's Preconfigured Cores.")

Table 3 is a more detailed look at the Xtensa LX3's performance. Again, these are Tensilica's estimates, but they assume fabrication in two smaller processes: TSMC's 40nm-LP, a leading-edge low-leakage CMOS process, and TSMC's 45nm-GS, a higher-performance CMOS process. For each process, Tensilica has estimated the performance of a minimum configuration and a typical configuration of the core, using design flows optimized for low power or maximum clock speed. This table offers a glimpse at the range of possibilities for the Xtensa LX3.

The typical configuration in Table 3 matches Tensilica's Diamond Standard 570T processor, one of the preconfigured cores introduced in 2006. Intended to compete with ARM11-family processors, the 570T has three-way 64-bit FLIX instructions, three ALUs, two branch units, two multipliers (one 16-bit, one 32-bit), one 64-bit load/store unit, 16KB caches, a 64-bit system interface, and 32-bit TIE ports with FIFO queues. It can execute as many as three VLIW operations per clock cycle, though only one can be a DSP instruction. MACs execute in a single cycle. In other words, the "typical" configuration in Table 3 is a luxury model, by embedded-processor standards.

Note the vast clock-frequency range of the Xtensa LX3, even when the same configuration is fabricated in the same process. In 45nm-GS, for example, a minimally configured Xtensa LX3 can mosey along at 62MHz or sprint beyond 1.0GHz.

Dynamic power is very low. For the "typical" configuration, power ranges from 2.0mW at 58MHz (45nm-GS, optimized for power) to 51.5mW at 780MHz (same process, optimized for speed). The largest implementation in Table 3 is less than one-third of a square millimeter, and the smallest implementation is an order of magnitude smaller than that. Don't drop it, or you'll never find it.

#### **RISC Challenges DSPs**

Comparing any configurable processor with competing cores is difficult, because the variety of possible configurations is

| Tensilica<br>Xtensa LX3 | TSMC<br>40nm-LP      | TSMC<br>40nm-LP      | TSMC<br>45nm-GS      | TSMC<br>45nm-GS      |  |  |
|-------------------------|----------------------|----------------------|----------------------|----------------------|--|--|
| Minimum Configuration   |                      |                      |                      |                      |  |  |
| Synthesis Optimization  | Low power            | High speed           | Low power            | High speed           |  |  |
| Core Frequency          | 60MHz                | 670MHz               | 62MHz                | >1.0GHz              |  |  |
| Core Area               | 0.024mm <sup>2</sup> | 0.044mm <sup>2</sup> | 0.024mm <sup>2</sup> | 0.044mm <sup>2</sup> |  |  |
| Core Power              | 0.012mW / MHz        | 0.018mW / MHz        | 0.009mW / MHz        | 0.014mW / MHz        |  |  |
| Core Power @ Freq       | 0.72mW               | 12.1mW               | 0.6mW                | 14.4mW               |  |  |
| Typical Configuration   |                      |                      |                      |                      |  |  |
| Synthesis Optimization  | Low power            | High speed           | Low power            | High speed           |  |  |
| Core Frequency          | 57MHz                | 493MHz               | 58MHz                | 780MHz               |  |  |
| Core Area               | 0.163mm <sup>2</sup> | 0.295mm <sup>2</sup> | 0.158mm <sup>2</sup> | 0.283mm <sup>2</sup> |  |  |
| Core Power              | 0.046mW / MHz        | 0.093mW / MHz        | 0.034mW / MHz        | 0.066mW / MHz        |  |  |
| Core Power @ Freq       | 2.62mW               | 45.8mW               | 2.0mW                | 51.5mW               |  |  |

**Table 3.** Tensilica's Xtensa LX3 performance estimates, assuming two different fabrication processes, core configurations, and design flows. The minimum configuration has full clock gating and I/O interfaces for local scratchpad memories, but the core-area numbers don't include the RAMs. The typical configuration is the same as Tensilica's Diamond Standard 570T processor (see text). All design flows used Synopsys logic libraries.

virtually infinite. The Xtensa LX3 is equally comfortable as a general-purpose RISC processor, a microcontroller core, an application-specific coprocessor, a multicore processor, or a DSP. Tensilica's latest strategy is to challenge DSPs, so we'll focus on that aspect.

Keep in mind that the fastest DSPs aren't necessarily conventional DSPs. By adding DSP extensions to a basic 32bit RISC core, companies like Tensilica, ARC International, ARM, and MIPS Technologies are making surprisingly fast DSPs. Indeed, according to BDTI benchmarks, Tensilica's Diamond Standard 545CK processor (recently rebranded as the ConnX 545CK) is the

© IN-STAT

7

fastest licensable DSP core on the market. The 545CK that BDTI benchmarked was based on the Xtensa LX2, and now it will be upgraded to the Xtensa LX3.

Building a DSP on a RISC core has additional advantages. Unlike most DSPs, processors like the Xtensa LX3 have the memory-management features needed to run sophisticated embedded operating systems, so one core can handle multiple functions. If an SoC design needs a separate control processor and DSP, using differently configured RISC cores to serve both purposes simplifies licensing, because there's only one core supplier to deal with. Also, programmers can write their control code and DSP code in C, using the same software-development tools.

Tensilica's Xtensa LX3 will compete with ARM's latest processor, the Cortex-A5, which can tackle some DSP chores when equipped with ARM's Neon extensions. The Cortex-A5 is a close match in other respects as well. Like the Xtensa LX3, it has a an MMU, so it can run embedded versions of GNU/Linux and other operating systems that manage virtual memory. An optional FPU does single- and double-precision math, and the Cortex-A5 has approximately the same clock-frequency headroom as the Xtensa LX3. Symmetric multicore designs favor the Cortex-A5, because it supports two-, three-, or four-way multiprocessing with coherent memory. (See *MPR 10/26/09-01*, "ARM's Midsize Multiprocessor.")

### MIPS and ARC Do DSP, Too

MIPS Technologies offers two likely competitors: the MIPS32 74K and MIPS32 24KE. Both are 32-bit RISC cores with DSP extensions. Both have MMUs and versions of the core with an FPU. The DSP extensions in the 74K are newer than those in the 24KE but still don't match the DSP capabilities of an Xtensa LX3 with Vectra LX.

The MIPS 74K is a more radical embedded-processor design, clearly geared toward high performance when running general-purpose workloads. With an unusually deep 15-stage pipeline, two-way superscalar execution, and out-of-order execution, the 74K is a speed demon that can reach higher clock speeds (1.7GHz in 40nm-G) and deliver more throughput (2.0 Dhrystone mips per megahertz) than the Xtensa LX3 and similar processors.

However, because of those high-performance features, the 74K can't help being larger and more power-hungry than the Xtensa LX3 and similar processors. Scaling down the clock speed to match the Xtensa LX3 and optimizing the design flow for area and power would improve the 74K's power/ performance profile. Overall, though, it's better suited for general-purpose workloads than for intensive signal process-ing. (See *MPR* 5/29/07-01, "MIPS 74K Goes Superscalar," and *MPR* 6/4/07-01, "MIPS 74K Performance Update.")

The MIPS 24KE isn't quite as fast or as powerful as the MIPS 74K but is still a very swift core. It's closer in size and power to the Xtensa LX3 and can reach higher clock frequencies when manufactured in the same process technology. However, when their clock speeds are the same, the MIPS

# Price & Availability

In November, Tensilica began delivering a revised version of its Processor Generator that uses the new Xtensa LX3 core. Tensilica's ConnX DSPs and Diamond Standard preconfigured cores will also start using the Xtensa LX3. Tensilica began delivering the new Xtensa 8 core in October. The company isn't disclosing upfront licensing fees or royalties for either processor.

More information about the Xtensa LX3 core: www.tensilica.com/products/xtensa-customizable/ xtensa-lx.htm

More information about the Xtensa 8 core: www.tensilica.com/products/xtensa-customizable/ xtensa.htm

24KE is outgunned as a DSP. It has the older MIPS DSP Application-Specific Extensions (DSP ASE-1), which aren't as powerful as Tensilica's Vectra LX engine or new ConnX extensions. (See the sidebar, "MIPS 24KE: Better Late Than Never" in *MPR* 5/31/05-01, "The MIPS32 24KE Core Family.")

Another competing processor is the ARC 750D, now owned by Virage Logic since the acquisition of ARC International in September. ARC has offered DSP extensions since the 1990s, and many licensees have used ARC processors in a dual CPU/ DSP role. Like the Xtensa LX3, the ARC 750D has an MMU and an optional FPU, but ARC's FPU is more powerful than Tensilica's. (See *MPR 5/23/05/-02*, "Float Without Bloat.")

The ARC XY Advanced DSP Memory Subsystem adds separate local memories to the ARC 750D, much like Tensilica's Vectra LX with dual load/store units. With the XY extensions, the ARC 750D can execute 32-bit multiplies, 24- x 24-bit MACs, and other arithmetic operations while fetching two operands in parallel. ARC offers a library of DSP instructions similar to Tensilica's ConnX VectraVMB package, including Viterbi acceleration. While Tensilica is pursuing a DSP strategy, Virage Logic is positioning itself as a one-stop shop for SoC designers. (See MPR 9/14/09-01, "Summer Shopping Spree.")

Table 4 summarizes the features of the Xtensa LX3, Cortex-A5, MIPS 74K, MIPS 24KE, and ARC 750D. All are 32-bit synthesizable embedded-processor cores available for licensing. As usual, *MPR* was unable to obtain performance specifications for each core in equivalent fabrication processes, so the maximum clock frequencies, core areas, and power consumption of these processors aren't directly comparable. Nevertheless, the Xtensa LX3 appears to have the advantage in area and power and particularly in power efficiency (Dmips per milliwatt).

In DSP applications, the Xtensa LX3 will also compete with the licensable DSP cores from Ceva and NXP. Using a pure DSP has some drawbacks: it can't perform all the functions of a general-purpose CPU, and the software-development

# 8 Tensilica Tweaks Xtensa

| Feature                          | Tensilica<br>Xtensa LX3                                           | ARM<br>Cortex-A5                                                      | MIPS<br>MIPS32 74K                           | MIPS<br>MIPS32 24KE                          | Virage<br>ARC 750D                              |
|----------------------------------|-------------------------------------------------------------------|-----------------------------------------------------------------------|----------------------------------------------|----------------------------------------------|-------------------------------------------------|
| Architecture                     | Xtensa LX3                                                        | ARMv7-A                                                               | MIPS32-R2                                    | MIPS32-R2                                    | ARCompact                                       |
| Integer<br>Pipeline              | 5 or 7 stages<br>In-order<br>1-way                                | 8 stages<br>In-order<br>1-way*                                        | 15 stages<br>Out-of-order<br>2-way           | 8 stages<br>In-order<br>1-way                | 7 stages<br>In-order<br>1-way                   |
| Branch Predict                   | —                                                                 | Dynamic                                                               | Dynamic                                      | Dynamic                                      | Dynamic                                         |
| L1 Cache                         | 0–32K each<br>ECC                                                 | 4K–64K each                                                           | 0–64K each                                   | 0–64K each                                   | 8K–64K each                                     |
| L2 Cache                         | _                                                                 | Optional<br>16K–8MB                                                   | Optional                                     | Optional                                     | —                                               |
| Scratchpad RAM                   | Optional<br>Up to 495GB                                           | —                                                                     | Optional<br>Up to 1MB                        | Optional<br>Up to 1MB                        | 8K–512K (code)<br>8K–256K (data)                |
| 16-Bit Instructions              | Yes                                                               | Thumb, Thumb-2                                                        | MIPS16e                                      | MIPS16e                                      | ARCompact                                       |
| DSP Extensions                   | Vectra LX,<br>ConnX D2, ConnX BBE                                 | Optional Neon                                                         | MIPS DSP ASE-2                               | MIPS DSP ASE-1                               | ARC XY<br>Memory Subsystem                      |
| Java Extensions                  | —                                                                 | Jazelle DBX, RCT                                                      | —                                            | —                                            | —                                               |
| Custom Instructions              | Yes                                                               | —                                                                     | Yes                                          | Yes                                          | Yes                                             |
| System Interface                 | AMBA-3 AXI<br>2 x 8–1,024 bits<br>or AHB-Lite<br>2 x 8–1,024 bits | AMBA-3 AXI<br>1 x 64 bits<br>Optional 2 x 64 bits<br>(multiprocessor) | OCP 2.1<br>2 x 64 bits<br>(Optional bridges) | OCP 2.1<br>2 x 64 bits<br>(Optional bridges) | BVCI, AHB, AXI<br>1 x 32 bits<br>or 1 x 64 bits |
| FPU                              | Optional SP or DP                                                 | Optional VFPv3 (DP)                                                   | Optional SP + DP                             | Optional SP + DP                             | Optional SP or DP                               |
| Memory<br>Management             | Optional<br>MMU + TLB                                             | MMU + TLB                                                             | Optional<br>MMU + TLB                        | Optional<br>MMU + TLB                        | MMU + TLB                                       |
| Privilege Levels                 | 4                                                                 | 2 + TrustZone                                                         | 3                                            | 3                                            | 2                                               |
| Core Frequency<br>(Max)          | >1.0GHz<br>(45nm-GS)                                              | 480MHz–1.0GHz<br>(40nm-LP, 40nm-G)                                    | 1.7GHz<br>(40nm-G)                           | 1.45GHz<br>(40nm-G)                          | 700MHz<br>(90nm-GT)                             |
| Core Area<br>@ Max Freq          | 0.044mm <sup>2</sup><br>(45nm-GS)                                 | 0.27mm <sup>2</sup><br>(40nm-LP)                                      | 0.78mm <sup>2</sup><br>(40nm-G)              | 0.34mm <sup>2</sup><br>(40nm-G)              | 0.53mm <sup>2</sup><br>(90nm-LP)                |
| Dhrystone 2.1                    | 1.52Dmips / MHz                                                   | 1.57Dmips / MHz                                                       | 2.0Dmips / MHz                               | 1.55Dmips / MHz                              | >1.5 Dmips / MHz                                |
| Power (typical)                  | 0.014mW / MHz<br>(45nm-GS)                                        | 0.12mW / MHz<br>(40nm-LP)                                             | < 0.25mW / MHz<br>(40nm-G)                   | 0.125mW / MHz<br>(40nm-G)                    | 0.12mW / MHz<br>(90nm-LP)                       |
| Power Efficiency<br>(Dmips / mW) | 108.5Dmips / mW<br>(45nm-GS)                                      | 13.0Dmips / mW<br>(40nm-LP)                                           | 8.0Dmips / mW<br>(40nm-G)                    | 12.4Dmips / mW<br>(40nm-G)                   | 12.5Dmips / mW<br>(90nm-LP)                     |
| Introduction                     | 2009                                                              | 2009                                                                  | 2007                                         | 2005                                         | 2004                                            |

Table 4. Feature comparison of the Tensilica Xtensa LX3, ARM Cortex-A5, MIPS Technologies MIPS32 74K, MIPS32 24KE, and Virage Logic ARC 750D. The Xtensa LX3 is probably the smallest, lowest-power core in this group, but significant differences among the fabrication processes make comparisons problematic. The power/performance numbers are vendor estimates and assume speed-optimized design flows. Other important variables—such as synthesis libraries, power-measurement methods, and details about the core configurations—are largely unknown. *MPR* selected these cores for comparison because all have MMUs, optional DSP extensions, and optional FPUs. Therefore, all can serve as DSPs or as dual-role CPU/DSPs in high-performance embedded applications. The Xtensa LX3 is likely the most powerful DSP in this group, judging from BDTI benchmarks of Tensilica's 545CK, a preconfigured DSP that's based on the earlier Xtensa LX2 core. (\*Although the Cortex-A5 has a mostly uniscalar pipeline, it can issue some pairs of instructions in parallel.)

flow will be different from the CPU's. Nevertheless, it's a popular choice. We covered the Ceva-XC and NXP CoolFlux BSP cores in our recent report on Tensilica's ConnX BBE. (See *MPR 8/10/09-01*, "Tensilica Plays Baseband.")

## **Tensilica's Flanking Strategy**

With ARM and MIPS dominating the control plane in most 32-bit embedded systems, Tensilica needed a new strategy. Although Tensilica still competes for chip designs in the control plane, the data plane appears to offer more opportunities for design wins. In particular, the demand for signal processing keeps growing as more embedded systems incorporate digital audio, video, and communications. Unlike some other segments, the market for licensable DSP cores is still very much in play.

The Xtensa LX3 establishes a solid foundation for Tensilica's data-plane strategy and builds on the stellar performance of the Xtensa LX2. The new features and options will offer developers more flexibility to create optimized designs. As Tensilica migrates its line of preconfigured processors from the LX2 to the LX3, some developers with projects in motion will reap immediate benefits. The versatility of Tensilica's processors is second to none, and they often lead the pack in core area and power.

It would seem that Tensilica has all the ingredients for a successful strategy. The catch, of course, is that competitors

are attuned to the same opportunities—as evidenced by the nearly universal availability of DSP extensions for 32-bit RISC cores. The challenge for Tensilica is to convince developers that Xtensa is a better choice for data-plane processing than other architectures and can peacefully coexist with the CPUs that dominate the control plane.  $\diamond$ 

To subscribe to Microprocessor Report, phone 480.483.4441 or visit www.MPRonline.com