Caches and Local Memories

Subsections describe the individual cache and local memory options in more detail.

Cadence allows up to six local memory interfaces on each of the instructions and data sides. Each interface might be a local RAM, local ROM or cache. Each way of a set-associative cache counts as one interface. The caches can be anywhere from 1 Kilobyte to 128 Kilobytes, from direct-mapped to 4-way set associative, with line sizes from 16 bytes to 256 bytes.

Caches allow reasonably robust performance with minimal effort. Local memories potentially allow higher performance and efficiency, but not always. Local memories support external DMA engines through the processor’s inbound PIF port. DMA allows you to work on one block of data while loading another block in the background. DMA potentially completely avoids the penalties of a cache miss. Of course, this only works if the working set sizes of the current block plus the block being loaded in parallel together are small enough to fit inside the local memory.

Caches work well when the total memory being used is significantly larger than the local memory size but the working set at any given time is sufficiently small. Local memories are much harder to use in such scenarios. Data must be explicitly and manually moved into and out of the local memory. Partitioning code is not always easy. You may try to use both local memories and caches, putting your frequently used data or code in local memories while leaving caches to handle the rest. This can be very effective if some code or data is small and used frequently, and other code or data is very large and is being streamed into the processor. Frequently, however, making such a clean partition is difficult; hardware does a better job of dynamically allocating memory to caches than you can statically. Local memories require less power to access than equivalently sized caches. Direct-mapped caches require significantly less power than set associative caches. Direct-mapped caches can perform well, but performance can be less robust. Small changes to an application can have a dramatic performance impact if two pieces of code or data suddenly fall into the same cache location. With direct-mapped caches, be certain to utilize some of the performance tuning and measuring methodologies described in Chapter 2 to make sure that you are not thrashing the cache. In particular, the Cache Explorer allows you to automatically simulate performance and power usage for various cache systems on your actual application, and the Link Order tool allows you to rearrange your code to minimize instruction cache misses.

Two local memories of size n/2 require less power than one local memory of size n. Two local memories can also increase the performance of DMA because the DMA engine writing into one memory will not compete for bandwidth with the processor trying to access the other memory. However, with two local memories, you must partition the data or code between the two local memories. Cadence also supports line locking of all but one way in a set associative cache. Line locking provides some of the benefits of local memories in a cache. In order to effectively utilize line locking, you must explicitly identify data or code that is small and frequently used. As with local memories, it is often hard to statically partition as well as the hard- ware caching mechanism is able to automatically partition.

Caches and local data memories can be divided into one to four banks. The data memory is divided into banks so that successive data memory width sized accesses go to different banks. At most one load or store can go to any one bank in a cycle. On configurations that support multiple loads or stores per cycle or on systems with DMA, using more banks will minimize the number of stalls due to bank conflicts.

See the appropriate Data Book for more detailed information.