The ELM1A

ELM1A Die plot
Summary data

0.18-micron, 6-layer metal CMOS
125-MHz operation
8-kbyte data cache
8-kbyte instruction cache
64-bit asynchronous bus
vector processing facility
    4 integer or logical operations per cycle
    4 logarithmic multiplies, divides or square-roots / cycle
    2 logarithmic adds or subtracts / 3 cycles
single-level interrupt
1.8-V core; 3.3-V I/O
development and evaluation system for PC-based host
    relocating assembler
    linkage editor
    mathematical function library

Legend

IC: instruction cache
DC: data cache
IIU: instruction issue unit
BIU: bus interface unit
SCALU: single cycle ALUs
MCALU: multi cycle ALUs
REG: general registers
F,D,E and P: interpolator lookup tables (mirrored for two MCALUs)
G: transform tables (mirrored)
PLL: phase-locked loop


Architecture

The ELM offers convenient access to, and efficient use of, a logarithmic arithmetic unit for all 32-bit numerical work. A detailed description of its architecture is available in 'IEEE Transactions on Computers', April 2008. It comprises a four-stage pipeline, illustrated below, communicating with 16 general registers and with a conventional 64-bit asynchronous bus. At the end of the pipeline there are two arithmetic units. All integer and logical operations, and logarithmic multiply, divide and square-root, take one cycle to complete and are handled by the single-cycle ALU. As this is small, it is replicated four times, thereby offering a single-instruction multiple-data (SIMD)-style vector facility in which four pairs of operands in consecutive registers or memory locations may be processed at once. The vast majority of the silicon area is occupied by the larger units required for logarithmic addition and subtraction operations. These take three (or exceptionally four) cycles, and are handled by a flowthrough unit, the multi-cycle ALU, which is similarly replicated twice.

A minimal instruction set includes 35 operations, its orthogonal arrangement allowing either for both operands to be located in registers, or for one in a register and the other in a memory location. All instructions operate on 32-bit words (although packing facilities are available for efficient input and output of shorter wordlength data), and each may process either one 32-bit word or a vector of two or four consecutive words as described above. The instructions themselves are 32 bits. 8-kbyte instruction and data caches supply respectively one and four words to the processor per cycle.

After starting an addition, the processor will continue to issue instructions in the two remaining cycles before it completes, so a typical peak procesing rate would be two additions (including two implied loads), four loads, and four multiplications (including four more implied loads) in three cycles.




System support

A development and evaluation system consists of an 8 x 7 inch, 4-layer PCB, accommodating a 125-MHz ELM, 4 MB of 10-ns static memory, a 16-bit general-purpose parallel I/O port, LED array, and serial link to a PC-based host.

Software support currently comprises a relocating assembler, linkage editor (with a library of common mathematical functions) and register-level simulator, all of which run stand-alone on a PC, and a monitor interface for communicating with the development board.