ELM1A Prototype Evaluation Board

What is the novel idea?

The European Logarithmic Microprocessor performs the basic mathematical operations - add, subtract, multiply, divide and square-root - in a way which is radically different from that of conventional microprocessors. These use the 'floating-point' system, in which a number is represented more-or-less as the direct binary equvalent of its real value. Processing these floating-point numbers takes a lot of time, and also does not deliver a completely accurate result. The error in any one operation is minuscule, but in a complex programme with billions of operations the errors build up. Using the logarithmic number system (LNS), which represents numbers as logarithms instead of floating-point values, dramatically improves the speed and accuracy of the multiply, divide and square-root operations. In fact the circuits to perform these operations logarithmically are almost trivial. What has always been difficult, and had not so far been done in a practical microprocessor, is designing the circuits for logarithmic addition and subtraction. The first challenge was therefore to develop techniques for performing these operations at least as well as the standard floating-point methods. Together with the ready-made advantages of the multiplies, divides and square-roots, the average performance in a typical calculation involving both multiplications and additions was then substantially improved.
The LNS represents a number as its base 2 logarithm, which itself is a fixed-point value. Multiplications and divisions become fixed-point additions and subtractions. However, adding or subtracting two values, each of which is represented as a logarithm, and returning the logarithm of the result, requires the evaluation of a non-linear function. For two logarithms i and j:

This takes place every time an addition or subtraction is executed, and requires dedicated hardware capable of returning an exactly rounded result in minimal time. As it is not possible to store all the values of F(r)in a memory, an LNS adder / subtractor consists largely of a widely-spaced lookup table, with an interpolator for calculating intervening values. The design of an interpolator with speed and accuracy at least equivalent to that of a floating-point unit is challenging. It was achieved in the ELM with a novel design that delivers a result in three clock cycles, except in the case discussed (right) which requires four. Full details are published in 'IEEE Transactions on Computers', July 2000.
LNS Addition and Subtraction Functions
The subtraction function exhibits a singularity when r approaches 0. As F(r) approaches -infinity, so does the slope, which makes interpolation very difficult. Using an interpolator, the only way to maintain accuracy would be to use very narrowly-spaced, and therefore impractically large, lookup tables. The ELM circumvents this problem in a novel way. Once i, j and r are known, and r found to be close to the singularity, a coefficient k[1] can be selected with no time penalty whatever, by rearranging bit patterns in the three values.

Applied as shown, the index r for the inner subtaction will then lie exactly on a stored point, so the inner subtraction will not need to be interpolated and will be completed with a delay of only one additional cycle. During this time, the complementary value k[2] is also looked up. The outer subtraction is then done in the normal way, in three cycles. However, the preceeding algebraic transform has guaranteed that the value of r in this outer subtraction is less than -1, and therefore lies in the region for which interpolation is straightforward. The two tables required to implement the transform are themselves many times smaller than what would have been required to produce the same result by interpolation. Full details are available in 'IEEE Transactions on Computers', January 2016.
The ELM belongs to a category of microprocessors known as 'digital signal processors': devices intended for fast mathematical processing which find use in a great variety of applications. Because they are often employed in battery-operated equipment, it is important that they have low power consumption. They typically work at around a tenth of the clock frequency of general-purpose microprocessors, but can often offer better performance in mathematical processing because of their specialised design.

The developers

The ELM project originated in Dr Nick Coleman's research at Newcastle University, and was then undertaken with around 1m Euros funding from the Long-Term Research sector of the European Stategic Programme for Research in Information Technology (EPSRIT). Partners in the project were the Institute of Information Theory and Automation at the Czech Academy of Sciences, Philips Research, Eindhoven, University College Dublin, and Massana Ltd, Dublin. Chips were fabricated at Philips Semiconductors, Nijmegen.

The results

Following the original development work, a suite of industrial-scale programmes was developed to prove that the device actually worked in practice. It was also essential to show that, by using the logarithmic arithmetic system, the device could outperform a conventional microprocessor based on floating-point arithmetic. The same programmes were therefore deployed on a state-of-the-art industry-standard floating-point device, the Texas Instruments TMS320C67, and a direct comparison made between the two. The ELM delivered significantly faster execution and more accurate results. Some highlights are presented below; for full details please see 'IEEE Transactions on Computers', April 2008.
A 256 x 256 pixel ray-traced image (with 512 x 512 adaptive supersampling) was generated on the ELM (running at 125 MHz) and on the TMS320C67 (150 MHz). Each was programmed in optimised assembly language, using the fastest algorithms available for each device. The TMS320C67 completed in 14.9 sec and the ELM in 10.9 sec. (Pictured image generated on ELM.)
Nymphaea alba
A signal (top) was corrupted with the addition of random noise (middle). This was presented to a recursive-least-squares filter which recovered the signal (bottom). This was run for several orders of filter, both on the ELM (at 125 MHz) and TMS320C67 (150 MHz). There was no significant difference in execution time between the two devices, but the average signal-to-noise ratio of the result returned by the TMS320C67 was 108.4 dB, and that of the ELM 117.5 dB. (Pictured results for order 64 filter running on ELM.)
QR Recursive Least Squares Filter

Latest developments

Although the LNS arithmetic circuits were faster and more accurate than the equivalent floating-point versions, one remaining problem was that they were physically larger. Continuing theoretical work has demonstrated that the transform algebra can be applied recursively. Whereas the first-order application, as described above, yielded a dramatic reduction in circuit size from a purely interpolated solution, this second-order application delivers a further large reduction in that. As we show in 'IEEE Transactions on Computers', January 2016, an LNS arithmetic unit may now be constructed with a silicon footprint similar to that of a floating-point unit. Furthermore, the lookup tables are now small enough to allow their implementation in synthesised logic, rather than, as previously, as read-only memories. This results in further gains in speed.