FPGA Implementation of modified non-restoring square root core

ShabirAhmed B J, Narendra K, Swaroop Kumar K, Asha G H

Abstract— The aim of this paper is to synthesize and implement an algorithm to compute square root efficiently and cost effectively. Square root is a computation required in many mathematical problems required in computer multimedia communication and in many space related data processing. So, there is a requirement to develop this operation efficiently. Hence, in this paper a new type of algorithm is used to design a core for finding square root. The core implemented in this paper uses a modified non-restoring division algorithms to find square root. Three types of structures have been developed namely: basic combinational, iterative and pipeline. Basic combinational is simple implementation of non-restoring division algorithm, it is nothing but single stage of pipelined architecture. This architecture can be used when cost is the major factor and speed can be compromised. Iterative architecture is hardware efficient cost effective architecture where single hardware unit is used iteratively for a computation. Pipelined architecture is the fastest architecture to be implemented in this paper. It uses various stages for computation i.e. parallel execution is performed and hence speeding up the execution time and process. Pipelined architecture can be used in real time processing system where speed is the major factor. These three structures are developed to compare various parameters like speed, cost, reliability and distinguish them for various application suited by them. The core is developed for any FPGA processor and is simulated and debugged using XILINX ISE 14.1. The architecture is implemented onto VIRTEX family and debugged on Spartan 3 XC3S400TQ144.

Index Terms—Combinatorial, FPGA, Iterative, Pipelined, Spartan, Virtex

I. INTRODUCTION

Square root is an operation required by system graphics and scientific computation applications such as math coprocessors, DSP algorithms, data processing and control [1]. Hence, it is an important computation that need to be enhanced. In 1996, Lu and Chi [1] have proposed a ‘new non-restoring square root algorithm’ for VLSI implementation, which is better than the existing VLSI algorithms for computing square root. In many VLSI real time image processing applications, it is high prioritized requirement to provide the computation of square root of a binary coded number with low power dissipation and fast computation (low delay propagation). Square root calculation is one of the most useful and vital operations in computer graphics and scientific calculation applications, such as digital signal processing (DSP) algorithms, math coprocessor, data processing and control, and even multimedia applications [1-6]. It is a classical problem in computational number theory, which is often encountered and which is a hard task to get an exact result [7-8].

The paper is divided as follows: Section II describes the algorithm. Section III presents the implemented architectures. Section IV explains the results and analysis, and in the results a detailed comparison between the spartan core and Virtex core is presented. Finally, conclusions is given.

II. NON RESTORING ALGORITHM

The focus of the previous restoring and non-restoring algorithms is on each bit of the square root with each iteration. In this section, non-restoring square root algorithm has been described as in [1]. Each operation consists of addition or subtraction based on the sign of the result of previous operation. The partial remainder generated in each iteration is used in the next iteration even it is negative [1]. At the final iteration, if the partial remainder is not negative, it becomes the final precise remainder.

Radical: ‘D’ of ‘2n’ bits. Square root: ‘Q’ of ‘n’ bits:

<table>
<thead>
<tr>
<th>D:</th>
<th>D</th>
<th>D</th>
<th>D</th>
<th>D</th>
<th>D</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Q:</td>
<td>Q</td>
<td>Q</td>
<td>Q</td>
<td>Q</td>
<td>Q</td>
<td>Q</td>
<td>Q</td>
</tr>
</tbody>
</table>

\[
d_k = D_{2n-1}D_{2n-2} \ldots D_k, k \leq 2n - 1
\]

\[
q_k = Q_{n-1}Q_{n-2} \ldots Q_k, k \leq n - 1
\]

*Note that \( q_k \) has ‘n-k’ bits

\[
r'_{n-1} = D_{2n-1}D_{2n-2} - 01
\]

\[
Q_{n-1} = \begin{cases} 1, & \text{if } r'_{n-1} \geq 0 \\ 0, & \text{if } r'_{n-1} < 0 \end{cases}
\]

For \( k = n-2 \) downto 0 do

Manuscript received April 15, 2015.


Narendra K. Student (M.Tech) Digital Electronics and Communication systems, Malnad College of Engineering, Hassan, Karnataka, India, +91-9738543811.


Asha G H. Associate Professor, Dept. of Electronics and communication, Malnad College of Engineering, Hassan, Karnataka, India, +91-9448033837.
At each iteration, \( q_k \) (the square root of \( d_{2k} \)), is computed. Since \( d_{2k} = d_{2(k+1)}D_{2k+1}D_{2k} \), that is \( D_{2k+1}D_{2k} \) is attached to \( d_{2(k+1)} \) to form \( d_{2k} \), it can be inferred that \( D_{2k+1}D_{2k} \) must be used to get \( q_k \). That explains the fact the algorithm attaches \( D_{2k+1}D_{2k} \) to \( r'_{k+1} \) to form \( r'_{k} \) in order to get \( q_k \). The remainder at each iteration, called \( r_k \), has \( n-k+1 \) bits, one more bit than \( q_k \) [1]:

\[
\begin{align*}
\text{Remainder } R &= r_0 \\
&= \begin{cases} 
  r'_0 & \text{if } r'_0 \geq 0 \\
  q_k & \text{if } r'_0 < 0 
\end{cases}
\end{align*}
\]

III. ARCHITECTURES

As mentioned in abstract three types of architectures have been implemented which will be described in this section.

A. Pipelined

To implement this architecture we need to unfold the algorithm explained in section II. Therefore ‘n’ stages with ‘n’ adders/subtractors will appear. By observing the first iteration, a reduction is obtained:

\[
\begin{align*}
R'_{n-1} &\leftarrow D2n-1D2n-2 - 01 \\
Q_{n-1} &\leftarrow 1, \text{if } r'_{n-1} \geq 0 \\
Q_{n-1} &\leftarrow 0, \text{if } r'_{n-1} < 0 
\end{align*}
\]

There is no need to perform the first subtraction and wait one cycle, if the result from the first iteration can be obtained directly from the first 2 MSBs of \( D \). So the first stage can be embedded into the second stage, and there will be ‘n-1’ pipeline stages.

This architecture is depicted in Figure 1. The computation of the remainder is not considered, although the core computes it if the user wants. Note that the dotted rectangles indicate the registers that would have appeared if the reduction of the first stage hadn’t been performed. Such architecture can obtain a new square root each cycle. The initial latency is ‘n’ cycles.

The longest path delay occurs in the last stage, because the adder/subtractor increases in size as stages advance. A further improvement can be made if the last stages are pipelined, and the initial ones merged.

B. Combinatorial Architecture

This architecture is implemented because some non-real time applications need it, and also in order to establish a comparison with the core that does have a fully-combinatorial architecture. The architecture is very simple: It is the fully-pipelined architecture without the pipelining registers. It only has one register at the input and one at the output.
C. Iterative Architecture

The size of the elements (registers, adder/subtractor) will be the size of the last stage of the pipelined architecture:
Register R ‘n+1’ bits Register Q ‘n’ bits Adder / subtractor ‘n+2’ bits.

Since all iterations are embedded in one stage, the reduction of Section III A cannot be used.

But a simplification for this case exists:

In the adder/subtractor: the 2 LSBs performs either ‘xy-01’ or ‘xy+11’, ‘xy’ is the pair of D bits used at each step. The operation yields: ‘cba’. The truth table is shown:

<table>
<thead>
<tr>
<th>xy</th>
<th>cba = xy + 11</th>
<th>cba = xy - 01</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>011</td>
<td>00</td>
</tr>
<tr>
<td>01</td>
<td>100</td>
<td>01</td>
</tr>
<tr>
<td>00</td>
<td>101</td>
<td>00</td>
</tr>
<tr>
<td>01</td>
<td>110</td>
<td>01</td>
</tr>
</tbody>
</table>

‘C’: carry-in for the next stage of the adder/subtractor ‘ba’: result of the operation.

‘Ba’ depends only on ‘xy’, but ‘c’ depends on the type of operation. Luckily, a conventional adder/subtractor with carry-in (e.g. the lpm_add_sub megafuntion) treats the carry-in as positive logic when adding, and as negative logic when subtracting [3] (this is done to reduce gates usage). So, for subtraction, we have to invert ‘c’ to assure the proper working of the adder/subtractor. The new truth table is:

<table>
<thead>
<tr>
<th>xy</th>
<th>cba = xy + 11</th>
<th>cba = xy - 01</th>
</tr>
</thead>
<tbody>
<tr>
<td>00</td>
<td>011</td>
<td>00</td>
</tr>
<tr>
<td>01</td>
<td>100</td>
<td>01</td>
</tr>
<tr>
<td>00</td>
<td>101</td>
<td>00</td>
</tr>
<tr>
<td>01</td>
<td>110</td>
<td>01</td>
</tr>
</tbody>
</table>

Now, ‘c’ and ‘ba’ depends only on ‘xy’:

\[ c = x + y \]
\[ b = (x \oplus y) \]
\[ a = y \]

This reduces the width of the adder/subtractor by 2 bits. The result ‘ba’ is obtained in parallel and the carry-in comes from just an OR gate. So the new adder/subtractor uses ‘n’ bits and has carry-in. Also, note that the MSB of the second operator of the adder/subtractor is ‘0’ as in the pipelined case. Figure 3 depicts this architecture.

IV. RESULTS

The architecture were synthesized using XILINX ISE v14.1 successfully. After synthesizing the core were implemented on FPGA device XC3S400-TQ144 (Xilinx Spartan-3 family) with speed grade -5. The core presented does not compute the remainder, since it is rarely used. Figure 5 depicts the core with all its options. Table 1 establishes a comparison between this core and the ALTERA core.

Finite state machine of iterative architecture is depicted in figure 4. This FSM controls the iterative architecture. The process start when s = 1. After ‘n’ clock cycles, the result is obtained in register Q, done = 1, and a new process can be started.

IV. RESULTS

The architecture were synthesized using XILINX ISE v14.1 successfully. After synthesizing the core were implemented on FPGA device XC3S400-TQ144 (Xilinx Spartan-3 family) with speed grade -5. The core presented does not compute the remainder, since it is rarely used. Figure 5 depicts the core with all its options. Table 1 establishes a comparison between this core and the ALTERA core.

Results are shown only for a specific device (Spartan 3) because of large results data with just one device and these results are enough to demonstrate the benefits of the core implemented.
FPGA Implementation of modified non-restoring square root core

Figure 5: Parameter comparison graph for Spartan

Figure 5: Parameter comparison graph for Virtex
Result analysis:
From the graphs shown in the figure 5 and 6 we can infer that latency is more in combinatorial. But in pipeline the latency till 16 bits are low than iterative. After 16 bits iterative and pipeline latency time crossover: i.e. for large input bits latency increases in pipeline. Hence efficiency in pipeline architecture is the highest. Maximum frequency operable is highest in pipelined architecture.

Area Comparison:
As shown in figure 7 and 8 area covered by all three architecture can be seen. Plan Ahead tool of XILINX was used to generate area and compare them. We can infer from figure 7 and 8 that pipeline architecture uses more area and iterative uses the least area.

V. CONCLUSION
The core implemented achieves high speed at minimum cost since it only uses only an adder/subtractor unit to perform the operations. The architecture is very flexible, so that the user can choose the best architecture for his application. The efficiency of this core can be observed from graph and table.

The results are better in terms of speed and resource effort than the earlier implementation. An improvement i.e. simplification for the iterative architecture can be applied to each stage of the pipelined architecture.
REFERENCES


Shabir Ahmed B J, Student (M.Tech) Digital Electronics and Communication systems, Malnad College of Engineering, Hassan, Karnataka, India. +91-9738417860.


Swaroop Kumar K, Student (M.Tech) Digital Electronics and Communication systems, Malnad College of Engineering, Hassan, Karnataka, India, +91-7411379265.

Asha G H, Associate Professor, Dept. of Electronics and communication, Malnad College of Engineering, Hassan, Karnataka, India, +91-9448033837.