## 16 Point Fast Fourier Transform

The Fourier transform is a useful operation in the signal processing space. The transform allows for a signal to be operated on in the frequency domain which can make certain filtering and processing operations much more efficient. A time domain signal arrives in time and the amplitudes represent intensity, however a frequency domain signal arrives in frequency and the amplitudes represent intensity of the frequencies. The discrete Fourier transform (DFT), is the comprehensive transform for a number of sample points. The operation is often times redundant for many values, so researchers developed the fast Fourier transform (FFT) algorithm which is often regarded as the most important algorithm developed in the last century. Modern day digital signal processors contain a multitude of large FFT cores capable of processing dense signals reliably. The FFT is easily implemented in software for a large range of sample points, but is much slower than a hardware implementation. For the embedded DSP course, we are expected to implement the algorithm using computer hardware.

Our approach was a very structural one, using principles of hierarchy, abstraction, and optimization. The 16 point FFT can be broken down into 4 stages of a butterfly operation. The butterfly operation is the algorithmic method of computing the FFT. The 4 stages consist of 8 2 point butterfly operations, 4 4-point butterfly operations, 2 8-point butterfly operations, and 1 16 point butterfly operation.



*Figure 1: A diagram of the 16 point FFT butterfly operations*

The butterfly operations require arithmetic with complex numbers, as the signals in the frequency domain present themselves in the complex plane conveniently. We can simply represent complex numbers using two registers and perform the arithmetic with those. Our approach is a decimation-in-time FFT, which requires a bit reversal of the input order to receive the frequency domain signals in the proper order. The inverse transform is of interest as well. The inverse FFT can be computed by using

another forward FFT and passing in the complex conjugate of the output of the first FFT. The complex conjugate is obtained by simply negating the imaginary part of the complex number. After the signal passes through a second FFT module, the output is divided by N (16) to return the original signal. As for the HDL implementation, we followed a "control unit, datapath" architecture, effectively making a single FFT core. The circuit contains the forward and inverse modules, input and output registers, and the control unit. The modules require 16 parallel inputs, and will have 16 parallel outputs, but each I/O is 64 bits and this will require far too many I/O pins on a physical implementation, so a control unit is used to facilitate serial I/O, significantly reducing the number of I/O pins. The control unit is a simple FSM that transitions between filling the SIPO and reading the PISO.

## Testing

Of course, a testbench was developed to properly stimulate and verify the RTL design. The testbench has 16 real inputs which can be easily modified to test different input sequences. The simulation then proceeds to display the original input sequence, the frequency domain signal, and then the output of the inverse transform, which should be the original time domain signal. The circuit's functional operation was confirmed with a software implementation (python's numpy library). For demonstration purposes, the input sequence is simply the numbers 0-15. The results were on target but had lost some precision because of the scaling factor. With 64 bits and 4 stages of scaling the most reasonable scaling factor was  $2^7 = 128$ . A smaller scaling factor does result in a loss of precision, but the testbench does demonstrate the proof of concept.



*Figure 2: Software verification of the input sequence*

The above figure shows the same data that will be shown in the testbench, that is, the original input sequence, the transformed signal, and the signal after passing through the inverse transform.



Testbench results. As we can see, the results are accurate but not extremely precise.

## Hardware Implementation

The top module was synthesized in the Quartus software. The circuitry is very resource intensive and the combinational logic has a considerable propagation delay. However, the circuit can complete the operation in only 16 clock cycles and the data can be read out serially in another 16 cycles. The resources can be optimized by implementing a pipeline structure to reduce the number of multiplier elements. Although, modern DSPs contain many MAC units, which are extensively used in this operation, so with more specialized hardware, the design would be less resource intensive.



*Figure 4: Resource utilization*

The clock period needed to avoid data hazards came out to 107.8 ns. With approximately 32 clock cycles to complete a transform, inverse transform, and read, the whole sequence will take about 3.45 us.



In this view we see the connections of the four stages of butterfly operations as previously described in the paper. This circuitry serves as the combinational logic for the datapath and contains the majority of the resources.

*Figure 5: RTL view of 16 point butterfly operation*



The RTL view of the top module shows all the components of the core. The view is a very close realization of our initial drawup and works as expected.

*Figure 6: RTL view of the top module*



Finally, we see the internals of the control unit state machine.

## Conclusion

The project was rigorous and thorough. Our understanding of signals and systems was enhanced greatly through the research and development of the unit. The project involved many aspects of computer engineering including the principles of abstraction, signal processing, frequency transforms, RTL design, processor architecture, and simulation/verification. The process was comprehensive and informative and we are glad we chose the 16 point FFT for the final project. It was a strong project to demonstrate the skills we've learned in this course and the prerequisites.

