Hardware Accelerator for 1D Differential Filtering

This project, part of the “Reti Logiche” (Digital Design) course at Politecnico di Milano, involved designing a hardware component in VHDL to perform 1D differential filtering on a sequence of data. The module interfaces with a synchronous RAM to read instructions and data, processes them, and writes the saturated results back to memory.

The Challenge

The hardware module had to process a continuous stream of 8-bit signed integers (complemento a 2). The process involved:

Reading a 17-byte header to extract the sequence length ($K$), the filter order selector ($S$), and 14 dynamic coefficients.
Applying either an Order 3 or Order 5 differential filter based on the selector.
Performing hardware-efficient normalization (division by 12 or 60) without using expensive division blocks.
Applying saturation logic to clamp the final results within the $[-128, +127]$ range before writing back to memory.

Architectural Design

The system was implemented as a Finite State Machine (FSM) combined with a highly optimized datapath.

1. Finite State Machine (FSM)

I designed a reliable FSM to handle memory synchronization and data processing:

WAIT_START & BUS_HANDOVER: Handles the initialization and prevents misalignment during multiple consecutive start/reset cycles.
READ_HEADER: Uses a demultiplexer logic to sequentially load the 17 setup bytes into internal registers.
RUN_READ & RUN_SHIFT: Forms the core processing loop, fetching payload bytes and sliding them through the calculation window.

2. Optimized Datapath & Arithmetic

To meet strict timing and area constraints, complex arithmetic was mapped to simple hardware operations:

Sliding Data Window: Created a 7-element shift register (x0 to x6) to align the incoming sequence with the dynamic filter coefficients.
Multiply-Accumulate (MAC): Executed parallel multiplications resized to 20 bits to prevent overflow during accumulation.
Shift-Based Division: Avoided costly hardware dividers by approximating the normalization. For example, dividing by 12 for the Order 3 filter was achieved via right bit-shifts: $1/12 \approx 1/16 + 1/64 + 1/256 + 1/1024$. Custom compensation logic (+1 on shifts) was added to minimize truncation errors on negative numbers.

Results

The component was synthesized successfully in Vivado with excellent performance metrics:

Resource Utilization: Highly efficient, using only 971 LUTs (0.83%) and 269 Flip-Flops (0.11%).
Timing: Passed timing constraints easily, achieving a positive slack of $9.460\text{ ns}$ on a strict $20\text{ ns}$ clock period, proving the logic could run comfortably at speeds well above 50 MHz.
Reliability: Handled complex edge cases, such as multiple consecutive resets and memory out-of-bounds protection, passing all provided and custom testbenches.

Source Code: View on GitHubweight: 999