107 research outputs found
An efficient multiple precision floating-point Multiply-Add Fused unit
Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65 nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power
Conflict-free parallel memory accessing techniques for FFT architectures
Speeding up fast Fourier transform (FFT) computations is critical for today's real-time systems targeting signal processing and telecommunication applications. Aiming at the performance improvement and the efficiency of FFT architectures, this paper presents an address generation technique which enables a radix-b processor to access in parallel b memory banks without conflicts during each stage's computations. Using kb memory banks at each stage leads to increasing the speedup of the algorithm by a factor of kb. The address generation can be realized in each radix-b stage by the use of lookup tables of size O(kb2) bits. The proposed technique is cost efficient and leads to the design of FFT architectures of high speedup and high sustained throughput. © 2008 IEEE
Efficient systolic array mapping of FIR filters used in PAM-QAM modulators
This paper presents efficient techniques for mapping FIR filter computation circuits for PAM and QAM modulators onto systolic arrays. The exploitation of the inherent symmetry of these problem instances and the use of Look-Up Tables (L.U.T.) in conjunction with the use of systolic architectures, increases the performance while keeping the VLSI area minimal. Exploiting parallelism and pipelining enhances the throughput and results in linear expandability of the FIR filter with respect to the bit accuracy and to the stage count
A graphics parallel memory organization exploiting request correlations
Real-time graphics applications require memory organizations featuring parallel pixel access and low-cost implementation. This work bases on a nonlinear skew mapping scheme and exploits the correlation between consecutive requests for pixels to design an efficient parallel memory organization. The mapping achieves parallel access, of mn pixels in various shapes, to the memory organized with mn banks. The proposed design technique combines the mapping properties and the spatial correlations among pixel requests to eliminate conflicts by spending at most one extra cycle every mn consecutive parallel pixel accesses. Consequently, the technique ensures that any pixel patternamong these commonly used in graphicscan be accessed in a single cycle from any image location. The address computations become straightforward as the numbers of the requested pixels and the banksapart from equalcan be powers of 2. © 2006 IEEE
Acceleration techniques and evaluation on multi-core CPU, GPU and FPGA for image processing and super-resolution
Super-resolution (SR) techniques constitute a key element in image applications, which need high-resolution reconstruction, while in the worst case, only a single low-resolution observation is available. SR techniques involve computationally demanding processes, and thus, researchers are currently focusing on SR performance acceleration. Aiming at improving the SR performance, the current paper builds up on the characteristics of the L-SEABI SR method to introduce parallelization techniques for GPUs and FPGAs. The proposed techniques accelerate GPU reconstruction of ultra-high definition content, by achieving three (3×) times faster than the real-time performance on mid-range and previous generation devices and at least nine times (9×) faster than the real-time performance on high-end GPUs. The FPGA design leads to a scalable architecture performing four (4×) times faster than the real-time on low-end Xilinx Virtex 5 devices and 69 times (69×) faster than the real-time on the Virtex 2000t. Moreover, we confirm the benefits of the proposed acceleration techniques by employing them on a different category of image processing algorithms: on window-based disparity functions, for which the proposed GPU technique shows an improvement over the CPU performance ranging from 14 times (14×) to 64 times (64×), while the proposed FPGA architecture provides 29× acceleration. © 2016, Springer-Verlag Berlin Heidelberg
Management tool for the “Nephele” data center communication agent
Optical switching provided the means for the development of Data Centers with high throughput interconnection networks. A significant contribution to the advanced optical Data Centers designs is the Nephele architecture that employs optical data planes, optical Points of Delivery (PoD) switches and Top of Rack (ToR) switches equipped with 10 Gbps connections to the PoDs and the servers. Nephele follows the Software Defined Network (SDN) paradigm based on the OpenFlow protocol and it employs an Agent communicating the protocol commands to the data plane. The current paper presents a management tool for the Agent. The Agent's management tool is utilized to configure the Agent, create commands, perform step operations and monitor the results and the status. Moreover, as a testing and validation tool, it plays a significant role in the improvement of the Agent's design as well as in the upgrade of the entire data center's organization and performance. © 2018 Advances in Science, Technology and Engineering Systems. All rights reserved
Exact Max-Log MAP Soft-Output Sphere Decoding via Approximate Schnorr-Euchner Enumeration
Exact max-log MAP soft-output sphere decoding via approximate schnorr-euchner enumeration
The complexity gains of sphere decoders (SDs) with Schnorr-Euchner enumeration and nonconstant amplitude constellations are limited by the required node ordering. Aiming at improving the implementation efficiency of SD without compromising optimality, this paper proposes a novel tree traversal for soft-output SDs providing the exact max-log MAP decoder performance. It consists of a predefined visiting order that approximates the exact Schnorr-Euchner enumeration (SEE) and a modified pruning metric that preserves the exact max-log MAP despite the approximate ordering. The proposed approach significantly improves both the computational complexity and the implementation cost of exact soft-output SDs compared with previous techniques. In particular, simulations show gains of 30%-56% in the required calculations for a 4 × 4 multiple-input multiple-output system with 16-quadrature amplitude modulation (QAM), and field-programmable gate array (FPGA) implementations show an average power reduction of 34%-50%. © 1967-2012 IEEE
An efficient multiple precision floating-point Multiply-Add Fused unit
Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65 nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power. © 2015 Elsevier Ltd. All rights reserved
- …
