www.irjet.net e-ISSN: 2395-0056 p-ISSN: 2395-0072 # AN EFFICIENT VEDIC BASED PROCESSING ELEMENT FOR SYSTOLIC ARRAY Dr.N.Nagaraju<sup>1</sup>, Revathi B<sup>2</sup>, Shifanila S<sup>3</sup>, Shyama A<sup>4</sup>, Seema M<sup>5</sup> <sup>1</sup>Assistant Professor , <sup>2,3,4,5</sup> UG scholar, Department of Electronics and Communication Engineering, Adhiyamaan College of Engineering, Hosur, Tamilnadu, India. <sup>1</sup>nnagaece@gmail.com, <sup>2</sup>revathibr3@gmail.com, <sup>3</sup>Shifanilasenthil17@gmail.com, <sup>4</sup>sizzlingshyamaajith079@gmail.com, <sup>5</sup>mseemaacuec@gmail.com **Abstract**— VLSI technology is used to evolve hardware which will meet all the required performance within the specified size. The need for high performance in special purpose computer system is longed for in this time of Artificial Intelligence and Machine learning. Tensor processing unit (TPUs) is application specific integrated circuit which is used in accelerated Machine leaning applications. These are done by Matrix a computation which consist of extensive arithmetic operation and is considered important for many signal processing applications. This matrix multiplication can be accelerated using special purpose hardware such as Systolic arrays [4]. Systolic array is the network of PEs which is used to produce and pass data through the system. Though PEs in SAs have a high critical path delays thus will limit the performance satisfaction of SAs. So, the main goal is to reduce the delay of individual PE by employing a more efficient multiplier. Here in this project we are going to design an efficient multiplier in order to achieve better performance. This work will be implemented using VHDL and synthesized using Xilinx ISE 14.2i. **Keywords**— Processing element (PE), Systolic array (SA), Tensor processing unit (TPU), Vedic multiplier. #### I. INTRODUCTION In most of the signal processing applications such as image processing, feature extraction, video compression, audio processing, wireless communication and in machine learning, matrix multiplication plays a very important role. but as known Matrix multiplication is a computational extensive arithmetic operation thus; computational delay is a crucial factor to be observed. But, the problem can be reduced by using a special purpose hardware schematic called as systolic array. however, the function of this systolic array is directly dependent on the processing element (PE) [12] which makes up the arrays. The completel time taken by the system is the time taken by the PEs to propagate the signals. As a solution, the efficiency of the SA is highly reduced. A Analysis is made to explore how the delay of PEs can be reduced which tends to be the target of the proposed work. Matrix multiplication is considered to be one of the core components in DSP. Thus, the speed of the processor will be directly dependent on the multipliers which are used in it. The most common multipliers which are being used are booth multipliers and array multipliers. A high propagation delay is associated in both the cases. Thus, a Vedic based approach is highlightened here. Because, Vedic mathematics provides much simpler derivation of array multiplier than the conventional ones. Computational speed also known to be more higher compared with the conventional ones. The algorithm or sutra that is being used here is "Urdhva Triyabhyam". Eventhough Urdhva Triyabhyam [8] is known to be the fastest and most efficient multiplier, large number of partial products are propagated due to the certainity that 2X2 being the basic building block of 4X4 and 4X4 being the basic building block of 8X8 and so on. To manage this problem a 4X4 multiplier can be formed using other fast multiplication algorithms possible and keeping Urdhva Tiryabhyam for higher order multiplier blocks. # II. APPROXIMATE PROCESSING ELEMENT Matrix multiplication is considered to be as computational extensive arithmetic operation and is also considered very important for signal processing application. Ordinary matrix-matrix multiplication algorithm represents a www.irjet.net path delay by using approximation. The internal architecture of exact processing elements with 8-bit input operands was presented in [12] to highlight the long critical path delays issue. The partial products were divided into two groups namely A and B. Partial products are simultaneously generated and accumulated. By using approximate PPU, 58% improvement in delay has been achieved. A constant energy gain of about 45% and 51% was also achieved using both the designs fig 2.1, fig 2.2. These exhibit high performance as critical path delay of systolic arrays is reduced. e-ISSN: 2395-0056 p-ISSN: 2395-0072 Figure 2. Approximating Partial products in both groups (A&B) #### III. VEDIC BASED PROCESSING ELEMENT In the proposed architecture for n-bit processing element Vedic architecture is used in particular Urdhva tiryagbhyam sutra is used to evaluate the product of n-bit operands. The critical path delay is reduced by employing vedic architecture in the processing element. By doing so, no compromise in precision of the output is done. The proposed block diagram is given below fig 3. It has four blocks of 8×8 systolic array multiplier and three 16bit carrylook ahead adders. The input a [7:0] and b [7:0] is fed into the first 8×8 systolic array multiplier Qo [15:0] is the output. Similarly, to the rest of the blocks, a and b with several inputs are provided as inputs and later their outputs are Q1[15:0], Q2[15:8], Q3[15:0] noted. Carrylook ahead adder is used to add the partial products. To one of the CLA the inputs given compute-bound task, since every entry in a matrix is multiplied by all entries in some row or column of the other matrix [5]. Adding two matrices, on the other hand, is I/O-bound, because the total number of addition is not larger than the total number of entries in the two matrices. It should be clear that any strive to speed up an I/O-bound computation must depend on an increase in memory bandwidth. Memory bandwidth can be increased by the use of either fast components (costly) interleaved memory (complicated memory management problems). Speeding up a compute- bound computation, though, may often be skilled in a almost simple and low-priced manner, that is, by the systolic approach. Though processing element in systolic array have high critical path delay thus limiting the benefits of systolic arrays. One of the interesting part is most of the application in signal processing are lapse resident. consequently, approximate computing is used as an alternative design to address the estimate bound problem. Many strategy have been proposed to address the issue. Figure 1. Approximating Partial products in group A As signal processing are error resident approximate PE systolic array-based matrix multiplication has been proposed because systolic array architecture is simple, regular, modular and concurrent which can be used to increase efficiency. In the paper [12], two designs fig 1, fig 2 for approximate matrix multiplication based on a systolic array was proposed. It showed reduced critical © 2021, IRJET **Impact Factor value: 7.529** ISO 9001:2008 Certified Journal Page 156 www.irjet.net e-ISSN: 2395-0056 p-ISSN: 2395-0072 is Q1[15:0] Q2[15:8] and Q3[15:0] and its respective output was obtained as Q4[15:0] along with a carry c1. Another carrylook ahead adder is padded with 8bit'0' along with Q4[15:0] and Q0[15:8], its output was Q5[15:0] and generated a carry C2. Similarly the vedic design flow was followed. In case of the multipliers in the sub-block instead of using vedic multiplier other fast multiplier was used to reduce the number of partial products generated. As, a result an efficient processing element for the systolic array was designed. This design can be used to reduce the delay as well as the number of partial products generated. Figure 3. Vedic based Processing Element # IV. RESULTS AND DISCUSSION Various input were given and the correctness was tested. For $4\times4$ 1010(10) multiplied with 1110(14) gave the result 10001100(140) fig 4. Figure 4. Simulation Output for 4x4 As in the table 1, Delay and LUT count was determined for all the designs and the results is tabulated. Proposed system shows decrease in delay compared to the rest of the designs. LUT count is increased for the proposed designs. Table 1. 4x4 Delay analysis and LUTs count | Design | SPART | 'AN 3 | VIRTEX 4 | | |-------------------------|--------------|--------------|-------------|------------------| | | Delay | LUT<br>Count | Delay | LUT<br>Coun<br>t | | 4x4<br>Convention<br>al | 17.369n<br>s | 26 | 9,248n<br>s | 26 | | 4x4<br>Existing | 16.344n<br>s | 30 | 8.882n<br>s | 30 | | 4x4<br>Proposed | 15.410n<br>s | 34 | 8.346n<br>s | 34 | In this figure 5, Similarly for 8x8 the inputs given was 10101101(173) multiplied with 00111010(58) gave the result 0010011100110010(10034). Figure 5. Simulation Output for 8x8 By using SPARTAN 3 family, the proposed system is found 11.27% delay efficient as compared to conventional method and 5.71% when compared with existing systems. When VIRTEX 4 family was considered 8.78% and 5.02% improvement in delay was noted. Table 2. 8x8 Design analysis and LUT count | Design | SPARTAN 3 | | VIRTEX 4 | | |-------------------------|--------------|--------------|--------------|------------------| | | Delay | LUT<br>Count | Delay | LUT<br>Coun<br>t | | 8x8<br>Conventio<br>nal | 33.36<br>9ns | 118 | 13.66ns | 118 | | 8x8<br>Existing | 27.32<br>7ns | 127 | 13.767n<br>s | 127 | | 8x8<br>Proposed | 28.73<br>3ns | 124 | 14.231n<br>s | 124 | www.irjet.net e-ISSN: 2395-0056 p-ISSN: 2395-0072 The RTL schematic of the proposed system is also given Fig 7. Figure 7. RTL Schematic ## V. CONCLUSION Systolic array multiplier is an arrangement of processors in an array where data flows synchronously across the array between neighbours. In this proposed work, in each processing element Vedic architecture based systolic array multiplier is considered as an ideal solution for dense matrix multiplication. Thus, the design of n' bit systolic array multiplier was optimised using structural style compared with behavioural style. Using Xilinx ISE, we have synthesised the design on SPARTAN 3 and VIRTEX 4 family effectively. By implementing such designs in VHDL one can easily understand the behaviour of designing aspects effectively. Using these ideologies various computational extensive processes can reduce the time consumption in an effective way. ## REFERENCES - [1] Kung, H.T. and Lehman, P.L. "Systolic (VLSI) arrays for relational database operations," Proc. ACM/SIGMOD International Conference on Management of Data (eds Chen, P.P. and Sprowls, R.C.), pp. 105-116, Santa Monica, Ca., USA, 1980. - [2] Quinton, P. and Robert, Y. Systolic algorithms & architectures, Prentice Hall Masson, pp. 41-44, 1991. - [3] Kung, H.T. and Leiserson, C.E. Systolic arrays for VLSI, in, section 8.3, Introduction to VLSI System - by C. Mead and L. Conway, Addison-Wesley Pub. Co., 1981. - [4] H.T. Kung" Why systolic architectures?" IEEE computer, vol.15, pp.37,Jan.1982. - [5] Feifei Dong, Sihan Zhang and Cheng Chen, Improved Design and Analyse of parrele Matrix Multiplication on Systolic Array Matrix, IEEE, 2009. - [6] Ganapathi Hegde, Cyril Prasanna Raj P & P.R. Vaya, Implementation of Systolic Array Architecture for full search Block Mathcing Algorithm on FPGA, European Journal of scientific Research, vol.33 No.4(2009), pp.606-616. - [7] Sushma R. Huddar, Sudhir Rao Rupanagudi, Kalpana M and Surabhi Mohan, "Novel High-Speed Vedic Mathematics Multiplier using Compressors", International Multi conference on Automation, Computing, Communication, Control and Compressed Sensing(iMac4s), 22-23 March 2013, Kottayam, ISBN: 978-1-46735090-7/13, pp.465-469. - [8] Soma Bhanu Tej, "Vedic Algorithms to develop green chips for future", International Journal of Systems, Algorithms & Applications, Volume 2, Issue ICAEM12, February 2012, ISSN Online: 2277-2677. - [9] Gaurav Sharma, Arjun Singh Chauhan, Himanshu Joshi and Satish Kumar Alaria, "Delay Comparison of 4 by 4 Vedic Multiplier based on Different Adder Architectures using VHDL", International Journal of IT, Engineering and Applied Sciences Research (IJIEASR), ISSN: 2319-4413, Volume 2, No. 6, June 2013, pp. 28-32. - [10] Rakshith T R and RakshithSaligram, "Design of High-Speed Low Power Multiplier using Reversible logic: a Vedic Mathematical Approach", International Conference on Circuits, Power and Computing Technologies (ICCPCT-2013), ISBN: 978-1-4673-49222/13, pp.775-781. - [11] M.E. Paramasivam and Dr. R.S. Sabeenian, "An Efficient Bit Reduction Binary Multiplication Algorithm using Vedic Methods", IEEE 2nd International Advance Computing Conference, 2010, ISBN: 978-1-4244-4791-6/10, pp. 25-28. - [12] Haroon Waris, Chenghua Wang, Weiqiang Liu and Fabrizio Lombardi, "AxSA: On the Design of High-Performance and Power-Efficient Approximate Systolic Arrays for Matrix Multiplication", Journal of Signal processing Systems,2020, ISBN: 10.1007/s11265-020-015827.