Few years back I wrote the VHDL code for a 4 bit Wallace tree multiplier. In this post I want to convert the VHDL into a Verilog code. A Wallace tree multiplier is much faster than the normal multiplier designs.
The design uses half adder and full adder Verilog designs I have implemented few weeks. back. These modules will be instantiated for the implementation 4 bit Wallace multiplier.
4 bit Wallace tree multiplier:
module wallace(A,B,prod);
//inputs and outputs
input [3:0] A,B;
output [7:0] prod;
//internal variables.
wire s11,s12,s13,s14,s15,s22,s23,s24,s25,s26,s32,s33,s34,s35,s36,s37;
wire c11,c12,c13,c14,c15,c22,c23,c24,c25,c26,c32,c33,c34,c35,c36,c37;
wire [6:0] p0,p1,p2,p3;
//initialize the p's.
assign p0 = A & {4{B[0]}};
assign p1 = A & {4{B[1]}};
assign p2 = A & {4{B[2]}};
assign p3 = A & {4{B[3]}};
//final product assignments
assign prod[0] = p0[0];
assign prod[1] = s11;
assign prod[2] = s22;
assign prod[3] = s32;
assign prod[4] = s34;
assign prod[5] = s35;
assign prod[6] = s36;
assign prod[7] = s37;
//first stage
half_adder ha11 (p0[1],p1[0],s11,c11);
full_adder fa12(p0[2],p1[1],p2[0],s12,c12);
full_adder fa13(p0[3],p1[2],p2[1],s13,c13);
full_adder fa14(p1[3],p2[2],p3[1],s14,c14);
half_adder ha15(p2[3],p3[2],s15,c15);
//second stage
half_adder ha22 (c11,s12,s22,c22);
full_adder fa23 (p3[0],c12,s13,s23,c23);
full_adder fa24 (c13,c32,s14,s24,c24);
full_adder fa25 (c14,c24,s15,s25,c25);
full_adder fa26 (c15,c25,p3[3],s26,c26);
//third stage
half_adder ha32(c22,s23,s32,c32);
half_adder ha34(c23,s24,s34,c34);
half_adder ha35(c34,s25,s35,c35);
half_adder ha36(c35,s26,s36,c36);
half_adder ha37(c36,c26,s37,c37);
endmodule
Testbench code:
The testbench code checks the correctness of results for the whole range of inputs A and B.
Simulation waveform:
The codes were simulated using Xilinx ISE 13.1. The functionality of the codes were verified. A part of the waveform is pasted below:
The design uses half adder and full adder Verilog designs I have implemented few weeks. back. These modules will be instantiated for the implementation 4 bit Wallace multiplier.
4 bit Wallace tree multiplier:
module wallace(A,B,prod);
//inputs and outputs
input [3:0] A,B;
output [7:0] prod;
//internal variables.
wire s11,s12,s13,s14,s15,s22,s23,s24,s25,s26,s32,s33,s34,s35,s36,s37;
wire c11,c12,c13,c14,c15,c22,c23,c24,c25,c26,c32,c33,c34,c35,c36,c37;
wire [6:0] p0,p1,p2,p3;
//initialize the p's.
assign p0 = A & {4{B[0]}};
assign p1 = A & {4{B[1]}};
assign p2 = A & {4{B[2]}};
assign p3 = A & {4{B[3]}};
//final product assignments
assign prod[0] = p0[0];
assign prod[1] = s11;
assign prod[2] = s22;
assign prod[3] = s32;
assign prod[4] = s34;
assign prod[5] = s35;
assign prod[6] = s36;
assign prod[7] = s37;
//first stage
half_adder ha11 (p0[1],p1[0],s11,c11);
full_adder fa12(p0[2],p1[1],p2[0],s12,c12);
full_adder fa13(p0[3],p1[2],p2[1],s13,c13);
full_adder fa14(p1[3],p2[2],p3[1],s14,c14);
half_adder ha15(p2[3],p3[2],s15,c15);
//second stage
half_adder ha22 (c11,s12,s22,c22);
full_adder fa23 (p3[0],c12,s13,s23,c23);
full_adder fa24 (c13,c32,s14,s24,c24);
full_adder fa25 (c14,c24,s15,s25,c25);
full_adder fa26 (c15,c25,p3[3],s26,c26);
//third stage
half_adder ha32(c22,s23,s32,c32);
half_adder ha34(c23,s24,s34,c34);
half_adder ha35(c34,s25,s35,c35);
half_adder ha36(c35,s26,s36,c36);
half_adder ha37(c36,c26,s37,c37);
endmodule
Testbench code:
The testbench code checks the correctness of results for the whole range of inputs A and B.
module tb;
// Inputs
reg [3:0] A;
reg [3:0] B;
// Outputs
wire [7:0] prod;
integer i,j,error;
// Instantiate the Unit Under Test (UUT)
wallace uut (
.A(A),
.B(B),
.prod(prod)
);
initial begin
// Apply inputs for the whole range of A and B.
// 16*16 = 256 inputs.
error = 0;
for(i=0;i <=15;i = i+1)
for(j=0;j <=15;j = j+1)
begin
A <= i;
B <= j;
#1;
if(prod != A*B) //if the result isnt correct increment "error".
error = error + 1;
end
end
endmodule
// Inputs
reg [3:0] A;
reg [3:0] B;
// Outputs
wire [7:0] prod;
integer i,j,error;
// Instantiate the Unit Under Test (UUT)
wallace uut (
.A(A),
.B(B),
.prod(prod)
);
initial begin
// Apply inputs for the whole range of A and B.
// 16*16 = 256 inputs.
error = 0;
for(i=0;i <=15;i = i+1)
for(j=0;j <=15;j = j+1)
begin
A <= i;
B <= j;
#1;
if(prod != A*B) //if the result isnt correct increment "error".
error = error + 1;
end
end
endmodule
Synthesis Results:
The design was successfully synthesised for Virtex 4 fpga and a maximum combinational path delay of 8.652ns was obtained.