Carry Lookahead Implementation

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Carry Lookahead Implementation

Shjanzey
I went ahead and implemented a carry lookahead 4-bit adder and I wanted to know if I am on the right track.  I tried to include helpful comments, for example the boolean calculations for each carry bit calculation.  The last calculation is quite verbose, and I find it hard to understand why this would be quicker than the ripple chip containing "less code".  

First I created a chip called a partial full adder that will take an input of x, y, and a carry bit.  It will output a sum bit, a generate bit, and a propagate bit.

CHIP PFA {
    IN a, b, c;
    OUT s, g, p;

    PARTS:
    //Put your code here.
       
    //Calculate the Generate Bit
    And(a=a, b=b, out=g);

    //Calculate the Propagation Bit
    Or(a=a, b=b, out=p);

    //Calculate the Sum-Bit
    Xor(a=a, b=c, out=s1);
    Xor(a=s1, b=b, out=s);        

}

Next I created the 4-bit look forward adder which i called "Add4" for simplicities sake.  Maybe I should rename it to Add4CLA.  After each partial full add, I went ahead and calculated the carry bit, including the boolean logic in a comment for your convenience.  It is that step that truly baffles me.  I would love some further explanation as to why, although this seems to involve more calculation, mathematically it should result in fewer gates and generation in a constant time (rather than linear as ripple is done).

CHIP Add4 {
    IN a[4], b[4], c0;
    OUT out[4], carry;

    PARTS:
    //Put your code here.

    PFA(a=a[0], b=b[0], c=c0, s=out[0], g=g0, p=p0);

    //C1 = g0+p0c0
    And(a=p0, b=c0, out=pand1);
    Or(a=pand1, b=g0, out=c1);

    PFA(a=a[1], b=b[1], c=c1, s=out[1], g=g1, p=p1);
   
    //C2 = g1 + p1g0 + p1p0c0
    And(a=p1, b=pand1, out=pand2);
    And(a=p1, b=g0, out=gand1);
    Or(a=g1, b=pand2, out=or1);
    Or(a=or1, b=pand2, out=c2);
   

    PFA(a=a[2], b=b[2], c=c2, s=out[2], g=g2, p=p2);
     
    //C3 = g2 + p2g1 + p2p1g0 + p2p1p0c0    
    And(a=p2, b=pand2, out=pand3);
    And(a=p2, b=gand1, out=gand2);
    And(a=p2, b=g1, out=ggand1);
    Or(a=pand3, b=gand2, out=or2);
    Or(a=or2, b=ggand1, out=or3);
    Or(a=or3, b=g2, out=c3);
   
    PFA(a=a[3], b=b[3], c=c3, s=out[3], g=g3, p=p3);

    //c4 = g3 + p3g2 + p2p3g1 + p1p2p3g0 + p0p1p2p3c0
    And(a=p3, b=pand3, out=pand4);
    And(a=p3, b=gand2, out=gand3);
    And(a=p3, b=ggand1, out=ggand2);
    And(a=p3, b=g2, out=ppand1);
    Or(a=g3, b=ppand1, out=or4);
    Or(a=or4, b=ggand2, out=or5);
    Or(a=or5, b=gand3, out=or6);
    Or(a=or6, b=pand4, out=carry);

}


Finally I have my Add16 chip.  Since it is difficult to string long chains of carry look ahead together, I am wondering if the speedup will be found not in the number of adders strung together, but rather it will be realized when thousands of iterations have been run over each adder type. (E.G. -- Ripple versus Carry Lookahead).  Your thoughts are appreciated.

Note: I left my old implementation commented out in the Add16 for comparison's sake.

CHIP Add16 {
    IN a[16], b[16];
    OUT out[16];

    PARTS:
    //Second version with 4-bit CLA and ripple carry between 4bit adders;

    Add4(a=a[0..3], b=b[0..3], c0=false, out=out[0..3], carry=c1);
    Add4(a=a[4..7], b=b[4..7], c0=c1, out=out[4..7], carry=c2);
    Add4(a=a[8..11], b=b[8..11], c0=c2, out=out[8..11], carry=c3);
    Add4(a=a[12..15], b=b[12..15], c0=c3, out=out[12..15], carry=c4);


    // Put your code here.
    //First Version with ripple carry
/*    HalfAdder(a=a[0], b=b[0], sum=out[0], carry=c1);
    FullAdder(a=a[1], b=b[1], c=c1, sum=out[1], carry=c2);
    FullAdder(a=a[2], b=b[2], c=c2, sum=out[2], carry=c3);
    FullAdder(a=a[3], b=b[3], c=c3, sum=out[3], carry=c4);
    FullAdder(a=a[4], b=b[4], c=c4, sum=out[4], carry=c5);
    FullAdder(a=a[5], b=b[5], c=c5, sum=out[5], carry=c6);
    FullAdder(a=a[6], b=b[6], c=c6, sum=out[6], carry=c7);
    FullAdder(a=a[7], b=b[7], c=c7, sum=out[7], carry=c8);
    FullAdder(a=a[8], b=b[8], c=c8, sum=out[8], carry=c9);
    FullAdder(a=a[9], b=b[9], c=c9, sum=out[9], carry=c10);
    FullAdder(a=a[10], b=b[10], c=c10, sum=out[10], carry=c11);
    FullAdder(a=a[11], b=b[11], c=c11, sum=out[11], carry=c12);
    FullAdder(a=a[12], b=b[12], c=c12, sum=out[12], carry=c13);
    FullAdder(a=a[13], b=b[13], c=c13, sum=out[13], carry=c14);
    FullAdder(a=a[14], b=b[14], c=c14, sum=out[14], carry=c15);
    FullAdder(a=a[15], b=b[15], c=c15, sum=out[15], carry=c16);*/

}



Reply | Threaded
Open this post in threaded view
|

Re: Carry Lookahead Implementation

cadet1620
Administrator
The speed improvement in the Carry Lookahead Adder comes into play as the adder gets wider.

For a Ripple Carry Adder, the worst case delay for an N-bit adder is from CIN in to SumN through all the bit-to-bit carries. The delay time is k1 N for some constant k1.

For a CLA Adder, the worst case delay path gets from CIN in to SumN by going up the tree of Lookahead Generators and back down to the SumN adder. The delay time is k2 log2N.

k2 is typically larger than k1 so there is a tradeoff point between the two. Often, the smallest building blocks of the CLA adder, usually 4 bits wide, are implemented with ripple carry add and parallel G, P computation.

--Mark

Reply | Threaded
Open this post in threaded view
|

Re: Carry Lookahead Implementation

Shjanzey
Ok, this makes a lot more sense.  Thank you for that clarification.

Oh by the way I just realized I can simplify the Add4 program above drastically.  I worked some more on the algebra and then laughed at what I was doing.  I actually recalculate  previous carry pins when they are already available.  With your reply and my corrected math I can now understand the elegance of this method.  I'll post an update version.

CHIP Add4 {
    IN a[4], b[4], c0;
    OUT out[4], carry;

    PARTS:
    //Put your code here.

    PFA(a=a[0], b=b[0], c=c0, s=out[0], g=g0, p=p0);

    //C1 = g0+p0c0
    And(a=p0, b=c0, out=pand1);
    Or(a=pand1, b=g0, out=c1);

    PFA(a=a[1], b=b[1], c=c1, s=out[1], g=g1, p=p1);
   
    //C2 = g1 + p1g0 + p1p0c0 --> g1 + p1(g0+p0c0) --> g1 + p1c1
       
    And(a=p1, b=c1, out=pand2);
    Or(a=g1, b=pand2, out=c2);
   

    PFA(a=a[2], b=b[2], c=c2, s=out[2], g=g2, p=p2);
     
    //C3 = g2 + p2g1 + p2p1g0 + p2p1p0c0 --> g2+p2(g1+p1(g0+p0c0)) --> g2+p2c2    
   
    And(a=p2, b=c2, out=pand3);
    Or(a=g2, b=pand3, out=c3);
   
    PFA(a=a[3], b=b[3], c=c3, s=out[3], g=g3, p=p3);

    //c4 = g3 + p3g2 + p2p3g1 + p1p2p3g0 + p0p1p2p3c0
   
    And(a=p3, b=c3, out=pand4);
    Or(a=g3, b=pand4, out=carry);

}
Reply | Threaded
Open this post in threaded view
|

Re: Carry Lookahead Implementation

cadet1620
Administrator
In reply to this post by Shjanzey
If you are interested in other types of adder optimization, check out this post on Carry Select Adders

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Carry Lookahead Implementation

Jaytron
Has anyone had any success implementing a full 16 bit carry lookahead adder?  I've successfully implemented the 4-bit carry lookahead adder in .hdl code as visualized here: http://en.wikipedia.org/wiki/Lookahead_Carry_Unit  but when I attempt to scale the chip up to a 16-bit lookahead carry unit the Nand2Tetris hardware simulator gives me the error that I have a "circle in my parts connection".

It's basically the group propagate and group generate outputs from four 4-bit carry lookahead chips piped into a 5th carry lookahead chip (therefore accepting 16 bits as input).  I've narrowed down the error to the carry bits from the 5th carry lookahead chip, since they are fed in to the four upper level carry lookahead chips.

I'm at a bit of a loss since I don't believe that my logic diagrams nor implementation are wrong (I've built/rebuilt it in it's entirety twice).

Does any have any experience or suggestions for this?


Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: Carry Lookahead Implementation

cadet1620
Administrator
Jaytron wrote
Has anyone had any success implementing a full 16 bit carry lookahead adder?  I've successfully implemented the 4-bit carry lookahead adder in .hdl code as visualized here: http://en.wikipedia.org/wiki/Lookahead_Carry_Unit  but when I attempt to scale the chip up to a 16-bit lookahead carry unit the Nand2Tetris hardware simulator gives me the error that I have a "circle in my parts connection".
I'm assuming that you are trying to implement the second figure on the Wiki page you referenced.

You are running into a limitation of the Hardware Simulator. It sees that there are outputs p4 and g4 from the 4-bit CLA adder that connect to the 16-bit CLA chip, and the C4 output from the 16-bit CLA that connects back to the same 4-bit CLA adder. It doesn't analyze the internals of the CLA adder so it doesn't know that the C4 input doesn't affect the p4 and g4 outputs.

A quick and dirty kludge is to use two CLA adders for each 4-bit group. One to generate the p and g and the other to add A, B and C. From my CLAadd16.hdl:
    ClaAdd4(a=a[0..3],   b=b[0..3],   c=c,  sum=out[0..3]);
    ClaAdd4(a=a[0..3],   b=b[0..3],                         g=g0, p=p0);
(There's a comment in my HDL explaining this kludge. The comment is longer than the implementation!)

If you want to explore more complicated logic I highly recommend Logisim, a fairly powerful schematic-based logic simulator.

--Mark
Reply | Threaded
Open this post in threaded view
|

Re: Carry Lookahead Implementation

Jaytron
Thanks for the advice! I'm not sure why I didn't occur to me before, but I never tried to make a modified 1-Bit Full Adder to handle the propagate/generate bits internally and then plug them directly in to the ClaAdd4 chip.

As it stands right now, my 4-Bit CLA chip accepts only a C0 (carry-in bit) and two 4 bit numbers to add.  It outputs the 4-Bit sum, the C4 bit to shift to the next group, and the GP and GG bits.

I just scaled it up to a 16-Bit chip and it works!  Working on the test scripts for it now.

I really appreciate the help and I'm excited to check out Logisim.


-Jason