+D1m$ÅG 1mq*E:8NV(l!5]N"9R.4Bj*p~:|*\® CP7n(إثLmZ֛*%A?N*ѧ[P@ihwMbI{Aφ(\ b?V*PӖ(rj=\R:zTii1ZiqWju[bXS=A&)>; o XmAd^E5`( ^bNSOJ,$8xuɵ->#㍦*y)N}PKFRC^- JB¥ˀ~$%KqTU- 2%};`,4Gh`0 &ijӠ8@ms0G\Ug1S1BWLU^xaKezR-i;+m ۸$~xUwAN%o v%#I @zTm\,\Ho]FqVSj↫^|1MU܅~xaCt[U\Uԏ*ýiO [[qVE8FkM(qWSQN*SMLJ| \#_F)0b>qWxWh(9o犷Ȣ;tRU t#~XBRӅ+YvŌ+}Aa0Q|i 5sɏJ*RxdIlab\ RG| qc\6qU½O^UY!$8 (iW"ρxH Tl5@%dmQF튒 9I⅙dƑV *`K|T"ӈ_~*Ws^ Hv8.(iS@ 1E4(iS[%]LVe(\Ӷ KEk uŠqL*ZAb]4%[^T_1CF: ۈ]i튕 MzN M5ʄ|0`oLUޕZN@nEVLRq=VTr Ծ;Rx6#RjoM%p*/Hx &T4Z+]cciVIql=D7 f@ZOCc,h+[D+Am Hi­=G}JLUmh1P) UVqCiCӊI{⭀¬6_aUxn(^~&L,m‡pp%kSUoH$c1SJdmCRx犑h9#hm2 Jت+=0 `BQ\RӠxz- `f-hp+U8nPt3M@BoC-\<1}# []^;70+|h:W-۾*77\E6Wv銻|Iia=MT ]]Hhר1CdZab⡆bh'}Z*WpڶN%65 Jc-u0iኴXcmuM:cI,%&x-RGlU#x`E%ZRt~Rѐcz~U5}w+{bS -H )\jV8' /D$JRɸqrQmT,\=(ߖ-8ڪ 1Zv5p\4%j渶I A#mޢi%2(qg+*$'w#n@cɣ| !|RWV-rh|v8|45x5G{aVv\xo* [$b]8`⴯LR?,ZVaW]Ğpi5늭bl*\ Fڧ\Cj҄ևc :}J(1C)kqCc\HzOVܠpJo[ 1KCM*u R8;0V< ؃b ބbiv]qWod$@RirBCqf$?v b@ &p)n8aZoZ"z`@!4J]lj*r&-"3Uv1NbjT~ KF ϡ(kSf oӶ*VsQ{b([ X4ƑkIn|im6W=NPA=W+m{(AO (jҘ)-[+\U7U"UQJ]n MqZ U ; ~ viL+oZ~ŵmZAYʼn)TG uM7ME֤b}]p+t5;qVo7)n'w5jbI_T5wU7}"S$?u6B3}bZO CvؕsOp;R*ʿ#WTx}ֆZj QkzPb+J->;W+ R|{uf qUG݊Bqv=:0+(*@aUE1VQ൥?qE8Z֔=h{{b;S)hbWboZZC p`A\+Rb>[W;b6`CaM:aKb8(qeiv]Vڕ[sZ,}C\1K|u7w)nq(F;}9cIAkU8h}*0-W w 1U(p*m u 1dS>D/ĄM v#XymJV|]Ʀh|*)^”Zߦ)_L*auC|:>xoQ]KjF5fؐqTMp$\+mS}eaBœO*vSq &iNaZ~EFx/ Q-|+M (ZaÓRWцVoc)abՎ+N5h>RЦ)n}+d=銶@;sHu%ԛu5L5I.a qVV1UVUYA*)\)nP*bl`J *J nq"Is64!r@U ޝkNö) 3R}0+tj;|&Z@؏4+֝kgJ_qN$W^UVAxUl)pW% ߧ(hTb\>lRUw®;7VP@8)WTWxRIڃcpk;b kژͿ}aVSñ]C5V \U hkL4JⅧuGOaJM]֪H5I,nX(G[Z8ku ere6Ib~*1U.t$tĐ'@W6ʋb@N҉@iTC@;ҸApo Pf?يZnU8ڵW6گZ*=0+ %:HSJU?Qń녊5 p+}cGjAb!GZ`<ۇ$oɵثT `UE*k$MZYďZ=w6[޿!O +n Ƹ}@+x҇(Y^Oх+U/Zl꫹!uO`Zmw⿆6-+ k]T|UĎ4•TF9ywӈeR|+`opGn1V)ԏ`W+mGjb'{aCWuw*KC Q,$V` b *늲M ٗ#@"[b_^JגfHC䘻6)u1Kc(T^UU{b9 r gW4ňۦ;dV*r+LUZmOjmԀ‹ZFF*Q{fp)LR%V])ZbE41B,F% uZ*z ۠<튺G"㊴V[ۮJR:/ŵG|R>V +DCA;V =&C^U_ (iZ$´8褟 Uu*l(~ثJŷnZaWr;+H ekl FĤ/|ZUUPP A@}BuqR$‡bb^T8R=qJw됓0z~y*xN4G*@ jHmq*OM!SzRp_ p+ET=>*H^~-*;l-˂C0D. { =;|1V[\AC1VboB1K~C*&)x%PS׮!4߮M(k bY?ДOӀ ~)2,fFX#|~6VCxx*S]㧎*vq{ba(YeqM6{cHu *Ư84@hMG ت.U@A;xڿ.兵u{V\}[iփkzj}V/銸6†5푴VMYMl ).*vqE*:ҾRz5G߁[Zw(l,Umክa]@VSMq;{VFq0zp%vjun:i#vVC\:Ui ;{cJkZ0*F0?6FɎ|FM8Uh\^UZ>NlM 4< ?F%}z`VqV@6b4#(mnsT;bPW8=J#\{S(ocbSAT|*u>8&u?3`N'lVߦ6SlmZn_-%[v_*2Vذ^ݰ*XH8-M7*;ֽ7WJu8~im{u‹w>k\%q_cqUw&?$*Wm#C\`1V)ZN;\C" AQbE"Šw+5?gN~MPlb' یT'D#|B&26Uw ZP*aVZ6j@\j= ح5_1VA▉p6ÌPC~{soli 8&*^b{bSb*ikzqB)k8+0 kc\P۵qK`1CNaMYnN[$ZJAvW )wu1EGb v;&A:W(_,*eR | [#1 ($RXJh@H=(z+ !A$90Ɛp+tCY.Ng U`KEXƾ#1Q8jR47$([+uS*dirr $#fs+aSoTr5Rh?劯 W6qbPۨ؃N JC^Ҹ5:4.4튴ZRڟ,UӰE-?m0> h[ 'or>.B *C.`#BîDEPt)-Oӊk wxvµBaZ\ҝ~X| e#iG}ثCa`d٥-`u(YJѾ1,w%Me|P b:⪈#J40~xi)1 fdYl54jj[T<;WϷ(-raڌJZۮw#Ѝ1UqV< p+ K*W qVx0+TjaWnzC[]ˮ +L MSm]={,"ź*hi uN41BTxI Rwpj*GjKaBiZb\UFݱl=qV:qӾ)w#pZ) Q$@?qʼnc`dڗ ik`C@ UWR6>6ORS%#A"ypNjN?# t(eڤ1V#E[mH)ΜӍAl7jtEg犭)_lPب۶)n0ޕ8Pi['e{G݊Zc|1Z]ϧ] Ǧ╴^;ji]]qCF$oZgnHk|0%»~Rإ++R)nlV.#v덭7V>V,U҇ohV/ P+-q\U Uk9Z*q^n/FrǯсBs@0|R+ʧFl(,\:j<=1B*l+kAzb&ȣRr)lop1V; _K9&95 N588)ZU3/rsE;W¸y`NBGcm⭞$YDTF/ iLRҌE(;`R0 !LUv8GL Ӯ)(vlv[۾LV֓[S;>.>x|P+#mvb#`1 pq^WrRv8M6M1Uw\뷆lPjN*;wWZbBƕ (N5 >?,IF6+NFSA]8 8dY4N3₽KVP)qBLVW>PiPV)bJ\$UcVo»Gi'zfnF)?1ƒ׾+nQA…JSB,=Wt)mZF[_ >튴ITx൦=)Ⓒ "v*:b 8Z.8|;V*١LUe#+) rv#nث`*=kWqN{b{ PP{֘ ?bklUZxSjVָm-NQOuƕp\ n++CjaK="lv#LE5uUS<}+})늭 p;`Cdoa [}|mWk֦b-peub?Ij*狆MHddKcJTn>J9!FG*\IhT;+q8qCU)khI\` "ɺrHlp%]SV뷆9TWQO PZ?* 1E-+з1M-ijh[nF,d}Zd>6ॎ n UږĻ[.m*JN -lU( mZmm5-=ԟq@]SZ.[|S)!mi;Xޟ[l-aV:ZnlNרOAG)Ҕ=u}4X 6'rcv8Ԧ*p+cت4V!%pZW]R;t¥jzw.)AW4q@]Qq uAGN#+C۱ڽFxT+UpOK+p*P|0*ҬHe;/|,U u4+i.E7ҸL 6$;8^@ҧ㆖$ ӃPӭNn w [)M­v銷ߦ*4BX)~ةc&6qbQ+zUP Z_ -?VA6X/ɯPmWTfA$oQ*;bıdbM2|ڊ pwƒ(+UU)Ԍm m񴻈l6~Ku=qE;~ v*ꎽwl _a፥MwUEn`f(kj]IZ=G+UYO iS~hV+QWm¶8'*)sO*YJz[G*zb$uicphwҴ##pETmw QUr,P R-GLVԨIP nJW;|VʫSC{u­Sc8$#zTey *UPBl)]]+ Y.ܚPRYn=%O…ďP!oٯʕiRuߋCa 2qo#Hq QladK8ȶSbmWS犮* Sp+)pH 1Vـ=:U!]p+^T.H%Oo||+N 6B]XQOAc MihnءQHaC—Cb О ϾS*+퍫ӈo⇿\V#aJ€Ƙ(G\i[Gd8O9>:B11p4 Sco-*TqPӁ]Q8UȥN0\?|Uj>xD2<ОۃBmSu$C\zUӊO i PWr=> + ev^ĕT ҆W%%qJZӁ"۱PZ+C\g||q*H;l1Vï~إǍ>x(\T4Mz&:.pG*Ց)BR;PّW;xaI-M [ph{:`q$S-i+v_s၊U8NU: e9+n)NJ=h0!*6kZwEoR]IkS " B HT  w)W4NoJ 6CӐdUB=(pj6[UZ8w/ M4~p#+mj(XXQTt~SHW<`|RH؏nZ;v¶zH7"6#<]hi\oW~8AhBN=~Uk1Kd0v#Tv늻=+Nثը@b2} R: iKt/l(YD~U @sA;4Z| h\6>تB)oElp Īzmw¶ĎFJ顄Ͷa#I($l w)\UZ3ϧ-+reÿ]2 Uk4+bj#n v.'~Ƕ5ȃSiA}Pحw4Zjኬ'9S?Cc1h~ b-)S=zE,~4~곳m;5b(5#=5P%EQpL_%}d㙳 ^3A+^ѸKF#zGz53]@Cޝ#c%H ҿG1 \QEXaKPW$s"y5 K\ yN05v+j:RWvg$5N#d.`h!8t+؊#u?r4:pt&ZV_SzHhMZZf7IūV)kN'r=05oO£6H(;qvKP! qת9M`Ն`)ߏHfEacZtCqA,Z s5:gW4:NJ=4 ytG[,ֈ)Ukeͣ ^V׋Q^΂N) j#У !u,v%d0I }v!1"[.Q|{Yq{ L?rħl tD2y!|5]#jR>#jMGoss\;ks>]#P}C J)̴FUك*pQ_eynkYgv)X:!qjg5n b[btt iFo|OgQfDߤZqeN2LtkWAuP_p{zzH6^kn@0rqVlu~4ﱪhʌxu[-+H0d8yGM^ACP79ߦ5,2ej+5Kb6F- :6>Ma', SE+Ka-rӶKMqp?Eh,RezqNǵkXTq=F@˵ZmְWO#rE^ȅ;!fU:{^:^C 6Aҧ ebֆػ38;Lu8WvJɌ":x]hQPIYPɖXj;,ַH~rE,-nn;VZG/-I<6r7og"DtJfԐ~*_Ejޚ|Mwu-kٮDzgZB'slZ;=jr$IhZj;F;Z;ݑۣdÌm 9|t~:_BZ -[Uź(oUxcۥaMaC;FEpamf,#4\J7Ȕ*홇7 6Q?{N 7xX,Lo;|S8-n>:M:Gw?xQN|N 9I2\cXK&ʩV?cyaO IEQܺK%[x+XXlQ9P;D̚Y[U}ZR&ҝ5Lɥz=9lbܷ-z޷-+>S곓OkB%s;e>T8˵7߇TH,J91Z'>+>YYYNr=_:xT^PSkT[1G-n+%g2 Lc{Ba]"'˒9n)ۈ1++rܷ-rLܷ- Љ#NYݱl[VбԔS.NyNN.Y9A)̶XB9П++=xJ(@k`-~+)\欸KbqB*m?>_򒋑*RrQbDWLa+q[ܲ~ ȼ"ȹsh\Nk܆qC=3?&Q(J' rܷegTۘj?x⋑r.Y[| "w6=JrvS=L%g(P(áH F[xNz/Nz/FDd[/+?F7kzSt2n>0#~,%J%9Y镞#l_un(=JME<.$hB(ZMiiY=3ОQ|J>X znBQ)%[啞Y[V| )ȧ"GR(X+ EQG:'"=3ᕕ +(Oⲉ뎇 (J(QEQGeB!=p(Br=qBD"մ j Q jڰ$ !01"@APQ?_MYV~ S"!̏?L<"e_[O/2/4Q]lN_Q";WDO?WEeī3>+b$gndEˆv^ HBK$'cb袷?LhI!cF$/C+6j5 XF6J&Y=y>C䟒Y{kQ%i+ldm2ذFxy=FL@,1f|}:Eq(+zCE$i(lfQce^뢊eYFEYbetWr뢳EQEQY"0 @!1"`?Vj(}[YEfbc\,d7$|'d`ē F2IPDx^yEVw΋̰бT'}i'K'=bY)я؅R?EPjՉ |Yy(!dV$1`dT,M>?ybխ,E/J(PUt(4.cb{7#v(&#e tlcF !1"2AQ 3Baq0@R#4brCPScsp?Xϑ]h+]b [kkJj" ^]\yYī;ʬȭ%kd(h@VT9#tsEADDwr[5٤):+Ɋ}VYN\!&y7TGoa IhMlrhD jOR" \t>w2)r9r9TrJ, WiR}J݆.h`֑yMTOkJ]QM֩wPK)ǰe ^/қzEL{"A\UZ2GtX\"` @\ 즴dku9ysZSmU Xm)qeIGz1gos?a5:Yqax4_hr=ĭR]B0}4筹8>5*J"ff;H# #P!S>I;8aTnX*;wT쀖&b*[Z0e(;LUDҜ~HF7TϊQU45l/s00ؐ*1NFpXi.r;0)4VoyR\"$Ϛo0? ܇gtƿ&WQk:+0gި LTUW- Sq #m:GDG"w(v*YŅLm-YĦNpoV .ܚֆkon&BW hAߦ\oLқ1! vyNn3d^c壖'Q}VU2U -s*suU-q!V CxZMMiJ;k JX0NEk2:K2#{X43uF"nasSd#8"_µ}7}>]tOrS_+ sL6sN 2MDL2s "p7Zr 963ꨏR#&rR)6u8-RIm3X&JZJt|)>INu.ۣא!f?'TkgkXш5-Ӝ` xZOw꩸WZMfux'+UmAqg 6$*eM:ȩLBil+&|7Z.XʑtaWkޤp15Ujn>Thp_B*mfqmщMfح&3\>h*1/wz,u]-O*;>+%t[3ޮ!V?/vST9CiA'udwܵƵ}SY}1-lJk/PT-Fhu=?[oU@y+- BYe1H@yɤV?/2!sn ;E;二>.Cob91f*Ya46-w6*u/Z]eOUPw%qB\%g72+WVºz.1;,-kI񅉵i-$s&ڷT=j{֭J~J~JJqP)nQs-{äZ)5_VZ?:Ӄ P7nC ueS>kYH0k!]܀|7dPkr 1-V=Vud/}\%~e ؕk%G|[Vp,1OZǂ[P wA#IV*+ruF"|}Vx[yy[VK JiR^')*4gZ}nWEbӑ\|!zm|ڹ衣]y,=>L~㚶2k|)}\ֱz+5E졞d,G.ֵF4#ȓ ]n k䶊rF_%[ow± \]lq^)ҁd+ ]]UgTVW.[u˫,n)!1AQaq 0@?!Léez1ee˗~%s>}S|:Im41:ϴn>y^b7mӌJ+@)ROCIr(Z,^Es7K- RԬJ+ \Q/e鞖quS9}/̷)rgJ2tWҥJ*TIQ#bVe2(e0/;KK>Dh|?ԻiW ~Q!@=5=IV܂5 na1g_G~25|JHAb\$rʺ__6"]&M?g&۴?nѾ~C{= $&RfNwïRAEm,mQ4@Ό9_GK9&{.?1}2F\}WV=D /HOlá{ K37Ʊ )e]Uذ~a`/EjGQ^1 ǎypt:m{~`}蒌OMDyYMĔK==ӮSe `&b_`P-Vn=|fQqb~lG6*q$;`}%a7S*A,yVPg{Ak4&]mi5Tȓ}aH}GgXY]RGϼHzBJu!t!+ьۭRnEj3d8'! <Q,+Q pC!M%i!(1%v)Xܵ03=Fauń"br̆etr3u:Qxq(>5hf{)z:C" ?E7VDپs)t%-Jj#ew[BV~^Fe,oY_VBun % 5br_vXp(Tv"gܡSž\}PStY 1HsKTz-KF'%+]R۵caqb(=A7"혅@Ο2W^X81"W"cIbw#S\F. Pk~z.C=@ӏTܝЕa AUXu+첨^ x#34 c`̬ <ݿ.H Wz4b iZLA^P٥^(,Tk_Siiu>z CQuTදZ|jV3".ĭo&9Oϡ0"itH˗8ZRݻӷCK:Jt9{)l]9^ spRK ڕga p ]o*Gzі=,}xt!q Z] Ka]&{$b6ix~+熇0X4쎊#O}$*(5\P@k܄>ߤ_:/\+. >PD+v p3Cbh(SǤY{ZlesVLϖ:|38ACzo)yQh*754H%aiѢ; ԗZb&Z0ʡ? J:$]%0YQFfXčuz fIŤo=%õ. L ŧRCէ&lIZ%EBhH\/7O0gIm2u g#%r5HM *Gh+{q٘BW19H$zʼnOYoS;{I3߉ zI6.lRWPB=wt7a'_i+F'ib8Mdh *2 ^Di+XNH2m4?{$H G fn9SZa$?(]1!E#k~zT3L4l]M@΍sj0ĈrycFR0jS?yqwy^!zRsqvsMC<')_~aǭEN ع>yFR{ЫioBϔ.ohD#͗a\T%B6꞊btv`cjieCYW=~EPk(V9'WQc0Q*P|7gZ,"`(w(C.=RJ!t =դc3#^edh2ѣ uT\>(:/92zI`X7"i}QiD~DӸFIr,R,^Ӳxe=ʊ"~9K{]=Lyc,U&m\`Qc<. / Yh^L ]-$o`Skf~IF:UT9L;m)HYhèli-v6^yRk+1?0':0s؆mWE.&E(?i? b^b=CZG\5Xݺ0<KtZkH*%;-_49~-s^_4Yxv׼aCف GfssfbES`a4*V Q}"'4,.kƤJg3k`d"l"tc=i^" 2 m$k]8= (]>7*jm1,'p1;*hQu}&T"N#P(o! E'hD_2[+|ʬ&x7A.ָ!fM~|1=5>q;sҴ {P SGy˃M k w/ =Mm^]Kvo 받A|V`{]%{9a0 lFSk`CC-/mk3F5iN%I vDWh~e_YT kZ=ج5C Ae-|\P\ūHKytԅQ\#|G !较YAl֠ aQV^wۘ%Ɋc^[OeZYmNxQA6{<7?-,}vw)xaҹUZkdx#5KqbiBC?h@˗˗ 5sWxOľ<Ay ]ttbVٟLFqt}&^4Z wŷne\Ce96ɉȁ*yU=|%Ji tD!`\]f"wH#^%DHc\iAJO@"#;L geڗ_5G(" {̹5Y_4;pk\&b8b,ѳ/=[~7cMdhޫ(鸵 :yi1_]ι؇lN"fs14br̗̽ K]cڃ5Tcs<+QN7cp.PԞ|ȕ uxF 0J{LY1}dR(b1e:ܯ+ݏyVW׈#ߠ-,4/_yf&;L7!fp(^fQ"[UJ;EF8M\%(:)~#,aqyI)"c1ubVˡqE= cGG$Kkx{|tV\p~]߯G=)1cяT$weYS q+w̪-)pzĸMP!1J_XMۊ4jvBi;a8=Ŧ#G sF$$Ig)4N[$-H N9:vJf%eOFo}=hP''5|\]9}qO?0?|ImШhh<?. K<ü놜+)trA"})yR"VR"1Kg8]Ě zJ$3'^nܕ;=P<¼K|*oo+v,8AD8{[<9 3Ylj!9o<zh?0sc|ļЏa(Hã  O<Ǘf3 xΞT,B t4P`+W<'%݆e76?c,q2z-=\<*w&pg m|3ݖ(#Yg0 c`p"5:a4Abv2ɜdˡy?Yv?7HNᗞF쟛o\rzX޸a 1{/Go]B,m< {?_݇^#C9a#c]]pCudh!V1\W\[PBVurMB!ꑑ'kym3(Fb>16]lnmp"] $9p=17 pX!r6N O3rYeC2Y=/^[}!d˱&.f1Y2]nrLy덶f!LgiO|m|tk[glY_ق>\JfŘl,>8drMY϶$ Xf͋ Ye! 01@AaQ?!8BW%Q=/B'ˈPQBcsP5H< &'a>ɣEgf˔ȈЫnBQk&52V*&)n厉 3D14U!JjeN! %JUz> ;J*;j6Xeܸ(h3m =$SbpnHYj؆Ţ:~*4؊'-aI&#cR#E^0r &z%=C Oa؃RⰱLl!FZ/Œg0b HNK4i@'WoIBS BvVcE^v|)NL/$%_ :]$S(!1AQaq 0?W=4W-*dFUq,OA0셇GPV_vYJ^e ~#c}!tjec@j^B#q#׼Sr4QlEc2nKBwyg&^bqBpKtSJjQBPa)ngzx qmp58wdL3 /9֚⫙~cfu-t0+UfLы1 f8 <)Ws7q0^-lvi'~fF80S%@j|E~Ӎ>Q+J%άx8S1qLe2Uxa)727(]h$ _1˘3Ra[Zqw9^"jካǖ2ʾ[_C4>f*?VǛ)-;!~eNʜEFK߻&!7qM`f:\M0KG8e4g<}yoH /25s]xj.s1f5J6K͸4ҜP?a .M̢yC[dNEê`c/9QѶ斴8`.iڠV %5,}56&4Q;HE9q*ǹBZs_ѓDʶA "{Ȝx7x,TUpԵѪP1<+7m5Lj\ R" 12BsX+b)an( Ƣmܳ,sh~!4XW3bϤ0,Bz#$r4ˮ Lujd@b1?C1|z#Ģ8 z*F1 a Dx/;Yb!DPy !E?ђJΑ\OJXqFrbYV! ΊvL#|3C$(j"UY{vAt2WRk7Ե' & d72&JwݽK>N)^>}V7RQ3(b b"[%m^F#XIt.Sya{ps*VQQu zX5反ёN.*`wP7<17d@(Y/͓?=bt Xk="f lu10-RDQl6Lʋ%TC.|1iq ڼvFQ] yNo ®\K\:mDs?L?Ʉl1bb !)RuŒ"W ) 㞴l?B0--pEYm|Gaʽd-q KJ1^[O>`E9K jn /*Wbܷ ?3ga]EpPհE-W(m&au܉Q{TowըrǦ%4!tnhƣsj@uJpx)+w`oSZky䀵ZǼY]@Go Ѹ;Cw)wza X\ѻUK{R*B˞itXnY~R} "t`2 N"7.oEX]0VڱTY=.mK+Mt3bމy7G"]Nkm&6d&YU#xk0_^V_8,s-7 =&!&W|Bf>c_C:hl]ݟRU^܂%0HY^JF׈\ {I|JG% Ո_1z·X9.嚠1fr:*V ,p7Dͫ^#sk"a@5Uf_f5)FʆeTEn^ߍܷ@IJJEl8΍ݡyAv!1bi/O![|F2d0(ȯ[:A{kU/ 6Qr*&|pTݗ2ѱ>#L\iG$ QJN,/gU jr4J?f}NCQ9+\"~ƎڱqN#\@! +sKz[*1z<5E[s% / Aw\>Y8aX;uUOowoͺˊ|yԻACF !3N0T-/qMJǤ!13S b8oWTE1G^Mg'aQCj>a8I.0:em;q~./0`x%iC?E˴[ҡQ; `b %Wl+ʮqBֈ #Z)Sʅr,lt-&]5}CaAx{7` %۫9p1[)_MM]Qb ,ZLP[VAyh`IFͲVJ7)@9_4(m>N>P :&Ce  ڋL@ +m״m,ЬPŘxǀ8PpR4X_:%!٩IJ`^CrmtP VF$G"(xLLOduVY=*ZǬ%G,Λ«!Wg/@t&Dp  BtK qPeK HWA~ |ea^)X#&s,+?(M3Ŕ+)YVyU4m.KJB7&Z\5_q˼*Vw/>Əv`]pUܮ:h>.]Qa뒠aQm'u_,EܭpSL0PU,0UAeֆ_Cu(IEO8 RSEO0Ūp4 ( >bTZ`ZZEUb_y`]F!Wo[Aa oJ]Q 1>\fpxezNY4o$+y -,F6O7_Qb\53.7lpCBԤPҿH`/mmV 7 wܭMsFy"RP5= c-H*Ws KZW^ .Q׈%!PL'k!J EzX~2>Zܙ`k v&ekY&(6S(sLfX0Ecaphv`x\q.4pYhR'/9`kdyWdvBh Y|pxp~6uşG59%R (5AwN:8!F`O [qvqkH, 9p YE3N5U WLk`Lj+5z3u ==SKRRE5ah\W~ NowY2K=*SE~ZŨq (ZkA+@QAPuo ;*Zes2?x 4`@f5@:,(|]> Ґ÷R]Ubfim.%kK `VrE'^8.bE+jwLi{ "(|pm{M?("X%G‹EL% 9 oO1F[*C-md7xc Md-e+_J k2j,g9.:Jtns)|H|C:{#r|'"XϪRvw17p`X[]%'F`/U_˸*] j^iJAk =u.} ڀr-B|) )^nd W<հ@g*z\naY1.m6eA\. nbs/UbV ~Ϯ[:6Ό u v$Or=aixSQJfc."+0S4\`^ޤv{b;7GLk\`7t=%kChx4)P8@,Y+ԫ] 6B ^ 1Q-__c[7\ˌ_xVv\j֓E:*̰SmyEOy %ۡÉHx?%UyR^yzlY WN#u|62 $}D M|GRFcp]LjT/-;KB>I.3l//=Guз{?SV*~'zT ^#uc}?1t5S+Fq*BiƔ% `IB~H.!1)D#m1X}Scu`T:l} @^fXD]jط.Ks>`QNj \š"6^e AbQIyeiFW%7;c QC)u)Rblw`:%ԱulY9`;lϼA = *2%P0P^uw1Ȋm-*m _K~Svd]jgr&^" r~  ߈bB1vDPVU<JEV@~=]k[>e*<#.÷$x؏<"e zCʵg%G[.^SR[VQs]Iu|}Wn@S)I[/L:yzB-e "J81ϙ*Thһx Lx; :Vx!b1/1D:%iAX]tA|:b6Jqx~"SxSǀw4G54F I`PՏHͿ#k 2nZKEƮqj=7a+a\{"LTAc|Թ#.#^X齤 [k&FN;QI<`y29V-kg\x7ffYLL>L:Vxy"9HT̳6}5_?`E"S~KlvjM³s&4v0BQbto ɪD{]m=a^С}ՙՄ:G_RC p^G8L/.LTJ'5m?Ԯ<Vge{ߊ&Yz+h;ͺ|?V 6aטONECImKh- 8r:)J{2Q)b،_Dq*ZQFYCOcF 2Ì1L/Y @2ɢpܬaNM#*ʳ:*f'Rㇾ%Ļ{8_.|l`("8lw:Kdտb"7`#--96NAp>0T*AK_E d^,^?+&l,W&5/ahX3+{mh`klt,'?9_;⨅ |xEk 7r/ k8 j[}y2m ڪ rhftjV, ":JD@V073wBfI/AvSX/ކsE8xq鏼X5I"YdxE85 t눵&aPۈ]ꅦ:KTlpQJв,BzF4+"7nm@\&8Pn*bnmvGr}YT c0l`M)нww[_HRV7%8OEj|;1@,``&@P:e+tp-OBˬ>9bNuK3 f7@(cf RƱ$ kqG_+kW8.luW*P+_p+F`,Q %`*R~YFlV}@)\}SKj~JEa}יԫ_)Z[@Hӯ!Iezq*ڋ -h-f^cբ4)j\cęQAu*yuBblkdZxH͸yfHmKaŕczWfuebQ9$Y0a^b+`]mB{ ]Cھ"@/ /[ġ۷`)MÇXJ8KƝ5S\J\ח,B.#vixa+(׈6!g] +FQ\?m92Jˊkth]4KɈAV7C %~Z\hLuQlen]}yX`W)~bo..Z{rf&j2&GJ4>`-vsޭw N< Ym<ŤӜ e7m>EwosE+2c.6L(|tLm!t s_K="B;1,ˢwZAYkןr)"> #qcX QGP۠")qNݙkQ2z¡D %Uj0jZ{|0Eְ/5=͐YbߥLu壘[ Sh-f>tZH_lE }MwX~ÕZ+ c6_eUڡ<rWmS fEQ\P(=3 A]\ql mvA3*ls!>.Y];:[^8ge8;F@8*Z KY66Z1ө:_bzK bZ$KMyVS ?gG첽л4;]vPqz>1rVh[~Y UavR[a1!(/ h1Yy ꃃ~.]8_>VM.yrkÔhk슐,5>3W~O:bڰfs+VM-1̦` *: )Nhy@*3E\p"ʯw 1zie]FJzv}y'80bnTUVۗ.!`-zsBbi[q҅#Pc{K1ly1Y'yϏ0 V8s%8E:R+̳ qL5`Pp׈P{1s@%?KW934_7ʄ7ioA)Bo2w0W>} u^,?)QKm)_1ꁻ,Puʼ8(=?LrKUŶk(6z8㏞b]P^1hP( &>OiUùL(QNAE|)-MTc׬u0 `+`:k 'DEmLhH a_*kkկ1VWi[ cwn~, 15a(n:+2ͱ^eJ7! yUjJKg[BxqfCfA CV k>=+-UOī(%vԸhWp_Ii1`6 4ϰ0 E^]>SѨ@QzOySHj!Ao"V¸_\ߘ)Q =eF牑e M7 ¼7*Ix漿ĥrMveSݜ,!SFZ(|@f ]+^F:+ =1 U\ >Rk&xâjTG/f>>:UzC'_OI#רx_Gh%p>k>3;gB]6r| ۴QKѧLy9/'U͇-nU6": ] /Q"l.&^ z1d*@ CEU)b 4*`yC@qD7!aPE v_;W"ygԹ]yUdHUɮ]˕dm(88Wj087ؼGf urk8t+jBIX10b]Lb ^x6V+8L94RyUXsٲ^Hź4 9*D\ʶSʻUYxe/IJQB)0e8TqI`*$kCUcOֱAU_ ϙleun꓌KD(l=\{4 sO!fhtQLƘ5wwqR\He\E͗y%0!dXLT)n]8^X9\&?[K≧~H!R/=T.GgZJKpua[-4LG!Z9ܣo*4N x5o]*HQ-rkY [A(ϧ2&V`G)2q4Ƶh<p8aCa`j7u* 4Wb7J3W- jgv|j3JS#5n$ֹΣ'n )xP_M#0Bdvޠ~qBYrVKy4E%~ߴ(tf%`!W43\z;E_Ad!bVs-[UiZjtnO1ݳ-lks@XL?"1Vç_XR<*kJK4cWY@|K9k[ʗ2E'-bXR'YwT?. Zw_E#00dڅaRbn o,0j'9n1YHgs @w -ߖ#qEk4s%^P ӟIB+għ3ƋToCe4yvn9f^xٶ1JøL0Avs6/Ҫo+1(XY̢R-z@ٿ4c;5-9y>U@ GĹn3z^(Cy7fk#UC-%Ywwܣ`1VEѠ}`X; #XP6#J%)Y`׏1h}f#r{**d95Xl &c][\ 9ѺQEBtC|@}b ,\zԹCm)Pu9{7cA9MWQ{3u| +FZyC 皘:a Q}z6@xla+ѕ0bd AF="sPbC ]eD,%j%) 2u~A')qp9C$ƘT۱?.SQ~7/ ꖓ0 P|~!%-z)&4tW9&PKڷx}#DP"%~'\o`VW{]QX̤C#Q=\Ӹys~TU!W7Fyb8<Up娾 ¹kZKR[) Untitled
JFIFHHExifII*   (12iX SONYCYBERSHOTHHAdobe Photoshop 7.02003:05:12 00:11:39"'d0210"6 >F  N0100  2001:03:15 03:33:092001:03:15 03:33:09 !a (HHJFIFHH Adobe_CMAdobed            `"?   3!1AQa"q2B#$Rb34rC%Scs5&DTdE£t6UeuF'Vfv7GWgw5!1AQaq"2B#R3$brCScs4%&5DTdEU6teuFVfv'7GWgw ?ԕa5cuƸ%{}7rg-[BIk KN0){uOf3CB)Z96$fP*~H qA0=-Z6@ɨ3yk:mRi$ۚ~-#[X6熴yM^ FCc[!o1]lS 5H3JNo ͏[]>b Y3? S{qs"AeQc?l5p<`\LowI-(~pHjk,sG{A{\=p 5K2 ~iV#<4=6:"%k~ c9k)LZG,?rky ND!T[Cvmst?i=evokÝTvHk9Ku*~5netĸ<8tb;IgUj5kshF;.5YVJǨ]Veoi$Aok_,7|{\1 _jK*h1Z8$:}=,?Zs'u|%R 襧Qo}LN=חK}1cw.kE-uM,yNߣ- z3vEvaE䝴KzGW~Xy{]zl>c.?sHxg>ZD|_gj㿚 ]*f:j^kk(?]N98yK+1]X$GM{|L0 F٥!^b 5'Ç4xK}p+ 8WX"1$hU;gRzk'Xj>ُ_ծMS[H7K^ƟgX7@=Q )z=\ߑ9H aӋ2oۍe054 '{~'􋶺2[J5&dcݷNnEL36TK嵏~E8\71 .>8qQ$UXN2~m*0s疱ٵe}nƇٸ}.W7_]>olcRn?;06,N]<@Hsg$$oEpa6׏8V}V$p5j`P95;:>\oGrFҏ'#ɖlGOXqKG4y' Ox·ABAWf:OJh?#t?U?y"^7$\@ 89GaG!/uѦ#i!u6O6/v/cwJNWWk+g˷Z~*{ne_W(A |vk?m=:!ĻGxc]vkWUUm K@pkx[6qÿ HSc{r. f8sktq34 peS6pck0$c\FxeWw1ˉs]?4`' N{`ikIk^=KTs:vEUP8dopm^]- vRޗs$RG{nb0{-AiVP}TdPNƠ`)&'zrߎjum'Ug,,wٜ:k{YSY~~EbD?,U=KXc!`wq塤}Q9V-Z =YkF9ѯ֝UsKoEuwh i"zs^[h7RyS[D}zq=n#{ 0Fq}{kl}VvuYlR6A=@MR8aYC(V&U9/XGޞ -(3?FYvV`mvZ onLOmϬ $ ?j+aޣ }Qdv kC .3{}/obTkf?6t6d'G>wn ymc)dnk3a?fMG:<ŕ4npkZ=;6~=[?Ռ:[&}0Q(_4ɸݓu{l{t9g^kY8e묫4~'coon95q겳Nֹܚf!9fp׵޾|r+' gm:4n5wcůJ詿4Tƀ@y@{Q^ԽS4Bƅ큺ڛ. xg=U'mnChltZDsG$"q d}}[c@SA$:yoVc\>(SKNh~y8q@s F C SFn4Zx Sqs~.=1V'v8Vh7H=gkm[Qnm2CﹾVp-sf-we/RB`s_Wks73~ 3 G3k˲v9[ߙ7ph8vߥ׌j8Wg-VZWk[4mmt>9KݹHqـ15rً~W6<\MPkȗGݩX}._1׿roݣb#?Photoshop 3.08BIM+x 8BIM% L*6W8BIMHH8BIM&?8BIM 8BIM8BIM 8BIM 8BIM' 8BIMH/fflff/ff2Z5-8BIMp8BIM@@8BIM8BIMEDSC00273nullboundsObjcRct1Top longLeftlongBtomlongRghtlongslicesVlLsObjcslicesliceIDlonggroupIDlongoriginenum ESliceOrigin autoGeneratedTypeenum ESliceTypeImg boundsObjcRct1Top longLeftlongBtomlongRghtlongurlTEXTnullTEXTMsgeTEXTaltTagTEXTcellTextIsHTMLboolcellTextTEXT horzAlignenumESliceHorzAligndefault vertAlignenumESliceVertAligndefault bgColorTypeenumESliceBGColorTypeNone topOutsetlong leftOutsetlong bottomOutsetlong rightOutsetlong8BIM8BIM `JFIFHH Adobe_CMAdobed            `"?   3!1AQa"q2B#$Rb34rC%Scs5&DTdE£t6UeuF'Vfv7GWgw5!1AQaq"2B#R3$brCScs4%&5DTdEU6teuFVfv'7GWgw ?ԕa5cuƸ%{}7rg-[BIk KN0){uOf3CB)Z96$fP*~H qA0=-Z6@ɨ3yk:mRi$ۚ~-#[X6熴yM^ FCc[!o1]lS 5H3JNo ͏[]>b Y3? S{qs"AeQc?l5p<`\LowI-(~pHjk,sG{A{\=p 5K2 ~iV#<4=6:"%k~ c9k)LZG,?rky ND!T[Cvmst?i=evokÝTvHk9Ku*~5netĸ<8tb;IgUj5kshF;.5YVJǨ]Veoi$Aok_,7|{\1 _jK*h1Z8$:}=,?Zs'u|%R 襧Qo}LN=חK}1cw.kE-uM,yNߣ- z3vEvaE䝴KzGW~Xy{]zl>c.?sHxg>ZD|_gj㿚 ]*f:j^kk(?]N98yK+1]X$GM{|L0 F٥!^b 5'Ç4xK}p+ 8WX"1$hU;gRzk'Xj>ُ_ծMS[H7K^ƟgX7@=Q )z=\ߑ9H aӋ2oۍe054 '{~'􋶺2[J5&dcݷNnEL36TK嵏~E8\71 .>8qQ$UXN2~m*0s疱ٵe}nƇٸ}.W7_]>olcRn?;06,N]<@Hsg$$oEpa6׏8V}V$p5j`P95;:>\oGrFҏ'#ɖlGOXqKG4y' Ox·ABAWf:OJh?#t?U?y"^7$\@ 89GaG!/uѦ#i!u6O6/v/cwJNWWk+g˷Z~*{ne_W(A |vk?m=:!ĻGxc]vkWUUm K@pkx[6qÿ HSc{r. f8sktq34 peS6pck0$c\FxeWw1ˉs]?4`' N{`ikIk^=KTs:vEUP8dopm^]- vRޗs$RG{nb0{-AiVP}TdPNƠ`)&'zrߎjum'Ug,,wٜ:k{YSY~~EbD?,U=KXc!`wq塤}Q9V-Z =YkF9ѯ֝UsKoEuwh i"zs^[h7RyS[D}zq=n#{ 0Fq}{kl}VvuYlR6A=@MR8aYC(V&U9/XGޞ -(3?FYvV`mvZ onLOmϬ $ ?j+aޣ }Qdv kC .3{}/obTkf?6t6d'G>wn ymc)dnk3a?fMG:<ŕ4npkZ=;6~=[?Ռ:[&}0Q(_4ɸݓu{l{t9g^kY8e묫4~'coon95q겳Nֹܚf!9fp׵޾|r+' gm:4n5wcůJ詿4Tƀ@y@{Q^ԽS4Bƅ큺ڛ. xg=U'mnChltZDsG$"q d}}[c@SA$:yoVc\>(SKNh~y8q@s F C SFn4Zx Sqs~.=1V'v8Vh7H=gkm[Qnm2CﹾVp-sf-we/RB`s_Wks73~ 3 G3k˲v9[ߙ7ph8vߥ׌j8Wg-VZWk[4mmt>9KݹHqـ15rً~W6<\MPkȗGݩX}._1׿roݣb#?8BIM!UAdobe PhotoshopAdobe Photoshop 7.08BIMghttp://ns.adobe.com/xap/1.0/ adobe:docid:photoshop:7c0b7c58-83ec-11d7-8a23-ccd3a752504f XICC_PROFILE HLinomntrRGB XYZ  1acspMSFTIEC sRGB-HP cprtP3desclwtptbkptrXYZgXYZ,bXYZ@dmndTpdmddvuedLview$lumimeas $tech0 rTRC< gTRC< bTRC< textCopyright (c) 1998 Hewlett-Packard CompanydescsRGB IEC61966-2.1sRGB IEC61966-2.1XYZ QXYZ XYZ o8XYZ bXYZ $descIEC http://www.iec.chIEC http://www.iec.chdesc.IEC 61966-2.1 Default RGB colour space - sRGB.IEC 61966-2.1 Default RGB colour space - sRGBdesc,Reference Viewing Condition in IEC61966-2.1,Reference Viewing Condition in IEC61966-2.1view_. \XYZ L VPWmeas loadsoft.narod.ru - Download PasteMaster by Gordon Production
Find it fast with
goClick.com!
in: PC > Utilities > ClipBoard > PasteMaster >
 

Download Sites:

PasteMaster

Size: 1.52 MB

Select a site:

· United States
   ß members.nbci.com  [pastem.zip] 

 

Rambler's Top100 be number one
/top100.rambler.ru/top100/"> Rambler's Top100 loadsoft.narod.ru - Download Chngcase by Roadrunner Software
Find it fast with
goClick.com!
in: PC > Utilities > ClipBoard > Chngcase >
 

Download Sites:

Chngcase

Size: 1.48 MB

Select a site:

· United States
   ß ftp.mindspring.com  [ChngCase.zip] 

 

Rambler's Top100 be number one
481 f2 195.72.245.117 JFIFHHpExifII*   (12iX SONYCYBERSHOTHHAdobe Photoshop 7.02003:05:12 00:25:06"'d0210"6 >F  N0100Z 8 2001:03:15 16:30:032001:03:15 16:30:03 !a (HHJFIFHH Adobe_CMAdobed            R"?   3!1AQa"q2B#$Rb34rC%Scs5&DTdE£t6UeuF'Vfv7GWgw5!1AQaq"2B#R3$brCScs4%&5DTdEU6teuFVfv'7GWgw ?ѳ';m9 [.fsd9;~Ʒ0@;lsD裶zIƿel"Fx*B w49kGXp{aZVdP ARq⭲6A;GorfR^CoIuc\j7\Gkmuy}w3.6].{CީoN8=?C}_=gԠ DiZ0ȑ"on;LP7k]v'?o{[hiljcn܌/2 , xW5|?>ִ >| J@π>R? C:=-h?ѻnFaD흿Rk5q,W\E>vL_?G2۶ϑU!s}/Q[6 ReIHF30ߦwi1U~As2}&BXXH:T?{o%k39CEmXrq~>t)I -hp LO19oˏtctԭKژtǼZCX[6o:~oGsαz첶z>c+?=;WA2oO}~{c݀G]!,v6c谀;mmoXpXI/Kvxx,pmc`wlc:=FNM}A2z ]{Z}zv[3:͔]{Ex8z=wA6c{/yl2 mm,DHt07k<3iu=^{~%:?qc+nss;s%A:Iw{(%+_kzihxvM{mQeWy`s 2A64}#qj@ɸEcYywFˆO-ߪ8=m.X{7=_VSDm`[ķD0ӆ{XG=3ӭM1;\XK[ Q<)#G8 gf:_}cD kh;?W[Եx{Rg%C@ ̂ωClZN 1FƓ[-Ƣ\]$n hţc7(W&۠q)א l>4̑;-kL˱i4KF {F~Sx-7K'Sj]7S։I{'*ga\ꪱYpg[BUzF#]NPev`7f;s SogExpK{NwNV-δ{Hg&$Vp5v%WU[@i[/fPVXkjsg6ecZ-pi#]ޓkӞ0ꦧae_V;ّm5=}a༻I--{[^QHNOAF;Hmt;cKofہ vL$7Pm湌-akޘ Q~ɡYcZ~y5X>;n?K]E;A ->$!7nN(CHh@wYX^5Ƿ能S&+.-NѴ ֹ?k2Zƃ?Tgo[}\`qK7goT`H}/egkЩ1kHN]c~oߤBGjp/[9:~JU-?>Ѹ6 kL=3 f ۉ AϤߣW?Eǥ6DN_sZ%Ɖ'_)$4> 晃֖lcC[Zo%u/s^wrIZ:]KC\ ׿PSgIvAx-k gih~or:>uwV]Mu!Dϥ].~p-;>~GDlmClW,gu>Y <iWR1ΐ G#ZLh WU_O/v~=Ra3(,A]1Ljiic{DhO-dǁ[CFjoO%1n= ?pEb4X: Lߓ9Ş I.-3$ $hjߟoI@s\ex"JVg?Kg|?ׄ$_~moK?qMΞ?_?$_ P)lso$T gq}I:~n>_?S.'_ć~~ $;)p??MЫ?I%_sg _焓Photoshop 3.08BIM+x 8BIM% L*6W8BIMHH8BIM&?8BIM 8BIM8BIM 8BIM 8BIM' 8BIMH/fflff/ff2Z5-8BIMp8BIM@@8BIM8BIMEZDSC00308ZnullboundsObjcRct1Top longLeftlongBtomlongZRghtlongslicesVlLsObjcslicesliceIDlonggroupIDlongoriginenum ESliceOrigin autoGeneratedTypeenum ESliceTypeImg boundsObjcRct1Top longLeftlongBtomlongZRghtlongurlTEXTnullTEXTMsgeTEXTaltTagTEXTcellTextIsHTMLboolcellTextTEXT horzAlignenumESliceHorzAligndefault vertAlignenumESliceVertAligndefault bgColorTypeenumESliceBGColorTypeNone topOutsetlong leftOutsetlong bottomOutsetlong rightOutsetlong8BIM8BIM R{JFIFHH Adobe_CMAdobed            R"?   3!1AQa"q2B#$Rb34rC%Scs5&DTdE£t6UeuF'Vfv7GWgw5!1AQaq"2B#R3$brCScs4%&5DTdEU6teuFVfv'7GWgw ?ѳ';m9 [.fsd9;~Ʒ0@;lsD裶zIƿel"Fx*B w49kGXp{aZVdP ARq⭲6A;GorfR^CoIuc\j7\Gkmuy}w3.6].{CީoN8=?C}_=gԠ DiZ0ȑ"on;LP7k]v'?o{[hiljcn܌/2 , xW5|?>ִ >| J@π>R? C:=-h?ѻnFaD흿Rk5q,W\E>vL_?G2۶ϑU!s}/Q[6 ReIHF30ߦwi1U~As2}&BXXH:T?{o%k39CEmXrq~>t)I -hp LO19oˏtctԭKژtǼZCX[6o:~oGsαz첶z>c+?=;WA2oO}~{c݀G]!,v6c谀;mmoXpXI/Kvxx,pmc`wlc:=FNM}A2z ]{Z}zv[3:͔]{Ex8z=wA6c{/yl2 mm,DHt07k<3iu=^{~%:?qc+nss;s%A:Iw{(%+_kzihxvM{mQeWy`s 2A64}#qj@ɸEcYywFˆO-ߪ8=m.X{7=_VSDm`[ķD0ӆ{XG=3ӭM1;\XK[ Q<)#G8 gf:_}cD kh;?W[Եx{Rg%C@ ̂ωClZN 1FƓ[-Ƣ\]$n hţc7(W&۠q)א l>4̑;-kL˱i4KF {F~Sx-7K'Sj]7S։I{'*ga\ꪱYpg[BUzF#]NPev`7f;s SogExpK{NwNV-δ{Hg&$Vp5v%WU[@i[/fPVXkjsg6ecZ-pi#]ޓkӞ0ꦧae_V;ّm5=}a༻I--{[^QHNOAF;Hmt;cKofہ vL$7Pm湌-akޘ Q~ɡYcZ~y5X>;n?K]E;A ->$!7nN(CHh@wYX^5Ƿ能S&+.-NѴ ֹ?k2Zƃ?Tgo[}\`qK7goT`H}/egkЩ1kHN]c~oߤBGjp/[9:~JU-?>Ѹ6 kL=3 f ۉ AϤߣW?Eǥ6DN_sZ%Ɖ'_)$4> 晃֖lcC[Zo%u/s^wrIZ:]KC\ ׿PSgIvAx-k gih~or:>uwV]Mu!Dϥ].~p-;>~GDlmClW,gu>Y <iWR1ΐ G#ZLh WU_O/v~=Ra3(,A]1Ljiic{DhO-dǁ[CFjoO%1n= ?pEb4X: Lߓ9Ş I.-3$ $hjߟoI@s\ex"JVg?Kg|?ׄ$_~moK?qMΞ?_?$_ P)lso$T gq}I:~n>_?S.'_ć~~ $;)p??MЫ?I%_sg _焓8BIM!UAdobe PhotoshopAdobe Photoshop 7.08BIMghttp://ns.adobe.com/xap/1.0/ adobe:docid:photoshop:2d41245e-83ee-11d7-8a23-ccd3a752504f XICC_PROFILE HLinomntrRGB XYZ  1acspMSFTIEC sRGB-HP cprtP3desclwtptbkptrXYZgXYZ,bXYZ@dmndTpdmddvuedLview$lumimeas $tech0 rTRC< gTRC< bTRC< textCopyright (c) 1998 Hewlett-Packard CompanydescsRGB IEC61966-2.1sRGB IEC61966-2.1XYZ QXYZ XYZ o8XYZ bXYZ $descIEC http://www.iec.chIEC http://www.iec.chdesc.IEC 61966-2.1 Default RGB colour space - sRGB.IEC 61966-2.1 Default RGB colour space - sRGBdesc,Reference Viewing Condition in IEC61966-2.1,Reference Viewing Condition in IEC61966-2.1view_. \XYZ L VPWmeassig CRT curv #(="9%" VALIGN="TOP" HEIGHT=16>

IF

ID

EX

ME

WB

and

IF

ID

EX

ME

WB

or

IF

ID

EX

ME

WB

A pipeline with this structure normally needs three ALU bypass registers, to record the ALU outputs for the three previous cycles.

However, only two are needed if WB is complete in the first half of its cycle, and ID starts register fetch during the second half.

The bypass registers should also be available as inputs to other units, e.g. the memory interface (for stores).

NOTE From now on we will assume that WB is complete in the first half of its cycle, and ID starts register fetch during the second half.

Suppose the instructions "add r1,r2,r3; sub r4,r1,r5" were followed by "sw 0(r1),r4", The EX stage of this store instruction, which takes place during the fifth cycle, can take its input register, r1, out of the second bypass register.

During the fifth cycle, the value in the second bypass register (the new value of r1) is first written to the register file. Late in the fifth cycle that value is then replaced by the value previously in the first bypass register (the new value of r4).

The ME stage of the store instruction, which takes place during the sixth cycle, can therefore also take its input register, r4, out of the second bypass register.

Each functional unit that produces values should have bypass registers. In the integer-only subset of DLX, the only functional unit beside the ALU is the memory interface. This needs only one bypass register. Consider the following instruction sequence, where the value fetched by the load instruction is input to the and and the or instructions:

ld

IF

ID

EX

ME

WB

sub

IF

ID

EX

ME

WB

and

IF

ID

EX

ME

WB

or

IF

ID

EX

ME

WB

The or instruction can get the loaded value from the register file, but the and instruction must get it from the memory interface's bypass register. (If the value is also input to the sub instruction, the sub instruction must be stalled for one cycle before it can get the value from the bypass register.)

 

Types of data hazards

Two instructions (not necessarily consecutive) can give rise to three types of data hazards:

RAW: inst 1 writes, inst 2 reads. Hazard: inst 2 will see obsolete value.

WAR: inst 1 reads, inst 2 writes. Hazard: inst 1 will see prematurely written value. Can't happen without a write in an earlier stage than a read.

WAW: inst 1 writes, inst 2 writes. Hazard: inst 3 will see value written by inst 1. Can't happen without writes at two stages.

NOTE RAW, WAR and WAW stand for read after write, write after read, and write after write respectively. They give the order in which actions should happen; the hazard is that they may happen in a different order.

Note that a stage whose timing is not fixed, e.g. it can happen either during cycle 5 or cycle 7 of an instruction depending on the situation, counts as two different stages for the purposes of the slide above.

For example, in DLX with floating point, the execution stage of an FP instruction may take e.g. three cycles. By contrast the load of a FP value uses the integer ALU for address calculation, and thus it is the execute stage only for once cycle. Therefore e.g. an FP add followed by an FP load will cause a WAW hazard if their destination registers are the same:

fpadd

IF

ID

EX

EX

EX

ME

WB

fpload

IF

ID

EX

ME

WB

 

Effects of loads

lw

r1,32(r6)

(r1=mem[32+r6])

add

r4,r1,r7

(r4=r1+r7)

Even if the cache hits, this code will cause a RAW data hazard on r1. Since the value being loaded is produced during ME, once cycle after EX, bypassing avoids a stall only for the second instruction after the load. Bypassing is not needed for later instructions.

lw

IF

ID

EX

ME

WB

add

IF

ID

EX

ME

WB

sub

IF

ID

EX

ME

WB

and

IF

ID

EX

ME

WB

NOTE The add instruction cannot refer to the value being loaded without causing a stall. Assuming it does not refer to it, the sub instruction can take the value out of the DMDR register, provided the DMDR register is one of the inputs of the ALU input multiplexers. The and instruction can take the value out of the register file.

 

Load stalls

lw

IF

ID

EX

ME

WB

add

IF

ID

-

EX

ME

WB

sub

IF

-

ID

EX

ME

WB

and

-

IF

ID

EX

ME

WB

Today's optimizing compilers perform instruction scheduling, which means they rearrange sequences of instructions to make stalls (e.g. load stalls) less frequent.

A very rough guide is that about 50% of loads cause stalls; and that instruction scheduling cuts this to about 25%.

 

Pipeline interlocks

One way to detect the need for a load stall is to compare the source register numbers of this instruction with the destination register numbers of the last two instructions. If there is a match on the preceding instruction, which is a load, then stall; otherwise, use the corresponding bypass register.

Another way is to associate an interlock bit with each register. This bit is set only while the register value is unavailable, either in the register file or in a bypass register (during ME for loads). Operating on a register with a set interlock bit causes a stall.

 

Cache misses

The simplest way to handle cache misses is to freeze the entire pipeline. This is cheap, effective if the miss rate is low enough, and avoids data hazards on operands in memory.

Many new machines allow non-memory instructions to proceed when an earlier memory access misses in the cache. By completing some instructions earlier, these machines reduce the possibilities of data hazards.

Their implementation needs interlock bits and a second register write port. Interrupt handling gets quite complex.

 

Control hazards

br

IF

ID

EX

ME

WB

i2

IF

-

-

IF

ID

EX

ME

WB

i3

-

-

-

IF

ID

EX

ME

This naive pipeline scheme has a branch penalty of three cycles. If the branch frequency is 30%, then 10 instructions need 19 cycles instead of 10, halving the machine speed.

Real machines reduce the penalty by finding out early whether the branch is taken and what the target address is. In DLX, the simple branch conditions allow branches to be completed during the ID stage, reducing branch penalty to one cycle.

 

Branch prediction

Some designs try to guess which way the branch will go and avoid the penalty if they guess right.

The simplest prediction is that the branch will not be taken. If the guess is true, the pipeline flows as usual; if it isn't, we must fetch the branch target instruction and continue from there.

untaken:

br

IF

ID

EX

ME

WB

i2

IF

ID

EX

ME

WB

i3

IF

ID

EX

ME

WB

taken:

br

IF

ID

EX

ME

WB

i2

IF

IF

ID

EX

ME

WB

i3

-

IF

ID

EX

ME

 

Delayed jump/branch

An alternative approach is to execute the instruction following the branch instruction regardless of the branch outcome, and make the compiler responsible for making sure that this instruction is useful (or at least not harmful) in both paths.

branch

IF

ID

EX

ME

WB

d-slot

IF

ID

EX

ME

WB

target

IF

ID

EX

ME

One compiler can fill the delay slot with a useful instruction about 50% of time and with a sometimes-useful instruction about 60% of the time.

NOTE This means that 50% of the time, the result of the delay slot instruction is useful regardless of which way the branch goes, while another 10% of the time, the result of the delay slot instruction is useful if the branch goes one way and not useful (but also not harmful) if it goes the other way. (An instruction can be harmful e.g. by overwriting a value that will be needed later, or by causing an unnecessary exception such as divide by zero.)

 

Branch likely

Most RISCs have simple delayed branches, in which the delay slot is executed whether the branch is taken or not. These require the compiler to fill the delay slots, possibly with noops.

Some RISCs also have branch likely instructions (also called annulled or squashed branches). These contain a prediction of whether the branch will be taken or not, and the delay slot is executed only if the prediction turns out to be correct.

The prediction can be an explicit bit, or it may be implicit in the sign of the displacement (backward => taken).

NOTE Annulling an integer operation is quite simple: it only requires inhibiting the writeback to the register file. Annulling a memory operation or a branch is somewhat more complex: not only must one prevent them from affecting any registers or memory locations. one must also prevent them from causing any exceptions.

 

Multiple functional units

Most modern processors have separate hardware for FP operations, the usual set being an FP adder, an FP multiplier, and an FP divider. After the instruction is decoded, it is issued to the appropriate functional unit for execution.

NOTE In DLX, the ID stage must fetch the input registers from both the general purpose register file and from the file of floating point registers.

The WB stage must write the result to the appropriate register file. This will be the floating point register file for instructions that go through the floating point functional units and for instructions that load a floating point value.

Instructions that store a floating point value take one input from the general purpose register file (the base address) and one input from the floating point register file (the value to be stored).

On some machines, the integer multiply and divide instructions use the FP multiplier and FP divider functional units respectively.

Other types of functional units also exist. A shifter is sometimes viewed as one. In some machines (e.g. those by DEC) memory operations have their own functional unit with its own adder; this allows the ME stage to be deleted from normal integer instructions.

 

Multi-cycle operations

Some functional units need more than one cycle in the EX stage, e.g. floating point arithmetic and integer multiply and divide.

Multi-cycle functional units may or may not be pipelined, i.e. able to accept a new operation each cycle; if not, two consecutive e.g. divisions constitute a structural hazard. Writebacks can also present structural hazards.

i1

IF

ID

EX

EX

EX

ME

WB

i2

IF

ID

EX

ME

WB

i3

IF

ID

EX

ME

WB

NOTE This diagram assumes that instructions i2 and i3 use functional units different from the one used by i1. This is very likely to be true anyway since i2 and i3 spend only 1 cycle in the execution stage whereas i1 spends 3 cycles there, and in most machines a given functional unit will always take the same number of cycles.

If i1 is a floating point operation and i2 and i3 are integer operations, there will be no structural hazard at writeback, since the two writeback in the same cycle go to different register files. If instead i1 is e.g. an integer multiply, or i3 is a floating-point load, the availability of only one write port on each register file will cause a structural hazard, and the WB stage of i3 will have to be stalled for one cycle. If the register files have two or more write ports, the structural hazard will not occur, but the hardware will have to know how to handle the case where i1 and i3 write to the same register. (It has to make sure that the final result reflects the write by i3.)

 

Out of order completion

divf

f0,f2,f4

subf

f0,f8,f10

addf

f2,f12,f14

With typical latencies, the subf will complete before the divf. This out of order completion would be OK except that the destination registers match; this constitutes a WAW data hazard.

The designer must take care that instructions do not overwrite data needed to handle exceptions detected late in a previous long-running instruction (e.g. addf and divf).

NOTE On the Pentium, the floating point adder can perform an addition or subtraction in three cycles and can accept a new operation every cycle. Therefore the adder's latency is three cycles and its throughput is one operation per cycle.

The Pentium's FP multiplier has a latency of three cycles but it can accept new operations only every second cycle, so its throughput is one operation per two cycles.

The Pentium's FP divider has a latency of 39 cycles and cannot accept new operations while one is in progress, so its throughput is one operation per 39 cycles.

These are fairly typical numbers for double precision FP operations, except that most workstations can start an FP multiply every cycle.

 

Superscalar processors

A best pipelined processors have a CPI around 1.25. To decrease CPI beyond this, a processor must be superscalar, i.e. capable of issuing more than one instruction per clock cycle.

Superscalar processors require more instruction bandwidth and better branch handling than nonsuperscalar ones. A 2-way superscalar pipeline:

i1

IF

ID

EX

ME

WB

i2

IF

ID

EX

ME

WB

i3

IF

ID

EX

ME

WB

i4

IF

ID

-

EX

ME

WB

i5

IF

ID

EX

ME

WB

i6

IF

-

ID

EX

ME

WB

NOTE i3 here represents an instruction that could not be issued together with instruction i4. If the design has room for only two decoded instructions, these will have to be i4 and i5 at the end of the fourth cycle, and so the decoding of i6 must be delayed.

 

Issue restrictions

Due to structural and other hazards, superscalar chips can issue only certain mixes of instructions together. If the next two instructions cannot be issued together, the processor issues only the first one.

The simplest scheme to implement, with least extra hardware, is "one integer, one FP", as e.g. in the Intel i960CA and the DEC Alpha 21064.

The IBM POWER processors can issue four instructions in one cycle, provided that these are one integer, one FP, one comparison and one branch, since there is only one of each functional unit.

NOTE The "one integer, one FP" approach obviously cannot yield speedups for integer-only code. Even on numerical programs, the Alpha 21064 can issue two instructions only about 30% of the time. Partly this is because once it has fetched an instruction pair, it must issue both instructions before it can fetch another pair. Therefore if the two instructions fetched together cannot be issued together, the second must be issued alone, even if it could otherwise be issued together with the first instruction of the next pair.

Since comparisons and branches each account for significantly less then 33% of all integer instructions, the POWER processors very rarely actually do issue three integer instructions in a cycle.

 

Duplicate functional units

Newer designs duplicate some functional units. The most important ones are a second integer ALU and a second load store unit. Without at least one of these, superscalarity really helps only FP intensive codes.

Having two copies of a functional unit allows the simultaneous execution of two independent instructions that use that unit, provided the appropriate register file has enough ports.

Designs with duplicated functional units have fewer issue restrictions and thus can get much closer to the minimum CPI.

NOTE For highly superscalar processors, one usually talks not in terms of Cycles Per Instruction, but in terms of its inverse, Instructions Per Cycle. The nonsuperscalar R3000 has a CPI of 1.25, therefore its IPC is 0.8. A superscalar processor that has a CPI of 0.85 has an IPC of 1.18.

Processors designed for number-crunching have more than one FP unit. These are often specialized, e.g. a processor with three FP units may have an adder (which also does subtraction), a multiplier and a divider. Sometimes one unit does additions and multiplications, while another one does additions, divisions and square roots. This approach exploits the frequency of additions and the rarity of divisions and square roots to yield a design which achieved a high IPC while not duplicating the expensive hardware resources needed for the efficient implementation of the more complex operations.

Since number-crunching programs contain many memory accesses and significant numbers of integer operations for housekeeping, additional load/store and integer units are useful for them as well.

 

LIWs and VLIWs

Superscalar machines evolved from (Very) Long Instruction Word machines. These always issue a fixed number (2 to 28) of operations in parallel, relying on a complex compiler to prevent hazards by inserting no-ops if necessary. The hardware can thus be simpler.

Some machines (e.g. i860) provide instructions to switch between normal scalar mode and LIW mode.

Superscalar chips can be binary compatible with older instruction sets; (V)LIW chips cannot.

NOTE Merced, the successor to the x86 architecture now being worked on by Intel and HP, will probably resemble a VLIW machine in several respects.

 

Maximum > expected

Some programs match the issue restrictions of a superscalar CPU much

better than other programs, but no real program is even close to a

perfect fit. Thus

  • the performance variation between two programs, and
  • the gap between max performance and average performance on real programs

are both significantly higher for superscalar machines.

For example, it is difficult to make a 4-way superscalar machine exceed about 1.5 instructions per cycle (IPC).

NOTE One of the main reasons for this is the problem of branch penalties.

 

Branch penalty

In most modern machines, pipeline disruptions are very expensive.

In the deeply pipelined R4000, I-fetch takes 3 cycles and the branch penalty is 3 cycles = 3 instructions.

In the 3-way superscalar SuperSPARC, it is 1 cycle = 3 instructions.

In the Pentium Pro, the branch penalty is at least 11 cycles, during each of which up to 3 instructions could be issued.

Techniques to reduce the effects of branch penalties are therefore very important.

 

Dynamic branch prediction

Some high-end machines maintain a cache that records the results of recently executed branch instructions.

While the CPU decodes an instruction, it looks up its address in this branch history table. If the instruction is a branch and the history says that it is predicted to be taken, the PC for the next cycle is set to the indicated target address.

This inserts the target instruction into the pipeline the branch. If the prediction is incorrect, this and later instructions must be nullified.

NOTE The usual branch predictor design has a table whose size is a power of two (usually between 256 and 4096); the entry corresponding to a branch instruction is given by the instruction address modulo the table size. Each table entry has two bits, indicating one of four states: strongly taken, weakly taken, weakly not taken, strongly not taken. Every time a branch is taken, the corresponding entry is moved one state closer to the strongly taken state if it isn't already there; every time a branch is not taken, the corresponding entry is moved one state closer to the strongly not taken state if it isn't already there. A branch is predicted taken if the corresponding entry is in the strongly taken or the weakly taken state.

With this design, loop closing branches will usually stay in the strongly taken state. Although the prediction will be wrong when the loop exits, this will only move the branch history entry to the weakly taken state, which means that the branch will be predicted correctly for the first iteration of the next invocation of the loop.

You can of course get interference between two different branch instructions that map to the same branch history table entry.

This technique can accurately predict about 75% of all branches. More complex techniques that keep more history can achieve prediction rates of about 80% to 95%. Due to its very large branch penalty, the Pentium Pro uses one of the most aggressive branch prediction mechanisms available.

All instructions flowing through the pipeline have a flag saying whether they have been nullified or not. All logic modules that are supposed to update the state of the machine for an instruction (the WB stage for most instructions, some earlier stages for loads, stores and branches) actually perform the update only if the instruction's nullification flag is clear.

 

Aggressive approaches

Very aggressive designs will issue an instruction even when previous instructions are stalled; this is dynamic scheduling or out of order issue.

If a CPU keeps around the values of several versions of a register (e.g. before and after an update), this is register renaming.

If the CPU continues issuing instructions after a branch even before it knows whether the branch is taken or not, this is speculative execution.

NOTE Many of these very complex techniques were invented for the IBM 360/91 in the middle sixties. They were not economical then: the CDC 6600 had better performance at lower cost. However, as designers exhaust the speedups available from cheaper and simpler approaches, they increasingly have to turn to these methods. For example, IBM RS/6000s used register renaming from their introduction in 1988, and new CPUs such as MIPS R1x000, HP PA-8x00, and Intel Pentium Pro use all three techniques.

 

Decoupled execution

The new generation of microprocessors (MIPS R10000, HP PA-8000, Intel Pentium Pro) extends these techniques.

One engine predicts branches, fetches and decodes instructions, and issues them into queues on the execution units.

The decoded instructions wait in the queues until their operands become available. They are then executed, probably out of order. The results are put into a reorder buffer, which writes them back to the register file in order.

NOTE The aim of decoupled execution is to keep all the execution units busy all the time. In the HP PA-8000, the CPU may contain up to 56 instructions in various stages of execution.

 

Instruction translation

New x86 processors (Intel Pentium Pro, Cyrix 6x86, AMD K5/K6) decode each x86 instruction into one or more microcode-like operations. The execution units then execute these operations.

Decoding multiple x86 instructions in a cycle is difficult. The Pentium Pro uses 2.5 cycles for the task.

In the AMD K5, instructions loaded into the I-cache are predecoded; for each byte of an I-cache block, the predecode bits say whether an instruction starts or ends there, the number of ops required by it, and the location of the opcode.

NOTE A typical translation would be converting an x86 instruction such as "push register onto stack" into two operations: store register in memory, and update the stack pointer.

Intel calls the Pentium Pro's operations "uops". Other manufacturers call their similar operations "RISC ops" or "RISC86 instructions".

Predecode bits can be useful even for RISC CPUs. The MIPS R10000 and HP's PA-7200 and PA-8000 use them to record whether an instruction can be issued together with the instructions around it, speeding up superscalar instruction issue. However, these designs use only a few predecode bits (at most 5) per 32-bit instruction, while the AMD K5 needs 5 bits per 8-bit byte because an x86 instruction can start on any byte boundary.

Merced, Intel's next generation architecture, will group three operations into a single 128-bit instruction. Each operation will require about 40 bits to describe; the other bits will explicitly describe the dependencies between the operations in this instruction and in other instructions. Since operations that are not dependent can be executed in parallel, Intel calls this "Explicitly Parallel Instruction Computing".

The K5 can decode up to two x86 instructions and issue up to 4 operations per cycle. The Pentium Pro can decode up to three x86 instructions per cycle, but only the first decoder is a full decoder; the second and third only handle instructions that correspond to one operation.

 

Compiler effects

To generate fast code, the compiler must know the pipeline design and the decode and issue restrictions of the CPU type on which the code will run.

Most compilers now have flags that say "optimize for chip X", where e.g. X is 486, Pentium or Pentium Pro. The resulting code will run on all of them, but will probably not be tuned for any of them except X.

Other flags usually exist to tell the compiler that it is OK to use instructions, addressing modes etc that exist only in the later chips in a family, e.g. MMX.

NOTE Pentium loses about 15% to 20% of its speed when running code that was compiled for a 486, i.e. without knowledge of the Pentium's pipeline structure and issue restrictions. Due to its out-of-order design, the Pentium Pro is somewhat more tolerant of code that do not meet its issue restrictions, but its designers believed that by the time it shipped, most people would be running 32-bit code on Windows NT, not 16-bit code on Windows 95 (Microsoft told them so, but they were wrong). This is important, because as a result of this belief, the designers of the Pentium Pro made several tradeoffs that make the execution of some 16-bit code (e.g. segment register loads) significantly slower than even in the Pentium. Therefore when generating code for the Pentium Pro, the compiler should avoid these "legacy" instructions.

 

Vector machines

for (i = 0; i < N; i++)

y[i] = a * x[i] + y[i];

Vector machines have instructions that operate on entire vectors as well as on scalar values. All operations are processed in a pipeline (two loads, +, * and a store in this case).

The compiler uses vector instructions only if it can remove dependencies between loop iterations. Since data and control hazards cannot occur in the vector pipe, the pipeline may be fairly long.

 

The key parameters

What percentage of the code can take advantage of vectorization?

Depends on compiler; should be > 30%.

At what array length does vectorization pay back the pipe's startup cost? In Cray machines, less than 10; in other vector machines, often about 100.

Does the machine have enough sustainable memory bandwidth? Cray processors can do two loads and one store (all 64 bit) each cycle in parallel with several FP operations.

Peak speeds are meaningless without this info.

NOTE An IBM RS6000 model 397 can add two vectors of FP numbers at a rate that requires 883 MB/s of memory bandwidth; currently this makes it the best uniprocessor workstation on this test. A processor in the Cray T932 achieves a bandwith of 13014 MB/s on the same test, or over 14 times as fast. A Cray T932 can support 32 processors. When these work together, they achieve a bandwidth of 359841 MB/s, or over 408 times the speed of the IBM RS6000/397.

For other memory bandwidth data, see http://www.cs.virginia.edu/stream.

 

Pipelines and interrupts

i1

IF

ID

EX

ME

WB

i2

IF

ID

EX

ME

WB

i3

IF

ID

EX

ME

WB

i4

IF

ID

EX

ME

WB

When an interrupt or exception occurs, the simplest thing to do would be to draw a vertical line between executed and nonexecuted stages. However, the dividing line should be horizontal. Interrupts should appear to happen between two instructions, and exceptions within one instruction.

The hardest to handle are exceptions (e.g. TLB misses, page faults) that require the faulting instruction to be restartable.

 

Interrupt handling

When an interrupt arrives, the CPU saves the PC of the faulting instruction in a special register (IAR). It then squashes the following instructions, and forces a sequence of special trap "instructions" into the pipeline.

The trap instructions (which should not be usable from user mode), will move the saved PC and the PSW to memory, and override the PC and PSW to invoke the operating system.

For machines with a delayed branch instruction, the CPU must save two PCs, in case the faulting instruction is in a branch delay slot.

NOTE If the faulting instruction is in a branch delay slot, when the process resumes it must execute the faulting instruction, but then it must continue at the target of the branch, not at the instruction following the faulting instruction.

If the faulting instruction is in a branch delay slot and is itself a delayed branch instruction, things get "interesting". Most machines with delay slots do not guarantee the correct execution of such code in the presence of interrupts or exceptions, so the compiler must avoid the situation.

 

Saving the CPU state

Between the arrival of an interrupt or exception and its handling, no instruction (including the faulting one) should modify the CPU state.

In most pipeline schemes, an instruction must not modify the CPU state until it knows that all previous instructions and itself are committed, i.e. can complete without traps.

In others, such modifications are allowed but must be recorded so that they can be undone if needed. This includes recording the previous values of overwritten registers.

NOTE The IBM PowerPC 604 takes the first approach, the Motorola 88110 takes the second.

 

Precise interrupts

If instructions before the faulting instruction are completed and those after it can be restarted from scratch, the pipeline has precise interrupts.

Heavily pipelined and superscalar machines, and those with out of order completion don't want to wait until previous instructions are committed before modifying the machine state. Some thus have imprecise interrupts.

Such designs (Alpha, POWER) have a trap barrier instruction, which stalls the pipe if necessary until all previous instructions are committed. The compiler can thus guarantee precise traps with a speed penalty.

 

Interrupt order

i1

IF

ID

EX

ME

WB

i2

IF

ID

EX

ME

WB

IF in i2 could a cause a page fault before ME in i1. The pipe could post each page fault in a status vector carried with each instruction down the pipe. The CPU checks this and handles the interrupt when the instruction reaches retirement in WB.

Alternatively, each interrupt can be handled immediately, saving enough state info for software to complete logically previous instructions and to back out of logically later instructions. The Intel i860 does this.

NOTE The first approach makes life a lot easier for OS writers. Most pipelined machine therefore take this approach, as do some out-of-order machines (e.g. the MIPS R10000, Intel Pentium Pro).

With the second approach. there may be complex interactions between the several instructions that may be in the pipeline at a time. On the i860, reputedly it takes a thousand lines of assembly code to save and restore the processor state. Such code is very difficult to get right.

This is one reason why the i860 is not used in general purpose computers. Its main applications are embedded in systems that run only one program, which have usually been written in such a way as to make sure there will be no exceptions or interrupts (one such application is in the graphics engine of one of SGI's high-end animation systems). This was not enough to support its continued development, and Intel has accordingly pulled the plug on the i860.

 

Instruction Sets

Instruction set design

The quintessential job of a computer architect is the definition of a new instruction set. However, there is less and less call for this, as the increasing domination of software costs makes compatibility more and more important.

Since a good instruction set has a very long life (>30 years), and there is no cheap way to correct mistakes in its design, getting the design right is very important. The irony is that most computers sold today had their instruction set designed when that process was still an art form.

 

The role of compilers

Until the eighties, architectures were usually evaluated on how easy it was to write good assembler programs for them.

Now that assembler is all but extinct, they are evaluated on how easy it is to write good compilers for them (good = fast, generates fast code).

The compiler and the architecture should be designed together. Measurement results should dictate whether a given function should be performed in hardware or by the compiler.

 

Instruction set measurement

Today, instruction set design is almost a branch of engineering.

  • Take a set of representative programs from the intended workload.
  • Build a simulator and optimizing compiler for each realizable (now or later) design proposal.
  • Analyze the performance of the workload on each design.
  • Use the results to come up with better alternatives and repeat the process.

 

Objectives

Small static code size: better use of main memory and disk.

Small dynamic code size: less bus traffic, fit in smaller cache.

Small variation in instruction sizes: simpler and faster decoding.

Small variation in instruction times: simpler pipelining.

Good target for nonoptimizing compiler: fast turnaround in debugging.

Good target for optimizing compiler: fast utilities and applications.

NOTE It is a mistake to concentrate on any one of these objectives while ignoring the rest. A good instruction set balances these objectives on an economic basis. For example, since main memory and disks are now both cheap, small code size is not as important as it once was. However, one cannot disregard code size as an objective, as the effectiveness of the given size instruction cache increases as code size decreases.

 

What compiler writers want

Regularity: instructions, data types and addressing modes should be as orthogonal, i.e. independent, as possible.

Provide primitives: architectures that attempt to provide solutions to compiler writers' problems often define an instruction that is unusable because it is slightly different from what is needed (e.g. a loop closing instruction).

Simplify tradeoffs: it should be simple to find out which of several alternative code sequences is the best (e.g. load variable into register or not).

NOTE A loop closing instruction designed for Fortran, a language in which every loop is executed at least once and hence for which the test for loop termination is at the bottom, is often not useful in implementing C, a language in which loops can be executed zero times and hence for which the test for loop termination is at the top, because it encodes the wrong test and/or a wrong direction for the jump.

 

Classifying instruction sets

There are five dimensions along which one can classify instruction sets.

  • Form of operand storage in CPU (accumulator, stack, register set)
  • Number of explicit operands per instruction
  • The locations of those operands and how specified
  • Type and size of operands and how specified
  • The set of operations provided

 

Accumulator machines

Accumulator machines (e.g. DEC PDP-8) have a single high speed location in the CPU for the storage of user data; this is called the accumulator.

The accumulator is an implicit operand in (almost) every instruction. Instructions have at most one explicit operand.

There are instructions to load the accumulator from memory and store its contents into memory; to add the contents of a memory location to the accumulator etc; and to jump or not depending on the value in the accumulator.

 

Evaluation

C = A + B

load A; add B; store C

The accumulator approach permits very cheap hardware but data memory traffic is very high (both due to lack of registers). Instructions and programs are short.

In a pipeline the accumulator would always present a data hazard. The resulting stalls would take away most of the performance improvement.

A few accumulator machines survive in embedded applications.

 

Stack machines

Stack machines have a small set of high speed locations in the CPU.

Some instructions push the contents of a memory location onto this stack, or pop the stack and copy the popped valued into memory. Branches test the value at the top of the stack.

Arithmetic and logical instructions pop their operands off the stack, and push their result back onto the stack. These operations have no explicit operands.

 

Evaluation

C = A + B

push A; push B; add; pop C

Stack machines also have short instructions and programs and are very easy to generate code for. However, the generation of good code is difficult since the stack is not random access; you need to shuffle data to the top before use.

Again, stack entries near the top are data hazards in pipelines. Some impure stack machines survive in low-end chips and some business machines (e.g. Unisys).

NOTE These stack machines are called impure because they have instructions such as "add the contents of location xyz to the value on the top of stack".

 

General register machines

All recent designs and many surviving ones (e.g. SPARC, MIPS, IBM System 360) have a set of individually named (numbered) registers.

In these machines, all operands are explicitly named. Arithmetic and logical instructions have either two or three explicit operands.

If two, then the result will overwrite one of the inputs. This is sometimes good and sometimes bad, depending on whether that input is needed in the future; if yes, it must be copied first.

 

Evaluation

C = A + B

load r1,A; load r2,B;

add r3,r1,r2; store C,r3

General register machines have longer instructions and programs. However, they have random access to a larger set of high speed locations, and can exploit fully their speed advantage over memory.

Usually there are enough registers to store variables as well as temporaries.

We will concentrate on these machines.

 

Number of registers

The more registers a machine has, the more variables can be kept in registers. This makes the code denser (a register can be named in fewer bits than a memory location) and faster (registers are faster than memory).

As the size of the register set grows, these advantages decrease, and register save/restore at procedure calls becomes a problem. Typical sizes are 16 to 64 registers.

Compiler writers prefer that all registers be equivalent. Many machines, however, dedicate certain registers to particular tasks (e.g. stack pointer).

 

Division of registers

Many modern instructions sets have separate integer and FP register sets. These were essential to reduce interchip communication when FPUs were on separate chips (e.g. R3000 CPU & R3010 FPU).

Having an integer register set coupled to an integer ALU and an FP register set coupled to FP functional units is still common. It doubles the register bandwidth, the number of registers that can be accessed in one cycle.

A superscalar machine with this organization can easily execute one integer and one FP instruction in parallel.

NOTE Even in nonsuperscalar machines, a multi-cycle operation may finish at the same time as an integer operation that was started later. This organization allows the integer and FP writebacks to proceed together.

R4000 and Alpha have 32 64-bit GPRs and 32 64-bit FPRs. Many other machines have 32 32-bit GPRs and an FP register set that can be accessed either as 16 64-bit FPRs or as 32 32-bit FPRs or some combination.

 

Two vs three address machines

Most arithmetic operations take two inputs and produce one output.

In three-address machines, arithmetic instructions have room for three register numbers or memory addresses.

In two-address machines, arithmetic instructions have room for only two register numbers or memory addresses. One of these describes where the output is to go as well as where one of the inputs is coming from.

Two-address instructions are more compact, but one may need an extra instruction to save the value that is overwritten.

NOTE The word "address" is used here in a generic sense; one can think of a register number as the register's "address". The terms two-operand-specifier and three-operand-specifier machines may be more accurate, but they are not standard. However, sometimes people call the two classes two-operand and three-operand machines.

Most two-address machines fall into the reg-mem category (see below), but this is just a statistical correlation, not a firm rule. Other types of machines (e.g. reg-reg) can be two-address, and reg-mem machines can be three-address.

 

Operand location

How many of the operands can be in memory?

Most recent machines are register-register or load-store, i.e. all the operands of ALU instructions must be in registers (e.g. SPARC, MIPS).

In register-memory machines, some but not all operands must be in registers. The number of operands that may be in memory may vary depending on instruction type (e.g. 1 or 2 for the IBM System 360).

Many machines from the seventies are memory-memory, i.e. any operand may be in memory (e.g. DEC VAX).

NOTE Almost all reg-reg machines are three-address. Most reg-mem machines are two-address. Mem-mem machines usually have both two-address and the three-address variants of each instruction.

C = A + B

reg-reg

load

r1,A

load

r2,B

add

r3,r1,r2

store

C,r3

reg-mem

load

r1,A

add

r1,B

store

C,r1

mem-mem

add

C,A,B

The three should take roughly the same amount of time, since they carry out the same operations in the same order, despite the differing numbers of instructions.

 

Register-register machines

Reg-reg machines have tend to have fixed instruction sizes (usually 4 bytes) and simple formats, allowing fixed field decoding. In most such designs, most instructions take the same number of clock cycles, simplifying pipelining and superscalar implementation.

The reg-reg instruction sequence is the longest if no operand is in a register and the shortest if all operands are and stay in registers.

Optimization is important but relatively easy. Code is not dense, typically 10-20% larger than mem-mem.

 

Register-memory machines

Reg-mem machines are harder to decode, since they usually have 2-5 instruction sizes (usually 2-10 bytes).

They can be pipelined, but the pipeline scheme must be more complex (e.g. ALU op for address arithmetic, Dcache access, ALU op for execution), and will have more exceptions (e.g. TLB miss and overflow). Superscalar implementation is difficult.

Code density is better than reg-reg. Optimization is a bit harder:

e.g. should one load a value into a register and operate on it there, or should one use an instruction that accesses it in memory?

 

Memory-memory machines

The flexibility of mem-mem machines makes instruction size and execution time vary widely (1-30+ and 1-1000+), slowing down decoding (often making it sequential) and making even simple pipelining very difficult, especially with respect to interrupts.

Code density of mem-mem machines is quite good, and simple compilers are easy to write. However, optimization is hard: e.g. when is it worthwhile to load something into a register? The answer can depend on the model, e.g. what is better on the VAX 11-780 can be worse on the VAX 9000.

 

Caveat

The preceding evaluations are rough guides only.

A compiler can turn a machine of one type (e.g. reg-mem) into another (e.g. reg-reg) by not emitting some kinds of instructions.

The IBM 360 is a reg-mem machine. IBM Research's PL.8 compiler treated it as a reg-reg machine by not generating instructions that both accessed memory and performed arithmetic.

Many machines straddle some of the boundaries.

NOTE Programs compiled with the PL.8 compiler executed more instructions than programs compiled with the usual PL/1 compiler, but these instructions were simpler and took less time. The overall result was that the PL.8 compiler achieved speedups of about ~20%. The main reason for this was that it did a better job of reusing values in registers than the PL/1 compiler.

The PL.8 compiler was so called because it implemented about 80% of the PL/1 language.

 

Examples of hybrids

  • Many general register machines have some instructions with implicit operands (e.g. DEC VAX).
  • Some machines are mem-mem for some instructions and reg-mem for others (e.g. Motorola 680x0).
  • The Intel 80x86 started out as an accumulator machine, then it was extended with "sortof" general purpose registers. The floating point instructions treat the FPU (the 80x87) as a stack machine.

 

Register allocation

Compilers for GPR machines try to put variables and temporaries into registers instead of memory whenever this is possible. Exception: variables whose addresses are taken (aliasing).

A procedure may have more values than registers, but they may still all fit into registers if not all values are live at the same time (live = may be needed in future).

Even if all values can be put in registers, one should minimize the number of registers required, as this also minimizes the number of registers that must be saved and restored at procedure calls.

NOTE After executing p = &v, the variable v has two names, v and *p, and we must make sure that both refer to the same storage. Since *p necessarily refers to memory, v must be in memory as well.

If you try to keep v in a register across statements, the program

may compute the wrong result. Consider the code fragment

v = 1;

*p = 2;

if (v == 1) ...

If the code does not reload the value of v from memory at the time of the execution of the if statement, the branch will go the wrong way.

 

Variable lifetimes

1

A = ...

2

B = ...

3

...

4

... B ...

5

C = ...

6

... A ...

7

D = ...

8

... D ...

9

... C ...

lifetime of A:statements 1 to 6

lifetime of B:statements 2 to 4

lifetime of C:statements 5 to 9

lifetime of D:statements 7 to 8

 

Register allocation by coloring

Construct a graph with one node per value (variable or temporary). If two values can be live at the same time, put an edge between them.

Use a standard heuristic algorithm to allocate a color to each node such that two nodes connected by an edge may not have the same color. Convert each color into a register number.

A and D can share r1 while B and C share r2.

 

Byte vs word addressing

In virtually all machines today, each byte of main memory has its own address. A memory access may retrieve 1, 2, 4 or possibly 8 bytes.

In some old machines (e.g. Data General Eclipse), addresses are attached not to bytes but to words (which may be 16, 18, 24, 32, 36 etc bits). In these machines, every access retrieves one word.

Word addressed machines are dying out. They are a pain to program; a (char *) may be 48 bits while an (int *) is 32 bits.

NOTE The Data General Eclipse has these pointer sizes. Of the extra 16 bits in a character pointer, only one is meaningful: it selects either the low byte or the high byte in the 16-bit word addressed by the other 32 bits.

 

Byte sex

When retrieving more than one byte on a byte addressed machine, which of the retrieved bytes is the most significant?

In big endian systems, the byte with the lowest address; in little endian systems, the byte with the highest address. DEC, Intel and hence Microsoft are little endian; IBM, Motorola and most other companies are big endian. This makes data transfers non-trivial (the NUXI problem).

Some bi endian chips support both orderings, via a pin sampled at startup (MIPS R3000) or via a status register bit (MIPS R3000A).

NOTE The names big endian and little endian are allusions to an old story by Swift.

In a big endian machine, the significance of bytes (or bits) in any data item decreases as the address (byte or bit number) increases. In a little endian machine, the significance of bytes (or bits) in any data item except character strings increases as the address (byte or bit number) increases. For character strings the significance decreases.

One cannot convert a piece of data from one endianness to the other without knowing which piece of data is of what type, e.g. which parts of the data represent strings.

When Unix was ported from a little endian machine to a big endian machine, both of which had 16 bit words (this was in the seventies), those occurrences of the word "UNIX" that had been stored as words on the little endian machine came out as NUXI on the big endian machine, hence the name of the problem.

 

Alignment

On some machines, alignment must be ensured by the user/compiler. On others, the CPU automatically converts an unaligned access (1) into two aligned accesses (0&4). This interferes with pipelining.

When any machine accesses an object smaller than a word, it must be able to put that object in the least significant part of a register regardless of the address of the object.

A special-purpose shifter called an alignment network usually implements both these functions. It is also used for byte switching in bi endian systems.

NOTE A memory access that accesses N bytes, where N is a power of 2, is aligned if the address of the first byte is divisible by N.

 

Addressing modes

DLX and MIPS use the simplest three, register (r1), immediate (4), and register indirect with displacement (M[r1+4]).

Indexed mode (M[r1+r2]) needs a third read port for store instructions; scaled indexed (M[r1+4*r2]) needs a preshifter as well. Otherwise these fit into a simple pipeline.

The autoincrement (M[r1++]) and autodecrement (M[--r1]) modes need an extra write port, allow WAR and WAW hazards, and complicate interrupt handling. Indirect modes (M[M[r1]]) do not fit into reasonable pipelines.

 

Addressing modes

Some implementations have dedicated hardware to speed up e.g. r1+4*r2. Others take two cycles for the shift and add; the intermediate result is not visible to the programmer and therefore cannot be reused.

PC relative modes are usually used only for branches.

In some machines, e.g. VAX, the PC is a GPR. This complicates pipelined implementations a great deal (there must be several versions of the PC). Moreover the PC can be written into; this mixes data and control hazards.

 

Addressing mode usage

The frequency of an addressing mode depends critically on both the program and the compiler used. However, in general the simplest addressing modes are the most frequently used ones, especially in code generated by optimizing compilers.

In the seventies, architects kept adding addressing modes to make assembly language programming easier; now they keep removing them to make the machine faster and optimization simpler. The "removed" modes, not generated by new compilers, are usually supported out of pipeline.

NOTE One example is the Motorola 680x0. The 68020 introduced some complex new addressing modes. The 68040 pipeline does not support these new addressing modes: when it encounters them, it suspends the pipeline, executes the instruction, and resumes the pipeline. This means that instructions with the 68020 addressing modes slow down the program instead of speeding it up.

One reason why optimizing compilers do not use the complex addressing modes is that those modes compute intermediate results (e.g. r1+r2) that the compiler knows will be needed again. The compiler therefore computes that value once, puts it in a register, and then uses that register from then on.

Another reason why optimizing compilers do not use the complex addressing modes is that these modes require several things to happen in sequence, where each part has to wait for the previous parts. An optimizing compiler is likely to want to insert other actions between those parts to reduce the probability of the later part having to wait: this can happen because e.g. a memory access can be done in parallel with some other CPU operations.

 

Addressing mode encoding

Different addressing modes have different space requirements; e.g. a register number is 3 to 5 bits, a displacement usually 8, 16 or 32 bits.

In most machines, the opcode specifies the addressing modes of all the operands, and therefore the instruction length. Some combinations of operations and addressing modes may be missing.

In some machines (e.g. VAX), the opcode specifies only the operation. Each operand has an operand specifier that gives the addressing mode and mode-specific info. Finding the length requires parsing every specifier in turn.

NOTE The size of the mode-specific info is different for different addressing modes. Therefore the CPU must finish decoding the addressing mode of one operand specifier before it can even locate the start of the next operand specifier. This is one of the main reasons why the VAX is very hard to pipeline.

 

Instruction size

The more registers and addressing modes there are, the more bits are required to distinguish them.

One can keep instructions small by having few addressing modes; having few registers also works but hurts performance in other ways. One can also make e.g. a single register number perform two tasks, as in two address machines (both source and destination). The use of implicit operands also keeps instructions short (e.g. clr r1 = mov r1,0).

Most displacements and immediates are small. Instructions should not use 32 bits to represent 0.

 

Type of operands

On a few word addressed machines, each word contains a tag that gives its type. The add instruction can thus add two integers or two floats or one of each, producing an integer in the first case and a float in the other cases. This makes pipelining very difficult.

On most machines, the opcode specifies the type and thus the size of the operands. All computers support integers and bitstrings; virtually all support floats and characters; some support character strings and BCD (binary coded decimal).

 

MultiMedia eXtensions (MMX)

The HP PA, SPARC, MIPS, x86 and some other instruction sets have all been extended in the last few years with instructions that can use e.g. a 32-bit ALU to process e.g. 4 8-bit integers in parallel.

This mostly requires stopping carry propagation between independent operations and supporting saturating arithmetic (replacing an overflow result with the maximum value).

Such extensions are aimed at supporting audio, video and graphics processing, especially playback of MPEG movies.

NOTE The intensity of a pixel in each of three colours is usually an 8-bit integer, while sound amplitudes are usually represented as 16-bit integers. Audio, video and graphics data contains other small integers as well.

The Alpha architecture has no extensions supporting MPEG playback, because by the time this became a desirable requirement, Alpha implementations were already fast enough to support MPEG playback without such extensions. However, DEC has recently added a small number of instructions to the next generation Alphas to support real-time MPEG encoding, a much more compute-intensive task.

MMX is Intel's name for the technology. Other vendors have used other names (e.g. Visual Instruction Set, VIS, for SPARC), but MMX is becoming a generic term in the industry.

While the first MMX implementations were restricted to executing multiple small integer operations in parallel on the same ALU, some systems (e.g. 3DNow! from AMD, Cyrix and IDT, all x86 clone vendors) which will appear in the near future can perform multiple small FP operations in parallel as well.

 

Instruction types

The universal instruction types are data transfer, control transfer, integer arithmetic, and bit-string operations (&, |, ~).

Machines for all markets except embedded systems tend to also have instructions for FP arithmetic and to support the OS (e.g. system call, and privileged instructions like return from interrupt and load page table base register).

Machines for the commercial market (i.e. Cobol) usually have BCD (binary coded decimal) arithmetic and string operations (move, search).

NOTE In BCD, a number is represented as a sequence of digits. Each digit is represented as 4 bits, with the patterns 0000 to 1001 standing for 0 through 9 and the patterns 1010 to 1111 being illegal. A 32 bit word can thus contain numbers from 0 to 99999999.

In some machines (e.g. HP PA), BCD arithmetic is done via the normal binary arithmetic instructions, which may produce illegal values as a result of overflow between digits, and correction instructions that handle the overflow.

BCD operations are quite similar in several respects to MMX operations, but they were present in instruction sets three decades earlier.

 

Control transfer instructions

Conditional branches are used within procedures. Their target is usually specified by a short (8 to 16 bit) offset from the PC; almost all branch distances are short (< 256 instructions).

Unconditional jumps/calls tend to have longer offsets because they are often used to transfer between procedures and because there is more room.

Out-of-range branches can be implemented as a branch around a jump.

 

Procedure call and return

A call is a jump that saves the return PC in a register or the stack. A return is an indirect jump to the return PC.

On some machines (e.g. DEC VAX), these instructions also do register save/restore. On others the compiler emits separate instructions for this.

Hardwired instructions usually do callee save; compilers can use callee save, caller save, or a combination. With 32 GP registers and full optimization, an average call saves and restores 2 registers on the MIPS Rx000.

NOTE The call instruction on the VAX is the classic case of an instruction that does too much. With an optimizing compiler, a call should just save a few registers and branch. The CALLS instruction on the VAX does this, but it also:

  • saves a register save mask, with a 1 for each saved register
  • saves the number of arguments in this call
  • aligns the stack
  • saves the stack pointer
  • updates the stack pointer to point beyond the arguments
  • resets the condition codes
  • resets the trap handlers
  • sets up registers to point to the arguments

As a result, many programs spend 30% of their time in this instruction. The observation of this fact was one the main motivations for the trend towards simpler instruction sets, i.e. towards RISC.

 

Condition codes

Older machines tend to have condition codes (usually part of the status register).

Instructions (e.g. sub, mov) set all the bits in the CC according to the result. In the VAX, the N bit is 1 if the result was negative, the Z bit is 1 if it was zero, the C bit is 1 if a carry occurred, and the O bit is 1 if an overflow occurred.

Branch instructions test bits in the CC. Sometimes the setting of the CC is free, but usually it is not: the branch is often preceded by an instruction whose sole purpose is to set the CCs (e.g. cmp).

NOTE Many machines have both subtraction and comparison instructions. Both instructions subtract one value from another and set the condition codes according to the value of the result. Usually the only difference is that the subtraction instruction puts the result somewhere whereas the comparison instruction throws it away.

On machines where all instructions set the CCs (e.g. DEC VAX), this fact often prevents instruction reordering by the optimizer.

Some machines (e.g. Sun SPARC) have a bit in each instruction that says whether the CCs should be set or not. This allows some reordering without changing the final CC.

Others (e.g. IBM RS/6000) have several CCs, and each instruction says which, if any, should be set. This is particularly useful in superscalar machines, as the CPU can execute an integer operation generating one CC while it is branching on another CC.

NOTE The compiler cannot change the order of two instructions if both affect the (same) condition codes and they are followed by an instruction that tests that condition code.

Allowing branches to overlap with other operations is very important in superscalar machines, because branches occur on average every 4 to 8 instructions.

Multiple CCs allow more reordering than a single CC: e.g. "test cond1; branch on cond1; test cond2; branch on cond2" can be done in the order "test cond1; test cond2; branch on cond1; branch on cond2" if execution in that order takes fewer cycles, as is possible in pipelined and superscalar machines.

Some condition code bits for FP are mandated by IEEE standard 754.

 

Condition register

Some machines (e.g. MIPS Rx000) dispense with condition codes entirely. In these machines, comparison instructions are like arithmetic ones: they take two inputs and produce an output, except they produce a boolean value, 0 or 1.

All three operands are usually in registers. Conditional branches test whether a register is zero or not.

This scheme allows booleans to be treated the same as the other primitive types. This makes optimization easier, although the condition temporarily uses up a GP register.

NOTE This scheme does not allow a superscalar CPU to execute an integer operation while it is also doing a branch, unless it has an additional port to the register file for retrieving the condition being branched on.

 

Compare and branch

Some machines have compare and branch instructions that specify the branch condition (e.g. r1 < r2) as well the target. These usually offer a limited set of branch conditions, partially because there may be no room in the instruction for a complex condition.

The need for two ALU operations (compare, compute target) means the pipeline branch penalty is relatively large.

The MIPS Rx000 allows only equality comparisons and comparisons against zero because they do not need an ALU operation; the comparison and target computation are done in parallel.

 

Predicated execution

One way to avoid branches altogether is to convert if statements to straight line code with some conditionally executed instructions. The condition can be attached to the instruction to be nullified, or to the previous instruction (skip if ...). Example from SPARC v9:

if (x > y)

z = x;

else

z = y;

cmp

x,y

; sets %icc

mov

y,z

; z = y

movgt

%icc,x,z

; if (x > y) z = x

NOTE In SPARC assembler, the destination register is last.

%icc is the integer condition code register. The last instruction has no effect unless the %icc indicates that the result of the last comparison was "greater than", i.e. x is greater than y.

The letters x, y and z should be replaced by the names of SPARC registers.

In the instruction set of the ARM chips, almost all instructions are conditional.

 

RISC vs CISC

The sixties were a time of experimentation. Most techniques used today were pioneered then; many were ahead of their time.

In the seventies, the trend was to use microcode to create more complex instruction sets to try to help assembly programmers and simple compilers. (Optimizing compilers needed more memory than most machines had.)

Since the eighties, the trend is to simplify instruction sets to help optimization, to make CPUs fit on one chip and to allow the application of more techniques to enhance the CPU's performance.

NOTE RISC stands for Reduced Instruction Set Computer, while CISC stands for Complex Instruction Set Computer.

 

Mashey's RISC criteria

  • number of instruction sizes --
  • RISCs: 1

  • size of largest instruction --
  • RISCs: 4 bytes

  • number of addressing modes --
  • RISCs: < 5 (except HP PA)

  • support indirect addressing? --
  • RISCs: no

  • load/store architecture? --
  • RISCs: yes

  • max addresses generated/instruction --
  • RISCs: 1

  • allow unaligned data? --
  • RISCs: no (except IBM RS/6000, within cache block)

  • max TLB accesses per instruction --
  • RISCs: 1 instruction + 1 data

  • number of GPRs --
  • RISCs: 32

  • number of FPRs --
  • RISCs: >= 16 (except Motorola 88100)

NOTE John Mashey, formerly at MIPS and now at Silicon Graphics, is one of the most respected scientists in the Unix / architecture community. Watch out for his articles in the newsgroup comp.arch: they are models of correctness and clarity.

 

Bell's law

The fastest, cheapest, most reliable and least power hungry components are the ones that aren't there.

The performance enhancing techniques used by RISCs can also be used by CISCs, but only with a complexity penalty. The Intel 860 is almost twice the speed of the 486, and the DEC Alpha more than twice the speed of the NVAX, despite similar silicon technology and time of introduction.

The number of instructions is almost irrelevant in the RISC/CISC dividing line. It is usually just an artifact of age: all architectures tend to acquire more instructions through time.

NOTE Gordon Bell was the principal architect of the DEC PDP-11, a very influential minicomputer of the middle seventies. He also made major contributions to the design of the Encore Multimax and the Ardent Titan, respectively one of the first departmental multiprocessors and one of the first graphics superworkstations.

The MIPS instruction set, which is a RISC designed in the early/middle eighties, is now in its fifth version. Each version added some features and instructions.

In the eighties the Motorola 680x0 architecture was dominant in workstations. However, Motorola's engineers couldn't keep up with the competition from RISC chips, and the architecture is now used mainly in embedded systems. To make the architecture more suitable for this use, Motorola's engineers have mutated the 680x0 architecture into an architecture called ColdFire, with the explicit goal of making it easier to design fast and cheap processors for the architecture. To this end they deleted several of the more CISCy features of the architecture.

 

Transitions

Many companies have moved from CISC to RISC product lines. The only significant CISC architecture still in wide use in computers will soon be the x86 architecture.

Intel is working on an architecture that is intended to succeed the

x86. This architecture has most of the RISC characteristics, plus

  • many more registers (2*128 + 64)
  • much use of predicated execution
  • support for speculative loads
  • explicit bundling of operations
  • info on operation dependencies

NOTE Some CISC to RISC transitions:

Apple:

680x0 -> PowerPC

DEC:

VAX -> MIPS, then later Alpha

DG:

Eclipse -> 88000 (then later x86)

HP:

3000 -> HP PA

IBM:

AS/400 -> PowerPC

SGI:

680x0 -> MIPS

Sun:

680x0 -> SPARC

Tandem:

Cyclone -> MIPS

Several CISC architectures are in very common use in embedded applications. One example is the original 68000, some versions of which now cost around one or two dollars. Together with the extensive set of tools (compilers, debuggers etc) available, this cost makes the 68000 very attractive for many mass-market applications. Another example is the PDP-8; although obsolete in computers since the early seventies, it is still used in dumb terminals made by DEC.

Migrating away from the 80x86 line for PCs will be very difficult. One reason is the sheer number of program that run on this platform; even just recompiling each application and distributing the new versions would be very expensive. On top of this, many popular programs were written in assembler and cannot be recompiled. Some programs also access the I/O hardware directly, so even if they were recompiled they still wouldn't work unless one replicated the precise layout and behaviour of e.g. the graphics control registers. Such programs are dying out, but there are still far too many of them out there.

Nevertheless, people can in fact run many PC applications on machines of other architectures by using emulators such as SoftPC and SoftWindows. These simulate (large parts of) the hardware environment as well as the instruction set.

At the end of 94, Intel and HP joined forces to develop a successor to the x86 architecture, with a target introduction date of 1999 (which has since slipped to 2000). The projected chip, Merced, would retain the ability to execute not only x86 code but HP-PA code as well, a difficult to achieve and expensive capability by itself. At the moment it appears (from the few press releases available) that Intel is doing almost all the chip design, although HP undoubtedly made significant contributions in designing the instruction set and the advanced compilation techniques required.

The 320 registers in Merced comprise 128 integer registers, 128 FP registers, and 64 one-bit predicate registers. Each operation can be conditional upon the value in a given predicate register being true. An instruction packet is 128 bits, which contains descriptions of three operations and information about the dependencies among these operations and between these operations and other instruction packets, in a scheme Intel call EPIC (Explicitly Parallel Instructing Computing).

 

The VAX architecture

The DEC PDP-11 minicomputer series was very popular in the seventies, but its 16-bit architecture limited its usefulness. The VAX line, introduced in 1977, is a culturally compatible 32-bit extension of the PDP product line.

The VAX is a "kitchen sink" architecture: in terms of the operations and addressing modes supported, it is a superset of most other architectures.

Because of its past omnipresence in universities, the VAX-11/780 is often used as a reference machine.

The main characteristics of the VAX architecture are the following.

  • 16 32-bit GPRs; R15 is program counter, R14 is stack top pointer, R13 is stack frame pointer, R12 is argument pointer (arguments may be on the stack but do have to be)
  • FP values stored in GPRs
  • memory is byte addressable and little endian; addresses are 32 bits and need not be aligned
  • instruction size varies from one byte to many (> 30) bytes
  • orthogonal mem-mem machine; many addressing modes and types (e.g. two 64-bit FP types)
  • branches test condition codes
  • string and decimal instructions

The first byte of each instruction is the opcode; some "extension opcodes" say the opcode is two bytes (the number of instructions is greater than 256).

The opcode specifies the operation and the number and type of the operands.

The opcode is followed by one operand specifier per operand. Each specifier is 1 to 6 bytes, with the first 4 bits giving the addressing mode and hence the size of the specifier.

Variability in instruction execution times is many thousands to one.

 

The 360 architecture

The IBM 360 architecture was introduced in 1964. It was the architecture of the first computer family, a set of computers that could run each other's programs.

The 360 architecture is dominant in the mainframe market, with several plug-compatible vendors (e.g. Amdahl).

The 360 architecture has been extended several times (e.g. from 24 to 32 to 16+32 bit addresses). The later variants are the 370 (1970), 370-XA (1983), 370-ESA (1986) and 390 (1990) architectures.

The 360 architecture was designed to be general purpose, i.e. suitable for both business and scientific computing, at a time when the few existing computer companies (including IBM) had separate lines of computers for each purpose. Hence the name, from 360 degrees.

Gene Amdahl designed several models of the 360 family while at IBM. Later he established his own company to make machines that were equivalent to IBM's in every way, except faster and cheaper.

The main characteristics of the 360 architecture are the following.

  • 16 32-bit GPRs; R0 is 0 when used as index, but not otherwise
  • 4 64-bit FPRs
  • memory is byte addressable and big endian; addresses are 24 bits and must be aligned in 360, 32 bits and need not be aligned in later variants
  • instructions are 2, 4 or 6 bytes
  • nonorthogonal reg-mem machine
  • branches test condition codes
  • string and decimal instructions

Memory addresses are specified as the sum of one or two registers and a 12-bit immediate. Programs need a base register for each 4 Kb chunk of immediately accessible address space.

Five instruction formats: opcode byte says how many operands, what type, and what addressing mode.

RR format: R[R1] op= R[R2]

RX format: R[R1] op= M[R[X2] + R[B2] + D]

RS format: R[R1] = M[R[B2] + D] op R[R3]

SI format: M[R[B1] + D1] = immed

SS format: M[R[B1] + D1] op= M[R[B2] + D2]

 

The 80x86 architecture

Thanks to the IBM PC and its clones, more computers are based on the Intel 80x86 architecture than on any other (yearly sales exceed 100 million).

The 8086 16-bit chip was introduced in 1978. It has since been extended several times, as the 80186, 80286, 80386, 80486, Pentium and Pentium Pro. (The Pentium II is a minor variant of the Pentium Pro. They differ only in their cache architecture, the presence of MMX in the Pentium II, and in that they are offered at different clock speeds. Intel's recently announced Celeron and Xeon processors are also variations on the Pentium II core.) Architecturally, the most significant extensions are the 286 (protection) and the 386 (32-bit addresses).

The 286 and above can (and if running MS-DOS, do) emulate an unprotected 8086. This wastes much of their capabilities.

The following data is for the 8086; its successors have somewhat better characteristics.

  • 14 16-bit registers
  • optional 8087 FP coprocessor has its own registers, which must be accessed as a stack; transfer between 8086 and 8087 is via memory
  • memory is byte addressable and little endian; addresses are 20 bits and need not be aligned
  • instructions are 1 to 6 bytes
  • nonorthogonal reg-mem machine
  • many instructions have implicit operands
  • string operations
  • 20-bit address is specified via a 16-bit segment address and a

16-bit offset: addr = seg_base << 4 + offset

  • 1 Mb address space made up of 64 Kb segments
  • four registers hold segment base addresses (code, stack, data, extra); the one to use is implicit in the instruction
  • some instructions (e.g. jumps) have near and far versions, one omitting the segment base and one including it; programmers should not mix the two

 

Multi Processor Systems

Why parallel computers?

As processors get faster, it becomes harder to speed them up, both in terms of design effort and manufacturing cost.

For many applications, the simplest and cheapest way to increase computing power is to go to multiprocessor architectures.

In a computer that costs $10,000, the cost of the CPU+cache is usually around $1,000. If performance is CPU limited, it makes sense economically to add another CPU (and maybe another ...).

NOTE Note that cost and price can be quite different, as we will discuss in slide set 8.

When IBM introduced the PowerPC 604 in June 1995, the price of the 100 MHz 604 chip was $343 US, while the price of the 133 MHz 604 chip was $756, i.e. more than twice the price for at most a 33% speed increase.

As of June 1995, the list price of a DEC AlphaServer 2100 5/250 with a single CPU of about 320 SPECint92 was about $130,000 (AUD), while the list price of a DEC AlphaServer 2100 4/233 with two CPUs each of about 160 SPECint92 was about $80,000. The two configurations are otherwise identical. Obviously, the faster CPU does not cost $50,000; instead, this amount is the premium DEC required customers to pay if they wanted the fastest available CPU on the market. (SPECint92 are a measure of performance; see slide set 8.)

The price of the second CPU in the second configuration is about $13,000. The CPU and its board do not cost this much; instead, some of this amount is the premium DEC required customers to pay if they want the convenience of two CPUs in one box (where they are easier to administer, where load balancing is automatic, etc). Part of this premium is justified by the design cost of extending a single-processor design to a multiprocessor design.

 

Applications

Some applications are inherently parallel, e.g. image processing, weather forecasting. Parallel algorithms have been devised for many others, e.g. matrix multiplication.

Some parallel machines are designed instead for timesharing systems with large numbers of users. Different processors execute one sequential program each, the programs usually (but not always) belonging to different users.

Were it not for Amdahl's law, N processors could be N times faster than one.

NOTE One way that a single user can exploit a parallel machine designed for timesharing is through the use of a version of the Unix make program, and its equivalents on other systems, that can manage several independent tasks. For example, when several modules of a multi-module program must be recompiled, the recompilations of different modules are usually independent tasks that can be done in parallel.

 

Speedup

Different programs experience different speedups when executed on parallel machines. The graph shows the speedup limit and two speedup curves. Most applications fit between these two curves.

NOTE Some applications apparently exhibit superlinear speedups, e.g. a speedup of ten on an eight processor machine. Such results appear either because the algorithm was changed during parallelization, or because the program exploits a resource that was duplicated along with processors.

For instance, a data set that doesn't fit into one cache may fit into eight caches. If it does, the speedup from that effect alone may be as high as the ratio between cache speed and memory speed. When joined to the speedup achieved by parallelization, this can lead to superlinear speedups.

 

Flynn classification

Flynn classified parallel machines into four classes depending on the number of instruction and data streams:

SISD

SIMD

MISD

MIMD

You can think of an instruction stream as a control unit that pro-cesses branch instructions, and a data stream as an ALU with memory (e.g. registers).

SISD refers to conventional computers, even those employing pipelin-ing and similar techniques. There are no machines widely accepted to be MISD.

 

SIMD machines

For applications with lots of data parallelism, the most cost effec-tive platforms are SIMD machines. In these machines, a single con-trol unit broadcasts (micro-) instructions to many processing ele-ments (datapaths with local memories) in parallel.

The best known SIMD computer is the Connection Machine from Thinking Machines. The CM-2 model has 64K PEs, and even though each PE is only four bits wide, the machine can outperform many big Crays on some specially programmed problems.

NOTE An application is data parallel if it wants to do the same com-putation on lots of pieces of data, which typically come from dif-ferent squares in a grid. Examples include image processing, weath-er forecasting, and computational fluid dynamics (e.g. simulating airflow on a car or inside a jet engine).

SIMD machines cannot use commodity microprocessors, one reason being that it would be very difficult to modify these to broadcast their control signals to a multitude of processing elements. The compa-nies that design SIMD machines have all designed their own process-ing elements and control units. The procesing elements are usually slower than ordinary microprocessors, but they are also much small-er, which makes it possible to put several on a single chip.

Since the CPUs are nonstandard, SIMD machines need their own compil-ers and other system software. The costs of designing the CPU and this system software add significantly to the up-front investment required for the machine. Due to the multi-million dollar price tags of SIMD machines, this investment has to be recovered from a relatively small number of customers, so each customer's share of the development cost is quite high.

SIMD machines were reasonably popular in the late eighties; at least as popular as machines with multi-million dollar price tags could be. However, the difficulty of programming them and their special-ized nature (their price/performance is abysmal for any job that is not data parallel) led to the demise of the companies that designed and sold them.

 

Shared memory multiprocessors

Most multiprocessors on the market today are shared memory MIMD ma-chines. They are built out of standard processors and standard mem-ory chips, interconnected by a fast bus (memory is interleaved).

NOTE The use of standard components is important because it keeps down the costs of the company designing the multiprocessor; the de-velopment cost of the standard components is spread out over a much larger number of customers.

In theory, the interconnection network can be something other than a bus. However, for cache coherence (covered a few slides later on), you need an interconnection network in which each processor sees the traffic between every other processor and memory, and all such in-terconnection networks are either buses or equivalent to buses.

 

Bus bandwidth

In an N-processor machine, the bus needs to have N times the band-width of the bus of a single processor machine.

Bus bandwidth can be increased by making the bus wider and/or faster. The first is expensive; the second is impossible beyond a point.

To exploit the available bandwidth as effectively as possible, mul-tiprocessor buses usually have split transactions. This means that bus transactions such as memory access are split into two parts, a request and a response, and (parts of) other transactions may occur between the two.

NOTE Sun has two buses with a 320 Mb/s peak bandwidth. The Mbus, intended for uniprocessors, does not have split transactions; its typical actual bandwidth is about 100 Mb/s. The XDbus, intended for multiprocessors, does have split transactions; its typical actual bandwidth is about 250 Mb/s.

 

Bus contention

If one processor can saturate the bus, there is no point in adding another; it will spend all its time waiting for the bus to become free.

If a processor requires 1/N of bus bandwidth, there is no point in having more than N processors. Due to fluctuations in demand, N processors may not all be useful either.

Because of contention for the bus and for memory banks, adding a second CPU multiplies the power of the system only by 1.8 to 1.95. Further CPUs add less and less power.

 

Caches

Including a cache on each processor reduces the bandwidth require-ment by a large factor, and helps processors to tolerate memory la-tency (which is usually greater than in uniprocessor systems).

However, a computer with multiple caches must ensure that the con-tents of these caches are consistent with one another. This means it must keep track of which copy of a memory location is the latest, and prevent the use of out of date copies.

This applies to TLBs as well, although TLB consistency is usually maintained at least partly by the OS.

NOTE There are two reasons why memory latency is usually higher in multiprocessors than in uniprocessors. First, multiprocessors must use some kind of protocol on the bus to avoid more than one CPU us-ing the bus at the same time and to share the bus fairly between the CPUs, and this protocol adds overhead. Second, multiprocessors typ-ically allow bigger maximum memory sizes. This means that buses are longer (and hence slower), and that the memory needs ECC (error cor-recting code) for reliability (which slows it down).

In this context, "consistency" and "coherence" mean the same thing.

 

Cache coherence

Processors P1 and P2 are both accessing location 100. Both of their caches contain the initial value of location 100, which is 5.

When P1 updates location 100 by storing 6 into it, P2 should be able to see the change; its next reference to location 100 should see the new value.

P2 cannot find out for itself at the next reference whether its copy is stale (out of date) without destroying the cache's speed advan-tage over memory.

Hence P2 must be told of the change when it happens.

 

Bus watching

The simplest algorithm for cache coherence requires the caches to be write through.

The idea is that each cache watches the bus (snoops on it). Every time a word is written to memory by some other processor, the cache checks whether it contains a copy of the affected location. If it does, then it can either update its copy or invalidate it (write up-date vs write invalidate).

Snooping caches have duplicate sets of tags, for concurrent use by the CPU and the bus watcher.

NOTE Write update has the advantage that the next access to the af-fected cache block will not cause a miss, but it requires more com-plex circuitry to implement and requires access to the cache data (not just the tags) for one cycle, which usually means the CPU can-not access the cache that cycle. This is why most current multipro-cessors use write invalidate.

 

Snooping caches

 

Write back caches

Write through caches access the bus whenever their processor does a store. Since store operations are relatively frequent (about 1 in 8 instructions), they cannot decrease bandwidth requirements very much. This limits the number of processors on the bus to about four.

Write back caches do not suffer from this limitation, but their con-sistency cannot be assured by a simple bus watcher. They can be adapted for multiprocessors by means of ownership schemes, in which the most up to date version of a location may be in a cache.

 

Ownership schemes

The state of each cache line says whether this cache "owns" the lo-cation. If it does, then no other cache may have a copy of that lo-cation, and the version in main memory may be out of date. Writes to such locations do not have to do write through.

If the location is shared by multiple caches, then none of the caches may own the location. Writes to that location go through to memory.

The write through enables the other caches to find out about the up-date through snooping. With the write invalidate policy, the writ-ing cache can then own the location.

NOTE The name "ownership" is misleading in one sense. A cache does not own a location until it decides to give it away; a cache owns a location only until some other cache accesses that location. Then it must give up its ownership.

 

Cache consistency protocols

There are many protocols for cache coherence; these are sets of rules about which cache has to do what, when, and under what circum-stances.

The most popular protocols are called MESI protocols. In these, each cache line is in one of four states:

Invalid: no data.

Shared: clean data that may be present in other caches.

Exclusive: clean data that is present only in this cache.

Modified: dirty data that is present only in this cache.

NOTE In MESI protocols, a cache owns a block if in that cache the state of the block is either Exclusive or Modified.

All caches support two states: invalid and valid. Write back caches support three states: invalid, valid and clean, valid and dirty. MESI protocols split the "valid and clean" state into two based on whether other caches may have a copy of the block.

Some advanced protocols also split the "valid and dirty" state into two, and allow several caches to contain the same dirty data, with one cache (the original writer) being responsible for updating main memory when the block is evicted. The state of the block in the original writer is called Owned, and therefore such protocols are called MOESI. (The name of this state is confusing, since if a block is in the Owned state in one cache, other caches may have copies of that block in the Shared state. Since according to our previous definition, a cache can own a block only if no other cache has a copy of that block, in a MOESI system a cache does not own a block if in that cache the state of the block is Owned! All it owns is the responsibility of updating memory.)

From CPU:

 

From bus:

 

NOTE The states of this protocol are Invalid, Clean and Dirty. Dirty corresponds to the Modified state of MESI protocols, while Clean corresponds to both the Shared and Exclusive states; this pro-tocol treats all clean blocks as potentially shared. In this proto-col, a cache owns a block if in that cache the block is in Dirty (Modified) state.

This protocol is for a cache with write allocate. Therefore write misses cause the affected block to be read in from memory, just as read misses do. The protocol requires that bus transactions that read from memory include a bit that specifies whether the read is to service a read miss or write miss.

Some states do not say what should happen on some events. This is

either because the event is impossible or because the event leaves the cache block in the same state as before.

There can be no hits for blocks in the Invalid state. There can be no invalidate signals for blocks in the Dirty state.

A read hit leaves the block in the same state whether the original state was Clean or Dirty. A write hit leaves the block in the same state when the original state was Dirty. A read miss in the Clean state leads to a replacement without writeback, and the state of the cache block stays Clean. A write miss in the Dirty state leads to a replacement with writeback, and the state of the cache block stays Dirty.

Read misses on the bus leave a Clean block Clean, and all bus activ-ity leaves an Invalid block Invalid.

Full MESI protocols have four states, while MOESI protocols have five states. Their transitions are correspondingly more complex.

 

Two level caches

Multiprocessors had two level caches longer than uniprocessors. They need a second level not only to reduce access time but also to reduce bus traffic.

The first level usually consists of small I & D caches that do write through to the second level, which is a large unified write back cache with a large block size.

If the two levels satisfy the inclusion property, i.e. everything in the primary cache is also in the secondary cache, all snooping can be done by the secondary cache (possibly with one set of tags).

NOTE Most of the time the CPU accesses only the primary cache. Therefore the probability that both the CPU and the bus watcher need to use the tags on the secondary cache at the same time is low. It may be low enough for the designer to decide to stall the CPU when this happens instead of incurring the cost of a second set of tags on the secondary cache.

The reason why the inclusion property is important is that one doesn't want the primary cache to be involved with snooping. The primary cache ought to have a fast hit time, and complicating the logic to allow snooping could slow it down.

However, inclusion also has a bad side. By forcing the L1 cache contents to be a subset of the L2 cache contents, it limits the max-imum useful associativity of the L1 cache. Consider three locations mapping to the same set in the L1 cache that are frequently used to-gether. If the L1 cache is e.g. four-way associative, all three lo-cations can be in the L1 cache at the same time. However, if any two of these locations map to the same set in the L2 cache, and the L2 cache is direct mapped, then only one can be present in the L2 cache at any given time. On a cache system with the inclusion prop-erty, this implies that only one can be present in the L1 cache at any given time, which means that an access to the other has to go all the way to memory. This can be a big performance hit. This is why several modern CPUs (e.g. Pentium II, PowerPC 750) that have a highly associative (e.g. 8-way) on-chip L1 cache but an off-chip L2 cache with much lower associativity (direct-mapped or 2-way) do not maintain the inclusion property and instead snoop both the L1 and L2 caches. (The reason why L2 caches have low associativities is that the number of pins needed to transfer a block from an off-chip cache to the CPU is quite large. Given the number of pins available in packages with an acceptable price, designers working with current technologies cannot afford enough pins to transfer more than one or two blocks at a time.)

 

A state of the art example

The CPUs in a Sun HPC 4500 multiprocessor are 336 MHz UltraSPARC IIs. Assuming about 1.5 instructions per cycle, and that there are about 1.3 memory references per instruction, each CPU makes 650 mil-lion memory accesses per second, which needs 2.6 Gb/s of bandwidth.

The memory bus achieves a maximum bandwidth of 2.6 Gb/s. However, all CPUs (up to 14 of them) share the same memory bus, as do all the I/O devices. The cache system must therefore have a miss rate that is significantly smaller than 5%; this is why it uses 4 Mb L2 caches.

 

Synchronization code

P1

P2

   

A = FALSE;

B = FALSE;

...

...

A = TRUE;

B = TRUE;

if (B==FALSE)

if (A==FALSE)

{

{

critical

critical

section

section

}

}

The intention of this code is to make sure that at most one proces-sor is executing inside the critical section at any one time. (The code is part of Dekker's algorithm.)

 

Consistency models

With sequential consistency, at most one if statement will succeed, because the result must be as if the load/store streams of all the processors were somehow interleaved.

With processor consistency, a processor (e.g. P2) is allowed to sat-isfy a load of e.g. A from its cache while a store of e.g. B is out-standing in P2's write buffer, and so both ifs may succeed. This violates the assumptions behind the synchronization code.

NOTE Caches that can continue to service accesses that are hits while servicing misses are said to be capable of hit-under-miss. Caches that can continue to service all accesses, both hits and misses, while servicing earlier misses are said to be capable of miss-under-miss. Several current CPUs can do hit-under-miss, and according to rumours some CPUs currently being designed will be able to handle miss-under-miss. Caches that can do either are sometimes called lockup-free. They are important because the CPU speed / mem-ory speed ratio is increasing, and waits for memory are therefore becoming relatively more damaging to fast CPUs.

 

Weak consistency

In Alpha, a memory barrier instruction ensures that subsequent loads and stores will not access memory until after all previous loads and stores have accessed memory, as observed by other processors. Be-tween barriers, the CPU may reorder loads and stores to different locations and may optimize some away.

When writing concurrent programs (especially the synchronization code), programmers must be aware of the consistency model of their system. Sequential consistency is the most convenient, but it can limit performance.

NOTE To make Dekker's algorithm work correctly even though the Alpha uses weak memory consistency, the programmer must place a memory barrier instruction just before each if statement.

 

Symmetric vs asymmetric

Some early multiprocessors directed all interrupts to a single mas-ter processor, which ran the operating system; the other slave pro-cessors executed only user mode code. In such asymmetric systems, the master can become a bottleneck.

Modern symmetric multiprocessors allow all CPUs to execute in kernel mode. They direct interrupts to an idle CPU, or to the CPU execut-ing the lowest priority process.

This requires arbitration; some machines have a separate bus for this purpose.

 

Taxonomy

MIMD machines can be classified according to whether memory is physically distributed among the processors, and whether the processors access the same physical address space:

memory:

shared AS

separate AS

central

prev slides

ELXSI 6400

distributed

Cray T3D

Intel Paragon

Today, most multiprocessors have central memory and a shared address

space. In the future, distributed memory machines will become more

important; buses are a bottleneck.

 

Distributed memory multiprocessors

In the currently common type, each processor has access only to its

own local memory. The processors communicate among themselves by

passing messages to each other.

NOTE In theory, the interconnection network can be a bus. However, if the number of processors is small enough for a bus to provide sufficient bandwith, you are better off building a shared memory multiprocessor, so in practice you will never find a bus as the interconnection network of a distributed memory multiprocessor.

The main drawback of this class of architectures is the difficulty of programming them, since every task has to be decomposed into subtasks that communicate only fairly infrequently; the communication then has to be programmed explicitly.

Their main advantage is that they are scalable: the number of processors can increase beyond what a single bus can support. Typically, the bandwidth of the interconnection network increases with each new CPU.

NOTE There are several popular interconnection network topologies. One of the more fashionable ones is the hypercube. A hypercube is an n-dimensional cube interconnecting 2n nodes (processing elements). The address of a node is a binary number of n digits. Each node is connected with all nodes from which its address differs by one bit.

To send a message from A to B, pick a bit in which A and B differ, and send the message in that direction; repeat if necessary.

In some machines, the CPU forwards messages; in others, this is done by dedicated hardware. Recent systems tend to use hardware routers in each node. With the best ones, the first bit of a message may reach its destination even before the last bit has left its source.

The topology of the oldest distributed memory machines is a two or three dimensional array or mesh, possibly with wraparound connections to create a torus. Some recent machines also use this topology, for example the Cray T3D supercomputer (the name stands for torus, three-dimensional). Since each node in the mesh contains two CPUs (Alphas), the T3D is partly a shared memory and partly (mostly) a distributed memory machine.

Although meshes have less bandwidth than hypercubes, they are more scalable: the number of connections per CPU stays constant even when adding more processors. (An N-dimensional hypercube needs N connections per CPU, so if you want to double the number of CPUs you also need to add a connection to every existing CPU. With (say) a 2D mesh, every CPU needs four connections regardless of the number of CPUs in the system.)

Other topologies include hierarchies of buses, and switching networks (interconnected crossbars) like the telephone system.

 

Distributed memory, global address space

Some machines, e.g. the Cray T3D (up to 512 Alphas), distribute memory with the processors but still allow processors to address every memory location directly.

Since access to local memory is faster than access to remote memory, these are usually called NUMA (non uniform memory access) architectures.

These machines can run any program a shared memory computer can, but for acceptable performance they require that most references should be to local memory.

 

Directory based coherence

Each processor in a NUMA machine will have a cache, and these need to be kept coherent. Snooping doesn't work since there is no shared bus.

In such systems, every block of memory has an associated directory, a data structure saying which processors have copies of it. On a write to the block, those processors are notified without a broadcast.

Several schemes exist for compacting directories, but they can still be expensive, as the cost is proportional to main memory size, not cache size.

NOTE The first affordable family of computers using directory based coherence was the Origin2000 family from SGI. This family was announced in October 1996.

 

Multithreading

Values loaded into registers tend to be used soon (within 10 instructions or so). A remote memory access may take 200 cycles or more.

To try to keep CPU utilization high, researchers are exploring including in the CPU storage for the state (PC, registers, TLB etc) of more than one thread. When the pipeline stalls, for remote access or not, the CPU switches to executing another thread.

To take advantage of this, applications must be divided into threads by the programmer or by the compiler.

NOTE At the moment, compilers are not yet advanced enough to be able to divide up existing imperative programs into several threads, and requiring the programmers to divide up their programs would cost far too much money. Multithreading will become cost-effective when people start writing significant numbers of applications in non-imperative languages such as Miranda.

However, there is a company called Tera that is selling a machine based on multithreading. Each Tera CPU has room for the state of 128 threads.

 

Dusty decks

The big problem with parallel machines is the large number of existing programs. Parallelizing compilers for Fortran exist and are improving, but they are effective only for small numbers of processors (at most 8). Parallelizing languages with pointers (e.g. C) is difficult.

Most parallelizing compilers work significantly better if the programmer annotates the program with assertions that the compiler cannot deduce itself. E.g. if a[i] is used as an array index in a loop, the compiler probably won't be able to parallelize the loop unless it knows that a[i] != a[j] if i != j.

NOTE Without an expert who knows such facts and appreciates their importance, automatic parallelization doesn't work very well for most dusty deck programs.

The phrase "dusty deck" comes from the image of a card deck that has been left on a shelf long enough to acquire a sheet of dust. The implication is that it is a program that works, but noone in the organization is familiar enough with the program to know how or why it works, so modifying it is not an option. This was already in a significant problem in the late sixties and in the seventies, when punched cards were the main program storage medium. It is a more significant problem now.

 

Imperative languages

The traditional approach is to write programs structured as a set of sequential processes (threads) with explicit communication. Writing such code is considerably more difficult than normal because of the potential for synchronization errors. Languages (partially) specialized for the task can help (Ada, Occam).

Data parallel languages like HPF (High Performance Fortran) and C* extend a sequential language with constructs that operate on many data items at once, e.g. add two matrices. These are very useful for data parallel problems and not at all useful for anything else.

 

Declarative languages

Functional and logic programming languages do not prescribe a sequence of actions (which is what imperative languages do). Compilers for such languages can generate parallel code whenever they find two tasks that do not depend on each other's results.

The language implementation will find these tasks and handle synchronization; programmers need not worry about this. This is why optimists expect that when parallelism becomes important in the marketplace, more applications will be written in declarative languages.

NOTE Pessimists say that applications will continue to be written in increasingly inappropriate languages for a long time, until people who learned declarative languages in school become senior enough in their jobs to influence language selections.

 

Input Output

Input/Output

I/O is the main differentiator of computer classes. A fast workstation can have as much CPU power as a mainframe, but the mainframe will have 10 to 1000 times more disk, and the disks will be faster and more reliable.

In an I/O bound system, an improvement in CPU speed would mostly cause more waiting for I/O. Many applications are I/O bound due to Amdahl's law, partially because people often overlook I/O when purchasing systems.

NOTE It is commonplace to hear about people who have upgraded their PCs by buying a much faster CPU (e.g. replacing a 133 MHz Pentium with a 333 MHz Pentium II) and then found that the machine felt only a little faster. In many of these cases, upgrading the disk and/or the graphics card would have been much more effective, and usually cheaper.

I/O capacity was the primary consideration on which candidate machines were evaluated when the Department purchased its fileserver in June 1995.

 

I/O performance

The performance of shared I/O devices (e.g. disks) can be modelled by queueing theory. Good latency requires that the queue of requests for the device be empty most the time; good throughput requires that it be full all the time.

 

Throughput vs latency

Latency is harder to improve than throughput because of the speed of light: "you can't bribe God".

In mainframes and departmental computers, the main goal is usually to maximize throughput; you aim to overlap I/O for one process with computing for another.

In personal machines, and to some extent in transaction processing and in supercomputers, the main goal is to minimize latency. Latencies above ~1s break the users' train of thought; a forecast of yesterday's weather is useless.

NOTE Light travels about 30 cm (one foot) per nanosecond in vacuum. Many signals in I/O systems travel at about one third of the speed of light.

 

I/O and caches

If the OS can guarantee that I/O is performed for a process only when that process is not running, then consistency can be maintained by flushing the cache whenever the CPU switches from one process to another.

An alternative solution is snooping. This is the method of choice in multiprocessors, which already have the required hardware, but it can also be used in uniprocessors.

A more radical solution is to connect the I/O system not to the CPU-memory bus, but to the cache (e.g. RS/6000). However, this clutters the cache with irrelevant data.

 

Balance

A faster processor will require more cache and more main memory. The choice of the amount of main memory is more important due to the large speed gap between memory and disk.

If there isn't enough memory, the disk has to be accessed too often. Beyond a certain memory size it becomes advisable to invest in disk bandwidth due to the unpredictability of future accesses.

The Case-Amdahl rule of thumb from the sixties dictates about 1 megabyte of memory and 1 megabit/s of I/O per MIP. Today you would buy more memory: it's relatively cheaper.

NOTE MIP = millions of instructions executed per second. An outdated measure of CPU performance; see slide set 8.

 

Disk technology

Disk access time consists of seek time, rotational latency, and transfer time, plus setup and completion overhead.

Current disks rotate at 7200 rpm (120 rps), giving an average rotational latency of 4.1 ms. The "average" seek time (calculated over all track pairs) has also come down to about 9 ms.

Most seeks are shorter than average due to OS or DBMS optimization. In two workloads from the text, 24% / 64% of accesses need no seeks, and 23% / 11% need very short seeks (15 tracks or less, out of 1000).

NOTE There are still many disks drives that rotate at 5400 rpm, but some recent ones rotate at 10000 rpm.

Optical drives seek significantly slower than magnetic hard drives, although their rotation rates and hence transfer rates have been increasing fast recently (you can now get 32x CD-ROM drives).

Floppy disk drives are much slower than even optical drives. For example, Compaq's 1994 3.5 inch floppy drive spins at only 300 rpm, has an average seek time of 94 ms, and a maximum transfer rate of 500 Kb/s. Since floppies are not used much these days for anything speed critical, people are usually not willing to pay more for faster floppies, so the vendors do not have much incentive to improve their speed. They are instead concentrating on reducing the price of DVD drives, which in the next few years should take over from floppies as the main removable read/write medium.

 

Disk performance

In some applications involving large datasets, e.g. many number crunching tasks and image processing, the important measure is transfer rate for sequential access.

Transfer rate is limited by rotation rate and angular bit density (which may or may not vary from inner to outer tracks). It can be improved by accessing several platters at once (this requires tight synchronization and is thus expensive).

The I/O bus is another limit on transfer rate. With most current technologies it is reached only if several disks want to transfer on the bus simultaneously.

NOTE The limit on SCSI 1 is about 4 Mb/s, on SCSI 2 it is 5, 10, 20 or 40 Mb/s, depending on the variant. Machines purchased by the Department in 1995 and 1996 mostly have Fast and Wide SCSI 2 buses, with 20 Mb/s throughput. The newer Ultra SCSI and Ultra-2 SCSI disks have faster transfer rates still (20 Mb/s for Ultra SCSI, 40 Mb/s for Wide Ultra SCSI and Ultra-2 SCSI, and 80 Mb/s for Wide Ultra-2 SCSI); their protocols are defined in the SCSI 3 standards.

The limits on the various IDE variants are 2.1 Mb/s to 8.3 Mb/s. For EIDE, they are 11.1 Mb/s to 16.6 Mb/s. For Ultra-ATA, the limit is 33.3 Mb/s

Currently-used SCSI variants allow commands to be issued and results to be reported while other operations are in progress (a capability similar to split transactions). Since this allows e.g. several seeks to be overlapped, this makes them superior to other disk interfaces that do not have this capability for applications that involve significant disk activity (this includes most other kinds of disk interfaces used in PCs).

 

Disk performance

In most other application areas, e.g. file serving and transaction processing, the important measure is I/O operations per second (iops), i.e. number of random blocks accessed per second.

Iops can be improved by adding more arms to the disk, but this is very expensive, and only a few mainframe drives do it.

To get the same effect, one may use several smaller disks instead of one larger disk. This has its own cost, since large drives are cheaper per byte.

NOTE In both cases, the problem is the cost of the extra electronics and precision mechanical components required.

The dominant cost of an I/O operation to a random block is the seek time and rotational latency required. Once this cost is paid, accessing neighbouring blocks after the random block has a relatively very low cost.

 

Striping

With a straightforward disk setup, some disks will tend to get more requests than others, so processes wait for that disk while other disks are idle. At different times different disks may be the bottleneck.

Striping or interleaving over two disks puts odd numbered blocks on one disk and even numbered blocks on the other disk.

This evens out the load; it also doubles transfer rate for sequential access. However, the two disk heads must sometimes move together, and this lowers the maximum I/Os per second the combination can do.

NOTE The lowering comes about in situations where there are requests for several consecutive blocks in quick succession. Without striping, one disk would perform a seek and then transfer all the blocks on that track while the other disk is free for other activity. With striping, both disks must perform the seek.

Suppose two disks can both do 40 I/O operations per second, and that one quarter of operations are for sets of consecutive blocks while three quarters are for individual blocks (no neighbouring blocks accessed in the near future).

A non-striped disk pair can do 80 I/O operations per second provided the disks are evenly loaded. If one disk is accessed only one half as frequently as the other, the resulting performance will be only 60 I/O operations per second for the pair. If one disk is accessed only one quarter as frequently as the other, the resulting performance will be only 50 I/O operations per second for the pair.

If the pair were striped, for each 4 I/O operations coming to the pair, 1 would cause seeks on both disk arms, requiring a total of 5 seeks on the two drives. Since the load on the two disks should now be even, the pair can do 80 seeks per second. For every four accesses, the striped disks must do five seeks (on one access both disks must seek), so the striped pair is capable of 80 * 4/5 = 64 accesses per second.

 

Striping unit choice

Striping can also be done by interleaving units other than blocks, e.g. tracks, cylinders, groups of 16 sectors etc.

The bigger the unit, the more likely that the file being accessed fits into the unit, and thus the less likely that the other disk must be accessed.

Accessing the other disk can increase sequential performance on that file, but it can also interfere with unrelated accesses.

Unit size selection thus requires a tradeoff. Track interleaving is often a good compromise for timesharing.

NOTE Block interleaving puts odd numbered blocks on one disk and even numbered blocks on the other disk. Track interleaving puts odd numbered tracks on one disk and even numbered tracks on the other disk. Similarly with cylinders.

One can also do striping on more than two disks.

Modern 2 Gb disks contain 64 to 96 512-byte sectors per track, so a track contains 32 to 48 Kb. These disk drives usually contain about 15 surfaces, so a cylinder (a group of tracks above each other) contains about 450 to 750 Kb. This means that a 2 Gb disk needs about 2500 cylinders. (To increase disk drive capacity, manufacturers usually increase both the number of tracks and the capacity of each track, so large disk drives tend to have larger tracks.)

With block interleaving, all files bigger than one disk block will require both disks to be accessed; this represents between 20% and 50% of all files. Files bigger than one cylinder are relatively rare (less than 1% of files). This is important because Unix applications tend to read entire files.

Figuring out the best striping arrangement is at the moment more an art than a science: frequently the best thing to do is simply to try a few configurations.

Muse, the department's old fileserver, used track-based striping for several filesystems, including /home/stude. The new fileserver, munkora, uses RAID 5, an extension of striping.

 

Disk caches

The simplest type is a track buffer. After one revolution, this will contain all data on the current track, so later requests incur no rotational latency. This is particularly effective if the controller can reorder requests (the OS does not know the current rotational position of the disk).

Some controllers use their memories as a cache to speed up reads (although the OS is better at this than the controller). Unless the cache is battery backed, it cannot speed up writes: a crash between reporting completion and the actual write would lose data.

NOTE Disks with caches appeared in the PC market only recently. They have been in the disks for the Unix market for years and the disks for the mainframe market for more than a decade.

 

I/O buses

I/O devices can be connected to the CPU and memory via an adapter that connects to the CPU-memory bus. This adapter is shared between several I/O devices; they communicate with the adapter over an I/O bus. The standard I/O bus on Unix systems is SCSI.

Many systems have a mezzanine bus that connects to the memory bus on the one hand and to several I/O buses on the other hand. High-speed I/O devices (e.g. graphics) can plug into this bus. Sun has used SBus as its mezzanine bus since the eighties; many companies (Intel, DEC, etc) are moving to PCI.

NOTE CPU speeds are increasing very fast, and memory systems must change to meet growing the growing demand from CPUs (e.g. by introducing interleaving, increasing bus width etc). Therefore the CPU-memory bus of a product one year will differ from the CPU-memory bus of the product last year. On the other hand, I/O devices are improving much more slowly than CPUs, and companies would like their customers to be able to use in the new product peripheral devices developed for last year's product. They can achieve this aim by connecting peripherals to a mezzanine bus and changing mezzanine buses only very rarely (once a decade or so is not unheard of).

The specification of SCSI (the Small Computer System Interface) is now being updated for the second time. SCSI 2 has several variants that differ in maximum clock speed (normal or fast) and the width of the interface (8, 16 or 32 bits; narrow, wide and double wide). The draft SCSI 3 standards also allow still higher clock speeds (usually called Ultra SCSI).

Some supercomputers use an I/O bus called Hippi (High Performance Peripheral Interface).

 

RPS miss

A disk should transmit data being read as soon as it comes under the disk head. If the I/O bus is busy then, an RPS (rotational position sensing) miss occurs.

To avoid having to wait for a full rotation, some disks include a buffer to hold the entire block (or track), and transmit the data from there when the bus is free. In IBM mainframes, buses (strings) are replicated, so that a disk can almost always find a free string when needed (this is dynamic path reconnection).

 

Mainframe I/O

Mainframes attempt whenever possible to reduce latency and provide fault tolerance. Replicated paths can do both.

Multiple independent paths to memory (which must be multiported) are very expensive, but they allow several disks, CPUs, terminals etc to talk to memory at a time (each to a different bank).

Disks are sometimes mirrored and connected to two machines (main and standby). Mirroring protects against disk crashes, multiporting against CPU/OS crashes. Putting one disk of each pair with each CPU protects against disasters.

 

Redundant Array of Inexpensive Disks

RAID is a set of techniques to use several disks to improve reliability and performance.

RAID 1 is mirroring. Writes go to both disks simultaneously, reads can come from either.

This improves I/Os per second for reads. The number of I/Os per second for writes and the transfer rate are unaffected.

NOTE Some people call a striped set of disks a RAID-0 system, but since that arrangement has no redundancy, this is misleading.

The problem with RAID technology is that although the disks may be inexpensive, the array is not. When RAID was new to the Unix market in the early nineties, a RAID setup cost two to four times as much as a set of conventional disks of equivalent storage capacity. With N=5, the expected cost overhead for RAID-5 is 20%, not 100% to 300%. One part of the extra cost is the electronics and software required for a RAID controller, another part is the higher price of the extra reliable drives that are typically used in RAIDs. Yet another part was the normal premium for a new product. Nowadays even commodity drives are reliable enough to be used in RAIDS, and the price premium has been significantly reduced. A RAID setup now costs only about 50% more than a simple disk setup. 20% of that is the extra drives required for redundancy, the rest being for the RAID controller software, extra electronics, redundant power supplies and fans, and maybe hot swap capability (i.e. the ability to insert and remove disk drives while the computer is up and running and without any disruption to its operation).

The RAID techniques are normally applied to disks, but they can be applied to tapes as well; such systems are called RAITs, Redundant Arrays of Inexpensive Tapes. They can protect against losing backups because of tape media failure. However, while most servers have five or more disk drives, most have only one or two tape drives, and one cannot apply RAIT techniques with only one or two tape drives. Therefore use of RAIT is mostly restricted to quite big installations.

 

RAID 2 & 3

RAID 2 and 3 have byte interleaved data on all disks with parity (2) or ECC (3) protection. The heads in all disks move together, which requires special drives. Both reads and writes go to all disks simultaneously.

The TMC Data Vault is a RAID 3 with 32 data disks and 8 ECC disks. It accesses 32 bytes while one disk can access one byte, so the transfer rate is 32 times as much. The I/Os per second rate is the same as for one disk, although each I/O gets 32 times as much data.

 

RAID 4

In RAID 4, data is block interleaved (striped) on all disks, with a dedicated disk holding parity. Reads read from one data disk. Writes must read the data block and the parity block, compute new parity, and then write the data block and the parity block.

For reading, RAID 4s have all the advantages of striping: evening out the load and increasing the transfer rate. However, since all writes affect the parity disk, this disk becomes a bottleneck, and writes are even slower than on one disk.

NOTE The bit string 01101 has odd parity because it has an odd number of bits set to 1. The idea of parity bits is to add to a group of N bits an extra parity bit such that the N+1 bits have a known parity, e.g. odd parity. This way, if any one bit is corrupted, the corruption can be detected. If the different bits are on different disks, and it is known which disk failed, then one can reconstruct the missing bit, e.g. given the data 01?01 in an odd-parity system the missing bit must be 1.

Given (e.g.) four data bits a, b, c and d, one can compute the parity bit via the expression 1^a^b^c^d, where ^ is the C exclusive or operator; the expression is easily generalized.

When replacing a data bit d1 with a new data bit d2, one can retain correct parity by making sure that d1^p1 == d2^p2, where p1 and p2 are the old and new values of the corresponding parity bit. p2 can then be computed as p2 = d1^p1^d2.

 

RAID 5

To prevent the parity disk from becoming the bottleneck, RAID 5 distributes parity info over all the disks.

Most RAID 5 systems have 5 or more disks. This three disk RAID 5 system has relatively more parity:

NOTE Block 0, block 1, and the block containing the parity of blocks 0 and 1 must be on different disks. If any two (or more) of these blocks were on the same disk, then the failure of that disk would lose data. The same applies to other blocks as well.

A RAID 5 with N+1 disks can do almost N read ops per second and almost N/2 write ops per second for each op per second that a single disk can do. The "almost" is there because the disk heads sometimes must move together, as in striping.

Synchronization of data and parity writes is a problem in both RAID 4 and RAID 5; a crash between the two writes can cause data loss.

Modern RAID controllers store writes in non-volatile memory and complete outstanding writes at recovery. They can therefore reorder writes to disk during normal processing.

NOTE If you increase the size of the blocks being interleaved, you decrease the number of times when the heads must move together. Selection of the interleave unit has pretty much the same effect from RAID-5 as for striping.

 

Failures in RAIDs

Every RAID scheme stores enough information so that if a disk fails, its contents can be reconstructed automatically onto a replacement disk. During reconstruction, the system will be slower than usual. Until it is complete, further failures can cause data loss.

To allow reconstruction to commence immediately, without waiting for delivery of a replacement disk, the sysadmin can include a spare drive.

Some schemes (usually called RAID-6) can tolerate a second failure before or during the rebuild phase.

 

Backup

For unattended backup, tapes must have large capacity. The current best choices are 5 Gb 8mm video tapes (Exabytes), 2 Gb 4mm digital audio tapes (DAT), and 10 Gb Digital Linear Tape (DLT).

Compression can improve these capacities by about a factor of two but must be used carefully, since some kinds of files do not compress well or at all.

The other limit on the size of a nightly backup is the write and verify speed of the drive. DLTs can write data at 1.25 Mb/s if they are kept streaming. Exabytes and DATs are slower.

NOTE Backups that are not read back to verify their contents are likely to fail when they are needed, since that way sysadmins are not alerted to backup problems until a backup is needed.

If a machine has only RAIDs for disk storage, it needs conventional tape backups only for protection against disasters such as a computer room fire or two (nearly) simultaneous disk crashes.

Devices that can store 25 Gb on an 8mm video tape cartridge are said to be under development.

 

Robotic devices

For very large data sets, one wants a robot handling a carousel of tapes or a jukebox of optical disks (either WORM or erasable). These can be used for backup, or as a primary storage medium for very large data sets (Terabytes).

Some companies sell network file servers that use magnetic disks as a cache for the optical disks, with automatic migration between memory, magnetic disk, optical disk and tape. This is called hierarchical storage management. Other companies sell HSM software running on Unix machines.

 

I/O processors

I/O processors are specialized devices whose purpose is to off-load a defined activity from the main CPU; this usually means reducing the interrupt load on the main CPU.

The simplest IOPs are DMA controllers, which are just a few words of FIFO queue, command and status registers, and a state machine.

Complex IOPs are full computers dedicated to one task: NFS servers, X-terminals, terminal concentrators.

Other IOPs are in between: graphics accelerators, channel controllers, network interfaces.

NOTE Simple IOPs tend to acquire more and more features as time goes on. The interface can get so complex that it takes the main CPU more time to figure out the proper IOP command than it would take to do task itself.

Eventually somebody decides to remove the IOP and give its job to a general purpose CPU (a faster uniprocessor or a multiprocessor).

Then somebody offloads some of the functions of the resulting system to a simple IOP ... this is called the wheel of reincarnation.

 

Performance and Price

Performance and cost

Customers would like precise information about the price and the performance of available machines. Vendors would like customers to buy their machines.

There is an inherent conflict here. This makes the situation even more muddled than it would be otherwise. Example: quoting performance with all options and price with none.

The trade press rarely supplies accurate, complete information. The situation is improving, but only slowly.

 

What kind of performance?

The csh time command reports approximations of the time the CPU spent in user mode and in system mode and the elapsed time:

% time testcommand

90.7u 12.9s 2:39 65% ...

System mode time is the time spent by the OS on behalf of this process. The difference between elapsed time and user+system time is accounted for by I/O and by the CPU running other processes.

 

Which measure?

  • elapsed time
  • user time
  • user time + system time
  • elapsed time on unloaded system

For measuring CPU performance, we use user time.

For measuring system performance, we use elapsed time on an unloaded system. Unloaded: either single user mode or multi user with one person logged in. The two may not give the same result.

Timing results are frequently not exactly reproducible.

NOTE For a long time, the main reason for not being able to exactly reproduce timing results was clock granularity. Nowadays, however, cache effects can be significantly more important. If the program happens to be loaded into physical memory locations that cause frequently used instructions and/or data to map to the same cache block, perhaps in the secondary cache, the program will run significantly more slowly than if the program is loaded into physical memory locations for which this does not happen. This effect can increase the runtime of a program by several percent; in extreme cases, it can multiply the runtime by a factor of 2 or 3. (The upper bound is the ratio between the speed of the cache and the speed of the memory.)

 

lies, damned lies, statistics, benchmarks

The only way to be sure of the performance of your application on a given machine is to run it on that machine and measure the time taken.

Benchmarks ("standard" tests) may help if you treat them with caution and if you know how to interpret them, i.e. know how similar the benchmark programs are to the programs you are interested in.

If they aren't similar, the benchmarks are worse than useless, since they can mislead you.

 

Choosing programs for performance evaluation

  1. The best option: your workload
  2. Some real programs that are known to have characteristics similar to your workload (e.g. gcc)
  3. Kernels, i.e. the key parts of 2 (e.g. Lawrence Livermore Loops)
  4. Synthetic benchmarks constructed to have characteristics that represent the "average" program in a class (e.g. Whetstone, Dhrystone)
  5. Toy benchmarks (e.g. Sieve of Eratosthenes)

NOTE You should base substantive decisions only on results from programs in the first two classes. The last three categories are simply too misleading most of the time.

 

Native MIPS

MIPS stands for Millions of Instructions Per Second. However, it does not say what each of those instructions do, and what their contribution to the solution of the problem is. Without this info, MIPS is Meaningless Indicator of Processing Speed.

  • MIPS varies between programs
  • MIPS depends on instruction set
  • MIPS depends on compiler
  • MIPS can vary inversely to real performance

 

Variability of native MIPS

The PL.8 research compiler generated better code for the System 370 than commercial compilers by using only the load/store instructions to access memory. It was better at reusing data in registers and at instruction scheduling (e.g. spreading loads and uses).

It improved real performance by up to about 20%. Yet because it replaced each high-CPI instruction with several low-CPI instructions, the MIPS rate of the resulting programs improved from ~2 to ~6. This effect is important when comparing RISC code to CISC code.

 

Relative MIPS

They have nothing to do with native MIPS. Take an old machine with an accepted MIPS rating, execute some task on this machine and another, and multiply the old machine's MIPS rating with the time ratio. What you get is relative performance on an unspecified task (usually Dhrystone 1.1).

The usual "old machine" is the VAX-11/780 with 1980 compilers, which is defined to be a 1 MIP machine. In fact, the 780 executed about 470,000 instructions per second, but it had "about" the same speed as an IBM machine that was advertised as being 1 MIP.

NOTE This should give you an idea of the vagueness of most measures of performance.

VUPs, VAX Units of Performance, are DEC's own version of relative MIPS; it uses a set of small benchmark programs, and it is thus suspect.

 

MFLOPS

This measure, Millions of FLoating point Operations Per Second, attempts to measure the work being done on the problem, rather than the work performed for housekeeping. However, MFLOPS are not much better than MIPS.

  • MFLOPS is irrelevant for many programs
  • MFLOPS depends critically on operation mix
  • MFLOPS depends on compiler
  • Housekeeping often cannot be ignored

NOTE Some of today's microprocessors have so good FP hardware that even number crunching applications spend more time on updating index variables, testing loop conditions and moving data into and out of the CPU than on the calculations themselves.

MFLOPS are usually measured by the LINPACK benchmark, which is a linear algebra package operating on matrices. The two main versions operate on 100*100 and 1000*1000 matrices respectively. Due to lower loop overheads, the larger problem should yield the higher MFLOP rating, but the larger problem may not fit into cache, reducing its performance. When interpreting LINPACK numbers, it is important to know whether the basic linear algebra subroutines (BLAS) were written in Fortran or assembly: if they were in assembly, the indicated performance may be unobtainable from high level language programs.

 

Benchmark pitfalls

Vapourware always wins.

Unreproducible benchmarks should be viewed with suspicion.

Most benchmarks focus on CPU performance. Very few test the other bottlenecks: memory, I/O and the O/S.

Small benchmarks fit in caches and thus overestimate performance.

Kernels, synthetics and toys tend to be small.

Performance can vary with compiler and O/S version, cache size, memory size, and disk model.

NOTE Vapourware is something (hardware or software) that is announced a long time before being available, usually with great hype. Some vapourware delivers what it promises and on time, some vapourware delivers what it promises but with significant delay, and some vapourware never becomes available.

For example, Sun launched their SuperSPARC processor with great fanfare. Their customers then had to wait about 18 months before it was available in systems, and at first it was available only at a lower than promised clock rate (36 MHz). By then even performance at the originally promised clock rate (50 MHz) would have been behind the competition. Sun seem to have learned their lesson; they have done much better with UltraSPARC. A similar thing has happened to AMD with their K5 (a Pentium-class x86 processor); they have also done better with the K6 (whose design they acquired when they bought a company called NexGen).

In 1994 IBM had disclosed several aspects of their new PowerPC 620 processor, which still hasn't appeared in systems; the project seems to have been abandoned. In late 1996 the startup company Exponential announced their PowerPC 704 processor at a clock speed of 533 MHz amid a considerable amount of hype, which led some Macintosh advocates to predict the imminent death of the x86. (The fastest x86 at the time was the 200 MHz Pentium Pro.) The company abandoned the project in June 1997 because no company designed the 704 into their computers, mainly because they were frightened away by the 85 watt power consumption of the chip, which, because of the chip ECL design (instead of the usual CMOS) couldn't even be reduced by reducing the clock speed. In any event, at that time the highest speed at which the chip could work reliably was 410 MHz.

 

Benchmark pitfalls

Most vendors inflate their own performance (by lying, picking a favourable test, using non-standard compilers, or picking strawman systems to compare against).

Benchmarks with dead or unusual code (especially Dhrystone 1.1) are far too susceptible to compiler optimizations.

Expected performance need not be proportional to peak performance (vendor MIPS are "guaranteed not to exceed" figures).

Price/performance comparisons are usually bogus: vendors can measure lowball "wheels extra" configurations.

NOTE Dhrystone is an example of several of these points. It does a lot of string copying in an attempt to measure string copying performance. However, the strings in Dhrystone are of known constant length and their starts are aligned on natural boundaries, two characteristics usually absent from real programs. Therefore an optimizer can replace a string copy with a sequence of word moves, which will be much faster. This optimization therefore overstates system performance, sometimes by more than 30%.

Dhrystone is no longer at all useful for performance measurement.

Even its author has held this view for a long time now.

As an example of picking a favourable and quite irrelevant test, consider Intel's introduction of the i860 processor. Since this CPU was one of the first to include graphics instructions (instructions that operate on more than one 8 or 16 bit value at a time, which is the core of the MultiMedia eXtensions (MMX) in new machines), Intel quoted its performance in Mandelbrots per second (the speed of computing the famous fractal graphic).

 

Compiler effects

Different compilers and libraries can produce executables of significantly different speed. These results are for the LAPACK linear algebra package on a SPARCstation 10/712:

compiler

Fortan %

C %

Sun 4.2 libsunperf

100

-

Sun 4.2

87

71

Sun 4.0

85

64

Apogee

76

26

Gnu

24

37

NOTE LAPACK is written in Fortran, but like other Fortran programs, can be translated into C via f2c. The third column reports the speed of the benchmark when translated into C and compiled with a C compiler. (This translation step usually loses some speed.)

The first three rows report results using versions 4.0 or 4.2 of Sun's SPARCcompiler products. The fourth row uses compilers from a company called Apogee, while the fifth row uses the Free Software Foundation's g77 and gcc compilers.

The results indicate that g77 is still quite immature. While gcc does significantly better on the C version of LAPACK than Apogee, it still trails behind the Sun compilers in optimizing number-crunching code. (For integer programs, gcc trails the Sun compilers by a much smaller margin.)

 

Summarizing performance

 

A

B

C

test1

1

2

20

test2

1000

200

20

total

1001

202

40

This data can be used to "prove" that:

A is 2 times the speed of B (test1)

B is 5 times the speed of A (test2)

A is 20 times the speed of C (test1)

C is 50 times the speed of A (test2)

B is 10 times the speed of C (test1)

C is 10 times the speed of B (test2)

NOTE Such performance data may seem artificial, but it possible to get results like this if test1 is a sequential problem while test2 is a parallel problem that does 1000 times more work, and if machine A is a conventional fast uniprocessor, B is a multiprocessor with 10 slower CPUs, and C is an SIMD machine with 1000 very slow PEs.

 

Summarizing performance

Arithmetic mean is appropriate for averaging times:

(1/n) å nI=1 (Timei)

Harmonic mean is appropriate for averaging rates:

n / (å nI=1 (1 / Ratei))

Using arithmetic mean for rates is cheating. Either formula can be modified to give more weight to some tests.

NOTE Times and rates are inverses of each other. When you take the expression for arithmetic mean of times, and replace the times with the inverse of the rates, and express the resulting average time as a rate, you get the formula for the harmonic mean. This is why the harmonic mean is the appropriate mean for averaging rates.

Suppose you want a way of averaging performance ratios in which the results depend neither on the reference machine nor on the distribu-tion of total time among the benchmarks.

The only such averaging method is the geometric mean:

(P ni=1 Ratioi)1/n

However, the geometric mean (like other means) can distort results if there is wide variability among the rates (if the base time is 1000, A = 32, B = C = 50).

NOTE Variability of performance is an important piece of informa-tion. When that variability is small, ignoring the variability by reporting a single number is acceptable; when the variability is large, suppressing it robs the summary of its value. Therefore in cases with wide variability of performance, there is no good way of summarizing performance.

 

SPEC

The Systems Performance Evaluation Cooperative was founded by Apol-lo, HP, MIPS, Sun in 4Q89.

SPEC has published three benchmark suites, in 89, 92 and 95. They each consist of real programs whose sources are as close to identi-cal as possible from machine to machine. The 89 and 92 suites are now obsolete.

The SPEC result sheet must state the machine configuration, the op-erating system and compiler versions, and the background load. Without this detail, one cannot rely on the reproducibility of the benchmark.

NOTE Some time after its founding, the name of the organization was changed to the Standard Performance Evaluation Corporation, but the acronym remained the same.

 

SPEC results

SPECfp95 is the geometric mean of the SPECratios of 10 FP programs, all in Fortran and mostly double precision. It is the central con-cern of number crunchers.

SPECint95 is the geometric mean of the SPECratios of 8 integer pro-grams, all in C. It is of most interest to people doing e.g. soft-ware development, text processing and (to a lesser extent) database applications.

The ratios are based on the performance of the Sun SPARCstation 10/40 with frozen compilers.

NOTE The earlier suites used a VAX 11/780 with frozen compilers as their baseline. This is no longer practical: a test that runs on modern machines in a few minutes would take a day on a VAX 11/780.

You will find information about SPEC and SPEC results at http://www.spec.org.

 

SPECint_base95 details

 

IBM

IBM

DEC

Intel

 

H50

397

AS1200

??

CPU

604e

P2SC

21164

P2

MHz

332

160

533

400

go

17.0

?

17.2

14.5

m88Ksim

17.8

?

18.4

15.7

gcc

12.8

?

17.5

14.7

compress

9.7

?

15.0

11.8

li

12.7

?

15.1

15.8

jpeg

16.8

?

16.8

15.5

perl

14.6

?

18.1

17.8

vortex

12.5

?

15.9

17.2

SPECint95b

14.0

7.8

16.7

15.3

Notice the variability among the benchmarks, that different machines handle different things well, and that clock speed isn't everything (533/332 = 1.61, while 16.7/14.0 = 1.19).

NOTE For example, the performance of the AlphaServer 1200 with its

533 MHz 21164 is above average on the go program and below average

on the vortex program, while the performance of the Intel machine

with its 400 MHz Pentium II is below average on the go program and

above average on the vortex program,

IBM has not published SPECint_base95 results for the RS6000/397, al-though they have published SPECfp_base95 results. The unofficial summary number for the 397 above is from IBM's web site; the other numbers on this page and the next are official numbers from the SPEC web site.

 

SPECfp_base95 details

 

IBM

IBM

DEC

Intel

 

H50

397

AS1200

??

CPU

604e

P2SC

21164

P2

MHz

332

160

533

400

tomcatv

15.1

46.9

24.7

15.4

swim

24.4

56.5

27.3

22.3

su2cor

5.7

10.5

11.8

7.3

hyrdo2d

6.1

12.9

12.9

6.8

mgrid

9.6

22.8

19.3

7.3

applu

8.0

23.2

10.6

7.1

turb3d

15.0

22.7

22.0

10.4

apsi

9.2

11.5

24.0

14.1

fpppp

36.9

35.1

40.3

16.9

wave5

12.8

30.1

27.3

fB 11.4

SPECfp95b

12.1

23.6

20.4

11.0

Note that two machines with similar SPECfp_base95 can take very different times on one program, and that memory bandwidth can be much more important than clock speed.

NOTE The RS6000/397 and AlphaServer 1200/533 have similar FP performance despite wildly different clock rates. This is because their designers had different design philisophies. The Speed Demon school of CPU design emphasises high clock rates; the Brainiac school emphasizes doing as much work as possible per CPU cycle. Obviously, DEC's designers are in the Speed Demon school while IBM's are in the Brainiac school.

The 397 was designed for number crunching. Its Power2 CPU has several low-latency floating point units, it has a large on-chip cache (128 Kb data and 32 Kb instructions), and it has a very high bandwidth memory system (more than 2.5 Gb/s, due to a 256-bit wide bus) that can work on several accesses at once. By contrast, the H50 is oriented towards general purpose computation. The PowerPC 604e superscalar processor was designed for integer performance, it has a smaller on-chip primary caches (32 Kb data + 32 Kb instructions). The H50's 604e CPU cannot work on several accesses at once, its memory subsystem bandwidth is only 1.3 Mb/s (128-bit wide bus), and its memory latency is higher, due to the presence of an L2 cache and bus arbitration overhead (the H50 is a multiprocessor machine, while the 397 is available only as a uniprocessor).

Some recent compilers can automatically find parallelism in some programs. Moving from 1 to 8 processors increases the SPECfp95 score of the AlphaServer 8400/5/300 from 12.4 to 33.5, a factor of 2.7; the SPECint95 score is almost unchanged.

The SPECrate benchmarks run several copies of a benchmark in parallel, usually one on each processor. Moving from 1 to 8 processors increases the SPECfp_rate95 score from 109 to 789, a factor of 7.2 (the SPECint_rate95 goes from 64.2 to 525, a factor of 8.2).

Speeding up single programs by using multiple processors ranges from hard (as shown by SPECfp95) to impossible with current technology (SPECint95). Speeding up many programs by using multiple processors is easy, as shown by the SPECrate numbers.

In the SPECrate benchmarks, vendors experiment with the optimal number to run in parallel to achieve the maximum figure for jobs/timeperiod. (This means that SPECrate_int95 and SPECint95 are reported in different units and are thus incomparable, and similarly for FP.) Since the SPEC benchmark programs are designed mostly to exercise the CPU, and have little I/O or OS interaction, they are not representative of many applications. In real life one can't experiment with the load either. Therefore it is unwise to use SPECrate as anything other than a very rough indication.

The most probable explanation of the superlinear speedup for SPECint_rate95 is that the 8-processor system was tested with twice as much memory, which usually means twice the number of banks in an interleaved memory system. Not all processors have performance that scales so well, due to contention for bus, memory, disk, OS etc, which is why multiplying SPEC numbers by N in N-way multiprocessors is bogus. For example, each CPU of a four-processor SGI 4D/340S can do about 25 SPECint89, but if you run four benchmarks together, each CPU can only do about 19.

 

SPEC problems

SPEC rules require full disclosure of all 18 SPECratios. However, marketing often quotes only composite metrics.

Results from unobtainable (beta) compilers are bogus; results with a long string of optimizer switches almost so. This is why reporting base results, in which all benchmarks must use the same flags, is now compulsory. Reporting peak results, where flags may vary, are optional.

The programs in the 89 and 92 suites now fit in some caches. The programs in SPEC 95 handle larger datasets: up to about 64 Mb.

NOTE Buyers need full disclosure so they can pick the benchmark(s) that resemble their own applications and make their own comparison.

From June 1994, vendors who report new SPEC numbers must also report the corresponding SPECbase number. SPECbase uses the same benchmarks as SPEC but requires all programs in a given language to be compiled with the same set of compiler switches. Programmers can use the same set of switches on their own programs, since SPECbase proves them to be generally useful, instead of trying many switch combinations.

SPEC89 had a composite metric, SPECmark, that lumped integer and FP benchmarks together. Since superscalar machines often have SPECfp around 2*SPECint. lumping the two together distorts things. IBM advertised their superscalar RS/6000-970 as ~100 SPECmarks89, yet the SPECint89, which is much more predictive in most cases, was ~50.

The optimization that reorganizes matrix300 for a tenfold speedup will improve performance for very few applications, yet can increase SPECmark from 60 to 70 (e.g. SGI Crimson). This is why the matrix300 benchmark, which was part of SPEC89, was dropped from the SPEC92 benchmark suite.

The sc benchmark (a spreadsheet) was originally intended to measure performance with the output going to the screen. Some vendors have reinterpreted it to allow output going to disk. This inflates the apparent speed of the system. This program has been dropped from the SPEC 95 suite.

The eqntott benchmark (a program to convert equations to truth tables for circuit design) works mostly with small integers. It can be sped up significantly by a smart compiler arranging to pack two 16-bit integers into one 32-bit word, and then using 32-bit arithmetic instructions to perform 2 16-bit operations at the same time. Intel's compiler used to perform this optimization even when it isn't valid. As a result, the version of the eqntott executable used to derive some of Intel's SPECint92 numbers gave incorrect answers for some inputs, although it happened to give the correct answer on the input used in SPEC. These numbers have since been withdrawn by Intel, but you will still find them circulating. The eqntott program has been dropped from the SPEC 95 suite.

Intel's compiler is available for sale (SPEC requires this), but very few people actually do use it. Intel does not market it, and according to industry rumors this compiler has a high likelyhood of generating incorrect code for anything that is not a benchmark :-(, and as the eqntott story shows, even for some benchmarks.

 

System benchmarks

When evaluating a machine for e.g. timesharing or database applications, the benchmark must exercise I/O and the OS as well as the CPU. It is possible for machine A to have lower CPU performance but higher system performance than machine B.

The proprietary AIM benchmark and the System Development Multitasking (SDM) benchmark from SPEC each run a number of scripts in parallel, and measure the number of scripts executed per timeperiod. In SDM, each script is intended to approximate one user.

NOTE Servers usually have lower CPU performance and higher system performance than the workstations on which they are based. Mainframes have very high system performance, much higher than workstations, despite having comparable or lower CPU performance.

SDM is based on MUSBUS, a benchmark developed at Monash University in 1980.

SPEC also has system level benchmarks for measuring the performance of NFS file servers, Web servers, supercomputers, and graphics systems.

Bonnie, IOstone and NFSstone are pure I/O benchmarks that do no computation. They produce figures such as X Mb/s for sequential reads of 4 Kb blocks. The figures need very careful inspection, because small changes in system configuration can cause big changes in benchmark performance. For example, using a faster spinning disk, using a striped filesystem, or using a bigger buffer cache in the OS can all improve the reported results dramatically.

The Khornerstone suite of Unix Review contains some I/O and graphics benchmarks as well as some CPU benchmarks, but it tests these separately and doesn't use the scripts/time approach.

 

Transaction Processing Council

TPC-C simulates a warehouse order-entry application, with five nontrivial transaction types. The database size must go up linearly with performance while maintaining acceptable response times (five seconds).

TPC-D has 17 complex, long-running queries against complex data structures; it represents decision support applications. There is a range of approved database sizes: 1 Gb, 10 Gb, 30 Gb, 100 Gb, 300 Gb and 1 Tb. Results for different sizes are not comparable.

NOTE You will find information about TPC and TPC results at http://www.tpc.org.

DBMS = Database Management System.

The requirement that higher performance results must be measured on a machine with a larger database size is the derived from the usual database usage pattern, in which bigger, faster machines support both more users and more data. It significantly complicates the measurement process. You try out the speed of your machine at some data size. Since the speed will not match the required size, you must adjust the size up or down, and try again. However, on a different sized database the speed will be different, so you must keep varying the database size until you get to a fixpoint: a run in which the speed matches the size.

The now obsolete TPC-A and TPC-B benchmarks simulated simple transactions on an account database supporting a bank's ATM network. TPC-A was a full simulation; TPC-B ignored terminal and network handling and user think time.

TPC-A and TPC-B each have a single transaction, which performs just two disk reads and two disk writes. They are too easy to optimize: many optimizations speed up these benchmarks significantly while not doing much at all for real programs. Therefore the Transaction Processing Council has declared these benchmarks obsolete (just as SPEC 89 and 92 are obsolete).

Results for TPC-C are reported in transactions per minute (tpmC). 1000 TPC-C transactions per minute (tpm) on a 1 Mb database is trivial; 1000 tpmC on a 100 Gb database is very difficult.

TPC-D results report the power metric and the throughput metric, which measure optimal latency and optimal throughput respectively. Both include database updates.

The TPC-D power metric (denoted QppD) demonstrates the speed of the system when serving a single user who is alone on the system (no competition for system resources from other users), while the throughput metric (QthD) shows how many queries per hour the system can complete with the optimum number of users working in parallel. The query-per-hour rating (QphD) is a composite of these two metrics.

All TPC results report full system cost for five years and derive a price/performance metric from this. This is because results depend strongly on the configuration; more/faster disks or memory means higher performance, as does a better DBMS.

The reported cost must include everything in the system, both hardware and software, not just for the computer running the queries but also any front-end computers required to handle the terminals and/or network interfaces needed to connect the required number of users. The hardware costs must include not only CPUs, memory, disks and terminals, but also mundane but necessary things such as power supplies, disk trays and cables. The software must include the operating system, database management system, any "middleware" such as transaction monitors (e.g. Tuxedo). Finally, the reported costs must include maintenance costs for both hardware and software for five years (which may be zero for an item that is covered by warranty for that long).

 

SYSmark{32,95,NT}

The programs in these suites are crippled binaries of popular packages. They cover databases, spreadsheets, word processing, desktop publishing, and graphics/presentation programs.

The results for these 64 Mb Pentium II machines show the importance of the non-CPU aspects of systems.

Machine

MHz

SYSmark32

NEC Powermate E2

333

425

Dell Dim XPS R400

400

416

NEC Powermate E2

266

373

Aventec XL-600

400

369

NOTE These benchmarks, and others for testing file server performance and battery life, were developed by the Business Applications Performance Corporation (BAPCo), a consortium of most of the major players in the PC market. You will find information about BAPCo, and results on several of their benchmarks, at http://www.bapco.com.

The SYSmark programs are full system benchmarks and are more realistic and representative than the SPEC tests. This is possible partially because BAPCo does not have to handle different OSs or instruction sets (expect for SYSmarkNT). The choice of e.g. disk and graphics card can make a significant difference in the performance of the machine, both on these tests and in real life.

The SYSmark95 for Windows suite is based on popular 16-bit programs. The programs and the weightings of the program groups are Word for Windows 6.0, WordPerfect for Windows 6.0a and AmiPro 3.1 (word processing, 31%) Excel 5.0 and Lotus 1-2-3 for Windows Release 5 (spreadsheets, 23%) Paradox for Windows 5.0 (database, 23%) CorelDraw 5.0 (desktop graphics, 18%) Freelance Graphics for Windows 2.1 and Powerpoint 4.0 (desktop presentation, 3%) and PageMaker 5.0a (desktop publishing, 2%).

The SYSmark32 suite is based on 32-bit programs. The programs are Word 7.0 and Lotus WordPro 96 (word processing) Excel 7.0 (spreadsheets) Paradox 7.0 (database) CorelDraw 6.0 (desktop graphics) Freelance Graphics 96 and Powerpoint 7.0 (desktop presentation) and PageMaker 6.0 (desktop publishing). BAPCo also has new a benchmark suite, SYSmark for Windows NT. The programs in this suite run natively on the processors that support WNT: the x86, Alpha, and MIPS Rx000.

 

Other PC benchmarks

Winstone 96 is reasonable; it based on real applications, just like SYSmark. Most others are not.

Intel's competitors use a scheme called P-rating. Since the 208 MHz Cyrix 6x86MX performs similarly to a 266 MHz Pentium on Winstone 96, Cyrix calls it 6x86MX-PR266.

L2 cache size and bus speed are often important. A 233 MHz Pentium with L2 cache is usually faster than a 266 MHz Celeron (Pentium II without L2 cache). A 150 MHz Pentium, with its 60 MHz bus, is often slower than a 133 MHz Pentium with its 66 MHz bus.

NOTE PC benchmarks that you should not place any signficant amount of trust in include Winbench, CPUmark, Landmark, Power Meter, Norton SI and Intel's iComp rating.

The P-rating scheme was originally designed to relate x86 clone processors to the frequency of the Pentium whose performance they approximated. However, since then they have also been applied to relate x86 clone processors to the frequency of the Pentium MMX or Pentium Pro whose performance they approximated. Since at a given clock speed, the Pentium MMX is slightly faster than a plain Pentium (due to larger caches), and the Pentium Pro and Pentium II are significantly faster than a Pentium MMX (due to a completely different implementation), a P-rating is not worth much unless you know what processor it is relative to.

When Intel announced the 200 MHz Pentium, with a 66 MHz bus, they also announced they will not sell 180 MHz Pentiums. A 180 MHz Pentium with a 60 MHz bus would be slower on almost all tasks than a 166 MHz Pentium with a 66 MHz bus, while a 180 MHz Pentium with a 72 MHz bus would almost certainly be more expensive than a 200 MHz Pentium with a 66 MHz bus (the cost of increasing the speed of the bus from 66 to 72 MHz is probably more than the difference in the cost of the 180 and 200 MHz CPUs).

 

Cost

The cost of a product should always decrease: over time manufacturing learns how to make the product cost less.

One source of this improvement is an increase in yield, the percentage of products that survive testing.

Very large amounts of money drive the learning curve of the semiconductor industry. Designers must anticipate the cost and performance of components at the time of product introduction, and use this data at design time.

NOTE The price of a new chip production facility used to start at about 100 million US dollars; these days they start at around 1 billion, and go up to about 3 billion. One reason Asian countries are dominant in chip manufacturing is that they had much lower interest rates through the eighties than the US, and financing new plants was easier for them.

The rate of increase in the cost of new semiconductor plants must slow down sometime in the next two decades, because if the present trend continues, by 2020 a new plant will cost more than the entire annual gross domestic product of the US.

 

Trends

DRAM size quadruples and disk size doubles every three years.

Logic transistor count per chip doubles every three years.

DRAM and disk speeds have doubled in ten years.

Logic transistor speed doubles every four years.

Required address size increases by 0.5 to 1.5 bit per year.

Memory price in 1958: 1 $/bit.

Memory price now: < $3/megabyte.

NOTE Memory cost is now less than one millionth of the memory cost of 1958, even without allowing for inflation.

Disk technology has been accelerating for the last four years or so, and both disk capacities and disk speeds have been growing faster than the historical average.

 

Cost of ICs

A wafer of silicon holds many dies (chips-to-be) The circuits are made on the wafer. The wafer is cut up, and each chip is tested, packaged and tested again. IC cost is

(die + test + package) / final yeild

The smaller the chip, the more you can put on a wafer and the less likely a defect ruins it. Die cost is

cost of wafer / (dies / wafer * die yeild) = f(die area3)

NOTE Including spare rows and columns in memory and steering bad cells to spare ones can increase yield significantly.

 

Testing and packaging

One can reduce test time and hence test cost by making the chip simpler, or by design for testability. Even using such techniques, however, testing costs can be a significant fraction of overall costs for high-end chips.

The cost of a chip's packaging depends mainly on its number of pins and its power dissipation capability. Plastic packages are cheaper than ceramic ones, but cannot handle as much heat.

Most high-end chips these days are packaged in Pin-Grid Arrays or Ball-Grid Arrays.

NOTE The dominant packaging technology of the seventies was dual inline packages (DIP). As the name suggest, this had leads on two sides of the package, which was rectangular.

Since modern chips have a lot more functionality, they need a lot more leads to the outside world. Modern packaging generally uses the entire periphery of the package for leads, so packages tend to be square in shape. As the number of pins required per chip have continued to rise, designers are now using the middle of the package as well.

Pin grid arrays (PGAs) have more than one row of pins on each side. Those holding the latest microprocessors are often (literally) the hottest: e.g. a 300 MHz Alpha 21164 dissipates 50 Watts. This requires exotic fins for cooling.

Two recent packaging technologies are ball-grid arrays (BGA) and tape automated bonding (TAB). BGA uses small balls of solder to connect the package to the board, while TAB connects leads from chips directly to the printed circuit board or multichip module without a conventional package.

 

Packaging

Packaging is also important above the chip level. Making the computer smaller (increasing the level of integration) means fewer interfaces, fewer pins (fewer failure points), fewer layers in the PCB, smaller cabinets, power supplies etc.

Discontinuities are important: moving from several boards to one board, or from several chips to one chip, can cause large reductions in cost.

If you get rid of a level of packaging (e.g. the card cage), the signals travel faster, and the same chips can yield a faster system.

NOTE Some microprocessors designed for low end systems (e.g. Sun's MicroSPARC 2, DEC's Alpha 21066, Cyrix's MediaGX family) put the system crossbar, the memory controller, and a mezzanine bus controller on the same chip as the CPU. This increases the cost of the chip slightly, but significantly reduces the cost of the overall system.

 

Power consumption

Power compsuption is an important issue for notebooks, since it determines battery life. It is important for other systems because the main factor besides yield limiting chip size is cooling, and every watt consumed by a chip has to be dissipated.

Power consumption is higher in ECL/GaAs than in CMOS. In CMOS designs it goes up with density, frequency, and voltage. Since progress depends on increasing density and frequency, operating voltages have gone from 5V in the eighties to 3.3V in the early nineties, and many current CPUs are in the 2.xV range.

 

Design costs

The MIPS R4000 design team peaked at 58 people; the design cost was about $30M. The Intel 486's ~$100M design cost is actually lower per chip sold due to the much larger market size.

The design of RISC chips is still easier than that of CISC chips for equivalent performance levels. Design times and costs are increasing for all types of CPUs due to pressure for superscalar, out-of-order etc implementations.

To keep schedules feasible, vendors use as much automation as possible.

NOTE The design of the new generation of faster, denser chips (Pentium, Intel's Pentium Pro, the Pentium competitors from AMD and Cyrix, the MIPS R10000, the HP PA-8000, the new IBM PowerPC processors etc have each probably cost significantly more than $100M. The Intel Pentium Pro design cost is almost certainly closer to half a billion dollars than to $100M.

Register transfer level (RTL) simulators help optimize machine organization. Floorplanners help place the major blocks.

Silicon compilers, logic synthesis tools etc can generate macrocell or gate-array logic from RT level descriptions, while layout and routing tools make custom design easier.

Designs are simulated at several levels (from behavioral to electrical) to verify correctness and test speed. Chips from first tapeout often work with minor problems, which may be detected by automatically generated test vectors.

The first chips may work correctly, but only at a lower speed. Then the designers must increase the speed at which the chip works to the target speed, and when they have done that, they must increase the yield at the target speed.

For example, AMD have demonstrated a machine powered by a 50 MHz K5 in June 1995. The target speed of the initial version was 100 MHz, but AMD wasn't be able to achieve that speed in volume until late 1996.

 

Cost distribution

A surprisingly small fraction of the cost of a non-PC computer is in the CPU.

The price of main memory can be very significant: it can be as much as half the total price.

The cost of disk and tape drives is significant in servers.

Expandability alone can almost double the price of a machine.

The cost of displays (especially for color, even more for large color) is very significant in workstations and in X-terminals.

NOTE One reason why memory prices can be significant is simply that many non-PC machines have lots of it; many machines these days support several gigabytes. Another reason is that some vendors sell memory for their systems at several times the per-megabyte price of PC memory.

For PCs, the two main cost drivers are the size and quality of the monitor and the speed of the CPU. Substituting a good 17 inch monitor for a bad 14 or 15 inch one can come close to doubling the price of a PC, and so can replacing a 166 MHz Pentium with a 400 MHz Pentium II.

 

Cost vs price

Price consists of several components:

component costs (15-33%)

direct costs (6-8%)

labor costs

scrap & warranty

gross margin (34-39%)

R&D recovery

marketing & sales

cost of facilities

finance cost, taxes

profit

average discount (25-40%)

volume discount

OEM discount

educational discount

NOTE The above numbers were derived from workstations and servers, and are probably not representative for other kinds of computers.

PC class machines are usually not subject to educational discounts, and even volume discounts tend to be small unless the volume is very large.

 

Profit margins

Intel and Microsoft have margins of 27% and 30% respectively, because they are near-monopoly suppliers.

The very intense competition among component manufacturers and among companies that assemble PCs requires those companies to accept much thinner profit margins.

In the open systems market, customers can choose from several vendors, but then they depend on their vendor for upgrades. Users of proprietary OSs depend on their vendor completely.

NOTE With companies that sell machines with both Unix and proprietary OSs, the price of the same hardware can differ by a factor of two depending on the OS.

As of July 1998, Microsoft is the second most valuable company in the world - behind General Electric - in terms of stock market capitalization, i.e. the total market value of all the shares in the company, and Intel is the tenth, at 209 and 121 billion US dollars respectively. The reason why investors value their stocks so highly is that even though their sales per year lag behind the sales of many other big companies (11 billion US a year for Microsoft and 25 billion for Intel, compared with e.g. 178 billion for General Motors, whose capitalization is only 48 billion), their profit margins are among the largest, and investors expect their profits to rise significantly in the future. However, Microsoft and Intel are both under antitrust investigation.

When AMD started making 486s, Intel was ready with the Pentium. When Cyrix and AMD have introduced competitors to low-end Pentiums, Intel moved the mass market to high end Pentiums and later Pentium IIs. Their strategy is to try to move the market away from the performance levels that AMD, Cyrix and now IDT can reach, and to try to get buyers to pay a premium for the fastest CPUs, which only they can produce.




The above notes have beed adapted from the Computer Design Lectures at the Computer Science Department, University of Melbourne.

malam@poboxes.com
Last modified: 9th of Aug, 1999