Return to BSD News archive
Newsgroups: comp.os.386bsd.development Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!elroy.jpl.nasa.gov!usc!howland.reston.ans.net!ux1.cso.uiuc.edu!uwm.edu!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: assembly versions of bcopy, bcmp, memcpy, memmove, etc. Message-ID: <1993May13.224012.8525@fcom.cc.utah.edu> Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <CONKLIN.93May12163441@ngai.kaleida.com> Date: Thu, 13 May 93 22:40:12 GMT Lines: 59 In article <CONKLIN.93May12163441@ngai.kaleida.com> conklin@kaleida.com writes: >I have not done this myself, since my impression is that the data >structures will allready be aligned by the compiler (or malloc). Why >slow down the general case to handle a special case? But maybe my >impression is totally off-base and aligned access is the special case? > >I'd appreciate any facts (or opinions) that would either confirm or >disprove my assumption. Opinions: The unaligned case is most frequent. If alignment can be on 4 byte and 2 byte boundries: 25% Accessable on 4 byte boundries 50% Accessable on 2 byte boundries 100% Accessable on byte boundries In a str*cpy, both the source and targets must be examined: 6.25% Both src and dst on 4 byte boundry 25% Both src and dst on 2 byte boundry 100% Both src and dst on byte boudry For most systems, the bus size is > 8 bits (16 for an SX, 32 for a DX). On these machines, it is just as cheap to fetch 2 or 4 bytes at a time as it is to fetch 1; special casing for aligned copies allows the use of 16 and 32 bit operands for the majority of the copy, reducing the amount of bus time required to perform the operations. Your mileage may vary if one or the other of the source or the target forces totally unaligned access (75% of the time for short words, 93.75% of the time for long words). Generally, the compiler will have directives to coerce all global data to 4 byte boundries (loose structure packing) to "compile for speed". For aligned data, this can be a significan speed improvement (1/2 or 1/4 of the copy time depending on which optimization can be used). For length calculation, and int fetch and 4 ands is faster than 4 byte fetches and 4 ands. The condition code must be checked in both cases all 4 times, so the savings is dependant on bus speed and wait states. This is if you are looking for a null termination. If you have the choice, implement at least the 16 bit transfers; 6% on a non-coerced alignment isn't worth the overhead inless you have a high average string length. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me