*BSD News Article 16048

Newsgroups: comp.os.386bsd.development
Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!elroy.jpl.nasa.gov!usc!howland.reston.ans.net!ux1.cso.uiuc.edu!uwm.edu!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: assembly versions of bcopy, bcmp, memcpy, memmove, etc.
Message-ID: <1993May13.224012.8525@fcom.cc.utah.edu>
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <CONKLIN.93May12163441@ngai.kaleida.com>
Date: Thu, 13 May 93 22:40:12 GMT
Lines: 59

In article <CONKLIN.93May12163441@ngai.kaleida.com> conklin@kaleida.com writes:
>I have not done this myself, since my impression is that the data
>structures will allready be aligned by the compiler (or malloc).  Why
>slow down the general case to handle a special case?  But maybe my
>impression is totally off-base and aligned access is the special case?
>
>I'd appreciate any facts (or opinions) that would either confirm or
>disprove my assumption.

Opinions:

The unaligned case is most frequent.  If alignment can be on 4 byte and
2 byte boundries:

	 25%	Accessable on 4 byte boundries
	 50%	Accessable on 2 byte boundries
	100%	Accessable on byte boundries

In a str*cpy, both the source and targets must be examined:

	  6.25%	Both src and dst on 4 byte boundry
	 25%	Both src and dst on 2 byte boundry
	100%	Both src and dst on byte boudry

For most systems, the bus size is > 8 bits (16 for an SX, 32 for a DX).
On these machines, it is just as cheap to fetch 2 or 4 bytes at a time
as it is to fetch 1; special casing for aligned copies allows the use
of 16 and 32 bit operands for the majority of the copy, reducing the
amount of bus time required to perform the operations.

Your mileage may vary if one or the other of the source or the target
forces totally unaligned access (75% of the time for short words, 93.75%
of the time for long words).

Generally, the compiler will have directives to coerce all global data
to 4 byte boundries (loose structure packing) to "compile for speed".

For aligned data, this can be a significan speed improvement (1/2 or 1/4
of the copy time depending on which optimization can be used).

For length calculation, and int fetch and 4 ands is faster than 4 byte
fetches and 4 ands.  The condition code must be checked in both cases
all 4 times, so the savings is dependant on bus speed and wait states.
This is if you are looking for a null termination.

If you have the choice, implement at least the 16 bit transfers; 6% on
a non-coerced alignment isn't worth the overhead inless you have a high
average string length.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me