*BSD News Article 73357

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!nntp.coast.net!howland.reston.ans.net!news.sprintlink.net!news-stk-200.sprintlink.net!news.sprintlink.net!news-stk-11.sprintlink.net!news.sprintlink.net!new-news.sprintlink.net!server1.nw.ixe.net!server1.adam.ixe.net!wirehub!Leiden.NL.net!sun4nl!surfnet.nl!swsbe6.switch.ch!scsing.switch.ch!news.rccn.net!master.di.fc.ul.pt!usenet
From: Pedro Roque Marques <roque@di.fc.ul.pt>
Newsgroups: comp.os.linux.networking,comp.unix.bsd.netbsd.misc,comp.unix.bsd.freebsd.misc
Subject: Re: TCP latency
Date: 10 Jul 1996 23:39:59 +0100
Organization: Faculdade de Ciencias da Universidade de Lisboa
Lines: 133
Sender: roque@oberon.di.fc.ul.pt
Message-ID: <x74tnfn35s.fsf@oberon.di.fc.ul.pt>
References: <4paedl$4bm@engnews2.Eng.Sun.COM>
	<31D2F0C6.167EB0E7@inuxs.att.com> <4rfkje$am5@linux.cs.Helsinki.FI>
	<31DC8EBA.41C67EA6@dyson.iquest.net> <4rlf6i$c5f@linux.cs.Helsinki.FI>
	<31DEA3A3.41C67EA6@dyson.iquest.net> <Du681x.2Gy@kroete2.freinet.de>
	<31DFEB02.41C67EA6@dyson.iquest.net>
	<4rpdtn$30b@symiserver2.symantec.com>
	<x7ohlq78wt.fsf@oberon.di.fc.ul.pt>
	<Pine.LNX.3.91.960709020017.19115I-100000@reflections.mindspring.com>
NNTP-Posting-Host: oberon.di.fc.ul.pt
Mime-Version: 1.0 (generated by tm-edit 7.69)
Content-Type: text/plain; charset=US-ASCII
X-Newsreader: Gnus v5.2.25/XEmacs 19.14
Xref: euryale.cc.adfa.oz.au comp.os.linux.networking:44726 comp.unix.bsd.netbsd.misc:3991 comp.unix.bsd.freebsd.misc:23248

>>>>> "Todd" == Todd Graham Lewis <tlewis@mindspring.com> writes:


    Todd> On 8 Jul 1996, Pedro Roque Marques wrote:

    >> >>>>> "tedm" == tedm <tedm@agora.rdrop.com> writes:
    >> 
    tedm> I feel this has gotten so academic that it is meaningless.
    >>  I for one i'm so tired of seing non-techical arguments on
    >> Usenet about supposedly techincal issues that i don't ever
    >> consider a discussion to get "too academic".

    Todd> Then let's get to it.  A steak dinner at the next NANOG or
    Todd> LISA, my treat, to anyone who makes a substantial
    Todd> contribution to this thread.  There are two questions in
    Todd> this message; answer them if you have the time.

With such a prize i figure you'll get a lot of answers ;-)

    >> Having good thoughput and/or latency in TCP is much harder than
    >> most people believe.

    Todd> Question #1: Which aspects of network performance under
    Todd> FreeBSD and/or Linux are most in need of improvement?  Extra
    Todd> credit for well-reasoned answers.  If latency is
    Todd> unimportant, which I don't think anyone is seriously
    Todd> asserting, then what else is important, to which everyone
    Todd> should have an answer.

I can only answer in terms of Linux. I never looked to the FreeBSD
code in particular although i believe it is much based in BSD Net/3.

I think we should split our view of the TCP/IP stack in 3 areas: TCP,
routing table lookup/maintainance and UDP performance.

- TCP

I believe Linux TCP can be improved in all fronts :-). It's also important
here to distinguish between raw performance in loopback or fast
networks, behaviour over delayed/congestioned/lossy pipes and
scalability to large servers. Performance is usually achieved taking
the less number of CPU cycles to send a packet so i shouldn't be
conflictuous with the other two goals. The other two raise a very
tricky question which is: how to do timers ? Like i mentioned in a
previous post BSD style timers tend to be cheaper but they are less
correct than what one would normally desire. One of the finest remarks
i've heard on this issue was "if most of the times the retransmit
timer won't expire, why set it in the first place ?" the tought part
is that when the timer expires with want it to expire with the greatest
precision you can achieve.

But back to Linux improvements, on performance i think we can improve
things a lot by having a simpler way of handling send and receive
queues and by doing header prediction (this two can account to a 30%
improvement of localhost thoughput on a 486/66). Another thing comming
up is the change in the way Linux ix86 handles user and kernel
segments, rumor has it that upto one CPU cycle per memory access when
copying from user/kernel can be saved which will reflect on the
general performance of the network stack. For instance Linus posted a
profile strace of bw_tcp on localhost sometime ago when memcpy_to/from
user space shows as using around 90% of the CPU. Also we should fix
some details like only sending an ack after checking the transmit
queue (this is actually required by the specs) and geting a final
agreement on how we should do delay acks (this detail shows up very
nicely on lat_tcp test). Instead of doing like BSD that delays acks
some random time between 0 and 200 ms, Linux tries to predict the
apropriate ack delay time according to RFC 813 (like suggested in RFC
1122), which is quite tricky to get right.

When we have this done, if we are still unhappy with raw performance
we can adopt Van Jacobson's way of receiving TCP segments: a network
interrupt delivers the packet directly to the socket receive queue,
the process wakes, goes through IP and TCP tests and copies the buffer
to it's address space while checksuming. You can hardly do it
faster... trouble is that it breaks layering. Linux network stack has
a very simple and IMHO elegant layering that would be a shame to break
but...

As for TCP behaviour under delayed/congestioned paths ... we've fixed rto
estimative already ;-) (although only lately). I think that assuring
code correcteness, i.e. finding bugs should be the greatest problem in
the short term. One of the things i've been experimenting with is TCP
Vegas congestion avoidance but i don't have any results about if it
actually behaves better than Van Jacobson's original algorithm. Then
there is a whole set of "little" fixes like fast retransmit with some
corrections by Sally Floyd that are quite important (this one is in
already).

As for big WWW/FTP server support your biggest enemies are the timer
setting overhead and sockets in TIME-WAIT and SYN-RECV that occupy
memory that could be free.

- As for the routing table maintainance/lookup, i'm not familiar
enough with it but i think that in 2.0 Linux has a resonably good/fast
code. Rumor is that the big plans are to support part of the routing table in
user space (which has the great advantage of being swapable).

- UDP performance. I think the major improvemnt will be on the ix86
with the new way of accessing user memory. checksum and copy is a very
good thing (TM). the price of csum and copy is close to the price of a
memcpy. the rest of the code involved in a send is mainly a routing
table lookup (with it's cached link layer address for the nexthop).

    Todd> Question #2: Does latency optimization have ancillary
    Todd> benefits in terms of general code robustness or quality?

Generally yes. To reduce the latency you must look into reducing the
costs involved in sending and receiving a packet. remember in this
case the CPU costs are mainly in terms of the TCP processing
functions.

!Warning: This doesn't mean that the latency figures are a valid way
of comparing the quality of two distinct systems!.

    Todd> As a good neighbor, I ask you to do us a favor: Please don't
    Todd> storm off in a huff muttering about "Those stupid Linux
    Todd> guys."  If you have input to make on the network
    Todd> optimization issue(s), then I would love to hear it.

Hum... the "ignorant mob" reputation Linux has in some circles. Rumor
has it that there are actually some T-shirts with the saying: "Linux:
putting power in the wrong hands". I'd like one myself, since i find
it a good joke on both the criticised and the critics. :-)


    Todd> We're listening, and we're curious.  (And we're offering
    Todd> steak dinners.)

do i win a beer at least :-) i've surelly tired my fingers of writing
this much rubbish ;-)

regards,
  Pedro.