Discussion:
lock-free data structures
(too old to reply)
z***@globalnet.hr
2006-01-02 18:56:34 UTC
Permalink
Sorry if I'm a bit off-topic, but this is the only kernel-developers
mailing list that I'm subscribed to. The question is a practical one and
I believe that kernel developers can give the most informed answer.

Recently I've come across many papers describing the construction of
lock-free data structures using compare&exchange instructions. Such
implementations do not require explicit locks (mutexes) in an MP
environment. The advantages are more (no explicit locking needed) or
less (performance gains) obvious, so the question arises: why aren't
such data structures used more widely (in general and in the kernel
code?)

I know that Linux has adopted RCU, but again the example code for the
linked list traversal calls rcu_read_lock() / rcu_read_unlock() (thus,
not fulfilling the promise of lock-free code). Why didn't BSDs adopt RCU?
[ Please, I'm not trying to start a Linux vs. BSD war! This is a simple and
honest question related to my general inquiry. ]

Bottom line: what's the "catch" with lock-free data structures?

An immediate drawback is that not all architectures support the cmpxchg
instruction. What are others? Is the possible performance gain negligible
compared to the required effort of recoding a large volume of code?

Thanks for your answers.
Ignatios Souvatzis
2006-01-02 19:16:34 UTC
Permalink
hi,
Post by z***@globalnet.hr
Bottom line: what's the "catch" with lock-free data structures?
An immediate drawback is that not all architectures support the cmpxchg
instruction.
Exactly.

-is
Chapman Flack
2006-01-02 23:38:24 UTC
Permalink
Post by z***@globalnet.hr
An immediate drawback is that not all architectures support the cmpxchg
instruction.
Out of curiosity, anybody have an idea what fraction of NetBSD
supported MP architectures (maybe weighted by prominence?) have a
cmpxchg instruction (or a loadlinked/storeconditional, lwarx/stwcx
or other feature that can be used to the purpose)?

I notice that the JACK Audio Connection Kit (which it might be nice
to be able to port) likes to do userspace lock-free stuff with shared
memory.

-Chap
Garrett D'Amore
2006-01-03 02:03:47 UTC
Permalink
Post by Chapman Flack
Post by z***@globalnet.hr
An immediate drawback is that not all architectures support the cmpxchg
instruction.
Out of curiosity, anybody have an idea what fraction of NetBSD
supported MP architectures (maybe weighted by prominence?) have a
cmpxchg instruction (or a loadlinked/storeconditional, lwarx/stwcx
or other feature that can be used to the purpose)?
Isn't this the "normal" way of handling locking on SMP class machines?
(Actually, I'd sort of guess that even a lot of non-SMP architectures
have this kind of instruction e.g. for synchronizing interrupts, etc.)

So I guess it may be more interesting to hear about any counter
examples, where this kind of instruction cannot be used, especially on
SMP class machines. (On uniprocessor systems its easy to see how to
simulate this with a biglock.)

-- Garrett
Post by Chapman Flack
I notice that the JACK Audio Connection Kit (which it might be nice
to be able to port) likes to do userspace lock-free stuff with shared
memory.
-Chap
--
Garrett D'Amore http://www.tadpolecomputer.com/
Sr. Staff Engineer Extending the Power of 64-bit UNIX Computing
Tadpole Computer, Inc. Phone: (951) 325-2134
Brett Lymn
2006-01-03 11:43:17 UTC
Permalink
Post by Garrett D'Amore
Isn't this the "normal" way of handling locking on SMP class
machines?
Well "normal" for handling locking - you have a basic instruction that
can perform an "atomic" instruction. You can use a thing called
Peterson's Algorithm to perform the atomic lock without requiring
underlying hardware support.
Post by Garrett D'Amore
So I guess it may be more interesting to hear about any counter
examples, where this kind of instruction cannot be used, especially on
SMP class machines. (On uniprocessor systems its easy to see how to
simulate this with a biglock.)
An atomic lock is really just a building block, they can be useful in
some cases for gating access to a critical section but using them
exclusively could have a performance impact but they can be used to
make more sophisicated locking schemes like a reader-write lock where
you allow multiple readers to acquire a lock on an object (for
concurrent access to static information) but enforce only one writer
to the object. The writer requests the write lock, waits for all
readers to unlock and then does the write - any readers coming after
the write lock is asserted wait for the writer to unlock (of course,
this example is simplified due to lack of deadlock detection :). You
could easily do the same with a simple atomic lock but it would mean
that all read accesses would be serialised, decreasing performance in
some circumstances.

Of course, not having to perform the lock dance in the first place
would be an even bigger win.
--
Brett Lymn
David Laight
2006-01-05 21:36:59 UTC
Permalink
Post by Brett Lymn
Post by Garrett D'Amore
Isn't this the "normal" way of handling locking on SMP class
machines?
Well "normal" for handling locking - you have a basic instruction that
can perform an "atomic" instruction. You can use a thing called
Peterson's Algorithm to perform the atomic lock without requiring
underlying hardware support.
If that is the algorithm I think it might be, it doesn't work on all
architectures. Specificly it doesn't work on most sparc cpus.

The thing is that for any sort of compare+exchange instruction to be
useful it must do a locked bus cycle - so has the same basic cost
at acquiring a lock. Using a lock is much more portable.

Some CPUs have a rmw instruction, but the data written and/or the read
value is constrained in some way - so you can't do an actual exchange.
IIRC at least one cpu needs locks to be initialised to ~0.

David
--
David Laight: ***@l8s.co.uk
j***@dsg.stanford.edu
2006-01-05 22:09:18 UTC
Permalink
Post by David Laight
Post by Brett Lymn
Post by Garrett D'Amore
Isn't this the "normal" way of handling locking on SMP class
machines?
Well "normal" for handling locking - you have a basic instruction that
can perform an "atomic" instruction. You can use a thing called
Peterson's Algorithm to perform the atomic lock without requiring
underlying hardware support.
If that is the algorithm I think it might be, it doesn't work on all
architectures. Specificly it doesn't work on most sparc cpus.
.. I think what you are trying to say is something like this:

Peterson's algorithm works only on CPUs with strong memory
ordering. One example of systems where this assumption does not hold
is SPARC systems where Peterson's algorithm doesn't work are
multiprocessor configurations with store buffers.

For an authoritative disscussion of the SPARC issues, see:
http://docs.sun.com/app/docs/doc/816-5137/6mba5vpl5?a=view
Post by David Laight
The thing is that for any sort of compare+exchange instruction to be
useful it must do a locked bus cycle
I'm boggled. David, Have you never heard of ll/sc, or are you trying
to say that an atoomic compare+exchange synthesized from ll/sc
instructions is not _an_ (i.e., one) instruction?
Post by David Laight
- so has the same basic cost
at acquiring a lock.
No, that's just wrong. You are conflating the costs of a _software_
lock, with the costs of a hardware atomic operation, or attempted
atomic operation. On some systems (like mips or alpha, which combine
an ll/sc pair, which may fail, with a software loop to retry), there
isn't a 1:1 match between the two.

For a more sophisticated model, you might consider the total costs on
the memory system, which (may) impact *ALL* processors within a UMA
domain. When considering the r4k implementation (which, if memory
serves, allowed one issued LL per CPU, and one comparator per CPU
dedicated to LL by snooping the cache-coherency porotcol for writes to
the cacheline hondling the LL address: the following SC fails if the
comparator saw a remote CPU frob the LL-pending line), that model
gives a much more useful intution about the costs of synchronization.

Apologies if I misremember the exact details, its been... what, a decade?
David Laight
2006-01-05 22:25:45 UTC
Permalink
Post by j***@dsg.stanford.edu
Peterson's algorithm works only on CPUs with strong memory
ordering. One example of systems where this assumption does not hold
is SPARC systems where Peterson's algorithm doesn't work are
multiprocessor configurations with store buffers.
Sound good....
Post by j***@dsg.stanford.edu
Post by David Laight
The thing is that for any sort of compare+exchange instruction to be
useful it must do a locked bus cycle
I'm boggled. David, Have you never heard of ll/sc, or are you trying
to say that an atoomic compare+exchange synthesized from ll/sc
instructions is not _an_ (i.e., one) instruction?
Post by David Laight
- so has the same basic cost
at acquiring a lock.
No, that's just wrong. You are conflating the costs of a _software_
lock, with the costs of a hardware atomic operation, or attempted
atomic operation. On some systems (like mips or alpha, which combine
an ll/sc pair, which may fail, with a software loop to retry), there
isn't a 1:1 match between the two.
For a more sophisticated model, you might consider the total costs on
the memory system, which (may) impact *ALL* processors within a UMA
domain. When considering the r4k implementation (which, if memory
serves, allowed one issued LL per CPU, and one comparator per CPU
dedicated to LL by snooping the cache-coherency porotcol for writes to
the cacheline hondling the LL address: the following SC fails if the
comparator saw a remote CPU frob the LL-pending line), that model
gives a much more useful intution about the costs of synchronization.
Sounds like what I had in mind...
Post by j***@dsg.stanford.edu
Apologies if I misremember the exact details, its been... what, a decade?
What I intended to say, is that the rmw cycle done by a exchange instruction
(or a compare an exchange) is much the same expensive operation as
acquiring an lock (given you expect it to be uncontended).

The lock release is typiaclly free - since it gets snooped by the next
request.

There is also the issue that with expensive (ie slow) hardware rmw cycles
the lock memory location itself becomes a 'hotspot' so you need finer
grain locking.

But yes we need per-cpu free lists [1], per-cpu stats.
I'd go for ditching all these separate pools we currently have for
a decent SMP allocator.

For TCP is it probably enough to get data for separate connections
running in parallel....

David

[1] and being aware of the code that tends to 'pump' free items from
one cpu to another.....
--
David Laight: ***@l8s.co.uk
j***@dsg.stanford.edu
2006-01-06 01:42:13 UTC
Permalink
In message <***@snowdrop.l8s.co.uk>,
David Laight writes:

[..]
Post by David Laight
Post by j***@dsg.stanford.edu
Apologies if I misremember the exact details, its been... what, a decade?
What I intended to say, is that the rmw cycle done by a exchange instruction
(or a compare an exchange) is much the same expensive operation as
acquiring an lock (given you expect it to be uncontended).
The lock release is typiaclly free - since it gets snooped by the next
request.
There is also the issue that with expensive (ie slow) hardware rmw cycles
the lock memory location itself becomes a 'hotspot' so you need finer
grain locking.
Well, but here you are taking as axiomatic that the atomic operations
*is* implemented as a hardware lock on the memory subsystem. Whereas
ll/sc, or CAS/CAS2, aren't specified that way and typically don't
actually work that way.
Post by David Laight
But yes we need per-cpu free lists [1], per-cpu stats.
I'd go for ditching all these separate pools we currently have for
a decent SMP allocator.
"Decent" is in the eye of the beholder. Baed on the numbers I posted
earlier, I specifically want multiple {allocator, deallocator} pairs
for the same (type-stable) kind of underlying object, viz, mbufs, mbuf
clusers, and (mbuf header plus cluster) pairs.

A patch is probably shorter than text at this point....
Post by David Laight
For TCP is it probably enough to get data for separate connections
running in parallel....
Do you mean multiple 10GbE streams in parallel? That'd be nice, but
seeing as we currently can't handle more than about 500 Mbyte/sec, we
have a long way to go to get there.

Conversely, if what your'e trying to say is something like

``multiple single 1GbE streeams running in parallel, using at most
1 CPU pre stream, is probably enough''

then I disagree. Down that path lies the same, well-known [1] set of
naive [2], which are well-known to yield a stack in which single-stream
performance is slower, considerably slower, than the status-quo-ante:
namely, well-tuned, biglock, spl-synchronized, code.

[1, 2]: as seen by people who have prior experience with parallelising
high-performance TCP/IP stacks for SMP hardware, for single flows
fatter than a single CPU can handle. Think IRIX.
Post by David Laight
David
[1] and being aware of the code that tends to 'pump' free items from
one cpu to another.....
Hubert Feyrer
2006-01-03 08:14:17 UTC
Permalink
Post by Chapman Flack
Out of curiosity, anybody have an idea what fraction of NetBSD
supported MP architectures (maybe weighted by prominence?) have a
cmpxchg instruction (or a loadlinked/storeconditional, lwarx/stwcx
or other feature that can be used to the purpose)?
I don't, but maybe you want to start your research from this page:
http://www.netbsd.org/developers/features/


- Hubert
Mahendra M
2006-01-03 03:48:01 UTC
Permalink
Post by z***@globalnet.hr
not fulfilling the promise of lock-free code). Why didn't BSDs adopt RCU?
[ Please, I'm not trying to start a Linux vs. BSD war! This is a simple a
nd
Post by z***@globalnet.hr
honest question related to my general inquiry. ]
I am not sure about this, but I guess RCU is patented and use of RCU
is permitted only in GPL-ed code. I guess, this would prevent it from
being used
in the BSDs.

Regards,
Mahendra
z***@globalnet.hr
2006-01-03 14:20:49 UTC
Permalink
Post by Mahendra M
I am not sure about this, but I guess RCU is patented and use of RCU
is permitted only in GPL-ed code. I guess, this would prevent it from
being used
in the BSDs.
You are right about this. I've browsed a little through the Linux RCU
documentation, and the patents are free only for use in GPL code :/
matthew green
2006-01-03 12:11:07 UTC
Permalink
lock-free data structures aren't useful for all data structures.
sometimes you need to fiddle with a LOT of stuff inside the lock,
not just add one to a counter.

as you say, this is a very machine-specific operation and while
some code in netbsd could benefit from such a scheme i'm sure,
making that code portable in an MI place could be difficult and
require a lot of MD support code to be written. in other words,
we haven't rejected this approach but no one has made a good
case for it or produced the code, yet


mrg.
Garrett D'Amore
2006-01-03 20:34:59 UTC
Permalink
Post by matthew green
lock-free data structures aren't useful for all data structures.
sometimes you need to fiddle with a LOT of stuff inside the lock,
not just add one to a counter.
as you say, this is a very machine-specific operation and while
some code in netbsd could benefit from such a scheme i'm sure,
making that code portable in an MI place could be difficult and
require a lot of MD support code to be written. in other words,
we haven't rejected this approach but no one has made a good
case for it or produced the code, yet
Seems to me that some basic atomic math operations (increment,
decrement, and test versions of those) would go a long way.

In my opinion, one of the most scalable kernels around is Solaris
(proven on 100+ CPU SMP systems), and if getting locking "right" is key,
it would definitely benefit NetBSD if some developers examined the
Solaris design.

Of course, Solaris uses a hardware test-and-set operation, but they also
use something called a barrier which basically ensures a consistent view
of memory between processors when a boundary is crossed. There is also
some use of atomic math operations, but less so.

The main thing I think that Solaris gets right is lots and lots of
little locks, coupled with software designed so that very rarely do you
ever fail to get a lock (unless as a reader for a reader/writer
lock). The locks themselves are optimized so that the common case of
uncontended locks is very fast.

One other thing I like about Solaris is that the interrupt masking
behavior of certain kinds of locks is hidden behind the interface -- so
callers needn't normally worry about setting the processor mask
explicitly. (The exception is for "high level" locks, which block out
the timer interrupt. Typically hi level locks are only used for certain
hotplug devices (PCMCIA) and for serial ports.)

-- Garrett
Post by matthew green
.mrg.
--
Garrett D'Amore http://www.tadpolecomputer.com/
Sr. Staff Engineer Extending the Power of 64-bit UNIX Computing
Tadpole Computer, Inc. Phone: (951) 325-2134
Jason Thorpe
2006-01-04 05:24:22 UTC
Permalink
Post by Garrett D'Amore
In my opinion, one of the most scalable kernels around is Solaris
(proven on 100+ CPU SMP systems), and if getting locking "right" is key,
it would definitely benefit NetBSD if some developers examined the
Solaris design.
Yah, I've studied it, and even implemented a clone of it on the
"newlock" branch. Never had time to finish it, though. (Needed some
changes to the scheduler which were tricky, time consuming, and not
something I really had time to do...)
Post by Garrett D'Amore
Of course, Solaris uses a hardware test-and-set operation, but they also
use something called a barrier which basically ensures a consistent view
of memory between processors when a boundary is crossed. There is also
some use of atomic math operations, but less so.
It can be done with other than test-and-set. The Solaris model
leaves the low-level primitive to machine-dependent code, so you can
use whatever method works for your CPU. You could even use
restartable atomic sequences on platforms that both: uniprocessor,
lacking a suitable atomic operation (necessary even on uniprocessor
to handle preemption in the kernel).
Post by Garrett D'Amore
The main thing I think that Solaris gets right is lots and lots of
little locks, coupled with software designed so that very rarely do you
ever fail to get a lock (unless as a reader for a reader/writer
lock). The locks themselves are optimized so that the common case of
uncontended locks is very fast.
Yes, it is very fast indeed.
Post by Garrett D'Amore
One other thing I like about Solaris is that the interrupt masking
behavior of certain kinds of locks is hidden behind the interface -- so
callers needn't normally worry about setting the processor mask
explicitly. (The exception is for "high level" locks, which block out
the timer interrupt. Typically hi level locks are only used for certain
hotplug devices (PCMCIA) and for serial ports.)
Yah, you can either have an "adaptive mutex" or a "spinning mutex",
the latter also performing an implicit spl operation (based on the
IPL value that is provided at mutex initialization).

-- thorpej
Jed Davis
2006-01-05 23:55:09 UTC
Permalink
Post by Garrett D'Amore
Seems to me that some basic atomic math operations (increment,
decrement, and test versions of those) would go a long way.
Agreed! Some individual ports have some of their own (e.g.,
sys/arch/powerpc/include/atomic.h) for whatever MD purposes, but
there's no standardized MI interface.
--
(let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map
((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda
(f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l))
(C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline)))))
k***@pobox.com
2006-01-04 02:33:11 UTC
Permalink
Post by z***@globalnet.hr
Recently I've come across many papers describing the construction of
lock-free data structures using compare&exchange instructions. Such
implementations do not require explicit locks (mutexes) in an MP
environment. The advantages are more (no explicit locking needed) or
less (performance gains) obvious, so the question arises: why aren't
such data structures used more widely (in general and in the kernel
code?)
Bottom line: what's the "catch" with lock-free data structures?
A guy I know at work did some experimenting with atomic locks vs the more
common blocking mutexes. This was in a threaded user-level application
and not inside an OS (if it matters).

It turns out that atomic locks are faster until the number of threads
reaches a certain point. After that the application gets slower and slower.
The number of threads varied depending on the platform, and it probably
varied on the test case as well.

Here's one example of an issue: Atomic operations spin. With many threads
contending this results in lots of CPU being burned with little actual
work going on. Contrast this with lock types that forfeit the CPU when
they block.

If blocking on a lock causes the loss of the CPU then you end up with
roughly two CPUs operating on the lock at once: one to do an unlock and
another to do a lock. That's a small window compared to having all your
CPUs spinning.

AND, if you have a server handling requests then speeding up individual
requests with atomic locks may delay other requests. CPUs spinning waiting
in the service of one request are CPUs not serving other requests. You
have an overall loss of performance.

"There is no silver bullet."
Post by z***@globalnet.hr
An immediate drawback is that not all architectures support the cmpxchg
instruction. What are others? Is the possible performance gain negligible
compared to the required effort of recoding a large volume of code?
I think the Amiga documentation forbids use of the m68k CAS instruction...
--
Kevin P. Neal http://www.pobox.com/~kpn/

"It sounded pretty good, but it's hard to tell how it will work out
in practice." -- Dennis Ritchie, ~1977, "Summary of a DEC 32-bit machine"
Garrett D'Amore
2006-01-04 04:42:23 UTC
Permalink
Solaris handles this (spinning vs. blocking) quite nicely with what it
calls "adaptive mutexes". These mutexes spin when when there is only
one thread calling, and the owner is currently running. Otherwise they
block. Works really, really well.

I'm not trying to do a "my OS" is better sort of thing or start a
flamefest, but merely point developers at the implementation as a source
of alternatives if one is considering new implementation or changing
current implementation.

I'm still pretty new to NetBSD kernel stuff, but it looks to me like
there really isn't much in the way of locks in the kernel right now -- a
lot more focus seems to be given to just masking interrupts. I'm not
sure if we want to do more locking, but it would certainly be good for
(scalable) SMP support. But the locking philosophy is harder for some
developers to get their heads around, tends to be a bit more error
prone, and ultimately may represent too much change. Plus on
uniprocessor systems (e.g. embedded platforms that NetBSD seems to be so
focused on) blocking interrupts is probably simply more performant.

-- Garrett
Post by k***@pobox.com
Post by z***@globalnet.hr
Recently I've come across many papers describing the construction of
lock-free data structures using compare&exchange instructions. Such
implementations do not require explicit locks (mutexes) in an MP
environment. The advantages are more (no explicit locking needed) or
less (performance gains) obvious, so the question arises: why aren't
such data structures used more widely (in general and in the kernel
code?)
Bottom line: what's the "catch" with lock-free data structures?
A guy I know at work did some experimenting with atomic locks vs the more
common blocking mutexes. This was in a threaded user-level application
and not inside an OS (if it matters).
It turns out that atomic locks are faster until the number of threads
reaches a certain point. After that the application gets slower and slower.
The number of threads varied depending on the platform, and it probably
varied on the test case as well.
Here's one example of an issue: Atomic operations spin. With many threads
contending this results in lots of CPU being burned with little actual
work going on. Contrast this with lock types that forfeit the CPU when
they block.
If blocking on a lock causes the loss of the CPU then you end up with
roughly two CPUs operating on the lock at once: one to do an unlock and
another to do a lock. That's a small window compared to having all your
CPUs spinning.
AND, if you have a server handling requests then speeding up individual
requests with atomic locks may delay other requests. CPUs spinning waiting
in the service of one request are CPUs not serving other requests. You
have an overall loss of performance.
"There is no silver bullet."
Post by z***@globalnet.hr
An immediate drawback is that not all architectures support the cmpxchg
instruction. What are others? Is the possible performance gain negligible
compared to the required effort of recoding a large volume of code?
I think the Amiga documentation forbids use of the m68k CAS instruction...
--
Garrett D'Amore http://www.tadpolecomputer.com/
Sr. Staff Engineer Extending the Power of 64-bit UNIX Computing
Tadpole Computer, Inc. Phone: (951) 325-2134
j***@britannica.bec.de
2006-01-04 07:40:17 UTC
Permalink
Post by k***@pobox.com
Here's one example of an issue: Atomic operations spin. With many threads
contending this results in lots of CPU being burned with little actual
work going on. Contrast this with lock types that forfeit the CPU when
they block.
Sorry, but your design is seriously flawed either way, if lots of CPU
contest for a single lock. Using a block mutex doesn't make it cheaper
at all, it just moves part of the contention from userland into the
kernel. If you replace the spinning (and consider Ethernet collision
style handling with random delays) with a contention-free token passing,
you have to modify your program a lot more to not hit critical latency
issues.

Joerg
Michael van Elst
2006-01-04 12:15:53 UTC
Permalink
Post by k***@pobox.com
Post by z***@globalnet.hr
An immediate drawback is that not all architectures support the cmpxchg
instruction. What are others? Is the possible performance gain negligible
compared to the required effort of recoding a large volume of code?
I think the Amiga documentation forbids use of the m68k CAS instruction...
The instructions like CAS are only necessary for shared data
among multiple bus masters. For a single processor you can
simply use read-modify-write instructions like BSET/BCLR.
Ignatios Souvatzis
2006-01-04 12:56:40 UTC
Permalink
Post by Michael van Elst
Post by k***@pobox.com
Post by z***@globalnet.hr
An immediate drawback is that not all architectures support the cmpxchg
instruction. What are others? Is the possible performance gain negligible
compared to the required effort of recoding a large volume of code?
I think the Amiga documentation forbids use of the m68k CAS instruction...
The instructions like CAS are only necessary for shared data
among multiple bus masters. For a single processor you can
simply use read-modify-write instructions like BSET/BCLR.
Yes, but those are functionally like TAS.

I'm not aware of an atomic CAS equivalent.

As for CAS - isn't CAS only forbidden in CHIPMEM?

-is
--
seal your e-mail: http://www.gnupg.org/
Michael van Elst
2006-01-04 13:55:47 UTC
Permalink
Post by Ignatios Souvatzis
Post by Michael van Elst
simply use read-modify-write instructions like BSET/BCLR.
Yes, but those are functionally like TAS.
I'm not aware of an atomic CAS equivalent.
True, the closest approximation is probably DEC.B/INC.B.
However, a single bit is sufficient for locking.
Post by Ignatios Souvatzis
As for CAS - isn't CAS only forbidden in CHIPMEM?
The "forbidden in CHIPMEM" rule applies to the 68000,
it uses a special bus cycle that doesn't fit into
a DMA slot and the result is undefined.

68020 and higher use a dedicated locking signal,
I am not aware of any Amiga hardware that interprets
it and I haven't seen any CPU board that would
synthesize the special 68000 bus cycles for
Zorro-II hardware that might interpret an 68000 TAS.

Things are different for other 68k hardware (e.g.
some VME boards).
Ignatios Souvatzis
2006-01-04 14:33:15 UTC
Permalink
Post by Michael van Elst
Post by Ignatios Souvatzis
As for CAS - isn't CAS only forbidden in CHIPMEM?
The "forbidden in CHIPMEM" rule applies to the 68000,
it uses a special bus cycle that doesn't fit into
a DMA slot and the result is undefined.
68020 and higher use a dedicated locking signal,
I am not aware of any Amiga hardware that interprets
it and I haven't seen any CPU board that would
synthesize the special 68000 bus cycles for
Zorro-II hardware that might interpret an 68000 TAS.
So... as the Amiga is a single-CPU machine (well, as supported by
NetBSD) would it just work?

-is
Michael van Elst
2006-01-04 16:08:39 UTC
Permalink
Post by Ignatios Souvatzis
So... as the Amiga is a single-CPU machine (well, as supported by
NetBSD) would it just work?
I guess so. But using BSET/BCLR instead will definitely work
while CAS could possibly fail.
Ignatios Souvatzis
2006-01-04 17:02:03 UTC
Permalink
Post by Michael van Elst
Post by Ignatios Souvatzis
So... as the Amiga is a single-CPU machine (well, as supported by
o
Post by Michael van Elst
Post by Ignatios Souvatzis
NetBSD) would it just work?
I guess so. But using BSET/BCLR instead will definitely work
while CAS could possibly fail.
Hm.

The problem is - with bset/bclr you can't implement list
insertion/deletion directly, which was what triggered this thread
above. Well, you can indirectly by using bset/bclr to implement a
lock.

(Btw, I remember an old paper about using CAS2 for more complicated
structure manipulations - this would be slow on 68060 machines, as CAS2
(and unaligned CAS) trap on the '060 and have to be emulated.)

Regards,
-is
j***@dsg.stanford.edu
2006-01-04 17:39:08 UTC
Permalink
Post by Ignatios Souvatzis
Post by Michael van Elst
Post by Ignatios Souvatzis
So... as the Amiga is a single-CPU machine (well, as supported by
o
Post by Michael van Elst
Post by Ignatios Souvatzis
NetBSD) would it just work?
I guess so. But using BSET/BCLR instead will definitely work
while CAS could possibly fail.
Hm.
The problem is - with bset/bclr you can't implement list
insertion/deletion directly, which was what triggered this thread
above. Well, you can indirectly by using bset/bclr to implement a
lock.
(Btw, I remember an old paper about using CAS2 for more complicated
structure manipulations - this would be slow on 68060 machines, as CAS2
(and unaligned CAS) trap on the '060 and have to be emulated.)
You're probably remembering either Massalin and Pu (Synthesis kernel),
or work by Michael Greenwald:

Greenwald, M. Non-Blocking Synchronization and System Design.
PhD thesis, Stanford University Technical Report
STAN-CS-TR-99-1624, Stanford, CA 94305.

"Slow" was an understatement. Michael's implementation (and the V++
kernel/cachekernel) ran on 25MHz 68040s. CAS2 stalled the entire
pipeline, dithered internally for hundreds of cycles, issued the
memory ops, then dithered internally for more cycles.

After a little thought, it shouild become obvious that CAS is simply
not adequate for efficient lock-free synchronization on datastructres
more complicated than a singly-linked list. (Yes, solutions like
Herlihy's work exist, but they are not sufficiently efficient for use
in an OS kernel).

For practical kernel use, one will soon find one *needs* a DCAS/CAS2,
or some suitable approximation.

One approximation that's moderately well-known by practitioners in the
field is to do a ptrdiff_t-sized CAS, where the ptrdiff_t is in
interpreted _not_ as an address, but as an index into some type-stable
array of actual pointers (or datastructures).

Given a double-address, double-operand CAS, one can bundle up ``more
work'', as described in (e.g., Greewald's dissertation.


One more part of that body of work was my invention of LLP/SCP:

ll ...
llp
scp
sc

the llp and scp are "pipelined" load-llink/store-conditional: either
both scp/and sc atomically succeeed, or both fail). This architectural
extension allows one to synthesize a DCAS [IBM-speak] or CAS2 [68k].

Michael and I gave a presentation on this idea to SGI, who intended to
implement it in the (then-called r20k); but after that project was
cancelled it never saw the light of day. An abstract of the talk is
still avaiable at, um,
http://www-dsg.stanford.edu/michaelg/abstracts/97sgi.abstract

Greg Chesson and his team used a similar trick in IRIX. That trick
relied on non-architecturally-mandated aspects of the r4k and later
LL/SC implementation (viz, that per-CPU indivisibility of those
synchronization instruction applied to entire cache lines, not the
strict address). Essentially, that trick combined LL/SC and
modification of part of the same cache line inside the ll/sc loop,
*knowing* that the ll/sc would effectively protect other writes to the
same cacheline.

(Burying such tricks deep inside the OS is the kind of thing which
causes subsequent microprocessor architects heart trouble; which is
one reason the LLP/SCP was well-received by [then] MIPS implementors.)

Oh, one more thing: to suggest using ras(9) is proof-positive that the
suggester simply doesn't understand the problem domain. Bershad's
work on restartable atomic sequences, of which ras(9) is a
re-implementation, gives atomic semantics to *USERSPACE PROCESSES*,
implemented by adding "check for RAS collision during trap-to-OS"
hooks and PC-unwind in one (or a very, very small integer number) of
codepaths. Thus, RAS is not suitable for use *inside* a kernel.

As a historical note, the main motivation for Brian Bershad's RAS work
was the availalility of (at the time) fast, inexpensive R3K-based
machines, which had no support for atomic instructions at *all*. RAS
was a big win for userspace threading or synchronization, compared to
the alternative of trapping into the kernel to execute an atomic
operation with interrupts blocked.
Jason Thorpe
2006-01-04 18:05:55 UTC
Permalink
Post by j***@dsg.stanford.edu
Oh, one more thing: to suggest using ras(9) is proof-positive that the
suggester simply doesn't understand the problem domain. Bershad's
work on restartable atomic sequences, of which ras(9) is a
re-implementation, gives atomic semantics to *USERSPACE PROCESSES*,
implemented by adding "check for RAS collision during trap-to-OS"
hooks and PC-unwind in one (or a very, very small integer number) of
codepaths. Thus, RAS is not suitable for use *inside* a kernel.
Actually, it is possible to use RAS inside a kernel, and Solaris does
(did) it on some platforms for low-level mutex operations on
uniprocessor systems.

-- thorpej
j***@dsg.stanford.edu
2006-01-04 18:59:08 UTC
Permalink
Post by Jason Thorpe
Post by j***@dsg.stanford.edu
Oh, one more thing: to suggest using ras(9) is proof-positive that the
suggester simply doesn't understand the problem domain. Bershad's
work on restartable atomic sequences, of which ras(9) is a
re-implementation, gives atomic semantics to *USERSPACE PROCESSES*,
implemented by adding "check for RAS collision during trap-to-OS"
hooks and PC-unwind in one (or a very, very small integer number) of
codepaths. Thus, RAS is not suitable for use *inside* a kernel.
Actually, it is possible to use RAS inside a kernel, and Solaris does
(did) it on some platforms for low-level mutex operations on
uniprocessor systems.
Uh, Jason, I said "not suitable", not "not possible".

I'm not the only one to notice and comment on your tendency to send
off-the-cuff replies which miss the point of what was said.
It's a New Year, maybe you could make a resolution to take the time to
*read* what people write before responding?
z***@globalnet.hr
2006-01-04 19:29:15 UTC
Permalink
Post by j***@dsg.stanford.edu
Post by Jason Thorpe
Post by j***@dsg.stanford.edu
codepaths. Thus, RAS is not suitable for use *inside* a kernel.
Actually, it is possible to use RAS inside a kernel, and Solaris does
(did) it on some platforms for low-level mutex operations on
uniprocessor systems.
Uh, Jason, I said "not suitable", not "not possible".
Now that you started this discussion, "not suitable" is interpreted by
many people (me included) as "not possible".

Webster online dictionary for "suitable" gives a definition "adapted to
a use or purpose". In this reading "ras is not suitable for in-kernel
use" -> "ras is not adapted to be used in-kernel" -> "it is not possible
to use ras in-kernel" (how can you use it, if it is not adapted?)

Other definitions of "suitable" are: b : satisfying propriety : PROPER
c : ABLE, QUALIFIED

And for "adapted"/"adapt" it says "to make fit (as for a specific or
new use or situation) often by modification".

I admit that I'm not a native English speaker, but consider that there are
at least two persons (Jason and me) who read "not suitable" as "not possible".
Webster seems to corroborate our interpretation.
Jason Thorpe
2006-01-04 21:41:12 UTC
Permalink
Post by j***@dsg.stanford.edu
Uh, Jason, I said "not suitable", not "not possible".
The fact that it is both possible and in use in real world
applications shows that it is, in fact, quite suitable for carefully-
chosen in-kernel uses.

-- thorpej
z***@globalnet.hr
2006-01-04 18:30:03 UTC
Permalink
Post by j***@dsg.stanford.edu
For practical kernel use, one will soon find one *needs* a DCAS/CAS2,
or some suitable approximation.
Newer 32-bit Intel processors offer the 8-byte cmpxchg, and the 64-bit
AMDs offer the 16-byte cmpxchg. Would these be sufficient? [In one of
the papers I remember seeing that DCAS si defined to take two distinct
memory locations but which are not neccessarily adjacent, unlike the
working of CMPXCHG8B]
Post by j***@dsg.stanford.edu
Oh, one more thing: to suggest using ras(9) is proof-positive that the
suggester simply doesn't understand the problem domain. Bershad's
Hm, I admit that I don't understand the NetBSD kernel part. I have never
even read the manpage thoroughly. What I read in the paper reminded me
on ras(9) on which I have browsed through the manpage some time ago.. I
assumed that it was intended for in-kernel use since it is in the 9th
section of the manual.. :) It turns out that it is (I guess) just the
kernel support for rasctl(2)?

Bottom line: disregarding portability, would it be beneficial to use
waitfree data structures in the kernel instead of explicit locks (on
architectures that properly and efficiently support DCAS)? Or most of
such structures are logically tied to some other resource so locks are
needed anyway (as someone has already pointed out)?
Bill Studenmund
2006-01-04 21:34:59 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
Bottom line: disregarding portability, would it be beneficial to use
waitfree data structures in the kernel instead of explicit locks (on
architectures that properly and efficiently support DCAS)? Or most of
such structures are logically tied to some other resource so locks are
needed anyway (as someone has already pointed out)?
I haven't yet read the papers. However I see two issues here. First is the
patent one, and I suspect it's a killer.

Second, I see lock-free structures as an optimization for certain cases
(perhaps an excellent one!) of fine-locked access and structures. We are
still at big-lock, so it strikes me that we have work to do before we need
to worry about this. :-|

Take care,

Bill
Ignatios Souvatzis
2006-01-05 13:09:02 UTC
Permalink
Post by Bill Studenmund
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
Bottom line: disregarding portability, would it be beneficial to use
waitfree data structures in the kernel instead of explicit locks (on
architectures that properly and efficiently support DCAS)? Or most of
such structures are logically tied to some other resource so locks are
needed anyway (as someone has already pointed out)?
I haven't yet read the papers. However I see two issues here. First is the
patent one, and I suspect it's a killer.
What exactly is patented?

-is
Jonathan Stone
2006-01-04 21:58:38 UTC
Permalink
Post by z***@globalnet.hr
Post by j***@dsg.stanford.edu
For practical kernel use, one will soon find one *needs* a DCAS/CAS2,
or some suitable approximation.
Newer 32-bit Intel processors offer the 8-byte cmpxchg, and the 64-bit
AMDs offer the 16-byte cmpxchg. Would these be sufficient?
One can do that. It's a special-case of using a ptrdiff_t-sized atomic
operation and treating the two halves of the pointer as an index.

(I originally intended to subsume that case, but I concede that, for
clarity, I ended up using wording which hid that intent.)
Post by z***@globalnet.hr
In one of
the papers I remember seeing that DCAS si defined to take two distinct
memory locations but which are not neccessarily adjacent, unlike the
working of CMPXCHG8B]
Yes, exactly.
Post by z***@globalnet.hr
Post by j***@dsg.stanford.edu
Oh, one more thing: to suggest using ras(9) is proof-positive that the
suggester simply doesn't understand the problem domain. Bershad's
Hm, I admit that I don't understand the NetBSD kernel part. I have never
even read the manpage thoroughly. What I read in the paper reminded me
on ras(9) on which I have browsed through the manpage some time ago.. I
assumed that it was intended for in-kernel use since it is in the 9th
section of the manual.. :) It turns out that it is (I guess) just the
kernel support for rasctl(2)?
Huh? To quote ras(9):

DESCRIPTION
Restartable atomic sequences are user code sequences which are guaranteed
^^^^^^^^^^^^^^^^^^^
to execute without preemption.

Emphasis added, not in original.
Post by z***@globalnet.hr
Bottom line: disregarding portability, would it be beneficial to use
waitfree data structures in the kernel instead of explicit locks (on
architectures that properly and efficiently support DCAS)? Or most of
such structures are logically tied to some other resource so locks are
needed anyway (as someone has already pointed out)?
That's a hard question. Unfortunately, from a systems perspective,
one sometimes has to take some of the work in the PODC (principles of
distributed computing) with not just a grain, but a sackful of salt:
theoretical results about O(1) wait-free system sometime end up
assuming synchronization primitives that are O(n^3) in hardware, if
implmemented so as to avoid starvation. If you ask Michael Greenwald
(at UPenn last I heard), he can probably still cite you examples.



The other gotcha with wait-free algorithms is that (not infrquently)
even PhD candidates in sytems (OS arena) strain to follow the details
of wait-free algorithms. Getting the details right is _hard_. That
poses a high bar to development and maintenance of such algorithms.

For one example, work through all the details in Michael Greenwald's
NBS paper or dissertation (cited previoiusly).


The gripping hand, though, is that we're still at the big-lock stage.
Before NBS becomes an issue, we need to first move code out from
under the biglock.

And, just for my own interests, networking, an even bigger win there
is not need synchronization *at all* for operations like allocating or
freeing mbufs. Instead, just add or remove on a per-CPU freelist
dedicated to the desired interrupt priority.

And where synchronization is necessary, amortize the cost over large
chunks (e.g., an entire interrupt's worth of packets), not just one.
Garrett D'Amore
2006-01-04 22:11:13 UTC
Permalink
All this conversation really seems to beg one question, which may be
obvious to everyone else, but I figure I ought to ask it anyway.

Is there a plan to move from the biglock to fine-grained locking?

If not then we should shelve this conversation.

If yes, then perhaps our time is better spent trying to figure out how
to achieve *that* goal first, then worry about optimization details.
--
Garrett D'Amore http://www.tadpolecomputer.com/
Sr. Staff Engineer Extending the Power of 64-bit UNIX Computing
Tadpole Computer, Inc. Phone: (951) 325-2134
j***@dsg.stanford.edu
2006-01-05 01:03:12 UTC
Permalink
Post by Garrett D'Amore
All this conversation really seems to beg one question, which may be
obvious to everyone else, but I figure I ought to ask it anyway.
Is there a plan to move from the biglock to fine-grained locking?
I have (more than once) tried to outline a roadmap for receive-side
network processing out from under the big-lock. I see that as a
non-negotiable requirement, if NetBSD is to be a viable contender with
other open-source OSes, for certain performance-intensive
applications.

As Steven M. Bellovin outlined some months ago, the driving factor is
that, for about the last 3 years, CPUs aren't getting exponentially
faster; instead, we're getting more cores (possibly exponentially
more, possibly not) cores on individual chips.

Consider 10GbE, or better still, multiple 10GbE links. For the sake of
discussion, lets disregard TCP offload engines (TOEs). Using standard
1500-byte packets, a 10GbE link can send on the close order of 800,000
1500-byte packets, plus (for TCP, with the standard ACK every-
other-segment), another 400,000 ACK-only packets.

That's unidirectional; when using full-duplex, or sinking and
retransmitting a half-duplex stream, one has to handle roughly 2.4
million packets/sec.

If one considers the falling cost of multi-core CPUs, then within a
couple of years, a viable OS will need to support a parallelized TCP
stack, in which one CPU handles interrupts, another CPU runs the TCP
state machine, and yet another CPU handles the copying to or from
user-space. That's hardly rocket science; IRIX could do that
(with 6.4GBit/sec HIPPI and 64k frames) nearly 10 years ago.

You doesn't have to be a rocket scientist or an LLNL atomic scientist
to conclude that raising and lowering IPL some 5 million times per
second (one raise/lower to allocate, one to free) is a losing
proposition.

I know how _I'd_ start to work toward that:

1. Multiple per-CPU listheads for allocating and freeing mbufs nad
clusters, one for use when already at IPL_NET, and one for use below
IPL_NET. Separate allocators (mget(), mget_iplnet()) and deallocators
(mfree(), mfree_iplnet()). Mbuf allocation and deallocation then
becomes synchronization-free when called from inside network drivers
at IPL_NET, except in the (very rare) case of having to acquire the
big-lock to create new mbufs from a pool or shuffle from a global
(biglock-protected, at first) mbuf queue.


2. A new data-structure (call it an mbq), with head and tail pointers
to a chain of mbuf packets; a per-CPU of pending newly-arrived
packets;

3a. A way to move all packets in an mbq into an ifq (e..g, ipintrq)
cheaply, in one synchronization operation, and

3b. A (per-processor?) mbq, onto which low-level IPL_NET code can
temporarily dequeue packets for later deposition onto a layer-2 input
ifq every K packets or when returning from an IPL_Net interrupt;

4. An interprocessor communication mechanism (interprocessor interrupt
or other) whereby one CPU busy handing hardware interrupts can wake up
a different CPU, and force that other second CPU to start running our
current softint network-processing code (e.g., ipintr()).

Unfortunately, even starting to commit even _that_ much keeps bogging
down in details, and people who insist on making progress elsewhere
before touching the networking code.

Oh, and then Bill observed recently that IPIs are not well-ordered in
a machine-independent way, relative to other interupts (in other
words, the ordering is machine-DEpendent). Which makes the idea of
parallelizing the networking code somewhat moot......
z***@globalnet.hr
2006-01-05 14:52:14 UTC
Permalink
Post by Jonathan Stone
DESCRIPTION
Restartable atomic sequences are user code sequences which are guaranteed
^^^^^^^^^^^^^^^^^^^
to execute without preemption.
Emphasis added, not in original.
NHF, but the word "user" is overloaded in this sentence. I took "user code" to
mean "code produced by the user" (i.e. the programmer). I mean, I was surprised
to see something like ras(9) implemented, so I took this sentence to mean: the
user (= programmer) can create his own restartable sequences. (and the rest of
the manpage documents how...)

IMHO, putting "user-level" instead of "user" would be much more explicit.

Anyway, thanks for clarifying the issue.
David Laight
2006-01-05 22:02:30 UTC
Permalink
Post by Ignatios Souvatzis
(Btw, I remember an old paper about using CAS2 for more complicated
structure manipulations - this would be slow on 68060 machines, as CAS2
(and unaligned CAS) trap on the '060 and have to be emulated.)
I managed to use CAS (or maybe it was CAS2) for a linked list once, but
the code is so tortuous I switched to disabling interrupts....

David
--
David Laight: ***@l8s.co.uk
z***@globalnet.hr
2006-01-04 14:20:18 UTC
Permalink
Post by k***@pobox.com
A guy I know at work did some experimenting with atomic locks vs the more
common blocking mutexes. This was in a threaded user-level application
and not inside an OS (if it matters).
It seems to me that the discussion went way off my original inquiry. I'm
not talking about atomic locks, but about shared data structures that do
not need explicit locks. E.g. parallel readers/writers of a linked list
get serialized correctly without any external locks.

Please look at the following papers:

A Pragmatic Implementation of Non-Blocking Linked-Lists
http://citeseer.ist.psu.edu/harris01pragmatic.html

The Synergy Between Non-blocking Synchronization and Operating System Structure
http://citeseer.ist.psu.edu/greenwald96synergy.html

The first paper describes what I have in mind. It is one of many papers
dealing with lock-free data structures. (I'm also a bit guilty of wrongly
naming them as "lock-free" instead of "non-blocking" in my initial post).

The other is a report on implementing a kernel on top of non-blocking
synchronization primitives. It also gives pointers to the software
implementation of CMPXCHG instruction which, I believe (didn't look into
the papers yet), could be implemented by ras(9).
Post by k***@pobox.com
Here's one example of an issue: Atomic operations spin. With many threads
contending this results in lots of CPU being burned with little actual
work going on. Contrast this with lock types that forfeit the CPU when
they block.
Non-blocking data structures have an upper bound to latency. Achievable
latency is O(n) where n is the number of processes contending for the
data structure. It's very different from one thread acquiring a spinlock
and others indefinetly burning CPU cycles until the spinlock is released.
j***@britannica.bec.de
2006-01-04 14:45:22 UTC
Permalink
Post by z***@globalnet.hr
Non-blocking data structures have an upper bound to latency. Achievable
latency is O(n) where n is the number of processes contending for the
data structure. It's very different from one thread acquiring a spinlock
and others indefinetly burning CPU cycles until the spinlock is released.
This is incorrect. You are confusing non-blocking with wait-free.
Non-block algorithms guaranty that at least one instance is mkaing
progress, e.g. can insert an element. It does not mean you can predict
*which* instance it is. This means in a NUMA system a remote CPU can
starve. A wait-free algorithm (which is much harder) gives you an upper
bound on the time of an operation for any single instance. It does
effectively prevent starvation.

Joerg
Joe Seigh
2006-01-06 00:51:14 UTC
Permalink
(2nd attempt to post this)
Post by z***@globalnet.hr
Sorry if I'm a bit off-topic, but this is the only kernel-developers
mailing list that I'm subscribed to. The question is a practical one and
I believe that kernel developers can give the most informed answer.
Recently I've come across many papers describing the construction of
lock-free data structures using compare&exchange instructions. Such
implementations do not require explicit locks (mutexes) in an MP
environment. The advantages are more (no explicit locking needed) or
less (performance gains) obvious, so the question arises: why aren't
such data structures used more widely (in general and in the kernel
code?)
I know that Linux has adopted RCU, but again the example code for the
linked list traversal calls rcu_read_lock() / rcu_read_unlock() (thus,
not fulfilling the promise of lock-free code). Why didn't BSDs adopt RCU?
[ Please, I'm not trying to start a Linux vs. BSD war! This is a simple and
honest question related to my general inquiry. ]
rcu_read_lock() / rcu_read_unlock() are AFAIK dummy macro's, in the
non-preemptive kernel anyway, for documentation/code maintenance purposes.
The "lock" is in name only.

As mentioned elsewhere, the problem with RCU is that it is patented
except for one lapsed patent, 4,809,168. You can implement RCU
using that but you'd have to be careful as IBM's lawyers are likely
to interpret the claims on the lapsed patent narrowly and the claims
of their active patents broadly. And IBM's lawyers are bigger than
your lawyers.
Post by z***@globalnet.hr
Bottom line: what's the "catch" with lock-free data structures?
An immediate drawback is that not all architectures support the cmpxchg
instruction. What are others? Is the possible performance gain negligible
compared to the required effort of recoding a large volume of code?
For some lock-free algorithms you need interlocked instructions. For
some like RCU you only need atomic stores and loads with dependent
acquire and release memory semantics.

There's some lock-free stuff I did here
http://atomic-ptr-plus.sourceforge.net/
There's fastsmr, an RCU+SMR userspace implementation, which runs on
32bit x86 Linux, 32 bit ppc OSX, and 32bit sparc Solaris. Besides
the patent problems with RCU, there's a patent application in on
SMR hazard pointers.

The rcpc, reference counted proxy collector, is probably better since
it has no patent entanglements that I'm aware of. It needs double wide
compare and swap (e.g. cmpxchg8b or cmpxchg16b), load locked/store conditional,
or 64 bit or larger single wide compare and swap, which probably
covers every architecture that matters except for some embedded
processors which are probably running on a single processor where
you can simulate interlocked instructions if you are running disabled.

RCU and the proxy stuff are reader lock-free. You use them for what
I call PCOW (partial copy on write) and you need conventional mechanisms
for write access. They're really forms of GC and are used the same
way you'd use reader lock-free in Java with JSR-133 volatile supplying
the acquire and release memory semantics. You can use PCOW for things
besides linked lists, e.g. hash tables or trees. I have some lock-free
b-tree code I prototyped in Java so it is possible.
--
Joe Seigh
Continue reading on narkive:
Loading...