Discussion:
Are there good examples of SPMC queues?
Rajiv Kurian
2013-12-23 07:55:09 UTC
Permalink
My use case is the following:

1) A single thread reads bytes of the network and produces work.
2) Multiple worker threads actually complete the work. There are only a few
types of work (say at most 20) but each one takes a variable amount of time.

I first tried using multiple SPSC ring buffers (connecting the single
producer to a worker thread) to distribute the work. The producer would
just round robin between the SPSC ring buffers and enqueue some work to
each worker. This avoids contention since the producer thread and each
worker thread is connected by a different queue. It ends up causing
imbalanced queues though. Some workers are starved of work while others
have too much. This especially applies to my work load where some work
items take a few seconds while others take a few 100 micro seconds.

I thought of a few other options:

1) Use a SPMC ring buffer. The workers just compete to get a slot on the
ring buffer and then do their work. The nice thing is that they now are
never idle if there is work to be done. The bad thing is contention on the
queue. This is the simplest option though.

2) Use some kind of work requesting. I just saw the work requesting design<http://www.1024cores.net/home/parallel-computing/concurrent-skip-list/scheduler-algorithm>on Dmitry's blog. This seems like a nice pattern especially since, if most
workers are busy there is no contention. The problem though is if a worker
is stuck working on a large request (one that takes a few seconds) it won't
be able to hand off work to an idle worker.

3) Build a cost model of requests. For my application this doesn't seem
like a bad thing. Since I have only a few types of requests (20 or so) I
could run a tuning phase at start up that builds a cost model for each
input task. Let's say for an image processing application the types of work
are filters like blur, sharpen, resize etc. We could run a phase on startup
where we check how much time each of the filter takes with image of
different sizes (user supplied maybe). Once we have this info down, we
could build a cost model which estimates how much time say a blur of an
image of size x px * y px would take. Now we could go back to the multiple
SPSC queues design and instead of round robin we just assign work to the
lightest loaded worker based on our cost model. So in essence we are just
trying to ensure that the workers never get starved or over-worked at the
source of scheduling instead of after the fact.

Any comments or other methods would be highly appreciated.
--
---
You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/9d71e72a-51f1-491d-adfc-31a367ce23d7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Dmitriy V'jukov
2014-01-09 14:28:21 UTC
Permalink
Post by Rajiv Kurian
1) A single thread reads bytes of the network and produces work.
2) Multiple worker threads actually complete the work. There are only a few
types of work (say at most 20) but each one takes a variable amount of time.
I first tried using multiple SPSC ring buffers (connecting the single
producer to a worker thread) to distribute the work. The producer would just
round robin between the SPSC ring buffers and enqueue some work to each
worker. This avoids contention since the producer thread and each worker
thread is connected by a different queue. It ends up causing imbalanced
queues though. Some workers are starved of work while others have too much.
This especially applies to my work load where some work items take a few
seconds while others take a few 100 micro seconds.
Hi!

Yes, this definitely does not look like a good solution.
Post by Rajiv Kurian
1) Use a SPMC ring buffer. The workers just compete to get a slot on the
ring buffer and then do their work. The nice thing is that they now are
never idle if there is work to be done. The bad thing is contention on the
queue. This is the simplest option though.
2) Use some kind of work requesting. I just saw the work requesting design
on Dmitry's blog. This seems like a nice pattern especially since, if most
workers are busy there is no contention. The problem though is if a worker
is stuck working on a large request (one that takes a few seconds) it won't
be able to hand off work to an idle worker.
3) Build a cost model of requests. For my application this doesn't seem like
a bad thing. Since I have only a few types of requests (20 or so) I could
run a tuning phase at start up that builds a cost model for each input task.
Let's say for an image processing application the types of work are filters
like blur, sharpen, resize etc. We could run a phase on startup where we
check how much time each of the filter takes with image of different sizes
(user supplied maybe). Once we have this info down, we could build a cost
model which estimates how much time say a blur of an image of size x px * y
px would take. Now we could go back to the multiple SPSC queues design and
instead of round robin we just assign work to the lightest loaded worker
based on our cost model. So in essence we are just trying to ensure that the
workers never get starved or over-worked at the source of scheduling instead
of after the fact.
There is also "classical" work stealing. You can have N SPMC queues,
the produces queues in round-robin, each thread takes one element from
own queue and processes, if it's out of work it tries to steal a half
of elements from other queues.


However, I would consider using an efficient SPMC queue as well,
because it will be much simpler.
You can take this queue:
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
and remove the CAS from enqueue because you have only 1 producer.

Or even you can take this Chris' modification as base:
https://groups.google.com/forum/#!searchin/lock-free/thomasson/lock-free/acjQ3-89abE/a6-Di0GZsyEJ

Since you have a single queue in the program, you can afford to pad
each element to cache line size. This should give you very decent
contention.
--
---
You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/CAEeQi3vzw6LguUBySRVz_g_uv6D2pG%3DyvjWf3%2BbyRj9P3BL9eg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Rajiv Kurian
2014-01-10 00:25:30 UTC
Permalink
Post by Dmitriy V'jukov
Post by Rajiv Kurian
1) A single thread reads bytes of the network and produces work.
2) Multiple worker threads actually complete the work. There are only a
few
Post by Rajiv Kurian
types of work (say at most 20) but each one takes a variable amount of
time.
Post by Rajiv Kurian
I first tried using multiple SPSC ring buffers (connecting the single
producer to a worker thread) to distribute the work. The producer would
just
Post by Rajiv Kurian
round robin between the SPSC ring buffers and enqueue some work to each
worker. This avoids contention since the producer thread and each worker
thread is connected by a different queue. It ends up causing imbalanced
queues though. Some workers are starved of work while others have too
much.
Post by Rajiv Kurian
This especially applies to my work load where some work items take a few
seconds while others take a few 100 micro seconds.
Hi!
Yes, this definitely does not look like a good solution.
Post by Rajiv Kurian
1) Use a SPMC ring buffer. The workers just compete to get a slot on the
ring buffer and then do their work. The nice thing is that they now are
never idle if there is work to be done. The bad thing is contention on
the
Post by Rajiv Kurian
queue. This is the simplest option though.
2) Use some kind of work requesting. I just saw the work requesting
design
Post by Rajiv Kurian
on Dmitry's blog. This seems like a nice pattern especially since, if
most
Post by Rajiv Kurian
workers are busy there is no contention. The problem though is if a
worker
Post by Rajiv Kurian
is stuck working on a large request (one that takes a few seconds) it
won't
Post by Rajiv Kurian
be able to hand off work to an idle worker.
3) Build a cost model of requests. For my application this doesn't seem
like
Post by Rajiv Kurian
a bad thing. Since I have only a few types of requests (20 or so) I
could
Post by Rajiv Kurian
run a tuning phase at start up that builds a cost model for each input
task.
Post by Rajiv Kurian
Let's say for an image processing application the types of work are
filters
Post by Rajiv Kurian
like blur, sharpen, resize etc. We could run a phase on startup where we
check how much time each of the filter takes with image of different
sizes
Post by Rajiv Kurian
(user supplied maybe). Once we have this info down, we could build a
cost
Post by Rajiv Kurian
model which estimates how much time say a blur of an image of size x px
* y
Post by Rajiv Kurian
px would take. Now we could go back to the multiple SPSC queues design
and
Post by Rajiv Kurian
instead of round robin we just assign work to the lightest loaded worker
based on our cost model. So in essence we are just trying to ensure that
the
Post by Rajiv Kurian
workers never get starved or over-worked at the source of scheduling
instead
Post by Rajiv Kurian
of after the fact.
There is also "classical" work stealing. You can have N SPMC queues,
the produces queues in round-robin, each thread takes one element from
own queue and processes, if it's out of work it tries to steal a half
of elements from other queues.
However, I would consider using an efficient SPMC queue as well,
because it will be much simpler.
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
and remove the CAS from enqueue because you have only 1 producer.
https://groups.google.com/forum/#!searchin/lock-free/thomasson/lock-free/acjQ3-89abE/a6-Di0GZsyEJ
Since you have a single queue in the program, you can afford to pad
each element to cache line size. This should give you very decent
contention.
Thank you Dmitry. I think I will use a SPMC queue. The application is for
image processing and processing each image takes longer than any cache
contention during deque.The automatic load balancing with a SPMC queue just
makes life easier than first distributing the work and then trying to do
work stealing.
--
---
You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/d1a509e2-9b41-4850-8662-afc1a56dbc36%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Dmitriy V'jukov
2014-01-12 10:36:35 UTC
Permalink
Post by Rajiv Kurian
Post by Dmitriy V'jukov
Post by Rajiv Kurian
1) A single thread reads bytes of the network and produces work.
2) Multiple worker threads actually complete the work. There are only a few
types of work (say at most 20) but each one takes a variable amount of time.
I first tried using multiple SPSC ring buffers (connecting the single
producer to a worker thread) to distribute the work. The producer would just
round robin between the SPSC ring buffers and enqueue some work to each
worker. This avoids contention since the producer thread and each worker
thread is connected by a different queue. It ends up causing imbalanced
queues though. Some workers are starved of work while others have too much.
This especially applies to my work load where some work items take a few
seconds while others take a few 100 micro seconds.
Hi!
Yes, this definitely does not look like a good solution.
Post by Rajiv Kurian
1) Use a SPMC ring buffer. The workers just compete to get a slot on the
ring buffer and then do their work. The nice thing is that they now are
never idle if there is work to be done. The bad thing is contention on the
queue. This is the simplest option though.
2) Use some kind of work requesting. I just saw the work requesting design
on Dmitry's blog. This seems like a nice pattern especially since, if most
workers are busy there is no contention. The problem though is if a worker
is stuck working on a large request (one that takes a few seconds) it won't
be able to hand off work to an idle worker.
3) Build a cost model of requests. For my application this doesn't seem like
a bad thing. Since I have only a few types of requests (20 or so) I could
run a tuning phase at start up that builds a cost model for each input task.
Let's say for an image processing application the types of work are filters
like blur, sharpen, resize etc. We could run a phase on startup where we
check how much time each of the filter takes with image of different sizes
(user supplied maybe). Once we have this info down, we could build a cost
model which estimates how much time say a blur of an image of size x px * y
px would take. Now we could go back to the multiple SPSC queues design and
instead of round robin we just assign work to the lightest loaded worker
based on our cost model. So in essence we are just trying to ensure that the
workers never get starved or over-worked at the source of scheduling instead
of after the fact.
There is also "classical" work stealing. You can have N SPMC queues,
the produces queues in round-robin, each thread takes one element from
own queue and processes, if it's out of work it tries to steal a half
of elements from other queues.
However, I would consider using an efficient SPMC queue as well,
because it will be much simpler.
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
and remove the CAS from enqueue because you have only 1 producer.
https://groups.google.com/forum/#!searchin/lock-free/thomasson/lock-free/acjQ3-89abE/a6-Di0GZsyEJ
Since you have a single queue in the program, you can afford to pad
each element to cache line size. This should give you very decent
contention.
Thank you Dmitry. I think I will use a SPMC queue. The application is for
image processing and processing each image takes longer than any cache
contention during deque.The automatic load balancing with a SPMC queue just
makes life easier than first distributing the work and then trying to do
work stealing.
Makes sense.
--
---
You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/CAEeQi3ucGhk088xp_jyGpK8nEUPHyG8h8Zs%2BeyN7gM0zHYpS6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Loading...