Rajiv Kurian
2013-12-23 07:55:09 UTC
My use case is the following:
1) A single thread reads bytes of the network and produces work.
2) Multiple worker threads actually complete the work. There are only a few
types of work (say at most 20) but each one takes a variable amount of time.
I first tried using multiple SPSC ring buffers (connecting the single
producer to a worker thread) to distribute the work. The producer would
just round robin between the SPSC ring buffers and enqueue some work to
each worker. This avoids contention since the producer thread and each
worker thread is connected by a different queue. It ends up causing
imbalanced queues though. Some workers are starved of work while others
have too much. This especially applies to my work load where some work
items take a few seconds while others take a few 100 micro seconds.
I thought of a few other options:
1) Use a SPMC ring buffer. The workers just compete to get a slot on the
ring buffer and then do their work. The nice thing is that they now are
never idle if there is work to be done. The bad thing is contention on the
queue. This is the simplest option though.
2) Use some kind of work requesting. I just saw the work requesting design<http://www.1024cores.net/home/parallel-computing/concurrent-skip-list/scheduler-algorithm>on Dmitry's blog. This seems like a nice pattern especially since, if most
workers are busy there is no contention. The problem though is if a worker
is stuck working on a large request (one that takes a few seconds) it won't
be able to hand off work to an idle worker.
3) Build a cost model of requests. For my application this doesn't seem
like a bad thing. Since I have only a few types of requests (20 or so) I
could run a tuning phase at start up that builds a cost model for each
input task. Let's say for an image processing application the types of work
are filters like blur, sharpen, resize etc. We could run a phase on startup
where we check how much time each of the filter takes with image of
different sizes (user supplied maybe). Once we have this info down, we
could build a cost model which estimates how much time say a blur of an
image of size x px * y px would take. Now we could go back to the multiple
SPSC queues design and instead of round robin we just assign work to the
lightest loaded worker based on our cost model. So in essence we are just
trying to ensure that the workers never get starved or over-worked at the
source of scheduling instead of after the fact.
Any comments or other methods would be highly appreciated.
1) A single thread reads bytes of the network and produces work.
2) Multiple worker threads actually complete the work. There are only a few
types of work (say at most 20) but each one takes a variable amount of time.
I first tried using multiple SPSC ring buffers (connecting the single
producer to a worker thread) to distribute the work. The producer would
just round robin between the SPSC ring buffers and enqueue some work to
each worker. This avoids contention since the producer thread and each
worker thread is connected by a different queue. It ends up causing
imbalanced queues though. Some workers are starved of work while others
have too much. This especially applies to my work load where some work
items take a few seconds while others take a few 100 micro seconds.
I thought of a few other options:
1) Use a SPMC ring buffer. The workers just compete to get a slot on the
ring buffer and then do their work. The nice thing is that they now are
never idle if there is work to be done. The bad thing is contention on the
queue. This is the simplest option though.
2) Use some kind of work requesting. I just saw the work requesting design<http://www.1024cores.net/home/parallel-computing/concurrent-skip-list/scheduler-algorithm>on Dmitry's blog. This seems like a nice pattern especially since, if most
workers are busy there is no contention. The problem though is if a worker
is stuck working on a large request (one that takes a few seconds) it won't
be able to hand off work to an idle worker.
3) Build a cost model of requests. For my application this doesn't seem
like a bad thing. Since I have only a few types of requests (20 or so) I
could run a tuning phase at start up that builds a cost model for each
input task. Let's say for an image processing application the types of work
are filters like blur, sharpen, resize etc. We could run a phase on startup
where we check how much time each of the filter takes with image of
different sizes (user supplied maybe). Once we have this info down, we
could build a cost model which estimates how much time say a blur of an
image of size x px * y px would take. Now we could go back to the multiple
SPSC queues design and instead of round robin we just assign work to the
lightest loaded worker based on our cost model. So in essence we are just
trying to ensure that the workers never get starved or over-worked at the
source of scheduling instead of after the fact.
Any comments or other methods would be highly appreciated.
--
---
You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/9d71e72a-51f1-491d-adfc-31a367ce23d7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
---
You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/9d71e72a-51f1-491d-adfc-31a367ce23d7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.