Introduction
In the previous post, I wrote about parallelism, and towards the end, mentioned Ray, the ultimate python package for parallelism.
I played around with it today, and wanted to give my take as well as describe the new challenge that I am currently stuck on.
Prelude
Ray is just so much fun. I am so glad I found it.
Want to define a function that can execute in the background?
@ray.remote
def add(x, y):
return x + y
await add.remote(1, 2)
Just how sick is this? Seriously? That function could also execute in a distributed manner, on some other machine entirely if you wanted to.
How about Actors? That must be more challenging to write, right? … WRONG.
@ray.remote
class Actor:
def __init__(self):
self.state = 0
def increment(self, x):
self.state += x
return self.state
actor = Actor.remote()
await actor.increment.remote(1)
print(await actor.increment.remote(1))
# Output: 2
Sateful actors, with a few simple lines of code. Just love it.
The Catch
So far, so good. Where is the catch?
I haven’t been able to find a built-in approach to dispatch work to a pool of workers and await each result separately.
I mean, there is ray.util.ActorPool
, however, its API is very limited. As far as I can tell, the main APIs are ActorPool.map
and ActorPool.submit
. ActorPool.map
makes sense if you have a bunch of data to process, but ActorPool.submit
is more for one-off tasks.
Given the above, ActorPool.map
definitely won’t do for my usecase which involves submitting work that is coming from an API request. ActorPool.submit
sounded interesting, however, I have no idea why they chose to make it not return anything. That means, once you submit a task, you have to go search somewhere else for the result :ugh:.
The Solution
I will now try to conjure up a solution as I go.
At first, I noticed that ActorPool
had one more cool API up its sleeve. ActorPool.pop_idle
. That would work for me, if I can just try to pop an idle actor, do my work on it, and return the result. Unfortunately, if there isn’t an idle actor at the time of the call, it will return None
. I wanted a blocking solution instead.
After reading the source for ActorPool
, I realized the usecase I want to use the actors for is completely different. For example, I would like to keep the async
api.
… After a bit of pondering, I realized I could just use a blocking queue to achieve the desired result. Pop an available actor, use it, then put it back. New API requests that come in will block until an actor is available in the queue.
Of course, I will prepare my own abstraction around the queue of actors to provide a simple API similar to ActorPool
, ProcessPoolExecutor
, etc.
Conclusion
How far will this streak go?