Intro to concurrent computing (in Nim)
Some notes to get started
Modern computing usually concerns multiple compute units. These compute units can be in the form of implementing shaders on a graphics card, or programming for multiple cores on the central processing unit (CPU). Taking advantage of the multiple cores can under optimal circumstances yield a $n$ times (= number of cores) increase runtime of your computation.
Implementing a multiple core program, however, is not a trivial task. Some programming languages (such as Matlab, python, julia) offer a high level abstract interface that would work well for so-called embarrasingly parallel jobs. Work performed in these kinds of jobs offer little effort to implement on multiple cores as the computation can happen on each core independently of the other cores. Little to no communication occurs between the cores, and shared memory is read-only.
Concurrent computing in a nutshell
In concurrent computer the programmer harnesses the power of compute units accessible to their program. These compute units can be compute cores on a single computer, or multiple computers communicating over the web or inside a server.
A concurrent program consists of dividing the task into sub-tasks that can independently run on threads (independent tiny programs).
Concurrent programming can be decomposed further in parallel programming and asynchronous programming. In parallel programming, threads are spawned and return results without any communication with the other spawned threads. They are parallel in the sense they perform the task in jointly independently. Examples include parallel computing the pairwise correlation between $N$ variables, running independent runs of agent-based models.
Asynchronous programming works by dividing the task up in threads in a non-blocking manner. That is, the task is subdivided in a thread, and the main thread can continue processing code. When called upon, the main thread can be blocked to retrieve the outcome of the thread spawned earlier.
In this post I will write some notes on how to implement parallel programming in Nim. This compared to my other posts are not extensive, and form a mere reminder to me on how to do this in the future.
The problem
I have $N$ simulation in which each simulation consists of
- creating an agent-based simulation with config $C_i, i\in N$
- Running and storing the results from this result
The tasks (including the IO in 2) can run in parallel as the data generation process takes longer than dumping the files to disk.
Below I will introduce some dummy code, and will not bore you with the details of the implementation.
In python I would use the `multiprocessing` library to create a pool of workers. The library will hide the nitty gritty and nasty details of the implementation. In Nim the parallel functionality is at the time of writing still hidden in the experimental section. It is fully functional for my ends luckily ;-).
The major frustration I have is that the documentation on the web was not completely clear on how this task can be achieved. On the docs the following example is given
import std/threadpool
proc processLine(line: string) =
discard "do some heavy lifting here"
for x in lines("myinput.txt"):
spawn processLine(x)
sync()
The code defines function and introduces some special keywords introduced by the threadpool import.
The keyword `spawn` is used to spawn off (create) a thread. The execution flow works as follows. When invoking this script, the main thread will spawn off seperate threads to run `processLine`. These threads will be spawned and executed at slightly different times. The main thread continues for every line in the dummy file until it hits `sync`. The `sync` keyword forces the main program to wait until all threads are finished. If `sync` was not there, the program would have exited, leaving the threads in limbo!
FlowVar[T]
In Nim, a spawned function will return a `FlowVar[T]` variable. The variable will be set to nil if the thread has not put any information there yet (i.e. it has not been called yet). It effectively tells the compiler that there might be a result here when the thread returns a value, and allows the main thread to continue.
The implementation
What we need to achieve is map out what task we want to compute, create threads for these tasks, and retrieve the results from the computation of these tasks.
We first create some dummy code that creates a load on the system. This code is necessary as some compilers can optimize out non-functional code which creates a faster running program and bypassing the intent of our code (running it concurrently).
#file: concurrent.nim
import threadpool, sequtils, os
{.experimental: "parallel".}
proc heavyComputation(n: int): int {.thread.} =
# simulate some arbitrary code
result = n
sleep((n * 100 mod 29) * 7)
The first line, makes the keywords available for the multithreading, and enables the experimental parallel processing. THe function is given a pragma which is a compiler directive that this function will be run on a thread. It mainly serves for readibility and ensures that the heap memory is managed properly.
The task itself is a simlulation of heavy load by giving the task different run times. The operations here are arbitrary chosen. For identifying when which thread was executed, we return the input `n` here.
Next, we create the program that runs on the main thread
#file: concurrent.nim
if isMainModule:
# create dummy variables
let N = 10000
var toCompute = (0..<N).toseq
# essentially they have a void conversion here
var results = newSeq[FlowVarBase](N)
parallel:
for idx in 0..<N:
results[idx] = spawn heavyComputation(toCompute[idx])
# get output and dump to disk
var fp = open("test.txt", fmWrite)
while results.len > 0:
var jdx = results.blockUntilAny()
if jdx >= 0:
fp.writeline(^FlowVar[int](results[jdx]))
results.del(jdx)
fp.close
The first couple of lines creates a simple integer array representing our simulations indicated above. Then, we allocate an array `results` that will store the result (or status) of the task to be executed concurrently. The main thread spawns off the threads and stores a `FlowVar` inside `results` and continues executing.
Then, we open a file, and scan for a result to be completed. If it is, we block the executing of the thread, and write the result to disk. The `^` (hat operator) ensures that the `FlowVar` results are ready (not nil). The conversion from the `FlowVarBase` to the type is similar to a void conversion in c or c++. It is currently and implementation detail that is not good programming practice. I wish this was implemented differently.
We execute the code by
nim c -r --threads:on -d:release concurrent.nim
and open up the text we can see that the threads returned are not (necessarily) in linear order. For example for me the file looks like:
0
1
2
3
...
22
9976
...
32
9965
The order of this sequence can differ.
Conclusions
Concurrent programming can drastically speed up your code. Using threadas in languages other than Nim has the advantage of doing more advanced things than is possible in higher language interfaces to concurrent program as present in (for example Python). Nim has some experimental features that can be leveraged to drastically improve the runtime of your code. In the future I will do some work on implementing parallel code on the GPU.