A simple start

When you work in this industry for a while, if your introspective, you can discover what type of programmer you are. You might be a programmer obsessed with performance or trying to find clever tricks.

I am neither of those. While I do like to find new techniques or simpler ways to do something, I generally prefer to steal from others who are much smarter than I. I dislike tracking down performance problems in systems unless it is necessary. I tend to be interested in delivering a product that functions with reasonable performance characteristics. I do not analyze assembly output in order to add custom assembly code to take advantage of a specific Intel instruction that does xyz faster (but thank God for a few people I've met who do). I rarely think that is necessary.

So, for reasons unknown to me, I have ended up going down a rabbit hole of exploring some Go concepts in order to make something faster and use less memory. Let me explain.

rabbit hole

Recently I have had a need for an unbounded queue. Buffered channels are my go to for a standard queue. And they are nice when you are okay with either:

  • blocking when the buffer is full
  • dropping an update when the buffer is full

Neither of those situations was going to work in this case. In reality, the queue's use would have a bound, but I would not know what that bound was ahead of time. I could not drop an update, have an update out of order, or block for an update. However, I could deal with higher memory use while we built up to whatever the bound would be.

I wanted was something that:

  • FIFO if single reader
  • pushed new items fast
  • read items fast
  • grew when the buffer became full
  • shrank when buffer space was not required
  • was unbounded
  • provide methods that could pop if possible or return !ok
  • provide methods that blocked until it could pop
  • if an above method was waiting, it would not eat up the CPU
  • no goroutines
  • solved a generic use case, so []interface{}

Note on being Generic: Yeah, I hate []interface{}. Empty interface{} conversion is a time suck and I hate not being typed at compile. However, I'm anti-generics with the proposals I've seen, they are just ugly and I don't want to spend 6 weeks explaining them to newbies. I'm fairly convinced that with the brain power behind trying to come up with generics in Go that if there was a good answer it would have happened. They will either screw up compile time or make the code more difficult to understand. With that said, this code is probably ripe for using Genny or some other type of pre-compiled code generator to save you.

To test implementations, I eventually decided on an interface I wanted for testing:

type Buffer interface{
    // Push an item on to the buffer. If we have reached maximum
    // buffer size, return false.
    Push(item interface{}) (ok bool)
    // Force is like Push() except it will block until we can push onto the buffer.
    Force(item interface{}
    // Pop retrieves a value from the buffer. It returns false if the buffer
    // is empty.
    Pop() (val interface{}, ok bool)
    // Pull is like Pop() except it will block until it can retrieve a value.
    Pull() interface{}
}

This isn't the most Goish interface. Pull() for example could return a channel that keeps returning values until another method like Close() was called. But I wanted to avoid spinning off goroutines. My final version might include these.

But before I go any further, other than Attempt #1, everything I am doing here is wrong. This is a near-perfect example of over-optimizing. When writing an application you should concentrate on:

  • Application structure
  • Nice interfaces to data structures that can have the internals switched out
  • Good data structures

Performance optimization should come after testing, tracing, etc... of bottlenecks that are unacceptable in your application. Good data structures lead to more optimized algorithms if you need them. And if you abstract your interfaces cleanly, you can always optimize behind the scenes.

For reference, you can find the various code that I show or don't show here:

Different versions of the FIFO queue:
http://github.com/johnsiilver/golib/queue/fifo/unbounded/experimental/

The final version:
http://github.com/johnsiilver/golib/queue/fifo/unbounded/

Attempt 1: Not Being Clever

My first crack at this was to avoid being clever at all and use Go in the most simple way possible.

This involved:

  • using a slice to hold the queue
  • using a slice's sliding window over the array to dequeue entries
  • use a mutex when enqueuing or dequeuing to the slice
  • a channel to prevent providing sleeping routines when doing blocking calls

Here's what the first attempt looked like:

package unbounded

import "sync"

// Buffer provides a first in first out non-blocking (but not lock-free) Buffer.
type Buffer struct {
    mu     sync.Mutex
    nextCh chan interface{}
    data   []interface{}
}

// New is the contructor for Buffer.
func New() *Buffer {
    return &Buffer{nextCh: make(chan interface{}, 1)}
}

// Push adds an item to the Buffer.
func (q *Buffer) Push(item interface{}) {
    q.mu.Lock()
    q.data = append(q.data, item)
    q.shift()
    q.mu.Unlock()
}

// Force will push the item onto the buffer and blocks until the operation completes.
func (q *Buffer) Force(item interface{}) {
    q.Push(item)
}

// shift will move the next available entry from our queue into the channel to
// be returned to the user. Must be locked by a calling function.
func (q *Buffer) shift() {
    if len(q.data) > 0 {
        select {
        case q.nextCh <- q.data[0]:
            q.data = q.data[1:]
        default:
        }
    }
}

// Pop returns the next value in the buffer.  It returns ok == false if there
// is no value in the buffer.
func (q *Buffer) Pop() (val interface{}, ok bool) {
    select {
    case item := <-q.nextCh:
        q.mu.Lock()
        q.shift()
        q.mu.Unlock()
        return item, true
    default:
        return nil, false
    }
}

// Pull pulls an item off the circular buffer. It blocks until a value is found.
func (q *Buffer) Pull() interface{} {
    item := <-q.nextCh
    q.mu.Lock()
    q.shift()
    q.mu.Unlock()
    return item
}

This one worked well enough. But it got me thinking about other ways to do this. I have no idea why. I should have just stopped. I had a good data structure for what I needed. I could optimize later if I needed to. But for some reason, I just decided I wanted to waste a day, very unlike me. Sorry, time to dive first into the rabbit hole.

dive into the rabbit hole

First, for whatever crazy reason, I dislike using Mutex. In recent times I've begun using atomic.Value and other atomic operations to protect constantly changing values. I've benchmarked these and found this to provide significant increases if you hammer the variable with changes. But that got me thinking about just doing atomic.CompareAndSwap_ for my Mutex needs.

Second, there is a lot of array copies happening in this version. Array copies are very fast1, but I don't like doing quite so many (this irked me, but it has no bearing on the efficiency of the method). Wouldn't some form of circular buffer be faster? Of course, it would need to be a circular buffer that didn't have a set size.

1Note: I saw a talk in the last year from someone showing that array copies were faster than heap allocations and pointer swaps by some crazy multiple. I remember his evidence being fairly good and for this reason, I concentrated on either using channels for the Buffer or slices. However, there are two old sayings from Google that apply here:

  • In God we trust, everyone else must bring the data
  • Trust but verify

I'm sure these originate somewhere else for anyone who is really nitpicky. But Google is the first place I heard this said.

Go wants you to be clever, but not too clever

So Go isn't the language for people who want to be clever with optimizations. C is the language for clever optimizations. No runtime to contend with, crazy pointer arithmetic and direct control.

Go really wants you to be clever with your data structures, APIs, etc... Make it easy for your users and maintainers. There is an entire philosophy that bears where after you swim in its waters for a while you do find it refreshing. At first, it just feels cold.

With that said, when you start looking at internals to low-level tasks, you will often find that the standard library has access to code your programs don't. So, for example, if you want to spinlock as well as the Mutex implementation, you can't. Mutex has access to runtime_doSpin(), which your program doesn't.

There are other memory barrier tricks that your code can't take advantage of unless you want to use Cgo, which for the purpose of this article is cheating. And I'm sure I can do a hack to gain access to these things, but that's not portable. One of Go's strength is that until you start using unsafe or syscall, your code really is portable (while C just dreams it is actually portable).

It is important to understand this, because if you try to write a Mutex that is more optimized for your use case, you are not going to be as efficient. You can attempt to do some of this with runtime.Gosched(), but you are rolling that rock uphill. Apparently, I like rolling rocks.

rolling the rock

Trying to be clever

My first attempt at being clever was to use channels as a circular buffer. The circular buffer would be protected by a simple atomic.CompareAndSwapInt32(). If I couldn't insert a value into the channel, I would simply create a new channel with double the capacity, copy values from the old channel to the new channel, insert the new value, and swap the old channel for the new one. Pretty much what append() does for slices.

My reasoning was:

  • A lot of work has been done in channels to prevent the use of locks when the channel is not blocked
  • select has some very nice properties that can be used
  • atomic.CompareAndSwapInt32 is much faster than sync.Mutex

I knew that there was some risk:

  • Mutex makes certain guarantees that a caller will not be starved. CompareAndSwapInt32() by itself does not.
  • Needed to write my own sleeping code that tried to be a spinlock and eventually slept to some maximum value. Otherwise, when idle these locking loops would eat the CPUs alive.

The sleep code took a while to get to a point that seemed to work in a manner that didn't kill performance nor choke my CPU for too long.

package spin

import (
    "runtime"
    "time"
)

// Sleeper allows sleeping for an increasing time period up to 1 second
// after giving up a goroutine 2^16 times.
// This is not thread-safe and should be thrown away once the loop the calls it
// is able to perform its function.
type Sleeper struct {
    loop uint16

    at time.Duration
}

// Sleep at minimum allows another goroutine to be scheduled and after 2^16
// calls will begin to sleep from 1 nanosecond to 1 second, with each
// call raising the sleep time by a multiple of 10.
func (s *Sleeper) Sleep() {
    const maxSleep = 1 * time.Second

    if s.loop < 65535 {
        runtime.Gosched()
        s.loop++
        return
    }

    if s.at == 0 {
        s.at = 1 * time.Nanosecond
    }

    time.Sleep(s.at)

    if s.at < maxSleep {
        s.at = s.at * 10
        if s.at > maxSleep {
            s.at = maxSleep
        }
    }
}

The first custom version came out like:

import (
    "runtime"
    "sync/atomic"
    "time"

    "github.com/johnsiilver/golib/queue/fifo/internal/spin"
)

// Unbounded indicates that the Queue should have no memory bounds.
const Unbounded = -1

const (
    unlocked = int32(0)
    locked   = int32(1)
)

// Buffer provides a FIFO circular buffer that can grow and shrink as required.
// Buffer must not be copied after creation (which means use a pointer if
// passing between functions).
type Buffer struct {
    // Max is the maximum size the Buffer can grow to.  Use Unbounded if
    // you wish to grow the buffer to any size. By default, this will grow to 1k items.
    Max int

    lockInt    int32
    lastShrink time.Time
    data       chan interface{}
}

// Push pushes an item into the circular buffer. "ok" indicates if this happens.
func (c *Buffer) Push(item interface{}) (ok bool) {
    c.lock()
    defer c.unlock()

    select {
    case c.data <- item:
        return true
    default:
    }

    // The buffer was too small, grow the buffer and then insert.
    c.grow()
    select {
    case c.data <- item:
        return true
    default:
        return false
    }
}

// Force will push the item onto the buffer and blocks until the operation completes.
func (c *Buffer) Force(item interface{}) {
    sleeper := spin.Sleeper{}

    for {
        if c.Push(item) {
            return
        }
        sleeper.Sleep()
    }
}

// Pop returns the next value off the circular buffer. If the buffer is empty
// ok will be false.
func (c *Buffer) Pop() (value interface{}, ok bool) {
    c.lock()
    defer c.unlock()

    select {
    case v := <-c.data:
        c.shrink()
        return v, true
    default:
        return nil, false
    }
}

// Pull pulls an item off the circular buffer. It blocks until a value is found.
func (c *Buffer) Pull() interface{} {
    sleeper := spin.Sleeper{}

    for {
        v, ok := c.Pop()
        if !ok {
            sleeper.Sleep()
            continue
        }
        return v
    }
}

// grow will double the size of the internal buffer until we hit the max size
// if the buffer is currently full.
// Note: grow must be protected with .lock/.unlock.
func (c *Buffer) grow() {
    if c.Max == 0 {
        c.Max = 1000
    }

    if cap(c.data) == c.Max {
        return
    }

    if len(c.data) == cap(c.data) {
        if c.Max == Unbounded {
            ch := make(chan interface{}, cap(c.data)*2)
            c.copy(ch, c.data)
            c.data = ch
            return
        }
        if cap(c.data) < c.Max {
            size := cap(c.data) * 2
            if size == 0 {
                size = 8
            }
            if size > c.Max {
                size = c.Max
            }
            ch := make(chan interface{}, size)
            c.copy(ch, c.data)
            c.data = ch
            return
        }
    }
}

// shrink shrinks the size of the internal buffer if the length of the buffer
// is < 50% of the capacity.  The reduction will be by 25% but will produce
// a buffer size of no less than 8 slots.
// Note: shrink must be protected with .lock/.unlock.
func (c *Buffer) shrink() {
    if cap(c.data) == 8 {
        return
    }

    if time.Now().Sub(c.lastShrink) > 10*time.Minute {
        return
    }

    // If the current unused capacity is > 50% of the buffer, reduce it by 25%.
    if (cap(c.data) - len(c.data)) > (cap(c.data) / 2) {
        size := int(float64(cap(c.data)) * .75)
        if size < 8 {
            size = 8
        }

        ch := make(chan interface{}, size)
        c.copy(ch, c.data)
        c.data = ch
        c.lastShrink = time.Now()
    }
}

func (c *Buffer) copy(dst chan<- interface{}, src chan interface{}) {
    if (cap(dst) - len(dst)) < (cap(src) - len(src)) {
        panic("internal error: Buffer.copy() cannot be called when dst is smaller than src")
    }

    close(src)
    for v := range src {
        dst <- v
    }
}

func (c *Buffer) lock() {
    for {
        if atomic.CompareAndSwapInt32(&c.lockInt, unlocked, locked) {
            return
        }
        runtime.Gosched()
    }
}

func (c *Buffer) unlock() {
    for {
        if atomic.CompareAndSwapInt32(&c.lockInt, locked, unlocked) {
            return
        }
        runtime.Gosched()
    }
}

This one had some nice properties. It had a useful zero value, which I tried to use with other implementations instead of a constructor (which was not always possible).

It also only shrunk at intervals, reducing the need for array copies.

The Truth is in the Pudding

Now I have two versions, but how do each stack up against each other? Oh, the fun of benchmarking!

bench.go

package bench

import (
    "sync"
    "testing"
)

type unbounded interface {
    Force(item interface{})
    Pull() interface{}
}

func singleRun(bench *testing.B, n func() unbounded, items, senders, receivers int) {
    for i := 0; i < bench.N; i++ {
        bench.StopTimer()

        b := n()
        sendCh := make(chan int, items)
        wg := sync.WaitGroup{}
        wg.Add(items)

        // Setup senders.
        for i := 0; i < senders; i++ {
            go func() {
                for v := range sendCh {
                    b.Force(v)
                }
            }()
        }

        // Setup receivers.
        for i := 0; i < receivers; i++ {
            go func() {
                for {
                    b.Pull()
                    wg.Done()
                }
            }()
        }

        bench.StartTimer()

        // Send to Buffer (which the receivers will read from)
        go func() {
            for i := 0; i < items; i++ {
                sendCh <- i
            }
            close(sendCh)
        }()

        wg.Wait()
    }
}

bench_test.go

package bench

import (
    "fmt"
    "testing"

    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/custom_locks"
    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/custom_sleep_atomic"
    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/sliding_slice_atomic"
    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/sliding_slice_locks"
)

func BenchmarkUnboundedQueues(b *testing.B) {
    runs := []struct{ items, senders, receivers int }{
        {100000, 1, 1},
        {100000, 10, 1},
        {100000, 1, 10},
        {100000, 10, 10},
        {100000, 100, 10},
        {100000, 10, 100},
    }

    benchmarks := []struct {
        name    string
        newFunc func() unbounded
    }{
        {
            "sliding_slice_atomic",
            func() unbounded { return sliding_slice_atomic.New() },
        },
        {
            "sliding_slice_locks",
            func() unbounded { return sliding_slice_locks.New() },
        },
        {
            "custom_sleep_atomic",
            func() unbounded { return &custom_sleep_atomic.Buffer{} },
        },
        {
            "custom_locks",
            func() unbounded { return &custom_locks.Buffer{} },
        },
    }

    for _, run := range runs {
        for _, benchmark := range benchmarks {
            b.Run(
                benchmark.name+fmt.Sprintf("-items: %d, senders: %d, receivers: %d", run.items, run.senders, run.receivers),
                func(b *testing.B) {
                    singleRun(b, benchmark.newFunc, run.items, run.senders, run.receivers)
                },
            )
        }
    }
}

This benchmark was setup to test a few scenarios:

  • Single sender and receiver
  • 10 senders and 1 receiver
  • 1 sender and 10 receivers
  • 10 senders and 10 receivers
  • 100 senders and 10 receivers
  • 10 senders and 100 receivers

This gives a certain static scaling of the number of senders and receivers to see how each scales for different scenarios. I want a good generic solution that handles each of these reasonably.

In this initial test, I also included versions of both that use atomic.CompareAndSwapInt32() and sync.Mutex. Here are the initial results:

Note: Bold is the fastest version.

Version #Items #Senders #Receivers Iterations ns/op
sliding_slice_atomic 100000 1 1 30 50688582
sliding_slice_locks 100000 1 1 30 41097381
custom_sleep_atomic 100000 1 1 20 60742172
custom_locks 100000 1 1 30 61267323

Version #Items #Senders #Receivers Iterations ns/op
sliding_slice_atomic 100000 10 1 30 46452147
sliding_slice_locks 100000 10 1 50 37041679
custom_sleep_atomic 100000 10 1 10 106059804
custom_locks 100000 10 1 20 116124583

Version #Items #Senders #Receivers Iterations ns/op
sliding_slice_atomic 100000 100 1 30 59664733
sliding_slice_locks 100000 100 1 30 39991409
custom_sleep_atomic 100000 100 1 1 1048544140
custom_locks 100000 100 1 2 699103394

Version #Items #Senders #Receivers Iterations ns/op
sliding_slice_atomic 100000 10 10 30 50459215
sliding_slice_locks 100000 10 10 30 42415787
custom_sleep_atomic 100000 10 10 20 76158348
custom_locks 100000 10 10 20 74131977

Version #Items #Senders #Receivers Iterations ns/op
sliding_slice_atomic 100000 10 100 20 82940141
sliding_slice_locks 100000 10 100 30 45741994
custom_sleep_atomic 100000 10 100 5 221296458
custom_locks 100000 10 100 20 97246758

Well....huh...

First take away:

  • atomic and custom sleeping is not beating standard Mutex
  • The array copies are most likely beating the custom channel copies
  • Using select and channel, while providing some advantages are probably hurting us with a lot of extra locking and other magic

There was some profiling and such, but I decided to go on and test a bunch of other versions.

If At First You Don't Succeed, Fail Some More??

So it would be lying to say I'm covering all the steps I took here. It was a winding path of outright failure and tuning. There were other tests I wrote to prove that the CPU would lower after being stuck either pushing or pulling and other things I'm not going to cover. The rabbit hole seems to keep going.

large rabbit hole

In the end, I decided to benchmark against a few more methods:

  • Versions of each that had no sleep. These would eat CPU, but I wanted to see if it made a difference with using atomic for locking
  • I wanted to run against a reference type that used a channel with a buffer of 100 and was not unbounded. Resizing to support unbounded is expensive, but how expensive?
  • What if I used a queue that was more CS standard, records with pointers pointing to the next entry in the queue (I called this the heap version). According to the talk I saw, this should be much slower, but how much slower?

Well, let's have a look:

Note: Bold is for the fastest, discounting the reference.
Note: The reference is in blue when it was the fastest.

Version #Items #Senders #Receivers Iterations ns/op
reference 100000 1 1 100 20660880
sliding_slice_atomic 100000 1 1 30 49620908
sliding_slice_locks 100000 1 1 30 38928647
custom_sleep_atomic 100000 1 1 30 60764994
custom_nosleep 100000 1 1 20 61262504
custom_locks 100000 1 1 20 63590469
heap_lock 100000 1 1 50 27832756
heap_atomic 100000 1 1 100 25798467

Version #Items #Senders #Receivers Iterations ns/op
reference 100000 10 1 50 31719731
sliding_slice_atomic 100000 10 1 20 99060012
sliding_slice_locks 100000 10 1 30 78695899
custom_sleep_atomic 100000 10 1 10 119628479
custom_nosleep 100000 10 1 10 105837735
custom_locks 100000 10 1 10 111282187
heap_lock 100000 10 1 50 28885042
heap_atomic 100000 10 1 50 28278157

Version #Items #Senders #Receivers Iterations ns/op
reference 100000 100 1 50 29424432
sliding_slice_atomic 100000 100 1 30 45287610
sliding_slice_locks 100000 100 1 30 49388002
custom_sleep_atomic 100000 100 1 2 1014270608
custom_nosleep 100000 100 1 1 1018922777
custom_locks 100000 100 1 3 478910481
heap_lock 100000 100 1 100 26944540
heap_atomic 100000 100 1 50 28301942

Version #Items #Senders #Receivers Iterations ns/op
reference 100000 10 10 50 29539230
sliding_slice_atomic 100000 10 10 30 47400717
sliding_slice_locks 100000 10 10 30 44891524
custom_sleep_atomic 100000 10 10 20 73116554
custom_nosleep 100000 10 10 20 74409189
custom_locks 100000 10 10 20 70609204
heap_lock 100000 10 10 50 23445934
heap_atomic 100000 10 10 50 26386795

Version #Items #Senders #Receivers Iterations ns/op
reference 100000 10 100 50 24754023
sliding_slice_atomic 100000 10 100 20 58304944
sliding_slice_locks 100000 10 100 50 46103498
custom_sleep_atomic 100000 10 100 5 203424993
custom_nosleep 100000 10 100 5 211169360
custom_locks 100000 10 100 20 89729445
heap_lock 100000 10 100 50 25450734
heap_atomic 100000 10 100 100 34094147

Well....huh...., again!

It is not surprising the reference is faster in several versions. But what surprised me was the standard pointer based queue. The heap_lock or heap_atomic took this test by storm.

heap_atomic and heap_lock being so close in most benchmarks leads me to believe that the custom.* versions are slower because there is a bunch of lock contention with the combination of channels/locks(or atomic)/select. I could go work on that, but I don't think that is worth my time.

Now, I'm not sure why the speaker's assessment I listened to was wrong. I tend to think that this kind of thing is my fault. It could be:

  • I misheard what he was saying
  • I'm optimizing in some way
  • The compiler has changed
  • The benchmark is broken
  • He was just dead wrong

No matter which one it is, my heap_lock optimization is where my effort is going to concentrate on.

But why not heap_atomic? While it did win in some benchmarks, simpler is better. There isn't enough difference to make me want to use atomic over Mutex and Mutex makes guarantees on goroutine starvation that atomic will not.

What If I Made the Heap Version Circular?

The heap version uses a simple queue. It is memory efficient and apparently fast. But I'm still haunted by allocating extra memory when I don't need to.

What if instead of throwing away entries when dequeuing I just move a pointer in a circular buffer? That entry could be reused without the need for allocating a new entry. And if I run out of room I can always insert more entries at the back.

This will add some overhead in checking if things like the front and tail pointers are the same to know if we need to do an allocation.

Here's a first attempt at it with 100k items:

Version #Items #Senders #Receivers Iterations ns/op
heap_lock 100000 1 1 50 26110856
heap_circular 100000 1 1 50 26823725
heap_lock 100000 10 1 50 29731315
heap_circular 100000 10 1 30 34592901
heap_lock 100000 100 1 50 30892013
heap_circular 100000 100 1 30 40746096
heap_lock 100000 10 10 50 27873525
heap_circular 100000 10 10 50 35315318
heap_lock 100000 10 100 20 57328038
heap_circular 100000 10 100 30 60764377

Well, it looks like the queue is still winning. Not by a huge amount, but enough.

I wonder if we see better results with 1m items:

Version #Items #Senders #Receivers Iterations ns/op
heap_lock 1000000 1 1 5 275110923
heap_circular 1000000 1 1 5 251509839
heap_lock 1000000 10 1 5 316068383
heap_circular 1000000 10 1 3 367007666
heap_lock 1000000 100 1 3 489702046
heap_circular 1000000 100 1 3 397814725
heap_lock 1000000 10 10 3 418825312
heap_circular 1000000 10 10 3 404838549
heap_lock 1000000 10 100 2 579100509
heap_circular 1000000 10 100 2 630555319

How about 10m items:

Version #Items #Senders #Receivers Iterations ns/op
heap_lock 10000000 1 1 1 2719649748
heap_circular 10000000 1 1 1 2896301313
heap_lock 10000000 10 1 1 3239424103
heap_circular 10000000 10 1 1 3812422117
heap_lock 10000000 100 1 1 3314603774
heap_circular 10000000 100 1 1 4393173577
heap_lock 10000000 10 10 1 4271595984
heap_circular 10000000 10 10 1 4358516405
heap_lock 10000000 10 100 1 6051491583
heap_circular 10000000 10 100 1 5518405633

Now my growing circular buffer is probably overly complicated, which leads to some time suck. And it is close in time to heap_lock for the common cases (1:1/10:10/10:100 senders to receivers).

But looking at the trace tool, the allocations are just minor time sinks, gc rarely kicks in because of this on the heap_lock. That makes heap_lock the overall winner in my view. The code is easy to follow and fast.

I think heap_circular has merit, but with a slightly different interface and only if the data being stored is concrete not interface{}. If the entries recorded large data, such as []byte{}, we could use it as a combination FIFO queue and free-list. You could have a method to return an empty value in which it would pull a dequeued entry or if full would give a new one. That could then be used to enqueue without a large allocation. But that is a very specific use case, which I am not going to test. And of course, there is the very good argument of just having a free list and this queue separate.

I am absolutely sure that many of these can be made faster with some trivial work (but I've wasted enough time with this thought experiment). And there were all kinds of things I didn't try (using []unsafe.Pointer and CompareAndSwapPointer() or a slice of slices or ...).

But now I am absolutely sure that:

  • I should have used the one in Attempt #1
  • I am not a computer scientist

If you read this far, I'm sorry to have dragged you down the rabbit hole with me. Misery occasionally loves company.

safe landing