Okay, now that I've optimized once, I have to measure twice *g*
Fortunately it turns out that thread synchronization is pretty cheap.
I whipped up a quick & dirty but hopefully meaningful benchmark to measure both:
1) Raw mutex locking/unlocking
2) Alternating threads, just like in the aforementioned scenario
Bottom line: The synchronization performance hit is on the order of 0.000001 seconds (~1 microsecond) which can be easily compensated with the concurrent non-critical section. Acceptable! 
Code:
Clock resolution (process timer): 1 ns
Test 1: single thread mutex lock cycles
iterations: 124413150
runtime: 4999 ms
=> 24883535 lock cycles per s
=> 24883 lock cycles per ms
Test 2: two alternating threads
iterations: 12428330
runtime: 11055 ms
=> 1124151 thread switches per s
=> 1124 thread switches per ms
The test was run on an Intel Core 2 Duo with 3600 MHz. These numbers should not be overrated but they do give an an impression of the performance magnitude that can be expected.
Test source code (C99, POSIX.1-2001):
Code:
/* Quick & dirty mutex/thread benchmark.
* Link against the real-time library: -lrt
*/
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <time.h>
uint64_t time_usec() {
struct timespec t;
// High-resolution per-process timer from the CPU
clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &t );
return ((uint64_t) t.tv_sec) * 1000000 + t.tv_nsec / 1000;
}
uint64_t get_approx_locks_per_sec();
void test1_singlethreaded_cycles();
void test2_alternating_threads();
int main() {
// Print clock resolution
struct timespec res;
clock_getres( CLOCK_PROCESS_CPUTIME_ID, &res );
printf("Clock resolution (process timer): %ld ns\n\n", res.tv_nsec);
// Tests
test1_singlethreaded_cycles();
test2_alternating_threads();
return 0;
}
uint64_t get_approx_locks_per_sec() {
// Rough estimate of locks/second. Why?
// I don't want to retrieve the system time in each lock cycle
// during the actual benchmark and thus need an approximate
// number of necessary iterations.
pthread_mutex_t m;
uint64_t t_start, t_end, t;
uint64_t lockcount;
pthread_mutex_init(&m, NULL);
t_start = time_usec();
t = lockcount = 0;
while( t < 1000000 ) { // 1 second
pthread_mutex_lock(&m);
pthread_mutex_unlock(&m);
t_end = time_usec();
t = t_end - t_start;
lockcount++;
}
return (1000000 * lockcount)/t;
}
/**** Test 1 ****/
void test1_singlethreaded_cycles() {
uint64_t lockcount;
const uint64_t iterations = get_approx_locks_per_sec() * 50;
uint64_t t_start, t_end, t;
printf("Test 1: single thread mutex lock cycles\n");
printf(" iterations: %lld\n", iterations);
pthread_mutex_t m;
pthread_mutex_init( &m, NULL );
t_start = time_usec();
for(lockcount = 0; lockcount < iterations; lockcount++) {
pthread_mutex_lock(&m);
pthread_mutex_unlock(&m);
}
t_end = time_usec();
t = t_end - t_start;
printf(" runtime: %lld ms\n", t/1000);
printf(" => %lld lock cycles per s\n", (1000000 * lockcount) / t);
printf(" => %lld lock cycles per ms\n", (1000 * lockcount) / t);
printf("\n");
}
/**** Test 2 ****/
struct switchlock_t {
pthread_mutex_t *A, *B;
uint64_t *count, iterations;
};
void* threadfunc( void* p_slock ) {
struct switchlock_t* slock = (switchlock_t*) p_slock;
while( (*slock->count) < slock->iterations ) {
pthread_mutex_unlock( slock->A );
(*slock->count)++;
pthread_mutex_lock( slock->B );
}
pthread_mutex_unlock( slock->A );
pthread_mutex_unlock( slock->B );
printf(" thread finished...\n");
pthread_exit(NULL);
return NULL;
};
void test2_alternating_threads() {
struct switchlock_t slock1, slock2;
uint64_t lockcount = 0;
uint64_t t_start, t_end, t;
pthread_mutex_t m_A, m_B;
pthread_mutex_init(&m_A, NULL);
pthread_mutex_init(&m_B, NULL);
slock1.A = &m_A;
slock1.B = &m_B;
slock1.count = &lockcount;
slock1.iterations = get_approx_locks_per_sec() * 5;
slock2 = slock1;
slock2.A = &m_B;
slock2.B = &m_A;
printf("Test 2: two alternating threads\n");
printf(" iterations: %lld\n", slock1.iterations);
t_start = time_usec();
pthread_t thread1, thread2;
pthread_create( &thread1, NULL, threadfunc, &slock1 );
pthread_create( &thread2, NULL, threadfunc, &slock2 );
pthread_join( thread1, NULL );
pthread_join( thread2, NULL );
t_end = time_usec();
// summary
t = t_end - t_start;
printf(" runtime: %lld ms\n", t/1000);
printf(" => %lld thread switches per s\n", 1000000 * lockcount / t);
printf(" => %lld thread switches per ms\n", 1000 * lockcount / t);
}
No matter if all that work turns out to be useful or not, I learned quite a bit about multithreading and thought I'd share it while I am at it.