Improving your bot's performance on multi-core CPUs

**Sednogmah** · 01-18-2010

Improve your bot's performance on multi-core CPUs

1. Basic idea

Many injected bots are hooking a library function like OpenGL's glClear or the DirectX equivalent to execute code after WoW has finished rendering a frame. This is necessary to have a consistent engine state. Unfortunately it inevitably slows down the whole game, reducing the number of frames per second.

Fig. 1: WoW rendering frame by frame:

Fig. 2: Injected code (red) running at the end of each frame:

The first step is to offload the bot code into a separate thread. This comes with a major problem though: You can't safely access WoW's memory or call engine functions from a concurrent thread. You need some sort of synchronization mechanism.

Fig. 3: Obviously nothing is gained if you just lock your bot's thread while WoW is processing a frame:

If your bot has computing-intensive parts, most of those will probably not rely on calling engine functions or accessing memory. This means that your code can be split into a critical and non-critical section. The critical section has to be synchronized with WoW's main thread while the non-critical can be concurrent.

The critical section will contain code that:
- accesses WoW's memory
- calls engine functions

The non-critical sections could be:
- path finding
- decision making
- updating your GUI
- accessing a database
- logging
- file access, e.g. writing compressed screenshots

Fig. 4: Asynchronous processing. red = critical, green = non-critical. Obvious performance gain:

2. Mutexes
Threads can be synchronized with mutexes. I will use the default behaviour of POSIX mutexes for my example which are available on Mac OS X and Linux. Our mutex M has two states: locked and unlocked. We will use two functions to interact with the mutex: mutex_lock(M) and mutex_unlock(M). The behaviour of the two functions depends on the state of M:

If "M is unlocked":
- mutex_unlock(M) does nothing
- mutex_lock(M) locks M

If "M is locked":
- mutex_unlock(M) unlocks M
- mutex_lock(M) blocks the calling thread until it is unlocked

3. Implementation (Pseudocode)
Let's assume we have hooked a library function that gets called after every frame. We call it EndFrameHook(). We have also created a separate thread for our bot, running an infinite loop.

Code:

// Initialization code

Mutex m_Wow; // locked when WoW is busy with processing/rendering
Mutex m_Bot; // locked when our bot is in its critical section

// Starting our bot's thread
mutex_lock(m_Wow)
start_bot_thread()

Code:

EndFrameHook() { // called after WoW finished processing a frame
	mutex_unlock(m_Wow); // It's safe to interact with the engine now
	mutex_lock(m_Bot);   // Blocks during the bot's critical section 
}

//
BotThread() {
	for(ever) {
		mutex_lock( m_Wow )
				
		// CRITICAL START
		accessMemory()
		callEngineFunctions()
		// CRITICAL END

		mutex_unlock( m_Bot )

		// NON-CRITICAL section, for example:
		decisionMaking()
		pathFinding()
		accessFiles()
		database()
	}
}

4. Further notes

Fig. 5: What if the non-critical section takes more time than a frame?

As you can see, if the non-critical section takes more time than WoW needs to render a frame, no frames are skipped and the main thread's execution is paused. If you want frame skipping, you need to wrap the non-critical section with a third mutex lock and test it with mutex_trylock in EndFrameHook(). I will add pseudocode for that later. The result would look like this:

Edit 1: As amadmonk pointed out, concurrent access of engine functions is not _necessarily_ unsafe but _can_ be
Edit 2: Fixed implementation, had one mutex_unlock() in the wrong place. Updated images. Added notes.

**namreeb** · 01-18-2010

You present this information very well. Nice graphics!

**amadmonk** · 01-18-2010

Clean presentation.

One thing I'd check, however, is the assumption that you can't access engine functions safely in an unsynchronized fashion. You most assuredly can -- I've had a bot running for 12 hours straight doing roughly 20-40 engine calls per second (some as simply as EnumVisibleObjects, some full Lua stuff) with no evident race conditions. This was surprising to me because I know we use a *global* Lua state object for the Lua engine, but perhaps Lua adds its own synchronization.

So, bravo on a clear, detailed description of a way to improve bot performance. My only modification would be to remember Knuth's caution: "premature optimization is the root of all evil." Make sure that your optimizations actually FIX a problem (surprisingly many programmers forget that), and make sure that your optimization doesn't introduce *other* problems.

**audible83** · 01-18-2010

Easy to understand explanation of the theory behind it. Good work!

Next up is Cuda ?

**Apoc** · 01-18-2010

Very well said. However, I have to agree with amadmonk.

Currently, it's very viable to run a very large amount of code during each frame. In comparison; the amount of code you run, should be roughly 1% of what WoW runs per-frame. (No, that's not an exaggeration either.) Obviously you shouldn't be doing very 'intensive' things from within the main thread, but you can most definitely run a fairly complex AI system from an EndScene pulse without issue.

Even if I try very hard; I can only get WoW to lose roughly 5FPS with our current framework. (Not including deliberately sleeping the main thread of course, or doing any lengthy file reads, etc.)

**Sednogmah** · 01-18-2010

I fully agree with both of you as well: It's certainly not necessary most of the time. Thinking of it now, I'm not entirely sure anymore why I even implemented it as bots usually run in the background anyway which makes the number of FPS pretty irrelevant, except for the loss of accuracy below a certain point. Oh well, at least I learned to use Inkscape to make the pretty pictures *g*
It can still be useful if you're doing more than pure AI and decision making though. For example your could have a control GUI with a graphical radar that needs to be rendered. I really should write that Knuth quote down and stick it to my monitor. Premature optimization is a bad habit of mine.

**audible83** · 01-18-2010

I would typically see this used in the non-time-critical part of code. Using a seperate thread for database fetching and the like is probably a good thing if you have latency on network etc.

**Cypher** · 01-18-2010

Well presented post, however I too have to agree with amadmonk. I've never had any of my bots/hacks/whatever cause a performance drop which would justify all the extra work of doing a multi-threaded implementation.

I'd be worried that without proper profiling and checks that thread synchronization would actually cause a larger performance hit than the actual gain caused by the multi-threading.

I'll add to amadmonk's "Premature optimization is the root of all evil" (was that Knuth or Hoare? I seem to remember seeing it first as Knuth quoting Hoare in one of his books) with "Measure twice, optimize once".

**Sednogmah** · 01-19-2010

Okay, now that I've optimized once, I have to measure twice *g*

Fortunately it turns out that thread synchronization is pretty cheap.

I whipped up a quick & dirty but hopefully meaningful benchmark to measure both:
1) Raw mutex locking/unlocking
2) Alternating threads, just like in the aforementioned scenario

Bottom line: The synchronization performance hit is on the order of 0.000001 seconds (~1 microsecond) which can be easily compensated with the concurrent non-critical section. Acceptable!

Code:

Clock resolution (process timer): 1 ns

Test 1: single thread mutex lock cycles
  iterations: 124413150
  runtime: 4999 ms
  => 24883535 lock cycles per s
  => 24883 lock cycles per ms

Test 2: two alternating threads
  iterations: 12428330
  runtime: 11055 ms
  => 1124151 thread switches per s
  => 1124 thread switches per ms

The test was run on an Intel Core 2 Duo with 3600 MHz. These numbers should not be overrated but they do give an an impression of the performance magnitude that can be expected.

Test source code (C99, POSIX.1-2001):

Code:

/* Quick & dirty mutex/thread benchmark.
 * Link against the real-time library: -lrt
 */

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <time.h>

uint64_t time_usec() {
	struct timespec t;
	// High-resolution per-process timer from the CPU
	clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &t );
	return ((uint64_t) t.tv_sec) * 1000000 + t.tv_nsec / 1000;
}

uint64_t get_approx_locks_per_sec();
void test1_singlethreaded_cycles();
void test2_alternating_threads();

int main() {
	// Print clock resolution
	struct timespec res;
	clock_getres( CLOCK_PROCESS_CPUTIME_ID, &res );
	printf("Clock resolution (process timer): %ld ns\n\n", res.tv_nsec);

	// Tests
	test1_singlethreaded_cycles();
	test2_alternating_threads();

	return 0;
}

uint64_t get_approx_locks_per_sec() {
	// Rough estimate of locks/second. Why?
	//   I don't want to retrieve the system time in each lock cycle
	//   during the actual benchmark and thus need an approximate
	//   number of necessary iterations.

	pthread_mutex_t m;
	uint64_t t_start, t_end, t;
	uint64_t lockcount;

	pthread_mutex_init(&m, NULL);
	t_start = time_usec();
	t = lockcount = 0;
	while( t < 1000000 ) { // 1 second
		pthread_mutex_lock(&m);
		pthread_mutex_unlock(&m);
		t_end = time_usec();
		t = t_end - t_start;
		lockcount++;
	}

	return (1000000 * lockcount)/t;
}

/**** Test 1 ****/
void test1_singlethreaded_cycles() { 
	uint64_t lockcount;
	const uint64_t iterations = get_approx_locks_per_sec() * 50;
	uint64_t t_start, t_end, t;

	printf("Test 1: single thread mutex lock cycles\n");
	printf("  iterations: %lld\n", iterations);
	
	pthread_mutex_t m;
	pthread_mutex_init( &m, NULL );

	t_start = time_usec();
	for(lockcount = 0; lockcount < iterations; lockcount++) {
		pthread_mutex_lock(&m);
		pthread_mutex_unlock(&m);
	}
	t_end = time_usec();
	t = t_end - t_start;

	printf("  runtime: %lld ms\n", t/1000);
	printf("  => %lld lock cycles per s\n", (1000000 * lockcount) / t);
	printf("  => %lld lock cycles per ms\n", (1000 * lockcount) / t);
	printf("\n");
}

/**** Test 2 ****/
struct switchlock_t { 
	pthread_mutex_t *A, *B;
	uint64_t *count, iterations;
};
void* threadfunc( void* p_slock ) {
	struct switchlock_t* slock = (switchlock_t*) p_slock;
	while( (*slock->count) < slock->iterations ) {
		pthread_mutex_unlock( slock->A );
		(*slock->count)++;
		pthread_mutex_lock( slock->B );
	}
	pthread_mutex_unlock( slock->A );
	pthread_mutex_unlock( slock->B );
	printf("  thread finished...\n");
	pthread_exit(NULL);
	return NULL;
};

void test2_alternating_threads() {
	struct switchlock_t slock1, slock2;
	uint64_t lockcount = 0;
	uint64_t t_start, t_end, t;

	pthread_mutex_t m_A, m_B;
	pthread_mutex_init(&m_A, NULL);
	pthread_mutex_init(&m_B, NULL);

	slock1.A = &m_A;
	slock1.B = &m_B;
	slock1.count = &lockcount;
	slock1.iterations = get_approx_locks_per_sec() * 5;

	slock2 = slock1;
	slock2.A = &m_B;
	slock2.B = &m_A;

	printf("Test 2: two alternating threads\n");
	printf("  iterations: %lld\n", slock1.iterations);

	t_start = time_usec();
	pthread_t thread1, thread2;
	pthread_create( &thread1, NULL, threadfunc, &slock1 );
	pthread_create( &thread2, NULL, threadfunc, &slock2 );
	pthread_join( thread1, NULL );
	pthread_join( thread2, NULL );
	t_end = time_usec();

	// summary
	t = t_end - t_start;
	printf("  runtime: %lld ms\n", t/1000);
	printf("  => %lld thread switches per s\n", 1000000 * lockcount / t);
	printf("  => %lld thread switches per ms\n", 1000 * lockcount / t);
}

No matter if all that work turns out to be useful or not, I learned quite a bit about multithreading and thought I'd share it while I am at it.

**amadmonk** · 01-19-2010

Yes, definitely. A well-optimized synch routine (and IIRC .Net's "lock" semantics aren't much more expensive than a raw Mutex write, when amortized out over a thousand calls or so) shouldn't add much of a performance penalty.

I have to correct something I said earlier, when I mentioned that there were no races in Lua; I was wrong. It turned out that I had built a SynchronizationContext into my Lua invokes and forgotten about it (I'm such a slick programmer that I fooled myself

) -- as a result, my Lua invokes specifically were being done in the render thread.

When I took this synch out, I fairly quickly hit a race condition between the event reporting code (which right now is just a bit of glue Lua that forwards events into my bot) and the off-thread Lua execution. This is precisely for the reason that I suspected; the shared Lua state.

Unfortunately for me, this means that my Lua invokes from out of thread are effectively throttled to FPS, since that's the only time I can be sure that it's "safe" to invoke them.

I suspect that I could remove this limitation if I could (a) create a custom Lua state (looking into it) or (b) find a cleaner way to report events that didn't require glue Lua. Unfortunately, I've spent basically a day looking into (b) and there's just no easy way. I can try a mid-function intercept in the combat message log pump, but that's... fragile.

So, I may be stuck gating my logic to the frame rate, after all.

Dammit.

**XTZGZoReX** · 01-20-2010

Yes, definitely. A well-optimized synch routine (and IIRC .Net's "lock" semantics aren't much more expensive than a raw Mutex write, when amortized out over a thousand calls or so) shouldn't add much of a performance penalty.

Feel free to correct me if I'm wrong here, but: Using a Mutex in .NET seems to be a waste of resources in this case, as it is a cross-process lock, unlike other languages/runtimes. I think what you really want is ReaderWriterLockSlim; it only applies to the current instance of your process. Or, Monitor (which, in basic terms, is what the 'lock' keyword wraps around), if you don't need read/write locking.

**hobbienoobie** · 01-26-2010

I usually just do all my heavy processing off bot... I use the "bot" to read the appropriate materials, implement moves or actions and if I have any heavy lifting, I shunt to another process so I am not frame limited. This of course makes more sense if you are already burdened by running multiple bots at the same time, or are needing a common executioner program (coordinating actions of multiple bots in a team, or have a path server). Plenty of free interprocess or memory sharing apps out there to take advantage of if that's your preference.

Shout-Out

User Tag List

Thread: Improving your bot's performance on multi-core CPUs

Thread Tools

Search Thread

Improving your bot's performance on multi-core CPUs

Similar Threads

How well does your bot perform? (Numbers for different bots / different classes)

[EPIC]Improve your WoW performances

2 ways to greatly improve your performance

Make your bot play like you want it

Improve your fishbot's accuracy! New splash texture color!

OwnedCore Forums

casino

CoreCoins

My OwnedCore

About Us

Casino