Improving your bot's performance on multi-core CPUs menu

Shout-Out

User Tag List

Results 1 to 12 of 12
  1. #1
    Sednogmah's Avatar Contributor
    Reputation
    129
    Join Date
    Oct 2009
    Posts
    158
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Improving your bot's performance on multi-core CPUs

    Improve your bot's performance on multi-core CPUs

    1. Basic idea

    Many injected bots are hooking a library function like OpenGL's glClear or the DirectX equivalent to execute code after WoW has finished rendering a frame. This is necessary to have a consistent engine state. Unfortunately it inevitably slows down the whole game, reducing the number of frames per second.

    Fig. 1: WoW rendering frame by frame:


    Fig. 2: Injected code (red) running at the end of each frame:


    The first step is to offload the bot code into a separate thread. This comes with a major problem though: You can't safely access WoW's memory or call engine functions from a concurrent thread. You need some sort of synchronization mechanism.

    Fig. 3: Obviously nothing is gained if you just lock your bot's thread while WoW is processing a frame:


    If your bot has computing-intensive parts, most of those will probably not rely on calling engine functions or accessing memory. This means that your code can be split into a critical and non-critical section. The critical section has to be synchronized with WoW's main thread while the non-critical can be concurrent.

    The critical section will contain code that:
    - accesses WoW's memory
    - calls engine functions

    The non-critical sections could be:
    - path finding
    - decision making
    - updating your GUI
    - accessing a database
    - logging
    - file access, e.g. writing compressed screenshots

    Fig. 4: Asynchronous processing. red = critical, green = non-critical. Obvious performance gain:


    2. Mutexes
    Threads can be synchronized with mutexes. I will use the default behaviour of POSIX mutexes for my example which are available on Mac OS X and Linux. Our mutex M has two states: locked and unlocked. We will use two functions to interact with the mutex: mutex_lock(M) and mutex_unlock(M). The behaviour of the two functions depends on the state of M:

    If "M is unlocked":
    - mutex_unlock(M) does nothing
    - mutex_lock(M) locks M

    If "M is locked":
    - mutex_unlock(M) unlocks M
    - mutex_lock(M) blocks the calling thread until it is unlocked

    3. Implementation (Pseudocode)
    Let's assume we have hooked a library function that gets called after every frame. We call it EndFrameHook(). We have also created a separate thread for our bot, running an infinite loop.
    Code:
    // Initialization code
    
    Mutex m_Wow; // locked when WoW is busy with processing/rendering
    Mutex m_Bot; // locked when our bot is in its critical section
    
    // Starting our bot's thread
    mutex_lock(m_Wow)
    start_bot_thread()
    Code:
    EndFrameHook() { // called after WoW finished processing a frame
    	mutex_unlock(m_Wow); // It's safe to interact with the engine now
    	mutex_lock(m_Bot);   // Blocks during the bot's critical section 
    }
    
    //
    BotThread() {
    	for(ever) {
    		mutex_lock( m_Wow )
    				
    		// CRITICAL START
    		accessMemory()
    		callEngineFunctions()
    		// CRITICAL END
    
    		mutex_unlock( m_Bot )
    
    		// NON-CRITICAL section, for example:
    		decisionMaking()
    		pathFinding()
    		accessFiles()
    		database()
    	}
    }
    4. Further notes

    Fig. 5: What if the non-critical section takes more time than a frame?


    As you can see, if the non-critical section takes more time than WoW needs to render a frame, no frames are skipped and the main thread's execution is paused. If you want frame skipping, you need to wrap the non-critical section with a third mutex lock and test it with mutex_trylock in EndFrameHook(). I will add pseudocode for that later. The result would look like this:


    Edit 1: As amadmonk pointed out, concurrent access of engine functions is not _necessarily_ unsafe but _can_ be
    Edit 2: Fixed implementation, had one mutex_unlock() in the wrong place. Updated images. Added notes.
    Last edited by Sednogmah; 01-18-2010 at 04:28 PM.
    951388dcb8e5be825c2c10a7f53c16fcd84fc6c8b76ff0483237eeff745eaeac

    Improving your bot's performance on multi-core CPUs
  2. #2
    namreeb's Avatar Legendary

    Reputation
    668
    Join Date
    Sep 2008
    Posts
    1,029
    Thanks G/R
    8/222
    Trade Feedback
    0 (0%)
    Mentioned
    9 Post(s)
    Tagged
    0 Thread(s)
    You present this information very well. Nice graphics!

  3. #3
    amadmonk's Avatar Active Member
    Reputation
    124
    Join Date
    Apr 2008
    Posts
    772
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Clean presentation.

    One thing I'd check, however, is the assumption that you can't access engine functions safely in an unsynchronized fashion. You most assuredly can -- I've had a bot running for 12 hours straight doing roughly 20-40 engine calls per second (some as simply as EnumVisibleObjects, some full Lua stuff) with no evident race conditions. This was surprising to me because I know we use a *global* Lua state object for the Lua engine, but perhaps Lua adds its own synchronization.

    So, bravo on a clear, detailed description of a way to improve bot performance. My only modification would be to remember Knuth's caution: "premature optimization is the root of all evil." Make sure that your optimizations actually FIX a problem (surprisingly many programmers forget that), and make sure that your optimization doesn't introduce *other* problems.
    Don't believe everything you think.

  4. #4
    audible83's Avatar Member
    Reputation
    4
    Join Date
    Jun 2008
    Posts
    48
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Easy to understand explanation of the theory behind it. Good work!

    Next up is Cuda ?

  5. #5
    Apoc's Avatar Angry Penguin
    Reputation
    1388
    Join Date
    Jan 2008
    Posts
    2,750
    Thanks G/R
    0/13
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Very well said. However, I have to agree with amadmonk.

    Currently, it's very viable to run a very large amount of code during each frame. In comparison; the amount of code you run, should be roughly 1% of what WoW runs per-frame. (No, that's not an exaggeration either.) Obviously you shouldn't be doing very 'intensive' things from within the main thread, but you can most definitely run a fairly complex AI system from an EndScene pulse without issue.

    Even if I try very hard; I can only get WoW to lose roughly 5FPS with our current framework. (Not including deliberately sleeping the main thread of course, or doing any lengthy file reads, etc.)

  6. #6
    Sednogmah's Avatar Contributor
    Reputation
    129
    Join Date
    Oct 2009
    Posts
    158
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I fully agree with both of you as well: It's certainly not necessary most of the time. Thinking of it now, I'm not entirely sure anymore why I even implemented it as bots usually run in the background anyway which makes the number of FPS pretty irrelevant, except for the loss of accuracy below a certain point. Oh well, at least I learned to use Inkscape to make the pretty pictures *g*
    It can still be useful if you're doing more than pure AI and decision making though. For example your could have a control GUI with a graphical radar that needs to be rendered. I really should write that Knuth quote down and stick it to my monitor. Premature optimization is a bad habit of mine.
    Last edited by Sednogmah; 01-18-2010 at 02:51 PM.
    951388dcb8e5be825c2c10a7f53c16fcd84fc6c8b76ff0483237eeff745eaeac

  7. #7
    audible83's Avatar Member
    Reputation
    4
    Join Date
    Jun 2008
    Posts
    48
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I would typically see this used in the non-time-critical part of code. Using a seperate thread for database fetching and the like is probably a good thing if you have latency on network etc.

  8. #8
    Cypher's Avatar Kynox's Sister's Pimp
    Reputation
    1358
    Join Date
    Apr 2006
    Posts
    5,368
    Thanks G/R
    0/6
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well presented post, however I too have to agree with amadmonk. I've never had any of my bots/hacks/whatever cause a performance drop which would justify all the extra work of doing a multi-threaded implementation.

    I'd be worried that without proper profiling and checks that thread synchronization would actually cause a larger performance hit than the actual gain caused by the multi-threading.

    I'll add to amadmonk's "Premature optimization is the root of all evil" (was that Knuth or Hoare? I seem to remember seeing it first as Knuth quoting Hoare in one of his books) with "Measure twice, optimize once".

  9. #9
    Sednogmah's Avatar Contributor
    Reputation
    129
    Join Date
    Oct 2009
    Posts
    158
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Okay, now that I've optimized once, I have to measure twice *g*

    Fortunately it turns out that thread synchronization is pretty cheap.

    I whipped up a quick & dirty but hopefully meaningful benchmark to measure both:
    1) Raw mutex locking/unlocking
    2) Alternating threads, just like in the aforementioned scenario

    Bottom line: The synchronization performance hit is on the order of 0.000001 seconds (~1 microsecond) which can be easily compensated with the concurrent non-critical section. Acceptable!

    Code:
    Clock resolution (process timer): 1 ns
    
    Test 1: single thread mutex lock cycles
      iterations: 124413150
      runtime: 4999 ms
      => 24883535 lock cycles per s
      => 24883 lock cycles per ms
    
    Test 2: two alternating threads
      iterations: 12428330
      runtime: 11055 ms
      => 1124151 thread switches per s
      => 1124 thread switches per ms
    The test was run on an Intel Core 2 Duo with 3600 MHz. These numbers should not be overrated but they do give an an impression of the performance magnitude that can be expected.

    Test source code (C99, POSIX.1-2001):
    Code:
    /* Quick & dirty mutex/thread benchmark.
     * Link against the real-time library: -lrt
     */
    
    #include <pthread.h>
    #include <stdio.h>
    #include <unistd.h>
    #include <stdint.h>
    #include <time.h>
    
    uint64_t time_usec() {
    	struct timespec t;
    	// High-resolution per-process timer from the CPU
    	clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &t );
    	return ((uint64_t) t.tv_sec) * 1000000 + t.tv_nsec / 1000;
    }
    
    uint64_t get_approx_locks_per_sec();
    void test1_singlethreaded_cycles();
    void test2_alternating_threads();
    
    int main() {
    	// Print clock resolution
    	struct timespec res;
    	clock_getres( CLOCK_PROCESS_CPUTIME_ID, &res );
    	printf("Clock resolution (process timer): %ld ns\n\n", res.tv_nsec);
    
    	// Tests
    	test1_singlethreaded_cycles();
    	test2_alternating_threads();
    
    	return 0;
    }
    
    uint64_t get_approx_locks_per_sec() {
    	// Rough estimate of locks/second. Why?
    	//   I don't want to retrieve the system time in each lock cycle
    	//   during the actual benchmark and thus need an approximate
    	//   number of necessary iterations.
    
    	pthread_mutex_t m;
    	uint64_t t_start, t_end, t;
    	uint64_t lockcount;
    
    	pthread_mutex_init(&m, NULL);
    	t_start = time_usec();
    	t = lockcount = 0;
    	while( t < 1000000 ) { // 1 second
    		pthread_mutex_lock(&m);
    		pthread_mutex_unlock(&m);
    		t_end = time_usec();
    		t = t_end - t_start;
    		lockcount++;
    	}
    
    	return (1000000 * lockcount)/t;
    }
    
    /**** Test 1 ****/
    void test1_singlethreaded_cycles() { 
    	uint64_t lockcount;
    	const uint64_t iterations = get_approx_locks_per_sec() * 50;
    	uint64_t t_start, t_end, t;
    
    	printf("Test 1: single thread mutex lock cycles\n");
    	printf("  iterations: %lld\n", iterations);
    	
    	pthread_mutex_t m;
    	pthread_mutex_init( &m, NULL );
    
    	t_start = time_usec();
    	for(lockcount = 0; lockcount < iterations; lockcount++) {
    		pthread_mutex_lock(&m);
    		pthread_mutex_unlock(&m);
    	}
    	t_end = time_usec();
    	t = t_end - t_start;
    
    	printf("  runtime: %lld ms\n", t/1000);
    	printf("  => %lld lock cycles per s\n", (1000000 * lockcount) / t);
    	printf("  => %lld lock cycles per ms\n", (1000 * lockcount) / t);
    	printf("\n");
    }
    
    /**** Test 2 ****/
    struct switchlock_t { 
    	pthread_mutex_t *A, *B;
    	uint64_t *count, iterations;
    };
    void* threadfunc( void* p_slock ) {
    	struct switchlock_t* slock = (switchlock_t*) p_slock;
    	while( (*slock->count) < slock->iterations ) {
    		pthread_mutex_unlock( slock->A );
    		(*slock->count)++;
    		pthread_mutex_lock( slock->B );
    	}
    	pthread_mutex_unlock( slock->A );
    	pthread_mutex_unlock( slock->B );
    	printf("  thread finished...\n");
    	pthread_exit(NULL);
    	return NULL;
    };
    
    void test2_alternating_threads() {
    	struct switchlock_t slock1, slock2;
    	uint64_t lockcount = 0;
    	uint64_t t_start, t_end, t;
    
    	pthread_mutex_t m_A, m_B;
    	pthread_mutex_init(&m_A, NULL);
    	pthread_mutex_init(&m_B, NULL);
    
    	slock1.A = &m_A;
    	slock1.B = &m_B;
    	slock1.count = &lockcount;
    	slock1.iterations = get_approx_locks_per_sec() * 5;
    
    	slock2 = slock1;
    	slock2.A = &m_B;
    	slock2.B = &m_A;
    
    	printf("Test 2: two alternating threads\n");
    	printf("  iterations: %lld\n", slock1.iterations);
    
    	t_start = time_usec();
    	pthread_t thread1, thread2;
    	pthread_create( &thread1, NULL, threadfunc, &slock1 );
    	pthread_create( &thread2, NULL, threadfunc, &slock2 );
    	pthread_join( thread1, NULL );
    	pthread_join( thread2, NULL );
    	t_end = time_usec();
    
    	// summary
    	t = t_end - t_start;
    	printf("  runtime: %lld ms\n", t/1000);
    	printf("  => %lld thread switches per s\n", 1000000 * lockcount / t);
    	printf("  => %lld thread switches per ms\n", 1000 * lockcount / t);
    }
    No matter if all that work turns out to be useful or not, I learned quite a bit about multithreading and thought I'd share it while I am at it.
    Last edited by Sednogmah; 01-19-2010 at 06:14 AM.
    951388dcb8e5be825c2c10a7f53c16fcd84fc6c8b76ff0483237eeff745eaeac

  10. #10
    amadmonk's Avatar Active Member
    Reputation
    124
    Join Date
    Apr 2008
    Posts
    772
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, definitely. A well-optimized synch routine (and IIRC .Net's "lock" semantics aren't much more expensive than a raw Mutex write, when amortized out over a thousand calls or so) shouldn't add much of a performance penalty.

    I have to correct something I said earlier, when I mentioned that there were no races in Lua; I was wrong. It turned out that I had built a SynchronizationContext into my Lua invokes and forgotten about it (I'm such a slick programmer that I fooled myself ) -- as a result, my Lua invokes specifically were being done in the render thread.

    When I took this synch out, I fairly quickly hit a race condition between the event reporting code (which right now is just a bit of glue Lua that forwards events into my bot) and the off-thread Lua execution. This is precisely for the reason that I suspected; the shared Lua state.

    Unfortunately for me, this means that my Lua invokes from out of thread are effectively throttled to FPS, since that's the only time I can be sure that it's "safe" to invoke them.

    I suspect that I could remove this limitation if I could (a) create a custom Lua state (looking into it) or (b) find a cleaner way to report events that didn't require glue Lua. Unfortunately, I've spent basically a day looking into (b) and there's just no easy way. I can try a mid-function intercept in the combat message log pump, but that's... fragile.

    So, I may be stuck gating my logic to the frame rate, after all.

    Dammit.
    Don't believe everything you think.

  11. #11
    XTZGZoReX's Avatar Active Member
    Reputation
    32
    Join Date
    Apr 2008
    Posts
    173
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, definitely. A well-optimized synch routine (and IIRC .Net's "lock" semantics aren't much more expensive than a raw Mutex write, when amortized out over a thousand calls or so) shouldn't add much of a performance penalty.
    Feel free to correct me if I'm wrong here, but: Using a Mutex in .NET seems to be a waste of resources in this case, as it is a cross-process lock, unlike other languages/runtimes. I think what you really want is ReaderWriterLockSlim; it only applies to the current instance of your process. Or, Monitor (which, in basic terms, is what the 'lock' keyword wraps around), if you don't need read/write locking.
    Last edited by XTZGZoReX; 01-20-2010 at 12:41 AM.

  12. #12
    hobbienoobie's Avatar Member
    Reputation
    1
    Join Date
    Jun 2009
    Posts
    12
    Thanks G/R
    0/0
    Trade Feedback
    0 (0%)
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I usually just do all my heavy processing off bot... I use the "bot" to read the appropriate materials, implement moves or actions and if I have any heavy lifting, I shunt to another process so I am not frame limited. This of course makes more sense if you are already burdened by running multiple bots at the same time, or are needing a common executioner program (coordinating actions of multiple bots in a team, or have a path server). Plenty of free interprocess or memory sharing apps out there to take advantage of if that's your preference.

Similar Threads

  1. How well does your bot perform? (Numbers for different bots / different classes)
    By MuppetMuffin in forum Diablo 3 Bots Questions & Requests
    Replies: 10
    Last Post: 06-18-2012, 11:03 AM
  2. [EPIC]Improve your WoW performances
    By tsatsa in forum World of Warcraft General
    Replies: 19
    Last Post: 04-30-2009, 03:45 AM
  3. 2 ways to greatly improve your performance
    By freakyflow in forum World of Warcraft Guides
    Replies: 10
    Last Post: 03-10-2008, 09:21 PM
  4. Make your bot play like you want it
    By Hounro in forum WoW UI, Macros and Talent Specs
    Replies: 1
    Last Post: 06-18-2007, 12:57 AM
  5. Improve your fishbot's accuracy! New splash texture color!
    By Grass in forum World of Warcraft Model Editing
    Replies: 5
    Last Post: 01-09-2007, 10:47 PM
All times are GMT -5. The time now is 01:40 PM. Powered by vBulletin® Version 4.2.3
Copyright © 2025 vBulletin Solutions, Inc. All rights reserved. User Alert System provided by Advanced User Tagging (Pro) - vBulletin Mods & Addons Copyright © 2025 DragonByte Technologies Ltd.
Google Authenticator verification provided by Two-Factor Authentication (Free) - vBulletin Mods & Addons Copyright © 2025 DragonByte Technologies Ltd.
Digital Point modules: Sphinx-based search