Originally Posted by
everdox
ntdll32!NtCreateThreadEx -> wow64 system service dispatcher -> ntqueryinformationprocess with infoclass 0x26 (check msdn) to determine if target process is running in wow64 emulator or not. if it is, it just simply returns with status_access_denied.
IsWow64Process() is an alternative. Technically it uses NTQIP internally. I know you knew this already, but for the OP it's probably a better choice. (And I apologize if I'm underestimating your ability mixtape. No offence intended.)
[Edit: I misunderstood what you were saying when I first posted this. You were explaining how NtCreateThreadEx works internally but I thought you were explaining how to detect a 64 bit application. Sorry about that.]
otherwise with 64 bit stack it will go on to call the x64 ntcreatethreadex. so you can patch that jump and it works fine for homebrew but wow64 binaries change with almost every windows update, so there goes portability and you would also have to walk the x64 peb from your x86 process. more trouble then it's worth.
so your obvious solution here is to just go with what
ccKep already stated. just build a 64 bit application.
No wow64 patching needed. Assemble with yasm and link with your application. You could potentially also use inline assembly and 'emit' the byte code or use a function pointer pointing to a byte array if you want to complicate it even more 
Might also be possible to do with AsmJit. I haven't tested if you can use it's 64 bit assembler from a 32 bit app.
Code:
CreateRemoteThread64:
bits 32
; [set up stack]
db 9Ah ; callf opcode
dd offset create_thread_proxy64 ; target address
db 33h ; segment selector 33 - sets the cpu to long mode
; [clean up stack and return]
create_thread_proxy64:
bits 64
; [set up stack]
call ntdll64!NtCreateThreadEx ; or use syscall/sysenter directly.
; [possibly wait for the thread to complete and get the exit code]
; [clean up stack]
bits 32
retf
// Then in your C(++) code it's as simple as calling
exit_code = CreateRemoteThread64(target_handle, thread_start_address, thread_args);
I'm not sure about the portability of using syscall. Is the system call numbering the same between windows versions?
You'd also need to write your own GetProcAddress64 to get the address for ntdll64!NtCreateThreadEx if you go that route. The link below has an example of how to get the address of ntdll64 using the 64 bit PEB, and the PE+ format is well documented.
But yes, a 64 bit application is the way to go unless you are just doing this for the learning experience. And the fact that it's incredibly cool to do everything from a single binary 
Credits to Heaven’s Gate: 64-bit code in 32-bit file | wasnt nate for publishing info on the call gate.