My earlier post explained “how” you can write a multiboot kernel using VC++. This post will explain “why” I wrote the kernel the way I did.
- The linker (link.exe) puts machine code of functions in the source text into the .text section in the order it finds them in the source.
- All PE headers never add up to a total size that exceeds 4K. So when file alignment is 4K (== memory alignment), the .text section is guaranteed to start at offset 4K (4096) in the PE file. This is a good place to put the multiboot header (which anyways needs to be present in its entirety within the first 8192 bytes)
- The multiboot header forms the first 48 bytes of the .text section
- __declspec(naked) is an attribute that makes the compiler generate code _WITHOUT_ a prolog or epilog. This is important because we want the multiboot header to start at offset 4096. Without the naked attribute, the function (and hence the .text section) would start with the bytes 55 8B EC which stand for the following instructions
55 push ebp
8B EC mov ebp, esp
which is the prolog. Because of this the multiboot header would be pushed to offset 4099 and Grub would refuse to load the kernel because the multiboot header isn’t 4K longword aligned. [Updated: 10/6/2005, 12:23 PM].
- Compiler switches
- /Gd : forces the use of __cdecl calling convention. I’m not really sure why I included this
- /Fm: names the map file which might be useful when the kernel gets large
- /Tc: compile the file as .c
- /c: compile only, no link
- Linker switches
- /safeseh:no: disables generating the symbols related to Safe SEH handlers (__safe_se_handler_table and __safe_se_handler_count)
- /filealign:0x1000: this is an undocumented switch that aligns sections in the file based on this value. I’ve set it to 0x1000 (4K) so that sections are aligned on a 4K boundary on the image also (this is the default in-memory alignment). This is required because, Grub doesn’t seem to load images whose file alignment is different from in-memory alignment.
- /base:0x100000: this makes the linker generate code assuming that the .text section starts at physical address 0x101000 and .data section at physical address 0x102000. This is what we want because Grub actually loads the image at 0x100000 which forces the .text and .data section into these addresses automatically and we need not relocate the kernel. (Now you probably understand why we set the file alignment to 4096 bytes as well).
- /entry:__multiboot_entry__: sets the entry point
- /nodefaultlib:libc: forces the linker to ignore libc while resolving external references. The idea is to be able to use names like memcpy etc in the kernel, and make sure the libc’s functions don’t get linked in.
- /subsystem:console: This sets a bit in one of the headers that tells windows which subsystem to use to execute this application. This doesn’t make any sense here, but I guess I included this to keep the linker happy.
- /out: kernel.exe: the name of the kernel
Aspiring operating system developers who target x86 often don’t get beyond writing a boot sector (seldom do they even complete it) because of the inordinate amount of time needed to understand the “tricks” required to get the processor into a “sane” mode of operation before the kernel can start executing. That’s why newbie kernel developers are always advised to use an alternative like
Grub to bootstrap their kernel so that they can concentrate on implementing the kernel itself rather than the plumbing. Why Grub? Because it is one of the bootloaders that implements the
“Multiboot” specification (correctly?).
This specification details the steps that OS / bootloader developers need to follow in order to be compatible with (and usable by) each other. In very simple terms, multiboot compliant operating systems need to have a 48 byte structure called the Multiboot header (in its entirety), somewhere within the first 8192 bytes of the kernel image, longword aligned.
on a 4K boundary. [Updated: 10/6/2005, 12:23 PM]. Actual details about the fields are documented in the Multiboot specification
here.
Every other “roll your own OS” tutorial invariably talks about how you can make your kernel bootable by Grub. But all of these assume that you are using the GCC toolset. If you are from a windows background you are out of luck. The GCC toolset itself is not very difficult to learn, but I’m sure you’d feel more at home using the tools you’re familiar with for a long time. At least I do and that’s why I set out writing this post about how you can make grub boot your VC++ kernel.
Making a boot loader like Grub boot a custom kernel is easy (at least compared to the effort it takes to create a new boot loader). The kernel itself is only a binary in some file format (AOUT, ELF, PE etc.). For example the Windows kernel (%WINDRIVE%\Windows\System32\NTOSKRNL.EXE) uses the PE file format (Try dumpbin /ALL %WINDRIVE%\Windows\System32\NTOSKRNL.EXE) that is also used by user mode programs under windows. Similarly, the Linux kernel probably gets compiled into the ELF file format. Now, expecting a bootloader to “know” all executable file formats is probably not a good idea. The multiboot specification takes a different approach to load a kernel image onto RAM. It uses fields in the multiboot header to denote the parts of the kernel image that needs to be loaded. Grub “knows” how to load an ELF binary, not a PE. So we are going to give it “hints” in our multiboot header that will help it load the kernel image properly. Time for some code…
/* kernel.h */#ifndef __kernel_h__#define __kernel_h__#define dd(x) \ __asm _emit (x) & 0xff \ __asm _emit (x) >> 8 & 0xff \ __asm _emit (x) >> 16 & 0xff \ __asm _emit (x) >> 24 & 0xff#define KERNEL_STACK 0x00103fff#define KERNEL_START 0x00101000#define KERNEL_LENGTH 0x0000200Fvoid main(unsigned long, unsigned long);#endif/* kernel.c */#include "kernel.h"__declspec(naked) void __multiboot_entry__(void){ __asm { multiboot_header: dd(0x1BADB002) ; magic dd(1 << 16) ; flags dd(-(0x1BADB002 + (1 << 16))) ; checksum dd(0x00101000) ; header_addr dd(0x00101000) ; load_addr dd(0x0010200F) ; load_end_addr dd(0x0010200F) ; bss_end_addr dd(0x00101030) ; entry_addr dd(0x00000000) ; mode_type dd(0x00000000) ; width dd(0x00000000) ; height dd(0x00000000) ; depth kernel_entry: mov esp, KERNEL_STACK xor ecx, ecx push ecx popf push eax push ebx call main jmp $ }}void main(unsigned long magic, unsigned long addr){ char *string = "Hello World!", *ch; unsigned short *vidmem = (unsigned short *) 0xB8000; int i; for(ch = string, i = 0; *ch; ch++, i++) vidmem[i] = (unsigned char) *ch | 0x0700;}The first field in the header is a magic number that the bootloader will use to locate the start of the multiboot header in the image. The second field denotes the features that the OS expects from the boot loader. To keep the code simple, I’ve ignored bits 0-15 (about which you can read in the multiboot specification). I’ve set bit 16 of this field. This means that the fields at offsets 8-24 in the Multiboot header are valid, and the boot loader should use them instead of the fields in the actual executable header to calculate where to load the OS image. This mechanism enables the bootloader load kernel images whose format is not understood “natively”.
Before examining what the fields at offsets 8-24 mean, let’s take a look at the PE file format.
A PE image starts with a couple of standard headers (DOS / PE / File / Optional). Following these is a set of headers called the section headers that contain information about the different sections in the image. (For a more verbose explanation of the PE file format read
Matt Pietrek’s article) A section typically contains either code or data. The above kernel if compiled with the following switches
cl /Gd
/Fokernel.obj
/Fm
/TC
/c
kernel.c
link /safeseh:no
/filealign:0x1000
/BASE:0x100000
/MAP: kernel.map
/ENTRY:__multiboot_entry__ kernel.obj
/NODEFAULTLIB:LIBC
/SUBSYSTEM:CONSOLE
/OUT: kernel.exe
Produces a .EXE with two sections named .text and .data. Sections are aligned on a 4K boundary using the undocumented linker switch /filealign:0x1000.
Armed with this information about the PE file format, lets examine the fields at offset 8-24 in the multiboot header.
dd(0x1BADB002) ; magicdd(1 << 16) ; flagsdd(-(0x1BADB002 + (1 << 16))) ; checksumdd(0x00101000) ; header_addrdd(0x00101000) ; load_addrdd(0x0010200F) ; load_end_addrdd(0x0010200F) ; bss_end_addrdd(0x00101030) ; entry_addrdd(0x00000000) ; mode_typedd(0x00000000) ; widthdd(0x00000000) ; heightdd(0x00000000) ; depthThe field at offset 8, Checksum, needs to be set to – (magic + flags). Grub loads the .text section of the kernel into physical address 0x100000 (1 MB) + Offset by default. The offset is specified indirectly using the header_addr and load_addr fields. According to the specification header_addr “Contains the address corresponding to the beginning of the Multiboot header”. IMHO, this is a bit confusing. What it really means is, if the image file is loaded at 0x100000, the physical address of the starting of the multiboot header is header_addr. The next field load_addr contains the physical address of the beginning of the .text section. (In our case both are the same because the multiboot header is the first 48 bytes of the .text section). The next field load_end_addr is used to determine how many bytes of the image file actually needs to be loaded. (Note that the .text and .data sections need to be successive in the image for this to work). In our case 0x102000 is where data section starts and it has a size of 0xF bytes and hence the value 0x10200F for load_end_addr. Grub, now knows it needs to load 0x10200F – 0x101000 bytes. The next field according to the multiboot specification, needs to be set to 0 if a bss section doesn’t exist. (As in our case). However Grub refuses to load the image if bss_end_addr is set to 0, so I set it to 0x10200F (same as the previous). The rest of the code is perhaps obvious and hence doesn’t deserve an explanation.
Our multiboot compliant PE kernel is now ready :)
- All threads running managed code are suspended (after
bringing it to a “GC safe” place)
- One or more generations are condemned
- Liveliness trace is used to distinguish live from dead objects in
- Gen
0 alone for ephemeral collection
- Gen 0 + 1 for full collection
- If ephemeral collection (copying collection)
- Live ephemeral objects are promoted into the elder
generation by copying
- Live objects are located using a recursive scan
- Elder generation and large-object heap are
scanned first for references pointing into ephemeral generation. Such
objects are marked as live (card table)
- The stack of each managed thread is traced to
find roots (Interior pointers are also traced)
- Handle table is traced for references to objects
in the ephemeral generation
- Finalization queue is traced for references to objects in the
ephemeral generation
- Live objects are copied into the elder generation
- References to copied objects are updated to reflect their new
locations
- If full collection (ephemeral + mark and sweep
collection (Gen 1))
- Ephemeral collection is done first
-
Live objects in Gen 1 are traced
-
Cross generational references (i.e. the Gen 0 objects that are pointed
to by a Gen 1 object reference) are not visited because they already have
their mark bit set (as a result of ephemeral collection)
-
Stack references are traced
-
Handle table references are traced
-
Finalization queue is scanned for references to objects in Gen
1
-
Sweep of the elder generation for dead objects
-
Mark and pin bits of live objects are cleared
-
Dead objects are linked together (if contiguous) into a free list *
-
Brick table is cleared to reflect the disappearance of dead objects
- Weak reference fix up and building finalizer queue follow
* There is no compaction. Dead objects that lie next to each other are
treated as one dead zone in the free
list.
Since my previous post was about code generation at runtime, I would like to link to this article that explains why ATL thunks are necessary and how they work.
And as somebody rightly points out, ATL thunks are the worst affected because of DEP, as they don’t do it the right way.
In this post, we’ll take a look at how to emit x86 code at runtime and call into it or in other words implement the primary functionality of a JIT compiler. I gained a fair amount of insight into how JIT compilers work when I implemented a rudimentary JIT compiler for Smoke – a virtual machine for dynamic languages. The JIT however is not a part of the official source snapshot. I implemented it in my own tree that I derived from the original source.
There is a lot of complexity involved in implementing a real JIT compiler. In this post, I will only demonstrate how x86 code can be generated at runtime and some of the issues to be kept in mind while doing so.
I wrote a very simple program that generates an add method that takes two integers and returns the sum, at runtime. Let us first take a look at the main function of this program
int __cdecl wmain(int argc, wchar_t *argv[])
{
int a, b, sum;
int (*add_func)(int, int);
__try {
if(!init_jit_compiler()) {
fwprintf(stderr,L"Unable to initialize JIT compiler\n");
__leave;
} else {
add_func = jit_compile_add_function();
if(add_func) {
fwprintf(stdout,L"Enter first number : ");
scanf("%d",&a);
fwprintf(stdout,L"Enter second number : ");
scanf("%d",&b);
sum = add_func(a,b);
fwprintf(stdout,L"The sum of %d and %d is %d\n", a, b, sum);
} else {
fwprintf(stderr,L"Unable to JIT compile add function\n");
__leave;
}
}
} __finally {
uninit_jit_compiler();
}
return 0;
}
add_func is a pointer to a function that takes two integers and returns an integer. This variable will be used to hold the pointer to an add function that we generate at runtime. We’ll then pass the two input values to the function for it to produce the sum.
This function if it was written in C would look like this
int __cdecl add(int a, int b)
{
return a+b;
}
Having defined the prototype, we first need to write this function in assembly so that we know what to generate at runtime. This function written in assembly (inline) would look like this
__declspec(naked) int __cdecl add(int a, int b)
{
__asm {
push ebp
mov ebp, esp
mov eax, [ebp+0x8]
add eax, [ebp+0xc]
pop ebp
ret
}
}
The first two lines of assembly code represent the function prolog and the last two lines the function epilog. The actual logic is contained in line 3 and 4. By convention the return value is moved into the EAX register. Line 3 fetches the first argument from the stack and puts it into EAX. Line 4 adds the second argument on the stack to the contents of EAX register and places the sum in the EAX register.
The next step is to translate this into machine code at runtime. The memory used to hold this code is initialized by init_jit_compiler and is released by uninit_jit_compiler. The actual code generation is done by a routine called jit_compile_add_function that returns a pointer to the function’s entry point. This pointer is used to call the function emitted, at runtime.
This function is written as follows
/*
55 push ebp
8B EC mov ebp,esp
8B 45 08 mov eax,dword ptr [a]
03 45 0C add eax,dword ptr [b]
5D pop ebp
C3 ret
*/
void* jit_compile_add_function(void)
{
char *_jit_heap = (char *)jit_heap;
if(_jit_heap) {
/* push ebp */
_jit_heap[0] = 0x55;
/* mov ebp, esp */
_jit_heap[1] = 0x8b;
_jit_heap[2] = 0xec;
/* mov eax, [ebp+0x8] */
_jit_heap[3] = 0x8b;
_jit_heap[4] = 0x45;
_jit_heap[5] = 0x08;
/* add eax, [ebp+0xc] */
_jit_heap[6] = 0x03;
_jit_heap[7] = 0x45;
_jit_heap[8] = 0x0c;
/* pop ebp */
_jit_heap[9] = 0x5d;
/* ret */
_jit_heap[10] = 0xc3;
}
return _jit_heap;
}
jit_heap is the global variable that holds the pointer to memory obtained from malloc during initialization.
This simple program works fine but there are a couple of issues that you’d have to remember while generating code at runtime especially with the advent of Windows XP SP2. The code we generated in the above example was put into the standard C heap. In essence we are trying to execute an area of memory that usually contains data. This is ok if we put it to constructive use like we did, but that’s not what people use it always for. In order to help prevent this, Microsoft added Data Execution Prevention (DEP) to Windows XP. DEP is a set of hardware and software technologies that help prevent malicious code from running on a system.
The primary benefit of DEP is to help prevent code execution from data pages. Typically, code is not executed from the default heap and the stack. Hardware-enforced DEP detects code that is running from these locations and raises an exception when execution occurs. Software-enforced DEP can help prevent malicious code from taking advantage of exception-handling mechanisms in Windows.
Read more about DEP here.
Also, Raymond Chen has a post that describes the right way to go about generating code at runtime.