Contents of /trunk/src/native/THOUGHTS_AND_IDEAS

Random thoughts about native code generation, which will be compatible 
with the already existing (non-host-specific) dyntrans core.


How to keep track of the number of times a basic block is executed? 
(Perhaps needed, since unnecessary native code generation may slow things 
down. Only the blocks that are really common need to be natively 
translated.)

Perhaps having a small additional array per page is a solution?
        unsigned char count[NR_OF_IC_ENTRIES_PER_PAGE];
For a typical MIPS cpu, that would be 1024 bytes extra per page.
The main loop could be changed to increase count, and if count goes beyond
a certain threshhold, the block is natively translated. Hm.

Or perhaps the overhead of implementing this counter check is more than it 
is worth? After all, most of the time will be spent executing (some of) 
the translated loops.

-------------------------------------

At most one [basic] block is ever translated at any given time.
A small array can hold the INR entries, and a small memory area can
hold a (double-linked list) of native instruction entries.

Simple instructions:

32-bit MIPS:
        andi $5,$5,0xff00
        ori $5,$5,0x0011

Intermediate native representation:
        AND_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0xff00)
        OR_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0x0011)

Non-peephole-optimized x86[_64] code:  (esi = struct cpu *)
        mov eax, [esi + offset_to_source_reg]
        and eax, 0xff00
        mov [esi + offset_to_destination_reg], eax      (#1)
        mov eax, [esi + offset_to_source_reg]           (#2)
        or eax, 0x0011
        mov [esi + offset_to_destination_reg], eax

Peephole-optimized x86[_64] code:
(on the first pass, #2 is removed, since it loads back a value which was
previously written. the value is already in eax!)
(on the second pass, the store at #1 is removed, since another store
later on overwrites the same register)
        mov eax, [esi + offset_to_source_reg]
        and eax, 0xff00
        or eax, 0x0011
        mov [esi + offset_to_destination_reg], eax

Native code entry:
        (none on x86_64)

Native code exit:
        ret[q]

---------------------------

Update of nr-of-executed-instructions and the IC pointer:

        All possible return paths need to update the following:

        x) The nr-of-executed-instructions count (one less than the
           number of instructions in the translated block, since an
           implicit count of 1 is already included).
        x) The next_ic pointer, and also the cur_page if we have
           switched page.

-----------------------------

Stages during translation:

        Stage 1:
                Emulated ISA (e.g. MIPS) to INR instructions.
                Each emulated instruction may be turned into 0 or
                more INR instructions.
                This is done in e.g. src/cpus/cpu_mips_instr.c
                using semi-magic macros.
                The INR array is a fixed size small array, pointed
                to by the cpu struct.

        Stage 2:
                INR -> native operations (e.g. x86).
                This is done in src/native/native_x86.c.
                Things to think about are round-robin use of
                temporary registers.
                native_inr_to_native_ops() takes a cpu as input,
                translates the current INR entries into native
                pseudo-opcodes.

        Stage 3:
                Optimization, native ops -> native ops.
                This is done in src/native/native_x86_optim.c,
                and is an optional step. It should be possible
                to turn this step of, for debugging.
                If e.g. a value is in a register, and it is stored
                to memory, then the same memory position does not
                have to be read back; the value is already in a
                register.

        Stage 4:
                Code generation, native ops -> native machine code.
                Done in src/native/native_x86_gen.c.

        Stage 5:
                Patch _older_ code chunks so that they can branch
                directly to the new chunk, if possible.
                An optional step.

        Stage 6:
                Enter the newly generated native code chunk into
                the physpage' ic->f.
1	Random thoughts about native code generation, which will be compatible
2	with the already existing (non-host-specific) dyntrans core.
3
4
5	How to keep track of the number of times a basic block is executed?
6	(Perhaps needed, since unnecessary native code generation may slow things
7	down. Only the blocks that are really common need to be natively
8	translated.)
9
10	Perhaps having a small additional array per page is a solution?
11	unsigned char count[NR_OF_IC_ENTRIES_PER_PAGE];
12	For a typical MIPS cpu, that would be 1024 bytes extra per page.
13	The main loop could be changed to increase count, and if count goes beyond
14	a certain threshhold, the block is natively translated. Hm.
15
16	Or perhaps the overhead of implementing this counter check is more than it
17	is worth? After all, most of the time will be spent executing (some of)
18	the translated loops.
19
20	-------------------------------------
21
22	At most one [basic] block is ever translated at any given time.
23	A small array can hold the INR entries, and a small memory area can
24	hold a (double-linked list) of native instruction entries.
25
26	Simple instructions:
27
28	32-bit MIPS:
29	andi $5,$5,0xff00
30	ori $5,$5,0x0011
31
32	Intermediate native representation:
33	AND_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0xff00)
34	OR_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0x0011)
35
36	Non-peephole-optimized x86[_64] code: (esi = struct cpu *)
37	mov eax, [esi + offset_to_source_reg]
38	and eax, 0xff00
39	mov [esi + offset_to_destination_reg], eax (#1)
40	mov eax, [esi + offset_to_source_reg] (#2)
41	or eax, 0x0011
42	mov [esi + offset_to_destination_reg], eax
43
44	Peephole-optimized x86[_64] code:
45	(on the first pass, #2 is removed, since it loads back a value which was
46	previously written. the value is already in eax!)
47	(on the second pass, the store at #1 is removed, since another store
48	later on overwrites the same register)
49	mov eax, [esi + offset_to_source_reg]
50	and eax, 0xff00
51	or eax, 0x0011
52	mov [esi + offset_to_destination_reg], eax
53
54	Native code entry:
55	(none on x86_64)
56
57	Native code exit:
58	ret[q]
59
60	---------------------------
61
62	Update of nr-of-executed-instructions and the IC pointer:
63
64	All possible return paths need to update the following:
65
66	x) The nr-of-executed-instructions count (one less than the
67	number of instructions in the translated block, since an
68	implicit count of 1 is already included).
69	x) The next_ic pointer, and also the cur_page if we have
70	switched page.
71
72	-----------------------------
73
74	Stages during translation:
75
76	Stage 1:
77	Emulated ISA (e.g. MIPS) to INR instructions.
78	Each emulated instruction may be turned into 0 or
79	more INR instructions.
80	This is done in e.g. src/cpus/cpu_mips_instr.c
81	using semi-magic macros.
82	The INR array is a fixed size small array, pointed
83	to by the cpu struct.
84
85	Stage 2:
86	INR -> native operations (e.g. x86).
87	This is done in src/native/native_x86.c.
88	Things to think about are round-robin use of
89	temporary registers.
90	native_inr_to_native_ops() takes a cpu as input,
91	translates the current INR entries into native
92	pseudo-opcodes.
93
94	Stage 3:
95	Optimization, native ops -> native ops.
96	This is done in src/native/native_x86_optim.c,
97	and is an optional step. It should be possible
98	to turn this step of, for debugging.
99	If e.g. a value is in a register, and it is stored
100	to memory, then the same memory position does not
101	have to be read back; the value is already in a
102	register.
103
104	Stage 4:
105	Code generation, native ops -> native machine code.
106	Done in src/native/native_x86_gen.c.
107
108	Stage 5:
109	Patch _older_ code chunks so that they can branch
110	directly to the new chunk, if possible.
111	An optional step.
112
113	Stage 6:
114	Enter the newly generated native code chunk into
115	the physpage' ic->f.