-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate ISA simulation by tiered JIT compilation #322
Comments
I have successfully built a simple sum program using the LLVM-C API in two ways. The first approach involves building LLVM IR through the API and utilizing
|
In the context of tiered compilation, the tier-1 JIT compiler (T1C) focuses on rapid code generation while omitting certain extensions like RV32A and RV32F. Conversely, the tier-2 JIT compiler (T2C) aims to handle all execution paths and bolster optimizations, utilizing the LLVM compilation framework. This allows for the translation of code from void addi(LLVMBuilderRef *builder,
LLVMTypeRef *param_types UNUSED,
LLVMValueRef start,
rv_insn_t *ir)
{
LLVMValueRef rs1_offset[1] = {LLVMConstInt(
LLVMInt32Type(), offsetof(riscv_t, X) / sizeof(int) + ir->rs1, true)};
LLVMValueRef addr_rs1 =
LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
rs1_offset, 1, "addr_rs1");
LLVMValueRef rd_offset[1] = {LLVMConstInt(
LLVMInt32Type(), offsetof(riscv_t, X) / sizeof(int) + ir->rd, true)};
LLVMValueRef addr_rd =
LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
rd_offset, 1, "addr_rd");
LLVMValueRef val_rs1 =
LLVMBuildLoad2(*builder, LLVMInt32Type(), addr_rs1, "val_rs1");
LLVMValueRef res = LLVMBuildAdd(
*builder, val_rs1, LLVMConstInt(LLVMInt32Type(), ir->imm, true), "add");
LLVMBuildStore(*builder, res, addr_rd);
} The LLVM code provided can be viewed as a specialized version of the Then, let's check the LLVM code for void jalr(LLVMBuilderRef *builder,
LLVMTypeRef *param_types UNUSED,
LLVMValueRef start,
rv_insn_t *ir)
{
if (ir->rd) {
LLVMValueRef rd_offset[1] = {
LLVMConstInt(LLVMInt32Type(),
offsetof(riscv_t, X) / sizeof(int) + ir->rd, true)};
LLVMValueRef addr_rd = LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(),
LLVMGetParam(start, 0),
rd_offset, 1, "addr_rd");
LLVMBuildStore(
*builder, LLVMConstInt(LLVMInt32Type(), ir->pc + 4, true), addr_rd);
}
LLVMValueRef rs1_offset[1] = {LLVMConstInt(
LLVMInt32Type(), offsetof(riscv_t, X) / sizeof(int) + ir->rs1, true)};
LLVMValueRef addr_rs1 =
LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
rs1_offset, 1, "addr_rs1");
LLVMValueRef val_rs1 =
LLVMBuildLoad2(*builder, LLVMInt32Type(), addr_rs1, "val_rs1");
LLVMValueRef res1 = LLVMBuildAdd(
*builder, val_rs1, LLVMConstInt(LLVMInt32Type(), ir->imm, true), "add");
LLVMValueRef res2 = LLVMBuildAnd(
*builder, res1, LLVMConstInt(LLVMInt32Type(), ~1U, true), "and");
LLVMValueRef PC_offset[1] = {LLVMConstInt(
LLVMInt32Type(), offsetof(riscv_t, PC) / sizeof(int), true)};
LLVMValueRef addr_PC =
LLVMBuildInBoundsGEP2(*builder, LLVMInt32Type(), LLVMGetParam(start, 0),
PC_offset, 1, "addr_PC");
LLVMBuildStore(*builder, res2, addr_PC);
LLVMBuildRetVoid(*builder);
} It can also be viewed as a specialized version of the RVOP(
jalr,
{
const uint32_t pc = PC;
/* jump */
PC = (rv->X[ir->rs1] + ir->imm) & ~1U;
/* link */
if (ir->rd)
rv->X[ir->rd] = pc + 4;
rv->csr_cycle = cycle;
rv->PC = PC;
return true;
}, Based on the insights gathered, I recommend using pycparser (or another appropriate C parsing package for Python) to parse the code in Some projects using pycparser: In essence, the first Python script, |
Rellume is a lifter designed to convert x86-64/AArch64/RISC-V64 machine code into LLVM IR, with a strong emphasis on the performance of the resultant code. The generated LLVM IR is both recompilable and executable, offering the potential to match or even exceed the original machine code's performance. Rellume meticulously models SIMD instructions and pointers, ensuring the optimizer can generate efficient code. Capable of working with specified instructions or automatically decoding control flow, it creates LLVM-IR functions that accurately reflect the original semantics. We can consider to adopt the approach of Rellume and Instrew for converting RISC-V instructions to LLVM-IR, leveraging compiler optimizations in the process. |
I have implemented a tier-2 just-in-time compiler through two approaches:
|
There are several optimization strategies mentioned in For tier-1 JIT compilerFunction RepresentationA function is decoded into superblocks, which have a single entry point, but can have several exits. This is intended to allow for a code representation close to the original machine code layout and reduces the overall number of blocks. To simplify code generation, superblocks can be chained to indicate a preferred ordering and enable the omission of a terminating branch instruction by falling through to the next block. The superblocks of a function represent its CFG. Register Liveness AnalysisFor several optimization passes, information about the usage of written registers is helpful, allowing Drob to detect unused effects of instructions, which can therefore be ignored, to identify entirely unused instruction sequences, which can be removed, and to find cyclic chains of unused values. Starting at the return instructions, information about used registers is propagated backwards through the instructions of all superblocks of the function until the liveness data no longer changes. For each instruction, Drob stores the set of registers which is potentially used afterwards. When further information about conditional execution or eventually written registers is available (e.g., from a previous analysis run propagating constant values), such information is incorporated as well. For Tier2 JIT compilerLLVM_IR based optimizationFor the main optimization, the generic -O3 pass pipeline is used, extended with only three new passes specifically tailored to machine code rewriting. The generic pass sequence provided by LLVM includes passes for instruction combination, inlining, vectorization, loop unrolling (see, e.g., Figures 6.5a and 6.5b) and other (non-trivial) loop transformations, and propagating memory accesses to SSA registers where safely possible (see, e.g., Figure 6.5d). The following passes are specifically designed for DBLL.
Client–Server ArchitectureThis thesis provides a framework of client-server architecture, this design is similar to background compilation. |
Next, we will make several further improvements: Tier-1 JIT compiler
Tier-2 JIT compiler
Based on the experimental results (see DBLL) of Optimizing Performance Using Dynamic Code Generation, the indirect jump instruction is also a target that needs to be optimized. |
Ideas for Improvements. Translation cachingTranslations are cached, so that if the same block of code requires executing again, it does not need to be retranslated. The translations are stored in a 16MB cache, and pointers to them are stored in a map. Given the value of the guest instruction pointer and some flags, the cached translation to be executed (if it exists) can easily be found. This lookup is accelerated further by the use of a translation lookaside buffer (TLB), which stores pointers to the most recently used translations. If the cache becomes full, it is cleared completely. Translations in the cache may become invalid due to the guest code on which they are based being overwritten. In order to ensure that translations are removed from the cache when this occurs, an array is maintained storing the state of each page of guest memory. If a memory write is performed to a page which contains translations, the function The translations are dependent on several states within the guest processor, for example, the default stack pointer size (16- or 32- bit). If one of these states changes, care must be taken to avoid using a translation compiled when the states were different. Since these states must be constant throughout a translation, any instructions which change these states are not added to a translation - they are emulated using interpretation. Control transferControl transfer instructions necessitate a shift in execution from one translated basic block to another. When reaching the end of a basic block concluding with a control transfer instruction, the process transitions from the translated block to a helper function named An optimization is viable for control transfer instructions that consistently branch to a specific location, such as direct jumps and calls. Here, conditional jumpsConditional jumps are processed in a manner akin to unconditional jumps. They are transformed into two separate unconditional jumps: one is executed when the condition is met, and the other when the condition is not met. These two jumps can be patched independently of each other. ExceptionsIn some cases, guest code is designed to trigger an exception, which needs to be accurately represented in the host code. This is achieved through two methods. Firstly, certain exceptions, like page faults, are intrinsic to the guest code and must be emulated in the translations. Here, the translation includes a deliberate check for conditions that would lead to an exception. If these conditions are met, the translation logs details about the exception and returns control to the emulator immediately, without completing the rest of the translation. Secondly, for exceptions that occur less frequently, such as divide-by-zero errors, adding a preemptive check for every division instruction would be inefficient. Instead, the division is allowed to proceed without checking the divisor. To handle any resulting divide-by-zero exceptions on the host, all translation execution is enclosed within a structured exception handling (try-catch) block. Exceptions can interrupt the execution of a translation before it concludes. Therefore, the guest CPU state must be constantly current wherever an exception might occur. For instance, merely updating the guest’s instruction pointer at the end of a translation is inadequate, as this update might not be reached. However, updating it after every guest instruction translation would degrade performance, particularly since some guest instructions correspond to a single host instruction. A balance is struck by updating the instruction pointer only before translating a guest instruction that could potentially cause an exception. This approach still necessitates frequent updates to the instruction pointer, as any memory-accessing instruction might trigger an exception, like a page fault. Memory accessIn situations where the guest code performs memory access, the translation incorporates a call to one of many helper functions tailored to various types of memory access, sizes, and processor modes. An example of such a function is
It is important to note that, for efficiency, the function does not verify whether the segment limit or access rights are appropriate for the memory access being executed. |
#341 focuses on tracking register contents within a basic block -- a linear sequence of instructions -- allowing for the efficient reuse of variables and constants directly from registers. |
After implemented register allocation for T1C and adding some LLVM opimization passes mentioned in Optimizing Performance Using Dynamic Code Generation for T2C, The performance of rv32emu with a tiered framework(T1C & T2C) surpasses that of QEMU in all scenarios except for miniz(very close). |
With Chrome 114 is the start of Google beginning to roll-out Maglev as their new mid-tier compiler for further enhancing the JavaScript browser performance. See early Maglev design document. |
FEX is a sophisticated x86 emulation frontend designed to enable the execution of x86(-64) binaries on Arm64 hosts, akin to qemu-user. It introduces several innovative concepts worth exploring:
These strategies underscore FEX's approach to enhancing emulation efficiency and performance through targeted optimizations. Reference: FEXCore IR Optimization passes |
After improving indirect jump for T1C, The performance of rv32emu with a tiered framework(T1C & T2C) is very close to that of QEMU and even surpasses in some cases. We still need more analysis and improvement about miniz, FP EMULATION, and bitfield benchmarks. |
Here, we also compare the hardware events:
Based on the analysis below, fast interpreter & T1C use less memory to perform dynamic binary translation than QEMU does. However, T2C uses large amount of memory to generate machine code because of the LLVM backend. iTLB-load-misses
cache-missespage-faults |
This task is considered complete! |
The prototype of the fast tier-1 JIT compiler (#283) introduces an advanced sampling-based profiler and a tier-up optimization strategy. The two fundamental components for this are: 1) a multi-tiered recompilation system that is well-balanced, and 2) a profiling system that is both reliable and has low overhead for enhancing performance.
Initially, all RISC-V instructions are processed by an optimizing interpreter, with each block having a counter for tracking function calls and loop iterations, starting from a preset threshold. The interpreter's detection of a loop's backward branch involves loop iteration count assessment and counter decrement adjustment based on this count. Should the iteration count be substantial, JIT compilation proceeds immediately, bypassing the need for the counter to deplete fully.
The tier-2 optimizing JIT compiler further enhances optimizations, including aggressive block inlining, an expanded range of dataflow optimizations, and superior code quality through an additional code generation pass. These optimizations span various techniques like copy propagation, array bound check elimination, and dead code elimination. Additional optimizations at this level encompass escape analysis, code scheduling, and DAG-based optimizations like loop versioning and pre-pass code scheduling, with an increased maximum iteration count for dataflow-based optimizations. This sophisticated approach ensures that blocks needing the most optimization receive it timely and effectively, leveraging a tiered system for optimal performance.
Quote from Design and Evaluation of Dynamic Optimizations for a Java Just-In-Time Compiler:
The development of the Tier-2 JIT compiler will utilize LLVM, specifically leveraging the ORC lazy JIT feature provided by LLVM. This approach will harness LLVM's capabilities for efficient and dynamic JIT compilation. Sample code:
Reference:
The text was updated successfully, but these errors were encountered: