A Rendezvous with System Management Interrupts

door Carles Pey

In a previous blog post [1], I presented the implementation of a new mode of operation in Chipsec's `smm_ptr` module that aims to identify unusually slow SMIs present in the system's BIOS. This work, which was carried out using a black box approach on an x86 platform running an MSI UEFI BIOS, left some open questions.

As observed in the obtained results, continuous execution of SMIs with the same code and parameters resulted in a trace with a standard deviation close to 1% of the average, which even considering poor pipeline utilization corresponds to a variability of thousands of instructions. A second unexplained issue was the sporadic execution of SMIs that were up to 20% faster or slower. This behavior seemed unjustified as the SMI parameters remained unchanged. The work done to investigate this led to a set of improvements in the measuring strategy, which I will discuss in this blog post.

My first thought in trying to address the execution variability, was to confirm that there was no SMI contention in the platform. The SMI execution time is measured by the `rdtsc` instruction [2] before and after triggering the SMI call, followed by a subtraction [3]. Therefore, the presence of other SMIs being triggered by the system could result in longer time measurement for my executed SMIs, as a consequence of waiting for the other SMIs to complete if these are triggered between the two `rdstc` calls.

Intel processors include the Model-Specific Register (MSR) `MSR_SMI_COUNT` at the register address `0x34` which provides the number of SMIs executed since boot [4]. I introduced a check in the SMI execution loop [5] to ensure that the SMI count increases by exactly one after each executed SMI, re-executing the measurement if the count increases by more than one. Although I observed instances where the system triggered SMIs that contended with my measured SMIs, this was not a regular occurring event that explains the previously stated questions.

Next, I verified that there were no cache related effects that could explain a slower path depending on events occurring in the system. To ensure this, I introduced cache memory flushing right before the first `rdtsc` by adding the `wbinvd` instruction [6]. I also added a memory fence [7] to prevent any possible speculative execution and to ensure my intented instruction ordering was followed [8]. These changes did not make an observable difference.

I also wanted to ensure that the variability was not caused by interrupts serviced with priority right before reading the time-stamp counter on SMI execution return, or before the SMI was triggered once the first time masurement was taken. To address this, I protected the SMI execution and time measurement using the Linux `local_irq_save` and `local_irq_restore` functions, which respectively disable and enabled interrupt delivery on the CPU core executing the SMI [9]. This did not seem to make much of a difference either.

At last, I decided to look for other ways to reduce the complexity of the
system and found that limiting the number of processor cores running had a noticeable impact. In Linux, this can be done by using the `maxcpus` command-line parameter [10]. Setting `max_cpus=1` causes Linux to boot and run with one single processor.

My reasoning for why limiting the number of processor cores is relevant is that it relates to how processor cores enter System Management Mode (SMM) through a mechanism present in UEFI called SMI Rendezvous, which is intended to prevent race conditions [11, 12].

The implementation of SMI Rendezvous in UEFI [13], waits before servicing the SMI handler until the Bootstrap Processor (BSP), which bootstraps the system and handles the SMI interrupts, and the Application Processors (AP), which are the secondary processor cores guided by the BSP, have all entered SMM. This implies that all the processors have changed their execution context to SMM and saved their running state in SMRAM before the BSP executes the SMI.

To achieve this in UEFI, the processor core that triggered the SMI enters SMM and execute the SMI rendezvous. it then elects the BSP and signals an SMI IPI (interprocessor interrupt) to it, causing the BSP to relinquish its work, enter SMM, and execute the SMI rendezvous. The BSP will follow by sending SMI IPIs to the remaning APs. Once all processor are in SMM, the BSP proceeds with servicing the SMI handler.

These IPIs are handled by each processor's Local APIC (Advanced Programmable Interrupt Controller), which examines the IPI message and then services the interrupt to the processor core based on its piority relative to other activities carried out by the processor, including servicing other interrupts [15]. Although SMIs are non-maskable, they could be slightly delayed depending on the state of the processor, for example if it was executing and Non-Maskable Interrupt (NMI) when the IPI arrived. This, I assume, accounts for the variations in time experienced when measuring SMI executions.

The traces below show the time measurement in CPU counts when the SMI 0xE1 is executed repeatedly without modifying its parameters, on an Intel Core i5-7200 CPU running at 2.50 GHz with 4 processor cores. The second image shows the time measurement on the same CPU and SMI code, but with the number of cores limited to one. As it is evident, the variability in the SMI execution time is significantly reduced.

Interestingly, this effect is not experienced on a system running Coreboot, which employs a different strategy for the SMI Rendezvous. However, that would be a discussion for another time!

[1] https://www.nccgroup.com/us/research-blog/enumerating-system-management-interrupts/
[2] https://www.felixcloutier.com/x86/rdtsc
[3] https://github.com/chipsec/chipsec/blob/1.13.9/drivers/linux/amd64/cpu.asm#L467-L490
[4] Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers
[5] https://github.com/chipsec/chipsec/blob/1.13.9/chipsec/modules/tools/smm/smm_ptr.py#L537
[6] https://www.felixcloutier.com/x86/wbinvd
[7] https://www.felixcloutier.com/x86/mfence
[8] https://github.com/chipsec/chipsec/blob/1.13.9/drivers/linux/amd64/cpu.asm#L465-L466
[9] https://github.com/chipsec/chipsec/blob/1.13.9/drivers/linux/chipsec_km.c#L1121-L1123
[10] https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html
[11] https://www.nccgroup.com/us/research-blog/stepping-insyde-system-management-mode/
[12] https://www.sentinelone.com/labs/zen-and-the-art-of-smm-bug-hunting-finding-mitigating-and-detecting-uefi-vulnerabilities/
[13] https://github.com/tianocore/edk2/blob/edk2-stable202411/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L1525
[14] https://github.com/tianocore/edk2/blob/edk2-stable202411/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304
[15] Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C, 3D): System Programming Guide. Section 11.8.3: Interrupt, Task, and Processor Priority