A technical deep dive on Meltdown and does it work?
Meltdown has definitly taken the internet by storm. The attack seems quite simple and elegant, yet the whitepaper leaves out critical details on the specific vulnerability. It relies mostly on a combination of cache timing side-channels and speculative execution that accesses globally mapped kernel pages.
This deep dive assumes some familiarity with CPU architecture and OS kernel behavior. Read the background section first for a primer on paging and memory protection.
Simplified version of the attack:
- Speculative memory reads to kernel mapped (supervisor) pages and then performing a calculation on value.
- Conditionally issuing a load to some other non-cached location from memory based on the result of the calculation.
- While the 2nd load will be nuked from the pipeline when the faulting exception retires, it already issued a load request out to L2$ and beyond ensuring the outstanding memory request still brings the line into the cache hierarchy like a prefetch.
- Finally, a separate process can issue loads to those same memory locations and measure the time for those loads. A cache hit will be much quicker than a cache miss which can be used to represent binary 1’s (i.e. hits) and binary 0’s (i.e. misses).
Parts 1 and 2. has to do with speculative execution of instructions while parts 3 and 4 enable the microarchitecture state (i.e. in cache or not) to be committed to an architectural state.
Is the attack believable?
What is not specified in the Meltdown whitepaper is what specific x86 instruction sequence or CPU state can enable the memory access to be speculatively executed AND allow the vector or integer unit to consume that value. or L2$. In modern Intel CPUs, when a fault happens such as a page fault, the pipeline is not squashed/nuked until the retirement of the offending instruction. However, memory permission checks for page protection, segmentation limits, and canonical checks are done in what is called the address generation (AGU) stage and TLB lookup stage before the load even look up in the L1D$ or go out to memory. More on this below.
Performing memory permission checks
Intel CPUs implement physically tagged L1D$ and L1I$ which requires translating the linear (virtual) address to a physical address before the L1D$ can determine if it hits or misses in the cache via a tag match. This means the CPU will attempt to find the translation in the (Translation Lookaside Buffer) TLB cache. The TLB caches these translations along with the page table or page directory permissions (privileges required to access a page are also stored along with the physical address translation in the page tables).
A TLB entry may contain the following:
- Physical Address minus page offset.
Thus, even a speculative load already knows the permissions required to access the page is compared against the Current Privilege Level (CPL) and required op privilege and thus can be blocked from any arithmetic unit from ever consuming the speculative load.
Such permission checks include:
- Segment limit checks
- Write faults
- User/Supervisor faults
- Page not present faults
This is what many x86 CPUs in fact are designed to do. The load would be rejected until the fault is later handled by software/uCode when the op is at retirement. The load would be zeroed out on the way to the integer/vector units. In other words, a fault on User/Supervisor protection fault would be similar to a page not present fault or other page translation issue meaning the line read out of the L1D$ should be thrown away immediately and the uOp simply put in a waiting state.
Preventing integer/floating units from consuming faulting loads is beneficial not just to prevent such leaks, but can actually boost performance. i.e. Loads that fault won’t train the prefetchers with bad data, allocate buffers to track memory ordering or allocate a fill buffer to fetch data from L2$ if it missed the L1D$. These are limited resources in modern CPUs and shouldn’t be consumed by loads that are not good anyway.
In fact, if the load missed the TLB’s and had to perform a page walk, some Intel CPUs will even kill the page walk in the PMH (Page Miss Handler) if a fault happens during the walk. Page walks perform a lot of pointer chasing and have to consume precious load cycles, so it makes sense to cancel if it’ll be thrown away later anyway. In addition, the PMH Finite State Machine can usually handle only a few page walks simultaneously.
In other words, aborting the L1D Load uOp can actually a good thing from a performance stand point. The press articles saying Intel slipped because they were trying to extract as much performance as possible with the tradeoff of being less secure isn’t true unless they want to claim the basic concepts of speculation and caching are considered tradeoffs.
While this doesn’t mean the Meltdown vulnerability doesn’t exist. There is more to the story than what the whitepaper and most news posts discuss. Most posts claim that only the mere act of having speculative memory accesses and cache timing attacks can create the attack, and now Intel has to completely redesign it’s CPUs or eliminate speculative execution.
Meltdown is more of a logic bug that slipped Intel CPU validation rather than a “fundamental breakdown in modern CPU architecture” like the press is currently saying. It can probably be fixed with a few gate changes in hardware.
In fact, the bug fix would probably be a few gate changes to add the correct rejection logic in the L1D$ pipelines to mask the load hit. Intel CPUs certainly have the information already as the address generation and TLB lookup stages have to be complete before a L1D$ cache hit can be determined anyway.
It is unknown what are all the scenarios that cause the vulnerability. Is it certain CPU designs that missed validation on this architectural behavior? Is it a special x86 instruction sequence that bypasses these checks, or some additional steps to set up the state of the CPU to ensure the load is actually executed? Project Zero believes the attack can only occur if the faulting load hits in L1D$. Maybe Intel had the logic on the miss path but had a logic bug for the hit path? I wouldn’t be surprised if certain Intel OoO designs are immune to Meltdown as it’s a specific CPU design and validation problem, rather than a general CPU architecture problem.
Unfortunately, x86 has many different flows through the Memory Execution Unit. For example, certain instructions like MOVNTDQA have different memory ordering and flows in the L1D$ than a standard cacheable load. Haswell Transactional Synchronization Extensions and locks add even additional complexity to validate correctness. Instruction fetches go through a different path than D-side loads. The validation state space is very large. Throw in all the bypass networks and then you can see how many different places where fault checks need to be validated.
One thing is for certain, caching and speculation are not going away anytime soon. If it is a logic bug, it may be a simple fix for future Intel CPUs.
Why is this attack easier today?
Let’s say there is an instruction that enable loads to hit and consumed even in presence of faults, or I am wrong on the above catch, why is it happening now rather than discovered decades ago.
Today’s CPUs have much deeper pipelines cough Prescott cough which provides a wider window between the speculative memory access and the actual nuke/squash of those faulting accesses. Faulting instructions are not handled until the instruction is up for retirement/committal to processor architectural state. Only at retirement is the pipeline nuked. Long pipelines allow for a large window between the execution of the faulting instruction and the retirement of it allowing other speculative instructions to race ahead.
Larger cache hierarchy and slower memory fabric speed relative to fast CPU only ops such as cache hit/integer ops which provides a much larger time difference in cycles between a cache hit and a cache miss/memory to enable more robust cache timing attacks. Today’s large multi-core server CPUs with elaborate mesh fabrics to connect tens or hundreds of cores exaggerates this.
Addition of performance enhancing features for fine granularity cache control such as the x86 CLFLUSH and PREFETx give more control for cache timing attacks.
Wider issue processors that enable parallel integer, floating, and memory ops simultaneously. One could place long floating point operations such as divide or sqrt right before the faulting instruction to keep the core busy, but still keeping the integer and memory pipelines free for the attack. Since the faulting instruction will not nuke the pipeline until retirement, it has to wait until any earlier instructions in the instruction sequence to be committed including long running floating point ops.
Virtualization and PaaS. Many web scale companies are now running workloads cloud providers like AWS and Azure. Before cloud, Fortune 500 companies would run their own trusted applications on their own hardware. Thus, applications from different companies were physically separated, unlike today. While it is unknown if Meltdown can allow a guest OS to break into the hypervisor or host OS, what is known is that many virtualization techniques are more lightweight than full blown VT-x. For example, multiple apps in Heroku, AWS Beanstalk, or Azure Web apps along with Docker containers are running within the same VM. Companies no longer spin up a separate VM for each application. This could allow a rogue application to read kernel memory of the specific VM. Shares resources were not a thing in the 90’s when OoO execution became mainstream with the Pentium Pro/Pentium III.
The use of the Global and User/Supervisor bits in x86 paging entries which enables the kernel memory space to be mapped into every user process (but protected from Ring3 code execution) to reduce pressure on TLBs and slow context switching to a separate kernel process. This performance optimization has been done since the 1990’s.
Is this x86 specific?
First of all, cache timing attacks and speculative execution is not specific to Intel or x86 CPUs. Most modern CPUs implement multi-level caches and heavy speculation outside of a few embedded microprocessors for your watch or microwave.
This isn’t an Intel specific problem or a x86 problem but rather a fundamental problem in general CPU architecture. There are now claims that specific OoO ARM CPUs such as those in iPhones and smartphones exhibit this flaw also. Out of order execution has been done since being introduced by Tomasulo algorithm. At the same time, cache timing attacks have been known for decades as it’s been known stuff may be loaded into caches when it shouldn’t have. However, cache timing attacks have traditionally been used to find the location of kernel memory rather than the ability to actually read it. It’s more of a race condition and window that is enabled depending on the microarchitecture. Some CPUs have shallower pipelines than others causing the nuke to happen sooner. Modern desktop/server CPUs like x86 have more elaborate features from CLFLUSH to PREFETCHTx that can be additional tools to make the attach more robust.
Background on Memory Paging
Since the introduction of paging to the 386 and Windows 3.0, operating systems have used this feature to isolate the memory space of one process from another. A process will be mapped to its own independent virtual address space which is independent from another running processes’s address space. These virtual address spaces are backed by physical memory (pages can be swapped out to disk, but that’s beyond the scope of this post).
For example, let’s say Process 1 needs 4KB of memory thus the OS allocates a virtual memory space of 4KB which has a byte-addressable range from 0x0 to 0xFFF. This range is backed by physical memory starting at the location 0x1000. This means processes 1’s
[0x0-0xFFF] is “mounted” at the physical location
[0x1000-0x1FFF]. If there is another process running, it also needs 4KB, thus the OS will map a second virtual address space for this Process 2 with the range 0x0 to 0xFFF. This virtual memory space also needs to be backed by physical memory. Since process 1 is already using 0x1000-0x1FFF, the OS will decide to allocate the next block of physical memory [0x2000-0x2FFF] for Process 2.
Given this setup, if Process 1 issues a load from memory to linear address 0x0, the OS will translate this to physical location 0x1000. Whereas if Process 2 issues a load from memory to linear address 0x0, the OS will translate this to physical location 0x2000. Notice how there needs to be a translation. That is the job of the page tables.
An analogy in the web world would be how two different running Docker containers running on a single host can mount the same
/data dir inside the container to two different physical locations on the host machine
A range of mapped memory is referred to as a page. CPU architectures have a defined page size such as 4KB. Paging allows the memory to be fragmented across the physical memory space. In our above example, we assumed a page size of 4KB, thus each process only mapped one page. Now, let’s say Process 1 performs a malloc() and forces the kernel to map a second 4KB region to be used. Since the next page of physical memory [0x2000-0x2FFF] is already utilized by Process 2, the OS needs to allocate a free block of physical memory [0x3000-0x3FFF] to Process 1 (Note: Modern OS’s use deferred/lazy memory allocation which means virtual memory may be created before being backed by any physical memory until the page is actually accessed, but that’s beyond the scope of this post. See x86 Page Accessed/Dirty Bits for more).
The address space appears contiguous to the process but in reality is fragmented across the physical memory space:
|Virtual Memory||Physical Memory|
|Virtual Memory||Physical Memory|
There is an additional translation step before this to translate logical address to linear address using x86 Segmentation. However, most Operating Systems today do not use segmentation in the classical sense so we’ll ignore it for now.
Besides creating virtual address spaces, paging is also used as a form of protection. The above translations are stored in a structure called a page table. Each 4KB page can have specific attributes and access rights stored along with the the translation data itself. For example, pages can be defined as read-only. If a memory store is executed against a read-only page of memory, a fault is triggered by the CPU.
Straight from the x86 Reference Manual, the following non-exhaustive list of attribute bits (which behave like boolean true/false) are stored with each page table entry:
|P||Present||must be 1 to map a 4-MByte page|
|R/W||Read/write||if 0, writes may not be allowed to the page referenced by this entry|
|U/S||User/supervisor||if 0, user-mode accesses are not allowed to the page referenced by this entry|
|A||Accessed||indicates whether software has accessed the page referenced by this entry|
|D||Dirty||indicates whether software has written to the page referenced by this entry|
|G||Global||if CR4.PGE = 1, determines whether the translation is global, ignored otherwise|
|XD||Execution Disable||If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by this entry; see Section 4.6); otherwise, reserved (must be 0)|
Minimizing context switching cost
We showed how each process has it’s own virtual address mapping. The kernel process is a process just like any other and also has a virtual memory mapping.
When the CPU switches context from one process to another process, there is high switching cost as much of the architectural state needs to be saved to memory so that the old process that was suspended can resume executing with the saved state when it starts executing again.
However, many system calls need to be performed by the kernel such as I/O, interrupts, etc. This means a CPU would constantly be switching between a user process and the kernel process to handle those system calls.
To minimize this cost, kernel engineers and computer architects map the kernel pages right in the user virtual memory space to avoid the context switching. This is done via the User/Supervisor access rights bits. The OS maps the kernel space but designates it as supervisor (a.k.a. Ring0) access only so that any user code cannot access those pages. Thus, those pages appear invisible to any code running at user privilege level (a.k.a. Ring3)
while running in user-mode, if a CPU sees a instruction access a page that requires supervisor rights, a page fault is triggered. In x86, page access rights are one of the paging related reasons that can trigger a #PF (Page Fault).
The Global bit
We showed how each process has it’s own virtual address mapping. The kernel process is a process just like any other and also has a virtual memory mapping. Most translations are private to the process. This ensures Process 1 cannot access Process 2’s data since there won’t be any mapping to [0x2000-0x2FFF] physical memory from Process 1. However, many system calls are shared across many processes to handle a process making I/O calls, interrupts, etc.
Normally, this means each process would replicate the kernel mapping putting pressure on caching these translations and higher cost of context switching between processes. The Global bit enables these certain translations (i.e. the kernel memory space) to be visible across all processes.
It’s always interesting to dig into security issues. Systems are now expected to be secure unlike in the 90’s, AND is only becoming much more critical with the growth of crypto, biometric verification, mobile payments, and digital health. A large breech is much more scary for consumers and businesses today that in the 90’s. At the same time, we also need to keep discussion going on new reports.
The steps taken to trigger the Meltdown vulnerability have been proven by various parties. However, it probably isn’t the mere act of having speculation and cache timing attacks that caused Meltdown nor is a fundamental breakdown in CPU architecture, rather seems like a logical bug that slipped validation. Meaning, speculation and caches are not going away anytime soon, nor will Intel require an entirely new architecture to fix Meltdown. Instead, only change needed in future x86 CPUs is a few small gate changes to the combinational logic that affects determining if a hit in L1D$ (or any temporary buffers) is good.