ZombieLoad: How Intel’s Latest Side Channel Bug Was Discovered and Disclosed

Daniel Gruss, the researcher behind Spectre, Meltdown – and most recently, ZombieLoad – Intel CPU side channel attacks, gives an inside look into how he discovered the flaws.

The release of a new speculative execution vulnerability called ZombieLoad last week follows a similar disclosure path as Meltdown and Spectre. Threatpost caught up with one of the researchers behind the discovery of ZombieLoad to find out how.

ZombieLoad was discovered and reported by Michael Schwarz, Moritz Lipp and Daniel Gruss from the Graz University of Technology (known for their previous discoveries of similar attacks, including Meltdown). Gruss sat down with Threatpost to share the story behind ZombieLoad – from the origins of its name, to how his team discovered the attack and why more researchers are taking side-channel speculative execution flaws seriously.

For direct download, click here.

Below find a lightly-edited transcript of the podcast.

Lindsey O’Donnell: This is Lindsey O’Donnell with Threatpost, and I’m here today on the Threatpost podcast with Daniel Gruss, a security researcher at the Graz University of Technology in Austria. Daniel, how are you doing today?

Daniel Gruss: Good. Thanks for having me.

LO: Thank you for coming on, and glad to hear that. For our listeners who have been keeping up with the headlines this week, Daniel and some of his fellow researchers discovered the Zombieload attack that impacts Intel CPUs, and that was disclosed this past week. And going back a bit further, his research team was also one of the four teams that discovered the Meltdown and Spectre bugs disclosed in early 2018.

So, Daniel, we have a lot to discuss today regarding speculative execution and side channel attacks. But just to start, I have to ask, where did the name Zombieload come from?

DG: Yeah. We always try to come up with names that somehow resemble the nature of the attack. For instance, with Spectre, this was idea by Paul Kocher from the team that we collaborated with. The reasoning behind the name was that Spectre is something that is a bit… it’s not a nice spectre. It’s one that holds a branch in a hand ready to hit someone, so it’s really a nasty spectre. Spectre is also something that might haunt you, and we believe that the Spectre attack will haunt us for several years. And so, we won’t have this one solved in several years.

For Meltdown, this was something where we saw it is something really dangerous. It has a huge impact right now, but as soon as we have fixed it, it’s not a problem anymore. As soon as we have fixed it, we can forget about it, basically. This is exactly also what we thought is the Meltdown attack. We discovered this. We had a counter measure. As the counter measure was deployed, no one had to worry about the attack anymore.

With Zombieload, it’s a bit different. It’s not a spectre, so it’s not something that will haunt us and it’s also not a meltdown, which is a very, very significant, imminent threat. But the Zombieload is rather something that you suddenly discover maybe in a cellar, maybe some loads rising from their graves. Also, it’s difficult to kill. It’s much more difficult to kill than the Meltdown attacks.

The nature of the attack is also something which fits the name very well. So what happens in the processor is that the processor, for some reasons, has to send multiple load requests out to load data. We only want to do one data request. We only want to load data once, but because the processor doesn’t really know what should happen now and it’s doing that in some optimistic, opportunistic way, the processor will send out multiple loads, usually an assisting load, an additional load. This additional load will then cause data leakage, and this load doesn’t do anything very meaningful because it’s already clear that this doesn’t have the right data. That is also the reason why the processor says, “Okay, let it just finish with this data here. I don’t care what data it is. It will not finish anyway, but we can fix that up at a later point.”

This is why we call it the Zombieload, because it runs a bit headless around and loads data that it shouldn’t load and provides it to us then.

LO: Right. Yeah. That’s really interesting, and that is a very good description of the technical details behind the attack as well. Before we get into those pretty technical details, I’m curious, Daniel, about the story behind this vulnerability and how your research team first discovered it. Can you give us some context into that a little bit as well as the process of disclosure?

DG: I can reach a bit back there and include the discovery of Meltdown as well. In 2016, early 2016… I, already, since 2010, I teach systems here at Graz University of Technology. Something that I have to look at every year is the student submissions where they implement things around the address translation from virtual addresses that normal processors see, to physical addresses that normal processors don’t see. They are not aware of these physical addresses. They don’t see that and they usually also don’t care about the physical addresses. They are completely transparent to a user processor.

I was wondering, because the students, they have to run through multiple steps there and have to translate the address step by step from virtual to physical, and that takes several computation steps for the students. I was thinking, well, there are processor instructions that also have to do that, which don’t do anything else. For instance, the prefetch instruction. The prefetch instruction prefetches a virtual address into the cache. To do that, it has to translate a virtual address into a physical address. So I thought maybe I should measure whether the runtime of the prefetch instruction varies, depending on where the translation terminates, because the students can also abort early in some cases.

If they are already finished with the translation, then it takes them a bit less time during the exam. They save some time, it’s good. Why wouldn’t the processor do the same? Saving time is always good. It turned out the processor does that. So the processor really opts out early and you can see the timing difference and you can see which pages are four-kilobyte pages, which pages are two-megabyte pages, which pages are one-gigabyte pages. So you can see this structure, how this is mapped into a process just by looking at the response time, the latency of the prefetch instruction.

We did that also then on kernel memory, on memory that belongs to the operating system, that we usually wouldn’t be able to access. If we would try that, our process would crash. Then we tried that and the prefetch instruction cannot crash. It’s only a hint to the processor. The processor can ignore it. But it turned out that it perfectly returns the latency also for this inaccessible kernel addresses. Someone else also discovered that, Anders Fogh. He blogged about it and then we decided to write a paper together. We submitted this paper to Black Hat and to the CCS, ACM CCS Conference in 2016.

We presented this together at Black Hat, USA, and there I shared a room with Anders because it’s cheaper to share a room. There we had discussions around this attack and what else you could do. We were wondering if you don’t do this with a prefetch instruction, but with regular memory access instructions, would you be able to get the value from the kernel address and not just see the timing difference? But then we said, “No, if that would be possible, they would have discovered that long ago.” I mean, that would be so obvious to try. Now, we didn’t even give it a shot until in 2017.

Anders tried it in July 2017. By then, Jann Horn from Google Project Zero had already discovered this, had already discovered that Meltdown actually works, but that was not public information at that point. So Anders was trying it and then he blogged about it later on and said, “Unfortunately it didn’t work.” Then I think I sent him a message on Twitter and said, “I told you, this doesn’t work. I’m not surprised.” Of course, I was wrong. We later on tried this ourselves in December because we announced the student project to look at that because we thought, why not, let a student look at that and maybe it works, probably not. But maybe it works and then we have a very nice student thesis.

Then we realized that, oh, if this works, then we have a real problem because we don’t know whether a student could handle the embargo situation that would arise there. And we said, “Okay maybe we should try that on our own first before we let a student do that.” A student already signed up for the project so we were a bit in a hurry. And then we tried it and were shocked that it works. We told the student, “Unfortunately the project is not available any more. You have to pick a different one.” And the student didn’t really know what was going on.

LO: Oh, wow. That’s really interesting that students were wrapped up in this as well. Really impressive.

DG: Yeah. The student later on approached us and said, “Ah, now I know why you told me it’s not available anymore.”

LO: Exactly. Did you then reach out to Intel about that?

DG: Yes. Yes. That was in December 2017 then. The embargo was already running for long time there. The disclosure of Meltdown was scheduled to be in I think it was 9th of January. And then it broke early on 3rd of January.

DG: Right from the beginning, for Meltdown, we saw leakage that was not in the cache, data that was not in the cache. And if it’s not in the cache, then the question was where does the data come from? And Intel didn’t believe us that this is possible for a long time. Our collaborators back then in the project didn’t believe us that it was working. It took us quite some time to convince them. At some point, we had a POC that worked for our collaborators. Then they saw, okay, it doesn’t work good, but it works a bit. And yes, this definitely leaks data from memory locations which are not in the cache. Not good, but very, very slowly a bit.

Then we also, in our communication with Intel, always made sure that we say we can leak data that is not the L1 cache, definitely. In March, that was in 2018, March 28, 2018, because we had a few emails exchange with Intel, we sent an email to them because they requested it with POC, which would prove that we can leak data that is not in the cache, because we marked the memory as uncacheable. It can never be in the L1 cache. It’s uncacheable. We sent this POC to Intel with a few more mails back and forth. We sent them also an explanation in May. That was the 30th of May, 2018, we sent them a mail clarifying that our POC from 28th of March leaks the data from the line fill buffer. We explicitly wrote in the email, this data is leaked from the line fill buffers.

From this point on, Intel must have been aware that you can leak data from the line fill buffers and they had the first POC that was leaking data from the line fill buffers from March 28. This was basically the first effect that connects to the Zombieload attack.

LO: Right. That’s really interesting. Were you aware that over the past year, that some of the other new types of speculative execution flaws within this new class such as Fallout and Rogue In-flight Data Load and the other ones that were relying on memory that wasn’t in the cache, were you aware of those as well?

DG:  No, we weren’t, and Intel didn’t inform us. This is also something we are not really happy about. We understand that it’s really a complex topic but we were not informed until early 2019, that others have also discovered that.

LO: Yeah. That’s interesting.

DG: This is not very nice.

LO: Yeah. Yeah, for sure. I’m curious too, how was this time around separate from when you discovered and publicly disclosed with Intel the Meltdown and Spectre attacks, because it seems to me that… it’s been more than a year later. It seems that maybe chip manufacturers might not any longer be caught by surprise when it comes to these types of flaws. I mean, is that what you’re seeing or do you think that chip manufacturers continue to… are they more prepared for this type of attack or do you think that this is always going to be a big issue?

DG: I think they are not really… they didn’t realize yet that this is a situation that will stay like that. When we had the first software vulnerabilities, people also didn’t think that… or maybe some people thought. I don’t know. That is already a very long time ago so it’s difficult to really figure what people thought back then. But I think it would have been ridiculous to think that, oh, yeah, sure we have this software flaw, but there won’t be any software flaws in the future, because we always introduce bugs.

The way processors are written is by writing code. You just write code. It’s a large amount of source code. Why wouldn’t this source code also contain bugs and assumptions that are not always true? Especially if it comes to security and information leakage, this is much, much more difficult to reason about than about function and correctness, because for function and correctness, you can always check whether the operation does the right thing or not and you can, in the worst case, maybe go down to checking all the return values that something could have, taking all the effects that instruction could have that are architecturally visible.

But if it’s about information leakage, it’s less clear because why is it now bad that you load something into a register? Because someone else from some completely different team in the company will forward this data in the execution unit that they are working on, to subsequent instructions that can work with the data. So it’s a problem of complexity, a bit.

LO: Right. That’s a really good point. I want to get into what Zombieload actually is. I know, like you said, these attack centers around a flaw in the fill buffer of Intel CPUs and allows an attacker to steal sensitive data and keys, while the computer accesses them. But could you break down the… I know they were four different attack scenarios that were tied to Zombieload. Can you outline what those are and what an attacker would need specifically to carry out an exploit against this type of flaw?

DG: I don’t have the paper in front of me right now, but I can run you through a few attack examples, of course.

LO: Just from a high level.

DG: The basic problem, yes, this… circles around the micro architectural buffers. It may be the fill buffer, the line fill buffer, may be the load buffer, may be the store buffer. These are generally the buffers that we work with here.

The line fill buffer is used to exchange data between the Level 1 cache and the other caches as well as the main memory, and also in some cases, the core. Also, if you access, for instance, uncacheable memory, it will enter the line fill buffer, but it will not enter the L1 cache.

Then you have the load buffer. Maybe I can briefly mention what these buffers do. So the line fill buffer is used whenever you want to load data into the Level 1 cache, for instance. It basically can store a cache line plus the metadata. That would be something like 64 bytes plus metadata. And as soon as place in the Level 1 cache became free, you can copy the data from the line fill buffer into the L1 cache and then use I from there. But before that, of course, you can also already leak it from the line fill buffer.

Now, the counter measures against Meltdown and Spectre… not so much Spectre, Meltdown and Foreshadow where focused on the L1 cache, because Intel always, in the beginning, as was also clear from the communication we had with them, was that data needs to be in Level 1 cache, which was never correct, which we told them in the beginning. But they assumed that it has to be in the L1 cache. Yes.

The line fill buffer has several entries, something like maybe 10 entries, maybe a bit more. It’s not entirely clear how many entries they have on different processors. The documentation there is not really much and not necessarily correct as we figured. And these entries can be valid or invalid. If an entry is invalid, it can happen that the processor still matches it if the load request cannot be served anyway.

For instance, is I access some address and the address is not valid, or for some other reason the processor decides that this load request cannot be served, for instance, because it has to do something else before that. So, for some reason, it has to cancel basically this load request, this request for data. Then the processor says, “Well, maybe we should just return whatever we have in the line fill buffer. The first thing that matches is fine. And if we match something, then we will get the data that someone else has put in the line fill buffer before. And this can be anyone having access to the same line fill buffer, so processor that was running in parallel on a different hyper-thread. It could be from a different context, maybe from the kernel or from the sandbox, and within the sandbox you can leak it or from user space you can leak the kernel value that was put there before. This crosses all isolation boundaries.

LO:  I mean, looking at how this attack plays out, which is looking more at the buffer components of the CPUs versus Meltdown and Spectre which was data in the cache as you’re saying, were Intel’s mitigations against these types of flaws different than they were for Spectre and Meltdown? I mean, what’s the best way to protect yourself against this? Is it just disabling hyper-threading?

DG: For Meltdown, we already had a defense. That was the KAISER patch. In Linux, we proposed this under a name KAISER, which is an acronym for a longer sentence explaining that we want to eliminate the side channels. Also, the KAISER, the word in German, there is the emperor penguin is called the Kaiserpinguine. It’s the largest penguin, and we thought it’s super nice to use this for Linux which is the mascot of linux, of course.

LO: That makes total sense. I love that.

DG:  They didn’t like it. They changed the name PTI. I don’t understand it. Yeah. It’s really sad but they changed it and, yes, then shipped the disk PTI. And what PTI does or what the idea of KAISER is, is you don’t trust the isolation between user and kernel anymore. The processor provides you with one bit there which says whether page, a memory location, can be accessed from the user process or whether it can only be accessed from the kernel. This bit apparently was not reliable. In Meltdown, we exploited this and accessed the kernel memory, and the fix for that was the KAISER patch, which works against the Meltdown attack.

It also works against the Meltdown attack on uncacheable memory because if the memory is just not there, you cannot leak anything from there. The next step was Foreshadow attack which became public in Augsut 2018. The Foreshadow attack showed that it’s not just this bit which decides whether it’s a user or a kernel page, but it’s also the present bit saying whether this is a valid address at all. Now, if you try to access an address which is not valid at all, the processor will still try to access some data because it already knows that it will raise a fault anyway, so it can also continue. The user won’t be able to access the data anyway. That’s the basic assumption they make.

What the processor then does is it returns just any data that fits approximately from the Level 1 cache, which fits based on the physical address and the virtual address. But the physical address here can be under full control of the user. For instance, in case you’re running a virtual machine, the physical address would be under full control of guests, in this case. And that means that the guest virtual machine can read any memory location of the host. Against that, you need different defenses, of course, because in the virtual machine, there is no… this is not about the kernel to user isolation anymore.

The solution they proposed back then was, A, either disable hyper-threading or don’t schedule mutually untrusted workloads on the same core, so on different hyper-threads of the same core. Second, if you switch from the hypervisor, so from a virtual machine host to the virtual machine guest, then you have flush the entire Level 1 cache. This is a lot of performance overhead, but it’s something that you can do.

Microsoft optimized this a bit. They said, “Well, it’s enough if we track all the memory locations that we accessed while we were in the host and only flush results before we return to the guest, which is also valid assumption and it’s a bit faster, of course, than flushing the entire cache. So it makes sense. Then this problem was also solved… sorry, was also solved if you choose one of these ways to mitigate it.

LO: Yeah, I know.

DG: The cache or flushing parts of the cache as necessary, not running any mutually untrusted workloads in parallel on the same core or entirely disabling hyper-threading. But the mitigations from the Foreshadow attack, they don’t fully work anymore now because the assumption was wrong, that the data has to be in L1. It’s enough if the data is in a line fill buffer or if there is something in the load buffer or the store buffer, which would pull data into an attacker-controlled register.

This bit is what’s new about these new techs. The load buffer, the store buffer, and the line fill buffer, they behave pretty similar to regular caches. So yes, entry is there, you allocate those entries, and at some point, these entries will be invalid. The problem here is that the processor may use state entries, entries that are not valid anymore and then read data that is not supposed to read, based on the state of these buffers.

The solution is not flushing the Level 1 anymore. That is something that you still need to do. But additionally, you now have to also make sure that you evicted all of these buffers upon every context switch and still not schedule mutually untrusted workloads on the same core or disable hyper-threading. So this comes as an addition to what you had to do before.

LO:   I’m curious, taking a step back, obviously there was Spectre and Meltdown and then there was Foreshadow and a few other types of… these types of attacks that were discovered just in the past year. What is the future of this type of flaw? I mean, do you think that we’ll only discover more side channels, speculative, execution types of flaws in CPUs in the future? Where is this going in the future?

DG: Going forward, yes, I think we are going to see more and more of these vulnerabilities. Back in 2017, I think there was only a handful of people looking at this area, maybe 10 people, 20 people by the start of the year, maybe a bit more at the end of the year. And then maybe in 2018, this exploded a bit. Maybe now worldwide, maybe 100 people, maybe a few 100 people are looking at that. I think as more flaws are discovered, more people realize that this is an area that you should look at, and more people will look at that in the future. I think this area will establish itself similar as other areas like software-based attacks, software flaws.

DG: Looking for software flaws is something that we today do all the time. It’s become super normal that if we write software, that we search for bugs in the software. And everyone does that. People are looking for bugs in open source software all the time.

LO: Yes, that’s definitely a good point. And I do think that we will see more people who are interested in looking for these types of flaws. Certainly, I’m excited to see more work that you have as well in the future. So, Daniel, thanks so much for coming on to the Threatpost podcast today.

DG: Thanks for having me.

LO: Great. And once again, this is Lindsey O’Donnell with Threatpost here with Daniel Gruss. Catch us next week on the Threatpost podcast.

Suggested articles