The speculative lack of kernels such as ARM1176 used by the Raspberry Pi frees us from exploits.

------ [Guidance] ------

Give you a sense of security in a panic. Raspberry pie, we are different!

In the past few days, discussions about the security vulnerabilities of Meltdown and Spectre have been rampant. This vulnerability affects all modern Intel processors, and Spectre also affects AMD processors and ARM cores. The Spectre vulnerability allows an attacker to bypass software checks and read data anywhere in the current address space; the Meltdown vulnerability allows an attacker to read data anywhere in the operating system's core address space (which is typically not accessible to user programs). Both of these vulnerabilities use some of the performance characteristics (cache and speculative execution) of modern processors to leak data through side-channel attacks. Recently, Raspberry Pi founder Eben Upton said the Raspberry Pi was not affected by these vulnerabilities and wrote an article explaining why.

The vulnerabilities discovered by the Google Project Zero team are called "Meltdown" and "Specter" respectively. These vulnerabilities allow malicious programs to steal information from the memory of other programs, which means that malicious programs can listen to passwords, account information, keys, and anything that is theoretically stored in the process.

Among them, "Meltdown" affects Intel processors, which breaks the most basic isolation between user applications and operating systems. This type of attack allows the program to access the memory of other programs and operating systems, which can lead to data leakage. In addition to affecting Intel processors, "Spectre" can affect a large number of processors in AMD and ARM architecture, which means that in addition to servers and personal computers, terminals such as smartphones will also be affected, almost all modern computer processors. None of them can be spared. It breaks the isolation between different applications, which means that an attacker can use a malicious program to get quarantined private data.

Intel recently said that there will be software patches released in the next few weeks. Although most PC users are not affected, security patches can cause the processor to slow down by 0-30%.

According to Eben Upton, a lot of cheap computing devices like the Raspberry Pi may not be affected by two security vulnerabilities, including many low-end Android phones.

The speculative lack of kernels such as ARM1176 used by the Raspberry Pi frees us from exploits.

This article introduces some of the concepts of modern processor design, using simple Python programs to explain these concepts, such as:

t = a+bu = c+dv = e+fw = v+gx = h+iy = j+k

Although your computer processor does not execute Python directly, the statements here are simple enough to be roughly equivalent to simple machine instructions. This article does not detail the important details of excessive processor design (mainly pipelining and register renaming), which are less important to understand how Spectre and Meltdown work.

For a comprehensive look at processor design and modern computer architecture, see Hennessy and Patterson's classic book, Computer Architecture: A Quantitative Approach.

What is a scalar processor?

The simplest modern processor executes an instruction every cycle, which we call a scalar processor. The above example requires six iterations on the scalar processor.

The Intel 486 and ARM1176 used in Raspberry Pi1 and Raspberry Pi Zero are both scalar processors.

What is a superscalar processor?

Obviously, the way to speed up the scalar processor is to increase its clock speed. However, we quickly reached the limit of the internal logic gates of the processor; therefore, processor designers began looking for ways to handle multiple things at once.

An ordered superscalar processor examines a large number of instructions received and attempts to execute multiple instructions at a time in a pipeline, depending on the dependencies between the instructions. Dependencies are important: you might think that a two-way superscalar processor can pair 6 instructions to execute as follows:

t, u = a+b, c+dv, w = e+f, v+gx, y = h+i, j+k

But this has no effect: we must first calculate v and then calculate w, that is, the third and fourth instructions cannot be executed at the same time. The two-way superscalar processor can't actually find the instruction that is paired with the third instruction, so the example will execute four loops:

t, u = a+b, c+dv = e+f # second pipe does nothing here w, x = v+g, h+iy = j+k

Superscalar processors include the Intel Cortex-A7 and Cortex-A53, which are used by Intel Pentium and Raspberry Pi 2 and Raspberry Pi 3 respectively. The Raspberry Pi 3 clock frequency is only 33% higher than the Raspberry Pi 2, but the performance is about twice that of the latter: in part because the Cortex-A53 exceeds the Cortex-A7's ability to pair a large number of instructions.

What is an out-of-order processor?

Going back to our example, we can see that even if there is a dependency between v and w, we can find other independent instructions to fill the hollow pipe of the second loop. An out-of-order superscalar processor can disrupt the order of instructions (also limited by the dependencies between instructions) to keep each pipeline busy.

An out-of-order processor can effectively swap the order of w and x:

t = a+bu = c+dv = e+fx = h+iw = v+gy = j+k

Allow three cycles to execute:

t, u = a+b, c+dv, x = e+f, h+iw, y = v+g, j+k

Unordered processors include the Intel Pentium 2 (and most subsequent Intel and AMD x86 processors, with the exception of some Atom and Quark devices) and many recent ARM processors such as the Cortex-A9, -A15, -A17, and -A57.

What is a branch predictor?

The above example is a linear code block. The real programs are not like this: they also include forward branches (for implementing conditional operations such as if statements) and reverse branches (for implementing loops). A branch may be unconditional (usually adopted) or it may be conditional (whether or not it depends on the calculated value).

When fetching an instruction, the processor may encounter a conditional branch that depends on the computed value (and the value is not yet calculated). To avoid stalls, the processor must guess the next instruction to fetch: the memory instruction (corresponding to the unused branch) or the next instruction on the branch target (corresponding to the branch). The branch predictor helps the processor guess whether the branch is taken by collecting relevant statistics of the frequency used before a certain branch.

The branch predictor is now very complex and can generate very accurate predictions. The additional performance of Raspberry Pi 3 is partly due to improvements in branch prediction between Cortex-A7 and Cortex-A53. However, an attacker can also mistrain the branch predictor to make poor predictions by performing a carefully designed series of branches.

What is speculation?

Reordering sequential instruction is a powerful way to restore instruction level parallelism, but since the processor becomes wider (can execute three or four instructions at a time), ensuring that all pipelines are busy becomes more Difficult. Therefore, modern processors increase speculation. Predictive execution can handle instructions that are not needed: this will ensure that the pipeline is busy. If the instruction is not executed at the end, we only need to give up the result.

Presuming that executing unnecessary instructions (and the infrastructure that supports speculation and reordering) consumes a lot of energy, but in many cases it is worthwhile to get single-threaded performance improvements. The branch predictor is used to select the most likely path through the program, maximizing the likelihood of speculative gains.

To show the benefits of speculation, let's look at another example:

t = a+bu = t+cv = u+d if v: w = e+fx = w+gy = x+h

Now that we have a dependency from t to u to v and from w to x to y, then the unintended two-way out-of-order processor cannot fill the second pipeline. It uses three cycles to calculate t, u, and v, after which the processor knows if the body of the if statement is executed, and then uses three cycles to calculate w, x, and y. Assuming that if (implemented by a branch instruction) uses a loop, the example can be executed four times (v is zero) or seven times (v is not zero). If the branch predictor indicates that the body of the if statement is likely to be executed, then the speculation can effectively disrupt the program, as follows:

t = a+bu = t+cv = u+d w_ = e+f x_ = w_+g y_ = x_+h if v: w, x, y = w_, x_, y_

So now we have additional instruction level parallelism to keep the pipeline busy:

t, w_ = a+b, e+fu, x_ = t+c, w_+gv, y_ = u+d, x_+h if v: w, x, y = w_, x_, y_

Loop counts become less explicit in speculative out-of-order processors, but the branch and conditional updates of w, x, and y are (almost) idle, so the above example is nearly three cycles executed.

What is caching?

In the past, processor speed was proportional to memory access speed. My BBC Micro (2MHz 6502) can execute an instruction every 2Î¼s (microseconds) with a storage period of 0.25Î¼s. Over the next 35 years, the processor has changed very fast, but the memory has barely changed: a Cortex-A53 in Raspberry Pi 3 can execute an instruction every 0.5ns (nanoseconds), but it may take 100ns to access Main memory.

a = mem[0] b = mem[1]

It takes 200ns.

But in practice, programs tend to access memory in a relatively predictable way, while showing temporal locality (if I visit a location, I will probably access it again soon) and spatial locality (if I visit a location, I It is likely to visit nearby locations soon). The cache uses these attributes to reduce the average cost of accessing memory.

The cache is a small on-chip memory that is close to the processor and stores a copy of the most recently used location (and its neighbors) for quick access in subsequent visits. With the help of the cache, the above example will execute slightly over 100ns:

a = mem[0] # 100ns delay, copies mem[0:15] into cache b = mem[1] # mem[1] is in the cache

From the perspective of Spectre and Meltdown, the most important point is that you can time the memory access time. You can know if the address being accessed is in the cache (short time) or not (long time).

What is the side channel?

From Wikipedia:

"A side channel attack is any attack based on information obtained from the physical implementation of the cryptosystem, not a brute force or theoretical weakness in the algorithm (compared to cryptanalysis). For example, timing information, power consumption, electromagnetic leakage, and even sound. Both can provide an additional source of information that can be used to crack the system."

Spectre and Meltdown are edge channel attacks that use timing to see if there is another accessible location in the cache to infer the contents of the memory location, which should normally not be accessed.

Put it together:

Now let's see how we can combine speculation and caching to allow attacks like Meltdown. Consider the following example, which is a user program that sometimes reads all illegal (kernel) addresses and causes an error (crash):

t = a+bu = t+cv = u+d if v: w = kern_mem[address] # if we get here, fault x = w&0x100 y = user_mem[x]

Now, suppose we can train the branch predictor to believe that v is likely to be non-zero, then our unordered two-way superscalar processor will shuffle the program like this:

t, w_ = a+b, kern_mem[address] u, x_ = t+c, w_&0x100 v, y_ = u+d, user_mem[x_] if v: # fault w, x, y = w_, x_, y_ # We never get here

Even if the processor always speculatively reads the kernel address, it must delay the resulting error until it is known that v is non-zero. On the surface, this is safe because:

v is zero, so the result of illegal reading will not be submitted to w

v is non-zero, but an error occurred before the read result was submitted to w

However, suppose we flush the cache before executing the code and arrange a, b, c, d so that v is actually zero. The speculative read in the third loop now is:

v, y_ = u+d, user_mem[x_]

It will rely on the eighth bit of the illegal read result to get the user address 0x000 or 0x100 and load the address and its neighbors into the cache. Since v is zero, the result of the speculative instruction will be discarded and execution will continue. If we subsequently access one of the addresses, we can decide which address is in the cache. Congratulations: You just read a bit from the kernel address space!

True Meltdown is actually more complicated than this (in particular, in order to avoid mistraining the branch predictor, the author unconditionally performs an illegal read and handles the generated exception), but the principle is the same. Spectre uses a similar approach to subvert software array boundary checks.

in conclusion

Modern processors strive to maintain abstraction and become an orderly scalar machine that directly accesses memory. In fact, using a number of techniques including caching, instruction reordering, and speculation to provide higher performance than simple processors is expected to become a reality. . Meltdown and Spectre are examples of what happens when we reason about security in an abstract context and then encounter subtle differences between abstraction and reality.

The speculative absences in the ARM1176, Cortex-A7, and Cortex-A53 cores used by the Raspberry Pi save us from such attacks.

HQD Vape

HQD Vape pen, Wholesale HQD Vape, HQD Manufacture

Shenzhen Xcool Vapor Technology Co.,Ltd , http://www.xcoolvapor.com