Modern microprocessors from AMD. Modern microprocessors Features of processors with SPARC architecture from Sun Microsystems

Modern microprocessors- these are the fastest and smartest microcircuits in the world. They can perform up to 4 billion operations per second and are produced using many different technologies. Since the early 90s of the twentieth century, when processors came into mass use, they have gone through several stages of development. The apogee of the development of microprocessor structures using existing 6th generation microprocessor technologies is considered to be 2002, when it became possible to use all the basic properties of silicon to obtain high frequencies with minimal losses during production and creation logic circuits. Now the efficiency of new processors is falling somewhat, despite the constant increase in the frequency of operation of the crystals, since silicon technologies are approaching the limit of their capabilities.

All modern processors field effect transistors are used. The transition to a new technical process makes it possible to create transistors with higher switching frequencies, lower leakage currents, smaller sizes. The reduction in size simultaneously reduces the die area and therefore heat dissipation, and the thinner gate allows lower switching voltage to be supplied, which also reduces power consumption and heat dissipation.

From the point of view home user, not all processor functionality is actually in demand. So, for home use, virtualization technology is absolutely unnecessary, so it makes no sense to focus on whether the processor installed in your PC supports it.

Ttechnologies and market

Now there is an interesting trend in the market: on the one hand, manufacturing companies are trying to introduce new technical processes and technologies into their new products as quickly as possible, on the other hand, there is an artificial restraint in the growth of processor frequencies. Firstly, marketers feel that the market is not fully ready for the next change in processor families, and companies have not yet received enough profit from the sales volume of CPUs currently being produced - the stock has not yet dried up. The prevalence of price significance is quite noticeable finished product over all other company interests. Secondly, a significant reduction in the rate“frequency race” is associated with an understanding of the need to introduce new technologies that actually increase productivity with a minimum amount of technological costs. As already noted, manufacturers encountered problems when transitioning to new technical processes. microprocessor computer programming

The 90 nm technology norm has turned out to be quite a serious technological barrier for many chip manufacturers. This is confirmed by TSMC, which produces chips for many market giants, such as AMD, nVidia, ATI, VIA. For a long time, it was unable to organize the production of chips using 0.09 micron technology, which led to a low yield of usable crystals. This is one of the reasons why AMD delayed the release of its processors with SOI (Silicon-on-Insulator) technology for a long time. The delays are due to the fact that it is precisely at this dimension of elements that all sorts of previously not so noticeable negative factors began to strongly manifest themselves: leakage currents, a large scatter of parameters and an exponential increase in heat generation. Let's figure it out in order.

As you know, there are two leakage currents: gate leakage current and subthreshold leakage. The first is caused by the spontaneous movement of electrons between the silicon channel substrate and the polysilicon gate. The second is the spontaneous movement of electrons from the source of the transistor to the drain. Both of these effects lead to the need to increase the supply voltage to control the currents in the transistor, and this negatively affects heat dissipation. So, by reducing the size of the transistor, we, first of all, reduce its gate and the layer of silicon dioxide (SiO2), which is a natural barrier between the gate and the channel. On the one hand, this improves the speed performance of the transistor (switching time), but on the other hand, it increases leakage. That is, it turns out to be a kind of vicious circle. So, the transition to 90 nm is another decrease in the thickness of the dioxide layer, and at the same time an increase in leaks. The fight against leaks means, again, an increase in control voltages, and, accordingly, a significant increase in heat generation. All this led to a delay in the introduction of a new technical process by competitors in the microprocessor market - Intel and AMD.

One alternative is the use of SOI (silicon on insulator) technology, which AMD recently introduced in its 64-bit processors. However, it cost her a lot of effort and overcoming a large number of associated difficulties. But the technology itself provides a huge number of advantages with a relatively small number of disadvantages. The essence of the technology, in general, is quite logical - the transistor is separated from the silicon substrate by another thin layer of insulator. There are a lot of advantages. No uncontrolled movement of electrons under the transistor channel affecting its electrical characteristics- this time. After applying the unlocking current to the gate, the time for ionization of the channel to the operating state (until the operating current flows through it) is reduced, that is, the second key parameter of transistor performance is improved, its on/off time is two. Or, at the same speed, you can simply lower the unlocking current - that’s three. Or find some kind of compromise between increasing the operating speed and decreasing the voltage. While maintaining the same gate current, the increase in transistor performance can be up to 30%; if you leave the frequency the same, focusing on energy saving, then the plus can be large - up to 50%. Finally, the channel characteristics become more predictable, and the transistor itself becomes more resistant to sporadic errors, such as those caused by cosmic particles entering the channel substrate and unexpectedly ionizing it. Now, when they get into the substrate located under the insulator layer, they do not affect the operation of the transistor in any way. The only disadvantage of SOI is that the depth of the emitter/collector region has to be reduced, which directly and directly affects the increase in its resistance as the thickness decreases.

And finally, the third reason that contributed to the slowdown in frequency growth is the low activity of competitors in the market. You could say everyone was busy with their own business. AMD was engaged in the widespread introduction of 64-bit processors; for Intel, this was a period of improving the new technical process and debugging to increase the yield of usable crystals.

Future microprocessor technologies

It is known that existing CMOS transistors have many limitations and will not allow raising processor frequencies in the near future as painlessly. At the end of 2003, at the Tokyo conference, specialists Intel made a very important announcement about the development of new materials for semiconductor transistors of the future.

First of all, we're talking about about a new transistor gate dielectric with a high dielectric constant (the so-called "high-k" material), which will be used to replace the silicon dioxide (SiO2) used today, as well as about new metal alloys compatible with the new gate dielectric.

The solution proposed by the researchers reduces the leakage current by 100 times, which makes it possible to come close to implementing a production process with a design norm of 45 nanometers. It is considered by experts as a small revolution in the world of microelectronic technology. To understand what we are talking about, let’s first take a look at a regular MOS transistor (Figure 1), on the basis of which the most complex CPUs are made.

Figure 1 - MOSFET transistor

In it, the conductive polysilicon gate is separated from the transistor channel by a thin (only 1.2 nm or 5 atoms thick) layer of silicon dioxide (a material used for decades as a gate dielectric).

Such a small thickness of the dielectric is necessary to obtain not only the small dimensions of the transistor as a whole, but also for its highest performance (charged particles move faster through the gate, as a result of which such a VT can switch up to 10 billion times per second). To put it simply, the closer the gate is to the transistor channel (that is, the thinner the dielectric), the “greater influence” in terms of speed it will have on electrons and holes in the transistor channel.

If we are struggling with leaks, then the thickness of the dielectric must be increased to at least 2-3 nm (see figure above). In order to maintain the same slope of the transistor (dependence of current on voltage), it is necessary to proportionally increase the dielectric constant of the dielectric material. If the permeability of bulk silicon dioxide is 4 (or slightly less in ultra-thin layers), then a reasonable value for the dielectric constant of the new “Intel” dielectric can be considered around 10-12. Despite the fact that there are many materials with such a dielectric constant (capacitor ceramics or single crystal silicon), factors of technological compatibility of materials are no less important here. Therefore, a high-precision deposition process was developed for the new high-k material, during which one molecular layer of this material is formed in one cycle (Figure 2).


Figure 2 - Formation of one molecular layer in one cycle

Based on Figure 2, it can be assumed that new material- this is also an oxide. Moreover, monoxide, which means the use of materials mainly of the second group, for example, magnesium, zinc or even copper.

But the matter was not limited to the dielectric. It was also necessary to change the material of the shutter itself - the usual polycrystalline silicon. The fact is that replacing silicon dioxide with a high-k dielectric leads to problems of interaction with polycrystalline silicon (the bandgap of the transistor determines the minimum possible voltage for it). These problems can be eliminated by using special metals for the gates of both types of transistors in combination with a special technological process. This combination of materials achieves record transistor performance and uniquely low leakage currents, 100 times lower than current materials. In this case, there is no longer any temptation to use much more expensive SOI technology to combat leaks, as some large microprocessor manufacturers do.

Modern microprocessors- these are complex devices that differ from each other in construction, command system and mathematical support. Therefore, it is not possible to give specific detailed recommendations for using the analyzer for a specific type of microprocessor. [ 1 ]

Structurally modern microprocessor is an ultra-large integrated circuit implemented on a single semiconductor chip - a thin rectangular slab of crystalline silicon with an area of ​​only a few square millimeters. It contains circuits that implement all the functions of the processor. The slab crystal is usually placed in a plastic or ceramic flat case and connected with gold leads to metal pins so that it can be attached to the computer's motherboard. [ 2 ]

Architecture has many features. One is that instructions and data are stored in the same storage device. For most systems, this is necessary, since they exchange commands and data using certain program development tools. For example, a bootloader loads a program stored in an external storage device into memory and must interpret it as data to do so. However, in widespread applications such as cash registers and car ignition systems, resident program development tools are not available and programs are never mixed with data. Therefore, in some microcontrollers, such as the MCS-48, instructions and data are stored in different storage devices; commands of such microcontrollers do not allow accessing memory cells in which the program is stored as data cells. [ 3 ]

IN at the hardware level, a protection scheme is implemented, the use of which by the operating system will allow it to adequately respond to errors and emergency situations in application programs ah, excluding access to critical system elements, ensuring reliable operation of several applications in the event of an error in one of them. [ 4 ]

IN modern microprocessors As a rule, the same supply voltage is used. Typically, the power source for a microprocessor is a 5 V DC source. To connect this source, two pins of the microprocessor are used: one of them is supplied with a voltage of 5 V, the second pin is grounded. Some microprocessors provide two more pins designed to supply supply voltages of 12 and -5 V. [ 5 ]

For architecture modern microprocessors characterized by the presence of a single addressable memory space, which is called main memory. [ 6 ]

The KR580IK80 microprocessor is one of the simplest among the family modern microprocessors. [7 ]

Currently, there is a tendency towards more complex stack structures data from modern microprocessors. To ensure their flexibility, the stack should generally have a hierarchical structure, although at the level of the basic instruction set it is advisable to implement only a one-, two-, or three-stack structure. [ 8 ]

Firstly, it is not complete enough (i.e. does not reflect all functions modern microprocessors) functional model of the microprocessor itself. [ 9 ]

Disc rotation speeds reaching 10,000 rpm and clock speeds modern microprocessors, making up 100 MHz and higher, allow you to cope with the formation of verification codes in RAID without any problems. Data buses between the processor and storage devices support 100 MB/s bandwidth, which also solves problems associated with network access to disk arrays. In addition, some buses, for example FC-AL, have a network interface TCP protocols/ IP, FDDI, ATM, which allows you to connect storage devices directly to networks. [ 10 ]

In general, the period of the second half of the 50s was extremely interesting and rich in the outpouring of ideas that survived into modern microprocessors, but more often under different names, and perceived by the younger generation as something completely new. [ 11 ]

Indeed, if Eniak (1946), recognized as the first large computer, occupied an area of ​​​​about 90 m2 and weighed more than 30 tons, then modern microprocessor, capable of accommodating all the electronic equipment of such a machine, has an area of ​​only 1 5 - 2 cm2, while providing such computing power that exceeds the total computing power of all computers available in the world in the mid-60s. The first computer contained about 17 thousand electronic tubes, and now the 0 15 micron technology allows you to place such a number of electronic components in a section of a human hair. [ 12 ]

Using a special utility, EXE and DLL files are shifted within the space they occupy in such a way as to ensure real compliance with the paging memory organization implemented . [13 ]

Due to the impossibility of introducing the necessary fundamental improvements into MS DOS, Microsoft was forced to create new operating systems (Windows, Windows NT, Windows 95, etc.) that provide adequate services for users and developers, support the simultaneous operation of several programs, data protection tools and allow use opportunities more effectively modern microprocessors. [1 ]

Modern microprocessors perform the functions of small computers. They are used in automatic digital monitoring and control devices, regulators, telemechanics systems and teleautomatic complexes. [ 2 ]

MP 80286 and higher have the ability to multitask (multi-programming) and accompanying memory protection. Modern microprocessors have two operating modes. [ 3 ]

Both parts of the MP operate in parallel, and the interface part is ahead of the operating part, so that the next command is fetched from memory (its recording in the block of command registers and preliminary analysis) occurs while the operating part is executing the previous command. Modern microprocessors have several groups of registers in the interface part, operating with varying degrees of advance, which allows operations to be performed in a pipeline mode. This organization of the MP can significantly increase its effective performance. [ 4 ]

The first microprocessors were made on p-MOS circuits. Modern microprocessors are performed on and - MOS circuits that have low cost and average performance, on extremely low-power CMOS circuits and on TTL circuits with high performance. [ 5 ]

The depth of the permissible nesting level of subroutines depends on the type of computer and the programming language used. Majority modern microprocessors and programming languages ​​allows multi-level nesting. [ 6 ]

As for operations on floating point numbers and other special complex operations, in systems based on the first processors they were implemented by a sequence of simpler commands and special subroutines, but then special computers were developed - mathematical coprocessors, which replaced the main processor while such commands were executed . IN modern microprocessors mathematical coprocessors are included in the structure as an integral part. [ 7 ]

The memory addressing methods considered here are as follows: with indexing; relative and the way data stored on the stack is addressed using a stack pointer. In many modern microprocessors either the first or second specified method is used. Almost every microprocessor has a stack-based way to access data using a stack pointer. [ 8 ]

A microprocessor is implemented on one or more semiconductor chips and usually consists of an arithmetic unit and a control unit. According to structural solutions modern microprocessor provides information processing according to a specific program specified in each case of application and software control necessary to perform various functions of the device. [ 9 ]

Obviously, we will soon see changes in this area and the transition to 32-bit operating systems. At the same time, a number of important properties will finally be realized modern microprocessors, such as protected operating mode, multitasking, peer-to-peer (flat) memory model, multi-threading, multi-processor and network operation on various hardware platforms will become possible. [ 10 ]

An important feature of this method is that the specified graphic images must be decomposed into corresponding sets of elementary vectors before being displayed. This operation can be performed using the main computer, however modern microprocessors allow it to be carried out more efficiently on the display terminal. In this case, the display terminal has two memories, one for the normal display file and the second for image regeneration, the latter containing the coordinates of the ends of the elementary vectors or character codes for each of the cells. For both the host computer and the programmer, such a system is no different from a beam-steered vector display, except for the lower resolution. [ 11 ]

The indicated negative phenomena in the field of production of military equipment at state-owned enterprises in Russia were caused primarily by the lack of the necessary elemental base corresponding to the modern world level. The Russian electronics industry, under the conditions of the collapse of the USSR, for a number of reasons, was unable to quickly master and develop mass production modern microprocessors and high-capacity memory chips. [ 12 ]

To form digital values active power requires numerical integration over the period of change in instantaneous power of the resulting sequence of numbers. Multiplication, delay and summation during numerical integration - typical operations of digital computer processors are performed at industrial frequency modern microprocessors in real time. In this case, the time required for digital power measuring converters to convert active and reactive powers into a digital signal is practically half and three-quarters of the duration of the industrial frequency period. Some research results are known on the implementation of faster digital power measuring converters. [ 13 ]

To form digital values ​​of active power, it is necessary to numerically integrate the resulting sequence of numbers over the period of change in instantaneous power. Multiplication, delay and summation during numerical integration are typical operations of digital computer processors that are performed at industrial frequency modern microprocessors in real time. [ 14 ]

What are the requirements for microprocessor power supplies? Indicate what power consumption is typical for modern microprocessors. [1 ]

Semiconductor devices and microcircuits are constantly being improved. Their parameters are improving, in addition, the degree of integration of microcircuits is growing and their functions are becoming more complex. A striking example of this trend is microprocessor integrated circuit kits. Modern microprocessors They are actually a computer on a chip. [ 2 ]

Undoubtedly, the recursive version of the program is more compact, but multiple function calls used in any recursive algorithm significantly reduce its performance. Therefore, using a loop is more acceptable in terms of execution speed. In addition, thanks to the optimization of arithmetic operations in most modern microprocessors The speed superiority of non-recursive algorithms is becoming increasingly obvious. [ 3 ]

Modern advances in the field of microelectronics have now made it possible to design and mass-produce so-called microprocessors. In terms of the functions it performs, a microprocessor is similar to a mainframe computer processor; it is usually implemented on one or several large microcircuits with a high degree of integration. The functions of the microprocessor are specified by the corresponding set of executable commands recorded in the read-only memory. The microprocessor is also characterized by a certain amount of register memory, the bit depth of the information being processed, and other parameters. There are microprocessors with scalable and constant bit depth. For modern microprocessors the number of operations performed reaches 100 or more, and operations with double word length and byte-byte information processing are provided. In addition, the microprocessor is equipped with general software, which is stored in a read-only memory device on integrated circuits. [ 4 ]

The consideration of the material is carried out on the basis of a certain hypothetical microprocessor. This is due to several reasons. First of all, it should be noted that most microprocessors produced by industry are too complex to serve as a basis for becoming familiar with the principles of construction and operation of these devices. However, mastery of these general principles makes it possible to master microprocessors of any type and model. Moreover, there is always a danger that such a microprocessor will remain the most preferable for the student. As for the hypothetical microprocessor chosen here, some of its characteristics cannot be found in any modern microprocessor. The author does not undertake to predict them for future models, since in our time technology and technology are changing too quickly. [ 5 ]

However, when we talk about combating detonation, we mean forced modes, in which the danger of detonation is especially great. But is this correct if more than 80% of the fuel is burned during stable engine operation, when high anti-knock characteristics are not at all needed and you can get by with low-octane gasoline. Are we hammering nails with a violin? Thus was born the idea of ​​dividing the fuel into two tanks: one smaller, for a high-octane additive, and the other larger, for regular low-octane gasoline. The whole question is in the dosage, in the supply of these flows in a ratio that exactly corresponds to the nature of the engine’s operation at the moment. It is clear that both dosage and carburetion must be regulated with pinpoint precision in such an engine. They can take this care upon themselves modern microprocessors in combination with a computer. [ 6 ]

For highly efficient computing systems it is necessary to have on the chip as many functions as possible for processing and storing data, as well as an interface with the user and other systems. Increased MP productivity is achieved by increasing the clock frequency , parallel and pipeline data processing, and also reducing memory access time .

Structural parallelism of MP. Usage natural parallelism inherent in most programs evaluation of integer address expressions and actually data processing in floating point format led to the emergence of disparate architectures . Such MP consists of two connected subprocessors (addressA -CPU And executiveE -CPU ), each of which is controlled by its own command flow. Spaced architecture Allows for equal performance like number crunching (scalar ), and when processing arrays of numbers (one vector command is used ). Program splitting for programs for A- And E- processors produced by the compiler or block splitter.

Structural reduction in memory access time. This is due to the fact that main memory access time more than ten times more than the time it takes to convert data in registers processor. Access time is reduced by multi-level memory hierarchy :

registers 64-256 words with access time 1 clock cycle processor;

Level 1 cache – 8 k words with access time 1–2 clock cycles ;

Level 2 cache – 256 kwords with access time 3–5 cycles.

RISC processors . In practice a limited set is used simple commands formats " register, register®register" And " register "memory". Compilers are unable to use complex commands efficiently. This contributed to the formation of the concept of processors with a reduced instruction set , or RISC -processors .

Methods for speeding up processor context switching. Modern OS and programming systems widely use processor context switching(contents of registers and individual control triggers ) when processing interrupt entry and exit, entering and exiting a subroutine And in case of organizing multi-program work. The context switch time should be minimal. Reducing CPU context switch time can be achieved through:

reducing the number of registers whose contents are stored in memory;

hardware support for register saving;

introduction of special agreements regulating the use of registers in programs (This allows you to move from full preservation of context to partial ).

3.3.2.1 Varieties of architectures of modern microprocessors



Superscalar processor class implements the approach that the instruction system does not contain any indication of parallel processing inside the processor . But different approach opens up all the possibilities of parallel processing : in special command fields, each of the parallel operating devices is assigned an action, which the device must perform. This processors with long instruction word (VLIW ).

Superscalar And VLIW -processors belong to the class of architectures that use instruction-level parallelism(ILP ). Text of the sequential program in language high level compiles to machine code , reflecting static program structure .

To resolve dependencies caused by jump commands, used prediction method, allowing you to extract and conditionally execute predicted branch commands. If it is discovered that the prediction was incorrect, the state of the processor is restored to the point at which the decision to make the branch was made. Executed instructions can be data dependent when the same memory resources are used. Therefore, these resources are used in the order prescribed by the program.

Multiscalar processors in its architecture use static and dynamic code analysis to identify parallelism reserves level of individual teams And program segments using information from a high-level language compiler. The program is divided into a set of tasks using software and hardware. Task part of a program whose execution corresponds to a continuous area of ​​dynamic sequence of instructions .

Signal processors used for digital signal processing . Their features – low-bit(forty or less digits) number crunching floating point , predominant use of numbers fixed point 32 bits or less, as well as focus on simple processing of large data sets. Features of digital signal processing tasks - This continuous processing large volumes data in real mode time.

Exists a different approach to achieving high performance. A large number of components on a semiconductor chip can be used to create a symmetric multiprocessor system with simpler processors that process integer operands. This so-called media processors , used for real-time processing of video and audio information.

Transputers. The concept of parallelism allows improve performance and reliability computing systems, using the construction of massively parallel systems based on LSI . Transputer (transistor + computer)- This microcomputer with its own internal memory And links (channels) for connection with other transputers. This term is often interpreted as a generalized name for microprocessors with built-in interprocessor interfaces. Transputers have a high degree of “functional independence”, are easy to integrate and have peripheral devices.

Neuroprocessors. Neural network the approach is effective in solving poorly formalized problems for which it is difficult to specify a sequence of actions . These include pattern recognition And data clusteringgrouping data according to their inherent “closeness”. The use of clustering is associated with data compression, analysis and search for patterns in them.

Unlike formalized tasks, the neural network can extrapolate the result . The neural network takes into account new factors by retraining the network with their participation, and not by redoing formalized rules . With a limited number of experimental data, neural networks are a device that allows you to make maximum use of the available information.

General idea of ​​using neural networks founded not executing the prescribed algorithm , and on the network’s memorization of the examples presented to it at the stage of network creation and the development of results, consistent with the memorized examples. In the tasks to be solved points of multidimensional space form regions of points that have the same property. Neural networks remember similar areas, not individual points, representing examples presented during training. When memorizing it is used a separate elementary computer called a neuron , A to remember all areas, the constituent neurons are combined into a parallel structure neural network.

The construction of computer systems that interpret neural network algorithms is carried out on a traditional element base. Accepted in the neurocomputer world The unit of performance measurement is "connections per second" CPS (connections per second). By combining we mean multiplying the input by the weight and adding with the accumulated sum. Another indicator is the number of changed weight values ​​per second CUPS (connections update per second).

Digital neuroBIS, like analog and hybrid, implement neural algorithms, may include schemes for setting weights during training , provide external loading scales. Digital LSIs for systolic and single-stream systems are similar devices to conventional RISC processors, typically 16- or 32-bit processors.

Analog neuro-LSIs use simple physical effects to perform neural network transformations. Analog elements are typically smaller and simpler than digital elements, but require careful design to achieve the required accuracy.

AMD products compete successfully with Intel microprocessors. According to a number of indicators, the microprocessors of this company occupy a leading position. Some interesting architectural and technical solutions, first used in AMD microprocessors, subsequently became widespread in products from other manufacturers, including microprocessors from Intel.

Microprocessor K5

For a number of years, AMD, at least one microprocessor generation behind Intel, relied primarily on licensed technology and made minor design changes to its microprocessors. The appearance of the Pentium microprocessor created a direct threat for AMD to be forced out of the market, which stimulated the company to intensify work on creating a new family of x86-compatible microprocessors. Work on the K5 began when details about the Pentium processor were not yet known. AMD engineers had to develop their own microarchitecture while ensuring compatibility with existing software for x86 processors.

AMD originally planned to begin shipping its 100--120 MHz microprocessor in 1995, but only a few thousand of these processors were released and clocked at only 75 MHz. Major deliveries of the K5 began in the first quarter of 1996, after the company switched to 0.35 micron technology developed jointly with Hewlett-Packard. This made it possible to increase the number of transistors to 4.2 million on a crystal with an area of ​​167 mm 2.

The K5 ]68] is the first AMD microprocessor that was not created using any Intel intellectual property (except for microcode), while at the same time, it has better performance than Intel processors. Many applications such as Microsoft Excel, Word, CorelDRAW, ran on K5 series processors 30% faster than on a Pentium with the same clock frequency. This performance was achieved mainly due to increased cache memory and a more advanced superscalar architecture. The RISC86 architecture used in AMD microprocessors.

As you know, x86 instructions are characterized by variable length and complex structure, which makes them difficult to decode and analyze existing data dependencies between instructions. In AMD's architecture, the decoder, the most complex part of the microprocessor, breaks long CISC instructions into small RISC-like components called ROPs (RISC operations).

ROPs are reminiscent of microcode commands on x86 microprocessors. The first microprocessors with x86 architecture executed their complex set of microinstructions by selecting microcode from internal read-only memory. In the latest x86 microprocessors, the use of microcode is minimized through the use of simple commands and their hardware implementation.

Unlike the Pentium, instead of two pipelines for parallel execution of two integer operations, the K5 has six parallel operating blocks. Floating-point, load/store, or jump instructions can be executed concurrently with integer operations. The load/store block can fetch two instructions from memory in one cycle. Another difference from the Pentium is that the K5 can change the sequence of commands it executes.

The floating point unit (FPU) meets x86 standards, but is somewhat inferior in performance to the FPU of the Pentium processor.

The combination of CISC and RISC principles used in the K5 architecture made it possible to overcome the limitations of the x86 instruction set. At the cost of increasing complexity AMD processor managed to increase its performance while maintaining compatibility with the x86 instruction system. The latter is quite important given the widespread availability of software for this microprocessor architecture.

Microprocessor K6

The K6 microprocessor was released in 1997 using 0.35 micron CMOS technology with five-layer metallization, contained 8.8 million transistors on a chip with an area of ​​162 mm 2, operated with clock frequencies of 166, 200 and 233 MHz and was installed in the Socket 7 connector.

Like the K5, the K6 used the RISC86 superscalar architecture with separate instruction decoding/execution, ensuring continuity with the x86 instruction set and achieving high performance typical of sixth-generation microprocessors. K6 was equipped with a multimedia extension of the command system - MMX. In terms of performance K6 at the same clock frequency significantly superior to the Pentium MMX and was comparable to the Pentium Pro. Unlike the Pentium Pro, the K6 worked equally well with both 32-bit and 16-bit applications.

High processor performance was ensured thanks to a number of new architectural and technological solutions.

· The processor pre-decodes x86 instructions when fetching them in cache memory. Each instruction in the L1 cache is equipped with pre-decode bits indicating the offset of the start of the next instruction in the cache (from I to 15 bytes).

· K6 contains an internal, separate 32 KB L1 cache for data and commands.

· The processor implements a high-performance floating point calculation unit.

· There is a high-performance block of multimedia operations of the MMX standard.

· Multiple decoding of x86 instructions into single-cycle RISC operations (ROP) is used.

· The processor contains parallel decoders, a centralized operation scheduler, and seven execution units that provide superscalar execution of instructions in a six-stage pipeline.

· The processor uses speculative execution with changing the sequence of instructions, pre-sending data, and renaming registers.

At the beginning of 1998, processor variants based on 0.25 micron technology with five layers of metallization were released for clock frequencies of 266 MHz and 300 MHz.

Microprocessor K7

The next generation microprocessor, K7 (codenamed Athlon), was released in June 1999. The K7 contains over 22 million transistors on a 184 mm2 die and was originally manufactured using 0.25 micron technology with 6 layers of metallization* for clock frequencies of 500, 550, 600 and 650 MHz. Subsequently, with the transition to 0.18 micron technology, the frequency was increased to 1 GHz and higher. The microprocessor supply voltage is 1.6 V.

The processor is housed in a cartridge and connects to the board via Slot A, developed by AMD. Athlon and Slot A use the Digital Alpha EV6 bus protocol, which has a number of advantages over GTL+ used by Intel. Thus, EV6 provides the possibility of using a “point to point” topology for multiprocessor systems. In addition, EV6 operates on the rising and falling edge of the clock signal, which at a frequency of 100 MHz gives an effective data transfer frequency of 200 MHz and an interface bandwidth of 1.6 GB / s. In subsequent processor models, the bus operating frequency (effective frequency) reached 133 (266) and then 200 (400) MHz.

The Athlon architecture is called QuantiSpeed™ and features superscalar, superpipeline execution, a pipelined floating point unit, hardware cache prefetch, and advanced branch prediction technology.

The Athlon has nine execution units: three integer execution units (IEU), three address computation units (AGU), and three floating-point and media processing units (one floating-point load/store (FSTORE) and two pipelined block for executing FPU/MMX/3DNOW commands).

The Athlon can decode three x86 instructions into six RISC operations. After decoding, the ROPs end up in a buffer, where they await their turn for execution in one of the processor's functional blocks. Buffer K7 contains 72 operations (three times more than KB) and produces 9 ROPs for 9 actuators.

The Athlon has 128 KB of L1 cache (64 KB for data and 64 KB for instructions). To interact with the second level cache memory, a special bus is provided (like the Intel P6 architecture). The second level cache memory, 512 KB in size, is located outside the processor core, in the processor cartridge, and operates at half the core frequency.

The next microprocessor with K7 architecture based on the Thunderbird core was Duron - a budget version of the microprocessor aimed at cheap PCs. Its main difference is the second level cache memory reduced to 64 KB. Duron contains 25 million transistors on a 100 mm 2 chip and is designed for frequencies from 600 to 1200 MHz.

Placing the cache memory on-chip allowed developers to abandon the use of a cartridge and return to the socket type connector (462-pin Socket A connector). In Athlon and Duron processors, cache memory operates according to an algorithm that ensures exclusivity of data representation in caches (data is not duplicated in the first and second level caches), which increases the effective volume of cached data.

Thanks to the new architectural and technical solutions used in the K7, AMD microprocessors managed to exceed the performance of the Pentium III by 7-10% at equal clock frequencies.

Further improvements in the architecture and production technology of microprocessors within the K7 family led to the emergence of two new versions of Athlon: Athlon XP and Athlon MP.

The main difference between the AMD Athlon MP processor and the AMD Athlon XP is the use of Smart MP technology, which is a combination of a high-speed dual system bus and a coherent MOESI cache protocol that controls memory bandwidth, which is necessary to achieve an optimal balance of processor operation in multiprocessor systems. Bandwidth bus capacity is 2.1 GB/s, per processor.

The processor is available with clock frequencies from I GHz (0.18 micron technology) to 2.133 GHz (0.13 micron technology, Thoroughbred core).

TOPIC 4 Microprocessors

LECTURE 7

Lecture questions:

1. General information about microprocessors.

2.

General information about microprocessors

Microprocessor is a software-controlled device for processing digital information and controlling the processing process, implemented in the form of a large-scale (LSI) or ultra-large-scale (VLSI) integrated circuit. Thus, the microprocessor plays the role of a processor in digital systems for various purposes. These can be information processing systems (computers), object and process control systems, information measurement systems and other types of systems used in industry, household appliances, communications equipment and many other applications.

A microprocessor is a universal device for performing software processing of information, which can be used in a wide variety of areas of human activity. Dozens of manufacturing companies produce several thousand types of microprocessors with different characteristics and designed for various applications. Manufactured microprocessors are divided into separate classes in accordance with their architecture, structure and functionality. This section provides an overview of the main architectural and structural options for implementing modern microprocessors used in various applications.

The development of technology makes it possible to create everything on a chip more active components - transistors, which can be used to implement new architectural and structural solutions that provide increased performance and expansion functionality microprocessors.

Classification of microprocessors

Although the microprocessor is a universal tool for digital information processing, certain areas of application require the implementation of certain specific options for their structure and architecture. Therefore, according to functional characteristics, two classes are distinguished: general purpose microprocessors And specialized microprocessors(Fig. 1.3).

Rice. 1.3. Classification of modern microprocessors according to functionality

Among specialized microprocessors, the most widespread received microcontrollers, designed to perform control functions of various objects, and digital signal processors(DSP - Digital Signal Processor), which are focused on the implementation of procedures that provide the necessary conversion of analog signals presented in digital form (as a sequence of numerical values).

General purpose microprocessors are designed to solve a wide range of problems of processing a variety of information. Their main areas of use are personal computers, workstations, servers and other digital systems for mass use. This class includes CISC Pentium processors from Intel, K7 from Advanced MicroDevices (AMD), 680x0 from Motorola, RISC PowerPC processors from Motorola and IBM, SPARC from Sun Microsystems "and a number of other products from various manufacturers.

Expanding the scope of such microprocessors is achieved mainly through increased productivity, thereby increasing the range of tasks that can be solved using them. Therefore, increasing productivity is the main direction of development of this class of microprocessors. Typically these are 32-bit microprocessors (some microprocessors in this class have a 64-bit or 128-bit structure), which are manufactured using the latest industrial technology to ensure maximum operating frequency.

A number of the most popular microprocessors of this class (Pentium, AMD K7 and some others) should be classified as CISC processors, since they execute a large set of multi-format instructions using numerous addressing methods. However, their internal structure contains a RISC processor that executes incoming commands after converting them into a sequence of simple RISC operations. A number of other microprocessors in this class directly implement the RISC architecture. Therefore, we can consider that the use of RISC architecture is typical 1st most of these microprocessors. However, a number of recent developments (Itanium, D A8500) from some leading manufacturers have successfully applied the principles of VLIW architecture, which can compete with RISC architecture in the competition for achieving the highest performance.

Almost all modern microprocessors in this class use the Harvard internal architecture, where the separation of instruction and data streams is implemented using separate cache memory blocks. In most cases, they have a superscalar structure with several execution pipelines (up to 10 in modern models) that contain up to 20 stages.

Due to their versatility, general-purpose microprocessors are also used in specialized systems where high performance is required. On their basis, single-board computers and industrial computers are implemented, which are used in control systems for various objects. Single board (embedded) computers They contain on the board the necessary additional microcircuits that ensure their specialized use, and are intended for integration into equipment for various purposes. Industrial computers are housed in specially designed housings that ensure their reliable operation in harsh production conditions. Typically, such computers operate without standard peripheral devices (monitor, keyboard, mouse) or use special versions of these devices, modified to suit specific application conditions.

Microcontrollers are specialized microprocessors that are focused on the implementation of control devices built into a variety of equipment. Due to the huge number of objects that are controlled using microcontrollers, their annual production volume exceeds 2 billion units, an order of magnitude greater than the production volume of general-purpose microprocessors. The range of manufactured microcontrollers is also very wide, containing several thousand types.

A characteristic feature of the structure of microcontrollers is the placement of internal memory and a large set of peripheral devices on the bottom chip with a central processor. Peripheral devices usually include several 3-bit parallel data input/output ports (from 1 to 8), one or two serial ports, a timer unit, and an analog-to-digital converter. In addition, various types of microcontrollers contain additional specialized devices - a signal generation unit with pulse width modulation, a liquid crystal display controller and a number of others. Thanks to the use of internal memory and peripheral devices, control systems implemented on the basis of microcontrollers contain a minimum number of additional components.

Due to the wide range of control problems being solved, the requirements for processor performance, the amount of internal memory of commands and data, and the set of necessary peripheral devices turn out to be very diverse. To meet consumer demands, a large range of microcontrollers are produced, which are usually divided into 8-, 16- and 32-bit.

8-bit microcontrollers represent the largest group of this class of microprocessors, which have relatively low productivity, which, however, is quite sufficient for solving a wide range of problems of managing various objects. These are simple and cheap microcontrollers aimed at use in relatively simple mass-produced devices. The main areas of their application are household and measuring equipment, industrial automation, automotive electronics, television, video and audio equipment, and communications.

These microcontrollers are characterized by the implementation of the Harvard architecture, which uses separate memory to store programs and data. To store programs in various types Microcontrollers use either a mask-programmable ROM (ROM), a one-time programmable ROM (PROM), or an electrically reprogrammable ROM (EPROM, EEPROM or Flash). Internal memory programs usually range in size from several units to tens of kilobytes. To store data, a register block is used, organized in the form of several register banks, or internal RAM. The volume of internal data memory ranges from several tens of bytes to several KB. A number of microcontrollers in this group allow, if necessary, to additionally connect external command and data memory with a capacity of up to 64-256 KB.

Microcontrollers in this group usually perform relatively small set commands (50-100) using the simplest addressing methods. A number of the latest models of these microcontrollers implement the principles of RISC architecture, which can significantly increase their performance. As a result, such microcontrollers ensure that most instructions are executed in one clock cycle.

16-bit microcontrollers are an improvement in many cases
modification of their 8-bit prototypes. They are characterized not only by an increased bit capacity of the processed data, but also by an expanded system of commands and addressing methods, an increased set of registers and the amount of addressable memory, as well as a number of others additional features, the use of which can improve productivity and provide new areas of application. Typically, these microcontrollers allow you to expand the program and data memory to several MB by connecting external memory chips. In many cases, their software compatibility with lower 8-bit models is realized. The main areas of application for such microcontrollers are complex industrial automation, telecommunications equipment, medical and measuring equipment.

32-bit microcontrollers contain a high-performance processor that matches the capabilities of low-end general-purpose microprocessors. In some cases, the processor used in these microcontrollers is similar to CISC or RISC processors that are or have previously been released as general-purpose microprocessors. For example, 32-bit microcontrollers from Intel use the i386 processor, microcontrollers from Motorola widely use the 680x0 processor, and a number of other microcontrollers use PowerPC-type RISC processors as the processor core. Various models have been implemented based on these processors personal computers. The introduction of these processors into microcontrollers makes it possible to use in the corresponding control systems a huge amount of application and system software that was previously created for the corresponding personal computers.

In addition to the 32-bit processor, the microcontroller chip houses an internal command memory with a capacity of up to tens of kilobytes, a data memory with a capacity of up to several kilobytes, as well as complex functional peripheral devices - a timer processor, a communication processor, a serial exchange module and a number of others. Microcontrollers work external memory up to 16 MB and higher. They are widely used in control systems complex objects industrial automation (engines, robotic devices, complex production automation equipment), control and measuring equipment and telecommunications equipment.

The internal structure of these microcontrollers implements Princeton or Harvard architecture. The processors they contain may have a CISC or RISC architecture, and some of them contain several execution pipelines that form a superscalar structure.

Digital Signal Processors (DSP) represent a class of specialized microprocessors focused on digital processing of incoming analog signals. A specific feature of analog signal processing algorithms determines the need to sequentially execute a series of multiplication-addition commands with the accumulation of an intermediate result in an accumulator register. Therefore, the ap-i/DSP texture is focused on the implementation of fast execution of operations of this kind. The instruction set of these processors contains special MAC (Multiplication Aith Accumulation) instructions that implement these operations.

The value of the received signal can be represented as a fixed-point or floating-point number. In accordance with this, DSPs are divided into processors that process fixed-point or floating-point numbers. Simpler and cheaper DSPs With Fixed point typically handles 16-bit operands represented as proper fractions. However, the limited bit capacity in some cases does not allow the necessary conversion accuracy to be ensured. Therefore, fixed-point DSPs produced by Motorola adopt a 24-bit operand representation. The highest processing accuracy is ensured when data is presented in a floating point format. In DSPs that process floating data point, Typically they are represented in 32-bit format.

To improve performance when performing specific signal processing operations, most DSPs implement Harvard architecture using multiple buses for transmitting addresses, commands, and data. In a number of DSPs, some features of the VLIW architecture have also been used: combining in one command several operations that ensure processing of existing data and simultaneous loading = executive pipeline of new data for subsequent processing.

Processor architecture is called the complex of its hardware and software provided to the user. Into this general concept includes a set of software-accessible registers and executive (operating) devices, the system main commands and addressing methods, volume and structure of addressable memory, types and methods of interrupt processing.

For example, all modifications Pentium processors, Celeron, i486 and i386 have IA-32 architecture (Intel Architecture - 32 bit), which is characterized by a standard set of registers provided to the user, common system basic commands and methods of organizing and addressing memory, the same implementation of memory protection and interrupt servicing.

When describing the architecture and operation of a processor, its representation is usually used in the form of a set of software-accessible registers that form register or software model. These registers contain processed data (operands) and control information. Accordingly, the register model includes the group general purpose registers, serving to store operands, and a group service registers, providing control over program execution and processor operating mode, organization of memory access (memory protection, segment and page organization, etc.).

General purpose registers form the RZU - internal register memory processor. The composition and number of service registers is determined by the microprocessor architecture. Typically they include:

Program Counter PC (or CS+IP in Intel microprocessor architecture);

SR Status Register (or EFLAGS);

CPU operating mode control registers CR (Control Register);

Registers that implement segment and page memory organization;

Registers that provide program debugging and processor testing.

In addition, different microprocessor models contain a number of other specialized registers.

The functioning of the processor is represented in the form of the implementation of register transfers - procedures for changing the state of these registers by reading and writing their contents. As a result of such transfers, addressing and selecting commands and operands, storing and forwarding results, changing the sequence of commands and operating modes of the processor in accordance with the arrival of new contents in the service registers, as well as all other procedures that implement the information processing process according to specified conditions are provided.

In a number of processors, registers are allocated that are used when executing application programs and are available to each user, and registers that control the operating mode of the entire system and are available only to privileged programs included in the operating system(supervisor). Accordingly, such processors are represented in the form user register model, which includes registers used when executing application programs, or supervisor register model, which contains the entire set of software-accessible processor registers used by the operating system.

Microprocessor structure determines the composition and interaction of the main devices and blocks located on its chip. This structure includes:

CPU(processor core), consisting of a control device (CU), one or more operating devices (OU);

Internal memory (RAM, cache memory, RAM and permanent memory units);

An interface block that provides access to the system bus and data exchange with external devices through parallel or serial I/O ports;

Peripherals(timer modules, analog-to-digital converters, specialized controllers);

Various auxiliary circuits (clock generator, circuits for debugging and testing, watchdog timer and a number of others).

The composition of devices and blocks included in the structure of the microprocessor and the implemented mechanisms of their interaction are determined by the functional purpose and scope of the microprocessor.

The architecture and structure of the microprocessor are closely interrelated. The implementation of certain architectural features requires the introduction of the necessary hardware (devices and blocks) into the microprocessor structure and the provision of appropriate mechanisms for their joint functioning.

The following architecture options are implemented in modern microprocessors.

CISC(Complex Instruction Set Computer)-apxumeкmypa implemented in many types of microprocessors that execute a large set of multi-format instructions using numerous addressing methods. This is a classic processor architecture that began its development in the 1940s with the advent of the first computers. A typical example of CISC processors are the Pentium family of microprocessors. They execute more than 200 commands of varying complexity, which range in size from 1 to 15 bytes and provide more than 10 in various ways addressing. Such a wide variety of executed commands and addressing methods allows the programmer to implement the most effective algorithms for solving various problems. However, this significantly complicates the structure of the microprocessor, especially its control device, which leads to an increase in the size and cost of the crystal, and a decrease in performance. At the same time, many commands and addressing methods are used quite rarely. Therefore, starting from the 1980s, the architecture of processors with a reduced instruction set (RISC processors) received intensive development.

RISC(Reduced Instruction Set Computer)-apxumeкmypa characterized by the use of a limited set of fixed-format commands. Modern RISC processors typically implement about 100 instructions, which have a fixed format of 4 bytes in length. The number of addressing methods used is also significantly reduced. Typically, in RISC processors, all data processing instructions are executed only with register or immediate addressing. Moreover, to reduce the number of memory accesses, RISC processors have an increased volume of internal RAM - from 32 to several hundred registers, while in CISC processors the number of general-purpose registers is usually 8-16.

Memory access in RISC processors is used only in operations of loading data into the memory or transferring results from the memory to memory. In this case, a small number of the most simple ways addressing: indirect register, index and some others. As a result, the structure of the microprocessor is significantly simplified, its size and cost are reduced, and productivity is significantly increased.

These advantages of the RISC architecture have led to the fact that many modern CISC processors use a RISC core that performs data processing. In this case, incoming complex and multi-format commands are pre-converted into a sequence of simple RISC operations that are quickly executed by this processor core. This is how, for example, the latest models of Pentium and K7 microprocessors work, which according to external indicators belong to CISC processors. The use of RISC architecture is a characteristic feature of many modern microprocessors.

VLIW(Very Large Instruction Word)-architecture appeared relatively recently - in the 1990s. Its peculiarity is the use of very long commands (up to 128 bits or more), the individual fields of which contain codes that enable the execution of various operations. Thus, one command causes several operations to be executed in parallel in various operating devices included in the microprocessor structure. When translating programs written in a high-level language, the corresponding compiler generates “long” VLIW instructions, each of which ensures that the processor implements an entire procedure or group of operations. This architecture is implemented in some types of modern microprocessors (PA8500 from Hewlett-Packard, Itanium - a joint development of Intel and Hewlett-Packard, some types of DSP - digital signal processors) and is very promising for creating a new generation of ultra-high-performance processors.

In addition to the set of commands to be executed and addressing methods, an important architectural feature of microprocessors is the memory implementation option used and the organization of fetching commands and data. According to these characteristics, processors with Princeton and Harvard architecture differ. These architectural options were proposed in the late 1940s by specialists from Princeton and Harvard Universities in the USA, respectively, for the computer models they were developing.

Princeton architecture, which is often called Von Neumann architecture, is characterized by the use of a common RAM for storing programs, data, and also for organizing a stack. To access this memory, a common system bus is used, through which both commands and data enter the processor. This architecture has a number of important advantages. The presence of shared memory allows you to quickly redistribute its volume to store separate arrays of commands, data and stack implementation, depending on the tasks being solved. Thus, it is possible to more efficiently use the available amount of RAM in each specific case of using the microprocessor. The use of a common bus for transmitting commands and data greatly simplifies debugging, testing and ongoing monitoring of system operation, and increases its reliability. Therefore, Princeton architecture dominated computing for a long time.

However, it also has significant disadvantages. The main one is the need for sequential sampling of commands and processed data over a common system bus. In this case, the common bus becomes a “bottleneck”, which limits the performance of the digital system. In recent years, ever-increasing demands on the performance of microprocessor systems have led to an increasing use of Harvard architecture in the creation of many types of modern microprocessors.

Harvard architecture characterized by the physical separation of instruction memory (programs) and data memory. Its original version also used a separate stack to store the contents of the program counter, which provided the ability to execute nested subroutines. Each memory is connected to the processor by a separate bus, which allows simultaneous reading and writing of data while executing the current command to fetch and decode the next command. Thanks to this separation of command and data streams and the combination of their fetch operations, higher performance is realized than when using the Princeton architecture.

The disadvantages of the Harvard architecture are associated with the need for a larger number of buses, as well as with a fixed amount of memory allocated for commands and data, the purpose of which cannot be quickly redistributed in accordance with the requirements of the problem being solved. Therefore, it is necessary to use larger memory, the utilization rate of which when solving various problems is lower than in systems with the Princeton architecture. However, the development of microelectronic technology has made it possible to largely overcome these shortcomings, so the Harvard architecture is widely used in the internal structure of modern high-performance microprocessors, which use a separate cache memory to store instructions and data. At the same time, the principles of the Princeton architecture are implemented in the external structure of most microprocessor systems.

Harvard architecture is also widely used in microcontrollers - specialized microprocessors for controlling various objects, the working program of which is usually stored in a separate ROM.

The internal structure of modern high-performance microprocessors implements pipeline principle of command execution. In this case, the process of executing a command is divided into a number of stages. In Fig. 1.1, and an example of breaking a command into six stages of its execution is given:

1) selection of the next command (VC);

3) formation of the operand address (FA);

4) receiving an operand from memory (software);

5) execution of the operation (VO);

6) placement of the result in memory (RR).

The implementation of each stage takes one cycle of computer time and is performed by devices and processor blocks that form stages of the executive pipeline, at each of which a corresponding micro-operation is performed. When selectable commands are loaded sequentially into the pipeline, each of its stages implements a certain stage of execution of the next command. Thus, the pipeline simultaneously contains several commands at different stages of execution. Ideally, when the conveyor is fully loaded, the result of executing the next command will be sent to its output at each clock cycle (Fig. 1.1, a). In this case, the processor performance (operations/s) will be equal to its clock frequency (cycles/s).

However, such efficient operation of the conveyor is ensured only when it is evenly loaded with the same type of commands. In reality, individual stages of the conveyor may be unloaded, being in a state of waiting or idle. Waiting The state of the executive stage is called when it cannot perform the required microoperation because the necessary operand, which is the result of executing the previous instruction, has not yet been received. Downtime the state of a stage is called when it is forced to skip the next clock cycle, since the received command does not require the execution of the corresponding stage. For example, when executing addressless commands, it is not necessary to generate an address and receive an operand (downtime at the FA and software stages of the pipeline).

Rice. 1. Implementation of pipeline execution of commands with ideal (a) and real (b) loading of a 6-stage conveyor

In Fig. Figure 1, b shows an example of the operation of a 6-stage conveyor when executing a fragment of a real program, when individual stages are in the standby state (OS) or idle state (ID). The INC R2 instruction, which increments the contents of the R2 register by 1, does not require fetching the operands from memory and placing the result in it. Therefore, when it is executed, an idle state (IS) is realized at the stages of the conveyor that perform micro-operations FA, SW, PP. The MOV (R2), R3 instruction transfers the contents of the memory cell addressed by the contents of register R2 to register R3. When it is executed, wait states are implemented until the result of the previous operation is received in register R2. Waiting cycles (WTC) are also introduced when executing the addition command ADD R3, (R4) until the required operand value is obtained in register R3. As a result of the introduction of wait and idle states, the actual processor performance when executing this program fragment will be 5/3 instructions/cycle, that is, it will be 1.7 times less than in the ideal case (Fig. 1, a).

In modern high-performance microprocessors, the instruction execution procedure can be divided into even smaller stages in order to have time to perform the corresponding micro-operations at each stage in one clock cycle, the duration of which at clock frequencies above 1 GHz is less than a nanosecond. Therefore, in such processors the number of pipeline stages reaches 10 or more. For example, Pentium 4 microprocessors use a 20-stage pipeline.

The efficiency of using a pipeline is determined by the type of incoming commands. Uniform commands reduce the number of idle and wait states during their execution, resulting in improved processor performance. When a program uses commands of different formats containing different numbers of bytes, the number of idle and wait states that must be entered during command execution increases significantly. Therefore, the standard 4-byte instruction format adopted in many RISC processors provides a significant reduction in the number of waits and pipeline downtimes, which can significantly improve performance.

Another reason for pipeline efficiency degradation is conditional branch instructions. If the branch condition is met, then the pipeline has to be reloaded with commands from another branch of the program, which requires additional work cycles and causes a significant decrease in performance. Therefore, one of the main conditions for efficient operation of the pipeline is to reduce the number of its reboots when executing conditional transitions. This goal is achieved through the implementation of various mechanisms for predicting the direction of branching, which are provided using special devices - branch prediction blocks, introduced into the processor structure.

Modern microprocessors use a variety of branch prediction techniques. The simplest way is that the processor records the result of executing previous branch commands at a given address and believes that the next command accessing this address will give a similar result. This prediction method assumes a higher probability of repeated access to a certain command specified by a given branching condition. To implement this branch prediction method, a special BTB memory (Branch Target Buffer) is used, where the addresses of previously executed conditional branches are stored. When a similar branch command is received, a transition to the branch that was selected in the previous case is predicted, and commands from the corresponding branch are loaded into the pipeline. With correct prediction, there is no need to reload the pipeline and the efficiency of its use does not decrease. The effectiveness of this prediction method depends on the capacity of the VTV and turns out to be quite high: the probability of correct prediction is 80% or more. Increased prediction accuracy is achieved by using more complex methods when storing and analyzing background of transitions - the results of several previous branch commands at a given address. In this case, it is possible to determine the most often implemented direction of branching, as well as to identify alternating transitions. The implementation of such algorithms requires the use of more complex prediction blocks, but the probability of correct prediction increases to 90-95%.

The possibility of increasing processor performance is also achieved by introducing several parallel operating devices into the processor structure, ensuring the simultaneous execution of several operations. This processor structure is called superscalar. These processors implement parallel operation of several execution pipelines, each of which receives one of the selected and decoded instructions for execution. Ideally, the number of concurrently executed instructions is equal to the number of operating devices included in the execution pipelines. However, when executing real programs It is difficult to ensure that all execution pipelines are fully loaded, so in practice the efficiency of using a superscalar structure is somewhat lower. Modern superscalar processors contain up to 4 to 10 different operating devices, the parallel operation of which ensures the execution of an average of 2 to 6 instructions per clock cycle.