banner



How Many Registers Does A Modern Cpu Have

There's definitely several ways to inquire the question, just IMHO the all-time answer to an unqualified question is to frame it in terms of what the necessary country to store in a thread context. To that end, the registers boil down to:

* xvi general-purpose registers.

* sixteen or 32 vector registers.

* 7 vector mask registers

* eight x87 floating-point registers, with eight MMX registers aliased. (To split or non split x87/mmx is definitely a challenging question)

* 3 normal status registers: RIP, RFLAGS, MXCSR

* vi x87 status registers: FSW, FCW, FTP, FDP, FIP, FOP

* 6 segment registers

* six debug registers

* Relevant MPX registers (I don't know this ISA extension very well, so I tin't count these registers accurately)

These are the registers that I would await to exist able to poke at in a debugger or audit/modify via something like ucontext_t, and they're going to be found in whatever kernel brainchild y'all use to salvage not-currently-running thread data.

That is a perfectly reasonable approach to that respond indeed. Registers practise hold land and saving restoring them is burden of program code (user, compiler, lib, os).

Contrast this with a hypothetical cpu that has only one annals "base of operations" and allows to address 32 words afterward the address at that base register. i.east. things like arithmetics instructions would have 5 bits to accost operands which will exist interpreters as base of operations+8*northward. To make things even more interesting this architecture'southward instruction pointer lives at base+0.

Such an compages would take i register under your metric (as only one annals needs to be saved/restored to context switch an entire "register file").

Yet, implementations (microarchitecture) could actually shadow that memory range into hardware registers, and page in/out the whole register bank upon writes of the base of operations register (effectively performing a hardware assisted context switch; how-do-you-do TSS).

Even so, since each educational activity in this hypothetical ISA must have enough space in the encoding to address these operands, for all intents and purposes this architecture would accept 32 registers.

Deciding instructions, addressing operands, dealing with consequences of code density (icache misses), ... are all way more than frequent events than context switches.

Hence I do agree with TFA that operand encoding should be the default metric to count registers. And this too includes sub/overlapping registers, if they are independently addressed.

> Contrast this with a hypothetical cpu that has but 1 register "base" and allows to address 32 words after the accost at that base register.

This, er, wasn't really hypothetical. The TMS9900, the CPU used in the TI-99/4A, had three hardware registers: a program counter, a status word, and what was called a Workspace Pointer (WP). Full general purpose "registers" lived in RAM, and were referenced past an outset off the value in the WP. Subroutine calls were initiated by saving the PC and irresolute the WP to a fresh new register context earlier branching.

Weird architectures definitely make the question a lot more than hard to answer. My gut reaction would be to say that your hypothetical ISA has 33 registers, and that the register file is memory-mapped to a specified region of virtual retentiveness. That's partially because of the way that y'all're going to worry almost how cache coherence will piece of work out, only also considering I suspect the mcontext_t or Bone-equivalent interface will too define its structure layout equally such.

The broader betoken, though, is in deciding whether or non to include registers like CR0 and DR0. The principle I'm using hither is that registers that are not expected to be saved/restored on job switches should exist excluded. Registers that are per-process (i.east., page tables in general, or segment descriptors on x86) or per-CPU (most MSRs) are thus excluded by this benchmark.

FSBASE/GSBASE are extremely borderline--I wouldn't mutter if they were or if they weren't excluded from a listing of registers. These act as a mixture of user-visible registers (even if accessibly only via syscalls until very recently) and segment descriptor information. They're non in Linux'southward userspace-visible mcontext_t struct, merely they are in the kernel'southward equivalent to mcontext_t.


The goal of the hypothetical ISA in my case was merely to tease apart two distinct aspects of the register file cardinality. Both aspects (state and encoding) exist in real world architectures just apparently it's easy to confuse them.

> I won't count microarchitectural implementation details, like shadow registers.

I actually think this article would have been more interesting if he had discussed the microarchitectural registers. Those are important to sympathise for optimization even if they're not directly visible.

As well discussing MSRs but non special information tables like the page table and VMCS is a somewhat odd stardom. While they're probably stored differently, they are somewhat similar in how they are used.

Also, isn't the TLB like a kind of gear up of registers? The TLB entries are very ofttimes accessed. How almost store buffers and the like?

Yes, what really happens nether the hood is dynamic register resource allotment at runtime. I wonder how important is static register allocation past the compiler in that scenario. In theory, even if the compiler uses the same register over and over in sequential instructions, the renamer should be smart enough to detect the false data dependencies and allocate registers from individual instructions to unlike locations in the register file.

Are there any x86 profiling tools which give any metrics virtually the real utilisation of the annals file?


The number of architectural registers are nonetheless relevant for register allocation because of class overlapping and contained lawmaking sequences cannot share the aforementioned architectural register name. This is not very important for integer loads, but still relevant for FP where optimal scheduling requires having multiple computations in flight at the same time. In some cases xvi FP registers are not enough and Intel had to add together 16 more FP registers with AVX512.


Oh: to clarify, you hateful that the compiler could just use stack slots for everything, but some instructions are merely immune to operate on architectural registers, right? If you have to execute a lot of those instructions, the number of architectural registers can be the bottleneck in performance?


Yes, consider vector floating betoken fused multiply-add (FMA). On a typical AVX implementation similar Skylake, this didactics has a latency of iv cycles and a throughput of two instructions per bike. To avoid stalls, you lot'd need iv * two = 8 instructions to run independently, and eight architectural registers to but store the results. Yous could store the results onto the stack and reuse the same architectural registers, simply normally yous want to employ the values immediately in the next loop iteration (eg. matrix multiply) and so this would be expensive. You probably want a few more architectural registers (at the very to the lowest degree 2, up to 16) to hold the inputs as well.

exactly this. Thanks for the practical example.

The reason information technology is less relevant for integer computations is that integer ops accept normally lower latency and tend to have shorter loop carried dependency bondage.


Not just in theory. The way modern register renaming works is that every single write to a annals proper name always allocates a new physical register. It's not fifty-fifty possible for there to be false dependencies because of reusing the same proper name over and over.

From the programmers point of view merely the working registers are really important, other registers are simply Os related (MSR), some are really important, but the internal representation may be total different.

Like a fashion switch bit in a CR annals. So MSR-r are just the interface. And the MSR annals access can be "ho-hum", so no synchronization or optimization required.

But, the idea of more register makes meliorate architecture, is a total bad assumption. See the dead body of Itanium (128 general-purpose 64 bit integer registers, 128 floating point registers etc. )

With multitasking, one take to switch betwixt context, and larger context (annals file size) takes more fourth dimension.

In that location are cases when you are better using just the GPRs, rather than the SIMD registers. (Linux kernel does not apply FPU or SIMD registers)

Also SIMD usage may tedious downward the clock, like AVX in x86_64. So you may trust your compiler for vectorization, only it may make more harm than good.

IMHO what killed itanium wasn't too many registers, and non fifty-fifty compiler difficulties — it was an attempt to have a working x86 emulation.

So, instead of a weird simply very fast CPU, it concluded up being not very fast both in x86 and native modes, while still being weird. (The makers of the Prison cell CPU did not compromise, went full weird, and had a winner of sorts.)


More GPRs visible in the ISA as well means more than $.25 needed to encode instructions. If instruction length and encoding were not an outcome, I bet nosotros would have seen retentivity-to-retentiveness ISAs where no GPRs exist, simply instructions referencing memory locations. The dynamic register file would and then be merely a level beneath L1 cache, or fifty-fifty completely removed.

> With multitasking, ane have to switch between context, and larger context (annals file size) takes more time.

Sparc chips got around that by having sliding windows of registers: instead of having to push all the registers to the stack you only moved the window.


Simply registers that are orthogonal to each other should be counted. RAX,EAX,AX,AL,AH can not be independently varied, and thus should but count equally ane.

They share the underlying storage (i.e. they are aliased) only they are independently addressed, and thus they eat educational activity encoding space.

Not maxim that it's not interesting to know how much bodily storage the annals file offers; just highlighting that TFA focuses on the instruction encoding bending of the question, which is also important.

CPU architectures are masterpieces of tradeoffs.

Put too many registers and your instructions steam is not dense plenty and you cannot keep your cpu busy due to stalls in the fetch phase. Too context switches become expensive (there are solutions to that though).

Put to few registers and you have to spill registers to memory as well oft, and thus also consume precious instruction stream space.

AL/AH are distinct at least

Maybe count how many bits of registers at that place are. And so count RAX as 64 bits of registers

> I will count sub-registers (e.thousand., EAX for RAX) as distinct registers. My justification: they have dissimilar educational activity encodings, and both Intel and AMD optimize/pessimize detail sub-register apply patterns in their microcode.

I disagree strongly with that characterisation. Simply no.

I think it'due south pointless to contend a methodology without a purpose. "How many registers does an x86-64 CPU accept?" is interesting (to 58 voters then far) but too full general to exist useful for whatever particular purpose. Consider a couple alternating questions brought up in this thread:

* How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)

* How many registers does an emulator (such as Rosetta 2) have to implement and exam? (Subregisters should be counted.)

Fifty-fifty these one might argue aren't directly useful; when considering context switching, one could dig down further into how much of the context switching time is attributable to saving the registers, validate that with experiments across architectures, etc.

> * How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)

This is something that has been bothering me for some fourth dimension now - actually since the mid-80's: why not implement multiple contexts as an alphabetize into a large register file? This way, a context switch would accept the fourth dimension it takes to write to the `task-id` annals. It will affect latencies, but would the touch on of having, say, eight contexts non be smaller than having to striking L1 or L2 for the same data?

Truthful, but at two per core (iv or 8 in more enlightened architectures), it's very meh.

I would assume that, instead, a modernistic CPU tags decoded instructions in the reorder buffer with the virtual core number (and register fix) it should be applied to. This mode, the parallelism would be much easier to exploit.

> How many registers does an emulator (such as Rosetta two) have to implement and test? (Subregisters should be counted.)

it still shouldn't because they tin't be distinct in the emulator as writing on a subregister affects the larger register and viceversa.


The emulator has to handle and the testing infrastructure has to have tests for each subregister and how it affects its parent register. Yous have to prevent and find design-level oversights such every bit "AH is not the LSB of AX/EAX/RAX and my design only accounts for LS-whatever subregisters". Your bespeak that they can't exist distinct holds, but but if they were distinct they would have less affect on the engineering than they volition have as subregisters.

On the other manus, they practise be as a separate case, and take instructions that treat them as an individual register.

There is precedent for all this besides.

Take the 6809, having 8 bit A and B accumulators. These can be addressed every bit D, a xvi bit annals.

D is not generally counted every bit it's own register considering the D addressing does not bespeak to anything new.

A and B are counted as 2 registers.

If somehow D brought new bits, say it was 24 bits long, so it would need to be counted.


Does that mean you would count AL, AH, but not AX (as it tin can be split into AL/AH and doesn't "bring new bits"), and and then again would count EAX and RAX?

In concrete terms, I probably would not count information technology.

However if I were doing emulation, I would have to count it because it is a register case that has to be addressed. Just like the D register has to be dealt with on a 6809.

My ain personal way to resolve this has always been to determine whether or non a given register specification, that tin be addressed somehow, brings new data not contain in whatever other register specification, to the table.

Fact is, CPU designers do all kinds of crazy things with registers. They overlap, they may exist indirect, like non direct addressable, but still there every bit consideration for the programmer.

Think nearly something like the REP instruction institute on some CPUs. There's a little circuit it keeps track of account, and some rules, and account may exist a register that may or may not be directly addressable any other mode.

My general take on this commodity is, "wow, that's a lot of registers!"

We tin can all quibble about what the quantity of a lot is, and it'south all proficient fun, and I don't think information technology ways anything actually.

"an emulator is hardware or software that enables i computer system (called the host) to behave like another computer system" (Wikipedia).

This is not the instance for Rosetta ii, which does a translation pass before running the generated ARM binary. No x86 code is run.


I'd contend that "translation" is an implementation item of emulation. Your translated app yet thinks it is x86-64 (witnessed by running "uname -a" in a Rosetta concluding.)


That'due south just the difference between a JIT and not - something like QEMU in the other direction doesn't "run" ARM code either, but in the cease it actually doesn't matter and is only a pocket-size pedantry.


The title "How many register can an x64-86 _address_" is probably more accurate and less controversial.


It makes sense if you're counting register _encodings_ (that is, how much ISA encoding space the registers utilise). Just yeah, a more useful count would consider the sub-registers equally office of the master register (and the same for other ISAs like 64-bit ARM, which does have a 32-flake view of its 64-bit general purpose registers), and would not consider registers outside each core (similar the MTRR registers).


It depends on if you're looking at this from an implementation in hardware vs software I'd say. The article specifically mentions Rosetta 2 as the context so I'chiliad guessing the enumeration is more important if y'all intend to sympathise all the things Rosetta 2 has to implement.


You can't independently store different values in the sub-registers. Storing in EAX clobbers what's in RAX.

Writer here: I don't retrieve this is a good necessary condition for what makes something a register. For consideration:

* Both x86 and other ISAs have registers that can't be stored to at all, like `k0` for the constant opmask and a whole agglomeration of read-only MSRs. But they can be read from wholly independently and as discrete registers.

* There are lots of cases where registers can be programmed to clobber other registers, particularly in the performance counter and PAT MSRs.

Sub-registers are definitely a stretch from the higher up, since they explicitly share bits in the x86 model. But then again, even the x86 model exposed to associates programmers is a lie: the underlying microcode dynamically renames a large arena of anonymous registers at runtime, and subregisters like AL and AH have been separated in the microcode (to avoid some cases of fractional register stalls) for over a decade.

MSRs are called "registers" certain but you lot can't actually perform any operations on them other than a load or a store, and yous tin't employ them to stash random data exterior of main retentiveness because they affect the operation of the processor. `k0` (or e.g. MIPS's `r0`) are fifty-fifty worse, since y'all can't even write to them.

That's non what virtually people care nearly when they talk about how many registers a processor has.

> That'due south not what near people intendance most when they talk nearly how many registers a processor has.

I call up most people exercise consider the instruction arrow and status word to be registers, despite likewise violating the constraints you specified.

Most people probably don't think virtually MSRs at all, and then peradventure just aren't interested in a count of them. But I'thou interested in counting the unlike pieces of on-CPU state that would exist necessary to faithfully model an entire x86 core, and both Intel and AMD refer to those bits of state every bit "registers."


It ruined the commodity for me. I skimmed a bit afterwards that, but you started off so wrong that I dismissed the residuum.


This assay does have some express value for characterizing the complication required in the processor's front terminate to decode instructions. Register renaming means it has somewhat less relevance to the difficulty a compiler faces with register allocation, and basically no relevance to the actual size of a processor's physical register file(south).


They are the same architectural register, but due to register renaming, they may not reside in the same physical transistors. That's probably why the author supersets them.


Funny to think that once x86 was the platform that had too few general-purpose registers, so people sacrificed the frame arrow annals in their highly-optimized assembly routines...


I still call up benchmarking the diverse optimization options in GCC and the only one that consistently and significantly improved operation on real lawmaking was -fomit-frame-pointer.


It saves one push per function phone call, which probably helps more freeing up the annals.


The improvement was dramatic. IIRC xx-25% faster code pretty much every time. Nothing else in the optimization set even came close.


It still is limited, this isn't general purpose registers which is sixteen, I semi regularly encounter register spills on x86_64 while looking at compiler output


You'll see register spills no affair how many registers, right? I definitely encounter register spills on architectures with 32 registers.

Yous didn't miss annihilation! I completely forgot them.

I plan to practise an update to the post this afternoon.


Phew. I must have re-read the postal service similar 3 times to confirm they were really missing. Thanks for clarification.


TIL I learned virtually an entire Intel microprocessor subsystem, MPX, that was added and so deemed useless before I learned about it. Information technology is both less secure and slower than software solutions. What a failure in processor pattern.


How about removing obsolete stuff from 86x CPUs to make the platform perform better? If someone need to execute erstwhile programs/OSes they can apply emulators for that...

People e'er get the thought that removing the cruft will make the design faster, and yet most of the top 500 supercomputers use x86-64 chips:

https://en.wikipedia.org/wiki/TOP500

> Every bit of November 2020, all supercomputers on TOP500 are 64-bit, more often than not based on CPUs using the x86-64 instruction fix architecture (of which 459 are Intel EMT64-based and 22 are AMD AMD64-based. The few exceptions are all based on RISC architectures). Thirteen supercomputers, including the no ii. and no. three are based on the Power ISA used by IBM Ability microprocessors, 3 on Fujitsu-designed SPARC64 chips. One figurer uses another non-U.s.a. design, the Japanese PEZY-SC (based on the British ARM[8]) as an accelerator paired with Intel's Xeon.

There are non-x86 architectures in the TOP500, including ones which accept less cruft than x86, but the x86 chips keep on existence used in some of the fastest machines on the planet. My hypothesis is that x86 cruft doesn't actually matter, and you'd need to go to a cruft level that was orders-of-magnitude worse for the ISA choice to boss performance.

It actually doesn't affair. The biggest benefit of ARM are the fixed length instructions and only Apple is actually taking advantage of this past decoding 8 instructions at once. The big question is whether decoding that many instructions is actually a benefit. It's entirely possible that branch prediction and other factors are greater bottlenecks that have to exist tackled first to take advantage of the faster instruction decoding.

Intel'southward Pentium processors (the original ones) were doing pretty desperately because they made the pipeline too deep at the expense of other things.

I recollect the original Pentium had the pretty much canonical v phase pipeline. It did pretty well. Its successor, the Pentium Pro, an OoO design, was deeper merely also did amazingly well.

Yous are probably thinking of the Pentium 4 which was designed as a speed demon with a very deep pipeline and failed to reach its target frequency.


Anandtech'south article indicates they have an out-of-guild buffer of around 630 entries (Zen 3 is merely 256 entries). The M1 has seven integer math ports and four FP/SIMD math ports plus several bunch of load/store/branch ports . It seems similar they could completely saturate those decoders given the right lawmaking.


Off-white points, just I call back we can be 100% certain that Apple has modelled this and made their architectural decisions based on this modelling - especially as they are no the 10th or so iteration of their designs.


That and that their CPUs are designed to run ane Os and apps are developed against one gear up of libraries. This frees them to tune the hardware to the needs of the software much more than any other manufacturer can practise (a PC needs to run Give-and-take and Autocad equally well)

They have certainly optimised against some central aspects of their software (eg Rosetta and reference counting) but that is not at the expense of other software. The M1 Arm CPUs are merely very fast general purpose CPUs.

Word and Autocad too run on the Mac!

> but that is non at the expense of other software

It'south ever at the expense of something. Transistor upkeep is stock-still.


That'southward still non free - you are locked out from a software base of operations. Today it's inexpensive, but, all the same, not free.

Information technology's probably that x86 has a ameliorate cost/operation than others. If you lot can get the job that'd require, say, 200 POWERs or 250 SPARC64s with 300 Xeons that cost half per socket than a Power, x86 will still exist a better selection. This could be for many reasons - from intrinsic performance of the CPU to the quality of the code generated by the compiler and/or architectural fitness to the task at hand.

Also, have into account the CPUs are non always the more than expensive part of the compute node - GPUs, HBM, lots of DDR4, and fast networking gear are too pretty expensive and will be more or less constant every bit you alter CPU architectures.

All that stuff is emulated in microcode.

And while counting architectural registers is something, in reality modern out-of-guild processors do something called "annals renaming" and then that more than registers can be used on-the-wing as it dynamically creates a information menses dependency graph. Aye, inside each processor.

In that location have been attempts to move the register renaming out of the processor and into the program through VLIW architectures such as the Itanium, but it failed because (ane) they require "sufficiently smart compilers" that weren't available in the day, and (2) putting it in the silicon allows new architecture revisions to benefit older programs (putting it in code ways information technology can't benefit from newer register renaming algorithms without a recompile).

Also, if information technology'south in the silicon and at that place's a problems in the algorithm, a microcode update tin ready information technology for anybody. If the bug was in the compiler, yous'd demand to recompile everything to set it.


The only successful VLIW architectures run JIT compilers on an existing ISA. Transmeta did this for x86 but Intel brought them to court them which gave Intel fourth dimension to release superior fries (through superior manufacturing). The other example is Nvidia'due south Project Denver.

A very debatable level of "successful" on that. It shipped in a product, yes, but it wasn't very good either. It ran ARM lawmaking, but not very well, and was a horrible nightmare to work on. The JIT'd aspect completely breaks profilers. ie, what do you mean this simple field admission took 20ms?? Oh, because the CPU wasn't running my code, information technology just silently went out to lunch to JIT some random shit, absurd, thank you. In that location'due south then likewise the questionable security pattern of a globally read/write/executable chunk of retentivity where security is "enforced" by the JIT which is a complex scrap of microcode, totally zippo can go wrong there...

Information technology'southward but noteworthy "feature" was that information technology was able to ship ARMv8 support earlier ARM had a proper ARMv8 CPU design. Time to market for a new ISA was fast, but that's almost it.


Project Denver lives on in NVIDIA'south Carmel cores shipping in Tegra 194 (Xavier) fries today. Though they seem to exist giving upwardly on it, as they'll exist using Cortex "Hercules" A78 cores in the successor named Orin.

Essentially every device that does not apply a x86 cpu did that. There'due south no point in waiting an unabridged development bike to get a new smaller x86-alike when you could solder an ARM chip or similar to the board today.

The vast majority of CPUs shipped are not x86 compatible.

Statista claims 23.v billion microcontrollers are shipped annually.

I know microchip (the Film people) made press releases roughly annually as they shipped some other billion flash microcontrollers. Google found the one from 2011 when they shipped their tenth billion PIC bit.

I find it difficult to become x86 sales figures. Intel gross revenue is high considering they have their fingers in everything. AMD financial statements claim virtually $2B/quarter total revenue, so if you figure the boilerplate shipped price of a AMD cpu is $200 and they made all their revenue off CPUs, that would be 40 1000000 CPUs shipped per year, which seems both ridiculously high AND about a 25th the quantity of microchip PICs shipped.

Ane fashion to look at the number of ARM CPUs shipped is the licensing / holding company has made enough licensing fees to pay for about "a hundred and fifty billion" ARM chips in its lifetime.


If you remove the obsolete stuff from x86, there will be no signal to keep this onetime ISA altogether. One of the selling point of x86 is legacy back up.


You mean remove backwards compatibility and ruin the whole reason people use x86? Not to mention that when AMD64 ("long mode") was introduced, a big amount of cruft was disabled in the new style (merely is nonetheless bachelor in 16-bit ("real way") and 32-fleck ("protected mode") for compatibility).

> (merely is yet available in 16-bit ("real mode") and 32-flake ("protected mode") for compatibility).

There exists both 16 bit protected mode (available since the 80286) and 32 bit protected mode (available since the 80386).


There is also "32 bit realmode", which is not mentioned in the official documentation but simply a combination of existing states. Ditto for real mode paging and the like --- which finds more applications as emulator acrid-tests than anything else.


I thought the 80286 had 32-bit protected style, just it was badly implemented (the merely way to get back to existent mode was a reboot), and then they fixed it with the 80386. Unless, are you referring to "unreal mode"?


No, most people forget about it (I had to be reminded by the to a higher place comment), but the 80286 did have a 16-scrap protected mode. Quoting Wikipedia (https://en.wikipedia.org/wiki/Protected_mode): "[...] Acceptance was additionally hampered by the fact that the 286 only immune memory access in 16 flake segments via each of four segment registers, meaning only 4*2^16 bytes, equivalent to 256 kilobytes, could be accessed at a fourth dimension. [...]"


I am still upset after all this fourth dimension, at what AMD recklessly did --- out of what might be the same misguided notion as the GP comment, they made instructions like LAHF invalid and removed much of segmentation, merely to exist forced to put much of it dorsum later considering people were really using them. I suspect Intel was actually working on a more consistent extension to 64 $.25 too, but AMD beat them to it.


If you don't need backwards compatibility (such equally on servers), at that place's not much reason to go x86. For that reason, ARM servers be. IIRC, AWS and Azure take some.

What obsolete stuff could you remove? If yous want to actually meaningfully cut out a large amount of space on the processor, you'll have to cut into the actual instruction space and remove instructions that accept up space in your execution units.

Removing the 16-bit and 32-bit modes don't actually remove whatsoever instructions from the platform (save the binary-coded decimal instructions)--yous're largely saving simply a few bits of decoder tabular array entries at all-time. Furthermore, processors reset into 16-bit style on startup for compatibility reasons, so killing 16-chip and 32-bit way would introduce major compatibility headaches.

ISA extensions can be more than hands removed since there'southward already a CPUID bit that tells operating systems and applications whether or not they are used. The MPX extension for bounds checking is now regarded equally a mistake, and Intel has already confirmed that they are removing it from time to come processor generations. The TSX extension for transactional retentiveness is patently on the striking list considering of Spectre, and was removed from some processor generations.

The only meaning processor execution unit space that is truly obsolete I tin remember is the x87 floating-point execution unit of measurement logic, with the concomitant MMX execution unit logic--SSE is just strictly improve for everything here, except if you lot're trying to really get the 80-bit precision. Just the beingness of 80-fleck floating point in the 64-flake ABI (i.eastward., long double) means you'd have a hard ABI pause that would potentially pause software fifty-fifty written today, and the hurting of breaking that ABI is probably non worth any savings y'all get out of it.

> Furthermore, processors reset into 16-flake mode on startup for compatibility reasons, so killing 16-bit and 32-bit mode would introduce major compatibility headaches.

Didn't the switch to UEFI finer reduce the scope of this trouble to only utilise to motherboard firmware? Operating systems no longer demand xvi-fleck code to boot.


There are a lot of x86 software out in that location. Even games switched to x64 binary relatively recently. That alone means that you would want to emulate with functioning of, let'southward say i5-2500, if y'all want a decent framerate, which would be quite chellanging given that modern CPUs are 80% faster at all-time.


I recall if something is trully obsolete is already removed from hardware and emulated by the CPU with microcode.

Yes, it's called Apple A1.

Once y'all remove a unmarried opcode, it's not actually x86 whatever more than, and you lot demand emulation at OS level. But then, in one case y'all've done that, why non remove more instructions? Why not remove all of them and start again on a much more power-efficient platform? Why not remove the memory model?

One of the vivid ideas in the A1 is a flag for whether the current process insists on the slower merely more comprehensible x86 memory ordering.

That's pretty misleading. You can remove many instructions while keeping CPU x86 compatible.

You can e.grand. remove all MMX instructions and signal this fact through CPUID. Very few apps crave MMX (equally in tin't piece of work without it).

Some other selection is to remove the instructions from hardware and emulate them in microcode.

Several CPU features crave other ones--for instance, AVX requires the XSAVE feature. Additionally, x86-64 implicitly requires several features (well-nigh notably SSE2), and the glibc folks take been working on a proposed ABI for x86-64 that groups the characteristic sets into levels--roughly base (≤SSE2), ≤SSE4.two, ≤AVX2, current skylake-server (a clutch of AVX-512 features), with the BMI and FMA features scattered in there somewhere.

That said, the MMX instructions in particular are so problematic to apply (and SSE ubiquitous and strictly better) that I suspect you could innovate a processor that lacks MMX support and break almost nobody, certainly far fewer people than removing x87. I don't know if in that location is an implicit or explicit actual dependency on MMX anywhere.


Peradventure, merely compilers mostly assume its beingness with the default flags. So at that place'southward enough of software that has no fallback.

Nosotros can't really demand compilers to create lawmaking that'due south compatible with all the ancient variations of a currently popular compages. It's still good manners to include a office that exits with a clear error message near a required architecture feature that's missing.

I had similar issues with PPC software that just blindly assumed I had Altivec on my G3. It wasn't fun.


It doesn't affair. The only meaningful change would exist to go with fixed length instructions and possibly become rid of the retentivity ordering guarantees. If you lot do that y'all might every bit well switch to whatsoever other ISA that has these properties. But since Intel has an inferior manufacturing process the new ISA would still suffer from junior CPUs. Nonetheless, AMD has shown that x86_64 is still viable and then the benefits of switching are miniscule anyhow.


Why are people downvoting this? It's a reasonable question. And it has some very helpful answers.

AMD64 has 32 registers: 16 scalar ones, and sixteen vector ones. Maybe 34 if you count RIP and flags.

From developer's perspective, the balance of them are either different ways to access these 32-34, or very exotic and rarely used.

P.South. Modern compilers don't commonly emit x87 nor MMX instructions, because they are often slower compared to SSE (all AMD64 processors are required to support at least SSE1 and SSE2). For instance, FSQRT on Zen2 has 22 cycles both latency and throughput, VSQRTPD has twenty cycles latency and 8.v cycles throughput (lower is better), despite taking 4 square roots in one shot. I think it's safe to assume x87 and MMX instructions are simply left for backward compatibility with old 32-scrap binaries, when writing new code they can exist ignored.

How Many Registers Does A Modern Cpu Have,

Source: https://news.ycombinator.com/item?id=25253797

Posted by: caldwellguttend1964.blogspot.com

0 Response to "How Many Registers Does A Modern Cpu Have"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel