孟母三迁告诉我们什么道理| renewal什么意思| 解酒喝什么最好| 中耳炎吃什么药最有效| 抑郁气滞是什么症状| 女大十八变是什么意思| 人造海蜇丝是什么做的| 什么运动使人脸部年轻| 胃有幽门螺旋杆菌是什么症状| 年岁是什么意思| 安逸什么意思| 桃花是什么颜色| 沙土地适合种什么农作物| 隐血试验阴性是什么意思| 十八罗汉分别叫什么| 晕厥是什么意思| 羊水偏少对胎儿有什么影响| 什么人不能吃海带| 孕囊是什么样的图片| 皮癣用什么药膏| 术后血压低什么原因| 干咳吃什么药好的快| 尿酸高要注意什么饮食| 股票缺口是什么意思| 为什么用| 拔牙挂什么科| 小酌怡情下一句是什么| 提是什么生肖| 肉苁蓉是什么| 不晨勃是什么原因| 脑萎缩挂什么科| 梦见猫死了是什么意思| 现在买什么股票好| 冤亲债主是什么意思| 什么时候量血压最准| 墨龟为什么只能养一只| 尿激酶的作用及功效是什么| 外周动脉僵硬度增高什么意思| 开飞机是什么意思| 入职体检70元一般检查什么| 吕布的武器叫什么| 你在纠结什么| 稀字五行属什么| 口臭药店买什么药吃| 一个胸大一个胸小是什么原因| 热狗为什么叫热狗| 尿酸低有什么危害| 眼压高是什么原因造成的| 羊入虎口是什么生肖| 高血糖吃什么| 1957年属什么生肖| 稀料对人体有什么危害| 屋尘螨和粉尘螨是什么| 中国信什么教| peni是什么意思| 便血挂什么科室| mansory是什么车| 早泄是什么原因导致| rapido是什么牌子| 众矢之地是什么意思| 安琪儿是什么意思| 剖腹产吃什么下奶最快| 三杯鸡为什么叫三杯鸡| 九月十四号是什么星座| 一什么知什么| 爸爸的姐姐叫什么| 覅什么意思| 手上蜕皮是什么原因| 灰指甲是什么样子| 心脏在人体什么位置| 原生家庭是什么意思| 第一次要注意什么| 意什么风发| 什么树最值钱| 早上起来头晕是什么原因| 肌肉溶解是什么意思| 勾践属什么生肖| 胸是什么| 砍单是什么意思| 弄虚作假是什么生肖| 好学不倦什么意思| 晚上睡觉流口水是什么病| 黄瓜籽有什么功效| 尿结石什么症状| 女生被摸胸是什么感觉| 一暴十寒什么意思| 劲仔小鱼是什么鱼做的| 衍生物是什么意思| 鸡伸脖子张嘴用什么药| 琉璃和玻璃有什么区别| 介怀是什么意思| 血红蛋白高是什么意思| 脾虚湿盛吃什么中药| 王不见王是什么意思| 狗男和什么属相最配| 三文鱼又叫什么鱼| 血气方刚什么意思| 变色龙指什么人| 射手男喜欢什么样的女生| 乳腺彩超能查出什么| 脚后跟疼用什么药最好| 偏光镜是什么意思| 十月二十八是什么星座| 正品行货是什么意思| 情绪低落是什么意思| 乳腺结节三级是什么意思| 长期口腔溃疡挂什么科| 1.17是什么星座| 九二年属猴的是什么命| 双侧胸膜增厚是什么意思| 辅料是什么意思| 雅漾属于什么档次| 肚脐眼下面痛什么原因| 七月五号是什么星座| 聚首一堂是指什么生肖| 金牛属于什么象星座| 在五行中属什么| 梦见抓了好多鱼是什么意思| 羊奶有什么作用与功效| 静脉穿刺是什么意思| 全麻后需要注意什么| 吃生姜对身体有什么好处和坏处| 背上长痘是什么原因| 蓝牙耳机什么品牌好| 功高震主是什么意思| 氨纶是什么面料优缺点| 喉咙痛吃什么饭菜好| 小鸭吃什么| 儿童包皮手术挂什么科| 什么食物含锌多| 小儿支气管炎咳嗽吃什么药好得快| 人老是放屁是什么原因| 铁蛋白偏高是什么原因| 拖鞋什么材质的好| 脚指麻木是什么病先兆| 胆囊炎可以吃什么| quilt什么意思| 什么是单亲家庭| 为什么一到晚上就痒| 流清鼻涕吃什么药好| 肉炒什么好吃| 胃胀吃什么水果| 脸上白了一小块是什么原因| 为什么会得艾滋病| 一直流口水是什么原因| 什么食物吃了不胖| 肠鸣是什么原因引起的| 大肠炒什么菜好吃| 红萝卜不能和什么一起吃| 老干局是干什么的| 什么鱼好养| 黄鼠狼是什么科| 息风止痉是什么意思| 盗窃是什么意思| 欲望是什么意思| 宫内暗区是什么意思| 喊6个1是什么意思| 比劫是什么意思| 喝醉是什么感觉| 招财猫是什么品种| 九宫八卦是什么意思| 18kgp是什么意思| 植物神经功能紊乱吃什么药| 高明是什么意思| 妇乐颗粒的功效能治什么病| 一什么野菜| 女性腋臭什么年龄消失| p图是什么意思| 一什么树叶| 豇豆不能和什么一起吃| nba什么时候开始| 侏罗纪是什么意思| 蚕屎做枕头有什么好处| 酵素什么牌子好| 吝啬什么意思| 梦见下大雪是什么预兆| 犀牛吃什么| 胆囊切除有什么影响| 南京为什么那么多梧桐树| 什么时候有流星| 老人双脚浮肿是什么原因| 桥本是什么| 诊查费是什么| rp是什么意思| 榴莲壳有什么用| 什么体质容易长肿瘤| 为什么会得梅毒| 十二指肠球部溃疡吃什么药| 寻麻疹涂什么药膏| 滑膜炎挂什么科| 经常流眼泪是什么原因| 汛期是什么| 微创手术是什么| 减肥可以喝什么饮料| 产后腰疼是什么原因| 口水多吃什么药好得快| 梦见搞卫生什么意思| 黄鼠狼怕什么| 夜盲症缺什么维生素| 马日冲鼠是什么意思| 蜜枣是什么枣做的| 婴儿胎毛什么时候剃最好| 便秘吃什么药最好最快| 长时间手淫有什么危害| 舌炎是什么原因引起的怎样治疗| 做体检挂什么科| 成家是什么意思| 提携是什么意思| 用淘米水洗脸有什么好处| 饭后烧心是什么原因引起的| 蛋白尿吃什么食物好| 卷柏是什么植物| 高血糖吃什么水果最好| 五代十国是什么意思| 酷暑的反义词是什么| 肺活量不足是什么症状| 64年属什么| 胎盘1级什么意思| 点状钙化灶是什么意思| 什么是腹式呼吸的正确方法| 为什么说成也萧何败也萧何| 888是什么意思| 91年的羊是什么命| 结核菌是什么| 邮政什么时候上班| 一什么景象| 老虎菜是什么菜| 化学阉割是什么| 什么是豹子号| gst什么意思| 跳蚤吃什么| 突然头晕是什么原因| 调节肠道菌群吃什么药| 中项是什么意思| 什么叫韵母| 金乌是什么| 抱薪救火是什么意思| 中指戴戒指是什么意思| 插管是什么意思| 乳清蛋白是什么| 梦见猫什么意思| 生津是什么意思| 婚车头车一般用什么车| 破处是什么感觉| 火车代表什么生肖| 1996年属什么| 万病之源是什么| 循环利息是什么意思| 四字五行属什么| 足跟痛是什么原因| 狗拉肚子吃什么药| 梦见自己手机丢了是什么意思| 起灵是什么意思| 雷猴是什么意思| 腹泻吃什么水果| 大姨妈一直不干净是什么原因| 虫合读什么| 手信是什么东西| 老年脑改变是什么意思| 孩子睡觉咬牙齿是什么原因引起的| 孕期长痘痘是什么原因| 心脾两虚吃什么中成药| 血脂高吃什么食物最好| 芥菜长什么样子图片| 百度Jump to content

我县又有4户贫困户获“养鸡生蛋”扶贫工程帮助

From Wikipedia, the free encyclopedia
百度 随后,交警、消防、120等部门赶到现场进行救援。

Explicit data graph execution, or EDGE, is a type of instruction set architecture (ISA) which intends to improve computing performance compared to common processors like the Intel x86 line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

Parallelism of modern CPU designs generally starts to plateau at about eight internal units and from one to four "cores", EDGE designs intend to support hundreds of internal units and offer processing speeds hundreds of times greater than existing designs. Major development of the EDGE concept had been led by the University of Texas at Austin under DARPA's Polymorphous Computing Architectures program, with the stated goal of producing a single-chip CPU design with 1 TFLOPS performance by 2012, which has yet to be realized as of 2018.[1]

Traditional designs

[edit]

Almost all computer programs consist of a series of instructions that convert data from one form to another. Most instructions require several internal steps to complete an operation. Over time, the relative performance and cost of the different steps have changed dramatically, resulting in several major shifts in ISA design.

CISC to RISC

[edit]

In the 1960s memory was relatively expensive, and CPU designers produced instruction sets that densely encoded instructions and data in order to better utilize this resource. For instance, the add A to B to produce C instruction would be provided in many different forms that would gather A and B from different places; main memory, indexes, or registers. Providing these different instructions allowed the programmer to select the instruction that took up the least possible room in memory, reducing the program's needs and leaving more room for data. For instance, the MOS 6502 has eight instructions (opcodes) for performing addition, differing only in where they collect their operands.[2]

Actually making these instructions work required circuitry in the CPU, which was a significant limitation in early designs and required designers to select just those instructions that were really needed. In 1964, IBM introduced its System/360 series which used microcode to allow a single expansive instruction set architecture (ISA) to run across a wide variety of machines by implementing more or less instructions in hardware depending on the need.[3] This allowed the 360's ISA to be expansive, and this became the paragon of computer design in the 1960s and 70s, the so-called orthogonal design. This style of memory access with wide variety of modes led to instruction sets with hundreds of different instructions, a style known today as CISC (Complex Instruction Set Computing).

In 1975 IBM started a project to develop a telephone switch that required performance about three times that of their fastest contemporary computers. To reach this goal, the development team began to study the massive amount of performance data IBM had collected over the last decade. This study demonstrated that the complex ISA was in fact a significant problem; because only the most basic instructions were guaranteed to be implemented in hardware, compilers ignored the more complex ones that only ran in hardware on certain machines. As a result, the vast majority of a program's time was being spent in only five instructions. Further, even when the program called one of those five instructions, the microcode required a finite time to decode it, even if it was just to call the internal hardware. On faster machines, this overhead was considerable.[4]

Their work, known at the time as the IBM 801, eventually led to the RISC (Reduced Instruction Set Computing) concept. Microcode was removed, and only the most basic versions of any given instruction were put into the CPU. Any more complex code was left to the compiler. The removal of so much circuitry, about 1?3 of the transistors in the Motorola 68000 for instance, allowed the CPU to include more registers, which had a direct impact on performance. By the mid-1980s, further developed versions of these basic concepts were delivering performance as much as 10 times that of the fastest CISC designs, in spite of using less-developed fabrication.[4]

Internal parallelism

[edit]

In the 1990s the chip design and fabrication process grew to the point where it was possible to build a commodity processor with every potential feature built into it. Units that were previously on separate chips, like floating point units and memory management units, were now able to be combined onto the same die, producing all-in one designs. This allows different types of instructions to be executed at the same time, improving overall system performed. In the later 1990s, single instruction, multiple data (SIMD) units were also added, and more recently, AI accelerators.

While these additions improve overall system performance, they do not improve the performance of programs which are primarily operating on basic logic and integer math, which is the majority of programs (one of the outcomes of Amdahl's law). To improve performance on these tasks, CPU designs started adding internal parallelism, becoming "superscalar". In any program there are instructions that work on unrelated data, so by adding more functional units these instructions can be run at the same time. A new portion of the CPU, the scheduler, looks for these independent instructions and feeds them into the units, taking their outputs and re-ordering them so externally it appears they ran in succession.

The amount of parallelism that can be extracted in superscalar designs is limited by the number of instructions that the scheduler can examine for interdependencies. Examining a greater number of instructions can improve the chance of finding an instruction that can be run in parallel, but only at the cost of increasing the complexity of the scheduler itself. Despite massive efforts, CPU designs using classic RISC or CISC ISA's plateaued by the late 2000s. Intel's Haswell designs of 2013 have a total of eight dispatch units,[5] and adding more results in significantly complicating design and increasing power demands.[6]

Additional performance can be wrung from systems by examining the instructions to find ones that operate on different types of data and adding units dedicated to that sort of data; this led to the introduction of on-board floating point units in the 1980s and 90s and, more recently, single instruction, multiple data (SIMD) units. The drawback to this approach is that it makes the CPU less generic; feeding the CPU with a program that uses almost all floating point instructions, for instance, will bog the FPUs while the other units sit idle.

A more recent problem in modern CPU designs is the delay talking to the registers. In general terms the size of the CPU die has remained largely the same over time, while the size of the units within the CPU has grown much smaller as more and more units were added. That means that the relative distance between any one function unit and the global register file has grown over time. Once introduced in order to avoid delays in talking to main memory, the global register file has itself become a delay that is worth avoiding.

A new ISA?

[edit]

Just as the delays talking to memory while its price fell suggested a radical change in ISA (Instruction Set Architecture) from CISC to RISC, designers are considering whether the problems scaling in parallelism and the increasing delays talking to registers demands another switch in basic ISA.

Among the ways to introduce a new ISA are the very long instruction word (VLIW) architectures, typified by the Itanium. VLIW moves the scheduler logic out of the CPU and into the compiler, where it has much more memory and longer timelines to examine the instruction stream. This static placement, static issue execution model works well when all delays are known, but in the presence of cache latencies, filling instruction words has proven to be a difficult challenge for the compiler.[7] An instruction that might take five cycles if the data is in the cache could take hundreds if it is not, but the compiler has no way to know whether that data will be in the cache at runtime – that's determined by overall system load and other factors that have nothing to do with the program being compiled.

The key performance bottleneck in traditional designs is that the data and the instructions that operate on them are theoretically scattered about memory. Memory performance dominates overall performance, and classic dynamic placement, dynamic issue designs seem to have reached the limit of their performance capabilities. VLIW uses a static placement, static issue model, but has proven difficult to master because the runtime behavior of programs is difficult to predict and properly schedule in advance.

EDGE

[edit]

Theory

[edit]

EDGE architectures are a new class of ISA's based on a static placement, dynamic issue design. EDGE systems compile source code into a form consisting of statically allocated hyperblocks containing many individual instructions, hundreds or thousands. These hyperblocks are then scheduled dynamically by the CPU. EDGE thus combines the advantages of the VLIW concept of looking for independent data at compile time, with the superscalar RISC concept of executing the instructions when the data for them becomes available.

In the vast majority of real-world programs, the linkage of data and instructions is both obvious and explicit. Programs are divided into small blocks referred to as subroutines, procedures or methods (depending on the era and the programming language being used) which generally have well-defined entrance and exit points where data is passed in or out. This information is lost as the high level language is converted into the processor's much simpler ISA. But this information is so useful that modern compilers have generalized the concept as the "basic block", attempting to identify them within programs while they optimize memory access through the registers. A block of instructions does not have control statements but can have predicated instructions. The dataflow graph is encoded using these blocks, by specifying the flow of data from one block of instructions to another, or to some storage area.

The basic idea of EDGE is to directly support and operate on these blocks at the ISA level. Since basic blocks access memory in well-defined ways, the processor can load up related blocks and schedule them so that the output of one block feeds directly into the one that will consume its data. This eliminates the need for a global register file, and simplifies the compiler's task in scheduling access to the registers by the program as a whole – instead, each basic block is given its own local registers and the compiler optimizes access within the block, a much simpler task.

EDGE systems bear a strong resemblance to dataflow languages from the 1960s–1970s, and again in the 1990s. Dataflow computers execute programs according to the "dataflow firing rule", which stipulates that an instruction may execute at any time after its operands are available. Due to the isolation of data, similar to EDGE, dataflow languages are inherently parallel, and interest in them followed the more general interest in massive parallelism as a solution to general computing problems. Studies based on existing CPU technology at the time demonstrated that it would be difficult for a dataflow machine to keep enough data near the CPU to be widely parallel, and it is precisely this bottleneck that modern fabrication techniques can solve by placing hundreds of CPU's and their memory on a single die.

Another reason that dataflow systems never became popular is that compilers of the era found it difficult to work with common imperative languages like C++. Instead, most dataflow systems used dedicated languages like Prograph, which limited their commercial interest. A decade of compiler research has eliminated many of these problems, and a key difference between dataflow and EDGE approaches is that EDGE designs intend to work with commonly used languages.

CPUs

[edit]

An EDGE-based CPU would consist of one or more small block engines with their own local registers; realistic designs might have hundreds of these units. The units are interconnected to each other using dedicated inter-block communication links. Due to the information encoded into the block by the compiler, the scheduler can examine an entire block to see if its inputs are available and send it into an engine for execution – there is no need to examine the individual instructions within.

With a small increase in complexity, the scheduler can examine multiple blocks to see if the outputs of one are fed in as the inputs of another, and place these blocks on units that reduce their inter-unit communications delays. If a modern CPU examines a thousand instructions for potential parallelism, the same complexity in EDGE allows it to examine a thousand hyperblocks, each one consisting of hundreds of instructions. This gives the scheduler considerably better scope for no additional cost. It is this pattern of operation that gives the concept its name; the "graph" is the string of blocks connected by the data flowing between them.

Another advantage of the EDGE concept is that it is massively scalable. A low-end design could consist of a single block engine with a stub scheduler that simply sends in blocks as they are called by the program. An EDGE processor intended for desktop use would instead include hundreds of block engines. Critically, all that changes between these designs is the physical layout of the chip and private information that is known only by the scheduler; a program written for the single-unit machine would run without any changes on the desktop version, albeit thousands of times faster. Power scaling is likewise dramatically improved and simplified; block engines can be turned on or off as required with a linear effect on power consumption.

Perhaps the greatest advantage to the EDGE concept is that it is suitable for running any sort of data load. Unlike modern CPU designs where different portions of the CPU are dedicated to different sorts of data, an EDGE CPU would normally consist of a single type of ALU-like unit. A desktop user running several different programs at the same time would get just as much parallelism as a scientific user feeding in a single program using floating point only; in both cases the scheduler would simply load every block it could into the units. At a low level the performance of the individual block engines would not match that of a dedicated FPU, for instance, but it would attempt to overwhelm any such advantage through massive parallelism.

Implementations

[edit]

TRIPS

[edit]

The University of Texas at Austin was developing an EDGE ISA known as TRIPS. In order to simplify the microarchitecture of a CPU designed to run it, the TRIPS ISA imposes several well-defined constraints on each TRIPS hyperblock, they:

  • have at most 128 instructions,
  • issue at most 32 loads and/or stores,
  • issue at most 32 register bank reads and/or writes,
  • have one branch decision, used to indicate the end of a block.

The TRIPS compiler statically bundles instructions into hyperblocks, but also statically compiles these blocks to run on particular ALUs. This means that TRIPS programs have some dependency on the precise implementation they are compiled for.

In 2003 they produced a sample TRIPS prototype with sixteen block engines in a 4 by 4 grid, along with a megabyte of local cache and transfer memory. A single chip version of TRIPS, fabbed by IBM in Canada using a 130 nm process, contains two such "grid engines" along with shared level-2 cache and various support systems. Four such chips and a gigabyte of RAM are placed together on a daughter-card for experimentation.

The TRIPS team had set an ultimate goal of producing a single-chip implementation capable of running at a sustained performance of 1 TFLOPS, about 50 times the performance of high-end commodity CPUs available in 2008 (the dual-core Xeon 5160 provides about 17 GFLOPS).

CASH

[edit]

CMU's CASH is a compiler that produces an intermediate code called "Pegasus".[8] CASH and TRIPS are very similar in concept, but CASH is not targeted to produce output for a specific architecture, and therefore has no hard limits on the block layout.

WaveScalar

[edit]

The University of Washington's WaveScalar architecture is substantially similar to EDGE, but does not statically place instructions within its "waves". Instead, special instructions (phi, and rho) mark the boundaries of the waves and allow scheduling.[9]

References

[edit]

Citations

[edit]
  1. ^ University of Texas at Austin, "TRIPS : One Trillion Calculations per Second by 2012"
  2. ^ Pickens, John (17 October 2020). "NMOS 6502 Opcodes".
  3. ^ Shirriff, Ken. "Simulating the IBM 360/50 mainframe from its microcode".
  4. ^ a b Cocke, John; Markstein, Victoria (January 1990). "The evolution of RISC technology at IBM" (PDF). IBM Journal of Research and Development. 34 (1): 4–11. doi:10.1147/rd.341.0004.
  5. ^ Shimpi, Anand Lal (5 October 2012). "Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel". AnandTech.
  6. ^ Tseng, Francis; Patt, Yale (June 2008). "Achieving Out-of-Order Performance with Almost In-Order Complexity". ACM SIGARCH Computer Architecture News. 36 (3): 3–12. doi:10.1145/1394608.1382169.
  7. ^ W. Havanki, S. Banerjia, and T. Conte. "Treegion scheduling for wide-issue processors", in Proceedings of the Fourth International Symposium on High-Performance Computer Architectures, January 1998, pg. 266–276
  8. ^ "Phoenix Project"
  9. ^ "The WaveScalar ISA"

Bibliography

[edit]
六月十二号是什么星座 车什么马什么 总是什么意思 干部是什么意思 盼头是什么意思
五级士官是什么级别 颈动脉强回声斑块是什么意思 live什么意思 祛斑去医院挂什么科 检查妇科清洁度三是什么意思
什么是静脉血栓 什么动物怕热 元气是什么意思 左眼皮跳是什么意思 打耳洞什么季节最好
怀挺是什么意思 食物中毒吃什么解毒最快 pd是什么 蜱虫的天敌是什么 34岁属什么的生肖
三月24号是什么星座的hcv9jop7ns4r.cn 生蛇是什么原因引起的hcv7jop5ns1r.cn mt指什么hcv8jop2ns8r.cn 孩子积食发烧吃什么药adwl56.com 尿白蛋白高是什么原因hcv8jop2ns1r.cn
dm是什么意思hcv9jop5ns0r.cn 拉肚子引起的发烧吃什么药hcv7jop6ns3r.cn 有偿是什么意思hcv7jop6ns2r.cn 蜈蚣怕什么东西hcv8jop4ns2r.cn 遥远的什么hcv9jop6ns0r.cn
心肌劳损的症状是什么hlguo.com 血糖高有什么影响hcv9jop0ns0r.cn 1r是什么意思hcv8jop3ns6r.cn 像什么似的hcv9jop8ns0r.cn 脾虚湿重吃什么中成药hcv8jop6ns6r.cn
教师节送老师什么礼物最好hcv8jop1ns5r.cn 衡字五行属什么hcv9jop3ns1r.cn 迎春花什么时候开花hcv9jop0ns9r.cn pa是什么意思hcv9jop2ns1r.cn 肝内低回声区是什么意思0735v.com
百度