意淫是什么| 痰饮是什么意思| 吃什么补气虚最快最好| 看皮肤挂什么科| 四面八方是什么意思| 96年属于什么生肖| 做飞机需要注意什么| 小便有点黄是什么原因| 有结石不能吃什么东西| 脆皮是什么意思| 孕妇做糖筛是检查什么| 什么鱼不能吃脑筋急转弯| 三生万物是什么意思| 肺部磨玻璃结节需要注意什么| 绿松石有什么功效| 抗美援朝什么时候结束| 追随是什么意思| 开心果是什么意思| 输卵管堵塞有什么症状| 坐骨神经痛挂什么科| 胃造影和胃镜有什么区别| 保健品是什么意思| 1942年属什么生肖属相| 身体不适是什么意思| 湿气太重吃什么好| 肌红蛋白低说明什么| 大暑是什么意思啊| 心率慢是什么原因| 46岁属什么| 尿蛋白可疑阳性是什么意思| 生力军什么意思| 脂肪浸润是什么意思| 劳宫穴在什么位置| 阴道炎是什么引起的| 纪元是什么意思| 七月初七是什么生肖| 什么是官方旗舰店| 天蓝色配什么颜色| 古今内衣是什么档次| 锦鲤跳缸是什么原因| 太燃了是什么意思| 上吐下泻吃什么| 刘五行属什么| 梦是什么意思| 思利及人是什么意思| 女频是什么| 些几 是什么意思| 耳鸣是什么病的前兆| 骨相美是什么意思| 回忆杀是什么意思| 特异性是什么意思| 强直性脊柱炎看什么科| 四川有什么好吃的| 怀字五行属什么| 正常人为什么会得梅毒| 什么是属性| 脚底板痛什么原因| 直肠炎是什么症状| 低血压高是什么原因| 浅绿色是什么颜色| 阑尾炎有什么症状表现| 待字闺中是什么意思| 脚背发麻是什么原因引起的| 祖庭是什么意思| 南京大屠杀是什么时候| hcg是检查什么的| 脚气有什么症状| 十二年是什么婚| 白兰地是属于什么酒| 为什么崴脚了休息一晚脚更疼| 喜欢趴着睡是什么原因| cfmoto是什么牌子| 五官指的是什么| 和衣是什么意思| 明天代表什么生肖| 元气大伤什么意思| 女性漏尿是什么原因| 什么人容易得小脑萎缩| 过期食品属于什么垃圾| 什么时候需要做肠镜| 叶酸买什么牌子的好| 先父什么意思| 普洱在云南什么位置| 宫禁糜烂用什么药| 麦乳精是什么| 汗毛旺盛是什么原因| 1.20是什么星座| 夜间多梦是什么原因| 乌龟不吃食是什么原因| 手信是什么东西| 阴阳脸是什么意思| 人这一生为了什么| 倒立有什么好处和坏处| 依非韦伦片治什么病的| 尐是什么意思| sle是什么病| 姊妹是什么意思| 扑炎痛又叫什么| vgr100是什么药| 三七粉有什么作用| 碳水化合物指的是什么| aml是什么意思| mc是什么| 检出限是什么意思| 肚脐上方是什么器官| 碘是什么颜色| cpa是什么意思| 三羊念什么字| 交通运输是干什么的| 酪氨酸酶是什么东西| 什么是碱性磷酸酶高怎么回事| 扁桃体发炎吃什么消炎药| 梦见自己化妆是什么意思| 溦是什么意思| 盐和小苏打一起有什么作用| 结节是什么病| 什么的杜鹃花| 2004年属猴的是什么命| 儿童扁桃体发炎吃什么药| 楚楚动人是什么意思| 乾元是什么意思| 吃燕窝有什么好处| 膝盖擦伤用什么药| 团长是什么级别| 检出限是什么意思| 疳是什么意思| 为什么拉黑色的屎| 什么是痉挛| 官员出狱后靠什么生活| 月经几个月不来是什么原因| 同事过生日送什么礼物| 五海瘿瘤丸主要治什么病| 骨痂是什么意思| 合寿木是什么意思| 放屁多吃什么药| 孙膑原名叫什么| 草泥马是什么| 特性是什么意思| 做梦梦见棺材和死人是什么意思| 百合花是什么颜色的| 半岛铁盒是什么| total什么意思| 黑色素通过什么排出来| 得了阴虱用什么药能除根| 丁什么丁什么成语| 湉是什么意思| 什么狗不如| 提供什么| 肝火旺盛是什么意思| tnt是什么| 后羿射日什么意思| 十二指肠球炎吃什么药| ABB的词语有什么| 手机的英文是什么| 掉马是什么意思| 医保定点医院是什么意思| 京五行属什么| 农历十月初八是什么星座| 道德绑架什么意思| 眼睛红是什么原因| 七月三十是什么星座| 妄想症吃什么药| 左手中指戴戒指什么意思| 主观臆断是什么意思| 脉冲是什么| 胸透能查出什么| 九月十七日是什么星座| 七月初七是什么节| 八八年属什么| 水肿吃什么药消肿最快| 正常是什么意思| 看牙挂什么科| 脂膜炎是什么原因引起的| 女性喝什么茶最好| 梦见皮带断了什么预兆| 什么叫2型糖尿病| 毛尖属于什么茶| 血沉高意味着什么意思| 蛇为什么怕鹅| 做梦梦见火是什么征兆| 美沙芬片是什么药| 水痘不能吃什么食物| 21速和24速有什么区别| 宝宝感冒流鼻涕吃什么药| 眼睛为什么不怕冷| 什么酒不能喝| 什么牌子的助听器好| 同房什么感觉| 痔疮有什么特征| 志趣相投是什么意思| 复刻版是什么意思| 脚后跟疼痛是什么原因| 炖牛肉放什么容易烂| 跖疣念什么字| 助产是干什么的| 庸俗是什么意思| 球菌阳性是什么意思| 婚姻宫是什么意思| 什么叫质子| 为什么会反复发烧| 大便什么颜色是正常的| 为什么女的会流水怎么回事| 女性长期缺维d会带来什么病| 尿蛋白定量是什么意思| 什么水果败火| 过氧化氢浓度阳性是什么意思| 南极为什么比北极冷| 牙齿发软是什么原因| 凹陷性疤痕用什么药膏| 草木皆兵是什么意思| 灵芝孢子粉有什么用| 咖啡伴侣是什么东西| 父母坟上长树意味什么| 打磨工为什么没人干| 冷漠是什么意思| 钴蓝色是什么颜色| 007最新一部叫什么| 什么是血脂| 四川的耗儿鱼是什么鱼| 想吃肉是身体缺什么| 正桃花是什么意思| 水光针是什么| plt是什么| 什么火锅最好吃| 殊荣是什么意思| 牛拉稀用什么药最快| 内膜薄吃什么增长最快| 阑尾炎能吃什么| 色带是什么| 闲云野鹤指什么生肖| 什么是包皮| 小儿电解质补给液有什么作用| 镜检是什么| b型o型生出来的孩子什么血型| 三尖瓣反流是什么意思| 爱的最高境界是什么| 为什么人会死| 什么是衰老| 尿酸高吃什么可以降下去| 什么是本科| 时来运转是什么意思| 六月二七是什么星座| 白带异常是什么原因| fion属于什么档次的包| 截瘫是什么意思| 一个月一个非念什么| 锁阳是什么东西| 鼻子旁边长痘是什么原因| 宝宝病毒性感冒吃什么药效果好| 舒俱来是什么宝石| paris是什么品牌| 什么原因导致阴虚| balco是什么牌子手表| 指甲紫色是什么病的征兆| 12月22号是什么星座| 跑步肚子疼是什么原因| 婴儿呛奶是什么原因引起的| 六味地黄丸有什么作用| 中医考证需要什么条件| xxoo什么意思| 1940年中国发生了什么| 焦虑吃什么药| hpv去医院挂什么科| 紫色加红色是什么颜色| 百度Jump to content

微信安卓版部分功能调整 朋友圈查看范围三天起设

From Wikipedia, the free encyclopedia
百度 最后两卦既济与未济,哲理尤深:人一辈子,如同涉水渡河,你以为自己渡过去了,其实前方还有河。

Explicit data graph execution, or EDGE, is a type of instruction set architecture (ISA) which intends to improve computing performance compared to common processors like the Intel x86 line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

Parallelism of modern CPU designs generally starts to plateau at about eight internal units and from one to four "cores", EDGE designs intend to support hundreds of internal units and offer processing speeds hundreds of times greater than existing designs. Major development of the EDGE concept had been led by the University of Texas at Austin under DARPA's Polymorphous Computing Architectures program, with the stated goal of producing a single-chip CPU design with 1 TFLOPS performance by 2012, which has yet to be realized as of 2018.[1]

Traditional designs

[edit]

Almost all computer programs consist of a series of instructions that convert data from one form to another. Most instructions require several internal steps to complete an operation. Over time, the relative performance and cost of the different steps have changed dramatically, resulting in several major shifts in ISA design.

CISC to RISC

[edit]

In the 1960s memory was relatively expensive, and CPU designers produced instruction sets that densely encoded instructions and data in order to better utilize this resource. For instance, the add A to B to produce C instruction would be provided in many different forms that would gather A and B from different places; main memory, indexes, or registers. Providing these different instructions allowed the programmer to select the instruction that took up the least possible room in memory, reducing the program's needs and leaving more room for data. For instance, the MOS 6502 has eight instructions (opcodes) for performing addition, differing only in where they collect their operands.[2]

Actually making these instructions work required circuitry in the CPU, which was a significant limitation in early designs and required designers to select just those instructions that were really needed. In 1964, IBM introduced its System/360 series which used microcode to allow a single expansive instruction set architecture (ISA) to run across a wide variety of machines by implementing more or less instructions in hardware depending on the need.[3] This allowed the 360's ISA to be expansive, and this became the paragon of computer design in the 1960s and 70s, the so-called orthogonal design. This style of memory access with wide variety of modes led to instruction sets with hundreds of different instructions, a style known today as CISC (Complex Instruction Set Computing).

In 1975 IBM started a project to develop a telephone switch that required performance about three times that of their fastest contemporary computers. To reach this goal, the development team began to study the massive amount of performance data IBM had collected over the last decade. This study demonstrated that the complex ISA was in fact a significant problem; because only the most basic instructions were guaranteed to be implemented in hardware, compilers ignored the more complex ones that only ran in hardware on certain machines. As a result, the vast majority of a program's time was being spent in only five instructions. Further, even when the program called one of those five instructions, the microcode required a finite time to decode it, even if it was just to call the internal hardware. On faster machines, this overhead was considerable.[4]

Their work, known at the time as the IBM 801, eventually led to the RISC (Reduced Instruction Set Computing) concept. Microcode was removed, and only the most basic versions of any given instruction were put into the CPU. Any more complex code was left to the compiler. The removal of so much circuitry, about 1?3 of the transistors in the Motorola 68000 for instance, allowed the CPU to include more registers, which had a direct impact on performance. By the mid-1980s, further developed versions of these basic concepts were delivering performance as much as 10 times that of the fastest CISC designs, in spite of using less-developed fabrication.[4]

Internal parallelism

[edit]

In the 1990s the chip design and fabrication process grew to the point where it was possible to build a commodity processor with every potential feature built into it. Units that were previously on separate chips, like floating point units and memory management units, were now able to be combined onto the same die, producing all-in one designs. This allows different types of instructions to be executed at the same time, improving overall system performed. In the later 1990s, single instruction, multiple data (SIMD) units were also added, and more recently, AI accelerators.

While these additions improve overall system performance, they do not improve the performance of programs which are primarily operating on basic logic and integer math, which is the majority of programs (one of the outcomes of Amdahl's law). To improve performance on these tasks, CPU designs started adding internal parallelism, becoming "superscalar". In any program there are instructions that work on unrelated data, so by adding more functional units these instructions can be run at the same time. A new portion of the CPU, the scheduler, looks for these independent instructions and feeds them into the units, taking their outputs and re-ordering them so externally it appears they ran in succession.

The amount of parallelism that can be extracted in superscalar designs is limited by the number of instructions that the scheduler can examine for interdependencies. Examining a greater number of instructions can improve the chance of finding an instruction that can be run in parallel, but only at the cost of increasing the complexity of the scheduler itself. Despite massive efforts, CPU designs using classic RISC or CISC ISA's plateaued by the late 2000s. Intel's Haswell designs of 2013 have a total of eight dispatch units,[5] and adding more results in significantly complicating design and increasing power demands.[6]

Additional performance can be wrung from systems by examining the instructions to find ones that operate on different types of data and adding units dedicated to that sort of data; this led to the introduction of on-board floating point units in the 1980s and 90s and, more recently, single instruction, multiple data (SIMD) units. The drawback to this approach is that it makes the CPU less generic; feeding the CPU with a program that uses almost all floating point instructions, for instance, will bog the FPUs while the other units sit idle.

A more recent problem in modern CPU designs is the delay talking to the registers. In general terms the size of the CPU die has remained largely the same over time, while the size of the units within the CPU has grown much smaller as more and more units were added. That means that the relative distance between any one function unit and the global register file has grown over time. Once introduced in order to avoid delays in talking to main memory, the global register file has itself become a delay that is worth avoiding.

A new ISA?

[edit]

Just as the delays talking to memory while its price fell suggested a radical change in ISA (Instruction Set Architecture) from CISC to RISC, designers are considering whether the problems scaling in parallelism and the increasing delays talking to registers demands another switch in basic ISA.

Among the ways to introduce a new ISA are the very long instruction word (VLIW) architectures, typified by the Itanium. VLIW moves the scheduler logic out of the CPU and into the compiler, where it has much more memory and longer timelines to examine the instruction stream. This static placement, static issue execution model works well when all delays are known, but in the presence of cache latencies, filling instruction words has proven to be a difficult challenge for the compiler.[7] An instruction that might take five cycles if the data is in the cache could take hundreds if it is not, but the compiler has no way to know whether that data will be in the cache at runtime – that's determined by overall system load and other factors that have nothing to do with the program being compiled.

The key performance bottleneck in traditional designs is that the data and the instructions that operate on them are theoretically scattered about memory. Memory performance dominates overall performance, and classic dynamic placement, dynamic issue designs seem to have reached the limit of their performance capabilities. VLIW uses a static placement, static issue model, but has proven difficult to master because the runtime behavior of programs is difficult to predict and properly schedule in advance.

EDGE

[edit]

Theory

[edit]

EDGE architectures are a new class of ISA's based on a static placement, dynamic issue design. EDGE systems compile source code into a form consisting of statically allocated hyperblocks containing many individual instructions, hundreds or thousands. These hyperblocks are then scheduled dynamically by the CPU. EDGE thus combines the advantages of the VLIW concept of looking for independent data at compile time, with the superscalar RISC concept of executing the instructions when the data for them becomes available.

In the vast majority of real-world programs, the linkage of data and instructions is both obvious and explicit. Programs are divided into small blocks referred to as subroutines, procedures or methods (depending on the era and the programming language being used) which generally have well-defined entrance and exit points where data is passed in or out. This information is lost as the high level language is converted into the processor's much simpler ISA. But this information is so useful that modern compilers have generalized the concept as the "basic block", attempting to identify them within programs while they optimize memory access through the registers. A block of instructions does not have control statements but can have predicated instructions. The dataflow graph is encoded using these blocks, by specifying the flow of data from one block of instructions to another, or to some storage area.

The basic idea of EDGE is to directly support and operate on these blocks at the ISA level. Since basic blocks access memory in well-defined ways, the processor can load up related blocks and schedule them so that the output of one block feeds directly into the one that will consume its data. This eliminates the need for a global register file, and simplifies the compiler's task in scheduling access to the registers by the program as a whole – instead, each basic block is given its own local registers and the compiler optimizes access within the block, a much simpler task.

EDGE systems bear a strong resemblance to dataflow languages from the 1960s–1970s, and again in the 1990s. Dataflow computers execute programs according to the "dataflow firing rule", which stipulates that an instruction may execute at any time after its operands are available. Due to the isolation of data, similar to EDGE, dataflow languages are inherently parallel, and interest in them followed the more general interest in massive parallelism as a solution to general computing problems. Studies based on existing CPU technology at the time demonstrated that it would be difficult for a dataflow machine to keep enough data near the CPU to be widely parallel, and it is precisely this bottleneck that modern fabrication techniques can solve by placing hundreds of CPU's and their memory on a single die.

Another reason that dataflow systems never became popular is that compilers of the era found it difficult to work with common imperative languages like C++. Instead, most dataflow systems used dedicated languages like Prograph, which limited their commercial interest. A decade of compiler research has eliminated many of these problems, and a key difference between dataflow and EDGE approaches is that EDGE designs intend to work with commonly used languages.

CPUs

[edit]

An EDGE-based CPU would consist of one or more small block engines with their own local registers; realistic designs might have hundreds of these units. The units are interconnected to each other using dedicated inter-block communication links. Due to the information encoded into the block by the compiler, the scheduler can examine an entire block to see if its inputs are available and send it into an engine for execution – there is no need to examine the individual instructions within.

With a small increase in complexity, the scheduler can examine multiple blocks to see if the outputs of one are fed in as the inputs of another, and place these blocks on units that reduce their inter-unit communications delays. If a modern CPU examines a thousand instructions for potential parallelism, the same complexity in EDGE allows it to examine a thousand hyperblocks, each one consisting of hundreds of instructions. This gives the scheduler considerably better scope for no additional cost. It is this pattern of operation that gives the concept its name; the "graph" is the string of blocks connected by the data flowing between them.

Another advantage of the EDGE concept is that it is massively scalable. A low-end design could consist of a single block engine with a stub scheduler that simply sends in blocks as they are called by the program. An EDGE processor intended for desktop use would instead include hundreds of block engines. Critically, all that changes between these designs is the physical layout of the chip and private information that is known only by the scheduler; a program written for the single-unit machine would run without any changes on the desktop version, albeit thousands of times faster. Power scaling is likewise dramatically improved and simplified; block engines can be turned on or off as required with a linear effect on power consumption.

Perhaps the greatest advantage to the EDGE concept is that it is suitable for running any sort of data load. Unlike modern CPU designs where different portions of the CPU are dedicated to different sorts of data, an EDGE CPU would normally consist of a single type of ALU-like unit. A desktop user running several different programs at the same time would get just as much parallelism as a scientific user feeding in a single program using floating point only; in both cases the scheduler would simply load every block it could into the units. At a low level the performance of the individual block engines would not match that of a dedicated FPU, for instance, but it would attempt to overwhelm any such advantage through massive parallelism.

Implementations

[edit]

TRIPS

[edit]

The University of Texas at Austin was developing an EDGE ISA known as TRIPS. In order to simplify the microarchitecture of a CPU designed to run it, the TRIPS ISA imposes several well-defined constraints on each TRIPS hyperblock, they:

  • have at most 128 instructions,
  • issue at most 32 loads and/or stores,
  • issue at most 32 register bank reads and/or writes,
  • have one branch decision, used to indicate the end of a block.

The TRIPS compiler statically bundles instructions into hyperblocks, but also statically compiles these blocks to run on particular ALUs. This means that TRIPS programs have some dependency on the precise implementation they are compiled for.

In 2003 they produced a sample TRIPS prototype with sixteen block engines in a 4 by 4 grid, along with a megabyte of local cache and transfer memory. A single chip version of TRIPS, fabbed by IBM in Canada using a 130 nm process, contains two such "grid engines" along with shared level-2 cache and various support systems. Four such chips and a gigabyte of RAM are placed together on a daughter-card for experimentation.

The TRIPS team had set an ultimate goal of producing a single-chip implementation capable of running at a sustained performance of 1 TFLOPS, about 50 times the performance of high-end commodity CPUs available in 2008 (the dual-core Xeon 5160 provides about 17 GFLOPS).

CASH

[edit]

CMU's CASH is a compiler that produces an intermediate code called "Pegasus".[8] CASH and TRIPS are very similar in concept, but CASH is not targeted to produce output for a specific architecture, and therefore has no hard limits on the block layout.

WaveScalar

[edit]

The University of Washington's WaveScalar architecture is substantially similar to EDGE, but does not statically place instructions within its "waves". Instead, special instructions (phi, and rho) mark the boundaries of the waves and allow scheduling.[9]

References

[edit]

Citations

[edit]
  1. ^ University of Texas at Austin, "TRIPS : One Trillion Calculations per Second by 2012"
  2. ^ Pickens, John (17 October 2020). "NMOS 6502 Opcodes".
  3. ^ Shirriff, Ken. "Simulating the IBM 360/50 mainframe from its microcode".
  4. ^ a b Cocke, John; Markstein, Victoria (January 1990). "The evolution of RISC technology at IBM" (PDF). IBM Journal of Research and Development. 34 (1): 4–11. doi:10.1147/rd.341.0004.
  5. ^ Shimpi, Anand Lal (5 October 2012). "Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel". AnandTech. Archived from the original on April 24, 2013.
  6. ^ Tseng, Francis; Patt, Yale (June 2008). "Achieving Out-of-Order Performance with Almost In-Order Complexity". ACM SIGARCH Computer Architecture News. 36 (3): 3–12. doi:10.1145/1394608.1382169.
  7. ^ W. Havanki, S. Banerjia, and T. Conte. "Treegion scheduling for wide-issue processors", in Proceedings of the Fourth International Symposium on High-Performance Computer Architectures, January 1998, pg. 266–276
  8. ^ "Phoenix Project"
  9. ^ "The WaveScalar ISA"

Bibliography

[edit]
为什么三角形具有稳定性 燕窝是什么 打磨工是做什么的 布鲁斯是什么 文工团是什么意思
jojo是什么意思 梦见自己抬棺材是什么意思 宫颈小有什么影响 份子钱是什么意思 痔疮是什么
陶土色大便是什么颜色 掉头发吃什么维生素 什么人不适合做业务员 过敏是什么意思 四面八方指什么生肖
幽门螺旋杆菌是什么症状 西洋参吃多了有什么副作用 妍字属于五行属什么 甘油三酯高什么原因 虾仁和什么包饺子好吃
鸡痘用什么药效果好0735v.com 川崎病是什么病hcv8jop7ns8r.cn 牛不吃草是什么原因helloaicloud.com 胎盘前置是什么意思hcv9jop4ns5r.cn 梦见下大雪是什么意思hcv8jop3ns2r.cn
马上是什么意思hcv9jop1ns4r.cn 四眼狗是什么品种hcv7jop4ns8r.cn 依非韦伦片治什么病的hanqikai.com 血液病是什么hkuteam.com 拉大便出血是什么原因hcv7jop7ns1r.cn
昕五行属什么hcv7jop7ns0r.cn 冠心病吃什么药hcv7jop9ns0r.cn 12生肖为什么没有猫hcv8jop6ns5r.cn 什么是植发hcv8jop4ns7r.cn 老鼠疣长什么样子图片hcv8jop9ns3r.cn
什么物流寄大件便宜xscnpatent.com 太上皇是什么意思hcv7jop9ns8r.cn 拔草是什么意思hcv8jop6ns9r.cn 夸瓜读什么hcv8jop4ns7r.cn 过敏性紫癜不能吃什么bysq.com
百度