On Identifying Instruction Criticality for Energy-Aware Applications.docx

资源描述

《On Identifying Instruction Criticality for Energy-Aware Applications.docx》由会员分享，可在线阅读，更多相关《On Identifying Instruction Criticality for Energy-Aware Applications.docx（20页珍藏版）》请在三一文库上搜索。

1、On Identifying Instruction Criticality for Energy-Aware ApplicationsOn Identifying Instruction Criticality for Energy-Aware ApplicationsAkihiro Chiyonobu1Toshinori Sato1,21Department of Arti?cial Intelligence,Kyushu Institute of Technology2Center for Microelectronic Systems,Kyushu Institute of Techn

2、ologyAbstractRecent studies show that microprocessor performance orits energy ef?ciency can be improved,if we utilize infor-mation regarding instruction criticality.This paper inves-tigates why we could not achieve low energy consumptionand high performance simultaneously,and reveals there isa funda

3、mental problem in heuristics to identify critical in-structions.1IntroductionCurrently,smart mobile devices and embedded systemsrequire high computing capability,thus employing high per-formance microprocessors.In addition,however,they re-quire low power consumption as well as high performance.Becau

4、se there is a tradeoff between power consumption andperformance in microprocessors,power is the primary de-sign constraint in embedded microprocessors for mobile de-vices.The active power P active and gate delay t pd of aCMOS circuit are given byP activefC load V2dd(1)t pdV dd(V dd?V th),(2)where f

5、is clock frequency,C load is load capacitance,V dd is supply voltage,and V th is the threshold voltage of the de-vice.is a factor depending upon the carrier velocity satu-ration and is about1.31.5in advanced MOSFETs.Eq.(1) shows clearly that power supply reduction is the most effec-tive way to lower

6、 power consumption.However,Eq.(2)tells us that supply voltage reduction increases gate delay,result-ing in a slower clock frequency.Thus,microprocessor per-formance is diminished.In order to solve this problem,we decided to exploit information regarding critical path in ex-ecuting a program2.Actuall

7、y,a microprocessor has dual-power functional units.That means that it has several func-tional units distinguished in terms of their execution latency and power consumption,and that instructions on the critical path,which determines the execution time of the program, are executed in fast and power-hu

8、ngry units and instructions on the non-critical path are executed in slow and power-ef?cient http:/ this scheduling strategy,microproces-sor power consumption can be reduced while maintaining its performance.Differences from the previous studies on utilizing instruction criticality are as follows.Tu

9、ne et al.9 and Kobayashi et al.6focus on performance improvement. Seng et al.8attacks peak power but dont consider total power;i.e.energy.In construct,we are interested in energy reduction because energy rather than power is more impor-tant for battery-operated devices.Thus,our goal is reducing ener

10、gy consumption of microprocessors with maintaining their performance.Unfortunately,however,it has not been achieved3,4.Thus,this paper investigates why instruc-tion scheduling utilizing critical path information does not improve energy ef?ciency with maintaining processor per-formance.2Energy-Aware

11、Instruction Scheduling Most contemporary microprocessors execute instruc-tions in an out-of-order fashion in order to reduce the ex-ecution time of a program.The execution time is deter-mined by the microprocessors computing capability and by dependences between instructions executed on the mi-cropr

12、ocessor.The critical path is the longest path in a data ?ow graph,where each node represents an instruction and each arc represents a dependence between instructions,and it determines the execution time of the program6.Figure 1shows an example of a data?ow graph(DFG).In this ex-ample,its critical pa

13、th consists of instructions I:0-I:3-I:4-I:6-I:7when every instructions latency is as-sumed to be1cycle.We propose that a microprocessor has several functional units distinguished in terms of their execution latency and power consumption,and that instructions on the critical path,which determines the

14、 execution time of the program, are executed in fast and power-hungry units and instructions on the non-critical path are executed in slow and power-I : 5I : 7I : 3I : 4I : 6I : 0Figure1.Critical Pathef?cient http:/ this scheduling strategy,we can re-duce microprocessor energy consumption while main

15、tain-ing its performance.This paper describes critical path iden-ti?cation methods,which are an essential component for this dual-pipeline architecture.3Critical Path PredictorsIn order to identify if each instruction is critical or not,we utilize critical path predictors.Recently,critical path pred

16、iction has been proposed for improving instruc-tion schedulers5,9.This section describes critical path predictors.3.1Per-address Critical Path PredictorsA critical path predictor proposed by Tune et al.9pre-dicts every instructions criticality based on its past history regarding criticality.A critic

17、al path history table(CPHT) is its main component and consists of up-down counters in-dexed by the program counter(PC),as shown in Figure2. Each counter is incremented if its associated instruction is critical.Otherwise,it is http:/ during instruc-tion scheduling,the CPHT is referred to and if the c

18、ounter value corresponding with the instruction is larger than the threshold,it is predicted as critical.Otherwise,it is pre-dicted as non-critical.The bit width,the threshold for deter-mining criticality,incremental value,and decrement value depend upon each implementation decision.In order to iden

19、tify every instructions criticality at the commit stage, several heuristics are proposed9.If we can identify which instructions are critical,we can accelerate their execution by any means5,6,9.3.2Correlation-based Critical Path PredictorsTune et al.s critical pathpredictor9utilizes only the local hi

20、story of each instruction.We assume that each in-structions criticality is affected by dependences betweenFigure2.Tunes Critical Path Predictor instructions and that prediction accuracy is improved by utilizing correlations between the histories of each instruc-tion.We then proposed correlation-base

21、d critical path predictors2.They are classi?ed into three categories: GCPH-type,GBH-type,and BOTH-type predictors.The following sections describe them in detail.In order to utilize correlations between the histories of each instructions criticality,a shift register is added to the CPHT.This shift re

22、gister keeps a history of the latest in-structionscriticality.Each bit in the register indicates if its corresponding instruction was critical at the last step. When the latest instructions criticality is determined,its information is inserted to the top of the shift register,and information kept in

23、 the bottom of the register is removed. Because this shift register saves global history through the latest instructions instead of each instructions local history, we named it the Global critical path history(GCPH)regis-ter.The GCPH-type correlation-based critical path predic-tor is shown in Figure

24、3.It is possible to utilize the correla-tion between instructions by maintaining the global critical-ity history in the GCPH register.Because each instructions criticality affects its dependent instructionscriticality,the global history might be useful for accurate critical path pre-diction.Branch i

25、nstructions reduce instruction-level parallelism in programs and thus diminishes processor performance. This is because it is impossible to?nd the target instruc-tion of each branch instruction until it is executed.In order to solve this problem,many studies have addressed branch prediction.One of t

26、he branch predictors is the correlation-based branch predictor.It saves the global branch outcome history in a shift register as does the GCHP-type critical path predictor.This register is called the Branch history reg-ister(BHR).We assume that branch outcome history affects every instructions criti

27、cality and thus propose a correlation-Figure3.GCPH-type Critical Path Predictor based predictor utilizing branch outcome history.Figure4 shows how the critical path predictor utilizes branch history. The index to CPHT is made from the PC and the BHR.We expect that this combination of global branch h

28、istory and local criticality history will improve critical path prediction accuracy.We named it the GBH-type critical path predictor.Figure4.GBH-type Critical Path PredictorIn brief,we expect that utilizing both global criticality history and global branch history will improve critical path predicti

29、on accuracy.Hence,we propose to combine the GCPH-type predictor and the GBH-type predictor.Their combination might result in an effective synergy,and we named this type of predictor the BOTH-type critical path predictor.Figure5shows the BOTH-type predictor.The PC,the GCPH register,and the BHR are us

30、ed for generat-ing the index to CPHT.Figure5.BOTH-type Critical Path Predictor4Evaluation MethodologyThis section describes our processor model and bench-mark programs,which are used in simulations,in order to explain what environment is used for our evaluation.We use the sim-outorder simulator from

31、 the Sim-pleScalar tool set(version3.0b)1to implement our sim-ulator.It is a cycle-by-cycle simulator.The proposed correlation-based predictors are included in the simulator in detail.In this papaer,we use2K-entry GCPH-type pre-dictor.It is a direct-mapped table of6-bit saturating up-down counters.W

32、hen an instruction is identi?ed as criti-cal according to the QOLD,ALOLD,QCONS,or FREED3 heuristics9,its associated counter is incremented by8. Otherwise,it is decremented by1.We will present simu-lation results using the QOLD heuristics,when we do not specify which heuristics is used.The counters a

33、re updated speculatively at the issue stage.The counter threshold at which an instruction is predicted as critical is set to8.The criticality history length saved in the GCPH register is8 instructions.The incremental value,the threshold value, and the history length were determined based on prelimi-

34、nary simulations.The fast functional units can execute most integer oper-ations in one cycle,while the slow functional units execute operations in two cycles.In the rest of this paper,functional units means integer units.Both the fast and slow units share their circuit design,while each transistors

35、size and thresh-old voltage might be optimized independently.We assume that the supply voltages for fast and slow units are assumed to be1.1V and0.7V,respectively7.Our processor model has3fast functional units and3slow units.Table1shows the processors con?guration.Instruction set architecture(ISA)is

36、 the Sim-pleScalar/PISA ISA,which is an extension of MIPS R10000ISA.The SPEC2000CINT benchmark suite is used for this study.Table2lists the benchmarks and theTable1.Processor modelFetch Bandwidth8instructionsBranch Predictor1K-set4-way set-associative BTB,4K-entry8-history-length gshare predictor, 6

37、4-entry return address stack,6-cycle miss penalty,updated at commit stage Insn.Windows32-entry instruction queue,32-entry load/store queueIssue Width8instructionsCommit Width8instructionsFunctional Units6Int,3FP,4Ld,StLatency iALU1/1,iMUL8/1,iDIV32/1,fADD4/1,fCMP4/1,fCVT4/1,fMUL4/1, (total/issue)fDI

38、V32/1,fSQRT32/1,Ld/St2/1Register Files3232-bit Int registers,3232-bit FP registersInsn.Cache64KB,2-way,64B blocks,18-cycle miss penaltyData Cache64KB,2-way,64B blocks,4-port,write-back,non-blocking load,hit under miss, 18-cycle miss penaltyL2Cache uni?ed,1MB,4-way,64B blocks,80-cycle miss penaltyinp

39、ut sets.For each program,1billion instructions are skipped before the actual simulation begins,and the next 100million instructions are executed.We do not count NOP instructions.Table2.Benchmark programsBenchmark input set164.gzip http:/ net.in arch.in176.gcc cccp.i197.parser test.in255.vortex lendi

40、an.raw256.bzip2input.random5Dif?culty in Identifying Critical Instruc-tionsIn order to utilize information regarding instruction crit-icality,we rely on critical path predictor(CPP)2,9.Un-fortunately,its accuracy is not enough high to achieve our goal.We have already known that we can perform low en

41、-ergy consumption and maintain processing performance if exact critical path information is used.Figure6shows pro-cessor performance and energy delay product(EDP).The ?rst bar is for our baseline model,which does not exploit in-struction criticality,and the second one is for the model uti-lizing CPP

42、.The third one shows processor performance and EDP,if path information table(PIT)6is used for identify-ing critical instructions.PIT obtains accurate critical path information,since it creates DFG in the instruction window for every cycle.We guess the energy consumed in PIT is very large,however,bec

43、ause of its complexity in hardware.In construct,the CPPs hardware structure is very simple, and it is estimated its energy consumption is1.7%of64KB data cache.Hence,it is better to improve prediction accu-racy of the CPP rather than utilizing PIT.We guess that the low prediction accuracy of the CPP

44、is due to the information which trains the predictor.Iden-tifying critical instructions is based on heuristics,such as QOLD,ALOLD,QCONS,and FREED3,which were pro-posed by Tune at el.9.We compare the identi?cation re-sults of heuristics with those of PIT.Surprisingly,we?nd approximately40%of identi?c

45、ation based on the heuris-tics does not agree with PIT,as shown in Figure7.And therefore,almost half of predictions does not agree with PIT.In order to con?rm that the low prediction accuracy is due to the heuristics,we replace the training heuristics with PIT.That is,the CPP is trained by PIT and i

46、nstruc-tions are scheduled based on critical path prediction.The last bar in Figure6shows processor performance and EDP in this case.While processor performance is improved, EDP is getting worse.This means that maintaining pro-cessing performance is realized by utilizing fast and power-hungry units.

47、This result shows prediction accuracy is not improve even if the CPP is updated using exact critical path information.A reason why the CPP still can not perform accurate predictions might be that each predicted critical-ity never changes while PIT-reported criticality follows dy-namic changes in DFG

48、.6ConclusionsThe evaluation performed in this paper shows that the energy-saving architecture is effective when exact critical path information is utilized.Unfortunately,however,PIT which can supply accurate critical path information is ex-pected to be very power-hungry since hardware of PIT is0.50.550.60.650.70.750.80.850.90.951PerformanceEDPR e l a t i v e r e s u l t sFigure 6.SimulationResults 0.10.20.30.40.50.60.70.8QO

展开阅读全文