蔡武男:异构计算—以ARM架构为例.pdf

上传人:来看看 文档编号:3335286 上传时间:2019-08-13 格式:PDF 页数:18 大小:1.01MB
返回 下载 相关 举报
蔡武男:异构计算—以ARM架构为例.pdf_第1页
第1页 / 共18页
蔡武男:异构计算—以ARM架构为例.pdf_第2页
第2页 / 共18页
蔡武男:异构计算—以ARM架构为例.pdf_第3页
第3页 / 共18页
蔡武男:异构计算—以ARM架构为例.pdf_第4页
第4页 / 共18页
蔡武男:异构计算—以ARM架构为例.pdf_第5页
第5页 / 共18页
点击查看更多>>
资源描述

《蔡武男:异构计算—以ARM架构为例.pdf》由会员分享,可在线阅读,更多相关《蔡武男:异构计算—以ARM架构为例.pdf(18页珍藏版)》请在三一文库上搜索。

1、Heterogeneous Computing in ARM Architecture Alan Tsai Business Development Manager Sept 2012 2 Agenda Trends in Heterogeneous Computing GPU Computing with ARM Mali-T600 series as example Heterogeneous System Architecture (HSA) is the future 3 Trends in the Industry Heterogeneous multiprocessing Esta

2、blished approach for SoC design Mix of many specialized accelerators, implementing different ISAs Diverse programming approaches lead to lack of portability Parallel computation for performance and efficiency Endorsed at all levels of computer architecture Parallel programming traditionally difficul

3、t General purpose programmability of GPUs Massive parallel computation potential Increasing programmability 4 What is Parallel Computing? Simply, doing multiple tasks simultaneously Task-Parallel computing does different tasks concurrently Reading email, playing music, and surfing the web are all se

4、parate tasks In a multicore system, these can execute simultaneously Data-Parallel computing does the same operation on a collection of data concurrently Adjusting the contrast of the pixels of an image Each thread executes the same code but with different data Classic SIMD (single-instruction, mult

5、iple-data) GPU computing is perfect for data-parallel applications 5 What is Heterogeneous Computing? CPU GPU GPU used as computational accelerators or companion processors Massively parallel architecture gives great computational capabilities Cost effective, efficient, great floating point performa

6、nce 6 Complementary Processor Architectures Serial workloads and task parallel workloads 50 stages Very high latency High throughput 2D/3D Graphics Stream processing The CPU The GPU 7 GPU Compute Making the Difference Computer Vision Real Time Still and Moving Image Perfection Up scaling Multi-Persp

7、ective Vision 2D to 3D Information Extraction Multi-User Interaction Benefits More efficient processing Improved accuracy/quality BOM reduction Unlock new use cases Improved existing use cases Light-Field Photography Computational Photography Trends Heterogeneous computing Portability Parallel compu

8、tation Hardware acceleration GPU computing 8 GPU COMPUTING Mali-T600 as Example 9 Mali-T600 GPU Series Overview Innovation and market leadership Tri-pipe ALU design - optimal graphics and GPU compute Native 64-bit integer and floating point (IEEE 754-2008), scalar and SIMD Flexibility and scalabilit

9、y Mali-T624 and Mali-T628 for smartphones and SmartTVs Mali-T678 for the best in compute and graphics for tablets Software compatibility and comprehensive API support DirectX 11, OpenGL ES 3.0 OpenCL Full Profile and Renderscript compute Performance 100s of GFLOPs of arithmetic performance Mali-T628

10、 10 What about OpenCL? OpenCL is an API for heterogeneous computing Write one source, deploy on many type of processors Currently, its targeted for data-parallel applications Applications use kernels to process data provided to the OpenCL runtime Kernels are written in OpenCL C Subset of C99 with th

11、e addition of vector data types (e.g. float4) Application Initializes OpenCL Runtime Compiles and Links Kernel Creates and Initialize Data Buffers Executes Kernel and Collect Results 11 GPU Computing with no compromises Embedded Profile is a subset of Full Profile, reducing features and precision Al

12、l shipping processors openly programmable with OpenCL 1.1 are Full Profile All mainstream developers are producing for Full Profile All existing software in the industry has been developed for Full Profile With Mali-T600, ARM is the first IP vendor to pass conformance for OpenCL 1.1 Full Profile Fea

13、ture Benefit Native support for 64-bit integer maths (scalar and SIMD) Radically faster and more efficient than software emulation Beneficial for multimedia encoders/decoders and encryption software, pointer arithmetic for the post 4Gb world, large counters IEEE 754-2008 compliance Same floating poi

14、nt accuracy on a Mali-T600 Series GPU as any other Full Profile conformant platform Hardware accelerated support for 3D images Great for volumetric modelling Useful in physics, games Built-in atomic operations Accelerated in hardware on Mali-T600 No need for expensive external memory synchronization

15、 or emulation Cornerstone of parallel computation 12 OpenCL Platform Model on Mali-T600 Host ARM Mali-T600 MP4 GPU ARM Compute Subsystem Core Multiple hardware execution queues Thread Work-items run as a single thread on a core A whole work-group executes on a single core Each thread has its own reg

16、isters, PS, SP, private stack Job manager handles everything in hardware: Issuing all tasks to available cores Handling out-of-order execution queues Continually spawning work items (threads) to keep cores busy Providing work item IDs Per-job completion interrupts can be requested 13 OpenCL Programm

17、ing Model Application Program Runtime Compiler Kernel object Kernel -OpenCL kernel -Native kernel Index space (NDRange) Execute command Can use static compilation Binaries are cached The kernel is executed over each element of the N-dimensional index space 14 OpenCL Execution Model on Mali-T600 Core

18、 L1 Core L1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 W1 Work Item Work Group NDRange Core Group Core L1 Core L1 Core L1 Core L1 Registers, PC, SP, Private stack Barriers, Local memory/Atomics, Constants Global Atomics, Cached global memory 15 OpenCL Execution Model on Mali-T600 Hardware Queue

19、 1 Hardware Queue 2 Core Group L1 Core L1 Core L1 Core L1 Core OpenCL Queue (Task Graph) Multiple hardware queues supported (whilst one is executed, the other is being built) Job manager handles everything in hardware Applications make driver calls to queue tasks/jobs to the target compute device 16

20、 Coherency allows the sharing of on-chip data Reduces external memory access Saves power Compute subsystems for SoC Designed and optimized by ARM Cache Coherent Interconnect Enables hardware cache coherency Increases available CPU performance Reduces the need to access external memory Improved OpenC

21、L performance across CPU and GPU GPU snoops into CPU caches Enables simple sharing of data between processors Coherency on Cortex-A15 & Mali-T600 Quad Cortex-A15 Video Quad/Octal Mali-T600 AMBA 4 Cache Coherent Interconnect CCI-400 MMU-400MMU-400 DMC-400 LPDDR2/DDR3LPDDR2/DDR3 17 ARM and HSA 18 Summary Compute more efficiently using heterogeneous and parallel processing Use OpenCL to enable portable heterogeneous multiprocessing Mali-T600 GPUs brings efficient GPU computing to you.now

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 建筑/环境 > 装饰装潢


经营许可证编号:宁ICP备18001539号-1