大数据技术发展的两大方向和最新进展-孙元浩.pdf

上传人:来看看 文档编号:3332990 上传时间:2019-08-13 格式:PDF 页数:18 大小:2.75MB
返回 下载 相关 举报
大数据技术发展的两大方向和最新进展-孙元浩.pdf_第1页
第1页 / 共18页
大数据技术发展的两大方向和最新进展-孙元浩.pdf_第2页
第2页 / 共18页
大数据技术发展的两大方向和最新进展-孙元浩.pdf_第3页
第3页 / 共18页
大数据技术发展的两大方向和最新进展-孙元浩.pdf_第4页
第4页 / 共18页
大数据技术发展的两大方向和最新进展-孙元浩.pdf_第5页
第5页 / 共18页
点击查看更多>>
资源描述

《大数据技术发展的两大方向和最新进展-孙元浩.pdf》由会员分享,可在线阅读,更多相关《大数据技术发展的两大方向和最新进展-孙元浩.pdf(18页珍藏版)》请在三一文库上搜索。

1、15/4/21 1 www.transwarp.io 大数据技术发展的两大方向和最新进展 孙元浩 星环科技 联合创始人兼CTO www.transwarp.io 2015年4月16日 15/4/21 2 www.transwarp.io 对SQL支持程度的制约阻碍了企业应用Hadoop技术 方向一:SQL on Hadoop是Killer App 15/4/21 3 www.transwarp.io 60%的Hadoop应用在SQL统计 Source: wikibon.org 15/4/21 4 www.transwarp.io SQL on Hadoop 技术 名称 名称 计算引擎 计算引擎

2、 ANSIANSI SQL支持程度 SQL支持程度 存储过程 存储过程 第一个版本发布时间 第一个版本发布时间 Cloudera Impala 类Dremel,类MPP引擎 SQL92子集+SQL2003扩展 不支持 2011/10 Hortonworks Tez/Stinger Map/Reduce改进 SQL92子集+SQL2003扩展 不支持 2012/5 Transwarp Inceptor Spark SQL99 + SQL2003 Oracle Compatible PL/SQL 2013/11 Databricks SparkSQL Spark HiveQL (SQL92子集)

3、不支持 2014/6 MapR Drill 改进自OpenDremel SQL92子集 不支持 2012/6立项,2014/11发布 IBM BigSQL v3 DB2/DPF like MPP Engine over HDFS SQL 2003 不支持 2014/6 Pivotal HAWQ Greenplum like MPP Engine over HDFS SQL 2003 部分支持 (Postgres like) 2013/2 Splice Machine Apache Derby + HBase SQL 1999 不支持 2015 GA Actian Vortex MPP Engi

4、ne over HDFS SQL 2003 不支持 2014 15/4/21 5 www.transwarp.io Inceptor PL/SQL Compiler SQL Parser SQL Statements Abstract Syntax Tree Constant Folding Constant Folding Constant Folding AST optimizer RDD DAG SQL Normalizer Logical Optimizer CSE byte code generation column pruner operator pruner partition

5、 pruner predicate pushdown PL/SQL SQL2003 CBO Optimizer Join optimizations PL/SQL Analyzer Control Flow Graph CFG Optimizer function inlining dead code elimination redundant elimination CSE loop invariants hoisting Parallel Optimizer cursor parallelization Table Statistics DAG Optimizer shuffle redu

6、cer Physical Plan DAG Scheduler spark task spark task spark task First PL/SQL Compiler on Hadoop; 80% Oracle PL/SQL Compatibility. 15/4/21 6 www.transwarp.io Parallel Query Optimizer 串行执行逻辑 CURSOR c IS SELECT * from score OPEN c FOR v_rec IN c LOOP IF v_rec.flag 0 THEN UPDATE fact1 SET ELSE UPDATE f

7、act2 SET END IF END LOOP CFG on Master slave0 A cursor can be parallelized if there is no loop-carried dependence or the dependence is inductive. 优化后等价于sql(“SELECT * from score”).map(loop_cfg_func) Parallel Query Optimizer partition parallelism control flow parallelism pipeline parallelism score fla

8、g 0 Yes No Move c ahead Yes update fact1 update fact2 partition 0 flag 0 Yes No Move c ahead Yes update fact1 update fact2 partition N flag 0 Yes No Move c ahead Yes update fact1 update fact2 slaveN 游标示例程序 游标示例程序 并行执行逻辑 15/4/21 7 www.transwarp.io Transwarp Inceptor vs Greenplum DB 0.06 0.13 0.25 0.5

9、0 1.00 2.00 4.00 8.00 16.00 32.00 64.00 128.00 q1 q3 q5 q7 q9 q11 q13 q15 q17 q19 q21 q23 q25 q27 q29 q31 q33 q35 q38 q41 q43 q45 q47 q49 q51 q53 q55 q57 q59 q62 q64 q66 q68 q70 q72 q74 q76 q78 q80 q82 q84 q86 q88 q90 q93 q95 q97 q99 ExecTime(GP) vs ExecTime(Inceptor) GP is faster than Inceptor for

10、25 of tpc-ds queries ratio query 15/4/21 8 www.transwarp.io Transwarp Inceptor 4.1 vs Spark SQL 1.3 TPC-DS Query Inceptor性能 比SparkSQL的 加速比例 SparkSQL can run 35 queries from tpc-ds benchmark 0.5 1 2 4 8 16 32 64 3 7 13 15 17 19 22 25 26 27 28 29 34 42 43 45 46 48 50 52 55 61 62 68 71 73 76 79 84 85 8

11、7 88 90 96 99 Inceptor4.1 vs SparkSQL1.3 Speedup ExecTime(SparkSQL1.3)/ExecTime(Inceptor4.1) 15/4/21 9 www.transwarp.io Slicing Dicing Rollup Drill Up/Down Pivot 交互式OLAP分析:Distributed Cube Holodesk A Columnar Store on SSD cache layer Executor Inceptor Server Executor Executor Executor Columnar Store

12、 API Cube (D1, D2, D3) INDEX Column D1 INDEX Column D2 INDEX Column D3 INDEX Column M1 Cube (D1, D2), (D2, D3), (D1, D3) Columnar Store API Cube (D1, D2, D3) INDEX Column D1 INDEX Column D2 INDEX Column D3 INDEX Column M1 Cube (D1, D2), (D2, D3), (D1, D3) Columnar Store API Cube (D1, D2, D3) INDEX C

13、olumn D1 INDEX Column D2 INDEX Column D3 INDEX Column M1 Cube (D1, D2), (D2, D3), (D1, D3) Columnar Store API Cube (D1, D2, D3) INDEX Column D1 INDEX Column D2 INDEX Column D3 INDEX Column M1 Cube (D1, D2), (D2, D3), (D1, D3) 如何定义一个Cube? Cube Size 256KB固定大小 ZK Cluster Cube on Transwarp Holodesk Cube

14、是OLAP分析的常用技术 create table store_sales tblproperties( cache=ram, holodesk.dimensions=product, cities, time ) as select * from store_sales;00000 15/4/21 10 www.transwarp.io 0.9 9.8 12.4 12.1 14.0 1.3 8.8 12.7 20.2 43.3 58.9 86.6 136.1 1.4 55.2 56.5 0 20 40 60 80 100 120 140 160 1 2 3 4 5 6 7 8 执行时间(秒执

15、行时间(秒) w/ cube w/o cube Holodesk Cube带来的性能加速 Operation SQL query q1 count select count(*) from store_sales q2 measure select sum(ss_sales_price) from store_sales q3 aggregation select sum(ss_sales_price) from store_sales group by ss_customer_sk q4 drill down select sum(ss_sales_price) from store_sal

16、es group by ss_sold_date_sk q5 drill down select sum(ss_sales_price) from store_sales group by ss_customer_sk, ss_sold_date_sk q6 slice select sum(ss_sales_price) from store_sales_r where ss_customer_sk=5000 group by ss_customer_sk,ss_sold_date_sk q7 dice select sum(ss_sales_price) from store_sales

17、where ss_sold_date_sk between 2450629 and 2451816 group by ss_customer_sk q8 pivot select sum(ss_sales_price) from store_sales where ss_customer_sk 5000 and ss_sold_date_sk between 2450629 and 2451816 group by ss_customer_sk,ss_sold_date_sk 40亿条记录 共500GB驻留内存 4台两路普通服务器 每台服务器256GB内存 CPU为E5-2620v2 万兆网络

18、 15/4/21 11 www.transwarp.io 企业对多租户资源管控和弹性计算的需求促使Hadoop发生变革 方向二:Hadoop加速Docker化 15/4/21 12 www.transwarp.io 企业大数据平台的切实需求 统一的企业大数据平台统一的企业大数据平台 (Data Hub) (Data Hub) 需求一:资源弹性共享 - 提高资源利用率 需求二:隔离性 - 保障服务质量和安全性 灵活部署:Big Data + Application 资源调度:Auto-scaling + Self-healing 服务发现:Central Repository 数据隔离:Data

19、 Sources,Access Pattern, Confidential Levels 计算隔离:CPU、Memory、I/O 应用隔离 15/4/21 13 www.transwarp.io HDFS YARN Incept or Strea m Elastic Search Hyper base Pig Ooize Flume Sqoop Postgr esSQL Redis Service Repository Announcing Transwarp Operating System CPU/MEM priority-based scheduler Transwarp Operati

20、ng System Disk storage manager Network VLAN manager Scheduler coordination etcd orchestration load balancer System Service auto-scaling replicator discovery name service Transwarp Operating System Ring 0:Docker/Container Ring 1:Resource scheduler Ring 2:Built-in system services Ring 3:Central servic

21、e repository (docker images) 可运行在裸机组成的集群中,或者是公有云上 Container Plugins Containers 15/4/21 14 www.transwarp.io HDFS YARN+ Map/ Reduce Inceptor Stream Elastic Search Hyperbase Flume Sqoop Oozie Pig TOS automated deployment 通过Web、REST API or CLI 一键瞬间安装和部署集群 自动根据服务的依赖性安装所需的其他服务组件 CPU/MEM priority-based sch

22、eduler Transwarp Operating System Disk storage manager Network VLAN manager Scheduler coordination etcd orchestration load balancer System Service auto-scaling replicator discovery name service Container Plugins Containers HDFS YARN Incept or Strea m Elastic Search Hyper base Flume Sqoop Oozie Pig P

23、ostgr esSQL Redis Service Repository 15/4/21 15 www.transwarp.io TOS Better Scheduler for Isolation HDFS1 Inceptor1 Data Warehouse Apps Inceptor2 Datamarts Analysis Mining HDFS2 Hyperbase HBase Online Query Apps Stream Real-time LBS Apps CPU/MEM priority-based scheduler Transwarp Operating System Di

24、sk storage manager Network VLAN manager Scheduler coordination etcd orchestration load balancer System Service auto-scaling replicator discovery name service Container Plugins Containers 为什么要重写资源管理框架来代替YARN? 资源粒度资源粒度 隔离程度 隔离程度 依赖性 依赖性 通用性 通用性 YARN CPU/MEM 进程级别、不精确 依赖某个HDFS 支持少量计算引擎 Kubernetes CPU/ME

25、M Container 不依赖Hadoop 支持通用Linux负载 TOS CPU,MEM, DISK,NETWORK Container + Quota + VLAN 不依赖Hadoop 支持大数据及通用应用 15/4/21 16 www.transwarp.io Stream Real-time LBS Apps HDFS1 Inceptor1 TOS- auto-scaling & self-healing Inceptor1 Stream Data Warehouse Apps CPU/MEM priority-based scheduler Transwarp Operating S

26、ystem Disk storage manager Network VLAN manager Scheduler coordination etcd orchestration load balancer System Service auto-scaling replicator discovery name service Container Plugins Containers 自动修复集群:Replicator监测集群规模并保持该规模 动态扩容/收缩集群:Capacity Scheduler + Priority/Price-based Bidding (支持抢占) 15/4/21

27、17 www.transwarp.io Big Data将成为Docker的主要应用之一 source: 1. QA 3. Big Data 2. Web Apps TranswarpTranswarp OperatingOperating System System automated hadoop deployment run any docker images better isolation auto-scaling & self-healing Transwarp Operating System 将在2015Q2二季度末发布! 15/4/21 18 www.transwarp.io TRANSWARP 2014

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 建筑/环境 > 装饰装潢


经营许可证编号:宁ICP备18001539号-1