Apache Hive &amp Stinger.pdf

上传人:韩长文 文档编号:3329627 上传时间:2019-08-13 格式:PDF 页数:27 大小:952.20KB
返回 下载 相关 举报
Apache Hive &amp Stinger.pdf_第1页
第1页 / 共27页
Apache Hive &amp Stinger.pdf_第2页
第2页 / 共27页
Apache Hive &amp Stinger.pdf_第3页
第3页 / 共27页
Apache Hive &amp Stinger.pdf_第4页
第4页 / 共27页
Apache Hive &amp Stinger.pdf_第5页
第5页 / 共27页
点击查看更多>>
资源描述

《Apache Hive &amp Stinger.pdf》由会员分享,可在线阅读,更多相关《Apache Hive &amp Stinger.pdf(27页珍藏版)》请在三一文库上搜索。

1、 Hortonworks Inc. 2013. Hortonworks Inc. 2013. Apache Hive GROUP a BY a.x JOIN (a,b) GROUP b BY b.x ORDER BY M M M R R M M R M M R M R HDFS HDFS HDFS M M M R R R M M R GROUP BY a.x JOIN (a,b) ORDER BY GROUP BY x Tez avoids unnecessary writes to HDFS HIVE-4660 Hortonworks Inc. 2013. Hortonworks Inc.

2、2013. Tez Sessions because Map/Reduce query startup is expensive Tez Sessions Hot containers ready for immediate use Removes task and job launch overhead (5s 30s) Hive Session launch/shutdown in background (seamless, user not aware) Submits query plan directly to Tez Session Native Hadoop service, n

3、ot ad-hoc Hortonworks Inc. 2013. Hortonworks Inc. 2013. Tez Delivers Interactive Query - Out of the Box! Page 15 Feature Description Benefit Tez Session Overcomes Map-Reduce job-launch latency by pre- launching Tez AppMaster Latency Tez Container Pre- Launch Overcomes Map-Reduce latency by pre-launc

4、hing hot containers ready to serve queries. Latency Tez Container Re-Use Finished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance! Latency Runtime re- configuration of DAG Runtime query tuning by picking aggre

5、gation parallelism using online query statistics Throughput Tez In-Memory Cache Hot data kept in RAM for fast access. Latency Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput Hortonworks Inc. 2013. Hortonworks Inc. 2013. ORC File Format Col

6、umnar format for complex data types Built into Hive from 0.11 Support for Pig and MapReduce via HCatalog Two levels of compression Lightweight type-specific and generic Built in indexes Every 10,000 rows with position information Min, Max, Sum, Count of each column Supports seek to row number Page 1

7、6 HIVE-3874 Hortonworks Inc. 2013. Hortonworks Inc. 2013. ORC File Format Hive 0.12 Predicate Push Down Improved run length encoding Adaptive string dictionaries Padding stripes to HDFS block boundaries Trunk Stripe-based Input Splits Input Split elimination Vectorized Reader Customized Pig Load and

8、 Store functions Page 17 Hortonworks Inc. 2013. Hortonworks Inc. 2013. Vectorized Query Execution Designed for Modern Processor Architectures Avoid branching in the inner loop. Make the most use of L1 and L2 cache. How It Works Process records in batches of 1,000 rows Generate code from templates to

9、 minimize branching. What It Gives 30x improvement in rows processed per second. Initial prototype: 100M rows/sec on laptop Page 18 HIVE-4160 Hortonworks Inc. 2013. Hortonworks Inc. 2013. HDFS Buffer Cache Use memory mapped buffers for zero copy Avoid overhead of going through DataNode Can mlock the

10、 block files into RAM ORC Reader enhanced for zero-copy reads New compression interfaces in Hadoop Vectorization specific reader Read 1000 rows at a time Read into Hives internal representation Hortonworks Inc. 2013. Hortonworks Inc. 2013. Cost-based optimization (Optiq) Page 20 Optiq: Open source,

11、Apache licensed query execution framework in Java Used by Apache Drill, Apache Cascade, Lucene DB, Based on Volcano paper 20 man years dev, more than 50 optimization rules Goals for hive Ease of Use no manual tuning for queries, make choices automatically based on cost View Chaining/Ad hoc queries i

12、nvolving multiple views Help enable BI Tools front-ending Hive Emphasis on latency reduction Cost computation will be used for Join ordering Join algorithm selection Tez vertex boundary selection HIVE-5775 Hortonworks Inc. 2013. Hortonworks Inc. 2013. How Stinger Phase 3 Delivers Interactive Query P

13、age 21 Feature Description Benefit Tez Integration Tez is significantly better engine than MapReduce Latency Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput Query Planner Using extensive statistics now available in Metastore

14、to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning) Latency ORC File Columnar, type aware format with indices Latency Cost Based Optimizer (Optiq) Join re-ordering and other optimizations based on column statist

15、ics including histograms etc. (future) Latency Hortonworks Inc. 2013. Hortonworks Inc. 2013. SCALE: Interactive Query at Petabyte Scale Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale 131 GB (78% Smaller) File Size Comparison Across Encoding Me

16、thods Dataset: TPC-DS Scale 500 Dataset 221 GB (62% Smaller) Encoded with Text Encoded with RCFile Encoded with ORCFile Encoded with Parquet 505 GB (14% Smaller) 585 GB (Original Size) Larger Block Sizes Columnar format arranges columns adjacent within the file for compression & fast access Impala H

17、ive 12 Smaller Footprint Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster Hortonworks Inc. 2013. Hortonworks Inc. 2013. Stinger Phase 3: Interactive Query In Hadoop Page 23 Hive 10 Trunk (Phase 3) Hive 0.11 (Phase 1) 190x Improvement 1400s 39s 7.2s TPC-DS Q

18、uery 27 3200s 65s 14.9s TPC-DS Query 82 200x Improvement Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables All Results at Scale Factor 200 (Approximately 200GB Data) Hortonworks Inc. 2013. Hortonworks Inc. 2013. 41.1s 4.2s 39.8s 4.1s TPC-DS

19、Query 52 TPC-DS Query 55 Query Time in Seconds Speed: Delivering Interactive Query Test Cluster: 200 GB Data (ORCFile) 20 Nodes, 24GB RAM each, 6x disk each Hive 0.12 Trunk (Phase 3) Query 52: Star Schema Join Query 55: Star Schema Join Hortonworks Inc. 2013. Hortonworks Inc. 2013. 22s 9.8s 31s 6.7s

20、 TPC-DS Query 28 TPC-DS Query 12 Query Time in Seconds Speed: Delivering Interactive Query Test Cluster: 200 GB Data (ORCFile) 20 Nodes, 24GB RAM each, 6x disk each Hive 0.12 Trunk (Phase 3) Query 28: Vectorization Query 12: Complex join (M-R-R pattern) Hortonworks Inc. 2013. Hortonworks Inc. 2013. Next Steps Blog http:/ Stinger Initiative http:/ Stinger Beta: HDP-2.1 Beta, December, 2013 Hortonworks Inc. 2013. Confidential and Proprietary. Hortonworks Inc. 2013. Confidential and Proprietary. Thank You! guntherapache.org guntherapache.org yakrobat hortonworks

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 建筑/环境 > 装饰装潢


经营许可证编号:宁ICP备18001539号-1