Kylin–基于Hadoop的大规模联机分析引擎.pdf

资源描述

《Kylin–基于Hadoop的大规模联机分析引擎.pdf》由会员分享，可在线阅读，更多相关《Kylin–基于Hadoop的大规模联机分析引擎.pdf（33页珍藏版）》请在三一文库上搜索。

1、Kylin OLAP Engine on Hadoop Jiang Xu, Architect of Kylin 1 October 25, 2014 http:/kylin.io Kylin (麒麟) - n. (in Chinese art) a mythical animal of composite form 2 Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on H

2、adoop supporting extremely large datasets What Is Kylin? Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data ANSI-SQL Interface on Hadoop Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions Seamless Integrati

3、on with BI Tools Kylin currently offers integration capability with BI Tools like Tableau. Integration with Microstrategy and Excel is coming soon Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset MOLAP Cub

4、e User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records 3 What Is Kylin? - Other Highlights Compression and Encoding Support Incremental Refresh of Cubes Approximate Query Capability for distinct Count (HyperLogLog) Leverage HBase Coprocessor for query l

5、atency Job Management and Monitoring Easy Web interface to manage, build, monitor and query cubes Security capability to set ACL at Cube/Project Level Support LDAP Integration 4 Glance of SQL-on-Hadoop Ecosystem SQL translated to MapReduce jobs Hive Stinger without Tez SQL processed by a MPP Engine

6、Impala Drill Presto Spark + Shark SQL process by a existing SQL Engine + HDFS EMC Greenplum (postgres) Taobao Garude (mysql) OLAP on Hadoop in other Companies Adobe: HBase Cube LinkedIn: Avatara S: Phoenix 5 Why Do We Build Kylin? Why existing SQL-on-Hadoop solutions fall short? The existing SQL-on-

7、Hadoop needs to scan partial or whole data set to answer a user query. Moreover, table join may trigger the huge data transfer across host. Due to large scan range and network traffic latency, many queries are very slow (minute+ latency). What is MOLAP/ROLAP? MOLAP (Multi-dimensional OLAP) is to pre

8、-compute data along different dimensions of interest and store resultant values in the cube. MOLAP is much faster but is inflexible. Kylin is more like MOLAP. ROLAP (Relational-OLAP) is to use star or snow-flake schema to do runtime aggregation. ROLAP is flexible but much slower. All existing SQL-on

9、-Hadoop is kind of ROLAP. How does Kylin support ROLAP/MOLAP? Kylin builds data cube (MOLAP) from hive table (ROLAP) according to the metadata definition. If the query can be fulfilled by data cube, Kylin will route the query to data cube that is MOLAP. If the query cant be fulfilled by data cube, K

10、ylin will route the query to hive table that is ROLAP. Basically, you can think Kylin as HOLAP(Hybrid OLAP) on top of MOLAP and ROLAP. 6 What Is Kylin Architecture? 7 How Does Kylin Utilize Hadoop? Hive Pre-join star schema during cube building MapReduce Pre-aggregation metrics during cube building

11、HDFS Store intermediated files during cube building. HBase Store data cube. Serve query on data cube. Coprocessor is used for query processing. 8 What Does Kylin Query Look Like? Kylin support ANSI-SQL: projection, filter, join, aggregation, groups and sub-query. 9 SELECT test_cal_dt.week_beg_dt, te

12、st_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS total_price, COUNT(*) AS total_count FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_cate

13、gory_groupings ON test_kylin_fact.leaf_categ_id = test_category_groupings.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category_groupings.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456 OR test_kylin_fact.lstg_format

14、_name = New GROUP BY test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name Analytics Query Taxonomy 10 High Level Aggregation Very High Level, e.g GMV by site by vertical by weeks Analysis Query Middle le

15、vel, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail Detail Level (SSA Table) Low Level Aggregation Seller ID Transaction Level Transaction ID 80+% Analytics Kylin is designed to accelerate Analytics queries! Whats The Query Performance? - Compare to Hive 11 # Q

16、uery Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002

17、49.054 N/A 0 50 100 150 200 SQL #1 SQL #2 SQL #3 Hive Kylin Whats The Query Performance? - Latency & Throughput 12 Technical Highlights 13 Whats MOLAP Cube? 14 Cuboid = one combination of dimensions Cube = all combination of dimensions (all cuboids) time, item time, item, location time, item, locati

18、on, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid Component Design 15 Query Engine (Calci

19、te) Metadata Manager Storage Engine Data Cube (HBase) Job Engine (MapReduce) Star Schema (Hive) SQL RESTful Server Dictionary & Cube JDBC Driver ODBC Driver HBase Coprossor How To Design Cube? 16 Cube: Fact Table: Dimensions: Measures: Row Key: HBase Mapping: Fact Dim Dim Dim Source Star Schema row

20、A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Cube Metadata Whats the cube metadata? Dimension Normal Mandatory Hierarchy Derived Measure Sum Count Max Min Average Distinct Count (based on HyperLogLog) 17 How to build Cube? - From Relational to Key-Value 18 How To

21、 Build Cube ? Job Flow 19 How To Store Cube? HBase Schema 20 How To Query Cube? Query Engine Query engine is based on Apache Calcite (http:/incubator.apache.org/projects/calcite.html) Apache Calcite is an extensible open source SQL engine that is also used in Stinger/Drill/ Cascading. 21 How To Quer

22、y Cube? Apache Calcite Plugin-ins Metadata SPI Provide table schema from kylin metadata Optimize Rule Translate the logic operator into kylin operator Relational Operator Find right cube Translate SQL into storage engine api call Generate physical execute plan by linq4j java implementation Result En

23、umerator Translate storage engine result into java implementation result. SQL Function Add HyperLogLog for distinct count Implement date time related functions (i.e. Quarter) 22 How To Query Cube? Explain Plan 23 SELECT test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_

24、category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name, sum(test_kylin_fact.price) as total_price, count(*) as total_count FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category_groupings ON test_kylin_fact.leaf_categ_id = tes

25、t_category_groupings.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category_groupings.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456 OR test_kylin_fact.lstg_format_name = New GROUP BY test_cal_dt.week_beg_dt, test_ca

26、tegory.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=$0, LV1_CATEG=$1, LVL2_CATEG=$2, LVL3_CATEG=$3, FORMAT_NAME=$4, SITE_NAME=$5, TOTAL_PRICE=CASE(=($7, 0), null, $6), TOTAL_COUNT=$

27、8) OLAPAggregateRel(group=0, 1, 2, 3, 4, 5, agg#0=$SUM0($6), agg#1=COUNT($6), TRANS_CNT=COUNT() OLAPProjectRel(WEEK_BEG_DT=$13, LV1_CATEG=$21, LVL2_CATEG=$15, LVL3_CATEG=$14, FORMAT_NAME=$5, SITE_NAME=$23, PRICE=$0) OLAPFilterRel(condition=OR(=($3, 123456), =($5, New) OLAPJoinRel(condition=($2, $25)

28、, joinType=left) OLAPJoinRel(condition=AND(=($6, $22), =($2, $17), joinType=left) OLAPJoinRel(condition=($4, $12), joinType=left) OLAPTableScan(table=DEFAULT, TEST_KYLIN_FACT, fields=0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) OLAPTableScan(table=DEFAULT, TEST_CAL_DT, fields=0, 1) OLAPTableScan(table=DEFA

29、ULT, TEST_CATEGORY_GROUPINGS, fields=0, 1, 2, 3, 4, 5, 6, 7, 8) OLAPTableScan(table=DEFAULT, TEST_SITES, fields=0, 1, 2) How To Query Cube? Storage Engine Provide cube query for query engine Common iterator interface for storage engine Isolate query engine from underline storage Translate cube query

30、 into HBase table scan Groups Cuboid ID Filters - Scan Range (Row Key) Aggregations - Measure Columns (Row Values) Scan HBase table and translate HBase result into cube result HBase Result (key + value) - Cube Result (dimensions + measures) 24 How To Optimize Cube? “Curse of dimensionality”: N dimen

31、sion cube has 2N cuboid Full Cube vs. Partial Cube Hugh data volume Dictionary Encoding Incremental Building Slow Table Scan TopN Query on High Cardinality Dimension Bitmap inverted index Time range partition In-memory parallel scan: block cache + endpoint coprocessor 25 How To Optimize Cube? Full C

32、ube vs. Partial Cube Full Cube Pre-aggregate all dimension combinations “Curse of dimensionality”: N dimension cube has 2N cuboid. Partial Cube To avoid dimension explosion, we divide the dimensions into different aggregation groups 2N+M+L 2N + 2M + 2L For cube with 30 dimensions, if we divide these

33、 dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands 230 210 + 210 + 210 Tradeoff between online aggregation and offline pre-aggregation 26 How To Optimize Cube? Partial Cube 27 How To Optimize Cube? Dictionary Encoding Data cube has lost of duplicated dimension valu

34、es Dictionary maps dimension values into IDs that will reduce the memory and storage footprint. Dictionary is based on Trie 28 How To Optimize Cube? Incremental Building 29 How To Optimize Cube? TopN Query on High Cardinality Dimension 30 Bitmap inverted index Separate high cardinality dimension fro

35、m low cardinality dimension Time range partition In-memory (block cache) Parallel scan (endpoint coprocessor) Roadmap Q3 2014 Version 1.0 MOLAP Cube ANSI SQL ODBC for Tableau Web GUI ACL Open Source Q2 2015 Version 2.0 HOLAP In-Memory Analysis Q4 2014 Version 1.1 Bitmap Inverted Index Filter on coprocessor Remote JDBC Q1 2015 Version 1.2 Job Schedule and Priority Capacity Management Cube Design Automation Kylin Resources Web Site http:/kylin.io Google Groups https:/ Twitter KylinOLAP Source Code https:/ 32 Q & A 33

展开阅读全文