开发者最佳实践日-Spark-Ecosystem.pdf

上传人:哈尼dd 文档编号:3333356 上传时间:2019-08-13 格式:PDF 页数:42 大小:2.82MB
返回 下载 相关 举报
开发者最佳实践日-Spark-Ecosystem.pdf_第1页
第1页 / 共42页
开发者最佳实践日-Spark-Ecosystem.pdf_第2页
第2页 / 共42页
开发者最佳实践日-Spark-Ecosystem.pdf_第3页
第3页 / 共42页
开发者最佳实践日-Spark-Ecosystem.pdf_第4页
第4页 / 共42页
开发者最佳实践日-Spark-Ecosystem.pdf_第5页
第5页 / 共42页
点击查看更多>>
资源描述

《开发者最佳实践日-Spark-Ecosystem.pdf》由会员分享,可在线阅读,更多相关《开发者最佳实践日-Spark-Ecosystem.pdf(42页珍藏版)》请在三一文库上搜索。

1、Spark Ecosystem & Internals 陈 超 CrazyJvm 开发者最佳实践日北京3W咖啡 Show of Hands How familiar are you with Spark? A. Heard of it, but havent used it before. B. Kicked the Gres with some basics. C. Worked or working on a proof-of-concept deployment. D. Worked or working on a producGon deployment. outline basis

2、& internals ecosystem Current Major Release released Spark 1.2 Spark : What & Why Apache Spark is a fast and general engine for large-scale data processing. Speed Ease of Use Generality Integrated with Hadoop Hadoop Data Sharing Spark Data Sharing DAG & in-memory Why Spark Fast? Memory based computa

3、Gon DAG Thread Model OpGmizaGon(e.g. delay scheduling) BDAS one stack to rule them all Key Concept-RDD A list of parGGons A funcGon for compuGng each split A list of dependencies on other RDDs OpGonally, a ParGGoner for key-value RDDs OpGonally, a list of preferred locaGons to compute each split on

4、Immutable! Key Concept-Lineage unroll parGGon safely when caching Key Concept-Dependency Key Concept-ClusterManager Local Standalone Yarn Mesos Cluster Overview Schedule Executor Shuffl e Sort-based shuffl e supported Shuffl e Pull-based (not push-based) Write intermediate fi les to disk Build hash

5、map within each parGGon Can spill across keys A single key-value pair must fi t in memory Beer Metrics System Previously: only collect aer task completed Now : report when task is sGll running outline basis & internals ecosystem Spark Streaming Mini-batch Enhanced HA Data Lose? Worker Failure Driver

6、 Failure Reliable Receiver MLlib Spark implementaGon of some common machine learning algorithms and uGliGes classifi caGon regression clustering collaboraGve fi ltering dimensionality reducGon feature extracGon supported: Word2Vec , TF-IDF ML Pipeline ML Dataset Transformer EsGmator Pipeline GraphX

7、GraphX Spark SQL Spark SQL Data Sources RDDs/Parquet Files/JSON Datasets/Hive Table DSL JDBC Server ProgrammaGcally Specifying the Schema So what is Hive on Spark? Shark Development in Shark has been ended and subsumed by Spark SQL Mission completed ! Tachyon Data sharing between diff erent jobs or

8、frameworks Tachyon Process crash Tachyon Duplicated memory and GC Tachyon SparkR SparkR BlinkDB Queries with Bounded Errors and Bounded Response Times on Very Large Data JobServer provides a RESTful interface for submiing and managing Apache Spark jobs, jars, and job contexts Spark Packages hp:/ spark-packages.org A community index of packages for Apache Spark Q & A weibo:CrazyJvm wechat public account : ChinaScala Q&AQ&A Thank you all!

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 建筑/环境 > 装饰装潢


经营许可证编号:宁ICP备18001539号-1