《开发者最佳实践日-Spark-Ecosystem.pdf》由会员分享,可在线阅读,更多相关《开发者最佳实践日-Spark-Ecosystem.pdf(42页珍藏版)》请在三一文库上搜索。
1、Spark Ecosystem & Internals 陈 超 CrazyJvm 开发者最佳实践日北京3W咖啡 Show of Hands How familiar are you with Spark? A. Heard of it, but havent used it before. B. Kicked the Gres with some basics. C. Worked or working on a proof-of-concept deployment. D. Worked or working on a producGon deployment. outline basis
2、& internals ecosystem Current Major Release released Spark 1.2 Spark : What & Why Apache Spark is a fast and general engine for large-scale data processing. Speed Ease of Use Generality Integrated with Hadoop Hadoop Data Sharing Spark Data Sharing DAG & in-memory Why Spark Fast? Memory based computa
3、Gon DAG Thread Model OpGmizaGon(e.g. delay scheduling) BDAS one stack to rule them all Key Concept-RDD A list of parGGons A funcGon for compuGng each split A list of dependencies on other RDDs OpGonally, a ParGGoner for key-value RDDs OpGonally, a list of preferred locaGons to compute each split on
4、Immutable! Key Concept-Lineage unroll parGGon safely when caching Key Concept-Dependency Key Concept-ClusterManager Local Standalone Yarn Mesos Cluster Overview Schedule Executor Shuffl e Sort-based shuffl e supported Shuffl e Pull-based (not push-based) Write intermediate fi les to disk Build hash
5、map within each parGGon Can spill across keys A single key-value pair must fi t in memory Beer Metrics System Previously: only collect aer task completed Now : report when task is sGll running outline basis & internals ecosystem Spark Streaming Mini-batch Enhanced HA Data Lose? Worker Failure Driver
6、 Failure Reliable Receiver MLlib Spark implementaGon of some common machine learning algorithms and uGliGes classifi caGon regression clustering collaboraGve fi ltering dimensionality reducGon feature extracGon supported: Word2Vec , TF-IDF ML Pipeline ML Dataset Transformer EsGmator Pipeline GraphX
7、GraphX Spark SQL Spark SQL Data Sources RDDs/Parquet Files/JSON Datasets/Hive Table DSL JDBC Server ProgrammaGcally Specifying the Schema So what is Hive on Spark? Shark Development in Shark has been ended and subsumed by Spark SQL Mission completed ! Tachyon Data sharing between diff erent jobs or
8、frameworks Tachyon Process crash Tachyon Duplicated memory and GC Tachyon SparkR SparkR BlinkDB Queries with Bounded Errors and Bounded Response Times on Very Large Data JobServer provides a RESTful interface for submiing and managing Apache Spark jobs, jars, and job contexts Spark Packages hp:/ spark-packages.org A community index of packages for Apache Spark Q & A weibo:CrazyJvm wechat public account : ChinaScala Q&AQ&A Thank you all!