《Hadoop平台监控、预警及自动化.pdf》由会员分享,可在线阅读,更多相关《Hadoop平台监控、预警及自动化.pdf(48页珍藏版)》请在三一文库上搜索。
1、Sami Ben-Romdhane Distinguished Architect, eBay.? Hadoop Platform Team China team 9 engineers US team 11 engineers Eagle fully developed by China team Sami Ben-romdhane Distinguished Architect? 1-10 nodes 2007 100+ nodes 1000s + core 1 PB 2010 2011 1000+ node 10,000+ core 10+ PB 3000+ node 10,000+ c
2、ore 50+ PB 2012 2013/2014 10,000 nodes 150,000+ cores 150 PB 2009 50+ nodes Fast business growth needs 10 large Hadoop clusters 10,000 data nodes 50,000 jobs per day 50,000,000 task per day 300+ types of Hadoop native metrics Millions of Hadoop component logs per day 1 billion audit events per day C
3、hallenges with large-scale Hadoop cluster monitoring Too many to be monitored job, task, host, daemon, queue Alert without setting per metric threshold Auto-detect node failures Auto-remediate node failures or daemon services Correlate job failures or slowness with node failures? Observe? Correlate?
4、 Prevent? Remediate? Learn? Product Real-Time Collection Real-Time Alert Storage Query API Machine learning Correlation/Analytics Dashboard Eagle hbase Hunk in yahoo splunk Ganglia gmond RRD fi le Nagios Mysql etc. Apache Ambari (ganglia + nagios) Yarn Timeline Server Cloudera Manager mysql WhiteEle
5、phant in Linkedin hdfs Enviso in Netfl ix ? Elastic search ELK in SequenceIQ logstash Elastic search Twitter Cassandra Product History Job Running Job Audit log NN RPC Calls Hadoop Native Metrics System Metrics Job SLA Job log GC Log Yarn App Job workfl ow Eagle Hunk yahoo Ganglia Nagios Apache Amba
6、ri Yarn Timeline Server Cloudera Manager WhiteElepha nt in Linkedin Enviso in Netfl ix ELK in SequenceIQ Twitter Monitor M/R job continuously Job Performance Analyzer Host anomaly detection Manage SLA jobs Monitoring namenode audit log Monitoring HDFS image Monitoring daemon GC logs Traditional job
7、monitoring (White elephant, ELK in SequenceIQ etc.) Upload job fi le in batch No monitoring on running job state No alert Monitor job history fi les in near real-time Crawl job history fi les upon completion Apply expertise rules for job performance suggestions Job Start Event Task Start Event Task
8、End Event Task roll-up Task2 Start Event Task2 End Event Task roll-up Job End Event Job Suggestio n Rules Eagle monitor running job states Minute-level job life cycle snapshots Minute-level resource usage snapshots CPU, HDFS I/O, Disk I/O, slot seconds Roll up to user/queue/cluster level Slide windo
9、w based alert Monitor M/R job continuously Job Performance Analyzer Host anomaly detection Manage SLA jobs Monitoring namenode audit log Monitoring HDFS image Monitoring daemon GC logs Job performance trend Job history trend (durations, job counters, failure ratio) Job running progress trends (MR pr
10、ogress, job counter progress) Job optimization suggestions Data skew, task number, compression, split settings, spill, gcTime Data skew analysis Is task duration skew triggered by data skew? Task failure drill down to hosts Task failures distributions by hosts Task execution time sequence Data skew
11、analysis? Task failures drill-down to host level?Job history trend? Job op9miza9on sugges9ons? Monitor M/R job continuously Job Performance Analyzer Anomaly detection Manage SLA jobs Monitoring namenode audit log Monitoring HDFS image Monitoring daemon GC logs Anomaly Detec9on ,)= 1 (2* pi)n/2 1/2 e
12、xp( 1 2 (x)T 1 (x) Threshold Computation Uses Mathews Correlation Coeffi cient Where TP denotes True Positives TN denotes True Negative FP denotes False Positive FN denotes False Negative MCC = TP*TN FP*FN (TP+FP)(TP+FN)(TN +FP)(TN +FN) Normal (Green) and Abnormal (Red) Data and Probability Distribu
13、tion and Threshold Selection Machine Learning Model is capable of listing top metrics contributing to anomaly Machine Learning Predictive Output is feedback to YARN scheduler Astro Advantages Intelligent workload placement and scheduling Workload scheduling only to good nodes Workload shifts from ba
14、d nodes to good nodes? Advantage of feedback to the prediction system With prediction feedback Astro execution time is less compared to current YARN implementation? End-to-end automatic remediation center Remove failed disk volume Hadoop Daemon restart Hadoop abnormal job killer Hadoop cluster expan
15、sion Node reboot HDFS/HBase balance Automation steps: TT/RS/DN decommission HDFS version fi le backup Hardware remediation Node reimage Host key setup SSH key setup Hadoop & OS confi guration Node restart Disk burning Node health verifi cation Nodes re-commissioning Client confi g deployment Daemons startup 更多职位,欢迎进入浏览并在线申请 或者,直接投递简历至:,我们将尽快与您取得联系 Q&A?