决策树.（Decision tree .）.doc

资源描述

《决策树.（Decision tree .）.doc》由会员分享，可在线阅读，更多相关《决策树.（Decision tree .）.doc（13页珍藏版）》请在三一文库上搜索。

1、决策树6.1（Decision tree 6.1）Classification (classification)Classification refers to the process of learning samples are mapped to a defined class, i.e. the corresponding attribute given a set of input vectors and the class, using inductive learning method based on the classification.The main objective

2、is to analyze the input data, the characteristics shown in the training dataset, for each class to find an accurate description or model.The classification problem is in the field of data mining research and one of the most widely used technologyAre widely used in commercial banking, classification,

3、 medical diagnosis, biology, text mining and Internet screening etc.Banks can assist the staff will be normal credit card fraud credit card users and users are classified, so as to take effective measures to reduce the banks losses;Medical diagnosis, can help medical personnel will be normal and can

4、cer cells are classified, thereby making treatment plan, save the patients life;The Internet can help the network screening, the staff will be the normal mail and spam classification, so as to formulate the effective mechanism to prevent spam filtering, spam interfere with peoples normal life.The ba

5、sic steps of data classification (see P126127)The data classification process consists of two steps1, learning modeling2, classification testStep one: learning data classification modelingThe establishment of a set of data to describe the known category or concept model; the model is obtained throug

6、h the analysis of the data content in the database.Each row can be considered as belonging to a certain data categories, the category value is described by an attribute (called label attribute). (said classification data set)By using the method of classification learning data set is called the train

7、ing set, so the classification learning can also be called supervised learning (learning by example), it is in the case of known training samples, the establishment of the corresponding model by study; without the supervision of teachers learning is the training sample category and the number of the

8、 categories are unknown under the condition of the.Usually the classification study of model can be expressed as the form of classification rules, decision tree, or mathematical formula.By using the training data set and learning classification rules of knowledge (model)Step two: classification test

9、 data classificationIs the use of the obtained model classification operationFirstly, the classification accuracy of model estimation, the holdout method is a simple estimation method. It uses a set of categories with sample classification test (test samples were obtained and are independent of the

10、training sample).For a given data set constructed by the accuracy of the model can be classified correctly by the model (test) data samples the total test sample proportion.For each test sample, and study its categories known by the comparison of model predicted category. If the accuracy of the mode

11、l is obtained based on the learning data set tests, such as learning models tend to over approximation of the training data, resulting in the estimation of model test accuracy is too optimistic.Therefore the accuracy of using a test data set to study the model need to be tested.The classification ru

12、les learned (model), assessment of the known test data for model accuracy, and the unknown (category) of new customers (category) for classification and prediction.The use of the classification problem in the data set is used to represent what form?Describe the attributes and class attribute etc.An

13、example of applicationAn existing customer email address database. Use the email address to send information for potential customers goods discount promotions and new product brochures will start.The contents of the database is the customers description, including age, income, occupation and credit

14、rating are classified as property description, customers will become in the mall to buy customers.When the new customer information is added to the database when,You need the customer will become the computer buyers for classification (i.e. on customer purchase intention classification), to decide w

15、hether to send the corresponding customer product brochures.Considering indiscriminately to each customer all send this kind of promotion brochure is obviously a great waste, by contrast, targeted to the largest possible customer purchase to send the goods they need advertising is an efficient fruga

16、l marketing strategy. Obviously, to meet the needs of this application requires the establishment of customer classification rules (purchase intention) model, purchase intention to help businesses accurately after each new customer can join.In addition, if the need for the number of customers may bu

17、y goods in the market within a year (order value) for prediction, we need to establish the forecast model to help obtain every new customer purchase times in this store may be.ValuationThe difference and classificationClassification and description of the output of discrete variables, valuation proc

18、essing is output value.Classification is to determine the number, the amount of valuation is uncertainSuch as: according to the purchase mode, an estimation of the family incomeThe contact and classificationAs a step forward valuation of job classificationThrough the valuation, the value of the unkn

19、own variables are continuous, then according to the preset threshold, classificationFor example, bank home loan business, the first use of valuation to each customer score, then according to the classification of the loan level threshold.The most typical classification, decision tree method (see P12

20、8)The decision tree is a flow chart like tree structure, where each internal node of the tree represents an attribute (value) of the test, each branch represents an outcome of the test; and the tree each leaf node represents a class. The top of the tree node is the root node.There are two kinds of d

21、ecision tree node:1, the decision node (which some branches, each branch represents a decision scheme, each program branches connected to a new node)2, the state of nodes (corresponding to a leaf node, representing a specific final state)1, advantages: comprehensible and intuitive (simple structure

22、and high efficiency)2 difficulties: how to choose a good value for the branch methodDecision tree constructionThe decision tree is inductive learning algorithm based on examples. It is from a group of no order, no rules in the tuple to infer classification rules of decision tree form; recursive top-

23、down, compare the attribute values in the internal node of the decision tree, and according to the different attribute values from the node branch and leaf nodes are down, to study the classification of the class. The path from the root node to a leaf node corresponds to a conjunctive rule, the deci

24、sion tree corresponds to a set of disjunctive expression rules.There are many decision tree algorithms: CLS, ID3, CHAID, C4.5, CART, SLIQ, SPRINT etc.The famous ID3 (Iterative Dichotomiser3) algorithm is proposed by J.R.Quinlan in 1986, the algorithm introduces the theory of information theory, deci

25、sion tree classification algorithm based on information entropy.The decision tree ID3 algorithmThe core of ID3 algorithm is: the choice of attributes in decision tree nodes at all levels, with information gain as the attribute selection criterion, can be obtained on test record categories most infor

26、mation that in every non leaf node when tested.Specific methods: detection of all attributes, attributes the maximum information gain selection decision tree node, by the properties of the different values of the establishment of branches, and a subset of recursive calls of each branch of the decisi

27、on tree building method of node branch, until all the subset contains only one class of data, and finally get a decision tree. It can be used to classify new samples.Attribute selection methodIn the decision tree induction method, the appropriate attributes usually use information gain method to hel

28、p determine the generation of each node that should be used when.So you can choose to have the highest information gain (maximum entropy decrease) attributes as the test attribute for the current node,In order to make the classification of the information needed for minimum get training sample subse

29、t partitioning on, that is to say, the current use of the attribute (node containing) sample set partitioning, will make the book of centralized different types of mixed degree for the low.So the use of such a method of information theory will help reduce the number of object classification is neede

30、d, so as to ensure that the decision tree generated by the most simple, though not necessarily the most simple.The book P130-131 (2) establish branchConsidering the branch age = =30 nodeBecause Gain (income) = 0.571, Gain =0.971, Gain (students) (credit) = 0.02So the test attribute branch age = 40 n

31、odeBecause Gain (income) = 0.02, Gain = 0.02 (students), Gain (L) = 0.971So the test attribute branch age = 40 node credit.Consider a branch of students = no node, because all the records belong to the same category no, so student branch = no node as a leaf node.Consider the student branch = yes nod

32、e, because all the records belong to the same category of is, so student branch = yes node as a leaf node.Consider a branch of credit = excellent node, because all the records belong to the same category no, so branch credit = no node as a leaf node.Consider a branch of credit = node, because all th

33、e records belong to the same category is, so branch credit = yes node as a leaf node.The classification rules extracted from the decision tree:To create a form of IF from root to leaves each path. The rules of THEN. Along the internal branch nodes on the given path to form before rule (IF), after a

34、leaf node formation rule (THEN).Tree pruningWhen a decision tree was founded, its many branches which is based on the abnormal data in the training set (due to noise and other reasons) structure.Tree pruning is just for this kind of data over approximation (overfitting) problems.Shu Zhi usually prun

35、ing method using statistical method by deleting the most unreliable branch (Shu Zhi), in order to improve the ability in classification speed and classification of new data.Two kinds of methods:Pre pruning (pruning first)Post pruning (after pruning)pre-pruningThe lift stopped branch generation proce

36、ss, i.e. the current node entropy to determine whether to continue to divide the nodes contained in the training sample set to achieve. Once the stop branch, the current node becomes a leaf node. May contain a number of different categories of training samples of the leaf node.In the construction of

37、 a decision tree, can use statistical significance to the detection of x2 or information gain to generate the branch (quality) evaluation.If the partition of sample set in a node, will cause (generated) less than a specified threshold sample number node, it will stop the decomposition of sample setB

38、ut to determine such a reasonable threshold often leaves more difficult. The threshold may result in the decision tree is too simple, while the threshold is small can result in redundant branches cannot prunePost pruningThe method from a fully grown tree, pruning the redundant branch (Branch)The cos

39、t of pruning a post pruning method based onTrim (Branch) of the node becomes a leaf node, and mark it as it contains a number of categories in the sample category.As for the tree each non leaf node, if the node is calculated (Branch) is what happens after pruning the expected classification error ra

40、te; at the same time, according to the classification error rate of each branch,And the weight of each branch (sample distribution), if the node is not calculated when trimming the expected classification error rate; if the pruning resulted in the expected classification error rate becomes larger, g

41、ive up all branches of the corresponding node pruning, retention, or will the corresponding node delete branch pruning. In a decision tree pruning the candidate after using an independent test data set to evaluate the classification accuracy of decision tree pruning, keep the expected classification

42、 error rate minimum (construction) decision tree.The application of decision treeBusiness areas: typical decision tree can solve business problems are: customer relationship management, database marketing, customer classification, cross selling behavior analysis, market analysis, customer churn and

43、customer credit scoring and fraud detection, etc.Industry: industrial production process control, fault diagnosis etc.Medicine: diagnosis and treatment of disease, gene and molecular sequence analysis, the hospital information system and medical policy analysis mining.Now, there are many research in

44、stitutions and companies have developed some data mining and knowledge discovery using decision tree technology application system, such as LMDT, OCI, SE-Learn, SIPINA-W, AC2, C4.5, IND, KATE-Tools, Knowledge, SEEKER, SPSS, CHAID, CART etc. In addition, there are American Vanguard company Software Decision Pro3.0, Litigation Risk Analysis Litigation Risk Analysis. A decision tree module SAS and SGIs data mining system.

展开阅读全文