The Generalization Performance of ERM Algorithm.doc

资源描述

《The Generalization Performance of ERM Algorithm.doc》由会员分享，可在线阅读，更多相关《The Generalization Performance of ERM Algorithm.doc（20页珍藏版）》请在三一文库上搜索。

1、精品论文推荐http:/The Generalization Performance of ERM Algorithmwith Strongly Mixing ObservationsBin Zou, Luoqing LiFaculty of Mathematics and Computer Science, Hubei University, Wuhan 430062, Chinazoubin0502, November 23, 2007AbstractThe generalization performance is the main purpose of machine learning

2、 theo- retical research. The previous main bounds describing the generalization ability of ERM algorithm are based on independent and identically distributed (i.i.d.) sam- ples. When the complexity of the given function set is high, the problem of solving the ERM algorithm is usually ill-posed and o

3、verfitting may happen. In order to study the generalization performance of ERM algorithm with dependent observa- tions, in this paper we decompose firstly the given function set into different com- pact subsets, and then we establish the exponential bound on the rate of relative uniform convergence

4、on these compact subsets for ERM algorithm with strongly mixing observations. In the end, we obtain the bounds on the generalization ability of ERM algorithm with strongly mixing observations over the given function set, which extend the previous results of i.i.d. observations to the case of strongl

5、y mixing observations.Keywords: Generalization performance, Principle of ERM, Relative uniform conver- gence, Exponentially strongly mixingSupported in part by NSFC under grant 10771053 and by the National Research Foundation for theDoctoral Program of Higher Education of China (SRFDP) under grant 2

6、00605120019精品论文推荐1 IntroductionRecently there has been a large increase of the interest for theoretical issues in the machine learning community.It is mainly due to the fact that statistical learning theory has demonstrated its usefulness by providing the ground for developing successful and well- f

7、ounded learning algorithm such as Support Vector Machines (SVM) 21. This renewed interest for theory naturally boosted the development of performance bounds 4, 5, 6, 7,16, 17, 27, 29The setting of learning theory is that given a set of data consisting of labeled objects, the goal is to find a functi

8、on that assigns labels to objects such that, if new objects are given, this function will label them correctly. We assume that all the data is generated by same processes, that is, the data is sampled from a fixed unknown probability distribution. The only knowledge we have about it comes from a sam

9、ple of observations. Typically, we choose a class of possible function that correspond to the possible ways in which the labels can be related to the objects, then we choose a function in that class which agrees as much as possible with the given data. Therefore the problems that learning from rando

10、m sampling should address are thus how to choose an appropriate class of functions and how to choose a function in that class 4. In the sequel, we are interested in how to choose a function from a given function set.We assume that the data is sampled from a certain distribution, it is possible to re

11、lated the observed behavior of a function on the data to its expected behavior on future data by means of probability theory. More precisely, for each given function in the class of interest, one can obtain confidence intervals for the expected mislabel error (expected error) in terms of the observe

12、d mislabel error (empirical error). This is not enough to guarantee that in a given class, a function which has a small empirical error will have a small expected error 4. So considering our sample as a random variable, we see that the empirical error of each function in our class is a random variab

13、le, which means we have a collection of random variables. If we want to have a bound on the expected error of the learning algorithm, since the particular function that the algorithm will pick after seeing the data is not known in advance, one has to bound uniformly the deviations of the collection

14、of random variables. Therefore the notion of size of a collection of random variables (indexed by a class of functions) is crucial in learning theory. One measure of the size of a collection of random variables is the covering number 28 and packing numbers 1, entropy numbers 26, VC-dimension for ind

15、icator functions 10, 21, V -dimension (or P -dimension) for real-valued functions 1, etc.In order to measure the generalization ability of algorithms that minimize empiricalrisk with i.i.d. observations, Vapnik 21 established exponential bounds on the rate ofuniform convergence and relative uniform

16、convergence for i.i.d. observations. Bousquet 4 obtained a generalization of Vapnik and Chervonenkis bounds by using the measure of the size of function classes, the local Rademacher average. Cucker and Smale 6, Smale and Zhou 16, 17 considered the least squares error and decomposed this difference

17、between the value of achieved risk and the value of minimal possible risk into two parts: the sample error and the approximation error, and then they bounded the sample error and the approximation error based on i.i.d. observations respectively over the compact subset of hypothesis space. Chen et al

18、 5 obtained the bound on the excess expected risk with i.i.d. observations for pattern recognition by introducing a projection operator.However, independence is a very restrictive concept in several ways 23. First, it is often an assumption, rather than a deduction on the basis of observations. Seco

19、nd, it is an all or nothing property, in the sense that two random variables are either indepen- dent or they are not the definition does not permit an intermediate notion of beingnearly independent. As a result, many of the proofs based on the assumption that theunderlying stochastic sequence is i.

20、i.d. are rather “fragile”. The notion of mixing allows one to put the notion of “near independence” on a firm mathematical foundation, and moreover, permits one to derive a robust rather than a “fragile” theory. Many machine learning applications such as market prediction, system diagnosis, and spee

21、ch recognition are inherently temporal in nature, and consequently not i.i.d. processes 20. Therefore, Yu 3 established the rates of uniform convergence of the empirical means to their means based on stationary mixing sequences. White 24 considered cross-validated regression estimators for strongly

22、mixing processes and established convergence, without rates, of their estimators. Modha and Masry 18 established the minimum complexity regression estimation with m-dependent observations and strongly mixing observations respectively. Vidyasagar 23 considered the notions of mixing and proved that mo

23、st of the desirable properties (e.g. PAC property or UCEMUP property) of i.i.d. sequence are preserved when the underlying sequence is mixing sequence. Nobel and Dembo 15 proved that, if a family of functions has the property that empirical means based on i.i.d. sequence converge uniformly to their

24、values as the number of samples approaches infinity, then the family of functions continues to have the same property if the i.i.d. sequence is replaced by -mixing sequence. Karandikar and Vidyasagar 14 extended this result to the case where the underlying probability is itself not fixed, but varies

25、 over a family of measures. Vidyasagar 23 obtained the rate of uniform convergence of the empirical means to their means with -mixing sequence for a finite family of measurable functions. Steinwart, Hush and Scovel 20 studied the learnability of SVMs for -mixing(not necessarily sta- tionary) process

26、es for both classification and regression. Zou et al 30 established the bounds on the rate of uniform convergence of learning machines with strongly mixingobservations.In this paper we obtain firstly the bound on the relative difference between the empir- ical risks and their expected risks by using

27、 the argument similar to that used by Modha and Masry in their proof of the Bernsteins inequality for strongly mixing sequences 18, and then we extend previous bounds 4, 21 on the rate of relative uniform convergence to the case where the i.i.d. observations replaced by strongly mixing observations,

28、 and improve the results of 30, 23. The rest of this paper is organized as follows: In section2, we introduce some notations. In section 3 we obtain the bounds on the rate of rel- ative uniform convergence for learning machines with strongly mixing observations. We establish the bounds on the genera

29、lization ability of ERM algorithm with strongly mixing sequence for the given function set in section 4.2 PreliminariesWe introduce some notations and do some preparations in this section.i=Let Z = zi = (xi, yi)be a stationary real-valued sequence on a probability space(, B, P ). For i 0, 0, and c 0

30、, where the constants and c are assumed to be known.Remark 1 18 Assumption 1 is satisfied by a large class of processes, for example, cer- tain linear processes(which includes certain ARMA processes) satisfy the assumption with = 1 25, and certain aperiodic, Harris-recurrent Markov processes(which i

31、ncludes cer- tain bilinear processes, nonlinear ARX processes, and ARH processes) satisfy the assump-tion 9. As a trivial example, i.i.d. random variables satisfy the assumption with = .Denote by S the sample set of size n observationsS = z1, z2, , zndrawn from the exponentially strongly mixing sequ

32、ence Z . The goal of machine learning from random sampling is to find a function f that assigns values to objects such that if new objects are given, the function f will forecast them correctly. LetZE (f ) = E(f, z) =(f, z)dP (1)be the expected error (or expected risk) of function f , where the func

33、tion (f, z), which is integrable for any f and depends on f and z, is called loss function. In this paper, we would like to establish a general framework which includes pattern recognition and regression estimation, the important feature of the regression estimation problem is that the loss function

34、 (f, z) can take on arbitrary non-negative values whereas in pattern recognition problem it can take only two values. Throughout the article, we require thatfor all z Z ,Let|(f, z)| M.F = f : a E (f ) b be the compact set of admissible functions, where a, b are positive constants, and b satisfies b

35、M . A learning task is to find the minimizer of the expected risk over the function set F . Since one knows only the set S of random samples instead of the distribution P , the minimizer of the expected risk can not be computed directly. According to the principleof Empirical Risk Minimizing (ERM),

36、we minimize, instead of the expected risk (1), the so called empirical risk(or empirical error)1 nEn (f ) = n X (f, zi).i=1Let f be a function minimizing the expected risk E (f ) over f F , i.e.,f = arg min E (f ) = arg min Z (f, z)dP.f Ff FWe define f to be a function minimizing the empirical risk

37、En(f ) over f F , i.e.,f = arg min En (f ) = arg min1 nX (f, zi).f Ff Fni=1According to the principle of ERM, we shall consider the function f as an approximation function of the target function f.To measure the generalization performance of ERM algorithm, Vapnik 21, Bousquet 4, Cucker and Smale 6 o

38、btained the bounds on the rate of the empirical risks converge uniformly to their expected risks on a given set H (Q) based on the i.i.d. sequences, thatis, for any 0, they bounded the termProbnsup |E (f ) En(f )| o.(2)f HYu 3 and Vidyasager 23 also obtained the convergence rate of the term (2) base

39、d on a-mixing sequence and -mixing sequence.However, if the complexity of the set H is high, the problem of solving the ERM algorithm is usually ill-posed and overfitting may happen. Moreover the term (2) fails to capture the phenomenon that for those functions f H for which E (f ) is small, the dev

40、iation En(f ) E (f ) is also small with large probability. In order to extend the results in 4, 6, 21 to the case where the i.i.d. sequence replaced by -mixing sequence, improvethese estimates in 3, 30, 23 and solve the ill-posed and overfitting problem of ERM algorithm on the function set of high c

41、omplexity, we decompose the given target function set F into different disjointed parts by making use of the idea of 12 in this paper, thatis, for given a, b, a 1 and suppose that b = aqm for some m N, so thatb m = logq a .Let i = aqi , i = 0, 1, , m (with 0 = a, m = b). We setF (i1) = f F : E (f )

42、i1 ,andThen we haveF (i1, i = F (i)F (i1).mF = F (i1, i ,(3)i=1where m is finite because F is a compact set.For the sake of simplicity, we denote F (i1, i by F (r, s in the sequel . Thus we bound firstly the following term for any 0Probn supf F (r,sE(f ) En (f ) o(4)pE (f )for ERM algorithm with the

43、 strongly mixing sequences Z , then we establish the bounds on the generalization ability over the set F (r, s by making use of the bound on the term(4). To bound the term (4) for ERM algorithm with the strongly mixing sequences Z , we need to give some assumptions on the loss function. For any t 0,

44、 defineL(t) = supmax |(g1, z) (g2, z)| .z Z|g g |g1|t,|g2 |t 1 2We assume that L(t) is finite if t is finite.Our approach is based on Bernstein moment condition 8 and covariance inequality for -mixing sequences 23 to establish a new bound on the relative difference between the empirical risks and their expected risks by similar idea from 18. We

展开阅读全文