统计机器学习（斯坦福大学讲义）1-12(全).pdf

资源描述

1、CS229 Lecture notesAndrew NgSupervised learningLets start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540. .We can

2、 plot this data:500 1000 1500 2000 2500 3000 3500 4000 4500 500001002003004005006007008009001000housing pricessquare feetprice (in $1000)Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?1CS229 Fall 2012 2To estab

3、lish notation for future use, well use x(i) to denote the “input”variables (living area in this example), also called input features, and y(i)to denote the “output” or target variable that we are trying to predict(price). A pair (x(i),y(i) is called a training example, and the datasetthat well be us

4、ing to learna list of m training examples (x(i),y(i);i =1,.,mis called a training set. Note that the superscript “(i)” in thenotation is simply an index into the training set, and has nothing to do withexponentiation. We will also use X denote the space of input values, and Ythe space of output valu

5、es. In this example, X = Y = R.To describe the supervised learning problem slightly more formally, ourgoal is, given a training set, to learn a function h : X mapsto Y so that h(x) is a“good” predictor for the corresponding value of y. For historical reasons, thisfunction h is called a hypothesis. S

6、een pictorially, the process is thereforelike this:Training sethouse.)(living area ofLearning algorithmh predicted yx(predicted price)of house)When the target variable that were trying to predict is continuous, suchas in our housing example, we call the learning problem a regression prob-lem. When y

7、 can take on only a small number of discrete values (such asif, given the living area, we wanted to predict if a dwelling is a house or anapartment, say), we call it a classification problem.3Part ILinear RegressionTo make our housing example more interesting, lets consider a slightly richerdataset

8、in which we also know the number of bedrooms in each house:Living area (feet2) #bedrooms Price (1000$s)2104 3 4001600 3 3302400 3 3691416 2 2323000 4 540. . .Here, the xs are two-dimensional vectors in R2. For instance, x(i)1 is theliving area of the i-th house in the training set, and x(i)2 is its

9、number ofbedrooms. (In general, when designing a learning problem, it will be up toyou to decide what features to choose, so if you are out in Portland gatheringhousing data, you might also decide to include other features such as whethereach house has a fireplace, the number of bathrooms, and so on

10、. Well saymore about feature selection later, but for now lets take the features asgiven.)To perform supervised learning, we must decide how were going to rep-resent functions/hypotheses h in a computer. As an initial choice, lets saywe decide to approximate y as a linear function of x:h(x) = 0 + 1x

11、1 + 2x2Here, the is are the parameters (also called weights) parameterizing thespace of linear functions mapping from X to Y. When there is no risk ofconfusion, we will drop the subscript in h(x), and write it more simply ash(x). To simplify our notation, we also introduce the convention of lettingx

12、0 = 1 (this is the intercept term), so thath(x) =nsummationdisplayi=0ixi = Tx,where on the right-hand side above we are viewing and x both as vectors,and here n is the number of input variables (not counting x0).4Now, given a training set, how do we pick, or learn, the parameters ?One reasonable met

13、hod seems to be to make h(x) close to y, at least forthe training examples we have. To formalize this, we will define a functionthat measures, for each value of the s, how close the h(x(i)s are to thecorresponding y(i)s. We define the cost function:J() = 12msummationdisplayi=1(h(x(i)y(i)2.If youve s

14、een linear regression before, you may recognize this as the familiarleast-squares cost function that gives rise to the ordinary least squaresregression model. Whether or not you have seen it previously, lets keepgoing, and well eventually show this to be a special case of a much broaderfamily of alg

15、orithms.1 LMS algorithmWe want to choose so as to minimize J(). To do so, lets use a searchalgorithm that starts with some “initial guess” for , and that repeatedlychanges to make J() smaller, until hopefully we converge to a value of that minimizes J(). Specifically, lets consider the gradient desc

16、entalgorithm, which starts with some initial , and repeatedly performs theupdate:j := j jJ().(This update is simultaneously performed for all values of j = 0,.,n.)Here, is called the learning rate. This is a very natural algorithm thatrepeatedly takes a step in the direction of steepest decrease of

17、J.In order to implement this algorithm, we have to work out what is thepartial derivative term on the right hand side. Lets first work it out for thecase of if we have only one training example (x,y), so that we can neglectthe sum in the definition of J. We have:j J() =j12 (h(x)y)2= 2 12 (h(x)y) j(h

18、(x)y)= (h(x)y) jparenleftBigg nsummationdisplayi=0ixi yparenrightBigg= (h(x)y)xj5For a single training example, this gives the update rule:1j := j + parenleftbigy(i) h(x(i)parenrightbigx(i)j .The rule is called theLMSupdate rule (LMS stands for “least mean squares”),and is also known as the Widrow-H

19、off learning rule. This rule has severalproperties that seem natural and intuitive. For instance, the magnitude ofthe update is proportional to the error term (y(i) h(x(i); thus, for in-stance, if we are encountering a training example on which our predictionnearly matches the actual value of y(i),

20、then we find that there is little needto change the parameters; in contrast, a larger change to the parameters willbe made if our prediction h(x(i) has a large error (i.e., if it is very far fromy(i).Wed derived the LMS rule for when there was only a single trainingexample. There are two ways to mod

21、ify this method for a training set ofmore than one example. The first is replace it with the following algorithm:Repeat until convergence j := j + summationtextmi=1parenleftbigy(i) h(x(i)parenrightbigx(i)j (for every j).The reader can easily verify that the quantity in the summation in the updaterul

22、e above is just J()/j (for the original definition of J). So, this issimply gradient descent on the original cost function J. This method looksat every example in the entire training set on every step, and is called batchgradient descent. Note that, while gradient descent can be susceptibleto local

23、minima in general, the optimization problem we have posed herefor linear regression has only one global, and no other local, optima; thusgradient descent always converges (assuming the learning rate is not toolarge) to the global minimum. Indeed, J is a convex quadratic function.Here is an example o

24、f gradient descent as it is run to minimize a quadraticfunction.1We use the notation “a := b” to denote an operation (in a computer program) inwhich we set the value of a variable a to be equal to the value of b. In other words, thisoperation overwrites a with the value of b. In contrast, we will wr

25、ite “a = b” when we areasserting a statement of fact, that the value of a is equal to the value of b.65 10 15 20 25 30 35 40 45 505101520253035404550The ellipses shown above are the contours of a quadratic function. Alsoshown is the trajectory taken by gradient descent, which was initialized at(48,3

26、0). The xs in the figure (joined by straight lines) mark the successivevalues of that gradient descent went through.When we run batch gradient descent to fit on our previous dataset,to learn to predict housing price as a function of living area, we obtain0 = 71.27, 1 = 0.1345. If we plot h(x) as a f

27、unction of x (area), alongwith the training data, we obtain the following figure:500 1000 1500 2000 2500 3000 3500 4000 4500 500001002003004005006007008009001000housing pricessquare feetprice (in $1000)If the number of bedrooms were included as one of the input features as well,we get 0 = 89.60,1 =

28、0.1392, 2 = 8.738.The above results were obtained with batch gradient descent. There isan alternative to batch gradient descent that also works very well. Considerthe following algorithm:7Loop for i=1 to m, j := j + parenleftbigy(i) h(x(i)parenrightbigx(i)j (for every j).In this algorithm, we repeat

29、edly run through the training set, and each timewe encounter a training example, we update the parameters according tothe gradient of the error with respect to that single training example only.This algorithm is called stochastic gradient descent (also incrementalgradient descent). Whereas batch gra

30、dient descent has to scan throughthe entire training set before taking a single stepa costly operation if m islargestochastic gradient descent can start making progress right away, andcontinues to make progress with each example it looks at. Often, stochasticgradient descent gets “close” to the mini

31、mum much faster than batch gra-dient descent. (Note however that it may never “converge” to the minimum,and the parameters will keep oscillating around the minimum of J(); butin practice most of the values near the minimum will be reasonably goodapproximations to the true minimum.2) For these reason

32、s, particularly whenthe training set is large, stochastic gradient descent is often preferred overbatch gradient descent.2 The normal equationsGradient descent gives one way of minimizing J. Lets discuss a second wayof doing so, this time performing the minimization explicitly and withoutresorting t

33、o an iterative algorithm. In this method, we will minimize J byexplicitly taking its derivatives with respect to the js, and setting them tozero. To enable us to do this without having to write reams of algebra andpages full of matrices of derivatives, lets introduce some notation for doingcalculus

34、with matrices.2While it is more common to run stochastic gradient descent as we have described itand with a fixed learning rate , by slowly letting the learning rate decrease to zero asthe algorithm runs, it is also possible to ensure that the parameters will converge to theglobal minimum rather the

35、n merely oscillate around the minimum.82.1 Matrix derivativesFor a function f : Rmn mapsto R mapping from m-by-n matrices to the realnumbers, we define the derivative of f with respect to A to be:Af(A) =fA11 fA1n. . .fAm1 fAmnThus, the gradient Af(A) is itself an m-by-n matrix, whose (i,j)-elementis

36、 f/Aij. For example, suppose A =bracketleftbiggA11 A12A21 A22bracketrightbiggis a 2-by-2 matrix, andthe function f : R22 mapsto R is given byf(A) = 32A11 + 5A212 + A21A22.Here, Aij denotes the (i,j) entry of the matrix A. We then haveAf(A) =bracketleftbigg 32 10A12A22 A21bracketrightbigg.We also int

37、roduce the trace operator, written “tr.” For an n-by-n(square) matrix A, the trace of A is defined to be the sum of its diagonalentries:trA =nsummationdisplayi=1AiiIf a is a real number (i.e., a 1-by-1 matrix), then tra = a. (If you haventseen this “operator notation” before, you should think of the

38、 trace of A astr(A), or as application of the “trace” function to the matrix A. Its morecommonly written without the parentheses, however.)The trace operator has the property that for two matrices A and B suchthat AB is square, we have that trAB = trBA. (Check this yourself!) Ascorollaries of this,

39、we also have, e.g.,trABC = trCAB = trBCA,trABCD = trDABC = trCDAB = trBCDA.The following properties of the trace operator are also easily verified. Here,A and B are square matrices, and a is a real number:trA = trATtr(A + B) = trA + trBtraA = atrA9We now state without proof some facts of matrix deri

40、vatives (we wontneed some of these until later this quarter). Equation (4) applies only tonon-singular square matrices A, where |A| denotes the determinant of A. Wehave:AtrAB = BT (1)ATf(A) = (Af(A)T (2)AtrABATC = CAB + CTABT (3)A|A| = |A|(A1)T. (4)To make our matrix notation more concrete, let us n

41、ow explain in detail themeaning of the first of these equations. Suppose we have some fixed matrixB Rnm. We can then define a function f : Rmn mapsto R according tof(A) = trAB. Note that this definition makes sense, because if A Rmn,then AB is a square matrix, and we can apply the trace operator to

42、it; thus,f does indeed map from Rmn to R. We can then apply our definition ofmatrix derivatives to find Af(A), which will itself by an m-by-n matrix.Equation (1) above states that the (i,j) entry of this matrix will be given bythe (i,j)-entry of BT, or equivalently, by Bji.The proofs of Equations (1

43、-3) are reasonably simple, and are left as anexercise to the reader. Equations (4) can be derived using the adjoint repre-sentation of the inverse of a matrix.32.2 Least squares revisitedArmed with the tools of matrix derivatives, let us now proceed to find inclosed-form the value of that minimizes

44、J(). We begin by re-writing J inmatrix-vectorial notation.Given a training set, define the design matrix X to be the m-by-nmatrix (actually m-by-n +1, if we include the intercept term) that contains3If we define A to be the matrix whose (i,j) element is (1)i+j times the determinantof the square matr

45、ix resulting from deleting row i and column j from A, then it can beproved that A1 = (A)T/|A|. (You can check that this is consistent with the standardway of finding A1 when A is a 2-by-2 matrix. If you want to see a proof of this moregeneral result, see an intermediate or advanced linear algebra te

46、xt, such as Charles Curtis,1991, Linear Algebra, Springer.) This shows that A = |A|(A1)T. Also, the determinantof a matrix can be written |A| =summationtextj AijAij. Since (A)ij does not depend on Aij (as canbe seen from its definition), this implies that (/Aij)|A| = Aij. Putting all this togethersh

47、ows the result.10the training examples input values in its rows:X = (x(1)T (x(2)T . (x(m)T .Also, let vectory be the m-dimensional vector containing all the target values fromthe training set:vectory =y(1)y(2).y(m).Now, since h(x(i) = (x(i)T, we can easily verify thatXvectory = (x(1)T.(x(m)T y(1).y(

48、m)= h(x(1)y(1).h(x(m)y(m).Thus, using the fact that for a vector z, we have that zTz =summationtexti z2i :12(Xvectory)T(Xvectory) = 12msummationdisplayi=1(h(x(i)y(i)2= J()Finally, to minimize J, lets find its derivatives with respect to . CombiningEquations (2) and (3), we find thatATtrABATC = BTATCT + BATC (5)

展开阅读全文