1、Information Science and StatisticsSeries Editors:M. JordanJ. KleinbergB. ScholkopfInformation Science and Statistics Akaike and Kitagawa: The Practice of Time Series Analysis. Bishop: Pattern Recognition and Machine Learning. Cowell, Dawid, Lauritzen, and Spiegelhalter: Probabilistic Networks andExp
2、ert Systems. Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice. Fine: Feedforward Neural Network Methodology. Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality Improvement. Jensen: Bayesian Networks and Decision Graphs. Marchette: Computer Intrusion Detectio
3、n and Network Monitoring:A Statistical Viewpoint. Rubinstein and Kroese: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning. Studen: Probabilistic Conditional Independence Structures.Vapnik: The Nature of Statistical Learning Theo
4、ry, Second Edition. Wallace: Statistical and Inductive Inference by Minimum Massage Length. Christopher M. BishopPattern Recognition andMachine LearningChristopher M. Bishop F.R.Eng.Assistant DirectorMicrosoft Research LtdCambridge CB3 0FB, U.Khttp:/ EditorsMichael JordanDepartment of ComputerScienc
5、e and Departmentof StatisticsUniversity of California,BerkeleyBerkeley, CA 94720USAProfessor Jon KleinbergDepartment of ComputerScienceCornell UniversityIthaca, NY 14853USABernhard ScholkopfMax Planck Institute forBiological CyberneticsSpemannstrasse 3872076 TubingenGermanyLibrary of Congress Contro
6、l Number: 2006922522ISBN-10: 0-387-31073-8ISBN-13: 978-0387-31073-2Printed on acid-free paper. 2006 Springer Science+Business Media, LLCAll rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher(Springer Science+Business Media,
7、LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connectionwith reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation,computer software, or by similar or dissimilar methodology now known or hereafter d
8、eveloped is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such,is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.Printed in Singapore. (KYO)This book is d
9、edicated to my family:Jenna, Mark, and HughTotal eclipse of the sun, Antalya, Turkey, 29 March 2006.PrefacePattern recognition has its origins in engineering, whereas machine learning grewout of computer science. However, these activities can be viewed as two facets ofthe same field, and together th
10、ey have undergone substantial development over thepast ten years. In particular, Bayesian methods have grown from a specialist niche tobecome mainstream, while graphical models have emerged as a general frameworkfor describing and applying probabilistic models. Also, the practical applicability ofBa
11、yesian methods has been greatly enhanced through the development of a range ofapproximate inference algorithms such as variational Bayes and expectation propa-gation. Similarly, new models based on kernels have had significant impact on bothalgorithms and applications.This new textbook reflects thes
12、e recent developments while providing a compre-hensive introduction to the fields of pattern recognition and machine learning. It isaimed at advanced undergraduates or first year PhD students, as well as researchersand practitioners, and assumes no previous knowledge of pattern recognition or ma-chi
13、ne learning concepts. Knowledge of multivariate calculus and basic linear algebrais required, and some familiarity with probabilities would be helpful though not es-sential as the book includes a self-contained introduction to basic probability theory.Because this book has broad scope, it is impossi
14、ble to provide a complete list ofreferences, and in particular no attempt has been made to provide accurate historicalattribution of ideas. Instead, the aim has been to give references that offer greaterdetail than is possible here and that hopefully provide entry points into what, in somecases, is
15、a very extensive literature. For this reason, the references are often to morerecent textbooks and review articles rather than to original sources.The book is supported by a great deal of additional material, including lectureslides as well as the complete set of figures used in the book, and the re
16、ader isencouraged to visit the book web site for the latest information:http:/ PREFACEExercisesThe exercises that appear at the end of every chapter form an important com-ponent of the book. Each exercise has been carefully chosen to reinforce conceptsexplained in the text or to develop and generali
17、ze them in significant ways, and eachis graded according to difficulty ranging from (star), which denotes a simple exercisetaking a few minutes to complete, through to (starstarstar), which denotes a significantlymore complex exercise.It has been difficult to know to what extent these solutions shou
18、ld be madewidely available. Those engaged in self study will find worked solutions very ben-eficial, whereas many course tutors request that solutions be available only via thepublisher so that the exercises may be used in class. In order to try to meet theseconflicting requirements, those exercises
19、 that help amplify key points in the text, orthat fill in important details, have solutions that are available as a PDF file from thebook web site. Such exercises are denoted by www . Solutions for the remainingexercises are available to course tutors by contacting the publisher (contact detailsare
20、given on the book web site). Readers are strongly encouraged to work throughthe exercises unaided, and to turn to the solutions only as required.Although this book focuses on concepts and principles, in a taught course thestudents should ideally have the opportunity to experiment with some of the ke
21、yalgorithms using appropriate data sets. A companion volume (Bishop and Nabney,2008) will deal with practical aspects of pattern recognition and machine learning,and will be accompanied by Matlab software implementing most of the algorithmsdiscussed in this book.AcknowledgementsFirst of all I would
22、like to express my sincere thanks to Markus Svensen whohas provided immense help with preparation of figures and with the typesetting ofthe book in LATEX. His assistance has been invaluable.I am very grateful to Microsoft Research for providing a highly stimulating re-search environment and for givi
23、ng me the freedom to write this book (the views andopinions expressed in this book, however, are my own and are therefore not neces-sarily the same as those of Microsoft or its affiliates).Springer has provided excellent support throughout the final stages of prepara-tion of this book, and I would l
24、ike to thank my commissioning editor John Kimmelfor his support and professionalism, as well as Joseph Piliero for his help in design-ing the cover and the text format and MaryAnn Brickner for her numerous contribu-tions during the production phase. The inspiration for the cover design came from adi
25、scussion with Antonio Criminisi.I also wish to thank Oxford University Press for permission to reproduce ex-cerpts from an earlier textbook, Neural Networks for Pattern Recognition (Bishop,1995a). The images of the Mark 1 perceptron and of Frank Rosenblatt are repro-duced with the permission of Arvi
26、n Calspan Advanced Technology Center. I wouldalso like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1,and Bernhard Scholkopf for permission to use his kernel PCA code to plot Fig-ure 12.17.PREFACE ixMany people have helped by proofreading draft material and providing com-ment
27、s and suggestions, including Shivani Agarwal, Cedric Archambeau, Arik Azran,Andrew Blake, Hakan Cevikalp, Michael Fourman, Brendan Frey, Zoubin Ghahra-mani, Thore Graepel, Katherine Heller, Ralf Herbrich, Geoffrey Hinton, Adam Jo-hansen, Matthew Johnson, Michael Jordan, Eva Kalyvianaki, Anitha Kanna
28、n, JuliaLasserre, David Liu, Tom Minka, Ian Nabney, Tonatiuh Pena, Yuan Qi, Sam Roweis,Balaji Sanjiya, Toby Sharp, Ana Costa e Silva, David Spiegelhalter, Jay Stokes, TaraSymeonides, Martin Szummer, Marshall Tappen, Ilkay Ulusoy, Chris Williams, JohnWinn, and Andrew Zisserman.Finally, I would like t
29、o thank my wife Jenna who has been hugely supportivethroughout the several years it has taken to write this book.Chris BishopCambridgeFebruary 2006Mathematical notationI have tried to keep the mathematical content of the book to the minimum neces-sary to achieve a proper understanding of the field.
30、However, this minimum level isnonzero, and it should be emphasized that a good grasp of calculus, linear algebra,and probability theory is essential for a clear understanding of modern pattern recog-nition and machine learning techniques. Nevertheless, the emphasis in this book ison conveying the un
31、derlying concepts rather than on mathematical rigour.I have tried to use a consistent notation throughout the book, although at timesthis means departing from some of the conventions used in the corresponding re-search literature. Vectors are denoted by lower case bold Roman letters such asx, and al
32、l vectors are assumed to be column vectors. A superscript T denotes thetranspose of a matrix or vector, so that xTwill be a row vector. Uppercase boldroman letters, such as M, denote matrices. The notation (w1,.,wM) denotes arow vector with M elements, while the corresponding column vector is writte
33、n asw =(w1,.,wM)T.The notation a, b is used to denote the closed interval from a to b, that is theinterval including the values a and b themselves, while (a, b) denotes the correspond-ing open interval, that is the interval excluding a and b. Similarly, a, b) denotes aninterval that includes a but e
34、xcludes b. For the most part, however, there will belittle need to dwell on such refinements as whether the end points of an interval areincluded or not.The M M identity matrix (also known as the unit matrix) is denoted IM,which will be abbreviated to I where there is no ambiguity about it dimension
35、ality.It has elements Iijthat equal 1 if i = j and 0 if i negationslash= j.A functional is denoted fy where y(x) is some function. The concept of afunctional is discussed in Appendix D.The notation g(x)=O(f(x) denotes that |f(x)/g(x)| is bounded as x .For instance if g(x)=3x2+2, then g(x)=O(x2).The expectation of a function f(x, y) with respect to a random variable x is de-noted by Exf(x, y). In situations where there is no ambiguity as to which variableis being averaged over, this will be simplified by omitting the suffix, for instancexi