********************************************************* ** Learning from complete and incomplete data using EM ** ********************************************************* This directory contains matlab code for the series of EM based algorithms for learning from incomplete data outlined in (Ghahramani & Jordan 1994). The code is Matlab Version 3.5i. Modification for more recent versions of Matlab should be trivial. The code is intended for exploratory use as a research tool. It is not written in a very optimized way (it wouldn't have been written in Matlab!)-- in fact at times speed is clearly sacrificed for clarity of code. Comments, bug reports, and better implementations are welcome. Zoubin Ghahramani zoubin@psyche.mit.edu 3/10/94 ----------------------------------------------------------------------- To extract the tar files type from the UNIX prompt line: % uncompress EMcode.tar.Z % tar xvf EMcode.tar ----------------------------------------------------------------------- ******************************* ** The Learning Engine Files ** ******************************* The learning engine for each of the series of algorithms is denoted by the following codes: "EM" all start with this code "class" the classification algorithms "bin" for binary valued inputs (Bernoulli mixtures) "d" the algorithms with diagonal covariance matrices "inc" the algorithms that can handle incomplete data E.g. EM_inc_class_d.m is the code for learning a classifier which has diagonal covariance Gaussians from incomplete data. ******************** ** The Data Files ** ******************** The format of a data file is very simple and uniform: it must be a rectangular matrix of numbers. Each row is a data vector. Therefore the number of rows in the file is the number of input patterns (N) and the number of columns is the dimensionality of the inputs (D). Missing inputs are denoted by setting their value to -999. For classification problems the first D-1 columns are real valued attribute data and the Dth column is an integer from 1...nclass, denoting the class to which that data point belongs. Missing values (-999) can appear in any column. For binary input problems the data file must be all {0,1,-999}. ********************* ** The Script File ** ********************* A sample script file for running the EM algorithm for Gaussian mixtures is shown in script.m. For a classification problem there is also a corresponding scriptclass.m. Note that the algorithm will estimate the maximum likelihood parameters of the joint input/output density. To obtain estimates of an output given an input for function approximation or classification we need some extra code that will form conditional expectations or sample stochastically: *********************************************** ** Function Approximation and Classification ** *********************************************** regress.m uses the parameters of the mixture model to predict the values of some variables (y) from the other variables (x) using the least squares estimate E(y|x). classify.m uses the parameters of the mixture model to classify new data points, to fill in data, and to form class conditional means. ********************** ** Sample Data Sets ** ********************** Four small data sets are provided: dgauss1--A single Gaussian with mean (5,5) and covariance matrix (1.25 2.25, 2.25 4.25). The missing data pattern is nontrivial. dgauss3--A mixture of three Gaussians w/ mu1=(-5,0) cov1=(8 10,10 13) mu2=(2,2) cov2=(1.25 -0.5, -0.5 1) mu3=(4,6) cov3=(2 1,1 1) dclass--A simple classification problem w/ 2 Gaussian classes with means (0,0) and (2,2) and variance 1. irisdata--The classic Iris data set with varying proportions of missing data (irisdatat is the test set of 50 items).