Matlab code for fitting mixture models using the EM algorithm
******************
** Introduction **
******************
This archive contains Matlab code for fitting mixture models to
discrete and continuous data. The algorithm is based on EM, and can
accomodate any pattern of incompleteness in the data set. It can be used
for density estimation, and by conditionalizing, for function
approximation and classification. Details are outlined in (Ghahramani & Jordan 1994).
The code is Matlab Version 3.5i. Porting to more recent versions
of Matlab should be trivial. The code is intended for exploratory use,
as a research tool. It is not optimized for speed; in fact, at times
speed is clearly sacrificed for clarity of code. Comments, bug reports, and
better implementations are welcome.
Zoubin Ghahramani
zoubin@cs.toronto.edu
-----------------------------------------------------------------------
To ftp the code, click here.
To extract the tar files type from the UNIX prompt line:
% uncompress EMcode.tar.Z
% tar xvf EMcode.tar
-----------------------------------------------------------------------
*******************************
** The Learning Engine Files **
*******************************
The learning engine for each of the series of algorithms is denoted
by the following codes:
"EM" all start with this code
"class" the classification algorithms
"bin" for binary valued inputs (Bernoulli mixtures)
"d" the algorithms with diagonal covariance matrices
"inc" the algorithms that can handle incomplete data
E.g. EM_inc_class_d.m is the code for learning a classifier which
has diagonal covariance Gaussians from incomplete data.
********************
** The Data Files **
********************
The format of a data file is very simple and uniform: it must be
a rectangular matrix of numbers. Each row is a data vector. Therefore
the number of rows in the file is the number of input patterns (N) and
the number of columns is the dimensionality of the inputs (D).
Missing inputs are denoted by setting their value to -999.
For classification problems the first D-1 columns are real valued
attribute data and the Dth column is an integer from 1...nclass,
denoting the class to which that data point belongs. Missing values
(-999) can appear in any column.
For binary input problems the data file must be all {0,1,-999}.
*********************
** The Script File **
*********************
A sample script file for running the EM algorithm for Gaussian
mixtures is shown in script.m. For a classification problem there is
also a corresponding scriptclass.m.
Note that the algorithm will estimate the maximum likelihood
parameters of the joint input/output density. To obtain estimates of an
output given an input for function approximation or classification
we need some extra code that will form conditional expectations or
sample stochastically:
***********************************************
** Function Approximation and Classification **
***********************************************
regress.m uses the parameters of the mixture model to predict
the values of some variables (y) from the other
variables (x) using the least squares estimate E(y|x).
classify.m uses the parameters of the mixture model to classify
new data points, to fill in data, and to form class
conditional means.
**********************
** Sample Data Sets **
**********************
Four small data sets are provided:
dgauss1--A single Gaussian with mean (5,5) and covariance
matrix (1.25 2.25, 2.25 4.25). The missing data
pattern is nontrivial.
dgauss3--A mixture of three Gaussians w/
mu1=(-5,0) cov1=(8 10,10 13)
mu2=(2,2) cov2=(1.25 -0.5, -0.5 1)
mu3=(4,6) cov3=(2 1,1 1)
dclass--A simple classification problem w/ 2 Gaussian classes
with means (0,0) and (2,2) and variance 1.
irisdata--The classic Iris data set with varying proportions
of missing data (irisdatat is the test set of 50 items).