%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Revision March 7 1997
% \documentstyle[mystyles,apalike,psfig,12pt]{myarticle3}
\documentstyle[mystyles,psfig]{kbsjrnl}
\renewcommand{\*}[1]{}
\input{mathnew.tex}
\newcommand{\new}{^{\rm \scriptstyle new}}
\newcommand{\mnew}{^{(m)\; \rm \scriptstyle new}}
\newcommand{\bea}{\begin{eqnarray*}}
\newcommand{\eea}{\end{eqnarray*}}
\begin{document}
\begin{article}
%%%%% To be entered at Kluwers: =====>>
%% Enter each date on its own line, and write in correct dates:
\begin{kluwerdates}
Received Date
Accepted Date
Final Manuscript Date
\end{kluwerdates}
\journame{Machine Learning}
\volnumber{?}
\issuenumber{?}
\issuemonth{?}
\volyear{1997}
%% Do not delete either of the following two commands.
%% Please supply facing curly brackets for the part you
%% are not using for this article.
%\received{May 1, 1991}\revised{}
\authorrunninghead{Z. Ghahramani and M.I. Jordan}
\titlerunninghead{Factorial Hidden Markov Models}
% \setcounter{page}{275} %% This command is optional.
%% May set page number only for first page in
%% issue, if desired.
%% <<== End of commands to be entered at Kluwers
%% Authors, start here ==>>
\title{Factorial Hidden Markov Models}
\authors{Zoubin Ghahramani}
\email{zoubin@cs.toronto.edu}
\affil{Department of Computer Science, University of Toronto, Toronto, ON
M5S 3H5, Canada}
\authors{Michael I. Jordan}
\email{jordan@psyche.mit.edu}
\affil{Department of Brain \& Cognitive Sciences, Massachusetts
Institute of Technology, \\
Cambridge, MA 02139, USA}
\editor{???}
\abstract{
Hidden Markov models (HMMs) have proven to be one of the most widely
used tools for learning probabilistic models of time series data. In
an HMM, information about the past of the time series is conveyed
through a single discrete variable---the hidden state. We present a
generalization of HMMs in which this state is factored into multiple
state variables and is therefore represented in a distributed manner.
We describe an exact algorithm for inferring the posterior
probabilities of the hidden state variables given the observations,
and relate it to the forward--backward algorithm for HMMs and to
algorithms for more general graphical models. Due to the combinatorial
nature of the hidden state representation, this exact algorithm is
intractable. As in other intractable systems, approximate inference
can be carried out using Gibbs sampling or variational methods. We
also present a structured approximation in which the the state
variables are decoupled, based on which we derive a tractable
algorithm for learning the parameters of the model. Empirical
comparisons suggest that these approximations are efficient and
accurate alternatives to the exact methods. Finally, we use the
structured approximation to model Bach's chorales and show that
factorial HMMs can capture statistical structure in this data set
which an unconstrained HMM cannot.}
\keywords{Hidden Markov models, time series, EM algorithm, graphical
models, Bayesian networks, mean field theory}
\section{Introduction}
Due to its flexibility and to the simplicity and efficiency of its
parameter estimation algorithm, the hidden Markov model (HMM) has
emerged as one of the basic statistical tools for modeling discrete
time series, finding widespread application in the areas of speech
recognition~\cite{RabJua86} and computational molecular
biology~\cite{BalChaHun94}. An HMM is essentially a mixture model,
encoding information about the history of a time series in the value
of a single multinomial variable---the hidden state---which can take
on one of $K$ discrete values. This multinomial assumption allows an
efficient parameter estimation algorithm to be derived---the
Baum-Welch algorithm---which considers each of the $K$ settings of the
hidden state at each time step. However, the multinomial assumption
also severely limits the representational capacity of HMMs. For
example, to represent 30 bits of information about the history of a
time sequence, an HMM would need $K=2^{30}$ distinct states. On the
other hand, an HMM with a {\em distributed} state representation could
achieve the same task with 30 binary state variables~\cite{WilHin91}.
This paper addresses the problem of deriving efficient learning
algorithms for hidden Markov models with distributed state
representations.
The need for distributed state representations in HMMs can be
motivated in two ways. First, such representations allow the model to
automatically decompose the state space into features that decouple
the dynamics of the process that generated the data. Second,
distributed state representations simplify the task of modeling time
series which are known a priori to be generated from an interaction of
multiple, loosely-coupled processes. For example, a speech signal
generated by the superposition of multiple simultaneous speakers can
be potentially modeled with such an architecture.
Williams and Hinton~(1991) first formulated the problem of learning in
HMMs with distributed state representation and proposed a solution
based on deterministic Boltzmann learning.\footnote{For related work
on inference in distributed state HMMs, see Dean and Kanazawa
(1989).}~\nocite{WilHin91,DeaKan89} The approach presented in this
paper is similar to Williams and Hinton's in that it can also be
viewed from the framework of statistical mechanics and mean field
theory. However, our learning algorithm is quite different in that it
makes use of the special structure of HMMs with distributed state
representation, resulting in a significantly more efficient learning
procedure. Anticipating the results in section~\ref{sec3}, this
learning algorithm obviates the need for the two-phase procedure of
Boltzmann machines, has an exact M-step, and makes use of the
forward-backward algorithm from classical HMMs as a subroutine. A
different approach comes from Saul and Jordan (1995), who derived a
set of rules for computing the gradients required for learning in HMMs
with distributed state spaces.~\nocite{SauJor95} However, their
methods can only be applied to a limited class of architectures.
Hidden Markov models with distributed state representations are a
particular class of probabilistic graphical
model~\cite{Pea88,LauSpi88}. Probabilistic graphical models represent
probability distributions as graphs, with the nodes corresponding to
random variables and the links representing conditional independence
relations. The relation between hidden Markov models and graphical
models has recently been reviewed in Smyth, Heckerman and Jordan
(1997).~\nocite{SmyHecJor97} While exact probability propagation
algorithms exist for graphical models \cite{JenLauOle90}, these
algorithms are intractable for densely-connected models such as the
ones we consider in this paper. One approach to dealing with this
issue is to utilize stochastic sampling methods (Kanazawa et al.,
1995).~\nocite{KanKolRus95} Another approach, which provides the basis
for algorithms described in the current paper, is to make use of
variational methods (cf.~Saul, Jaakkola, \&
Jordan,~1996). \nocite{SauJaaJor96}
The paper is organized as follows. In the following section we define the
probabilistic model for factorial HMMs. In section~\ref{sec3} we
present algorithms for inference and learning. In section~\ref{sec4}
we describe empirical results comparing exact and approximate
algorithms for learning on the basis of time complexity and model
quality. We also apply factorial HMMs to a real time series data set
consisting of the melody lines from a collection of chorales by
J. S. Bach. We discuss several generalizations of the probabilistic
model in section~\ref{sec5}, and conclude in section~\ref{sec6}. Where
necessary, details of derivations are provided in the appendixes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Probabilistic Model}
\label{sec2}
We begin by describing the hidden Markov model. In a hidden Markov
model, a sequence of observations $\{Y_t\}$ where $t=1, \ldots T$, is
modeled by specifying a probabilistic relation between the
observations and a sequence of hidden states $\{S_t\}$, and a Markov
transition structure linking the hidden states. The model assumes two
sets of conditional independence relations: that $Y_t$ is independent
of all other observations and states given $S_t$, and that $S_t$ is
independent of $S_1 \ldots S_{t-2}$ given $S_{t-1}$ (the first-order
Markov property). Using these independence relations, the joint probability
for the sequence of states and observations can be factored as:
\be
P(\{S_t,Y_t\}) = P(S_1)P(Y_1|S_1) \prod_{t=2}^T P(S_t|S_{t-1})
P(Y_t|S_t).
\el{hmm1}
\ee
The conditional independences specifies by equation \er{hmm1} can be
expressed graphically in the form of Figure~\ref{fig1}a. The state is
a single multinomial random variable that can take one of $K$ discrete
values, $S_t \in \{1, \ldots, K\}$. The state transition
probabilities, $P(S_t|S_{t-1})$, are specified by a $K\times K$
transition matrix. If the observations are discrete symbols taking on
one of $D$ values, the observation probabilities $P(Y_t|S_t)$ can be
fully specified as a $K\times D$ observation matrix. For a continuous
observation vector, $P(Y_t|S_t)$ can be modeled in many different
forms, such as a Gaussian, mixture of Gaussians, or even a neural
network.\footnote{In speech, neural networks are sometimes used to
model $P(S_t|Y_t)$, using Bayes rule to obtain from this the
observation probabilities in the HMM.}
\newcommand{\text}[3]{\put(#1,#2){\makebox(0,0){#3}}}
\begin{figure}
\begin{picture}(360,110)
\text{70}{42}{\psfig{figure=hmm.eps,width=1.8in}}
\text{230}{55}{\psfig{figure=fhmm.eps,width=2.8in}}
\text{20}{110}{a}
\text{155}{110}{b}
\end{picture}
\caption{a) A directed acyclic graph (DAG) specifying conditional
independence relations for a hidden Markov model. Each node is
conditionally independent from its non-descendants given its parents.
b) A DAG representing the conditional independence relations in a
factorial HMM.}
\label{fig1}
\end{figure}
In the present paper, we generalize the HMM state representation by
allowing the state to be represented by a collection of state
variables,
\be
S_t=S^{(1)}_t, \ldots S^{(m)}_t, \ldots, S^{(M)}_t,
\el{fact}
\ee
each of which can take on $K^{(m)}$ values. We refer to these models
as {\em factorial hidden Markov models} as the state space consists of
the cross product of these state variables. For simplicity, we will
assume that $K^{(m)} = K$, for all $m$, although all the results we
present can be trivially generalized to the case of differing
$K^{(m)}$. Given that the state space of this factorial HMM consists
of all $K^M$ combinations of the $\Smt$ variables, placing no
constraints on the state transition structure would result in a $K^M
\times K^M$ transition matrix. Such an unconstrained system is
uninteresting for several reasons: (1) it is equivalent to an HMM with
$K^M$ states; (2) it is unlikely to discover any interesting structure
in the $K$ state variables as all variables are allowed to interact
arbitrarily; (3) both the time complexity and sample complexity of the
estimation algorithm are exponential in $M$.
We therefore focus on factorial HMMs in which the underlying state
transitions are constrained. A natural structure to consider is one in
which each state variable evolves according to its own dynamics,
and is {\em a priori} uncoupled from the other state variables,
\be
P(S_t|S_{t-1}) = \prod_{m=1}^M P(\Smt|S\m_{t-1}).
\el{fact2}
\ee
A graphical representation for this model is presented in
Figure~\ref{fig1}b. The transition structure for this system can be
represented as $M$, $K \times K$ matrices. Generalization to models
which allow coupling between the state variables is briefly discussed
in section~\ref{sec5}.
As shown in Figure~\ref{fig1}b, the observation at time step $t$ can
depend on all the state variables at that time step. For continuous
observations, one simple form for this dependence assumes that the
observation is Gaussian distributed with mean specified by a linear
function of the state variables. We represent the state variables as
$K \times 1$ vectors, where each of the $K$ discrete values
corresponds to a 1 in one position and 0 elsewhere. The probability
density of a $D \times 1$ observation vector $Y_t$ under this Gaussian
model is
\be
P(Y_t|S_t) = |C|^{-1/2} \; (2 \pi)^{-D/2} \;
\exp \left\{ - \frac{1}{2} \left(Y_t - \hat{Y}_t \right)' C^{-1}
\left( Y_t - \hat{Y}_t \right) \right\}, \mathletter{a}
\el{out}
\ee
where
\be
\hat{Y}_t = \sum_{m=1}^M W\m \Smt. \mathletter{b}
\ee
Each $W\m$ matrix is a $D \times K$ matrix whose columns are the
contributions to the means for each of the settings of $S_t\m$, $C$ is
the covariance matrix, $'$ denotes matrix transpose, and $|\cdot |$
denotes the matrix determinant. Since there are $K$ settings for each
of the $M$ state variables, the marginal density of $Y_t$ is a mixture
of $K^M$ Gaussians with differing means and a constant covariance
matrix, $C$. The means are obtained by forming all sums where one
column if picked from each of the $M$ $W\m$ matrices. This static
model, without inclusion of the time index and the Markov dynamics, is
a generalization of mixtures of Gaussians which can be used to model
data with componential structure, such as images with multiple
objects~\cite{Zem93,HinZem94,Gha95}. The model we are considering in
this paper extends this idea to modeling time series by allowing
Markov dynamics in the discrete state variables underlying the
mixture. Unless otherwise stated, we will assume this Gaussian
observation model throughout the paper.
The hidden state variables at one time step, while marginally
independent, become conditionally dependent given the observation
sequence. This can be determined by applying the semantics of
directed graphs, in particular the d-separation criterion (Pearl,
1988), \nocite{Pea88} to the graphical model in Figure~\ref{fig1}b.
Consider the Gaussian model in equation~\er{out}. Given an observation
vector $Y_t$, the posterior probability of each of the settings of the
hidden state variables is proportional to the probability of $Y_t$
under a Gaussian with mean $\hat{Y}_t$. Since $\hat{Y}_t$ is a
function of all the state variables, the probability of a setting of
one of the state variables will depend on the setting of the other state
variables.\footnote{If the columns of $W\m$ and $W^{(n)}$ are
orthogonal for every pair of state variables, $m$ and $n$, and $C$ is
a diagonal covariance matrix, then the state variables will no longer
be dependent given the observation. In this case there is no
``explaining away'': each state variable is modeling the variability
in the observation along a different subspace.} This induced
dependency effectively couples all of the hidden state variables for
the purposes of calculating posterior probabilities and makes exact
inference intractable for the factorial HMM.
\section{Inference and Learning}
\label{sec3}
Given a probabilistic model structure and parameters, the inference
problem can generally be stated as computing the posterior
probabilities of the hidden variables given the observations. In the
context of an HMM used for speech, for example, the observations may
be acoustic vectors and the inference problem may be estimating the
probabilities for a particular word or sequence of phonemes~(the
hidden state). This problem can be solved efficiently via the
forward--backward~(F--B) algorithm~\cite{RabJua86}, which can be shown to
be a special case of the Jensen, Lauritzen, and Olesen~(JLO; 1990)
\nocite{JenLauOle90} algorithm for probability propagation in more
general graphical models~\cite{SmyHecJor97}. In some cases, rather than
a probability distribution over hidden states it is desirable to infer
the single most probable hidden state sequence. This can be achieved
via the Viterbi algorithm~\cite{Vit67}, a form of dynamic programming
which is very closely related to the F-B algorithm and also
has analogues in the graphical model literature~\cite{Daw92}.
The learning problem for probabilistic models consists of two
components: learning the structure of the model, and learning the
parameters of the model. Structure learning is a topic of current
research in both the graphical models and machine learning
communities~(e.g.\ Heckerman, 1995; Stolcke and Omohundro,
1993). \nocite{Hec95,StoOmo93} In the current paper we deal
exclusively on the problem of learning the parameters for a given
structure.
The parameters of a factorial HMM can be estimated via the Expectation
Maximization~(EM) algorithm~\cite{DemLaiRub77}, which in the case of
classical HMMs is known as the Baum--Welch algorithm~\cite{BauPetSou70}.
This procedure iterates between a step that fixes the current parameters
and computes posterior probabilities over the hidden states~(the E-step),
and a step that uses these probabilities to maximize the expected log
likelihood of the parameters~(the M-step). Since the E-step of EM
is exactly the inference problem as described above, we subsume
the discussion of both inference and learning problems into our
description of the EM algorithm for factorial HMMs.
The EM algorithm derives from the definition of the expected log
likelihood of the complete~(observed and hidden) data:
\be
\Q(\phi\new|\phi) = E \left\{ \rule{0mm}{2ex} \log P(\{X_t,Y_t\} |
\phi\new) \; | \; \phi, \{
Y_t\} \right\},
\el{Q}
\ee
where $\Q$ is a function of the parameters of the model $\phi\new$,
given the current estimate of the parameters $\phi$ and the
observation sequence $\{ Y_t\}$. For the factorial HMM the parameters
of the model are $\phi=\{ W\m, \pi\m, P\m, C \}$, where $\pi\m
= P(S\m_1)$ and $P\m = P(\Smt|S\m_{t-1})$. The E-step consists of
computing $\Q$. By expanding ~\er{Q} using
equations~\er{hmm1}--\er{out}, we find that $\Q$ can be expressed as a
function of three types of expectations over the hidden state
variables: $\Smte$, $\SmSnte$, and $\Smtte$, where $ \langle
\cdot \rangle$ has been used to abbreviate $E \{ \cdot | \phi, \{ Y_t
\} \}$. In the HMM notation of Rabiner and Juang~(1986),
\nocite{RabJua86} $\Smte$ corresponds to $\gamma_t$, the vector of state
occupation probabilities, $\Smtte$ corresponds to $\xi_t$, the
$K \times K$ matrix of state occupation probabilities at two
consecutive time steps, whereas $\SmSnte$ has no analogue when there
is only a single underlying Markov model. The M-step uses these
expectations to maximize $\Q$ as a function of $\phi\new$. Using
Jensen's inequality, Dempster~et al.~(1977) showed that each iteration
of the E and M steps increases the likelihood, $P(\{Y_t\}|\phi)$,
until convergence to a~(local) maximum.
As in hidden Markov models, the exact M step for factorial HMMs is
simple and tractable. For example, the M step for the parameters of
the output model described in equation~\er{out} can be found via the
normal equations for weighted linear regression. Similarly, the M step
for the priors, $\pi\m$, and state transition matrices, $P\m$, are
identical to those used in the Baum--Welch algorithm. The details of
the M-step are described in Appendix~\ref{appa}. We now turn to the
substantially more difficult problem of computing the expectations
required for the E-step.
\subsection{Exact calculations}
% EXACT
Unfortunately, the exact E-step for factorial HMMs is computationally
intractable. This fact can best be shown by making reference to
standard algorithms for probabilistic inference in graphical models
(Lauritzen \& Spiegelhalter, 1988),\nocite{LauSpi88} although it can
also be derived readily from direct application of Bayes rule.
Consider the computations that are required for calculating posterior
probabilities for the factorial HMM shown in Figure~\ref{fig1}b within
the framework of the Lauritzen and Spiegelhalter algorithm.
Moralizing and triangulating the graphical structure for the factorial
HMM results in a junction tree (in fact a chain) with $T (M+1) - M$
cliques of size $M+1$. The resulting probabilistic propagation
algorithm has time complexity $O (T M K^{M+1})$ for a single
observation sequence of length $T$. We present a forward-backward
type recursion which implements the exact E-step in
Appendix~\ref{appb}. The naive exact algorithm, consisting of
translating the factorial HMM into an equivalent HMM with $K^M$ states
and using the F-B algorithm, has time complexity $O(T K^{2M})$. Like
other models with multiple densely-connected hidden variables, this
exponential time complexity makes exact learning and inference
intractable.
Thus, although the Markov property can be used to obtain
forward-backward--like factorizations of the expectations across time
steps, the sum over all possible configurations of the other hidden
state variables {\em within} each time step is unavoidable. This
intractability is due inherently to the cooperative nature of the
model---the setting of one state variable only determines the mean of
the observation if all the other state variables are fixed.
\subsection{Gibbs sampling}
Rather than computing the exact posterior probabilities, one can
approximate them using a Markov chain Monte Carlo sampling procedure,
and thereby avoid the sum over exponentially many state patterns at
some cost in accuracy. Although there are many possible sampling
schemes~(for a review see Neal, 1993), here we present one of the
simplest ones---Gibbs sampling~\cite{GemGem84}. \nocite{Nea93}
For a given observation sequence $\{ Y_t \}$, this procedure starts
with a random setting of the hidden states $\{ S_t \}$. At each time
step, each state vector is updated stochastically according to its
probability distribution conditioned on the setting of all the other
state vectors. The graphical model is again useful here, as each node
is conditionally independent of all other nodes given its Markov
blanket. We therefore obtain the following sampling scheme in which we
only need to examine the states of a few neighboring nodes:
\be
\Smt \sim P(\Smt | \{ S^{(n)}_t : n\neq m \}, S\m_{t-1}, S\m_{t+1},
Y_t, \phi).
\ee
These conditional distributions are straightforward to compute---they
involve Gaussian distance calculations. Sampling once from each of the
$TM$ hidden variables in the model results in a new sample of the
overall hidden state of the model and requires $O(T M K )$
operations. The sequence of overall states resulting from each pass of
Gibbs sampling defines a Markov chain over the state space of the
model. Assuming that all probabilities are bounded away from zero,
this Markov chain is guaranteed to converge to the posterior
probabilities of the states given the observations~\cite{GemGem84}.
Thus, after some suitable time, samples from the
Markov chain can be taken as approximate samples from the posterior
probabilities. The first and second-order statistics needed to
estimate $\Smte$, $\SmSnte$ and $\Smtte$ are collected using the
states visited and the probabilities estimated during this sampling
process and used in the approximate E-step of EM.~\footnote{A more
Bayesian treatment of the learning problem, in which the parameters
are also considered hidden random variables, can be handled by Gibbs
sampling by replacing the ``M step'' with sampling from the conditional
distribution of the parameters given the other hidden variables~(see
for example, Tanner and Wong, 1987). \nocite{TanWon87} }
\subsection{Completely factorized variational approximation}
In this section, we present a second approximation of the posterior
probability of the hidden states which is both tractable and
deterministic. The basic idea is to approximate the posterior
distribution over the hidden variables $P(\{S_t\}|\{Y_t\})$ by a
tractable distribution $Q(\{S_t\}|\{Y_t\})$. This approximation
provides a lower bound on the log likelihood which can be used to
obtain an efficient learning algorithm.
The argument can be formalized as follows~(Saul et
al.,~1996). \nocite{SauJaaJor96} Any distribution over the hidden
variables $Q(\{S_t\}|\{Y_t\})$ can be used to define a lower bound on
the log likelihood:
\bea
\log P(\{ Y_t \}) &=& \log \sum_{\{S_t\}} P(\{S_t, Y_t\})\\
&=& \log \sum_{\{S_t\}} Q(\{S_t\}|\{Y_t\}) \left[ \frac{P(\{S_t,
Y_t\})}{Q(\{S_t\}|\{Y_t\})} \right]\\
&\geq& \sum_{\{S_t\}} Q(\{S_t\}|\{Y_t\}) \; \log \left[ \frac{P(\{S_t,
Y_t\})}{Q(\{S_t\}|\{Y_t\})} \right],
\eea
where we have made use of Jensen's inequality in the
last step. The difference between the left hand side and the right
hand side of this inequality is given by the Kullback-Liebler (KL)
divergence~\cite{CovTho91}:
\be
\KL(Q\|P)=\sum_{\{S_t\}} Q(\{S_t\}|\{Y_t\}) \; \log \left[
\frac{Q(\{S_t \}|\{Y_t\})}{P(\{S_t\}|\{Y_t\})} \right].
\el{kl}
\ee
Henceforth, we drop $\{Y_t\}$ from the conditioning side of $Q$ for
notational convenience.
The complexity of exact inference in the approximation given by $Q$ is
determined by its conditional independence relations, not by its
parameters. Thus, we can chose $Q$ to have a tractable
structure---a graphical representation which eliminates some of
the dependencies in $P$. Given this structure, we are free to vary the
parameters of $Q$ so as to obtain the tightest possible bound by
minimizing~\er{kl}.
We will refer to the general strategy of using a parameterized
approximating distribution as a {\em variational approximation\/}
and refer to the free parameters of the distribution as {\em variational
parameters}. To illustrate, consider the simplest variational
approximation, in which the state variables are assumed independent
given the observations~(Figure~\ref{fig2}a). This distribution can be
written as
\be
Q(\{S_t\}|\theta) = \prod_{t=1}^T \prod_{m=1}^M Q(\Smt|\theta\m_t).
\el{mf1}
\ee
The variational parameters, $\theta=\{\theta\m_t\}$, are the
means of the state variables, where, as before, a state variable
$\Smt$ is represented as a $K$-dimensional vector with a 1 in the
$k^{\mbox{\scriptsize th}}$ position and zeros elsewhere, if the
$m^{\mbox{\scriptsize th}}$ Markov chain is in state $k$ at time $t$.
The elements of the vector $\theta\m_t$ therefore define the state occupation
probabilities for multinomial variable $\Smt$ under the distribution $Q$:
\be
Q(S\m_{t} |\theta\m_t) = \prod_{k=1}^K \left( \theta\m_{t,k}
\right)^{S\m_{t,k}} \hspace*{0.2in} \mbox{where} \hspace*{0.15in}
S\m_{t,k} \in\{0,1\}; \hspace*{0.15in} \sum_{k=1}^K S\m_{t,k}=1.
\el{mf2}
\ee
This completely factorized approximation is often used in
statistical physics, where it provides the basis for simple
yet powerful {\em mean field approximations\/} to statistical
mechanical systems~\cite{Par88}.
\begin{figure}
\centerline{
\psfig{figure=mhmm.eps,width=2.55in} \hspace{0.2in}
\psfig{figure=shmm.eps,width=2.9in}
}
\begin{picture}(432,0)
\text{10}{130}{a)}
\text{226}{130}{b)}
\end{picture}
\caption{a) The completely factorized variational approximation assumes
that all the state variables are independent (conditional on the observation
sequence). b) The structured variational approximation assumes that the state
variables retain their Markov structure within each chain, but are independent
across chains.}
\label{fig2}
\end{figure}
To make the bound as tight as possible we vary $\theta$ separately for
each observation sequence so as to minimize the KL divergence. Taking
the derivatives of~\er{kl} with respect to $\theta\m_t$ and setting to
zero, we obtain the set of fixed point equations defined by (see
Appendix~\ref{appc}):
\begin{splitmath}
\theta^{(m) \; {\rm \scriptstyle new}}_t =& \varphi \left\{ W^{(m)'} C^{-1}
\left(Y_t - \hat{Y}_t\right) + W^{(m)'} C^{-1} W^{(m)} \theta\m_t -
\frac{1}{2} \Delta\m + \right. \\
\hspace{0.3in} \left. + (\log P\m) \; \theta\m_{t-1} + (\log
P\m)' \; \theta\m_{t+1} \rule{0mm}{3ex} \right\} \mathletter{a}
\el{mf}
\end{splitmath}
where $\hat{Y}_t$ is defined as
\be
\hat{Y}_t \equiv \sum_{\ell=1}^M W^{(\ell)} \theta^{(\ell)}_t,
\el{mfy}
\mathletter{b}
\ee
$\Delta\m$ is the vector of diagonal elements of $ W^{(m)'} C^{-1}
W^{(m)}$, and $\varphi \{ \cdot \}$ is the softmax operator which
maps a vector $A$ into a vector $B$ of the same size, with elements
\be
B_i = \frac{\exp \{ A_i \}}{\displaystyle \sum_j \exp \{ A_j \}}.
\ee
The first term of~\er{mf} is the projection of the error in reconstructing the
observation onto the weights of state vector $m$---the more a
particular setting of a state vector can reduce this error, the larger
its associated variational parameter. The next two terms arise from
the fact that the second order correlation $\langle S\m_t S\m_t
\rangle$ evaluated under the variational distribution is equal to the
mean $\theta\m_t$. The last two terms introduce dependencies forward
and backward in time. We therefore see that although the posterior
distribution over the hidden variables is approximated with a
completely factorized distribution, the fixed point equations couple
the parameter associated with each node with the parameters of its
Markov blanket. In this sense, the fixed point equations propagate
information along the same pathways as those defining the exact
algorithms for probability propagation.
The following may provide an intuitive interpretation of the
approximation being made by this distribution. Given a particular
observation sequence, the hidden state variables for the $M$ Markov
chains at time step $t$ are stochastically coupled. This stochastic
coupling is approximated by a system in which the hidden variables are
uncorrelated but have coupled means. The variational or ``mean-field''
equations solve for the deterministic coupling of the means that best
approximates the stochastically coupled system.
Each hidden state vector is updated in turn using ~\er{mf}, with a time
complexity of $O(T K M )$ per iteration. Convergence is diagnosed by
monitoring the $KL$ divergence in the variational distribution between
successive time steps; in practice convergence is very rapid (about 2
to 10 iterations of \er{mf}). Once the fixed point equations have
converged, the expectations required for the E-step can be obtained as
a simple function of the parameters (equations~\er{emf1}--\er{emf3} in
Appendix~\ref{appc}).
\subsection{Structured variational approximation}
The approximation presented in the previous section factorizes the
posterior probability such that all the state variables are
statistically independent. In contrast to this rather extreme
factorization, in this section we present an approximation which is
both tractable and preserves much of the probabilistic structure of
the original system. The factorial HMM is approximated by $M$
uncoupled HMMs~(Figure~\ref{fig2}b). Within each HMM, efficient and
exact inference is implemented via the forward--backward algorithm.
The approach of exploiting tractable substructures was first suggested
in the machine learning literature by Saul and
Jordan~(1996). \nocite{SauJor96}
Note that the arguments presented in the previous section did not
hinge on the the form of the approximating distribution. Therefore,
more structured variational approximations can be obtained by
using more structured variational distributions $Q$. Each such
$Q$ provides a lower bound on the log likelihood and can be used
to obtain a learning algorithm.
We write the structured variational approximation as
\be
Q(\{S_t\}|\theta) = \prod_{m=1}^M Q(S\m_1|\theta)
\prod_{t=2}^T Q(S\m_t|S\m_{t-1},\theta), \mathletter{a}
\el{smf1}
\ee
where
\begin{eqnarray}
Q(S\m_1|\theta)&=&\prod_{k=1}^K \left( h\m_{1,k} \pi\m_{k}
\right)^{S\m_{1,k}} \mathletter{b} \\
Q(S\m_t|S\m_{t-1},\theta)&=&P(\Smt|S\m_{t-1},\phi) \; h\m_t \nonumber \\
& =& \prod_{k=1}^K \left( h\m_{t,k} \sum_{j=1}^K P\m_{k,j} S\m_{t-1,j}
\right)^{S\m_{t,k}}. \mathletter{c}
\el{smf1c}
\end{eqnarray}
The parameters of this distribution are $\theta=\{\pi\m, P\m,
h\m_t\}$---the original priors and state transition matrices of the
factorial HMM and a time-varying bias for each state variable. By
comparing equations~\er{smf1}--\er{smf1c} to equation~\er{hmm1}, we
see that the $K \times 1$ vector $h\m_t$ plays the role of the
probability of an observation ($P(Y_t|S_t)$ in \er{hmm1}) for each
of the $K$ settings of $\Smt$.
Intuitively, this approximation uncouples the $M$ Markov chains and
attaches to each state variable a distinct ficticious observation. The
probability of this ficticious observation can be varied so as to
miminize the $KL$ divergence between $Q$ and $P$.
Applying the same arguments as before, we obtain a set of fixed point
equations for $h\m_t$ that minimize $KL(Q \| P)$---see
Appendix~\ref{appd}:
\begin{wideequation}
\be
h_t^{(m) \; {\rm \scriptstyle new}} \propto \exp \left\{ W^{(m)'} C^{-1}
\left(Y_t - \hat{Y}_t\right) + W^{(m)'} C^{-1} W^{(m)} \Smte -
\frac{1}{2} \Delta\m \right\}, \mathletter{a}
\el{smf2}
\ee
\end{wideequation}
where
\be
\hat{Y}_t \equiv \sum_{\ell=1}^M W^{(\ell)} \langle S^{(\ell)}_t \rangle,
\el{smf3} \mathletter{b}
\ee
and $\Delta\m$ is defined as before. The parameter $h\m_t$ obtained
from these fixed point equations is observation probability associated
with state variable $\Smt$ in hidden Markov model $m$. Using these
probabilities, the F--B algorithm is used to compute a new set of
expectations for $\Smte$, which are fed back into~\er{smf2} and
\er{smf3}. The F--B algorithm is therefore used as a subroutine in the
minimization of the KL divergence.
Note the similarity between equations~\er{smf2}--\er{smf3} and
equations~\er{mf}--\er{mfy} for the completely factorized system. In
the completely factorized system, since $\Smte = \thetamt$ the fixed
point equations can be written explicitly in terms of the variational
parameters. In the structured approximation, the dependence of $\Smte$
on $h\m_t$ is computed via the F--B algorithm. Note also
that~\er{smf2} does not contain terms involving the prior, $\pi\m$, or
transition matrix, $P\m$. These terms have cancelled out by our choice
of approximation.
\subsection{Choice of approximation}
The theory of the EM algorithm as presented in Dempster et al.~(1977)
assumes that an exact E-step is used. For models in which the exact E
step is intractable, the choice of approximation must be made
considering several theoretical and practical issues.
Markov chain Monte Carlo approximations, such as Gibbs sampling, offer
the theoretical assurance that the Markov chain will converge to the
correct posterior distribution in the limit. Although this means that
one can come arbitrarily close to the exact E-step, in practice
convergence can be slow (especially for multimodal distributions) and
it is often very difficult to diagnose how close one is to
covergence. However, it is important to keep in mind that when
sampling is used for the E-step of EM, there is a time tradeoff
between the number of samples used and the number of iterations of
EM. It seems wasteful to wait until convergence early on in learning,
when the posterior distribution being sampled from is far from the
posterior given the optimal parameters. In practice we have found that
even approximate E-steps using very few Gibbs samples (e.g. $\sim 10$
samples of each hidden variable) tend to increase the true likelihood.
Variational approximations offer the theoretical assurance that a
lower bound on the likelihood is being maximized. Both the
minimization of the KL divergence in the E-step and the parameter
update in the M step are guarranteed not to decrease this lower bound,
and therefore convergence can be defined in terms of this cost
function although not in terms of the true likelihood. An alternative
view given by Neal and Hinton (1993), describes EM in terms of the
free energy $F$, which is a function both of the parameters, $\phi$,
and a posterior probability distribution over the hidden variables,
$Q(S)$:
\[
F(Q,\phi) = E_Q \left\{ \log P ( S | \phi) \right\} - E_Q \left\{ \log
Q(S) \right\},
\]
where $E_Q$ denotes expectation over $S$ using the distribution
$Q(S)$. The exact E-step in EM maximizes $F$ with respect to $Q$ given
$\phi$. The variational E-steps used here maximize $F$ with respect to
$Q$ given $\phi$, subject to the constraint that $Q$ is of a particular
tractable form. \nocite{NeaHin93} Given this view, it seems
clear that the structured approximation is preferable to the
completely factorized approximation since it places fewer constraints
on $Q$, at no cost in tractability.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Simulations}
\label{sec4}
To investigate learning and inference in factorial HMMs we conducted
two experiments. The first experiment compared the different
approximate and exact methods of inference on the basis of computation
time and performance, as measured by the likelihood of the model
obtained. The second experiment sought to determine whether
the decomposition of the state space in factorial HMMs presented any
advantage in modeling a real time series data set which we might
assume to have complex internal structure---Bach's chorale melodies.
\subsection{Experiment 1: Performance and timing benchmark}
Using data generated from a factorial HMM structure with $M$
underlying Markov models with $K$ states each, we compared the time
per cycle of EM and the training and test set likelihoods of the
following models:
\bi
\item HMM trained using Baum-Welch.
\item Factorial HMM trained using exact inference for the E-step. The
exact algorithm used was a straightforward application of the
forward-backward algorithm, rather than the more efficient algorithm
outlined in Appendix~\ref{appb}.
\item Factorial HMM trained using Gibbs sampling for the E-step. The
number of samples was fixed at 10 samples per variable, a
priori. All samples were used for learning (i.e. no samples were
discarded at the beginning of the run). While this is a too few to
even approach convergence, it provides a run-time roughly comparable
to the variational methods. The goal was to see whether this ``impatient''
Gibbs sampler would be able to compete with the other approximate
methods.
\item Factorial HMM trained using the completely factorized approximation.
\item Factorial HMM trained using the structured approximation.
\end{itemize}
All factorial HMMs had a structure consisting of $M$ underlying Markov
models with $K$ states each; the HMM had $K^M$ states. The data were
generated from a factorial HMM structure with $M$ state variables,
each of which could take on $K$ discrete values. All of the parameters
of this model, except for the output covariance matrix, were sampled
from a uniform $[0, 1]$ distribution, and appropriately normalized to
satisfy the sum-to-one constraints of the transition matrices and
priors. The covariance matrix was set to a multiple of the identity
matrix $C=0.0025 I$.
The training set consisted of 10 sequences of length 20 where the
observable was a 4-dimensional vector; the test set was 20 such
sequences. Algorithms were run for a maximum of 100 cycles of EM or
until convergence, defined as the cycle $k$ at which the log
likelihood $L(k)$ satisfied $L(k)-L(k-1) < 0.0001 (L(k-1)-L(2))$. At
the end of learning, the log likelihoods on the training and test set
were computed for all models using the exact algorithm. The test set
log likelihood of the parameters $\phi$ for $N$ observation sequences
is defined as $ \sum_{n=1}^N \log P(Y^{(n)}_1, \ldots,
Y^{(n)}_t|\phi), $ where $\phi$ is obtained by maximizing the log
likelihood over a training set which is disjoint from the test
set. This provides a measure of how well the model generalizes to a
novel observation sequence from the same distribution as the training
data.
\newcommand{\mc}[1]{\multicolumn{2}{c}{#1}}
\begin{table}[h]
\caption{Comparison of factorial HMM on four problems of
varying size. The Train and Test columns show the log likelihood $\pm$
one standard deviation on the two data sets. CF refers to the
completely factorized variational approximation. ($*$) The likelihood
of the test data for the largest size HMMs was zero to within
numerical resolution of the simulation.}
\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}cccrlrl} \hline
$M$ & $K$ & Alg & \multicolumn{2}{c}{Train} &
\multicolumn{2}{c}{Test} \\ \hline
3 & 2 & HMM & 649 & $\pm$ 8 & 358 & $\pm$ 81\\
& & Exact & 877 & $\pm$ 0 & 768 & $\pm$ 0 \\
& & Gibbs & 710 & $\pm$ 152 & 627 & $\pm$ 129 \\
& & CF & 755 & $\pm$ 168 & 670 & $\pm$ 137 \\
& & Struct & 755 & $\pm$ 168 & 661 & $\pm$ 152 \\ \hline
3 & 3 & HMM & 670 & $\pm$ 26 & $-782$ & $\pm$ 128 \\
& & Exact & 568 & $\pm$ 164 & 276 & $\pm$ 62 \\
& & Gibbs & 564 & $\pm$ 160 & 305 & $\pm$ 51 \\
& & CF & 495 & $\pm$ 83 & 326 & $\pm$ 62 \\
& & Struct & 606 & $\pm$ 154 & 317 & $\pm$ 47 \\ \hline
5 & 2 & HMM & 588 & $\pm$ 37 & $-2634$ & $\pm$ 566 \\
& & Exact & 223 & $\pm$ 76 & 159 & $\pm$ 80 \\
& & Gibbs & 123 & $\pm$ 103 & 73 & $\pm$ 95 \\
& & CF & 292 & $\pm$ 101 & 237 & $\pm$ 103 \\
& & Struct & 242 & $\pm$ 247 & 184 & $\pm$ 250 \\ \hline
5 & 3 & HMM & 1610 & $\pm$ 102 & \mc{$-\infty^*$} \\
& & Exact & $-283$ & $\pm$ 130 & $-331$ & $\pm$ 117 \\
& & Gibbs & $-206$ & $\pm$ 69 & $-299$ & $\pm$ 79 \\
& & CF & $-290$ & $\pm$ 5 & $-368$ & $\pm$ 4\\
& & Struct & $-275$ & $\pm$ 22 & $-368$ & $\pm$ 8 \\ \hline
\end{tabular*}
\label{tab}
\end{table}
The results indicate that even for moderately small state spaces
$(M=3 \;\mbox{and}\; K=3)$, the standard HMM with $K^M$ states
suffered from severe overfitting~(Table~\ref{tab}). Furthermore, both
the standard HMM and the exact E-step factorial HMM are extremely slow
for the larger models~(Figure~\ref{fig3}). The Gibbs sampling,
completely factorized, and structured approximations offered roughly
comparable performance at a great increase in speed.
\begin{figure}
\centerline{\psfig{figure=times2.ps,width=3.5in}}
\caption{Time per cycle of EM on a Silicon Graphics R4400 processor
running Matlab.}
\label{fig3}
\end{figure}
\subsection{Experiment 2: Bach chorales}
Musical pieces naturally exhibit complex structure at many different
time scales. Furthermore, one can imagine that to represent the
``state'' of the musical piece at any given time it would be necessary
to specify a conjunction of many different features. For these
reasons, we chose to test whether a factorial HMM would provide an
advantage over a regular HMM in learning the structure of a collection
of musical pieces.
The data set consisted of discrete event sequences encoding the melody
lines of J.~S.~Bach's Chorales~(obtained from the UCI Repository for
Machine Learning~\cite{MurAha92}; originally discussed in Conklin and
Witten, 1995). \nocite{MurAha92,ConWit95} Each event in the sequence
consisted of the setting of six attributes, described in
Table~\ref{tab2}. Sixty-six chorales, with 40 or more events each,
were divided into a training set~(30 chorales) and a test set~(36
chorales). Using the first set, hidden Markov models with state space
ranging from 2 to 100 states were trained until convergence ($30
\pm 12$ steps of EM). Factorial HMMs of varying sizes ($K$ ranging
from 2 to 6; $M$ ranging from 2 to 9) were also trained on the
same data. To approximate the E-step for factorial HMMs we used the
structured approximation. This choice was motivated by three
considerations. First, for the size of state space we wished to
explore the exact algorithms were prohibitively slow. Second, the
Gibbs sampling algorithm did not appear to present any advantages in
speed or performance and required some heuristic method for
determining the number of samples. Third, theoretical arguments
suggest that the structured approximation should in general be
superior to the completely factorized variational approximation since
more of the dependencies of the original model are preserved.
\begin{table}
\centerline{
\begin{tabular}{cll} \hline
Attribute & Description & Representation \\ \hline
pitch & pitch of the event & int $[0, 127]$\\
keysig & key signature & int
$[-7,7]$ \\
timesig & time signature & (1/16 notes)\\
fermata & event under fermata? & binary \\
st & start time of event & int (1/16 notes)\\
dur & duration of event & int (1/16 notes)\\ \hline
\end{tabular}}
\caption{Attributes in the Bach chorale data set. The key signature
and time signature attributes were constant over the duration of the
chorale.}
\label{tab2}
\end{table}
The test set log likelihoods for the HMMs, shown in
Figure~\ref{fig4}a, exhibited the typical U-shaped curve demonstrating
a trade-off between bias and variance~\cite{GemBieDou92}. HMMs with
fewer than 10 states did not capture much of the structure of the
chorales, while HMMs with more than 40 states overfit the training
data and therefore provided a poor model of the test data. Out of the
75 runs, the highest test set log likelihood was $-6.26$ nats (9.0
bits), obtained by an HMM with 30 hidden states.
\begin{figure}
\centerline{
\psfig{figure=chhmm.ps,width=2in}
\psfig{figure=ch2hmm.ps,width=2in}
\psfig{figure=chfhmm.ps,width=2in}}
\begin{picture}(432,0)
\text{5}{160}{a)}
\text{149}{160}{b)}
\text{293}{160}{c)}
\end{picture}
\caption{Test set log likelihood per event of the Bach chorale
data set as a function of number of states for a) HMMs, b) factorial
HMMs with $K=2$ and c) factorial HMMs with $K=3$ ($\times$'s; dotted
line), $K=4$ ($\circ$'s; heavy dashed line) and $K=5$ ($+$'s; solid
line). Each circle represents a single run; the lines indicate the
mean performances. The thin dashed line in b and c indicates the log
likelihood of the best run in a. The factorial HMMs were trained using
the structured approximation. For both methods the true likelihood
was computed using the same algorithm.}
\label{fig4}
\end{figure}
Almost half (31/63) of the runs of the factorial HMM surpassed the
best test set log likelihood of the HMM~(Figure~\ref{fig4}b
and~c). Furthermore, the best factorial HMMs had very large state
spaces, approaching 1000 states, which would result in severe
overfitting and prohibitively slow training for unconstrained HMMs. It
is possible to constrain HMMs to have many states with few parameters
by allowing only a limited number of transitions. However, such HMMs
do not have the componential state representation of factorial
HMMs. This componentiality, which allows for several dynamic processes
to be represented at the same time, is most likely to account for the
superior performance of factorial HMMs in modeling Bach's chorales.
\section{Generalizations}
\label{sec5}
\subsection{Discrete observables}
The probabilistic model presented in this paper assumed real-valued
Gaussian observations. One of the advantages arising from this
assumption is that the conditional density of a $D$-dimensional
observation $P(Y_t|S^{(1)}_t, \ldots, S^{(M)}_t)$ can be compactly
specified through $M$ $D \times K$ mean matrices, and one $D \times D$
covariance matrix. Furthermore, the M-step for such a model reduces to
a set of regression equations.
The model can be generalized to handle discrete observations in
several ways. Consider a single $D$-valued discrete observation
$Y_t$. In analogy to traditional HMMs, the output probabilities could
be modeled using a matrix. However, in the case of a factorial HMM,
this matrix would have $D \times K^M$ entries for each combination of
the state variables and observation. Thus the compactness of the
representation would be entirely lost. Standard methods from
graphical models suggest approximating such large matrices with
``noisy-OR''~\cite{Pea88} or~``sigmoid'' \cite{Nea92} models of
interaction. For example, in the softmax model, which generalizes the
sigmoid model to $D > 2$,
\be
P(Y_t|S^{(1)}_t, \ldots,S^{(M)}_t) \propto \exp \left\{ \sum_{m=1}^M
W\m S\m_t \right\}.
\ee
Like the Gaussian model, this specification is again compact, using $M$
$D\times K$ matrices.~(Note that also like the Gaussian model, the
means of the $W\m$ are overparametrized, see Appendix~\ref{appa}.)
While the nonlinearity induced by the softmax function makes
both the E-step and M-step of the algorithm more difficult, iterative
numerical methods can be used in the M-step and Gibbs sampling and
variational methods can again be used in the E-step~(see Neal,
1992; Hinton et al., 1995; and Saul et al, 1996, for discussions of
different approaches to learning in sigmoid networks).
\subsection{Introducing couplings}
The architecture for factorial HMMs presented in section~\ref{sec2}
assumes that the underlying Markov chains interact only through the
observations. This constraint can be relaxed by introducing couplings
between the hidden state variables. For example, if $S^{(m)}_t$
depends on $S^{(m)}_{t-1}$ and $S^{(m-1)}_{t-1}$, equation~\er{fact2}
is replaced by the following factorization
\be
P(S_t|S_{t-1}) = P(S^{(1)}_t|S^{(1)}_{t-1}) \prod_{m=1}^M
P(\Smt|S\m_{t-1},S^{(m-1)}_{t-1}). \el{coup}
\ee
Similar exact, variational, and Gibbs sampling procedures can be
defined for this architecture. Note however, that these couplings must
be introduced with caution as they may result in an exponential growth
in parameters. For example, the above factorization requires
transition matrices of size $K^2 \times K$. Rather than specifying
these higher-order couplings through probability transition matrices,
one can introduce second-order interaction terms in the energy (log
probability) function. Such terms effectively couple the chains
without the number of parameters incurred by a full probability
transition matrix.\footnote{This is analogous to the fully-connected
Boltzmann machine with $N$ units~\cite{AckHinSej85}, in which every
binary unit is coupled to every other unit using $O(N^2)$ parameters,
rather than the $O(2^N)$ parameters required to specify the complete
probability table.} In the graphical model formalism these correspond
to symmetric undirected links, making the model a chain graph. While
the JLO algorithm can still be used to propagate information exactly
in chain graphs, such undirected links cause the normalization
constant of the probability distribution---the {\em partition
function}---to depend on the coupling parameters. Like in Boltzmann
machines~\cite{AckHinSej85}, both a clamped and an unclamped phase are
therefore required for learning, where the goal of the unclamped phase
is to compute the derivative of the partition function with respect to
the parameters~\cite{Nea92}.
\subsection{Conditioning on inputs}
Like the hidden Markov model, the factorial HMM provides a model of
the unconditional density of the observation sequences. In certain
problem domains, some of the observations can be better thought of as
inputs or predictor variables, and the others as outputs or response
variables. The goal, in these cases, is to model the conditional
density of the output sequence given the input sequence. In machine
learning terminology, unconditional density estimation is unsupervised
while conditional density estimation is supervised.
Several algorithms for learning in hidden Markov models which are
conditioned on inputs have been recently presented in the
literature~\cite{CacNow94,BenFra95,MeiJor96}. Given a
sequence of input vectors $\{X_t\}$, the probabilistic model for an
input-conditioned factorial HMM is
\begin{eqnarray}
P(\{S_t,Y_t\}|\{X_t\}) &=& \prod_{m=1}^M P(S\m_1|X_1) P(Y_1|S\m_1,X_1)
\nonumber \\
& & \cdot \prod_{t=2}^T \prod_{m=1}^M P(S\m_t|S\m_{t-1},X_t) P(Y_t|S\m_t,X_t).
\el{iofhmm}
\end{eqnarray}
The model depends on the specification of $P(Y_t|S\m_t,X_t)$ and
$P(S\m_t|S\m_{t-1},X_t)$, which are conditioned both on a discrete
state variable and on a (possibly continuous) input vector. The
approach used in Bengio and Frasconi's Input Output HMMs~(IOHMMs)
suggests modeling $P(S\m_t|S\m_{t-1},X_t)$ as $K$ separate neural
networks, one for each setting of $S\m_{t-1}$. This decomposition
ensures that a valid probability transition matrix is defined at each
point in input space if a sum-to-one constraint (e.g. softmax
nonlinearity) is used the output of these networks.
Using the decomposition of each conditional probability into $K$
networks, inference in input-conditioned factorial HMMs is a
straightforward generalization of the algorithms we have presented for
factorial HMMs. The exact forward--backward algorithm in
Appendix~\ref{appb} can be adapted by using the appropriate
conditional probabilities. Similarly, the Gibbs sampling procedure is
no more complex when conditioned on inputs. Finally, the completely
factorized and structured approximations can also be generalized simply as
long as the approximating distribution has a dependence on the input
similar to the model's. If the probability transition structure
$P(S\m_t|S\m_{t-1},X_t)$ is not decomposed as above, but has a
complex dependence on the previous state variable and input, inference
may become considerably more complex.
Depending on the form of the input conditioning, the Maximization step
of learning may also change considerably. In general, if the output
and transition probabilities are modeled as neural networks the M-step
can no longer be solved exactly and a gradient-based generalized EM
algorithm must be used. For log-linear models, the M-step can be
solved using an inner loop of iteratively reweighted
least-squares~(IRLS; McCullagh and Nelder, 1989).\nocite{McCNel89}
\subsection{Hidden Markov decision trees}
An interesting generalization of factorial HMMs results if we
condition on an input $X_t$ and order the $M$ state variables such
that $S\m_t$ depends on $S^{(n)}_t$ for $1 \le n <
m$~(Figure~\ref{fig5}).
\begin{figure}
\centerline{\psfig{figure=hmdt.eps,width=4in}}
\caption{The hidden Markov decision tree.}
\label{fig5}
\end{figure}
The resulting architecture can be seen as a hidden Markov decision
tree (HMDT). Consider how this probabilistic model would generate data at
the first time step, $t=1$. Given input $X_1$, the top node
$S^{(1)}_1$ can take on $K$ values. This induces a stochastic
partitioning of $X$-space into $K$ decision regions. The next node down
the hierarchy, $S^{(2)}_1$, subdivides each of these regions into $K$
subregions, and so on. The output $Y_1$ is generated based on the
input $X_1$ and the $M$~\hspace{0.1in} $K$-way decisions at the hidden
nodes. At the next time step, a similar procedure is used to generate
data from the model, except that now, each decision in the decision
tree is dependent on the decision taken at that node in the previous
time step. Thus, the ``hierarchical mixture of experts''
architecture~\cite{JorJac94} is generalized such that it includes
Markovian dynamics for the decisions. Hidden Markov decision trees
provide a useful starting point for modeling time series with both
temporal and spatial structure at multiple resolutions. We explore
this generalization of factorial HMMs in Jordan, Ghahramani, and
Saul~(1997). \nocite{JorGhaSau97}
\section{Conclusion}
\label{sec6}
In this paper we have examined the problem of learning for a class of
generalized hidden Markov models with distributed state
representations. This generalization provides both a richer modeling
tool and a method for incorporating prior structural information about
the state variables underlying the dynamics of the system generating
the data. Although exact inference in this class of models is
generally intractable, we provide a structured variational
approximation which can be computed tractably. This approximation
forms the basis of the Expectation step in an EM algorithm for
learning the parameters of the model. Empirical comparisons to
several other approximations and to the exact algorithm show that this
approximation is both efficient to compute and accurate. Finally, we
have shown that the factorial HMM representation provides an advantage
over traditional HMMs in modeling the complex temporal patterns in
Bach's chorales.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\appendix{A}
\appendixtitle{The M-step}
\label{appa}
The M-step equations for each parameter are obtained by setting the
derivatives of $\Q$ with respect to that parameters to zero, and
solving. We start by expanding $\Q$ using equations~\er{hmm1}--\er{out}:
\begin{eqnarray}
\Q \! \! &=& \! \! -\frac{1}{2} \sum_{t=1}^T \left[ Y_t' C^{-1} Y_t - 2 \sum_{m=1}^M
Y_t' C^{-1} W\m \Smte + \! \! \sum_{m=1}^M \sum_{n=1}^M \tr \{ W^{(m)'}
C^{-1} W^{(n)} \SnSmte \} \right] \nonumber \\
& & + \sum_{m=1}^M \langle S_1^{(m)'} \rangle \log \pi\m +
\sum_{t=2}^T \sum_{m=1}^M \tr \left\{ (\log P\m) \Smtte \right\} -
\log Z,
\end{eqnarray}
where $Z$ is a normalization term independent of the states and observations
ensuring that the probabilities sum to one.
Setting the derivatives of $\Q$ with respect to the output weights to
zero, we obtain a linear system of equations for the $W\m$:
\be
\deldel{\Q}{W\m}= \sum_{t=1}^T \left[ \sum_{n=1}^M W^{(n)} \SnSmte -
Y_t \Smtpe \right]=0.
\el{msw1}
\ee
Assuming $Y_t$ is a $D \times 1$ vector, let $S_t$ be the $MK \times
1$ vector obtained by concatenating the $S\m$, and $W$ be the $D
\times MK$ matrix obtained by concatenating the $D \times K$
\hspace{0.1in} $W\m$ matrices. Then,
\be
W\new = \left(\sum_{t=1}^T Y_t \Stpe \right) \left( \sum_{t=1}^T
\SSte \right)^{{\normalsize \dagger}}.
\el{msw2}
\ee
where $\dagger$ denotes the Moore-Penrose pseudo-inverse. Note that
the model is slightly overparametrized since the $D \times 1$ means of
each of the $W\m$ matrices add up to a single mean. Using the
pseudo-inverse obviates the need to explicitly subtract this overall
mean from each $W\m$ and estimate it separately as another parameter.
To estimate the priors we solve $\deldelb{\Q}{\pi\m} = 0$ subject to
the constraint that they sum to one, obtaining
\be
\pi\mnew = \langle S\m_1 \rangle.
\ee
Similarly, to estimate the transition matrices we solve
$\deldelb{\Q}{P\m} = 0$, subject to the constraint that the columns
of $P\m$ sum to one. For element $(i,j)$ in $P\m$:
\be
P\mnew_{i,j} = \frac{\sum_{t=1}^T \langle S\m_{t,i} S\m_{t-1,j} \rangle }{
\sum_{t=1}^T \langle S\m_{t-1,j} \rangle}.
\el{M2}
\ee
Finally, the re-estimation equations for the covariance matrix can be
derived by taking derivatives with respect to $C^{-1}$,
\be
\deldel{\Q}{C^{-1}} = \frac{T}{2} C + \sum_{t=1}^T \left[ \sum_{m=1}^M
Y_t \Smtpe W^{(m)'} - \frac{1}{2} Y_t Y_t' - \frac{1}{2} \sum_{m=1}^M
\sum_{n=1}^M W^{(n)} \SnSmte W^{(m)'}\right],
\ee
where the first term arises from the normalization term for the
Gaussian density function. Substituting~\er{msw1} and re-organizing we
get
\be
C\new=\frac{1}{T} \sum_{t=1}^T Y_t Y_t' - \frac{1}{T} \sum_{t=1}^T \sum_{m=1}^M
W\m \Smte \; Y_t'.
\ee
Note that, for $M=1$, these equations reduce to the Baum-Welch
re-estimation equations for HMMs with Gaussian observables. The above
M-step has been presented for the case of a single observation
sequence. The extension to multiple sequences is straightforward.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\appendix{B}
\appendixtitle{Exact forward--backward algorithm}
\label{appb}
The following algorithm defines an exact forward--backward recursion
for computing the posterior probabilities of the hidden states in a
factorial HMM. It differs from a straightforward application of the
F--B algorithm on the equivalent $K^M$ state HMM, in that it does not
depend on a $K^M \times K^M$ transition matrix. Rather, it makes use
of the independence of the underlying Markov chains to sum over $M$ $K
\times K$ transition matrices.
Using the notation $\{ Y_{\tau} \}_{t}^{r}$ to mean the observation
sequence $Y_t, \ldots, Y_r$, define:
\begin{eqnarray*}
\alpha_t &=& P(S^{(1)}_t, S^{(2)}_t, \ldots S^{(M)}_t, \{ Y_{\tau} \}_1^t | \phi) \\
\alpha^{(0)}_t &=& P(S^{(1)}_t, S^{(2)}_t, \ldots S^{(M)}_t, \{ Y_{\tau} \}_1^{t-1} | \phi) \\
\alpha^{(1)}_t &=& P(S^{(1)}_{t-1}, S^{(2)}_t,\ldots S^{(M)}_t, \{ Y_{\tau} \}_1^{t-1} | \phi)
\\
& \vdots & \\
\alpha^{(M)}_t &=& P(S^{(1)}_{t-1}, \ldots S^{(M)}_{t-1}, \{ Y_{\tau}
\}_1^{t-1} | \phi) = \alpha_{t-1}.\\
\end{eqnarray*}
Then we obtain the following forward recursions:
\begin{eqnarray}
\alpha_t &=& P(Y_t | S^{(1)}_t, \ldots, S^{(M)}_t,\phi) \alpha^{(0)}_t \\
\alpha^{(m-1)}_t &=& \sum_{S^{(m)}_{t-1}} P(S^{(m)}_t | S\m_{t-1})
\alpha^{(m)}_t.
\end{eqnarray}
Note that at the end of the forward recursions, the likelihood of the
observation sequence is the sum of the elements in $\alpha_T$.
Similarly, to obtain the backward recursions we define:
\begin{eqnarray*}
\beta_t &=& P(\{ Y_{\tau}\}_{t+1}^T | S^{(1)}_t, \ldots S^{(M)}_t, \phi) \\
\beta^{(M)}_{t-1} &=& P(\{ Y_{\tau}\}_t^T | S^{(1)}_t, \ldots
S^{(M)}_t, \phi) \\
& \vdots & \\
\beta^{(1)}_{t-1} &=& P(\{ Y_{\tau}\}_t^T | S^{(1)}_t, S^{(2)}_{t-1} \ldots
S^{(M)}_{t-1}, \phi) \\
\beta^{(0)}_{t-1} &=& P(\{ Y_{\tau}\}_t^T | S^{(1)}_{t-1}, S^{(2)}_{t-1}
\ldots S^{(M)}_{t-1}, \phi) = \beta_{t-1}.\\
\end{eqnarray*}
Then:
\begin{eqnarray}
\beta^{(M)}_{t-1} &=& P(Y_t | S^{(1)}_t, \ldots, S^{(M)}_t, \phi)
\beta_t \\
\beta^{(m-1)}_{t-1} &=& \sum_{\Smt} P(\Smt | S\m_{t-1}) \beta\m_{t-1}.
\end{eqnarray}
The posterior probability of the state at time $t$ is obtained by
multiplying $\alpha_t$ and $\beta_t$ and normalizing:
\be
\gamma_t = P(S_t|\{Y_{\tau}\}_1^T,\phi) = \frac{\alpha_t
\beta_t}{\sum_{S_t} \alpha_t \beta_t}.
\ee
This algorithm can be shown to be equivalent to the JLO algorithm for
probability propagation in graphical models. The probabilities are
defined over collections of state variables corresponding to the
cliques in the equivalent junction tree. Information is passed
forwards and backwards by summing over the sets separating each
neighboring clique in the tree. This results in
forward--backward-type recursions of order $O(TM{K^{M+1}})$.
Using the $\alpha_t$, $\beta_t$, and $\gamma_t$ quantities, the
statistics required for the E-step are
\begin{eqnarray}
\Smte &=& \sum_{S^{(n)}_t (n\neq m)} \hspace*{-0.2in} \gamma_t\\
\SmSnte &=& \sum_{S^{(r)}_t (r\neq m \wedge r\neq n)} \hspace*{-0.3in} \gamma_t\\
\Smtte &=& \frac{\displaystyle \sum_{S^{(n)}_{t-1}, S^{(r)}_t (n\neq m
\wedge r \neq m)} \hspace*{-0.35in} \alpha_{t-1} P(S_t|S_{t-1}) P(Y_t|S_t)
\beta_t}{\displaystyle \sum_{S^{(n)}_{t-1}, S^{(r)}_t}
\alpha_{t-1} P(S_t|S_{t-1}) P(Y_t|S_t) \beta_t}.
\end{eqnarray}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\appendix{C}
\appendixtitle{Completely factorized variational approximation}
\label{appc}
Using the definition of the probabilistic model given by
equations~\er{hmm1}--\er{out}, the posterior probability of the states
given an observation sequence can be written as
\be
P(\{\Smt\} | \{Y_t\},\phi) = \frac{1}{Z} \exp \{ - H(\{S_t,Y_t\}) \},
\ee
where $Z$ is the normalization constant ensuring that the
probabilities sum to one and
\begin{eqnarray}
H(\{S_t,Y_t\}) &=& \frac{1}{2} \sum_{t=1}^T \left(Y_t - \sum_{m=1}^M W\m \Smt
\right)' C^{-1} \left( Y_t - \sum_{m=1}^M W\m \Smt \right) \nonumber \\
& & - \sum_{m=1}^M S^{(m)'}_1 \log \pi\m - \sum_{t=2}^T
\sum_{m=1}^M \Smtp (\log P\m) S\m_{t-1}. \el{hmf}
\end{eqnarray}
Similarly, the probability distribution given by the variational
approximation~\er{mf1}--\er{mf2} can be written as
\be
Q(\{S_t\}|\theta) = \frac{1}{Z_Q} \exp \{ - H_Q(\{S_t\}) \},
\ee
where
\be
H_Q(\{S_t\}) = - \sum_{t=1}^T \sum_{m=1}^M \Smtp \log \theta\m_t.
\ee
Using this notation, and denoting expectation with respect to the
variational distribution using angular brackets $\langle \cdot \rangle$, the
KL divergence is
\be
\KL(Q\|P)= \langle H \rangle - \langle H_Q \rangle - \log Z_Q +
\log Z.
\ee
The following facts can be verified from the definition of the
variational approximation:
\begin{eqnarray}
\Smte &=& \thetamt \el{emf1} \\
\Smtte &=& \thetamt \theta^{(m)'}_{t-1} \el{emf2} \\
\SmSnte &=& \left\{
\begin{array}{ll}
\thetamt \theta^{(n)'}_t & \mbox{if $m\neq n$}\\
\diag\{\thetamt\} & \mbox{if $m=n$,}
\end{array} \el{emf3}
\right.
\end{eqnarray}
where $\diag$ is an operator that takes a vector and returns a
square matrix with the elements of the vector along its diagonal, and
0s everywhere else.
The KL divergence can therefore be expanded out as
\begin{eqnarray}
\KL&=&\sum_{t=1}^T \sum_{m=1}^M \theta^{(m)'}_t \log \thetamt
+ \frac{1}{2} \sum_{t=1}^T \left[ Y_t' C^{-1} Y_t - 2 \sum_{m=1}^M
Y_t' C^{-1} W\m \thetamt \right. \nonumber \\
& & \left. + \sum_{m=1}^M \sum_{n\neq m}^M \tr \{ W^{(m)'}
C^{-1} W^{(n)} \theta^{(n)}_t \theta^{(m)'}_t \} + \sum_{m=1}^M \tr \{ W^{(m)'}
C^{-1} W\m \theta^{(m)'}_t \} \right] \nonumber \\
& & + \sum_{m=1}^M \theta^{(m)'}_1 \log \pi\m + \sum_{t=2}^T
\sum_{m=1}^M \tr \{ \theta^{(m)}_{t-1} \theta^{(m)'}_t \log P\m \} -
\log Z_Q + \log Z,
\end{eqnarray}
where $\tr$ is the trace operator for square matrices.
Taking derivatives with respect to $\thetamt$, we obtain
\begin{eqnarray}
\deldel{\KL}{\thetamt} &=& \log \thetamt - Y_t' C^{-1} W\m +
\sum_{n\neq m}^M \theta^{(n)'}_t W^{(n)'} C^{-1} W\m + \frac{1}{2}
\diag \{ W^{(m)'} C^{-1} W\m \} \nonumber \\
& & - (\log P\m) \; \theta\m_{t-1} - (\log P\m)' \; \theta\m_{t+1} + c,
\end{eqnarray}
where $c$ is a constant arising from $\log Z_Q$, ensuring that the
$\thetamt$ are normalized. Setting this derivative equal to 0 and
solving for $\thetamt$ gives equation~\er{mf}.
\appendix{D}
\appendixtitle{Structured approximation}
\label{appd}
For the structured approximation, $H_Q$ is defined as
\be
H_Q(\{S_t\}) = - \sum_{m=1}^M S^{(m)'}_1 \log \pi\m - \sum_{t=2}^T
\sum_{m=1}^M \Smtp (\log P\m) S\m_{t-1} - \sum_{t=1}^T \sum_{m=1}^M
\Smtp \log h\m_t.
\ee
Using~\er{hmf} we write the KL divergence as
\begin{eqnarray}
\KL&=&\sum_{t=1}^T \sum_{m=1}^M \Smte \log h\m_t
+ \frac{1}{2} \sum_{t=1}^T \left[ Y_t' C^{-1} Y_t - 2 \sum_{m=1}^M
Y_t' C^{-1} W\m \Smte \right. \nonumber \\
& & \left. + \sum_{m=1}^M \sum_{n\neq m}^M \tr \{ W^{(m)'}
C^{-1} W^{(n)} \langle S^{(n)}_t \rangle \langle S^{(m)'}_t \rangle \}
+ \sum_{m=1}^M \tr \{ W^{(m)'} C^{-1} W\m \Smtpe \} \right] \nonumber \\
& & - \log Z_Q + \log Z. \el{smfkl}
\end{eqnarray}
\newcommand{\loghnt}{\log h^{(n)}_{\tau}}
Since $\KL$ is independent of $\pi\m$ and $P\m$, the first thing to
note is that these parameters of the structured approximation remain
equal to the equivalent parameters of the true system. Now, taking
derivatives with respect to $\loghnt$,
\begin{eqnarray}
\deldel{\KL}{\loghnt} &=& \langle S^{(n)}_{\tau} \rangle + \sum_{t=1}^T
\sum_{m=1}^M \left[ \log h\m_t -
Y_t' C^{-1} W\m + \sum_{\ell\neq m}^M \langle S^{(\ell)}_t \rangle W^{(\ell)'}
C^{-1} W\m \right. \nonumber \\
& & \left. + \frac{1}{2} \diag \{ W^{(m)'} C^{-1} W\m \}
\rule{0mm}{3.5ex} \right]\deldel{\Smte}{\loghnt} - \langle
S^{(n)}_{\tau} \rangle. \el{smfdkl}
\end{eqnarray}
The last term was obtained by making use of the fact that
\be
\deldel{\log Z_Q}{\loghnt}= \langle S^{(n)}_{\tau} \rangle,
\ee
and cancels out the first term. Setting the terms inside the brackets
in~\er{smfdkl} equal to zero yields equation~\er{smf2}.
\acknowledgments
We thank Lawrence Saul for helpful discussions and Geoffrey Hinton for
support. This project was supported in part by a grant from the
McDonnell-Pew Foundation, by a grant from ATR Human Information
Processing Research Laboratories, by a grant from Siemens Corporation,
and by grant N00014-94-1-0777 from the Office of Naval
Research. Zoubin Ghahramani was supported by a grant from the Ontario
Information Technology Research Centre.
% \bibliographystyle{apalike}
% \bibliography{/homes/neuron/a/zoubin/papers/ref/thesis}
\begin{thebibliography}{AAAckley etal., 1985}
% \begin{references}
\bibitem[Ackley et~al., 1985]{AckHinSej85}
Ackley, D., Hinton, G., and Sejnowski, T. (1985).
\newblock A learning algorithm for {Boltzmann} machines.
\newblock {\em Cognitive Science}, 9:147--169.
\bibitem[Baldi et~al., 1994]{BalChaHun94}
Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M. (1994).
\newblock Hidden {{Markov}} models of biological primary sequence information.
\newblock {\em Proc. Nat. Acad. Sci. (USA)}, 91(3):1059--1063.
\bibitem[Baum et~al., 1970]{BauPetSou70}
Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970).
\newblock A maximization technique occurring in the statistical analysis of
probabilistic functions of {Markov} chains.
\newblock {\em The Annals of Mathematical Statistics}, 41:164--171.
\bibitem[Bengio and Frasconi, 1995]{BenFra95}
Bengio, Y. and Frasconi, P. (1995).
\newblock An input--output {HMM} architecture.
\newblock In Tesauro, G., Touretzky, D.~S., and Leen, T.~K., editors, {\em
Advances in Neural Information Processing Systems 7}, pages 427--434. MIT
Press, Cambridge, MA.
\bibitem[Cacciatore and Nowlan, 1994]{CacNow94}
Cacciatore, T.~W. and Nowlan, S.~J. (1994).
\newblock Mixtures of controllers for jump linear and non-linear plants.
\newblock In Cowan, J.~D., Tesauro, G., and Alspector, J., editors, {\em
Advances in Neural Information Processing Systems 6}, pages 719--726. Morgan
Kaufmann Publishers, San Francisco, CA.
\bibitem[Conklin and Witten, 1995]{ConWit95}
Conklin, D. and Witten, I.~H. (1995).
\newblock Multiple viewpoint systems for music prediction.
\newblock {\em J. of New Music Research}, 24:51--73.
\bibitem[Cover and Thomas, 1991]{CovTho91}
Cover, T. and Thomas, J. (1991).
\newblock {\em Elements of {I}nformation {The}ory}.
\newblock Wiley, New York.
\bibitem[Dawid, 1992]{Daw92}
Dawid, A.~P. (1992).
\newblock Applications of a general propagation algorithm for probabilistic
expert systems.
\newblock {\em Statistics and Computing}, 2:25--36.
\bibitem[Dean and Kanazawa, 1989]{DeaKan89}
Dean, T. and Kanazawa, K. (1989).
\newblock A model for reasoning about persitence and causation.
\newblock {\em Computational Intelligence}, 5(3):142--150.
\bibitem[Dempster et~al., 1977]{DemLaiRub77}
Dempster, A., Laird, N., and Rubin, D. (1977).
\newblock Maximum likelihood from incomplete data via the {{EM}} algorithm.
\newblock {\em J. Royal Statistical Society Series B}, 39:1--38.
\bibitem[Geman et~al., 1992]{GemBieDou92}
Geman, S., Bienenstock, E., and Doursat, R. (1992).
\newblock Neural networks and the bias/variance dilemma.
\newblock {\em Neural Computation}, 4:1--58.
\bibitem[Geman and Geman, 1984]{GemGem84}
Geman, S. and Geman, D. (1984).
\newblock Stochastic relaxation, {G}ibbs distributions, and the {{Bayesian}}
restoration of images.
\newblock {\em IEEE Transactions on Pattern Analysis and Machine Intelligence},
6:721--741.
\bibitem[Ghahramani, 1995]{Gha95}
Ghahramani, Z. (1995).
\newblock Factorial learning and the {{\em {EM} }} algorithm.
\newblock In Tesauro, G., Touretzky, D., and Leen, T., editors, {\em Advances
in Neural Information Processing Systems 7}, pages 617--624. MIT Press,
Cambridge, MA.
\bibitem[Heckerman, 1995]{Hec95}
Heckerman, D. (1995.).
\newblock A tutorial on learning {Bayesian} networks.
\newblock {\em Microsoft Research Technical Report MSR-TR-95-06}.
\bibitem[Hinton and Zemel, 1994]{HinZem94}
Hinton, G.~E. and Zemel, R.~S. (1994).
\newblock Autoencoders, minimum description length, and {Helmholtz} free
energy.
\newblock In Cowan, J., Tesauro, G., and Alspector, J., editors, {\em Advances
in Neural Information Processing Systems 6}. Morgan Kaufmann Publishers, San
Francisco, CA.
\bibitem[Jensen et~al., 1990]{JenLauOle90}
Jensen, F.~V., Lauritzen, S.~L., and Olesen, K.~G. (1990).
\newblock {Bayesian} updating in recursive graphical models by local
computations.
\newblock {\em Computational Statistical Quarterly}, 4:269--282.
\bibitem[Jordan et~al., 1997]{JorGhaSau97}
Jordan, M.~I., Ghahramani, Z., and Saul, L.~K. (1997).
\newblock Hidden {Markov} decision trees.
\newblock In Mozer, M., Jordan, M., and Petsche, T., editors, {\em Advances in
Neural Information Processing Systems 9}. MIT Press, Cambridge, MA.
\bibitem[Jordan and Jacobs, 1994]{JorJac94}
Jordan, M.~I. and Jacobs, R. (1994).
\newblock Hierarchical mixtures of experts and the {{EM}} algorithm.
\newblock {\em Neural Computation}, 6:181--214.
\bibitem[Kanazawa et~al., 1995]{KanKolRus95}
Kanazawa, K., Koller, D., and Russell, S.~J. (1995).
\newblock Stochastic simulation algorithms for dynamic probabilistic networks.
\newblock In Besnard, P. and Hanks, S., editors, {\em Uncertainty in Artificial
Intelligence. Proceedings of the Eleventh Conference.}, pages 346--351.
Morgan Kaufmann Publishers, San Francisco, CA.
\bibitem[Lauritzen and Spiegelhalter, 1988]{LauSpi88}
Lauritzen, S.~L. and Spiegelhalter, D.~J. (1988).
\newblock Local computations with probabilities on graphical structures and
their application to expert systems.
\newblock {\em J. Royal Statistical Society B}, pages 157--224.
\bibitem[McCullagh and Nelder, 1989]{McCNel89}
McCullagh, P. and Nelder, J. (1989).
\newblock {\em Generalized {Linear} Models}.
\newblock Chapman \& Hall, London.
\bibitem[Meila and Jordan, 1996]{MeiJor96}
Meila, M. and Jordan, M.~I. (1996).
\newblock Learning fine motion by {Markov} mixtures of experts.
\newblock In Touretzky, D.~S., Mozer, M.~C., and Hasselmo, M.~E., editors, {\em
Advances in Neural Information Processing Systems 8}. MIT Press.
\bibitem[Murphy and Aha, 1992]{MurAha92}
Murphy, P.~M. and Aha, D.~W. (1992).
\newblock {\em {UC{I}} Repository of machine learning databases
[Machine-readable data repository]}.
\newblock University of California, Department of Information and Computer
Science, Irvine, CA.
\bibitem[Neal, 1992]{Nea92}
Neal, R.~M. (1992).
\newblock Connectionist learning of belief networks.
\newblock {\em Artificial Intelligence}, 56:71--113.
\bibitem[Neal, 1993]{Nea93}
Neal, R.~M. (1993).
\newblock Probabilistic inference using {Markov} chain monte carlo methods.
\newblock {\em University of Toronto, Technical Report {CRG-TR-93-1}}.
\bibitem[Neal and Hinton, 1993]{NeaHin93}
Neal, R.~M. and Hinton, G.~E. (1993).
\newblock A new view of the {EM} algorithm that justifies incremental and other
variants.
\newblock {\em Submitted}.
\bibitem[Parisi, 1988]{Par88}
Parisi, G. (1988).
\newblock {\em Statistical Field {The}ory}.
\newblock Addison-Wesley, Redwood City, CA.
\bibitem[Pearl, 1988]{Pea88}
Pearl, J. (1988).
\newblock {\em Probabilistic Reasoning in {I}ntelligent Systems: Networks of
Plausible {I}nference}.
\newblock Morgan Kaufmann, San Mateo, CA.
\bibitem[Rabiner and Juang, 1986]{RabJua86}
Rabiner, L.~R. and Juang, B.~H. (1986).
\newblock An {I}ntroduction to hidden {{Markov}} models.
\newblock {\em IEEE Acoustics, Speech \& Signal Processing Magazine}, 3:4--16.
\bibitem[Saul et~al., 1996]{SauJaaJor96}
Saul, L., Jaakkola, T., and Jordan, M.~I. (1996).
\newblock {Mean Field {The}ory for Sigmoid Belief Networks}.
\newblock {\em Journal of Artificial Intelligence Research}, 4:61--76.
\bibitem[Saul and Jordan, 1995]{SauJor95}
Saul, L. and Jordan, M.~I. (1995).
\newblock Boltzmann chains and hidden {{Markov}} models.
\newblock In Tesauro, G., Touretzky, D., and Leen, T., editors, {\em Advances
in Neural Information Processing Systems 7}. MIT Press, Cambridge, MA.
\bibitem[Saul and Jordan, 1996]{SauJor96}
Saul, L. and Jordan, M.~I. (1996).
\newblock Exploiting tractable substructures in {I}ntractable networks.
\newblock In Touretzky, D., Mozer, M., and Hasselmo, M., editors, {\em Advances
in Neural Information Processing Systems 8}. MIT Press.
\bibitem[Smyth et~al., 1997]{SmyHecJor97}
Smyth, P., Heckerman, D., and Jordan, M.~I. (1997).
\newblock Probabilistic independence networks for hidden {Markov} probability
models.
\newblock {\em Neural Computation}, 9:227--269.
\bibitem[Stolcke and Omohundro, 1993]{StoOmo93}
Stolcke, A. and Omohundro, S. (1993).
\newblock Hidden {Markov} model induction by {Bayesian} model merging.
\newblock In Hanson, S.~J., Cowan, J.~D., and Giles, C.~L., editors, {\em
Advances in Neural Information Processing Systems 5}, pages 11--18. Morgan
Kaufmann, San Francisco, CA.
\bibitem[Tanner and Wong, 1987]{TanWon87}
Tanner, M.~A. and Wong, W.~H. (1987).
\newblock {The} calculation of posterior distributions by data augmentation
(with discussion).
\newblock {\em Journal of the American Statistical Association}, 82:528--550.
\bibitem[Viterbi, 1967]{Vit67}
Viterbi, A.~J. (1967).
\newblock Error bounds for convolutional codes and an asymptotically optimal
decoding algorithm.
\newblock {\em IEEE Trans. Informat. Theory}, IT-13:260--269.
\bibitem[Williams and Hinton, 1991]{WilHin91}
Williams, C. K.~I. and Hinton, G.~E. (1991).
\newblock Mean field networks that learn to discriminate temporally distorted
strings.
\newblock In Touretzky, D., Elman, J., Sejnowski, T., and Hinton, G., editors,
{\em Connectionist Models: Proceedings of the 1990 Summer School}, pages
18--22. Morgan Kaufmann Publishers, Man Mateo, CA.
\bibitem[Zemel, 1993]{Zem93}
Zemel, R.~S. (1993).
\newblock {\em A minimum description length framework for unsupervised
learning}.
\newblock Ph.D. Thesis, Dept. of Computer Science, University of Toronto,
Toronto, Canada.
% \end{references}
\end{thebibliography}
\end{article}
\end{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%