\documentstyle[twocolumn,psfig]{article}
%\pagestyle{empty}
\textheight 8.5in
\textwidth 6.0in
\columnsep 0.37in
\oddsidemargin 0.25in \evensidemargin 0.25in
\topmargin -.3in \headheight 0in \footheight 1in
\title{A Robot Agent That Can Search}
\author{ Yiming Ye \hspace{.1in} and \hspace{.1in} John K. Tsotsos \\
Department of Computer Science\\
University of Toronto\\
Toronto Ontario M5S 1A4 \\
yiming@vis.toronto.edu \hspace{.5in} tsotsos@vis.toronto.edu \\
Tel: (416) 978-6986 \hspace{.5in} (416) 978-3619 \\
FAX: 416-978-1455
}
\date{}
\begin{document}
%********************************************************************
%********************************************************************
\bibliographystyle{plain}
\maketitle
%********************************************************************
%* Two commands are needed for the hard copy version with no page
%* numbers. Under the "\documentstyle..." command use
%* '\pagestyle{empty}' (see above) and after the '\maketitle' command
%* use '\thispagestyle{empty}'.
%* Comment these two commands out for the soft copy version.
%********************************************************************
%\thispagestyle{empty}
%********************************************************************
%* Abstract
%********************************************************************
\begin{abstract}
{\em Object search} is the task of efficiently
searching for a given 3D object
in a given 3D environment by an agent equipped with a camera
for target detection
and, if the environment configuration is not known,
a method of calculating depth, like a stereo or laser range
finder. {\em Sensor planning for object search} refers
to the task of selecting the sensing parameters
so as to bring the target into the field of view of the
camera, and to make the image of the target easily
detectable by the available recognition algorithms.
In this paper\footnote{\footnotesize
The IBM contact for this paper is
Karen Bennet, Centre for Advanced Studies, IBM Canada Ltd.,
Stn. 2G, Dept. 894,
1150 Eglinton Avenue East,
North York, Ontario M3C~1V7.}, the task of sensor planning
for object search is formulated, and a mechanism
for this task is proposed.
The search space is characterized by the probability distribution of the presence
of the target.
The goal is to reliably find the desired object with minimum effort.
The control of the sensing parameters depends on the
current state of the search region and the detecting ability of the
recognition algorithm.
The huge space of possible sensing actions is decomposed
into a finite set of actions that must be considered.
In order to represent the surrounding environment of the camera
and to efficiently determine the sensing parameters over time,
a concept called the {\em sensed sphere} is proposed,
and its construction
is derived.
\end{abstract}
\section{Introduction}
\label{Introduction}
{\em Object search} is the task of finding a given $3D$
object in a given $3D$ environment. It is clear
that a brute-force blind search will suffice for its
solution; however, our goal is the design of efficient strategies
for searching because exhaustive searching is computationally
and mechanically
prohibitive for non-trivial situations.
Generally speaking, this task contains
three parts. The first is selecting the sensing parameters
so as to bring the target into the field of view of the sensor.
This is the sensor planning problem,
which is the main concern of this paper.
The second is manipulating the hardware so that the sensing
operators can reach the state specified by the planner.
The third is searching for the target within the image.
This is the object recognition and localization
problem, which attracts a
lot of attention within the computer vision community.
Sensor planning for object search is very important if a robot
wants to interact intelligently and effectively with its environment.
However, there has been little research done in this
area within the computer
vision community (\cite{Tsotsos92a}, \cite{Wixson92},
\cite{Thomas76}, \cite{Maver90},
\cite{Rimey92}).
Connell \cite{Connell89}
constructed a robot that roams an area searching for and collecting
soda cans.
The planning is very simple, since the robot just follows the walls
of the room and the sensor searches only the area immediately
in front of the robot. This may not be very efficient, since the
likely presence of the target is not considered when the robot is
roaming.
Rimey and Brown \cite{Rimey92} used the composite Bayes net and
utility decision rule to plan the sensor in
their task-oriented system, TEA.
The sensor is directed
to the center of mass of the expected area for a certain
object based on the belief value of the net.
{\em Probability of presence} is used in their system, but
the detection ability of the sensor is not considered,
and the purpose of the sensor planning is mainly verification
rather than searching.
The indirect search mechanism proposed by Garvey \cite{Thomas76}
is to first direct the sensor to search for
an ``intermediate" object that commonly
participates in a spatial relationship with the target,
and then direct the sensor to examine the
restricted region specified by this relationship.
Wixson \cite{Wixson92} presented a mathematical model of
search efficiency, and predicted that
{\em indirect search} can
improve efficiency in many situations.
The problems with indirect search is that the
spatial relationship between the target and
the ``intermediate" object may not always exist,
and the detection of the ``intermediate" object
may not be always easier than the detection of the target.
It is interesting to note that the operational research community
has done a lot of research
on optimal search \cite{Koopman}.
Their purpose is to determine how to allocate effort to search for
a target, such as a
lost submarine in the ocean or an oil field
within a certain region.
Although the results are elegant in
a mathematical sense,
they cannot be directly applied here because
the searcher model is too abstract and
general and there is no
sensor planning involved in their approach.
This paper proposes a practical mechanism for the task of
sensor planning
for object search.
This task is formulated
as an optimization problem: select an ordered set of camera
configurations that maximize the probability of finding the
target with minimum cost.
The optimization task is further simplified as a
decision problem: decide only which is the very next
action to execute, considering the effect and cost of
the candidate action.
\section{Problem Formulation}
In this section, we first explain the searcher model, the environment,
and some basic concepts
used in our discussion, and then formulate the object search task and
give a simplified version of it.
The searcher model is based on the ARK robot, which is
a
mobile platform
equipped with a special sensor called the Laser Eye \cite{Jasiobedzki93}.
The Laser Eye is mounted on a robotic head that has pan and tilt capabilities.
It consists of a camera
with controllable focal length (zoom), a laser range-finder, and a mirror.
The mirror is used to ensure collinearity of effective
optical axes of the camera lens and the range finder.
The state of the searcher is uniquely determined
by seven
parameters $(x_{c},y_{c},z_{c},p,t,w,h)$.
$(x_{c},y_{c},z_{c})$ is the position of the
camera center (the starting point of the camera viewing
axis);
$(p,t)$ is the direction of the camera viewing axis,
where $p$ is
the amount of pan ($0 \leq p < 2 \pi$), $t$ is the amount of tilt
$(0 \leq t < \pi$), and
$w$ and $h$
are the width and height of the solid viewing angle of the camera,
respectively.
$(x_{c},y_{c},z_{c})$ can be adjusted by moving
the mobile platform,
$(p,t)$ can be adjusted by the motors on the robotics head,
and
$w,h$ can be adjusted by the zoom lens of the camera.
We assume in this paper that the camera's image plane is always
coincident with its focal plane.
The search region, $\Omega$, can be in any form, and it is assumed
that we know its boundaries exactly but we do not
know its internal configuration.
In practice, we tessellate the region
$\Omega$ into a series of elements: $c_i$,
\( \Omega = \bigcup_{i=1}^{n} c_{i} \) and \( c_{i} \bigcap c_{j} = 0 \)
for $i \neq j$.
In the rest of the paper, we assume the search region is an
office-like environment, and we tessellate the space into little
cubes each of the same size. Usually, the size of the cube is determined
by the size of the environment and the size of the target \cite{YimingTech}.
An operation \( {\bf f} = {\bf f}(x_{c},y_{c},z_{c},p,t,w,h,a) \)
is an action of the searcher
within the region $\Omega$,
where
$a$ is the recognition algorithm used to detect the target.
An operation ${\bf f}$ entails taking a
{\bf perspective} projection image
according to the camera configuration of ${\bf f}$, and then
searching the image
using the recognition algorithm $a$.
The target distribution
can be specified by a probability distribution function
${\bf p}$.
${\bf p}(c_{i},t)$ gives the probability that
the center of the target is within cube $c_{i}$ at time $t$.
Note that we use ${\bf p}(c_{o},t)$ to represent the probability
that the target is outside the search region at time $t$.
The detection function on $\Omega$
is a function ${\bf b}$, such that
$ {\bf b}( c_{i},{\bf f} )$ gives the conditional
probability of detecting the target given that the
center of the target is located within $c_{i}$ and
the operation is ${\bf f}$.
For any operation,
if the projection of the center of the cube
$c_{i}$ is outside the image, we
assume ${\bf b}( c_{i},{\bf f} ) = 0$;
if the cube is occluded or it is too far from or too near to the camera, we also
have ${\bf b}( c_{i},{\bf f} ) = 0$.
In general \cite{YimingTech}, ${\bf b}( c_{i},{\bf f} )$ is determined by
various factors, such as intensity, occlusion, orientation, etc.
It is obvious that the probability of detecting the target
by applying action ${\bf f}$ is given by
$P({\bf f}) = \sum_{i=1}^{n} {\bf p}(c_{i},t_{{\bf f}})
{\bf b}(c_{i},{\bf f})$,
where $t_{{\bf f}}$ is the time just
before ${\bf f}$ is applied.
Let $\Psi$ be the set of all the cubes that are within the field of view
of ${\bf f}$ and that are not occluded; then we have
$P({\bf f}) = \sum_{c \in \Psi}{\bf p}(c,t_{{\bf f}}) {\bf b}(c,{\bf f})$.
The reason why the term $t_{{\bf f}}$ is introduced in the
calculation of $P({\bf f})$ is that the probability
distribution needs to be updated whenever an action fails.
Here we use Bayes' formula.
Let $\alpha_{i}$ be the event that the center of the target
is in cube $c_{i}$; $\alpha_{o}$ be the event that the center of the target
is outside the search region; and $\beta$ be the event that after applying
a recognition action, the recognizer successfully detects the target.
Then $P(\neg \beta \mid \alpha_{i})= 1 - {\bf b} ( c_{i}, {\bf f} )$
and
$ P( \alpha_{i} \mid \neg \beta ) = {\bf p}(c_{i},t_{{\bf f}+})$,
here $t_{{\bf f}+}$ is the time after ${\bf f}$ is applied.
Since
the above events \(\alpha_{1}, \ldots,\alpha_{n},\alpha_{o} \)
are mutually complementary and
exclusive, we get the following updating rule:
\small
\begin{displaymath}
{\bf p}(c_{i},t_{{\bf f}+}) \leftarrow
\frac{ {\bf p}(c_{i},t_{{\bf f}}) (1 - {\bf b} ( c_{i}, {\bf f} ))}
{ {\bf p}(c_{o},t_{{\bf f}}) +
\sum_{j=1}^{n} {\bf p}(c_{j},t_{{\bf f}}) (1 - {\bf b} ( c_{j}, {\bf f} ))
} \nonumber
\end{displaymath} \normalsize
where $i=1,\ldots,n,o$.
The
cost ${\bf t_{o}(f)}$
gives the total time needed to (1) manipulate the hardware
to the status specified by ${\bf f}$; (2) take a picture;
(3) update the environment and register the space;
and (4) run the recognition algorithm.
We assume that (1) and (2) are the same for all the actions;
(3) is a constant;
and (4) is known for any recognition algorithm.
Then, ${\bf t_{o}(f)}$ is influenced only by (4).
Let ${\bf O_{\Omega} }$ be the set of all the
possible operations that can be applied.
The effort allocation \( {\bf F} = \{ {\bf f}_{1}, \ldots, {\bf f}_{k} \} \)
gives the ordered set of operations applied in the search, where ${\bf f}_{i} \in {\bf O_{\Omega}}$.
It is clear that
the probability of detecting the target by this allocation
is:
\vspace{-0.15in}
\begin{displaymath}
P[{\bf F}] = P({\bf f}_{1}) + \ldots + \{ \: \prod^{k-1}_{i=1}[1 - P({\bf f}_{i})]\}
P({\bf f}_{k})
\nonumber
\end{displaymath}
The total cost for applying this allocation is
(following \cite{Tsotsos92b}):
$
T[{\bf F}] = \sum_{i=1}^{k} {\bf t_{o}}({\bf f}_{i})
$.
Suppose $K$ is the total time that can be allowed in the search. Then
the task of sensor planning for object search can be defined as
finding an allocation \( {\bf F} \subset
{\bf O_{\Omega}} \), which satisfies \( T{\bf (F)} \leq K \)
and maximizes $P[{\bf F}]$.
Since this task is NP-Complete \cite{Yiming95AIM},
we
consider a simpler problem:
decide only which is the very next
action to execute.
Suppose we have already executed $q$ ($q \geq 1$) actions
${\bf F}_{q}=\{ {\bf f}_{1}, \ldots, {\bf f}_{q} \}$.
We now want to find the next action
to execute, with the hope that our strategy of finding
it may finally
lead to an {\bf approximate} optimal
solution of the object search task.
For any next action ${\bf f}$, its contribution to the probability of detecting the target is
$\Delta_{P}({\bf f}) = \{ \: \prod^{q}_{j=1}[1 - P({\bf f}_{j})]\} P({\bf f}) $.
The additional cost is
$ \Delta_{T}({\bf f}) = {\bf t_{o}}({\bf f})$.
Since
$\{ \: \prod^{q}_{j=1}[1 - P({\bf f}_{j})]\}$
is fixed, the next action should be
selected that maximizes the term
$ E({\bf f}) = \frac{P({\bf f})}{\Delta_{T}({\bf f})} $.
Note that the above strategy may sometimes lead to an optimal
solution (see \cite{Yiming95AIM} for detail).
The ``One Step Look Ahead" problem is further
divided into two subproblems.
One is the ``where to look next" problem:
how to select
$w,h,p,t,a$ of ${\bf f}$ so as to maximize
$E({\bf f})$ for a fixed camera position.
The other is the ``where to move next"
problem: how to decide the next best
robot
position.
\section{Detection Function}
We briefly discuss the detection function in this section.
For details, refer to \cite{YimingTech}.
The standard detection function
${\bf b_{0}} ((\theta,\delta,l), ) $
gives a measure of the detecting ability of the
recognition algorithm $a$ when no previous
action has been applied.
Here, $$ is the viewing angle size of the camera,
and $(\theta,\delta,l)$ is the relative position of the center
of the target to the camera: $\theta = arctan( \frac{x}{z} )$,
$\delta = arctan( \frac{y}{z} )$, and $l=z$,
$(x,y,z)$ is the coordinate of the target center in
the camera
coordinate system.
The value of ${\bf b_{0}} ((\theta,\delta,l), ) $ can be
obtained by experiments.
We can first put the target at $(\theta,\delta,l)$
and then perform experiments
under various conditions, such as light intensity,
background situation, and the relative orientation of the
target with respect to the camera center.
The final value
is the total number of successful recognitions divided by
the total number of experiments.
These values can be stored in a look-up table indexed
by $\theta,\delta,l$ and retrieved when needed.
Sometimes we may approximate these values by
analytic formulas.
We only need to record the
detection values of one angle size,
$$. Those of other
sizes can be approximately
transformed to those of size $$.
Suppose $(\theta,\delta,l)$ is the target position for angle size $$.
We want to find the value $(\theta_{0}, \delta_{0},l_{0})$ for
angle size $<$$w_{0}, h_{0}$$>$ such that
${\bf b_{0}} ( (\theta_{0}, \delta_{0},l_{0}), )
\approx {\bf b_{0}} ((\theta,\delta,l), )$.
To guarantee this, the images taken with parameter
$<\theta_{0}, \delta_{0},l_{0},w_{0}, h_{0}>$ and
$<\theta,\delta,l,w,h>$ should be almost the same.
Thus,
the area and position of the
projected target image on the image plane
should be almost the same for both images. We get
$
l_{0} =l \sqrt{ \frac{tan(\frac{w}{2})tan(\frac{h}{2})}
{tan(\frac{w_{0}}{2})tan(\frac{h_{0}}{2})} }
$;
$
\theta_{0} = arctan[ tan(\theta)
\frac{tan(\frac{w_{0}}{2})}{tan(\frac{w}{2})}]
$;
$
\delta_{0} = arctan[ tan(\delta)
\frac{tan(\frac{w_{0}}{2})}{tan(\frac{w}{2})}]
$.
When the configuration of two operations are very similar, they might
be correlated with each other (refer to \cite{YimingTech} for detail).
Repeated actions are avoided during the search
process.
When independence is assumed,
$ {\bf b}( c_{i},{\bf f} )$
is calculated as follows.
First, calculate the corresponding $(\theta, \delta,l)$
of the center of $c_{i}$ with respect to operation ${\bf f}$.
Second, transform $(\theta, \delta,l)$ into the corresponding
$(\theta_{0}, \delta_{0},l_{0})$ of angle size $$.
Third, retrieve the detection value from the look-up
table, or get the detection value from a formula.
\section{The Sensed Sphere}
The space around the center of the camera
can be divided into a set of solid angles.
Each solid angle
is associated with a radius that is the length of an emitting
line along the direction of the central axis of the solid angle
from the origin.
The environment can thus be represented by the union of these solid
angles. This representation is called the {\bf sensed sphere}
\cite{Tsotsos92a}.
We can use a laser range finder to construct the sensed sphere.
First,
we need to {\bf tessellate} the surface of the unit sphere centered at the
camera center into a set of surface patches.
Then, we need to ping the
laser at the center of each patch so as to get the radius
of each solid angle.
In order to make the tessellation as uniform as possible
and to make the number of mechanical operations as small as possible,
we use the following method.
First, we tessellate
the range $[0,\pi]$ of tilt uniformly by a factor
$2m$, where $m$ is an integer. This tessellation in general
depends on the complexity of the environment.
Thus, the tilt ranges \vspace{1mm} are
$[0, \alpha), \ldots, [(i-1)\alpha,i\alpha), \ldots, [ \pi-\alpha,\pi)$,
where $\alpha = \frac{\pi}{2m}$.
Then, for each tilt range except $[0, \alpha)$ and $[ \pi-\alpha,\pi)$,
we tessellate the range $[0,2\pi)$ of pan.
In the tessellation, the amount of change of pan $\Delta_{i}$ for
tilt range $[(i-1)\alpha,i\alpha)$ is:
\[ \Delta_{i} =
\left\{ \begin{array}{ll}
2 arcsin( \frac{sin\frac{\alpha}{2}}{sin(i\alpha)} )
&
\mbox{ if $ i \alpha \leq \frac{\pi}{2} $ } \\
2 arcsin( \frac{sin\frac{\alpha}{2}}{sin[(i-1)\alpha]}
)
&
\mbox{ if $(i-1)\alpha \geq \frac{\pi}{2} $ }
\end{array}
\right. \]
So, for tilt range $[(i-1)\alpha,i\alpha)$, the pan ranges are
$[0,\Delta_{i})$, $[\Delta_{i},2\Delta_{i})$, $\ldots$, $[n_{i}\Delta_{i},2\pi)$.
The length of each pan range
is $\Delta_{i}$ except the last one, which is $[n_{i}\Delta_{i},2\pi)$.
The sensed sphere constructed with the tessellation scheme described above
can be concisely represented as the
union of all solid angles:
\[
\bigcup_{ [t_{b_{i}}, t_{e_{i}}) = [(i-1)\alpha,i\alpha),i=1,\ldots,2m}\]
\[
\Big\{ \bigcup_{ [p_{b_{i,j}}, p_{e_{i,j}}) = [0,\Delta_{i}), [\Delta_{i},2\Delta_{i}),\ldots,
[n_{i}\Delta_{i}, 2\pi)} \]
\[
A_{ij}(t_{b_{i}}, t_{e_{i}}; p_{b_{i,j}}, p_{e_{i,j}}; r_{ij}) \Big\}
\]
where $r_{ij}$ is the length of the radius along the direction
$tilt= \frac{t_{b_{i}}+t_{e_{i}}}{2}$, $pan=\frac{p_{b_{i,j}}+p_{e_{i,j}}}{2}$.
$ [t_{b_{i}}, t_{e_{i}})$, $[p_{b_{i,j}},p_{e_{i,j}})$ and $r_{ij}$
give the range of the $A_{ij}$,
$t_{b_{i}} \leq tilt < t_{e_{i}}, p_{b_{i,j}} \leq pan < p_{e_{i,j}}$.
\section{Where to look next}
We need to select $w,h,p,t,a$ for the next action.
First, we select $w,h,p,t$ for each given
recognition algorithm.
The ability of the recognition algorithm
and the value of the detection function
are influenced by the
image size of the target.
Only when the target can be totally brought into
the field of view of the camera and the features
be detected within certain
precision, can the recognition algorithm be expected to
function correctly.
So, for an operation with a given recognition algorithm
and a fixed viewing angle,
the probability of successfully
recognizing the target is high only when the target is within a certain
distance range.
We call this range the {\bf effective range}.
Our purpose here is to select
those angles whose effective ranges will cover the whole
distance of the depth $D$ of the search region, and at the same
time will have no overlap of their effective ranges.
Suppose the biggest viewing angle size
for the camera is $w_{0} \times h_{0} $,
and its
effective range is $[N_{0}, F_{0}]$.
We want to select other
necessary viewing angle sizes
$w_{1} \times h_{1}, \ldots,
w_{n_{0}} \times h_{n_{0}} $ and their corresponding
effective ranges $[N_{1}, F_{1}], \ldots, [N_{n_{0}}, F_{n_{0}}]$,
such that
\( [N_{0}, F_{0}] \bigcup \ldots \bigcup [N_{n_{0}}, F_{n_{0}}]
\supseteq [N_{0},D] \)
and
\( [N_{i}, F_{i}) \bigcap [N_{j}, F_{j}) = \emptyset \)
if $i \neq j$.
To guarantee this, we have $F_{i-1}=N_{i}, i=1,\ldots,n_{0}$ and
the area of the image of the target patch
for angle size
$w_{i-1} \times h_{i-1}$ at $N_{i-1}$ and $F_{i-1}$
should equal to the area of the image of the target patch
for angle size $w_{i} \times h_{i}$ at $N_{i}$ and $F_{i}$
respectively. According to section 3, we can get
$w_{i} = 2arctan[ (\frac{N_{0}}{F_{0}})^{i}tan(\frac{w_{0}}{2}) ]$,
$h_{i} = 2arctan[ (\frac{N_{0}}{F_{0}})^{i}tan(\frac{h_{0}}{2}) ]$,
$N_{i} = F_{0} ( \frac{F_{0}}{N_{0}})^{i-1}$ and
$F_{i} = F_{0} ( \frac{F_{0}}{N_{0}})^{i}$,
where $1\leq i \leq n_{0}$.
Since
\( N_{i} \leq D \), we get
$i \leq \frac{ ln(\frac{D}{F_{0}})}{ ln(\frac{F_{0}}{N_{0}})} - 1$.
So,
$n_{0} = \lfloor \frac{ ln(\frac{D}{F_{0}})}{ln(\frac{F_{0}}{N_{0}})} - 1
\rfloor $.
The sensed sphere can be further divided into several layers
according to the effective viewing angle sizes derived above.
This layered sensed sphere $LSS$ can be represented as
$LSS = \bigcup_{,k=0,\ldots,n_{0}} LSS_{}$.
Here $LSS_{}$ is the layer corresponding to
angle size $$,
$LSS_{} = \bigcup_{i,j}LA_{ij}^{}$.
The layered solid angle slice $LA_{ij}^{}$ is
the intersection of the solid angle $A_{ij}$ of the sensed sphere
and the current
layer $LSS_{}$.
There is a probability $p_{ij}^{}$ associated
with $LA_{ij}^{}$, which gives the sum of the probabilities
of all the cubes that belong to $LA_{ij}^{}$ (Note:
the distance of each cube to the camera center should be less than
$r_{ij}$ of $A_{ij}$.)
The following step is used to select the viewing direction
for a given angle size $$.
First, tessellate the sphere using the method in section 4, with
an angle size equal to $min\{w_{k},h_{k}\}$. Each resulting
patch corresponds to a viewing direction.
Second, calculate the sum of all $p_{ij}^{}$
for those $LA_{ij}^{}$ that belong to a given patch.
This is the probability for this patch.
Third, the direction $$
whose corresponding patch has the maximum
probability is the best direction for size $$.
For each recognition algorithm, we can find $n_{0}+1$ candidates
$$ ($0 \leq k \leq n_{0}$). Then,
use $P({\bf f})$ to select among them to get the best candidate
for this algorithm.
Finally, use $E({\bf f})$ to select among
the best candidates for all the recognition algorithms
so as to find the next action to be applied.
The environment needs
to be updated if the target is not found after
the selected action is applied.
\section{Where to Move Next}
The goal of the ``where to move next" task is to select the next robot
position such that, at the new position, the camera can examine those
parts of the region that have
a high probability of containing the target but that are occluded if
viewed from the previous positions.
In order to perform this task, first we need
to decide which
candidate positions the robot {\bf can reach}, then, we
need to know which one is the
{\bf most beneficial} among these reachable positions.
In order to simplify the navigation problem
(our primary task is object search, not navigation), we limit the movement
of the robot from the current position to the next position
to be on {\bf straight lines}.
We have developed an algorithm that can decide which positions
the robot can reach from the current position.
To select the next robot position among these reachable
candidate positions,
we need to know the approximate unoccluded
region
that can be seen by the camera with respect to each
reachable
candidate position.
The space around the center of the camera can be divided into
a set of solid angles. Each solid angle is associated with
a radius that is the length of an emitting line along
the direction of the central axis of the solid angle from
the origin. The unoccluded region around the camera center
can thus be represented by the union of these solid angles.
This representation is called a {\em sensed sphere}.
The expected sensed sphere $SS_{e}$ refers to the sensed sphere
that can be constructed at the reachable
candidate positions.
$SS_{e}$ can be {\bf calculated} by using the boundary of the search
region and the solidity property of the cubes in the region
(see \cite{YimingTech} for detail).
Each candidate position is associated with an expected sensed sphere.
For each $SS_{e}$, we can calculate the probability
of presence of the target by summing the probabilities of all the cubes
that are within $SS_{e}$:
\( \sum_{c_{i} \in SS_{e}} {\bf p}(c_{i}) \).
We then select
the position whose corresponding expected
sensed sphere has the maximum probability of presence of the target
as the next robot position.
After the robot moves to the next position, it will begin the
``where to look next" process again to search for the target.
At this point, we must consider the inaccuracy of the robot movement.
We have proposed a strategy to incorporate this inaccuracy
into the robot's knowledge
about the environment and the target distribution
(see \cite{YimingTech} for detail).
\section{Experiments}
We assume that only one recognition algorithm is available for
all the experiments.
The experiment is performed in our lab using the
ARK robot.
The task is to search for a white baseball within the
region shown in Figure 1 (a)(b).
Four criteria are used
by the recognition algorithm. They are:
intensity $I_{0}=119$; blob size $B_{min}
=250$
pixels; $B_{max}=1375$ pixels; and
roundness percentage
$R_{0}=0.91$. The algorithm first generates a binary image by thresholding
the
image using $I_{0}$. Then it performs morphological and region growing operations.
Only a white blob whose size is between $B_{min}$ and $B_{max}$ and whose roundness is greater
than $R_{0}$ can be taken as the target.
Two angle sizes are needed to search the region. They are $41^{o} \times 39^{o}$ and
$20.7^{o} \times 19.7^{o}$.
Their effective ranges are $[1.47m,3.01m]$ and $[3.01m,6.16m]$,
respectively.
Experimental results are listed in Table 1, which
gives the average
number of actions needed to find the ball for planning and non-planning
strategies. Where (A) means to give the region corresponding to
the table surface A high probability
at the beginning. The same for (B), (C), and (ABC).
From the second and third row of Table 1,
we can see that the planning strategy
is much more efficient when our knowledge about the distribution
is correct at the beginning.
From the fourth row of Table 1, we can see that the performance
of the planning strategy is not very satisfactory when we have
the misleading information at the beginning.
One of the above test results is shown in
Figure 2 (a)(b)(c)(d)(e).
\vspace{-0.35in}
\[ \]
\begin{tabular}{| l | l | l | l | l | l | l | l | } \hline \hline
{\bf Target Pos.} & A & C & A, B or C \\ \hline \hline
{\bf Plan}$(^{correct}_{knowledge})$&(A)1 & (C)1 & (ABC)2.5 \\ \hline
{\bf No Plan} & 8 & 10 & 9 \\ \hline
{\bf Plan}$(^{incorrect}_{knowledge})$&(C)7 & (B) 11.5 & (C)4.3 \\ \hline
\end{tabular}
\vspace{-0.25in}
\[ \]
\hspace{35mm}Table 1
\begin{figure}[!htbp]
\centerline{
\psfig{figure=Platform.ps,width=1.15in,height=1.15in}
\psfig{figure=real_robot_env.x.ps,width=1.15in,height=1.15in}
}
\centerline{\makebox[1.15in]{(a)} \makebox[1.15in]{(b)}}
\centerline{
\psfig{figure=image2.ps,width=1.15in,height=1.15in}
\hspace{-4.5mm} \psfig{figure=image1.ps,width=1.15in,height=1.15in}
}
\centerline{\makebox[2.5in]{(c)}}
Figure 1: The search agent and the search environment.
(a): The ARK robot. (b): Top view of the search region, where
A, B, C, E are table surfaces. The Laser Eye is on E.
(c): Composite image of the region from
position D of (b).
\end{figure}
\begin{figure}[!htbp]
\centerline{
\psfig{figure=sphere.ps,width=1.15in,height=1.15in}
\psfig{figure=image_3.ps,width=1.15in,height=1.15in}
}
\centerline{\makebox[1.15in]{(a)}
\makebox[1.15in]{(b)}}
\centerline{
\psfig{figure=image_2.ps,width=1.15in,height=1.15in}
\psfig{figure=image_1.ps,width=1.15in,height=1.15in}
}
\centerline{ \makebox[1.15in]{(c)} \makebox[1.15in]{(d)} }
\centerline{
\psfig{figure=output_image.ps,width=1.15in,height=1.15in}
\psfig{figure=outimage.ps,width=1.15in,height=1.15in}
}
\centerline{ \makebox[1.15in]{(e)}
\makebox[1.15in]{(f)}
}
Figure 2: Testing the ``where to look next" strategy.
(a): Sensed sphere from Laser Eye.
(b,c,d): One of the image sequences of the real
experiment by using the planning strategy, where the target is assumed
on A, B, or C and we give (ABC) a high probability at the beginning.
Although the ball appeared in the first image (b), the algorithm
failed to detect it because it is outside the effective
range of the action (size $41^{o}\times39^{o}$). The third action
((d), size $20.7^{o}\times19.7^{o}$)
found the target. (e): The image of (d) after region growing etc.
(f): The result of the image analysis of (e), where the target is detect
ed.
\end{figure}
Experiments are also
performed to test the ``where to move next"
strategy.
The camera is different from the camera used in the previous
experiments. One camera angle size is enough to examine the
search region.
Before the search process, we give the table top {\bf B}, the
table top
{\bf C},
and the floor region {\bf D} a
high probability of presence (Figure 1(b)).
The real position
of the baseball is within the region {\bf D}.
Initially, the robot is at $P_{1}$ of Figure 1(b).
Figure 3(a)(b)(c)(d) is the top four images taken by the head.
The first action examines table {\bf
C}; the second and third
actions examine table {\bf B}.
The fourth action simply points upward,
and none of the specified region is
examined.
No action is selected to check region {\bf D}, because {\bf D}
is occluded
by the office wall.
Since the probability of detecting the target by the fourth action
is very low, the robot drives to the next position $P_{2}$
(Figure 1(b)). $P_{2}$ is selected by the
``where to move next" algorithm.
At the new position, the robot begins the search process again.
The first action detects the target (Figure 3(h)).
Figure 3(f)(g)(h) is the image analysis result, where
the
target is found (Figure 3(h)).
\begin{figure}[!htbp]
\centerline{
\psfig{figure=/scr/image_1.ps,width=1.15in,height=1.15in}
\psfig{figure=/scr/image_2.ps,width=1.15in,height=1.15in}
}
\centerline{ \makebox[1.15in]{(a)} \makebox[1.15in]{(b)}}
\centerline{
\psfig{figure=/scr/image_3.ps,width=1.15in,height=1.15in}
\psfig{figure=/scr/image_5.ps,width=1.15in,height=1.15in}
}
\centerline{ \makebox[1.15in]{(c)} \makebox[1.15in]{(d)} }
\centerline{
\psfig{figure=/scr/next_image.ps,width=1.15in,height=1.15in}
\psfig{figure=/scr/DividedImage.ps,width=1.15in,height=1.15in}
}
\centerline{ \makebox[1.15in]{(e)} \makebox[1.15in]{(f)}}
\centerline{
\psfig{figure=/scr/output_image.ps,width=1.15in,height=1.15in}
\psfig{figure=/scr/outimage.ps,width=1.15in,height=1.15in}
}
\centerline{ \makebox[1.15in]{(g)} \makebox[1.15in]{(h)} }
Figure 3. Testing the ``where to move next" strategy.
(a,b,c,d): The top four images at the first position.
(e): The first action at the second position.
(f,g,h): The image analyzing result at the second position,
the target is detected.
\end{figure}
\section{Conclusion}
In this paper, we formulate the sensor planning task
for object search, and present a practical strategy
for this task.
By introducing the concepts of sensed sphere
and layered sensed sphere, we are able to decompose
the huge space of possible sensing actions into a finite
set of actions that must be tried,
and to select the
next action among these finite set of actions.
By combining
the detecting ability of a recognition algorithm
with the knowledge of the probability distribution
of the target, we argue that the object
search problem is quite different from the
exploration problem.
The concept of detection function for a recognition
algorithm may find applications in other vision tasks,
such as serving as an evaluation matrix in the comparison
of recognition algorithms.
The theory has been applied to hardware such as a
platform equipped with a camera and
a laser range finder.
Experiments so far have been successful as a proof of
concept.
We would like to apply the theory
to other kind of tasks and hardware, such as
searching for a target
on a cluttered table top
using a stereo camera that is controlled by
a robot arm.
\vspace{4mm}
\noindent
{\Large\bf Acknowledgments}
This work was supported by IBM Centre for Advanced Studies and
ARK (Autonomous Robot for a Known environment) Project,
which receives its funding from PRECARN Associates Inc.,
Industry Canada, the National Research Council
of Canada, Technology Ontario, Ontario Hydro Technologies,
and Atomic Energy of Canada Limited.
The author is grateful to Dave Wilkes, Piotr Jasiobedzki, Victor Lee,
and Karen Bennet for their comments.
\vspace{4mm}
\noindent
{\Large\bf About the Author}
Yiming Ye is a Ph.D candidate at Department of Computer Science,
University of Toronto and a Research Fellow student at
IBM Canada Centre for Advanced Studies.
He has papers in the areas of
internet agent and WWW, computer vision and image processing,
robotics, computational geometry, machine translation,
computational complexity,
and knowledge based systems
been
published in journals and conference proceedings
such as:
{\em Proceedings of CASCON96},
{\em Proceedings of the
Second International Conference on Multiagent Systems (ICMAS96)},
{\em Proceedings of the
International Symposium on Intelligent Robotic Systems (SIRS96)},
{\em Proceedings of 1995
IEEE International Symposium for Computer Vision (ISCV95)},
{\em Proceedings of the
4th International Symposium on Artificial Intelligence
and Mathematics (AIM95)},
{\em Proceedings of the
Second Asian Symposium on Computer Mathematics (ASCM96)},
{\em Proceedings of the
IJCAI Workshop for handicapped Children},
{\em Proceedings of 1990 ACM Eighteenth Annual Computer Science
Conference},
and {\em Science in China} etc.
His current research interests are
internet agent, image processing, computer vision,
robotics, multimedia, etc.
He can be reached at
yiming@vis.toronto.edu.
John K. Tsotsos is a professor of Department of Computer Science,
University of Toronto and
a Fellow of Canadian Institute of Advanced Studies.
He can be reached at tsotsos@vis.toronto.edu.
\bibliography{active,sort}
\end{document}