Abstracts
Contextual
Weighting for Vocabulary Tree based Image Retrieval
In this paper we address the problem of image retrieval from millions of
database images. We improve the vocabulary tree based approach by
introducing contextual weighting of local features in both descriptor
and spatial domains. Specifically, we propose to incorporate efficient
statistics of neighbor descriptors both on the vocabulary tree and in
the image spatial domain into the retrieval. These contextual cues
substantially enhance the discriminative power of individual local
features with very small computational overhead. We have conducted
extensive experiments on benchmark datasets, i.e., the UKbench,
Holidays, and our new Mobile dataset, which show that our method reaches
state-of-the-art performance with much less computation. Furthermore,
the proposed method demonstrates excellent scalability in terms of both
retrieval accuracy and efficiency on large-scale experiments using 1.26
million images from the ImageNet database as distractors.
Learning
from Partial Labels
We address the problem of partially-labeled multiclass classification,
where instead of a single label per instance, the algorithm is given
acandidate set of labels, only one of which is correct. Our setting is
motivated by a common scenario in many image and video collections,where
only partial access to labels is available. The goal is to learn a
classifier that can disambiguate the partially-labeled
traininginstances, and generalize to unseen data. We define an intuitive
property of the data distribution that sharply characterizes theability
to learn in this setting and show that effective learning is possible
even when all the data is only partiallylabeled. Exploiting this
property of the data, we propose a convex learning formulation based on
minimization of a loss functionappropriate for the partial label
setting. We analyze the conditions under which our loss function is
asymptotically consistent, as well asits generalization and transductive
performance. We apply our framework to identifying faces culled from web
news sources and to namingcharacters in TV series and movies; in
particular, we annotated and experimented on a very large video dataset
and achieve 6% errorfor character naming on 16 episodes of the TV
series Lost.
Large-scale
image classification: fast feature extraction and SVM training
Most research efforts on image classification so far have been focused
on medium-scale datasets, which are often defined as datasets that can
fit into the memory of a desktop (typically 4G∼48G). There are two main
reasons for the limited effort on large-scale image classification.
First, until the emergence of ImageNet dataset, there was almost no
publicly available large-scale benchmark data for image classification.
This is mostly because class labels are expensive to obtain. Second,
large-scale classification is hard because it poses more challenges than
its medium-scale counterparts. A key challenge is how to achieve
efficiency in both feature extraction and classifier training without
compromising performance. This paper is to show how we address this
challenge using ImageNet dataset as an example. For feature extraction,
we develop a Hadoop scheme that performs feature extraction in parallel
using hundreds of mappers. This allows us to extract fairly
sophisticated features (with dimensions being hundreds of thousands) on
1.2 million images within one day. For SVM training, we develop a
parallel averaging stochastic gradient descent (ASGD) algorithm for
training one-against-all 1000-class SVM classifiers. The ASGD algorithm
is capable of dealing with terabytes of training data and converges very
fast – typically 5 epochs are sufficient. As a result, we achieve
state-of-the-art performance on the ImageNet 1000-class classification,
i.e., 52.9% in classification accuracy and 71.8% in top 5 hit rate.
Talking
Pictures: Temporal Grouping and Dialog-Supervised Person
Recognition
We address the character identification problem in movies and television
videos: assigning names to faces on the screen. Most prior work on
person recognition in video assumes some supervised data such as
screenplay or hand-labeled faces. In this paper, our only source
of `supervision' are the dialog cues: first, second and third person
references (such as ``I'm Jack'', ``Hey, Jack!'' and ``Jack left'').
While this kind of supervision is sparse and indirect, we exploit
multiple modalities and their interactions (appearance, dialog, mouth
movement, synchrony, continuity-editing cues) to effectively resolve
identities through local temporal grouping followed by global weakly
supervised recognition. We propose a novel temporal grouping model that
partitions face tracks across multiple shots while respecting
appearance, geometric and film-editing cues and constraints. In this
model, states represent partitions of the k most recent face tracks, and
transitions represent compatibility of consecutive partitions. We
present dynamic programming inference and discriminative learning for
the model. The individual face tracks are subsequently assigned a name
by learning a classifier from partial label constraints. The weakly
supervised classifier incorporates multiple-instance constraints from
dialog cues as well as soft grouping constraints from our temporal
grouping. We evaluate both the temporal grouping and final character
naming on several hours of TV and movies.
Weakly
Supervised
Learning from Multiple Modalities: Exploiting Video, Audio and
Text for Video Understanding
As web and personal content become ever more enriched by videos, there
is increasing need for semantic video search and indexing. A main
challenge for this task is lack of supervised data for learning models.
In this dissertation we propose weakly supervised algorithms for video
content analysis, focusing on recovering video structure, retrieving
actions and identifying people. Key components of the algorithms we
present are (1) alignment between multiple modalities: video, audio and
text, and (2) unified convex formulation for learning under weak
supervision from easily accessible data.
At a coarse level, we focus on the task of recovering scene structure in
movies and TV series. We present a weakly supervised algorithm that
parses a movie into a hierarchy of scenes, threads and shots. Movie
scene boundaries are aligned with screenplay scenes and shots are
reordered into threads. We present a unified generative model and novel
hierarchical dynamic program inference.
At a finer level, we aim at resolving person identity in video using
images, screenplay and closed captions. We consider a
partially-supervised multiclass classification setting where each
instance is labeled ambiguously with more than one label. The set of
potential labels for each face is the characters' names mentioned in the
corresponding screenplay scene. We propose a novel convex formulation
based on minimization of a surrogate loss. We show theoretical analysis
and strong empirical proof that effective learning is possible even when
all examples are ambiguously labeled.
We also investigate the challenging scenario of naming people in video
without screenplay. Our only source of (indirect) supervision are person
references mentioned in dialog, such as ```Hey, Jack!''. We resolve
identities by learning a classifier from partial label constraints,
incorporating multiple-instance constraints from dialog, gender and
local grouping constraints, in a unified convex learning formulation.
Grouping constraints are provided by a novel temporal grouping model
that integrates appearance, synchrony and film-editing cues to partition
faces across multiple shots. We present dynamic programming inference
and discriminative learning for this partitioning model.
We have deployed our framework on hundreds of hours of movies and TV,
and present quantitative and qualitative results for each component.
Learning
from Ambiguously Labeled Images
In many image and video collections, we have access only to partially
labeled data. For example, personal photo collections often contain
several faces per image and a caption that only specifies who is in the
picture, but not which name matches which face. Similarly, movie
screenplays can tell us who is in the scene, but not when and where they
are on the screen. We formulate the learning problem in this setting as
partially-supervised multiclass classification where each instance is
labeled ambiguously with more than one label. We show
theoretically that effective learning is possible under reasonable
assumptions even when all the data is weakly labeled. Motivated by the
analysis, we propose a general convex learning formulation based on
minimization of a surrogate loss appropriate for the ambiguous label
setting. We apply our framework to identifying faces culled from
web news sources and to naming characters in TV series and movies. We
experiment on a very large dataset consisting of 100 hours of video, and
in particular achieve 6% error for character naming on 16 episodes
of LOST.
Movie/Script:
Alignment and Parsing of Video and Text Transcription
Movies and TV are a rich source of diverse and complex video of people,
objects, actions and locales “in the wild”. Harvesting automatically
labeled sequences of actions from video would enable creation of
large-scale and highly varied datasets. To enable such collection, we
focus on the task of recovering scene structure in movies and TV series
for object tracking and action retrieval. We present a weakly supervised
algorithm that uses the screenplay and closed captions to parse a movie
into a hierarchy of shots and scenes. Scene boundaries in the movie are
aligned with screenplay scene labels and shots are reordered into a
sequence of long continuous tracks or threads which allow for more
accurate tracking of people, actions and objects. Scene segmentation,
alignment, and shot threading are formulated as inference in a unified
generative model and a novel hierarchical dynamic programming algorithm
that can handle alignment and jump-limited reorderings in linear time is
presented. We present quantitative and qualitative results on movie
alignment and parsing, and use the recovered structure to improve
character naming and retrieval of common actions in several episodes of
popular TV series.
We present an algorithm that recognizes objects of a given category using a small number of hand segmented images as references. Our method first over segments an input image into superpixels, and then finds a shortlist of optimal combinations of superpixels that best fit one of template parts, under affine transformations. Second, we develop a contextual interpretation of the parts, gluing image segments using top-down fiducial points, and checking overall shape similarity. In contrast to previous work, the search for candidate superpixel combinations is not exponential in the number of segments, and in fact leads to a very efficient detection scheme. Both the storage and the detection of templates only require space and time proportional to the length of the template boundary, allowing us to store potentially millions of templates, and to detect a template anywhere in a large image in roughly 0.01 seconds. We apply our algorithm on the Weizmann horse database, and show our method is comparable to the state of the art while offering a simpler and more efficient alternative compared to previous work.
Solving
Markov Random Fields with Spectral Relaxation
Markov Random Fields (MRFs) are used in a large array of computer vision
and maching learning applications. Finding the Maximum Aposteriori (MAP)
solution of an MRF is in general intractable, and one has to resort to
approximate solutions, such as Belief Propagation, Graph Cuts, or more
recently, approaches based on quadratic programming. We propose a novel
type of approximation, Spectral relaxation to Quadratic Programming
(SQP). We show our method offers tighter bounds than recently published
work, while at the same time being computationally efficient. We compare
our method to other algorithms on random MRFs in various settings.
Learning
spectral graph segmentation
We present a general graph learning algorithm for spectral graph
partitioning, that allows direct supervised learning of graph structures
using hand labeled training examples. The learning algorithm is based on
gradient descent in the space of all feasible graph weights. Computation
of the gradient involves finding the derivatives of eigenvectors with
respect to the graph weight matrix. We show the derivatives of
eigenvectors exist and can be computed in an exact analytical form using
the theory of implicit functions. Furthermore, we show for a simple
case, the gradient converges exponentially fast. In the image
segmentation domain, we demonstrate how to encode top-down high level
object prior in a bottom-up shape detection process.
@inproceedings{Wang:iccv11,
author = "Wang, X. and Yang, M. and Cour, T. and Zhu, S. and Yu, K. and Han, T.X.",
title = "Contextual Weighting for Vocabulary Tree based Image Retrieval",
booktitle = "IEEE International Conference on Computer Vision (ICCV'11)",
year = "2011"
}
@article{Cour:jmlr11,
author = "Timothee Cour and Benjamin Sapp and Ben Taskar",
title = "Learning from Partial Labels",
journal = "JMLR",
year = "2011"
}
@inproceedings{lin11:_large,
author = {Yuanqing Lin and Fengjun Lv and Shenghuo Zhu and Kai Yu and Ming Yang and Timothee Cour},
title = {Large-scale image classification: fast feature extraction and SVM training},
booktitle = {CVPR'11: IEEE Conference on Computer Vision and Pattern Recognition},
year = 2011,
note = {to appear}
}
@inproceedings{Cour:cvpr10,
author= "Timothee Cour and Ben Sapp and Akash Nagle and Ben Taskar",
title= "Talking Pictures: Temporal Grouping and Dialog-Supervised Person Recognition",
booktitle= "IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'10)",
year= "2010"
}
@PHDTHESIS{Cour:thesis,
author = {Timothee Cour},
title = {Weakly Supervised Learning from Multiple Modalities: Exploiting Video, Audio and Text for Video Understanding},
school = {University of Pennsylvania},
year = {2009}
}
@inproceedings{Cour:cvpr09,
author= "Timothee Cour and Ben Sapp and Chris Jordan and Ben Taskar",
title= "Learning from Ambiguously Labeled Images",
booktitle= "IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'09)",
year= "2009"
}
@inproceedings{Cour:eccv08,
author= "Timothee Cour and Chris Jordan and Eleni Miltsakaki and Ben Taskar",
title= "Movie/Script: Alignment and Parsing of Video and Text Transcription",
booktitle= "Proceedings of 10th European Conference on Computer Vision, Marseille, France",
year= "2008"
}
@inproceedings{Cour:cvpr07,
author= "Timothee Cour and Jianbo Shi",
title= "Recognizing objects by piecing together the Segmentation Puzzle",
booktitle= "IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'07)",
year= "2007"
}
@inproceedings{Cour:aistats07,
author= "Timothee Cour and Jianbo Shi",
title= "Solving Markov Random Fields with Spectral Relaxation",
booktitle= "Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics",
volume= "11",
year= "2007"
}
@incollection{Cour:nips06,
author = {Timothee Cour and Praveen Srinivasan and Jianbo Shi},
title = {Balanced Graph Matching},
booktitle = {Advances in Neural Information Processing Systems 19},
editor = {B. Sch\”{o}lkopf and J.C. Platt and T. Hofmann},
publisher = {MIT Press},
address = {Cambridge, MA},
year = {2007}
}
@inproceedings{Cour:cvpr05,
author = {Timothee Cour and Florence Benezit and Jianbo Shi},
title = {Spectral Segmentation with Multiscale Graph Decomposition},
booktitle = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2},
year = {2005},
isbn = {0-7695-2372-2},
pages = {1124--1131},
doi = {http://dx.doi.org/10.1109/CVPR.2005.332},
publisher = {IEEE Computer Society},
address = {Washington, DC, USA},
}
@inproceedings{Cour:aistats05,
author = "Timothee Cour and Nicolas Gogin and Jianbo Shi",
title = "Learning Spectral Graph Segmentation",
booktitle = "Proceedings of the 10th International Workshop on
Artificial Intelligence and Statistics",
year = "2005"
}
@inproceedings{Cour:TR04,
author = "Timothee Cour and Jianbo Shi",
title = "A Learnable Spectral Memory Graph for Recognition and Segmentation",
institution = "University of Pennsylvania CIS Technical Reports",
month = "June",
year = "2004",
number = "MS-CIS-04-12",
address = "Philadelphia, PA"
}