Skip to content
Surf Wiki
Save to docs
general/object-recognition-and-categorization

From Surf Wiki (app.surf) — the open knowledge base

Bag-of-words model in computer vision

Image classification model


Image classification model

In computer vision, the bag-of-words (BoW) model, sometimes called bag-of-visual-words model (BoVW), can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Image representation based on the BoW model

To represent an image using the BoW model, an image can be treated as a document. Similarly, "words" in images need to be defined too. To achieve this, it usually includes following three steps: feature detection, feature description, and codebook generation.{{cite book| doi=10.1109/CVPR.2005.16| year=2005| author=Fei-Fei Li|last2=Perona|first2=P.| title=2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)| chapter=A Bayesian Hierarchical Model for Learning Natural Scene Categories| isbn=978-0-7695-2372-9|volume=2|pages=524–531| s2cid=6387937}} A definition of the BoW model can be the "histogram representation based on independent features". Content based image indexing and retrieval (CBIR) appears to be the early adopter of this image representation technique.

Feature representation

After feature detection, each image is abstracted by several local patches. Feature representation methods deal with how to represent the patches as numerical vectors. These vectors are called feature descriptors. A good descriptor should have the ability to handle intensity, rotation, scale and affine variations to some extent. One of the most famous descriptors is the scale-invariant feature transform (SIFT).{{Cite book | chapter-url = http://www.cs.ubc.ca/~lowe/papers/iccv99.pdf

Codebook generation

The final step for the BoW model is to convert vector-represented patches to "codewords" (analogous to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches. One simple method is performing k-means clustering over all the vectors.{{cite journal |author2-link=Jitendra Malik| title = Representing and recognizing the visual appearance of materials using three-dimensional textons

Thus, each patch in an image is mapped to a certain codeword through the clustering process and the image can be represented by the histogram of the codewords.

Learning and recognition based on the BoW model

Computer vision researchers have developed several learning methods to leverage the BoW model for image related tasks, such as object categorization. These methods can roughly be divided into two categories, unsupervised and supervised models. For multiple label categorization problem, the confusion matrix can be used as an evaluation metric.

Unsupervised models

Here are some notations for this section. Suppose the size of codebook is V.

  • w: each patch w is a V-dimensional vector that has a single component equal to one and all other components equal to zero (For k-means clustering setting, the single component equal one indicates the cluster that w belongs to). The vth codeword in the codebook can be represented as w^v=1 and w^u = 0 for u\neq v.
  • \mathbf{w}: each image is represented by \mathbf{w}=[w_1, w_2, \cdots, w_N], all the patches in an image
  • d_j: the jth image in an image collection
  • c: category of the image
  • z: theme or topic of the patch
  • \pi: mixture proportion

Since the BoW model is an analogy to the BoW model in NLP, generative models developed in text domains can also be adapted in computer vision. Simple Naive Bayes model and hierarchical Bayesian models are discussed.

Naive Bayes

The simplest one is Naive Bayes classifier. Using the language of graphical models, the Naive Bayes classifier is described by the equation below. The basic idea (or assumption) of this model is that each category has its own distribution over the codebooks, and that the distributions of each category are observably different. Take a face category and a car category for an example. The face category may emphasize the codewords which represent "nose", "eye" and "mouth", while the car category may emphasize the codewords which represent "wheel" and "window". Given a collection of training examples, the classifier learns different distributions for different categories. The categorization decision is made by : c^*=\arg \max_c p(c|\mathbf{w}) = \arg \max_c p(c)p(\mathbf{w}|c)=\arg \max_c p(c)\prod_{n=1}^Np(w_n|c)

Since the Naive Bayes classifier is simple yet effective, it is usually used as a baseline method for comparison.

Hierarchical Bayesian models

The basic assumption of Naive Bayes model does not hold sometimes. For example, a natural scene image may contain several different themes. Probabilistic latent semantic analysis (pLSA){{cite conference |book-title = Proc. of the Fifteenth Conference on Uncertainty in Artificial Intelligence |access-date = 2007-12-10 |archive-url = https://web.archive.org/web/20070710083034/http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf |archive-date = 2007-07-10 |chapter-url = http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic05b.pdf |access-date = 2007-12-10 |archive-date = 2020-01-31 |archive-url = https://web.archive.org/web/20200131172436/http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic05b.pdf |name-list-style = amp |editor1-last = Lafferty |editor1-first = John |access-date = 2007-12-10 |archive-url = https://web.archive.org/web/20080822212053/http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf |archive-date = 2008-08-22

  • the image category is mapped to the document category;
  • the mixture proportion of themes maps the mixture proportion of topics;
  • the theme index is mapped to topic index;
  • the codeword is mapped to the word. This method shows very promising results in natural scene categorization on 13 Natural Scene Categories.

Supervised models

Since images are represented based on the BoW model, any discriminative model suitable for text document categorization can be tried, such as support vector machine (SVM) and AdaBoost.{{Cite book | chapter-url=http://cbcl.mit.edu/projects/cbcl/publications/ps/serre-PID73457-05.pdf | access-date=2007-12-10 | archive-date=2017-07-06 | archive-url=https://web.archive.org/web/20170706051952/http://cbcl.mit.edu/projects/cbcl/publications/ps/serre-PID73457-05.pdf This approach has achieved very impressive results in the PASCAL Visual Object Classes Challenge.

Pyramid match kernel

Pyramid match kernel{{Cite book | chapter-url = http://www.cs.utexas.edu/~grauman/papers/grauman_darrell_iccv2005.pdf

Limitations and recent developments

One of the notorious disadvantages of BoW is that it ignores the spatial relationships among the patches, which are very important in image representation. Researchers have proposed several methods to incorporate the spatial information. For feature level improvements, correlogram features can capture spatial co-occurrences of features.{{Cite book |chapter-url=http://johnwinn.org/Publications/papers/Savarese_Winn_Criminisi_Correlatons_CVPR2006.pdf |access-date=2007-12-10 |archive-date=2013-10-29 |archive-url=https://web.archive.org/web/20131029185403/http://johnwinn.org/Publications/papers/Savarese_Winn_Criminisi_Correlatons_CVPR2006.pdf | book-title = Proc. of Neural Information Processing Systems

The BoW model has not been extensively tested yet for view point invariance and scale invariance, and the performance is unclear. Also the BoW model for object segmentation and localization is not well understood.

A systematic comparison of classification pipelines found that the encoding of first and second-order statistics (Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vector (FV)) considerably increased classification accuracy compared to BoW, while also decreasing the codebook size, thus lowering the computational effort for codebook generation. Moreover, a 2017 detailed comparison of coding and pooling methods for BoW has shown that second-order statistics combined with Sparse Coding and an appropriate pooling such as Power Normalisation can further outperform Fisher Vectors and even approach results of simple models of Convolutional Neural Network on some object recognition datasets such as Oxford Flower Dataset 102.

References

References

  1. (2003). "Video Google: A Text Retrieval Approach to Object Matching in Videos".
  2. (2004). "Visual categorization with bags of keypoints".
  3. (2002). "Indexing chromatic and achromatic patterns for content-based colour image retrieval". Pattern Recognition.
  4. (2007). "Local Features and Kernels for Classification of Texture and Object Categories: a Comprehensive Study". International Journal of Computer Vision.
  5. Jianchao Yang. (2009). "2009 IEEE Conference on Computer Vision and Pattern Recognition".
  6. (2005). "Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1".
  7. (2006). "2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR'06)".
  8. (2013-05-01). "Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection". Computer Vision and Image Understanding.
  9. (2017-02-24). "Higher-order occurrence pooling for bags-of-words: Visual concept detection". IEEE Transactions on Pattern Analysis and Machine Intelligence.
  10. (2010-06-01). "2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition".
  11. (2017-02-24). "Plant species classification using flower images—A comparative study of local feature representations". PLOS ONE.
Info: Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

Want to explore this topic further?

Ask Mako anything about Bag-of-words model in computer vision — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report