Abstract

In this blog, I want to share my current understanding of the Cobweb algorithm: a human-like learning algorithm, focusing on its computational aspect, i.e. what does this approach claim as being computed by the function of human category learning, with a hint of algorithmic aspect in the last section. In other words, this is about what should be implemented, not how to implement them. This is not meant to be a comprehensive introduction/significance statement of Cobweb, which should be done sufficiently well by existing literature (e.g., [5-7] [9-13]). Instead, it aims to unify partial information across various literatures and clarify core spirits.

Content

Why Cobweb

It is not controversial to say that the back-propagation gradient descent algorithm is a learning algorithm not designed to model human learning behaviors. Neural networks are universal function approximators, not developed as adaptive intelligent systems. Accordingly, an algorithm that enables this over-parameterized approximator to fit an extremely complicated curve does not necessarily provide a satisfying answer to the hard problem of learning.

Human learning has the following characteristics that are not exhibited in the current major machine-learning paradigm: sample-efficient, incremental, and continual learning (without catastrophic forgetting). Cobweb [6], on the other hand, is one of a few early models that support incremental learning by design. Empirical study also shows promising results in its anti-forgetting ability [7] and sample efficiency [7].

Arguably, according to [8], any algorithm that cannot easily scale up will not beat algorithms that are designed to utilize rapidly growing compute resources, e.g. deep learning, in terms of “performance” in unconstrained-compute settings in the long term. However, many real-world settings have various constraints, e.g. a home robot should be able to adapt to new environments efficiently and learn domain knowledge without massive trial-and-error. When enough prior knowledge is present, an intelligent agent should be able to learn a lot about a novel domain from a little new information. Therefore, cobweb serves as a preliminary exploration and primitive solution to this question.

Instance and Probabilistic Category Representation

For an object (instance) $obj$, it is assumed to be characterized by multiple attribute-value pairs. Denote the set of attributes/features as $A$. For each attribute $A_i \in A$, its value can be $v_{ij} \in V_i$, where $V_i$ is the value range of $A_i$. $A_i$ can be nominal ($|V_i| < \infty$ with no ordering), numeric ($V_i \subset \R$), component ($V_i = \{ f | f: A \to \cup_i V_i \}$), or relational ($V_i = A \times R \times A$, where $R$ is the set of relations).

A (probabilistic) category/concept/class $c$ contains multiple attribute and probabilistic distribution pairs. Any algorithm that solely operates over such a probabilistic description assumes that attributes are mutually independent. By default, discrete features are modelled using categorical distribution or multinomial distribution.

An object instance representation (in the middle) containing nominal, numeric, component, and relational attributes. The category description (on the right) stores the probabilistic distribution of each attribute. From [5].

An object instance representation (in the middle) containing nominal, numeric, component, and relational attributes. The category description (on the right) stores the probabilistic distribution of each attribute. From [5].

Assumptions

Category utility: a heuristic measure of category learning

This section will start from the question of measuring the utility gain due to learning a concept. Given the category utility hypothesis by [14], a quantitative answer termed as category utility is reviewed and its relation with basic-level effect in human category learning is discussed. Then it is applied to the setting of measuring the utility gain due to learning a refined partition of a concept, termed as partition utility, which leads to Cobweb [6]. Through Bayes’ theorem, it is connected to an equivalant form that captures the trade-off of intra-class similarity and inter-class dissimilarity in measuring goodness of category clustering. Lastly, the modern viewpoint of information gain is introduced (due to [14]) and is applied to Cobweb (as in [7]).

Problem formulation: category goodness

What is the value added due to learning, knowing, and using a category? In this formulation, a category $c$ is learned iff the probabilistic description of $c$ is known and there is a learned perfect test T s.t. for any object $obj$, $T(obj)=\top \iff obj \in c$.