Self-Supervised Learning



In recent years, self-supervised learning (SSL) has gained significant attention in the field of artificial intelligence (AI) and machine learning (ML). This approach to learning aims to extract useful information from unlabelled data by leveraging different techniques. In this article, we will delve deeper into self-supervised learning, its benefits, and how it has been applied in natural language processing (NLP) and computer vision.

Supervised vs Self-supervised Learning

Supervised learning has been the go-to approach in machine learning, where models are trained on labelled data. This approach works well when there is abundant labelled data, but it has its limitations. One of the biggest challenges of supervised learning is the availability of labelled data, which can be scarce, noisy, and expensive to obtain. Additionally, models trained using supervised learning often struggle to generalise to related but out-of-distribution (OOD) tasks, as the models tend to focus more on learning a direct mapping between the input and output rather than understanding the underlying structure of the data.

On the other hand, unsupervised learning involves using unlabelled data to learn about the underlying structure or patterns in the data, and self-supervised learning is a subcategory of unsupervised learning. Self-supervised learning can be accomplished without explicit human supervision, as the model is trained to learn from a specific aspect of the data by predicting a certain part of the input using another part of the input. This can be achieved by masking a portion of the input and training the model to predict the missing portion.

The most significant advantage of self-supervised learning is its ability to leverage the vast amount of unlabelled data available, resulting in more general representations of the underlying structure of the data. These representations can then be fine-tuned to a variety of downstream tasks, making self-supervised learning a highly flexible approach. By learning more general representations, self-supervised learning models can better capture the underlying structure of the data and thus generalize better to related tasks.

We can say, self-supervised learning is a promising approach for tackling unsupervised learning tasks and improving the generalization performance of models. In the next section, we will delve into the motivation behind the need for self-supervised learning in more detail.

Motivation behind the need for self-supervised learning

Deep learning has made remarkable strides over the past decade, achieving human-level or even surpassing human-level performance in complex tasks such as machine translation, image recognition, speech recognition, and more. However, as we push the boundaries of deep learning, we have begun to recognize the fundamental limitations of current approaches. One of the key differences between humans and current AI is the speed at which we learn. Humans can learn things significantly faster than machines, and the reason behind it is “common sense.”

Common sense is a combination of perception, motor skills, and basic physics that humans use to navigate the world. Humans learn mostly by observation, and this is the inspiration behind self-supervised learning, where models learn to predict the masked part of the data from the unmasked part. Self-supervised learning is a form of unsupervised learning that leverages the vast amounts of unlabeled data available to learn useful representations of the data.

One of the main motivations behind self-supervised learning is to address the problem of limited availability of labeled data. Labeling data is a laborious and expensive process, and it is not feasible to label all the data in the world. Self-supervised learning allows us to leverage the abundance of unlabeled data to learn more robust and generalizable representations of the data. By learning to predict some aspect of the input data, the model is forced to focus on the most salient features of the data, which can lead to better generalization performance.

To make machines with human-like common sense, we need to create a world model that can take advantage of the knowledge acquired over time to help navigate the world. With the development of self-supervised learning and other emerging techniques, we are one step closer to achieving this goal.

Self-supervised Learning in NLP and Computer Vision

Self-supervised learning has been widely adopted in both natural language processing (NLP) and computer vision (CV). In NLP, the technique involves masking some text and predicting it using nearby text. This has become a popular approach in NLP, and most state-of-the-art models, such as BERT, ROBERTA, XLM-R, GPT-2, and GPT-3, are trained using it. Since the prediction of the masked word can take only discrete values, it is relatively easy to generate a probability distribution over 10-20k words in the vocabulary.

However, self-supervised learning in computer vision presents a different set of challenges. In CV, we are dealing with high-dimensional continuous objects. For instance, a 10×10 masked patch of an image can potentially take 255¹⁰⁰ values across a single channel, making the possibilities infinite. Similarly, speech recognition presents similar challenges. Unlike NLP, it is not possible to make predictions for every single possibility and then pick the higher probability predictions. Therefore, the problem seems intractable in computer vision.

Using Siamese network or joint embedding architecture –

There are some ways around this problem. For instance, one approach involves using a Siamese network or joint embedding architecture to calculate the similarity between two images. The two neural networks used here can be exactly the same, partially shared, or completely different.

The idea is to train the Siamese network to calculate the similarity between two images while holding the following properties:

1) Images that are similar/compatible should return higher similarity scores.
2) Images that are different/incompatible should return lower similarity scores.

Achieving point 1 is easy by augmenting the image in different ways (e.g., cropping, color enhancements, rotating, shifting, etc.) and then asking the Siamese network to learn similar representations for the original and augmented images. Since we are not doing predictive modeling anymore, the model’s output is compared to a fixed target, and we’re comparing the model’s two encoders’ output to each other. This makes learning the representations quite flexible.

But achieving point 2 is tricky because, without an incentive, the network can learn the same representation for all images irrespective of the input. This is called mode collapse. To combat this issue, there are two ways:

A) Contrastive Learning
B) Non-Contrastive Learning

 

Contrastive Learning

Contrastive learning is a type of unsupervised learning that has gained a lot of attention in recent years for its effectiveness in pre-training deep neural networks. It is based on the idea that the model should learn to differentiate between similar and dissimilar pairs of data points.

The intuition behind contrastive learning is that if we can teach a neural network to distinguish between similar and dissimilar pairs of data points, it can learn useful representations of the input data that can be used for downstream tasks such as classification or object detection.

Let me describe the intuition of two of the key algorithms in contrastive learning —

1) Momentum Contrast

Momentum Contrast is a popular contrastive learning algorithm that was introduced by He et al. in 2019. It uses a momentum-based update rule to learn representations of the input data. The idea behind this algorithm is to create a moving average of the model’s parameters over time, which can help to smooth out the learning process and prevent the model from getting stuck in local minima.

The Momentum Contrast algorithm works by creating two different views of each input data point, and then trying to maximize the similarity between them. One view is created by randomly applying data augmentations to the input image, while the other view is created by applying a different set of data augmentations to the same image.

The model is then trained to learn representations that are similar for the two views of the same image, but dissimilar for different images. The momentum-based update rule helps to smooth out the learning process and improve the quality of the learned representations.

2) SimCLR

SimCLR is another popular contrastive learning algorithm that was introduced by Chen et al. in 2020. It is based on the idea of maximizing the agreement between two different views of the same input data point.

SimCLR uses a contrastive loss function that encourages the model to learn representations that are similar for the two views of the same image, but dissimilar for different images. The algorithm works by creating two different views of each input data point, and then maximizing the agreement between them.

SimCLR has been shown to be highly effective for pre-training deep neural networks, and has achieved state-of-the-art results on a wide range of benchmark datasets.

Limitations of Contrastive Learning

One of the main limitations of contrastive learning is that it requires a large amount of data to be effective. This is because the algorithm relies on creating pairs of similar and dissimilar data points, and these pairs become increasingly difficult to create as the amount of data decreases.

Another limitation is that contrastive learning can be computationally expensive, especially for large datasets. This is because the algorithm requires a lot of pairwise comparisons between different data points, which can be time-consuming and memory-intensive.

Interesting Side-Effect of Data Augmentation

Data augmentation is a technique used in deep learning to increase the size of the training dataset by applying various transformations to the input data. These transformations can include rotations, translations, and other types of geometric transformations, as well as changes in brightness, contrast, and color.

One interesting side-effect of data augmentation is that it can help to improve the generalization performance of the model. This is because the augmented data provides a wider range of examples for the model to learn from, which can help to reduce overfitting and improve the model’s ability to generalize to new data.

Another interesting side-effect of data augmentation is that it can help to increase the robustness of the model to various types of perturbations and adversarial attacks. This is because the augmented data introduces a wider range of variations and distortions that the model must learn to recognize and classify correctly.

Non-Contrastive Learning

Non-contrastive learning is a type of unsupervised learning where the model learns to predict a part of the input data from the rest of the input data, without using any explicit notion of similarity or dissimilarity between examples. In other words, instead of comparing pairs of examples, non-contrastive learning models are trained to reconstruct the input data itself.

The key intuition behind non-contrastive learning is that by forcing the model to predict a part of the input data from the rest of the input data, it is implicitly learning a representation that captures the underlying structure of the data. This is because the model needs to understand the relationships between different parts of the input data in order to accurately predict missing parts.

One popular family of non-contrastive learning algorithms is based on the idea of learning representations that are invariant to certain transformations of the input data. For example, in image data, the model can learn to be invariant to changes in lighting, rotation, or scale, which can help to improve generalization performance.

One interesting side-effect of data augmentation is that it can help to increase the invariance of the learned representations. This is because data augmentation increases the diversity of the training examples, which forces the model to learn to be invariant to a wider range of transformations. For example, if we apply random rotations to an image during training, the model will learn to recognize objects regardless of their orientation.

The success of non-contrastive learning models that learn from positive examples (i.e., examples without any explicit notion of similarity) can be attributed to the fact that they force the model to learn a rich representation that captures the underlying structure of the data. This is in contrast to contrastive learning models that require the explicit definition of similarity between pairs of examples, which can be limiting in some scenarios.

Two popular algorithms in the family of non-contrastive learning are Barlow Twins and BYOL.

 

Barlow Twins –

Barlow Twins is based on the idea of constraining the similarity between two different views of the same input data, where the views are created by applying different random transformations to the input data. The goal is to make the representations of the two views highly correlated, while at the same time keeping the representations of different inputs uncorrelated. This helps the model to learn to capture the underlying structure of the data, while also ensuring that the learned representation is invariant to the applied transformations.

BARLOW TWINS Architecture –

The loss function is quite intuitive

Loss function for BARLOW TWINS-

here C is the cross-correlation matrix between the embeddings of the two distorted images. Remember, we are not using any negative examples here!

The first term, invariance term in the loss function is minimum when all the C_ii are 1 i.e the diagonal elements of the correlation matrix are 1. This makes the embeddings invariant to distortion as the correlation is strengthened.The second term i.e the redundancy-reduction term forces the off-diagonal values to be 0 i.e it de-correlates the other dimensions of the embeddings. This results in the model learning non-redundant information about the sample while ignoring the distortions.

BYOL –

BYOL, on the other hand, is based on the idea of a “predictor” and a “target” network. During training, the predictor network is trained to predict the output of the target network, which is a slower-moving copy of the predictor network. The key idea is that the predictor network learns to predict the future representation of the target network, which helps to ensure that the learned representation captures the underlying structure of the data.

In summary, non-contrastive learning is a powerful paradigm for unsupervised learning that is based on learning to reconstruct the input data itself, rather than comparing pairs of examples. This paradigm has been successfully used in a variety of applications, such as image and speech recognition. Barlow Twins and BYOL are two popular algorithms in this family that have shown promising results in recent research.

You can also refer breakdown of SSL in NLP and computer vision with below image illustration –

Reference – https://arxiv.org/pdf/2006.07733.pdf

Last Updated on April 29, 2023 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs