Home

The 90-year-old idea behind JEPA models: Canonical Correlation Analysis (CCA)

Shon Czinner — Mon, 04 May 2026 00:00:00 GMT

Introduction

Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions.

This is the first sentence from the paper “Relations Between Two Sets of Variates” (Hotelling 1936) by statistician and economist Harold Hotelling. This paper introduced Canonical Correlation Analysis (CCA). In modern terminology, “CCA is used to find a common signal among two large matrices” (Bykhovskaya and Gorin 2025).

In JEPA, the objective is the same except the second data matrix happens to be simply a different view of the same data in the first dataset (e.g. via data augmentation or spatial or temporal proximity). One of the recent papers to acknowledge a connection states, “JEPA-based models implicitly perform a non-linear generalization of Canonical Correlation Analysis”. (Huang 2026)

CCA’s connection to JEPA is relevant to Schmidhuber’s debate on who invented JEPA, which is directed at Yann LeCun. Personally, I think Hotelling deserves the credit for the idea of maximizing correlation in embedding space.

Of course, the CCA model has many differences from JEPA.

For one, CCA does not enforce a shared encoder. But the biggest difference is that CCA is linear. Non-linear neural variants of CCA have been researched with the earliest usage of the term “Deep CCA” being (Andrew et al. 2013).

Connecting JEPA models back to its CCA roots is genuinely useful. Another Deep CCA paper (Benton et al. 2017) relaxed the assumption of two sets of variables to an arbitrary number based on a generalization of CCA proposed in 1961 (Horst 1961). Conceivably, JEPAs could be expanded to handle more than two views as well.

CCA vs. JEPA Overview

CCA

Suppose we have zero-mean matrices and .

Let and and so that and .

CCA solves the following maximization problem,

This maximizes the trace of the cross-correlation matrix, while constraining embedding vectors to unit variance and zero covariance.

Similar to the equivalence between maximizing variance and minimizing prediction error in solving PCA, we have a relationship between the trace of the cross-correlation matrix and embedding prediction error,

And due to the whitening constraints,

So maximizing the trace of the cross-correlation under the whitening constraints is equivalent to minimizing the MSE of the embedding representations. Therefore we can write CCA as,

JEPA

Adopting the previous notation, JEPA is constrained to as a result of the joint-embedding. In JEPA, we have the encoder , and predictor .

Let , .

Then we solve,

Note the similarity in the objective function but the lack of whitening constraints. The lack of whitening constraints results in representational and dimensional collapse. For example, a trivial solution to the above problem is .

As discussed in my previous blog post SIGReg (Balestriero and LeCun 2025) fixes this problem. What does it do? It encourages the embeddings and to have an isotropic (i.e. unit variance, uncorrelated) Gaussian distribution. As a result it encourages,

Conclusion

As I mentioned in the introduction, Schmidhuber has debated who invented JEPA and said this about LeCun,

Dr. LeCun’s heavily promoted Joint Embedding Predictive Architecture (JEPA) is the heart of his new company. However, the core ideas are not original to LeCun. Instead, JEPA is essentially identical to our 1992 Predictability Maximization system.

Schmidhuber references Yann LeCun’s response,

JEPA is merely a name for a general concept. The question is, and has always been, how do you make it work (particularly how do you prevent it from collapsing), and how do you make it work at scale with SOTA results on non-toy problems. That’s the hard part. Ideas are a dime a dozen. Making them work is what the community will give you credit for.

Do I agree with LeCun? Yes and no.

Yes, because of course you will get credit for making things work, and ideas are indeed arguably “a dime a dozen”.

No, because the thread of citations is important for progress. If important citations are missed, whether intentionally or not, the correct thing to do is just add them. We’re all only the better for doing so. The connection that JEPA models have to CCA is informative.

My opinion is that JEPA/Predictability Maximization models are architectural enhancements layered on top of CCA. Non-linearity is an enhancement.

Ultimately, these models all have the same objective function introduced by CCA: find the transformations that result in maximal correlation between sets of multidimensional data.

References

Andrew, Galen, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. “Deep Canonical Correlation Analysis.” International Conference on Machine Learning, 1247–55. https://proceedings.mlr.press/v28/andrew13.html.

Balestriero, Randall, and Yann LeCun. 2025. LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics. https://arxiv.org/abs/2511.08544.

Benton, Adrian, Huda Khayrallah, Biman Gujral, Dee Ann Reisinger, Sheng Zhang, and Raman Arora. 2017. Deep Generalized Canonical Correlation Analysis. https://arxiv.org/abs/1702.02519.

Bykhovskaya, Anna, and Vadim Gorin. 2025. Canonical Correlation Analysis: Review. https://arxiv.org/abs/2411.15625.

Horst, Paul. 1961. Generalized Canonical Correlations and Their Application to Experimental Data. Journal of clinical psychology.

Hotelling, Harold. 1936. “Relations Between Two Sets of Variates.” Biometrika 28 (3/4): 321–77. http://www.jstor.org/stable/2333955.

Huang, Yongchao. 2026. VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models. https://arxiv.org/abs/2601.14354.

A Small JEPA Word Embedding Model

Shon Czinner — Thu, 30 Apr 2026 00:00:00 GMT

After my prior blog post about SIGReg, I figured I’d train a small Joint-Embedding Predictive Architecture (JEPA) model to demonstrate it.

The paper “LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels” (Maes et al. 2026) suggested significantly reducing the complexity of JEPA models by removing stop-gradients, and the exponential-moving-average encoder. This was in the context of world models and planning.

In this case, I’m applying JEPA to the task of creating word embeddings. Prior methodologies include Word2vec (Mikolov et al. 2013) which uses a log-linear model and negative sampling, MLP next-word prediction (Bengio et al. 2000), applying CCA to small context windows (Dhillon et al. 2011), and training an autoencoder on small context windows (Shao et al. 2025).

We’ll be training a linear JEPA model with SIGReg on a small shakespeare dataset to show that it learns some informative embeddings. In other words, we’ll train an encoder that turns two words into word embeddings, and train a linear predictor that predicts the second word embedding from the first. It would be easy to extend this methodology to non-linear encoders and use larger contexts than single words.

Overview

First we’ll take our dataset and convert it into tokens. For example,

["to", "be", ",", "or", "not", "to", "be"] -> [1, 2, 3, 4, 5, 1, 2]

Then we create the dataset where we have context/target pairs. So in this case that would look like,

Context 1: [1], Target 1: [2]
Context 2: [2], Target 2: [3]
...
Context 5: [5], Target 5: [1]
Context 6: [1], Target 6: [2]

Then we create the JEPA model which uses the same embedding for the context and target , and then has predictor predict the target from the context. More formally,

Code

Sketched Isotropic Gaussian Regularization (SIGReg) Explained

Shon Czinner — Sun, 26 Apr 2026 00:00:00 GMT

The paper “LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics”(Balestriero and LeCun 2025) proposed an interesting method of regularizing latent spaces that I thought I’d explore.

This is a regularization technique to make latent embeddings have isotropic Gaussian distribution - each dimension is encouraged to be uncorrelated and independent, and gaussian distributed. They argue that gaussian latent embeddings are optimal because they are unbiased and lower variance for downstream tasks.

They make latent embeddings have Gaussian distribution using one-dimensional tests of normality in random directions via characteristic functions.

One-Dimension

ECDF Tests

First the paper motivates determining how Gaussian a distribution is with its empirical (i.e. observed) cumulative distribution function (ecdf). Lets look at this in one-dimension.

First we generate two “one sample” univariate datasets - one uniform and one standard gaussian. In this case the target distribution is the standard gaussian. Then we just compare the ecdfs with the theoretical gaussian cdf.

Code

Motivation To Write

Shon Czinner — Sat, 25 Apr 2026 00:00:00 GMT

I plan to write about AI, statistics, quantitative finance, economics, and more.

For my first post I’ve decided to write about why I’m writing.

Writing things down provides a reference for when I forget things.
Writing improves understanding.

If other people find the things I write about interesting, that’s a bonus.

Feel free to get in touch if you find any mistakes!