From Surf Wiki (app.surf) — the open knowledge base

Matrix Chernoff bound

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose {\mathbf{X}k} is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t: : \Pr \left { \lambda\max \left ( \sum_k \mathbf{X}_k \right ) \geq t \right } The following theorems answer this general question under various assumptions; these assumptions are named below by analogy to their classical, scalar counterparts. All of these theorems can be found in , as the specific application of a general result which is derived below. A summary of related works is given.

Matrix Gaussian and Rademacher series

Self-adjoint matrices case

Consider a finite sequence {\mathbf{A}_k} of fixed, self-adjoint matrices with dimension d, and let {\xi_k} be a finite sequence of independent standard normal or independent Rademacher random variables.

Then, for all t\geq0, : \Pr \left{ \lambda_{\text{max}} \left( \sum_k \xi_k \mathbf{A}_k \right) \geq t \right} \leq d \cdot e^{-t^2/2\sigma^2} where : \sigma^2 = \bigg\Vert \sum_k \mathbf{A}^2_k \bigg\Vert.

Rectangular case

Consider a finite sequence {\mathbf{B}_k} of fixed matrices with dimension d_1\times d_2, and let {\xi_k} be a finite sequence of independent standard normal or independent Rademacher random variables. Define the variance parameter : \sigma^2 = \max \left{ \bigg\Vert \sum_k \mathbf{B}_k\mathbf{B}_k^* \bigg\Vert, \bigg\Vert \sum_k \mathbf{B}_k^*\mathbf{B}_k \bigg\Vert \right}.

Then, for all t\geq0, : \Pr \left{ \bigg\Vert \sum_k \xi_k \mathbf{B}_k \bigg\Vert \geq t \right} \leq (d_1+d_2) \cdot e^{-t^2/2\sigma^2}.

Matrix Chernoff inequalities

The classical Chernoff bounds concern the sum of independent, nonnegative, and uniformly bounded random variables. In the matrix setting, the analogous theorem concerns a sum of positive-semidefinite random matrices subjected to a uniform eigenvalue bound.

Matrix Chernoff I

Consider a finite sequence {\mathbf{X}_k} of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies : \mathbf{X}k \succeq \mathbf{0} \quad \text{and} \quad \lambda{\text{max}}(\mathbf{X}_k) \leq R almost surely.

Define : \mu_{\text{min}} = \lambda_{\text{min}}\left( \sum_k \mathbb{E}, \mathbf{X}k \right) \quad \text{and} \quad \mu{\text{max}} = \lambda_{\text{max}}\left( \sum_k \mathbb{E}, \mathbf{X}k \right). Then : \Pr \left{ \lambda{\text{min}}\left( \sum_k \mathbf{X}k \right) \leq (1-\delta)\mu{\text{min}} \right} \leq d \cdot \left[ \frac{e^{-\delta}}{(1-\delta)^{1-\delta}} \right]^{\mu_{\text{min}}/R} \quad \text{for } \delta\in [0,1)\text{, and} : \Pr \left{ \lambda_{\text{max}}\left( \sum_k \mathbf{X}k \right) \geq (1+\delta)\mu{\text{max}} \right} \leq d \cdot \left[ \frac{e^{\delta}}{(1+\delta)^{1+\delta}} \right]^{\mu_{\text{max}}/R} \quad \text{for } \delta \geq 0.

Matrix Chernoff II

Consider a sequence {\mathbf{X}_k:k=1,2,\ldots,n} of independent, random, self-adjoint matrices that satisfy : \mathbf{X}k \succeq \mathbf{0} \quad \text{and} \quad \lambda{\text{max}}(\mathbf{X}_k) \leq 1 almost surely.

Compute the minimum and maximum eigenvalues of the average expectation, : \bar{\mu}{\text{min}} = \lambda{\text{min}}\left( \frac{1}{n} \sum_{k=1}^n \mathbb{E}, \mathbf{X}k \right) \quad \text{and} \quad \bar{\mu}{\text{max}} = \lambda_{\text{max}}\left( \frac{1}{n} \sum_{k=1}^n \mathbb{E}, \mathbf{X}k \right). Then : \Pr \left{ \lambda{\text{min}}\left( \frac{1}{n} \sum_{k=1}^n \mathbf{X}k \right) \leq \alpha \right} \leq d \cdot e^{-nD(\alpha \Vert \bar{\mu}{\text{min}})} \quad \text{for } 0 \leq \alpha \leq \bar{\mu}{\text{min}}\text{, and} : \Pr \left{ \lambda{\text{max}}\left( \frac{1}{n} \sum_{k=1}^n \mathbf{X}k \right) \geq \alpha \right} \leq d \cdot e^{-nD(\alpha \Vert \bar{\mu}{\text{max}})} \quad \text{for } \bar{\mu}_{\text{max}} \leq \alpha \leq 1. The binary information divergence is defined as : D(a\Vert u) = a \left( \log a - \log u \right) + (1-a)\left( \log(1-a)-\log(1-u) \right) for a,u\in[0,1].

Matrix Bennett and Bernstein inequalities

In the scalar setting, Bennett and Bernstein inequalities describe the upper tail of a sum of independent, zero-mean random variables that are either bounded or subexponential. In the matrix case, the analogous results concern a sum of zero-mean random matrices.

Bounded case

Consider a finite sequence {\mathbf{X}_k} of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies : \mathbb{E}\mathbf{X}k = \mathbf{0} \quad \text{and} \quad \lambda{\text{max}}(\mathbf{X}_k) \leq R almost surely.

Compute the norm of the total variance, : \sigma^2 = \bigg\Vert \sum_k \mathbb{E}, (\mathbf{X}^2_k) \bigg\Vert.

Then, the following chain of inequalities holds for all t \geq 0: : \begin{align} \Pr \left{ \lambda_{\text{max}} \left( \sum_k \mathbf{X}_k \right) \geq t \right} & \leq d \cdot \exp \left( -\frac{\sigma^2}{R^2} \cdot h\left( \frac{Rt}{\sigma^2} \right) \right) \ & \leq d \cdot \exp \left( \frac{-t^2} {\sigma^2+Rt/3} \right) \ & \leq \begin{cases} d \cdot \exp ( -3t^2/8\sigma^2 ) \quad & \text{for } t\leq \sigma^2/R; \ d \cdot \exp ( -3t/8R ) \quad & \text{for } t\geq \sigma^2/R. \ \end{cases} \end{align} The function h(u) is defined as h(u) = (1+u) \log (1+u)-u for u \geq 0.

Consider a sequence {\mathbf{X}k}{k=1}^n of independent and identically distributed random column vectors in \mathbb{R}^d. Assume that each random vector satisfies \Vert \mathbf{X}_k \Vert_2 \leq M almost surely, and \Vert \mathbb{E}[\mathbf{X}_k \mathbf{X}k^T] \Vert_2 \leq 1 . Then, for all t\geq0, : \Pr \left{ \bigg\Vert \frac{1}{n} \sum{k=1}^n \mathbf{X}_k \mathbf{X}_k^T - \mathbb{E}[\mathbf{X}_1 \mathbf{X}_1^T] \bigg\Vert_2 \geq t \right} \leq (2 \min(d, n))^2 \cdot \exp \left(-\frac{n (t-1)}{4 M^2} \right)

Subexponential case

Consider a finite sequence {\mathbf{X}_k} of independent, random, self-adjoint matrices with dimension d. Assume that : \mathbb{E},\mathbf{X}_k = \mathbf{0} \quad \text{and} \quad \mathbb{E},(\mathbf{X}_k^p) \preceq \frac{p!}{2}\cdot R^{p-2} \mathbf{A}_k^2 for p=2,3,4,\ldots.

Compute the variance parameter, : \sigma^2 = \bigg\Vert \sum_k \mathbf{A}^2_k \bigg\Vert.

Then, the following chain of inequalities holds for all t \geq 0: : \begin{align} \Pr \left{ \lambda_{\text{max}} \left( \sum_k \mathbf{X}_k \right) \geq t \right} & \leq d \cdot \exp \left( \frac{-t^2/2}{\sigma^2+Rt} \right) \ & \leq \begin{cases} d \cdot \exp ( -t^2/4\sigma^2 ) \quad & \text{for } t\leq \sigma^2/R; \ d \cdot \exp ( -t/4R ) \quad & \text{for } t\geq \sigma^2/R. \ \end{cases} \end{align}

Rectangular case

Consider a finite sequence {\mathbf{Z}_k} of independent, random, matrices with dimension d_1\times d_2. Assume that each random matrix satisfies : \mathbb{E},\mathbf{Z}_k = \mathbf{0} \quad \text{and} \quad \Vert \mathbf{Z}_k \Vert \leq R almost surely. Define the variance parameter : \sigma^2 = \max \left{ \bigg\Vert \sum_k \mathbb{E},(\mathbf{Z}_k\mathbf{Z}_k^) \bigg\Vert, \bigg\Vert \sum_k \mathbb{E}, (\mathbf{Z}_k^\mathbf{Z}_k) \bigg\Vert \right}.

Then, for all t \geq 0 : \Pr \left{ \bigg\Vert \sum_k \mathbf{Z}_k \bigg\Vert \geq t \right} \leq (d_1+d_2) \cdot \exp \left( \frac{-t^2 / 2} {\sigma^2+Rt/3} \right)

holds.

Matrix Azuma, Hoeffding, and McDiarmid inequalities

Matrix Azuma

The scalar version of Azuma's inequality states that a scalar martingale exhibits normal concentration about its mean value, and the scale for deviations is controlled by the total maximum squared range of the difference sequence. The following is the extension in matrix setting.

Consider a finite adapted sequence {\mathbf{X}_k} of self-adjoint matrices with dimension d, and a fixed sequence {\mathbf{A}k} of self-adjoint matrices that satisfy : \mathbb{E}{k-1},\mathbf{X}_k = \mathbf{0} \quad \text{and} \quad \mathbf{X}_k^2 \preceq \mathbf{A}_k^2 almost surely.

Compute the variance parameter : \sigma^2 = \bigg\Vert \sum_k \mathbf{A}^2_k \bigg\Vert.

Then, for all t \geq 0 : \Pr \left{ \lambda_{\text{max}} \left( \sum_k \mathbf{X}_k \right) \geq t \right} \leq d \cdot e^{-t^2/8\sigma^2}

The constant 1/8 can be improved to 1/2 when there is additional information available. One case occurs when each summand \mathbf{X}_k is conditionally symmetric. Another example requires the assumption that \mathbf{X}_k commutes almost surely with \mathbf{A}_k .

Matrix Hoeffding

Placing addition assumption that the summands in Matrix Azuma are independent gives a matrix extension of Hoeffding's inequalities.

Consider a finite sequence {\mathbf{X}_k} of independent, random, self-adjoint matrices with dimension d, and let {\mathbf{A}_k} be a sequence of fixed self-adjoint matrices. Assume that each random matrix satisfies : \mathbb{E},\mathbf{X}_k = \mathbf{0} \quad \text{and} \quad \mathbf{X}_k^2 \preceq \mathbf{A}_k^2 almost surely.

Then, for all t \geq 0 : \Pr \left{ \lambda_{\text{max}} \left( \sum_k \mathbf{X}_k \right) \geq t \right} \leq d \cdot e^{-t^2/8\sigma^2} where : \sigma^2 = \bigg\Vert \sum_k \mathbf{A}^2_k \bigg\Vert.

An improvement of this result was established in : for all t \geq 0 : \Pr \left{ \lambda_{\text{max}} \left( \sum_k \mathbf{X}_k \right) \geq t \right} \leq d \cdot e^{-t^2/2\sigma^2} where : \sigma^2 = \frac{1}{2}\bigg\Vert \sum_k \mathbf{A}^2_k + \mathbb{E},\mathbf{X}^2_k \bigg\Vert \leq \bigg\Vert \sum_k \mathbf{A}^2_k \bigg\Vert.

Matrix bounded difference (McDiarmid)

In scalar setting, McDiarmid's inequality provides one common way of bounding the differences by applying Azuma's inequality to a Doob martingale. A version of the bounded differences inequality holds in the matrix setting.

Let {Z_k:k=1,2,\ldots,n} be an independent, family of random variables, and let \mathbf{H} be a function that maps n variables to a self-adjoint matrix of dimension d. Consider a sequence {\mathbf{A}_k} of fixed self-adjoint matrices that satisfy : \left( \mathbf{H}(z_1,\ldots,z_k,\ldots,z_n) - \mathbf{H}(z_1,\ldots,z'_k,\ldots,z_n) \right)^2 \preceq \mathbf{A}_k^2, where z_i and z'_i range over all possible values of Z_i for each index i. Compute the variance parameter : \sigma^2 = \bigg\Vert \sum_k \mathbf{A}^2_k \bigg\Vert.

Then, for all t \geq 0 : \Pr \left{ \lambda_{\text{max}} \left( \mathbf{H}(\mathbf{z}) - \mathbb{E},\mathbf{H}(\mathbf{z}) \right) \geq t \right} \leq d \cdot e^{-t^2/8\sigma^2}, where \mathbf{z}=(Z_1,\ldots,Z_n).

An improvement of this result was established in (see also ): for all t \geq 0 : \Pr \left{ \lambda_{\text{max}} \left( \mathbf{H}(\mathbf{z}) - \mathbb{E},\mathbf{H}(\mathbf{z}) \right) \geq t \right} \leq d \cdot e^{-t^2/\sigma^2}, where \mathbf{z}=(Z_1,\ldots,Z_n) and \sigma^2 = \bigg\Vert \sum_k \mathbf{A}^2_k \bigg\Vert.

Derivation and proof

Ahlswede and Winter

The Laplace transform argument found in is a significant result in its own right: Let \mathbf{Y} be a random self-adjoint matrix. Then : \Pr \left { \lambda_{\max} ( Y ) \geq t \right } \leq \inf_{\theta 0} \left { e^{-\theta t } \cdot \operatorname{E} \left [ \operatorname{tr} e^{\theta \mathbf{Y}} \right ] \right }.

To prove this, fix \theta 0. Then : \begin{align} \Pr \left { \lambda_{\max} (\mathbf{Y}) \geq t \right } &= \Pr \left { \lambda_{\max} (\mathbf{\theta Y}) \geq \theta t \right } \ &= \Pr \left { e^{\lambda_{\max} (\theta \mathbf{Y})} \geq e^{\theta t} \right }\ &\leq e^{-\theta t} \operatorname{E} e^{\lambda_{\max} (\theta \mathbf{Y})}\ &\leq e^{-\theta t} \operatorname{E} \operatorname{tr} e^{(\theta \mathbf{Y})} \end{align} The second-to-last inequality is Markov's inequality. The last inequality holds since e^{\lambda_{\max} (\theta \mathbf{Y})} = \lambda_{\max} (e^{\theta \mathbf{Y}}) \leq \operatorname{tr} (e^{\theta \mathbf{Y}}). Since the left-most quantity is independent of \theta, the infimum over \theta0 remains an upper bound for it.

Thus, our task is to understand \operatorname{E} [\operatorname{tr} (e^{\theta \mathbf{Y}})] Nevertheless, since trace and expectation are both linear, we can commute them, so it is sufficient to consider \operatorname{E} e^{\theta \mathbf{Y}} := \mathbf{M}_\mathbf{Y}(\theta), which we call the matrix generating function. This is where the methods of and diverge. The immediately following presentation follows .

The Golden–Thompson inequality implies that : \operatorname{tr} \mathbf{M}_{\mathbf{X}_1 + \mathbf{X}2}(\theta) \leq \operatorname{tr} \left [ \left ( \operatorname{E} e^{\theta \mathbf{X}1} \right ) \left ( \operatorname{E} e^{\theta \mathbf{X}2} \right ) \right ] = \operatorname{tr} \mathbf{M}{\mathbf{X}1} (\theta) \mathbf{M}{\mathbf{X}2}(\theta) , where we used the linearity of expectation several times. Suppose \mathbf{Y} = \sum_k \mathbf{X}k. We can find an upper bound for \operatorname{tr} \mathbf{M}{\mathbf{Y}} (\theta) by iterating this result. Noting that \operatorname{tr}(\mathbf{AB}) \leq \operatorname{tr}(\mathbf{A}) \lambda{\max}(\mathbf{B}), then : \operatorname{tr} \mathbf{M}\mathbf{Y} (\theta) \leq \operatorname{tr} \left [ \left ( \operatorname{E} e^{\sum{k=1}^{n-1} \theta \mathbf{X}k} \right ) \left( \operatorname{E} e^{\theta \mathbf{X}n} \right ) \right ] \leq \operatorname{tr} \left ( \operatorname{E} e^{\sum{k=1}^{n-1} \theta \mathbf{X}k} \right ) \lambda{\max} ( \operatorname{E} e^{\theta \mathbf{X}n}). Iterating this, we get : \operatorname{tr} \mathbf{M}\mathbf{Y} (\theta) \leq (\operatorname{tr} \mathbf{I}) \left [ \Pi_k \lambda{\max} (\operatorname{E} e^{\theta \mathbf{X}k}) \right ] = d e^{\sum_k \lambda\max \left ( \log \operatorname{E} e^{\theta \mathbf{X}_k} \right ) }

So far we have found a bound with an infimum over \theta. In turn, this can be bounded. At any rate, one can see how the Ahlswede–Winter bound arises as the sum of largest eigenvalues.

Tropp

The major contribution of is the application of Lieb's theorem where had applied the Golden–Thompson inequality. Tropp's corollary is the following: If H is a fixed self-adjoint matrix and X is a random self-adjoint matrix, then : \operatorname{E} \operatorname{tr} e^{\mathbf{H}+\mathbf{X}} \leq \operatorname{tr} e^{\mathbf{H} + \log( \operatorname{E} e^{\mathbf{X} })} Proof: Let \mathbf{Y} = e^\mathbf{X}. Then Lieb's theorem tells us that : f(\mathbf{Y}) = \operatorname{tr} e^{\mathbf{H} + \log(\mathbf{Y})} is concave. The final step is to use Jensen's inequality to move the expectation inside the function: : \operatorname{E} \operatorname{tr} e^{\mathbf{H} + \log(\mathbf{Y})} \leq \operatorname{tr} e^{\mathbf{H} + \log(\operatorname{E} \mathbf{Y})}.

This gives us the major result of the paper: the subadditivity of the log of the matrix generating function.

Subadditivity of log mgf

Let \mathbf{X}k be a finite sequence of independent, random self-adjoint matrices. Then for all \theta \in \mathbb{R}, : \operatorname{tr} \mathbf{M}{\sum_k \mathbf{X}k} (\theta) \leq \operatorname{tr} e^{\sum_k \log \mathbf{M}{\mathbf{X}_k} (\theta)}

Proof: It is sufficient to let \theta=1. Expanding the definitions, we need to show that : \operatorname{E} \operatorname{tr} e^{\sum_k \theta \mathbf{X}_k } \leq \operatorname{tr} e^{\sum_k \log \operatorname{E} e^{\theta \mathbf{X}_k} }.

To complete the proof, we use the law of total expectation. Let \operatorname{E}_k be the expectation conditioned on \mathbf{X}_1, \ldots, \mathbf{X}_k . Since we assume all the \mathbf{X}i are independent, : \operatorname{E}{k-1} e^{\mathbf{X}_k} = \operatorname{E} e^{\mathbf{X}_k}. Define \mathbf{\Xi}k = \log \operatorname{E}{k-1} e^{\mathbf{X}k} = \log \mathbf{M}{\mathbf{X}_k}(\theta) .

Finally, we have : \begin{align} \operatorname{E} \operatorname{tr} e^{\sum_{k=1}^n \mathbf{X}k} & = \operatorname{E}0 \cdots \operatorname{E}{n-1} \operatorname{tr} e^{\sum{k=1}^{n-1} \mathbf{X}_k + \mathbf{X}n }\ &\leq \operatorname{E}0 \cdots \operatorname{E}{n-2} \operatorname{tr} e^{\sum{k=1}^{n-1} \mathbf{X}k + \log(\operatorname{E}{n-1} e^{\mathbf{X}n} ) }\ &= \operatorname{E}0 \cdots \operatorname{E}{n-2} \operatorname{tr} e^{\sum{k=1}^{n-2} \mathbf{X}k + \mathbf{X}{n-1} + \mathbf{\Xi}n} \ & \vdots\ & = \operatorname{tr} e^{\sum{k=1}^n \mathbf{\Xi}_k} \end{align} where at every step m we use Tropp's corollary with : \mathbf{H}m = \sum{k=1}^{m-1} \mathbf{X}k + \sum{k=m+1}^n \mathbf{\Xi}_k

Master tail bound

The following is immediate from the previous result: : \Pr \left { \lambda_\max \left ( \sum_k \mathbf{X}k \right ) \geq t \right } \leq \inf{\theta 0} \left { e^{-\theta t} \operatorname{tr} e^{\sum_k \log \mathbf{M}_{\mathbf{X}_k} (\theta) } \right } All of the theorems given above are derived from this bound; the theorems consist in various ways to bound the infimum. These steps are significantly simpler than the proofs given.

References

{{cite journal
{{cite journal
{{cite arXiv
{{cite arXiv
{{cite arXiv
{{cite arXiv
{{cite journal
{{cite arXiv
{{cite journal

References

Oliveira, Roberto. (January 2010). "Sums of random Hermitian matrices and an inequality by Rudelson". Electronic Communications in Probability.

Info: Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

linear-algebra

Want to explore this topic further?

Ask Mako anything about Matrix Chernoff bound — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report

Matrix Chernoff bound

Matrix Gaussian and Rademacher series

Self-adjoint matrices case

Rectangular case

Matrix Chernoff inequalities

Matrix Chernoff I

Matrix Chernoff II

Matrix Bennett and Bernstein inequalities

Bounded case

Subexponential case

Rectangular case

Matrix Azuma, Hoeffding, and McDiarmid inequalities

Matrix Azuma

Matrix Hoeffding

Matrix bounded difference (McDiarmid)

Survey of related theorems

Derivation and proof

Ahlswede and Winter

Tropp

Subadditivity of log mgf

Master tail bound

References

References