স্বজ্ঞাত ব্যাখ্যা)

18

তাহলে $X$ পূর্ণ র্যাঙ্ক হয়, বিপরীত $X^TX$ বিদ্যমান এবং আমরা লিস্ট স্কোয়ার অনুমান

\hat{β} = (X^{T} X)^{- 1} X Y

$\hat\beta = (X^TX)^{-1}XY$ এবং

Var (\hat{β}) = σ^{2} (X^{T} X)^{- 1}

$\operatorname{Var}(\hat\beta) = \sigma^2(X^TX)^{-1}$

ভেরিয়েন্স সূত্রে আমরা কীভাবে স্বজ্ঞাতভাবে ব্যাখ্যা করতে পারি $(X^TX)^{-1}$ ? ডেরাইভেশনের কৌশলটি আমার পক্ষে পরিষ্কার।

regression variance least-squares

— ড্যানিয়েল ইয়েফিমভ
সূত্র

3

আপনি যে বাতলান সূত্র আপনি ভ্যারিয়েন্স-সহভেদাংক ম্যাট্রিক্স জন্য বিবৃত করেছি একটি নোট যোগ করতে পারেন

অভিমানী -

OLS ঔজ্জ্বল্যের প্রেক্ষাপটে দ্বারা অনুমান করা হয় - গাউস-মার্কভ শর্ত উপপাদ্য হয় শুধুমাত্র যদি সঠিক হয় সন্তুষ্ট এবং, বিশেষত, শুধুমাত্র যদি ত্রুটি পদ ভ্যারিয়েন্স-সহভেদাংক ম্যাট্রিক্স দেওয়া হয়

, যেখানে

হয়

পরিচয় ম্যাট্রিক্স এবং

সারি সংখ্যা

(এবং

)। আপনার প্রদত্ত সূত্রটি অ-গোলাকার ত্রুটিগুলির আরও সাধারণ ক্ষেত্রে সঠিক নয় ।

\hat{β}

$\hat{\beta}$

\hat{β}

$\hat{\beta}$

σ^{2} I_{n}

$\sigma^2 I_n$

I_{n}

$I_n$

n \times n

$n\times n$

n

$n$

X

$X$

Y

$Y$

— ফ্ল্যাশ

13

একটি ধ্রুবক পদ ব্যতীত একটি সাধারণ রিগ্রেশন বিবেচনা করুন এবং যেখানে একক রেজিস্ট্রার তার নমুনা গড়কে কেন্দ্র করে। তারপরে এর নমুনা বৈকল্পিক ( বার), এবং $X'X$ $n$ $(X'X)^{-1}$ এটির পুনঃপরীক্ষা। সুতরাং রেজিস্ট্রারে ভেরিয়েন্সটি যত উচ্চতর হবে, সহগের হিসাবরক্ষকের কম তারতম্য: আমরা ব্যাখ্যামূলক ভেরিয়েবলের যত বেশি পরিবর্তনশীলতা তত বেশি নির্ভুলভাবে আমরা অজানা সহগের অনুমান করতে পারি।

কেন? যেহেতু একজন রেজিস্ট্রার যত বেশি বৈচিত্র্যময়, তত এটিতে আরও তথ্য রয়েছে। যখন রেজিস্ট্রারগুলি অনেক হয়, এটি তাদের বৈকল্পিক-কোভারিয়েন্স ম্যাট্রিক্সের বিপরীতকে সাধারণীকরণ করে, যা রেজিস্ট্রারদের সহ-পরিবর্তনের বিষয়টিও বিবেচনা করে। চরম ক্ষেত্রে যেখানে তির্যক, তারপরে প্রতিটি অনুমানের সহগের যথার্থতা কেবল সম্পর্কিত রেজিস্ট্রারের বৈচিত্র্য / পরিবর্তনশীলতার উপর নির্ভর করে (ত্রুটির শর্তের ভিন্নতা দেওয়া হয়)। $X'X$

— আলেকোস পাপাদোপ্লোস
সূত্র

আপনি কি এই যুক্তিটি এই সত্যের সাথে সম্পর্কিত করতে পারেন যে বৈকল্পিক-কোভারিয়েন্স ম্যাট্রিক্সের বিপরীতটি আংশিক পারস্পরিক সম্পর্ক স্থাপন করে ?

— হাইজেনবার্গ

5

দেখার এক সহজ উপায় হ'ল এর ম্যাট্রিক্স (মাল্টিভারিয়েট) এনালগ হিসাবে $\sigma^2 \left(\mathbf{X}^{T} \mathbf{X} \right)^{-1}$ , যা সাধারণ ওএলএস রিগ্রেশন-এর সহগের বৈকল্পিক। এক এমনকিপেতে পারেন $\frac{\sigma^2}{\sum_{i=1}^n \left(X_i-\bar{X}\right)^2}$ $\frac{\sigma^2}{\sum_{i=1}^n X_i^2}$ মডেলটিতে বিরতি বাদ দিয়ে অর্থাত্ উত্সের মাধ্যমে প্রতিরোধ সম্পাদন করে var

এই সূত্রগুলির যে কোনও একটি থেকে এটি দেখা যেতে পারে যে ভবিষ্যদ্বাণীকারী ভেরিয়েবলের বৃহত্তর পরিবর্তনশীলতা সাধারণত তার সহগের আরও নির্ভুল অনুমানের দিকে পরিচালিত করে। এটি প্রায়শই পরীক্ষা-নিরীক্ষার ডিজাইনে ব্যবহার করা হয়, যেখানে (নন-র্যান্ডম) ভবিষ্যদ্বাণীকারীদের জন্য মানগুলি বেছে নিয়ে কেউ এর নির্ধারককে যতটা সম্ভব বৃহত করে তোলার চেষ্টা করে , নির্ধারকটি পরিবর্তনশীলতার একটি পরিমাপ। $\left(\mathbf{X}^{T} \mathbf{X} \right)$

— JohnK
সূত্র

2

গাউসিয়ান র্যান্ডম পরিবর্তনশীল রৈখিক রূপান্তর সাহায্য করে? এই নিয়মটি ব্যবহার করে যে যদি, , তবে । $x \sim \mathcal{N}(\mu,\Sigma)$ $Ax + b ~ \sim \mathcal{N}(A\mu + b,A^T\Sigma A)$

ধরে নিই যে, অন্তর্নিহিত মডেল এবং । $Y = X\beta + \epsilon$ $\epsilon \sim \mathcal{N}(0, \sigma^2)$

∴ Y \sim N (X β, σ^{2}) X^{T} Y \sim N (X^{T} X β, X σ^{2} X^{T}) (X^{T} X)^{- 1} X^{T} Y \sim N [β, (X^{T} X)^{- 1} σ^{2}]

$\therefore Y \sim \mathcal{N}(X\beta,\sigma^2)\\ X^TY \sim \mathcal{N}(X^TX\beta, X\sigma^2 X^T)\\ (X^TX)^{-1}X^TY \sim \mathcal{N}[\beta,(X^TX)^{-1} \sigma^2]$

সুতরাং কেবল একটি জটিল স্কেলিং ম্যাট্রিক্স যা বিতরণকে রূপান্তরিত করে $(X^TX)^{-1}X^T$ $Y$ .

আশা করি যে সহায়ক ছিল।

— kedarps
সূত্র

Nothing in the derivation of the OLS estimator and its variance requires normality of the error terms. All that's required is

E (ε) = 0

$E(\varepsilon)=0$ and

E (ε ε^{T}) = σ^{2} I_{n}

$E(\varepsilon\varepsilon^T)=\sigma^2 I_n$ . (Of course, normality is required to show that OLS achieves the Cramer-Rao lower bound, but that's not what the OP's posting is about, is it?)

— Mico

2

I'll take a different approach towards developing the intuition that underlies the formula $\text{Var}\,\hat{\beta}=\sigma^2 (X'X)^{-1}$ . When developing intuition for the multiple regression model, it's helpful to consider the bivariate linear regression model, viz.,

y_{i} = α + β x_{i} + ε_{i}, i = 1, \dots, n .

$y_i=\alpha+\beta x_i + \varepsilon_i, \quad i=1,\ldots,n.$

α + β x_{i}

$\alpha+\beta x_i$ is frequently called the deterministic contribution to

y_{i}

$y_i$ , and

ε_{i}

$\varepsilon_i$ is called the stochastic contribution. Expressed in terms of deviations from the sample means

(\bar{x}, \bar{y})

$(\bar{x},\bar{y})$ , this model may also be written as

(y_{i} - \bar{y}) = β (x_{i} - \bar{x}) + (ε_{i} - \bar{ε}), i = 1, \dots, n .

$(y_i-\bar{y}) = \beta(x_i-\bar{x})+(\varepsilon_i-\bar{\varepsilon}), \quad i=1,\ldots,n.$

To help develop the intuition, we will assume that the simplest Gauss-Markov assumptions are satisfied: $x_i$ nonstochastic, $\sum_{i=1}^n(x_i-\bar{x})^2>0$ for all $n$ , and $\varepsilon_i \sim \text{iid}(0,\sigma^2)$ for all $i=1,\ldots,n$ . As you already know very well, these conditions guarantee that

Var \hat{β} = \frac{1}{n} σ^{2} (Var x)^{- 1},

$\text{Var}\,\hat{\beta}=\tfrac{1}{n}\sigma^2(\text{Var}\,x)^{-1}\text{,}$ where

Var x

$\text{Var}\,x$ is the sample variance of

x

$x$ . In words, this formula makes three claims: "The variance of

\hat{β}

$\hat{\beta}$ is inversely proportional to the sample size

n

$n$ , it is directly proportional to the variance of

ε

$\varepsilon$ , and it is inversely proportional to the variance of

x

$x$ ."

Why should doubling the sample size, ceteris paribus, cause the variance of $\hat{\beta}$ to be cut in half? This result is intimately linked to the iid assumption applied to $\varepsilon$ : Since the individual errors are assumed to be iid, each observation should be treated ex ante as being equally informative. And, doubling the number of observations doubles the amount of information about the parameters that describe the (assumed linear) relationship between $x$ and $y$ . Having twice as much information cuts the uncertainty about the parameters in half. Similarly, it should be straightforward to develop one's intuition as to why doubling $\sigma^2$ also doubles the variance of $\hat{\beta}$ .

Let's turn, then, to your main question, which is about developing intuition for the claim that the variance of $\hat{\beta}$ is inversely proportional to the variance of $x$ . To formalize notions, let us consider two separate bivariate linear regression models, called Model $(1)$ and Model $(2)$ from now on. We will assume that both models satisfy the assumptions of the simplest form of the Gauss-Markov theorem and that the models share the exact same values of $\alpha$ , $\beta$ , $n$ , and $\sigma^2$ . Under these assumptions, it is easy to show that $\text{E}\,\hat{\beta}{}^{(1)}=\text{E}\,\hat{\beta}{}^{(2)}=\beta$ ; in words, both estimators are unbiased. Crucially, we will also assume that whereas $\bar{x}^{(1)}=\bar{x}^{(2)}=\bar{x}$ , $\text{Var}\,x^{(1)}\ne \text{Var}\,x^{(2)}$ . Without loss of generality, let us assume that $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ . Which estimator of $\hat{\beta}$ will have the smaller variance? Put differently, will $\hat{\beta}{}^{(1)}$ or $\hat{\beta}{}^{(2)}$ be closer, on average, to $\beta$ ? From the earlier discussion, we have $\text{Var}\,\hat{\beta} {}^{(k)} =\tfrac{1}{n}\sigma^2/\text{Var}\,x{}^{(k)})$ for $k=1,2$ . Because $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ by assumption, it follows that $\text{Var}\,\hat{\beta}{}^{(1)} <\text{Var}\,\hat{\beta}{}^{(2)}$ . What, then, is the intuition behind this result?

Because by assumption $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ , on average each $x_i^{(1)}$ will be farther away from $\bar{x}$ than is the case, on average, for $x_i^{(2)}$ . Let us denote the expected average absolute difference between $x_i$ and $\bar{x}$ by $d_x$ . The assumption that $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ implies that $d_x^{(1)} >d_x^{(2)}$ . The bivariate linear regression model, expressed in deviations from means, states that $d_y = \beta d_x^{(1)}$ for Model $(1)$ and $d_y = \beta d_x^{(2)}$ for Model $(2)$ . If $\beta\ne0$ , this means that the deterministic component of Model $(1)$ , $\beta d_x^{(1)}$ , has a greater influence on $d_y$ than does the deterministic component of Model $(2)$ , $\beta d_x^{(2)}$ . Recall that the both models are assumed to satisfy the Gauss-Markov assumptions, that the error variances are the same in both models, and that $\beta^{(1)}=\beta^{(2)}=\beta$ . Since Model $(1)$ imparts more information about the contribution of the deterministic component of $y$ than does Model $(2)$ , it follows that the precision with which the deterministic contribution can be estimated is greater for Model $(1)$ than is the case for Model $(2)$ . The converse of greater precision is a lower variance of the point estimate of $\beta$ .

It is reasonably straightforward to generalize the intuition obtained from studying the simple regression model to the general multiple linear regression model. The main complication is that instead of comparing scalar variances, it is necessary to compare the "size" of variance-covariance matrices. Having a good working knowledge of determinants, traces and eigenvalues of real symmetric matrices comes in very handy at this point :-)

— Mico
সূত্র

1

Say we have $n$ observations (or sample size) and $p$ parameters.

The covariance matrix $\operatorname{Var}(\hat{\beta})$ of the estimated parameters $\hat{\beta}_1,\hat{\beta}_2$ etc. is a representation of the accuracy of the estimated parameters.

If in an ideal world the data could be perfectly described by the model, then the noise will be $\sigma^2= 0$ . Now, the diagonal entries of $\operatorname{Var}(\hat{\beta})$ correspond to $\operatorname{Var}(\hat{\beta_1}),\operatorname{Var}(\hat{\beta_2})$ etc. The derived formula for the variance agrees with the intuition that if the noise is lower, the estimates will be more accurate.

In addition, as the number of measurements gets larger, the variance of the estimated parameters will decrease. So, overall the absolute value of the entries of $X^TX$ will be higher, as the number of columns of $X^T$ is $n$ and the number of rows of $X$ is $n$ , and each entry of $X^TX$ is a sum of $n$ product pairs. The absolute value of the entries of the inverse $(X^TX)^{-1}$ will be lower.

Hence, even if there is a lot of noise, we can still reach good estimates $\hat{\beta_i}$ of the parameters if we increase the sample size $n$ .

I hope this helps.

Reference: Section 7.3 on Least squares: Cosentino, Carlo, and Declan Bates. Feedback control in systems biology. Crc Press, 2011.

— Dilly Minch
সূত্র

1

This builds on @Alecos Papadopuolos' answer.

Recall that the result of a least-squares regression doesn't depend on the units of measurement of your variables. Suppose your X-variable is a length measurement, given in inches. Then rescaling X, say by multiplying by 2.54 to change the unit to centimeters, doesn't materially affect things. If you refit the model, the new regression estimate will be the old estimate divided by 2.54.

The $X'X$ matrix is the variance of X, and hence reflects the scale of measurement of X. If you change the scale, you have to reflect this in your estimate of $\beta$ , and this is done by multiplying by the inverse of $X'X$ .

— Hong Ooi
সূত্র