চি-স্কোয়ার পরীক্ষা কেন প্রত্যাশিত গণনাটিকে বৈকল্পিক হিসাবে ব্যবহার করে?

ইন $\chi^2$ টেস্টিং, স্বাভাবিক ডিস্ট্রিবিউশন প্রতিটি স্ট্যান্ডার্ড ডেভিয়েশন (অর্থাত প্রত্যাশিত গন্য ভেরিয়ানস হিসাবে) যেমন প্রত্যাশিত গন্য বর্গমূল ব্যবহার করার জন্য ভিত্তি কি? আমি কেবল এই বিষয়ে আলোচনা করতে পেলাম কেবলমাত্র http://www.physics.csbsju.edu/stats/chi-square.html এবং এটি কেবল পোইসন বিতরণের উল্লেখ করেছে।

আমার বিভ্রান্তির একটি সাধারণ চিত্র হিসাবে, যদি আমরা পরীক্ষা করে যাচ্ছিলাম যে দুটি প্রক্রিয়া উল্লেখযোগ্যভাবে পৃথক কিনা, একটি যা খুব ছোট বৈচিত্র সহ 500 এএস এবং 500 বিএস উত্পন্ন করে এবং অন্যটি 550 এএস এবং 450 বিএস খুব ছোট বৈকল্পিক দিয়ে উত্পন্ন করে (খুব কমই উত্পন্ন হয়) 551 হিসাবে এবং 449 বিএস)? এখানে বৈকল্পিক স্পষ্টভাবে কেবল প্রত্যাশিত মান নয়?

(আমি কোনও পরিসংখ্যানবিদ নই, তাই অ-বিশেষজ্ঞের কাছে অ্যাক্সেসযোগ্য এমন উত্তরটি সত্যিই সন্ধান করছি))

hypothesis-testing chi-squared

— ইয়াং
সূত্র

এটি সম্ভবত ঘটনা এই যে কোনো ভ্যারিয়েন্স সঙ্গে কি কিছু আছে

χ_{k}^{2}

$\chi^{2}_{k}$ র্যান্ডম পরিবর্তনশীল

2 k

$2k$ এবং সত্য যে পরিসংখ্যাত সঠিক বন্টন (সম্ভাবনা অনুপাত পরীক্ষা হিসাবে) আছে 2 দ্বারা গুন করতে হবে না। সম্ভবত কেউ এ সম্পর্কে আরও আনুষ্ঠানিকভাবে জানেন।

— ম্যাক্রো

উত্তর:

অনেক পরীক্ষার পরিসংখ্যানের জন্য সাধারণ ফর্ম

$\frac{observed - expected}{standard error}$

একটি সাধারণ ভেরিয়েবলের ক্ষেত্রে স্ট্যান্ডার্ড ত্রুটি হয় পরিচিত জনসংখ্যার বৈচিত্র (জেড-স্ট্যাটাস) বা নমুনা (টি-স্ট্যাটাস) থেকে প্রাপ্ত প্রাক্কলনের উপর ভিত্তি করে। দ্বিপদী সঙ্গে মান ত্রুটি অনুপাতের উপর ভিত্তি করে (পরীক্ষার জন্য অনুমানযুক্ত অনুপাত)।

একটি জরুরী টেবিলে প্রতিটি কক্ষের গণনাটি পোয়েসন বিতরণ থেকে প্রত্যাশিত মানের সমান গড় (শূন্যের নীচে) সমান হিসাবে বিবেচনা করা যেতে পারে। পোইসন বিতরণের জন্য বৈকল্পিক গড়ের সমান, তাই আমরা স্ট্যান্ডার্ড ত্রুটি গণনার জন্যও প্রত্যাশিত মানটি ব্যবহার করি। আমি এমন একটি পরিসংখ্যান দেখেছি যা পরিবর্তে পর্যবেক্ষণগুলি ব্যবহার করে তবে এটিতে তাত্ত্বিক ন্যায়সঙ্গততা কম রয়েছে এবং এটি বিতরণে রূপান্তরিত করে না । $\chi^2$

— গ্রেগ স্নো
সূত্র

আমি পইসনের সাথে সংযোগে আটকে যাচ্ছি / বুঝতে পারছি কেন প্রতিটি কক্ষ একটি পইসন থেকে আগত বলে মনে করা যেতে পারে। আমি পয়েসনের গড় / বৈচিত্রটি জানি এবং আমি জানি যে তারা একটি হার দেওয়া ইভেন্টের সংখ্যাকে উপস্থাপন করে। আমি আরও জানি যে চি-স্কোয়ার বিতরণগুলি স্ট্যান্ডার্ড (ভেরিয়েন্স 1) নরমালগুলির বর্গের যোগফলকে উপস্থাপন করে। আমি কেবলমাত্র প্রতিটি সাধারণের "স্প্রেড" এর অনুমান হিসাবে প্রত্যাশিত মানটিকে পুনরায় ব্যবহারের ন্যায্যতার চারপাশে আমার মাথা গুটিয়ে দেওয়ার চেষ্টা করছি। এটি কি সবকিছুকে চি-বর্গ বিতরণের সাথে "মান-আইজ" নরমাল অনুসারে তৈরি করতে হয়?

— ইয়াং

কয়েকটি বিষয় আছে, যখন বিষয়গুলি মোটামুটি স্বতন্ত্র থাকে তখন পইসন বিতরণ গণনাগুলির পক্ষে সাধারণ। একটি নির্দিষ্ট মোট হিসাবে টেবিলটি চিন্তা করার পরিবর্তে এবং আপনি টেবিলের কোষগুলির মধ্যে মানগুলি বিতরণ করছেন, টেবিলের কেবল একটি ঘর সম্পর্কে চিন্তা করুন এবং সেই ঘরের মধ্যে কতগুলি প্রতিক্রিয়া আসে তা দেখার জন্য আপনি একটি নির্দিষ্ট সময়ের জন্য অপেক্ষা করছেন এটি পয়েসনের সাধারণ ধারণার সাথে খাপ খায়। বৃহত্তর উপায়ে আপনি একটি সাধারণ বিতরণ দিয়ে একটি পয়সন আনুমানিক করতে পারেন, তাই পরীক্ষার পরিসংখ্যানগুলি পোইসনের সাথে একটি সাধারণ অনুমান হিসাবে বিবেচনা করে, তারপরে

রূপান্তর করে ।

χ^{2}

$\chi^2$

— গ্রেগ তুষার

(+1) ধরুন কক্ষটি

যার অর্থ

সহ স্বাধীন পোইসন র্যান্ডম ভেরিয়েবল ছিল । তারপরে, অবশ্যই,

X_{i}, \dots, X_{k}

$X_i,\ldots,X_k$

n π_{i}

$n\pi_i$

বিতরণ। তবে, এর সাথে সমস্যা হ'ল

একটিপ্যারামিটারএবং প্রকৃত পর্যবেক্ষণ গণনা নয়। মোট পর্যবেক্ষণের গণনাগুলি হল

। যদিও

প্রায় SLLN দ্বারা নিশ্চয়, কিছু আরো কাজ কিছু কার্যকর মধ্যে অনুসন্ধানমূলক ঘুরে কাজ করতে হবে।

\sum_{i = 1}^{k} \frac{(X_{i} - n π_{i})^{2}}{n π_{i}} \to χ_{k}^{2}

$\sum_{i=1}^k \frac{(X_i - n\pi_i)^2}{n \pi_i} \to \chi_k^2$

n

$n$

N = \sum_{i = 1}^{k} X_{i} \sim P o i (n)

$N = \sum_{i=1}^k X_i \sim \mathrm{Poi}(n)$

N / n \to 1

$N/n \to 1$

— কার্ডিনাল

— ইয়াং

@ ইয়াং: এটি আপনার ডেটা বলে মনে হচ্ছে --- যা আপনি বর্ণনা করেন নি --- চি-স্কোয়ার স্ট্যাটিস্টিক ব্যবহারের অন্তর্নিহিত মডেলটির সাথে খাপ খায় না। মানক মডেলটি বহু-জাতীয় নমুনাগুলির মধ্যে একটি । কড়া কথায় বলতে গেলে, এমনকি (নিঃশর্ত) পোইসন স্যাম্পলিংও আচ্ছাদিত নয়, যা গ্রেগের উত্তর অনুমান করে। আমি আমার পূর্ববর্তী মন্তব্যে এটির (সম্ভবত একটি অবলম্বন) রেফারেন্স করি।

— কার্ডিনাল

আসুন সর্বাধিক অন্তর্দৃষ্টি সরবরাহ করার চেষ্টা করার জন্য সহজ কেসটি পরিচালনা করি। যাক সঙ্গে একটি বিযুক্ত বন্টন থেকে একটি IID নমুনা হতে ফলাফল। প্রতিটি নির্দিষ্ট ফলাফলের সম্ভাবনা হয়ে উঠুক । আমরা চি-স্কোয়ার স্ট্যাটিস্টিক বিতরণ (অ্যাসিপটোটিক) এ আগ্রহী $X_1, X_2, \ldots, X_n$ $k$ $\pi_1,\ldots,\pi_k$ এখানে

X^{2} = \sum_{i = 1}^{k} \frac{(S_{i} - n π_{i})^{2}}{n π_{i}} .

$X^2 = \sum_{i=1}^k \frac{(S_i - n \pi_i)^2}{n\pi_i} \> .$

n π_{i}

$n \pi_i$ এর গন্য প্রত্যাশিত নম্বর

তম পরিণতি।

i

$i$

একটি পরামর্শমূলক heuristic

নির্ধারণ করুন , যাতে যেখানে। $U_i = (S_i - n\pi_i) / \sqrt{n \pi_i}$ $X^2 = \sum_i U_i^2 = \newcommand{\U}{\mathbf{U}}\|\U\|^2_2$ $\U = (U_1,\ldots,U_k)$

যেহেতু হ'ল , তারপরে কেন্দ্রীয় সীমাবদ্ধ উপপাদ্য দ্বারা , $S_i$ $\mathrm{Bin}(n,\pi_i)$

T_{i} = \frac{U_{i}}{\sqrt{1 - π_{i}}} = \frac{S_{i} - n π_{i}}{\sqrt{n π_{i} (1 - π_{i})}} \overset{d}{\to} N (0, 1),

$\newcommand{\convd}{\xrightarrow{d}}\newcommand{\N}{\mathcal{N}} T_i = \frac{U_i}{\sqrt{1-\pi_i}} = \frac{S_i - n \pi_i}{\sqrt{ n\pi_i(1-\pi_i)}} \convd \N(0, 1) \>,$

U_{i} \overset{d}{\to} N (0, 1 - π_{i})

$U_i \convd \N(0, 1-\pi_i)$

$T_i$ $\sum_i T_i^2$ $\chi_k^2$ $T_k$ $(T_1,\ldots,T_{k-1})$ $T_i$ ভেরিয়েবলগুলি সম্ভবত স্বাধীন হতে পারে না।

$U_i$ $\U$ $\chi_{k}^2$ to what is, in fact, a $\chi_{k-1}^2$ .

Some details on this follow.

A more rigorous treatment

It is not hard to check that, in fact, $\newcommand{\Cov}{\mathrm{Cov}}\Cov(U_i, U_j) = - \sqrt{\pi_i \pi_j}$ for $i \neq j$ .

So, the covariance of $\U$ is

A = I - \sqrt{π} {\sqrt{π}}^{T},

$\newcommand{\sqpi}{\sqrt{\boldsymbol{\pi}}} \newcommand{\A}{\mathbf{A}} \A = \mathbf{I} - \sqpi \sqpi^T \>,$ where

\sqrt{π} = (\sqrt{π_{1}}, \dots, \sqrt{π_{k}})

$\sqpi = (\sqrt{\pi_1}, \ldots, \sqrt{\pi_k})$ . Note that

A

$\A$ is symmetric and idempotent, i.e.,

A = A^{2} = A^{T}

$\A = \A^2 = \A^T$ . So, in particular, if

Z = (Z_{1}, \dots, Z_{k})

$\newcommand{\Z}{\mathbf{Z}}\Z = (Z_1, \ldots, Z_k)$ has iid standard normal components, then

A Z \sim N (0, A)

$\A \Z \sim \N(0, \A)$ . (NB The multivariate normal distribution in this case is degenerate.)

Now, by the Multivariate Central Limit Theorem, the vector $\U$ has an asymptotic multivariate normal distribution with mean $0$ and covariance $\A$ .

So, $\U$ has the same asymptotic distribution as $\A \Z$ , hence, the same asymptotic distribution of $X^2 = \U^T \U$ is the same as the distribution of $\Z^T \A^T \A \Z = \Z^T \A \Z$ by the continuous mapping theorem.

But, $\A$ is symmetric and idempotent, so (a) it has orthogonal eigenvectors, (b) all of its eigenvalues are 0 or 1, and (c) the multiplicity of the eigenvalue of 1 is $\mathrm{rank}(\A)$ . This means that $\A$ can be decomposed as $\A = \mathbf{Q D Q}^T$ where $\mathbf{Q}$ is orthogonal and $\mathbf{D}$ is a diagonal matrix with $\mathrm{rank}(\A)$ ones on the diagonal and the remaining diagonal entries being zero.

Thus, $\Z^T \A \Z$ must be $\chi^2_{k-1}$ distributed since $\A$ has rank $k-1$ in our case.

Other connections

The chi-square statistic is also closely related to likelihood ratio statistics. Indeed, it is a Rao score statistic and can be viewed as a Taylor-series approximation of the likelihood ratio statistic.

References

This is my own development based on experience, but obviously influenced by classical texts. Good places to look to learn more are

G. A. F. Seber and A. J. Lee (2003), Linear Regression Analysis, 2nd ed., Wiley.
E. Lehmann and J. Romano (2005), Testing Statistical Hypotheses, 3rd ed., Springer. Section 14.3 in particular.
D. R. Cox and D. V. Hinkley (1979), Theoretical Statistics, Chapman and Hall.

— cardinal
সূত্র

(+1) I think it is hard to find this proof in standard categorical data analysis texts like Agresti, A. (2002). Categorical Data Analysis. John-Wiley.

— suncoolsu

Thanks for the comment. I know there is some treatment of the chi-squared statistic in Agresti, but don't recall how far he takes it. He may just appeal to the asymptotic equivalence with the likelihood ratio statistic.

— cardinal

I don't know if you'll find the proof above in any text. I haven't seen the use of the full (degenerate) covariance matrix and its properties elsewhere. The usual treatment looks at the (nondegenerate) distribution of the first

k - 1

$k-1$ coordinates and then uses the inverse covariance matrix (which has a nice form, but one which is not immediately obvious) and some (somewhat) tedious algebra to establish the result.

— cardinal

Your answer begins by defining a set of

X

$X$ 's but then defines the statistic in terms of

S

$S$ 's. Can you include something in the answer that indicates how the variables you define at the start and the variables in the statistic are related?

— Glen_b -Reinstate Monica