আমি পরিসংখ্যানগুলিতে নুব, তাই আপনি ছেলেরা এখানে আমাকে সাহায্য করতে পারেন।
আমার প্রশ্নটি নিম্নরূপ: পুলযুক্ত ভেরিয়েন্সটি আসলে কী বোঝায়?
আমি যখন ইন্টারনেটে পুঞ্জিকৃত ভ্যারিয়েন্স জন্য একটি সূত্র সন্ধান, আমি নিম্নলিখিত সূত্র (এখানে ব্যবহার উদাহরণস্বরূপ, সাহিত্য অনেক খুঁজে পেয়েছেন: http://math.tntech.edu/ISR/Mathematical_Statistics/Introduction_to_Statistical_Tests/thispage/newnode19.html ):
তবে এটি আসলে কী গণনা করে? কারণ আমি যখন আমার পুলের বৈকল্পিক গণনা করতে এই সূত্রটি ব্যবহার করি তখন এটি আমাকে ভুল উত্তর দেয়।
উদাহরণস্বরূপ, এই "পিতামাতার নমুনা" বিবেচনা করুন:
The variance of this parent sample is , and its mean is .
Now, suppose I split this parent sample into two sub-samples:
- The first sub-sample is 2,2,2,2,2 with mean and variance .
- The second sub-sample is 8,8,8,8,8 with mean and variance .
Now, clearly, using the above formula to calculate the pooled/parent variance of these two sub-samples will produce zero, because and . So what does this formula actually calculate?
On the other hand, after some lengthy derivation, I found the formula which produces the correct pooled/parent variance is:
In the above formula, and .
I found a similar formula with mine, for example here: http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html and also in Wikipedia. Although I have to admit that they don't look exactly the same like mine.
So again, what does pooled variance actually mean? Shouldn't it mean the variance of parent sample from the two sub-samples? Or I am completely wrong here?
Thank you in advance.
EDIT 1: Someone says that my two sub-samples above are pathological since they have zero variance. Well, I could give you a different example. Consider this parent sample:
The variance of this parent sample is , and its mean is .
Now, suppose I split this parent sample into two sub-samples:
- The first sub-sample is 1,2,3,4,5 with mean and variance .
- The second sub-sample is 46,47,48,49,50 with mean and variance .
Now, if you use "literature's formula" to compute the pooled variance, you will get 2.5, which is completely wrong, because the parent/pooled variance should be 564.7. Instead, if you use "my formula", you will get correct answer.
Please understand, I use extreme examples here to show people that the formula indeed wrong. If I use "normal data" which doesn't have a lot of variations (extreme cases), then the results from those two formulae will be very similar, and people could dismiss the difference due to rounding error, not because the formula itself is wrong.