"প্রবণতা" হিসাবে অ-উল্লেখযোগ্য ফলাফলের ব্যাখ্যা

16

সম্প্রতি, দু'জন পৃথক সহকর্মী আমার কাছে ভুল বলে মনে হচ্ছে এমন শর্তগুলির মধ্যে পার্থক্য সম্পর্কে এক ধরণের যুক্তি ব্যবহার করেছেন। এই উভয় সহকর্মী পরিসংখ্যান ব্যবহার করে তবে তারা পরিসংখ্যানবিদ নয়। আমি পরিসংখ্যান মধ্যে নবাগত।

উভয় ক্ষেত্রেই আমি যুক্তি দিয়েছিলাম, যেহেতু একটি পরীক্ষায় দুটি শর্তের মধ্যে উল্লেখযোগ্য পার্থক্য ছিল না, কারচুপির বিষয়ে এই গোষ্ঠীগুলি সম্পর্কে সাধারণ দাবি করা ভুল ছিল। নোট করুন যে "একটি সাধারণ দাবি করা" এর অর্থ লেখার মতো কিছু: "গ্রুপ এ এর চেয়ে গ্রুপ এ এর বেশি ব্যবহৃত হয়"।

আমার সহকর্মীরা এর সাথে প্রতিক্রিয়া জানিয়েছিলেন: "যদিও উল্লেখযোগ্য পার্থক্য না থাকলেও, প্রবণতাটি এখনও রয়েছে" এবং "উল্লেখযোগ্য পার্থক্য না থাকলেও এখনও একটি পার্থক্য রয়েছে"। আমার কাছে এই উভয় শব্দটি দ্বিখণ্ডনের মতো, অর্থাত্ তারা "পার্থক্য" এর অর্থ পরিবর্তিত করে: "এমন একটি পার্থক্য যা সম্ভবত সুযোগ ব্যতীত অন্য কোনও কিছুর ফলাফল হতে পারে" (অর্থাত্ পরিসংখ্যানগত তাত্পর্য), "কোনও অ -গ্রুপের মধ্যে পরিমাপের মধ্যে পার্থক্য "।

আমার সহকর্মীদের প্রতিক্রিয়া কি সঠিক ছিল? আমি তাদের সাথে এটি গ্রহণ করি নি কারণ তারা আমাকে ছাড়িয়েছে।

statistical-significance

— amdex
সূত্র

আমি এই নিবন্ধগুলি দরকারী এখনও তাত্পর্যপূর্ণ এবং প্রান্তিকভাবে গুরুতর

— ব্যবহারকারী 20637

26

এইটা একটা ভালো প্রশ্ন; উত্তর প্রসঙ্গে অনেকটা নির্ভর করে।

সাধারণভাবে আমি বলব আপনি ঠিক বলেছেন : "গ্রুপ এ এর চেয়ে বেশি বেশি ব্যবহৃত এক্স গ্রুপ এক্স এর মতো" অযোগ্য সাধারণ দাবি করা বিভ্রান্তিকর। এরকম কিছু বললে ভাল হয়

আমাদের পরীক্ষামূলক গ্রুপে এ গ্রুপ বি এর চেয়ে প্রায়শই বেশি এক্স ব্যবহার করেছে, তবে সাধারণ জনগণের মধ্যে এটি কীভাবে কার্যকর হবে তা আমরা খুব অনিশ্চিত

অথবা

যদিও গ্রুপ এ আমাদের এক্স গ্রুপের তুলনায় গ্রুপ 13 এর বেশি X ব্যবহার করেছে, সাধারণ জনগণের মধ্যে আমাদের পার্থক্যের অনুমান পরিষ্কার নয় : বিশ্লেষণযোগ্য মানগুলি A থেকে X 5% কম ব্যবহার করে গ্রুপ বি থেকে এ 21% ব্যবহার করে গ্রুপ বি চেয়ে বেশি প্রায়ই

অথবা

গ্রুপ এ গ্রুপ বি এর তুলনায় এক্স ১৩% বেশি ব্যবহার করে, তবে পার্থক্যটি পরিসংখ্যানগতভাবে তাৎপর্যপূর্ণ ছিল না (95% সিআই -5% থেকে 21%; পি = 0.75)

অন্যদিকে: আপনার সহকর্মীরা ঠিক বলেছেন যে এই বিশেষ পরীক্ষায় , গ্রুপ এ গ্রুপ বি এর চেয়ে বেশি বার এক্স ব্যবহার করেছে তবে, লোকেরা খুব কমই নির্দিষ্ট পরীক্ষায় অংশগ্রহণকারীদের যত্ন করে; তারা জানতে চায় যে কীভাবে আপনার ফলাফলগুলি বৃহত্তর জনগোষ্ঠীর কাছে সাধারণীকরণ করবে, এবং এই ক্ষেত্রে সাধারণ উত্তরটি আপনি আত্মবিশ্বাসের সাথে বলতে পারবেন না যে এলোমেলোভাবে নির্বাচিত একটি গোষ্ঠী এ, এলোমেলোভাবে নির্বাচিত গ্রুপ বিয়ের চেয়ে কম বা কম প্রায়ই এক্স ব্যবহার করবে কিনা।

এক্স এর ব্যবহার বাড়াতে চিকিত্সা A বা চিকিত্সা বি ব্যবহার করা উচিত কিনা সে সম্পর্কে আপনার যদি আজ একটি বাছাই করা প্রয়োজন, তবে এটিকে বেছে নেওয়া আপনার সেরা বাজি হবে। তবে আপনি যদি স্বাচ্ছন্দ্য বোধ করতে চান যে আপনি সম্ভবত সঠিক পছন্দটি করছেন, আপনার আরও তথ্যের প্রয়োজন হবে।

মনে রাখবেন যে আপনার "X এর ব্যবহারের ক্ষেত্রে গ্রুপ A এবং গ্রুপ B এর মধ্যে কোনও পার্থক্য নেই" বা "গ্রুপ এ এবং গ্রুপ বি একই পরিমাণে এক্স ব্যবহার করবেন" বলবেন না । আপনার পরীক্ষায় (যেখানে একটি এক্স 13% বেশি ব্যবহৃত হয়েছে) বা সাধারণ জনগণের অংশীদারদের মধ্যে এটিই সত্য নয়; বেশিরভাগ বাস্তব-বিশ্বের প্রেক্ষাপটে, আপনি জানেন যে এ বনাম বি এর সত্যই কিছুটা প্রভাব থাকতে হবে (যত সামান্যই হোক না কেন); আপনি ঠিক জানেন না যে এটি কোন দিকে যায়।

— বেন বলকার
সূত্র

5

সুন্দর সাড়া, বেন! আমি অবাক হয়েছি যদি আপনার দ্বিতীয় উদাহরণের বিবৃতিটি প্রথম উদাহরণের বিবৃতিটির সংক্ষিপ্ত প্রতিচ্ছবি পরিষ্কার করার জন্য পরিবর্তিত হতে পারে: "যদিও গ্রুপ এ আমাদের এক্সপেরিমেন্টের বি গ্রুপের তুলনায় এক্স 13% বেশি ব্যবহৃত হয়েছে, তবে সাধারণের মধ্যে এক্স বিটিভেন গ্রুপের ব্যবহারের পার্থক্য জনসংখ্যা পরিষ্কার ছিল না : গ্রুপ বিয়ের তুলনায় গ্রুপ বি এর তুলনায় এক্স 5% কম ব্যবহার করে এক্সের থেকে এক্স বিতে 5% কম ব্যবহার করে যে বিচ্ছিন্নতার প্রশংসনীয় পরিসর "

— ইসাবেলা ঘেমেন্ট

3

ধন্যবাদ, আংশিকভাবে অন্তর্ভুক্ত (

— ব্রেভিটি

8

+1 আমি মনে করি যে অনেকেই এটি বুঝতে ব্যর্থ হন যে পরিসংখ্যানগত প্রমাণের অভাবে, পর্যবেক্ষণ করা পার্থক্যগুলি জনসংখ্যার সাথে কী ঘটছে তার বিপরীত হতে পারে!

— ডেভ

@ ডেভ: এমনকি "পরিসংখ্যানগত প্রমাণ" (পরিসংখ্যানগতভাবে উল্লেখযোগ্য পি-মান?) উপস্থিতি থাকলেও "জনগণের সাথে যা চলছে তার বিপরীতে পর্যবেক্ষণ করা পার্থক্যগুলি খুব ভাল হতে পারে"

— বোস্কোভিচ

@ বস্কোভিচ অবশ্যই, আমরা পরিসংখ্যানগুলি করার সময় আমি বিস্মৃত হয়ে কথা বলছিলাম, তবে আমি এটিকে একটি তুচ্ছ পি-ভ্যালু হিসাবে মনে করি যার অর্থ জনসংখ্যার সাথে কী ঘটছে তা আপনার সত্যিই কোনও ধারণা নেই। কমপক্ষে একটি উল্লেখযোগ্য পি-মান সহ আপনি প্রমাণের কিছু প্রতিষ্ঠিত প্রান্তে পৌঁছেছেন যাতে আপনি কিছু জানেন know তবে অবশ্যই এটির দিকনির্দেশনা ভুল থাকলে একটি উল্লেখযোগ্য পি-মান পাওয়া সম্ভব। সেই ত্রুটিটি সময়ে সময়ে ঘটতে হবে।

— ডেভ

3

এটি একটি কঠিন প্রশ্ন!

প্রথম জিনিসগুলি, আপনি পরিসংখ্যানগত তাত্পর্য নির্ধারণ করতে যে কোনও প্রান্তিক নির্বাচন করতে পারেন তা নির্বিচারে। বেশিরভাগ লোকেরা $5\%$ $p$ ভ্যালু ব্যবহার করে এটি অন্য কোনওটির চেয়ে বেশি সঠিক করে না। সুতরাং, কিছুটা অর্থে, আপনার পরিসংখ্যানিক তাত্পর্যকে কালো বা সাদা বিষয়গুলির চেয়ে "বর্ণালী" হিসাবে ভাবা উচিত।

$H_0$ $A$ $B$ $X$ $Y$ $H_0$ $p$ $p$ $H_0$ to be true (অর্থাত্ প্রবণতা নেই)।

$p$ $H_0$ (there's statistically significant evidence that $H_0$ could be false). If we get a "high" $p$ -value, then the results are more likely to be a result of luck, rather than actual trend. We don't say $H_0$ is true, but rather, that further studying should take place in order to reject it.

WARNING: A $p$ -value of $23\%$ does not mean that there is a $23\%$ chance of there not being any trend, but rather, that chance generates results as those $23\%$ of the time, which sounds similar, but is a completely different thing. For example, if I claim something ridiculous, like "I can predict results of rolling dice an hour before they take place," we make an experiment to check the null hypothesis $H_0:=$ "I cannot do such thing" and get a $0.5\%$ $p-$ value, you would still have good reason not to believe me, despite the statistical significance.

So, with these ideas in mind, let's go back to your main question. Let's say we want to check if increasing the dose of drug $X$ has an effect on the likelihood of patients that survive a certain disease. We perform an experiment, fit a logistic regression model (taking into account many other variables) and check for significance on the coefficient associated with the "dose" variable (calling that coefficient $\beta$ , we'd test a null hypothesis $H_0:$ $\beta=0$ or maybe, $\beta \leq 0$ . In English, "the drug has no effect" or "the drug has either no or negative effect."

The results of the experiment throw a positive beta, but the test $\beta=0$ stays at 0.79. Can we say there is a trend? Well, that would really diminish the meaning of "trend". If we accept that kind of thing, basically half of all experiments we make would show "trends," even when testing for the most ridiculous things.

So, in conclusion, I think it is dishonest to claim that our drug makes any difference. What we should say, instead, is that our drug should not be put into production unless further testing is made. Indeed, my say would be that we should still be careful about the claims we make even when statistical significance is reached. Would you take that drug if chance had a $4\%$ of generating those results? This is why research replication and peer-reviewing is critical.

I hope this too-wordy explanation helps you sort your ideas. The summary is that you are absolutely right! We shouldn't fill our reports, whether it's for research, business, or whatever, with wild claims supported by little evidence. If you really think there is a trend, but you didn't reach statistical significance, then repeat the experiment with more data!

— David
সূত্র

1

+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).

— Peter - Reinstate Monica

0

Significant effect just means that you measured an unlikely anomaly (unlikely if the null hypothesis, absence of effect, would be true). And as a consequence it must be doubted with high probability (although this probability is not equal to the p-value and also depends on prior believes).

Depending on the quality of the experiment you could measure the same effect size, but it might not be an anomaly (not an unlikely result if the null hypothesis would be true).

When you observe an effect but it is not significant then indeed it (the effect) can still be there, but it is only not significant (the measurements do not indicate that the null hypothesis should be doubted/rejected with high probability). It means that you should improve your experiment, gather more data, to be more sure.

So instead of the dichotomy effect versus no-effect you should go for the following four categories:

Image from https://en.wikipedia.org/wiki/Equivalence_test explaining the two one sided t-tests procedure (TOST)

You seem to be in category D, the test is inconclusive. Your coworkers might be wrong to say that there is an effect. However, it is equally wrong to say that there is no effect!

— Sextus Empiricus
সূত্র

"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1%

p

$p$ -value? Would you say there is a high probability of the null hypothesis being false?

— David

@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....

— Sextus Empiricus

....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).

— Sextus Empiricus

I have adapted the text to take away this misinterpretation.

— Sextus Empiricus

0

It sounds like they're arguing p-value vs. the definition of "Trend".

If you plot the data out on a run chart, you may see a trend... a run of plot points that show a trend going up or down over time.

But, when you do the statistics on it.. the p-value suggests it's not significant.

For the p-value to show little significance, but for them to see a trend / run in the series of data ... that would have to be a very slight trend.

So, if that was the case, I would fall back on the p-value.. IE: ok, yes, there's a trend / run in the data.. but it's so slight and insignificant that the statistics suggest it's not worth pursuing further analysis of.

An insignificant trend is something that may be attributable to some kind of bias in the research.. maybe something very minor.. something that may just be a one time occurence in the experiment that happened to create a slight trend.

If I was the manager of the group, I would tell them to stop wasting time and money digging into insignificant trends, and to look for more significant ones.

— blahblah
সূত্র

0

It sounds like in this case they have little justification for their claim and are just abusing statistics to reach the conclusion they already had. But there are times when it's ok to not be so strict with p-val cutoffs. This (how to use statistical significance and pval cutoffs) is a debate that has been raging since Fisher, Neyman, and Pearson first laid the foundations of statistical testing.

Let's say you are building a model and you are deciding what variables in include. You gather a little bit of data to do some preliminary investigation into potential variables. Now there's this one variable that the business team really is interested in, but your preliminary investigation shows that the variable is not statistically significant. However, the 'direction' of the variable comports to what the business team expected, and although it didn't meet the threshold for significance, it was close. Perhaps it was suspected to have positive correlation to the outcome and you got a beta coefficient that was positive but the pval was just a bit above the .05 cutoff.

In that case, you might go ahead and include it. It's sort of an informal bayesian statistics -- there was a strong prior belief that it is a useful variable and the initial investigation into it showed some evidence in that direction (but not statistically significant evidence!) so you give it the benefit of the doubt and keep it in the model. Perhaps with more data it will be more evident what relationship it has with the outcome of interest.

Another example might be where you are building a new model and you look at the variables that were used in the previous model -- you might continue to include a marginal variable (one that is on the cusp of significance) to maintain some continuity from model to model.

Basically, depending on what you are doing there are reasons to be more and less strict about these sorts of things.

On the other hand, it's also important to keep in mind that statistical significance does not have to imply a practical significance! Remember that at the heart of all this is sample size. Collect enough data and the standard error of the estimate will shrink to 0. This will make any sort of difference, no matter how small, 'statistically significant' even if that difference might not amount to anything in the real world. For example, suppose the probability of a particular coin landing on heads was .500000000000001. This means that theoretically you could design an experiment which concludes that the coin is not fair, but for all intents and purposes the coin could be treated as a fair coin.

— eps
সূত্র