ডাউন-স্যাম্পলিং লজিস্টিক রিগ্রেশন সহগকে পরিবর্তন করে?

যদি আমার খুব বিরল ধনাত্মক শ্রেণীর সাথে একটি ডেটাসেট থাকে এবং আমি নেতিবাচক শ্রেণিকে নিম্ন-নমুনা করি, তবে লজিস্টিক রিগ্রেশন করান, আমি ইতিবাচক শ্রেণীর প্রসারকে পরিবর্তন করেছি এই প্রতিস্থাপনের জন্য আমার কি রিগ্রেশন সহগের সমন্বয় করা দরকার?

উদাহরণস্বরূপ, ধরা যাক আমার 4 টি ভেরিয়েবল সহ একটি ডেটাসেট রয়েছে: ওয়াই, এ, বি এবং সি ওয়াই, এ, এবং বি বাইনারি, সি অবিচ্ছিন্ন। 11,100 পর্যবেক্ষণের জন্য Y = 0 এবং 900 ওয়াই = 1 এর জন্য:

set.seed(42)
n <- 12000
r <- 1/12
A <- sample(0:1, n, replace=TRUE)
B <- sample(0:1, n, replace=TRUE)
C <- rnorm(n)
Y <- ifelse(10 * A + 0.5 * B + 5 * C + rnorm(n)/10 > -5, 0, 1)

এ, বি এবং সি দেওয়া ওয়াইয়ের পূর্বাভাস দেওয়ার জন্য আমি একটি লজিস্টিক রিগ্রেশন ফিট করি given

dat1 <- data.frame(Y, A, B, C)
mod1 <- glm(Y~., dat1, family=binomial)

যাইহোক, সময় বাঁচাতে আমি ১০,০০০ নন-ওয়াই পর্যবেক্ষণগুলি সরিয়ে ফেলতে পারি, 900 ওয়াই = 0 এবং 900 ওয়াই = 1 দিয়ে:

require('caret')
dat2 <- downSample(data.frame(A, B, C), factor(Y), list=FALSE)
mod2 <- glm(Class~., dat2, family=binomial)

2 টি মডেলের রিগ্রেশন সহগগুলি খুব মিল দেখায়:

> coef(summary(mod1))
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -127.67782  20.619858 -6.191983 5.941186e-10
A           -257.20668  41.650386 -6.175373 6.600728e-10
B            -13.20966   2.231606 -5.919353 3.232109e-09
C           -127.73597  20.630541 -6.191596 5.955818e-10
> coef(summary(mod2))
              Estimate  Std. Error     z value    Pr(>|z|)
(Intercept) -167.90178   59.126511 -2.83970391 0.004515542
A           -246.59975 4059.733845 -0.06074284 0.951564016
B            -16.93093    5.861286 -2.88860377 0.003869563
C           -170.18735   59.516021 -2.85952165 0.004242805

যা আমাকে বিশ্বাস করতে পরিচালিত করে যে ডাউন-স্যাম্পলিংটি সহগকে প্রভাবিত করে না। তবে এটি একটি একক, স্বীকৃত উদাহরণ এবং আমি অবশ্যই এটি নিশ্চিতভাবে জানতে পারি।

logistic unbalanced-classes case-control-study

— জ্যাক
সূত্র

বিরতি আলাদা করে রেখে আপনি একই জনসংখ্যার প্যারামিটারগুলি অনুমান করছেন যখন আপনি নমুনা নিচ্ছেন তবে কম নির্ভুলতার সাথে - বিরতি ছাড়া, যা আপনি অনুমান করতে পারবেন যখন আপনি প্রতিক্রিয়াটির জনসংখ্যার বিস্তার জানেন know একটি প্রমাণের জন্য হোসমার এবং লেমেশো (2000), প্রয়োগযুক্ত লজিস্টিক রিগ্রেশন , সিএফ 6.3 দেখুন। বেশিরভাগ ক্ষেত্রে আপনি পৃথকীকরণের পরিচয় দিতে পারেন, যদিও সাধারণভাবে নয়, আপনি সর্বাধিক প্রতিক্রিয়াটি নিম্ন-নমুনা হিসাবে দেখান।

— স্কর্চচি - মনিকা পুনরায় ইনস্টল করুন

@ স্কোর্টচি আপনার মন্তব্যটি উত্তর হিসাবে পোস্ট করুন - এটি আমার প্রশ্নের পক্ষে যথেষ্ট বলে মনে হচ্ছে। রেফারেন্সের জন্য ধন্যবাদ।

— জাচ

@Scortchi এবং জ্যাক: downsampled মডেল (মতে mod2), Pr(>|z|)জন্য Aপ্রায় 1.আমরা নাল হাইপোথিসিস যে সহগ প্রত্যাখ্যান না পারে A0 তাই আমরা একটি covariate যা ব্যবহার করা হয় হারিয়েছে mod1। এটি কি যথেষ্ট পার্থক্য নয়?

— ঝুবার্ব

@ ঝুবার্ব: যেমনটি আমি উল্লেখ করেছি যে আপনি ওয়াল্ডের স্ট্যান্ডার্ড ত্রুটির অনুমানটিকে সম্পূর্ণ অবিশ্বাস্য করে তুলতে পারে এমন বিচ্ছিন্নতার পরিচয় দিতে পারেন।

— স্কর্চচি - মনিকা পুনরায় ইনস্টল করুন

See also Scott 2006

— StasK

Down-sampling is equivalent to case–control designs in medical statistics—you're fixing the counts of responses & observing the covariate patterns (predictors). Perhaps the key reference is Prentice & Pyke (1979), "Logistic Disease Incidence Models and Case–Control Studies", Biometrika, 66, 3.

They used Bayes' Theorem to rewrite each term in the likelihood for the probability of a given covariate pattern conditional on being a case or control as two factors; one representing an ordinary logistic regression (probability of being a case or control conditional on a covariate pattern), & the other representing the marginal probability of the covariate pattern. They showed that maximizing the overall likelihood subject to the constraint that the marginal probabilities of being a case or control are fixed by the sampling scheme gives the same odds ratio estimates as maximizing the first factor without a constraint (i.e. carrying out an ordinary logistic regression).

The intercept for the population $\beta_0^*$ can be estimated from the case–control intercept $\hat{\beta}_0$ if the population prevalence $\pi$ is known:

{\hat{β}}_{0}^{*} = {\hat{β}}_{0} - \log (\frac{1 - π}{π} \cdot \frac{n_{1}}{n_{0}})

$\hat{\beta}_0^* = \hat{\beta}_0 - \log\left( \frac{1-\pi}{\pi}\cdot \frac{n_1}{n_0}\right)$

where $n_0$ & $n_1$ are the number of controls & cases sampled, respectively.

Of course by throwing away data you've gone to the trouble of collecting, albeit the least useful part, you're reducing the precision of your estimates. Constraints on computational resources are the only good reason I know of for doing this, but I mention it because some people seem to think that "a balanced data-set" is important for some other reason I've never been able to ascertain.

— Scortchi - Reinstate Monica
সূত্র

Thanks for the detailed answer. And yes, the reason I'm doing this running the full model (with no down-sampling) is computationally prohibitive.

— Zach

Dear @Scortchi , thanks for the explanation but in a case that I want to use logistic regression, the balanced dataset seems necessary regardless of the computational resources. I tried to use "Firth's biased reduced penalized-likelihood logit" with no avail. So seemingly the down-sampling is the only alternate for me, right?

— Shahin

@Shahin Well, (1) why are you unhappy with a logistic regression fit by maximum-likelihood? & (2) what exactly goes wrong using Firth's method?

— Scortchi - Reinstate Monica

@Scortchi, The problem is the model is very bad at detection of success instances. In other words, very low TPR. By changing the threshold, the TPR increases but precision is very bad, which means over 70% of the instances labeled as positive, are indeed negatives. I read that in the rare events, logistic regression does not do well, this is where Firth's method comes to play, or at least one of the roles that it can take. But the results of Firth's method happened to be very similar to usual logit. I thought I might be wrong in doing Firth's, but seemingly everything is okay

— Shahin

@Shahin: You seem to be barking up the wrong tree there: down-sampling isn't going to improve the discrimination of your model. Bias correction or regularization might (on new data - are you assessing its performance on a test set?), but a more complex specification could perhaps help, or it could simply be that you need more informative predictors. You should probably ask a new question, giving details of the data, the subject-matter context, the model, diagnostics and your aims.

— Scortchi - Reinstate Monica