Outliers/High Leverage observations and influential points
2017-08-20 20:54:42

In this section, we learn the distinction between outliers and high leverage observations. In short:

An outlier is a data point whose response y does not follow the general trend of the rest of the data.

A data point has high leverage if it has "extreme" predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be "unusual" combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).

Note that — for our purposes — we consider a data point to be an outlier only if it is extreme with respect to the other y values, not the x values.

A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Outliers and high leverage data points have the potential to be influential, but we generally have to investigate further to determine whether or not they are actually influential.

One advantage of the case in which we have only one predictor is that we can look at simple scatter plots in order to identify any outliers and influential data points. Let's take a look at a few examples that should help to clarify the distinction between the two types of extreme values.

from https://onlinecourses.science.psu.edu/stat501/node/337

▼ more
qual 대비1
2017-08-18 23:44:14

stat 2

- 1-way RM ANOVA(wg(ssub, err))

- 2-way ANOVA

stat 3

- pooled proportion to SE(pooled q!!, sqrt((pooledP * pooledQ) / N))

- proportion: 2-way table

ml 0

- Forward, State Prob.

- EM: theta 는 emission

- GMM: Euclidean / variance 최소

- SVM(Maximal margin) -> 어떤점??

ml 1

- bootstrap : lim(1-1/n)^n = e^-1

ml 2

- WCV 는 2 곱하기 클러스터 분산

- Random Forest

- Effective df

- Stacked classifier

- Ensemble: p(majority)

▼ more
바쁜 와중 이지만
2017-08-15 20:43:36

나의 일상도 소중하므로

upgrade complete

한판만 ㅠ

▼ more
책을 한 번 읽고나면 그 책에 대해서
2017-08-13 21:03:49

모르는 게 없는 것 같은 착각을 하곤 하지만

책을 다시 펴보는 것은 물론이거니와

심지어 내가 일전에 이해해서 정리했던 글을 읽다보면

내가 단단히 착각하고 있다는 사실을 깨닫게 된다.

물론 모든 내용을 완벽히 이해한 적은 없기 때문에

완전히 생소한 것은 기억을 더듬어 자료를 찾아봐도 대부분 정리된 것이 애초에 없다.

여러번 읽기 전에는 처음에 놓친 내용이 무엇인지 조차 알지 못하니

이래서 책을 여러번 읽어야 하는 건가 싶다.

그래서 갑자기 떠오른 빅쇼트의 시작 문구

"It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.", a fake Mark Twain quote used in The Big Short

▼ more