Data 분석 왜 많이 하니?
Advancement of Technology
• (Big) data handling
– Much easier‐to‐use software
– Server clusters in the cloud
• Algorithms and analysis platforms
– Much improved performance
• Data tend to be less difficult to obtain
– People are aware of the importance
(well, does not necessarily mean many people understand the
exact value or how to extract it)
• When simple (or less complicated)
– Data approach works
– ‘Traditional Research’ might work as well (or
better)
• When complicated
– Data approach often gives a better chance
– When done properly…
LDA
Data Processing
Visualization
Analysis
Interpretation
Box Plot!!!
목적한 바가 중요하다.
Causality
Confounding means a difference between treatment and control groups - other than the treatment – which affects the response being studied
독립변수 이외에 종속변수에 영향을 주는 다른 요인
사람 vs 기계
• Certain tasks are better done by human
– Hacking
– Iteration
– Visual inspection
– Bringing knowledge from outside (of data)
• Some others by machine
– High‐dimensional analysis
– Data crunching (stats calculation, etc.)
– Organized approaches when one has no clue
• e.g. Automatic feature generation/selection
– ML algorithms
• Each known to work well for certain problems
• e.g. Deep learning on text, speech, image
Typical Sequence
(1/50) ^6
1 - (1/(50^6))
Lean Start Up!
Minimize the total time through the loop
IDEAS -> (build) -> Code -> (Measure) -> DATA -> (LEARN) -> IDEAS
Ensemble
Statistical : 양
Representational : 표현 자체 불가..(
Computational : 계산(Local minimal)
ML Performance, Task, Evaluation
Entropy
-Sigma p(x) log p(x)
Correl
sigma(x-_x)(y-_y) / squrt((x-_x)^2 * (y-_y)^2)
Box Plost 4분위수, 평균, Median, 동떨어진 것!!, Min, Max