20
Stat Risk Models, Billion Alphas, & ... Cancer Signatures Zura Kakushadze Quantigic r Solutions LLC, Stamford, CT, USA Business School & School of Physics, Free University of Tbilisi, Georgia [email protected] Talk Presented at Financial Engineering Division, School of Systems and Enterprises Stevens Institute of Technology, Hoboken, NJ, USA May 4, 2016 Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 1 / 20

Zura Kakushadze Stevens 05042016

Embed Size (px)

Citation preview

Page 1: Zura Kakushadze Stevens 05042016

Stat Risk Models, Billion Alphas, & ... Cancer Signatures

Zura Kakushadze

Quantigicr Solutions LLC, Stamford, CT, USABusiness School & School of Physics, Free University of Tbilisi, Georgia

[email protected]

Talk Presented at Financial Engineering Division, School of Systems and EnterprisesStevens Institute of Technology, Hoboken, NJ, USA

May 4, 2016

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 1 / 20

Page 2: Zura Kakushadze Stevens 05042016

Motivation

Proliferation of Alphas

α mining: machines have taken over!

Olden days: #(α’s) N ∼ 10

Nowadays: N ∼ 104, 105, 106, . . . , 109 (soon!)

α’s: ephemeral, faint, #(obs.) � N ⇒ sample cov.mat singular!

“Mega” α: α combos → optimization → need invertible cov.mat

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 2 / 20

Page 3: Zura Kakushadze Stevens 05042016

Factor Models

Factor Models

Sample cov.mat: Cij = σiσjΨij (i , j = 1, . . . ,N)

Sample var: σ2i (computable, relatively stable, skewed)

Sample cor.mat: Ψij (singular, out-of-sample unstable)

Factor model: Ψij ≈ ξ2i δij +∑K

A=1 βiAβjA (pos-def, K � N)

Pair-wise cor: ⇐ factor loadings βiA (ξ2i ⇐ Ψii = 1)

What to Use for Factor Loadings?

Stocks: style factors (size, value, growth, mom, vol, liq, etc., ∼< 10)

Stocks: industry classification (GICS, ICB, BICS, etc.)

Alphas: no useful classification, few style factors (bad cor proxies)

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 3 / 20

Page 4: Zura Kakushadze Stevens 05042016

Statistical Risk Models

Statistical Risk Models [ZK & Yu, 2016a]

No add’l input: only α return time series Ris (s = 1, . . . ,M + 1)

Factor loadings βiA: linear combos of Ris = Ris/σi (norm.ret)

Principal components (Xis = serially demeaned Ris):

Ψij =1

M

M+1∑s=1

XisXjs =M∑

A=1

λ(A)V(A)i V

(A)j

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 4 / 20

Page 5: Zura Kakushadze Stevens 05042016

Statistical Risk Models

Truncation

First K prin.comp (λ(1) > λ(2) > . . . ):

Ψij ≈ ξ2i δij +K∑

A=1

βiAβjA, βiA =√λ(A)V

(A)i

What Should K Be?

Fix K : keep it simple

K = eRank(Ψij): eRank = effective rank [Roy & Vetterli, 2007]

eRank = effective dimensionality (R code):eig = eigen(Psi)$values # eigenvalues of cor.mat

eig = eig[eig > 0] # positive eigenvalues

p = eig / sum(eig) # normalized as weights

eRank = exp(-sum(p * log(p))) # arg.exp = spectral entropy

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 5 / 20

Page 6: Zura Kakushadze Stevens 05042016

Figure: eRank Illustration

Simple Example

Uniform cor: Ψij = (1− ρ)δij + ρ [1-factor model, βi =√ρ]

0 20 40 60 80 100

020

4060

8010

0

Correlation (%)

eRan

k, N

= 1

00

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 6 / 20

Page 7: Zura Kakushadze Stevens 05042016

How to Combine a Billion Alphas?

Sharpe → max [ZK & Yu, 2016b]

Exp.ret: Ei

Weights: wi = γ∑N

j=1 C−1ij Ej

[γ ⇐

∑Ni=1 |wi | = 1

]Rescale: wi = σiξiwi/γ, Ei = Ei/σiξi , βiA = βiA/ξi

Claim: N � 1⇒ wi ≈ εi = regression residuals of Ei over βiA

Example (K = 1, uniform cor):

wi = Ei −ρ

1 + (N − 1)ρ

N∑j=1

Ej ≈ Ei −1

N

N∑j=1

Ej

Reg over int: βi ≡ ρ/√

1− ρGeneral: cor 6� 1,N � 1⇒ approx. regression (holds for prin.comp)

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 7 / 20

Page 8: Zura Kakushadze Stevens 05042016

Removing “Overall” Mode

Sharpe → max = Overkill

Hedge: incl. all α’s going bust at once

Unlikely: if avg. α cor is low

Math: 1st prin.comp V(1)i ≈ 1/

√N (“overall” ∼ “market” mode)

Residuals: εi ⊥ βiA ⇒ εi ⊥ V(1)i ⇒∼ 50% of wi & wi < 0

Solution: X-sec demean Ris , then calc Ψij – remove “overall” mode!

Computational Cost

Regression: O(M2N)

Prin.comp: O(M2N) (not power iterations) [ZK & Yu, 2016a]

Reduce: βiA = X-sec demeaned Xis (s = 1, . . . ,M − 1), ξi ≡ 1

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 8 / 20

Page 9: Zura Kakushadze Stevens 05042016

Cancer Signatures

Motivation

Cancer: 1 in 8 human deaths

Diff: somatic mutations

Common: single nucleotide variation = single base

Exo: chemical insults, UV, etc.

Endo: imperfect DNA replic., spont. cytosine deamination, etc.

Mutational signatures: alteration patterns in cancer genome

Identify: understand origins and development of cancer

Therapy: if ∃ small #(sigs), cure for 1 cancer may help cure many

Prevention: pair obs. sigs w/ sigs caused by carcinogens

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 9 / 20

Page 10: Zura Kakushadze Stevens 05042016

Figure: Double Helix

DNA = 2 strands. Each strand = string of A, C, G, T (adenine, cytosine, guanine and thymine).

Base complementarity: A in one strand always binds with T in the other, and G with C.

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 10 / 20

Page 11: Zura Kakushadze Stevens 05042016

Mutation Categories

Base Mutations

6 mutations: C > A, C > G, C > G, T > A, T > C, T > G

Other 6: base complementarity

96 mut.cat: 4 (lhs) × 6 (base.mut) × 4 (rhs) (e.g.: TCG > TAG)

Data

Samples: DNA sequenced whole cancer genomes

Occur.cts: matrix Gis ≥ 0 (i = 1, . . . ,N = 96; s = 1, . . . , d samples)

By cancer type: [G (α)]is (α = 1, . . . , n cancer types)

Nonnegative Matrix Factorization (NMF) [Alexandrov et al, 2013]

NMF: G ≈WH, N × K sig weights WiA, K × d sig exposures HAs

Iter.algo: K = trial.err, many loc.min, no glob.conv, avg.samplings

Run: cancer type/“big matrix”, days/weeks, sig instability (var)...

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 11 / 20

Page 12: Zura Kakushadze Stevens 05042016

Stat Factor Models for Cancer Sigs [ZK & Yu, 2016c]

Skewed Counts, Aggregation, “Overall” Mode, etc.

bio ↔ fin dict: mut.cat ↔ tkr, sample ↔ trd.day, sigs ↔ “sectors”

Skewed cts: nonneg.cts ∼ log-norm distrib.

Fix: Ris = ln(1 + Gis) (some Gis = 0)

Cor.mat: calc Ψij based on Ris

Samples: too noisy in each cancer type

Fix: aggregate by cancer type

Data: published only (Q&A), 14 cancer types, 1389 samples

Avg.cor: whopping 96%⇒ “overall” mode = somatic mut.noise

Fix: rm “overall” mode = X-sec demean Ris (s: 14 cancer types)

#(sigs.eRank): K = Round(eRank(Ψij)) = 7

NMF: re-exp Gis = exp(R ′is) (R ′is = X-sec demeaned Ris)

Result: #(sigs.NMF) = 7 = #(sigs.eRank), much less noisy sigs!

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 12 / 20

Page 13: Zura Kakushadze Stevens 05042016

Average Correlation and 1st 5 Eigenvalues of Ψij

Cancer Type Avg.Cor Eig.1 Eig.2 Eig.3 Eig.4 Eig.5

B Cell Lymphoma 66.6 65.0 5.67 3.08 2.54 2.23Bone Cancer 48.1 48.1 3.31 2.03 1.88 1.81Brain Lower Grade Glioma 17.7 20.3 4.52 3.91 3.47 3.23Breast Cancer 65.2 64.1 6.18 3.01 2.07 1.11Chronic Lymphocytic Leukemia 79.6 77.6 1.73 1.23 1.17 1.07Esophageal Cancer 16.7 23.3 9.58 7.59 7.34 6.98Gastric Cancer 80.9 78.2 6.41 3.95 1.62 0.68Liver Cancer 87.9 84.7 1.77 1.09 0.90 0.76Lung Cancer 80.0 78.3 6.03 4.47 1.93 1.17Medulloblastoma 54.2 53.6 3.33 2.01 1.76 1.63Ovarian Cancer 75.6 73.8 6.04 2.64 1.99 1.31Pancreatic Cancer 17.0 21.3 5.38 3.71 3.27 2.84Prostate Cancer 68.1 68.5 11.5 8.60 7.43 0Renal Cell Carcinoma 78.2 75.9 5.89 1.86 1.63 0.97All Cancer Types 88.1 84.9 5.47 0.77 0.58 0.37Aggregated by Cancer Type 96.1 92.4 1.26 0.89 0.51 0.34

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 13 / 20

Page 14: Zura Kakushadze Stevens 05042016

Figure: NMF Reconstruction Accuracy (Our Method)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reconstruction accuracy by cancer types

4 signatures 5 signatures 6 signatures 7 signatures 8 signatures

Cancer subtype

Pe

ars

on

Co

rre

latio

n

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 14 / 20

Page 15: Zura Kakushadze Stevens 05042016

Figure: Signature Contributions (Our Method)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Signature contribution to 14 cancer subtypes

Signature 7

Signature 6

Signature 5

Signature 4

Signature 3

Signature 2

Signature 1

Cancer subtypes

% c

on

trib

utio

n

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 15 / 20

Page 16: Zura Kakushadze Stevens 05042016

Figure: Sig1 (Pancreatic), Vanilla NMF v. Our Method

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 16 / 20

Page 17: Zura Kakushadze Stevens 05042016

Figure: Sig5 (Liver), Vanilla NMF v. Our Method

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 17 / 20

Page 18: Zura Kakushadze Stevens 05042016

Figure: Yoda v. Yoda on Steroids

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 18 / 20

Page 19: Zura Kakushadze Stevens 05042016

Novel Approach

7 Cancer Signatures

4 known sigs: [Nik-Zainal et al, 2012], [Alexandrov et al, 2013b]

Novel sig: dominant liver cancer sig: 96% contr. (exciting!)

Novel sig: renal cell carcinoma (kidney): 70% contr.

Novel sig: bone cancer, brain lower grade glioma, medulloblastoma

Bonus: more stable sigs, comp.cost cut by factor ∼ 10 (fewer iter)!

What’s Next?

Exome: much cheaper, faster than genome

Issue: small ⊂ genome, sparse (many 0’s), noisy

Fix: aggregate by cancer type, rm “overall” mode

Cleaner sigs: novel prop method, still a secret... ,

Stability: need� data, Int’l Cancer Genome Consortium (embargoed)

Funding: gov, angels, biotech/pharma, etc.

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 19 / 20

Page 20: Zura Kakushadze Stevens 05042016

References

ZK & Willie Yu (2016a) Statistical Risk Models. The Journal of InvestmentStrategies (forthcoming); http://ssrn.com/abstract=2732453 (Feb 15, 2016).

ZK & Willie Yu (2016b) How to Combine a Billion Alphas. Journal of AssetManagement (under review); http://ssrn.com/abstract=2739219 (Feb 29, 2016).

ZK & Willie Yu (2016c) Factor Models for Cancer Signatures. Working Paper;http://ssrn.com/abstract=2772458 (Apr 29, 2016).

Thank you!

Zura Kakushadze (Quantigic & FreeUni) Stat RM, Bn α’s, & ... Cancer Signatures May 4, 2016 20 / 20