데이터과학 입문 - ch6.시간 기록과 금융 모형화

데이터 과학 입문 Ch 6. 시간기록과금융 모형화

아꿈사 스터디 2015.06.27정민철([email protected])

www.it-ebooks.info

mailto:[email protected]

티비 태그

• 목적: 개인화된 TV 프로그램 추천과 편성표 제공

• 수집 정보 형식: {사용자, 행동, 항목} + 시간

• 특정 쇼에 특정 방식으로 반응 => “좋아요!” 로 분류

• “좋아요!” 데이터 시각화 방법: 사용자-항목 이분 그래프

• 그래프 개선: 사용자 간(팔로우/친구), TV쇼 간(유사도) 연결선

• 시간: 특정 시간대, 영향 확산, 시간에 따른 변화 파악에 도움

“check in” to TV shows, which means they can tell other people they’rewatching a show, thereby creating a timestamped data point. They canalso perform other actions such as liking or commenting on the show.

We store information in triplets of data of the form {user, action,item}, where the item is a TV show (or a movie). One way to visualizethis stored data is by drawing a bipartite graph as shown in Figure 6-1.

Figure 6-1. Bipartite graph with users and items (shows) as nodes

We’ll go into graphs in later chapters, but for now you should knowthat the dots are called “nodes” and the lines are called “edges.” Thisspecific kind of graph, called a bipartite graph, is characterized by therebeing two kinds of nodes, in this case corresponding to “users” and“items.” All the edges go between a user and an item, specifically if theuser in question has acted in some way on the show in question. Thereare never edges between different users or different shows. The graph

136 | Chapter 6: Time Stamps and Financial Modeling

www.it-ebooks.info

시간기록

• 시간기록 사건 데이터 다루기- 빅데이터 시대의 흔한 형태로 빅데이터 현상을 만든 요인- 하루종일 정확한 시간의 인간 행동 측정 가능- 대용량 데이터를 저장하고 신속하게 처리 가능

• 데이터 추출and which stories were clicked on. This generates event logs. Eachrecord is an event that took place between a user and the app or website.

Here’s an example of raw data point from GetGlue:{"userId": "rachelschutt", "numCheckins": "1","modelName": "movies", "title": "Collaborator","source": "http://getglue.com/stickers/tribeca_film/collaborator_coming_soon", "numReplies": "0","app": "GetGlue", "lastCheckin": "true","timestamp": "2012-05-18T14:15:40Z","director": "martin donovan", "verb": "watching","key": "rachelschutt/2012-05-18T14:15:40Z","others": "97", "displayName": "Rachel Schutt","lastModified": "2012-05-18T14:15:43Z","objectKey": "movies/collaborator/martin_donovan","action": "watching"}

If we extract four fields: {"userid":"rachelschutt", "action":"watching", "title":"Collaborator", timestamp:"2012-05-18T14:15:40Z" }, we can think of it as being in the order we justdiscussed, namely {user, verb, object, timestamp}.

Exploratory Data Analysis (EDA)As we described in Chapter 2, it’s best to start your analysis with EDAso you can gain intuition for the data before building models with it.Let’s delve deep into an example of EDA you can do with user data,stream-of-consciousness style. This is an illustration of a larger tech‐nique, and things we do here can be modified to other types of data,but you also might need to do something else entirely depending oncircumstances.

The very first thing you should look into when dealing with user datais individual user plots over time. Make sure the data makes sense toyou by investigating the narrative the data indicates from the perspec‐tive of one person.

To do this, take a random sample of users: start with something smalllike 100 users. Yes, maybe your dataset has millions of users, but tostart out, you need to gain intuition. Looking at millions of data pointsis too much for you as a human. But just by looking at 100, you’ll startto understand the data, and see if it’s clean. Of course, this kind ofsample size is not large enough if you were to start making inferencesabout the entire set of data.


www.it-ebooks.info

{"userid":"rachelschutt", "action": "watching", "title":"Collaborator", timestamp:"2012-05- 18T14:15:40Z" }

시간기록 - 탐색적 데이터 분석

• EDA를 통해 데이터에 관한 직관 얻기

• 여러가지 질문과 그에 대한 답을 얻는 과정에서 데이터의 특성을 파악

• 파악된 특성을 기반으로 분석 방법을 선택- 상황에 기반한 선택- 선택에 대한 타당한 이유 제시

• 선택할 것들: 척도, 부호화 방법, 시간범위, 데이터 범주, 혼합된 행동 다루기, 주목할 행동 패턴 등등등....

You might do this by finding usernames and grepping or searching for100 random choices, one at a time. For each user, create a plot like theone in Figure 6-2.

Figure 6-2. An example of a way to visually display user-level dataover time

Now try to construct a narrative from that plot. For example, we couldsay that user 1 comes the same time each day, whereas user 2 startedout active in this time period but then came less and less frequently.User 3 needs a longer time horizon for us to understand his or herbehavior, whereas user 4 looks “normal,” whatever that means.

Let’s pose questions from that narrative:

• What is the typical or average user doing?• What does variation around that look like?• How would we classify users into different segments based on

their behavior with respect to time?• How would we quantify the differences between these users?

Timestamps | 139

www.it-ebooks.info

Think abstractly about a certain typical question from data mungingdiscipline. Say we have some raw data where each data point is anevent, but we want to have data stored in rows where each row consistsof a user followed by a bunch of timestamps corresponding to actionsthat user performed. How would we get the data to that point? Notethat different users will have a different number of timestamps.

Make this reasoning explicit: how would we write the code to create aplot like the one just shown? How would we go about tackling the datamunging exercise?

Suppose a user can take multiple actions: “thumbs_up,” or“thumbs_down,” “like,” and “comment.” How can we plot those events?How can we modify our metrics? How can we encode the user datawith these different actions? Figure 6-3 provides an example for thefirst question where we color code actions thumbs up and thumbsdown, denoted thumbs_up and thumbs_down.

Figure 6-3. Use color to include more information about user actionsin a visual display

In this toy example, we see that all the users did the same thing at thesame time toward the right end of the plots. Wait, is this a real eventor a bug in the system? How do we check that? Is there a large co-occurence of some action across users? Is “black” more common than“red”? Maybe some users like to always thumb things up, another


www.it-ebooks.info

group always like to thumb things down, and some third group ofusers are a mix. What’s the definition of a “mix”?

Now that we’ve started to get some sense of variation across users, wecan think about how we might want to aggregate users. We might makethe x-axis refer to time, and the y-axis refer to counts, as shown inFigure 6-4.

Figure 6-4. Aggregating user actions into counts

We’re no longer working with 100 individual users, but we’re stillmaking choices, and those choices will impact our perception andunderstanding of the dataset.

For example, are we counting the number of unique users or the overallnumber of user logins? Because some users log in multiple times, thiscan have a huge impact. Are we counting the number of actions or thenumber of users who did a given action at least once during the giventime segment?

What is our time horizon? Are we counting per second, minute, hour,8-hour segments, day, or week? Why did we choose that? Is the signaloverwhelmed by seasonality, or are we searching for seasonality?

Timestamps | 141

www.it-ebooks.info

시간기록 - 마무리

• 척도와 새로운 변수 또는 특징- EDA로 부터 얻은 직관은 척도 구성에 도움- EDA로 얻은 직관을 모형과 알고리즘에 반영

• 다음은 무엇을 해야하나?- 자기회귀를 포함한 시계화 모형화 - 근접성 정의를 통한 군집화- 행동패턴 탐지 - 전환지점 탐지: 큰 사건이 일어난 시점 포착- 추천 시스템 훈련

• 시계열 모형화 (time series modeling): 시간에 극도로 민감한 사건을 예측 or 이미 발생한 것으로부터 예측 가능한 사건 예측

사고실험 - 커다란 훈련 데이터세트를 다루면서 시간 기록을 무시했을 때 무엇을 잃게 될까?

• 시간 감각이 없다면 원인과 결과를 알아낼 수 없음- 절대적 시간기록 v.s. 상대적 시간차이- 계절성, 추세 분석

Figure 6-6. Without keeping track of timestamps, we can’t see time-based patterns; here, we see a seasonal pattern in a time series

This idea, of keeping track of trends and seasonalities, is very impor‐tant in financial data, and essential to keep track of if you want to makemoney, considering how small the signals are.

Financial ModelingBefore the term data scientist existed, there were quants working infinance. There are many overlapping aspects of the job of the quantand the job of the data scientist, and of course some that are verydifferent. For example, as we will see in this chapter, quants are sin‐gularly obsessed with timestamps, and don’t care much about whythings work, just if they do.

Of course there’s a limit to what can be covered in just one chapter,but this is meant to give a taste of the kind of approach common infinancial modeling.

Financial Modeling | 145

www.it-ebooks.info

금융 모형화

• 금융 분석가: 빅데이터 분석의 원조- 시간기록에 집착하지만 원인은 크게 신경쓰지 않음

• 표본내, 표본외 데이터- 표본내 데이터: 훈련데이터 + 검증데이터- 표본외 데이터: 모형이 완성된 후 사용하는 데이터

인과 모형화 (causal modeling)

• 현재의 무언가를 예측하기 위해 미래의 정보를 결코 사용해서는 안 된다. (과거 ~ 현재 까지의 정보만 사용)

• 준거의 시간기록 v.s. 가용성의 시간기록(이용 가능한 시점)

• 계수의 집합 - 마지막 시간 기록에 도달할 때 까지는 최적적합계수를 알지 못함 - 시계열 데이터는 하나의 최적 적합 계수를 얻을 수 없음- 사건이 발생함에 따라 계수가 변경 됨 - 새로운 데이터를 얻을 때마다 모형 갱신 - 모형의 계수는 끊임없이 진화하는 살아있는 유기체

• 현재 알고있는 것에 기반해 미래에 관한 의사결정을 해야함

금융 데이터 준비하기

• 데이터 준비: 현실을 더 잘 반영하는 데이터로 변환- 데이터 정규화- 데이터의 로그를 취함- 범주형 변수 생성- 경계값을 기준으로 이진 변수로 데이터 변환

• 모형의 하위모형 운용- 새로운 부분 고려- 일변량 회기 등 하위모형 훈련

• 계산된 평균으로 정규화 할 수 없는 경우 => 이동평균으로 정규화- 인과적 간섭은 나쁜 모형을 좋아보이게 할 수 있음

로그 수익률

• 금융에서는 하루 단위로 수익을 계산

• 백분 수익률:- 비가산적- 이득에 대해 편의적

• 로그 수익률: - 가산적- 득실에 대칭적

• 아주 작은 수익률에서는 비슷

all causal, so you have to be careful when you train your overall modelhow to introduce your next data point and make sure the steps are allin order of time, and that you’re never ever cheating and looking aheadin time at data that hasn’t happened yet.

In particular, and it happens all the time, one can’t normalize by themean calculuated over the training set. Instead, have a running esti‐mate of the mean, which you know at a given moment, and normalizewith respect to that.

To see why this is so dangerous, imagine a market crash in the middleof your training set. The mean and variance of your returns are heavilyaffected by such an event, and doing something as innocuous as amean estimate translates into anticipating the crash before it happens.Such acausal interference tends to help the model, and could likelymake a bad model look good (or, what is more likely, make a modelthat is pure noise look good).

Log ReturnsIn finance, we consider returns on a daily basis. In other words, wecare about how much the stock (or future, or index) changes from dayto day. This might mean we measure movement from opening onMonday to opening on Tuesday, but the standard approach is to careabout closing prices on subsequent trading days.

We typically don’t consider percent returns, but rather log returns: ifFt denotes a close on day t, then the log return that day is defined aslog Ft / Ft−1 , whereas the percent return would be computed as100 Ft / Ft−1 −1 . To simplify the discussion, we’ll compare log re‐turns to scaled percent returns, which is the same as percent returnsexcept without the factor of 100. The reasoning is not changed by thisdifference in scalar.

There are a few different reasons we use log returns instead of per‐centage returns. For example, log returns are additive but scaled per‐cent returns aren’t. In other words, the five-day log return is the sumof the five one-day log returns. This is often computationally handy.

By the same token, log returns are symmetric with respect to gains andlosses, whereas percent returns are biased in favor of gains. So, forexample, if our stock goes down by 50%, or has a –0.5 scaled percentgain, and then goes up by 200%, so has a 2.0 scaled percent gain, weare where we started. But working in the same scenarios with log


www.it-ebooks.info

all causal, so you have to be careful when you train your overall modelhow to introduce your next data point and make sure the steps are allin order of time, and that you’re never ever cheating and looking aheadin time at data that hasn’t happened yet.

In particular, and it happens all the time, one can’t normalize by themean calculuated over the training set. Instead, have a running esti‐mate of the mean, which you know at a given moment, and normalizewith respect to that.

To see why this is so dangerous, imagine a market crash in the middleof your training set. The mean and variance of your returns are heavilyaffected by such an event, and doing something as innocuous as amean estimate translates into anticipating the crash before it happens.Such acausal interference tends to help the model, and could likelymake a bad model look good (or, what is more likely, make a modelthat is pure noise look good).

Log ReturnsIn finance, we consider returns on a daily basis. In other words, wecare about how much the stock (or future, or index) changes from dayto day. This might mean we measure movement from opening onMonday to opening on Tuesday, but the standard approach is to careabout closing prices on subsequent trading days.

We typically don’t consider percent returns, but rather log returns: ifFt denotes a close on day t, then the log return that day is defined aslog Ft / Ft−1 , whereas the percent return would be computed as100 Ft / Ft−1 −1 . To simplify the discussion, we’ll compare log re‐turns to scaled percent returns, which is the same as percent returnsexcept without the factor of 100. The reasoning is not changed by thisdifference in scalar.

There are a few different reasons we use log returns instead of per‐centage returns. For example, log returns are additive but scaled per‐cent returns aren’t. In other words, the five-day log return is the sumof the five one-day log returns. This is often computationally handy.

By the same token, log returns are symmetric with respect to gains andlosses, whereas percent returns are biased in favor of gains. So, forexample, if our stock goes down by 50%, or has a –0.5 scaled percentgain, and then goes up by 200%, so has a 2.0 scaled percent gain, weare where we started. But working in the same scenarios with log


www.it-ebooks.info

returns, we’d see first a log return of log 0.5 = −0.301 followed by alog return of log 2.0 = 0.301.

Even so, the two kinds of returns are close to each other for smallishreturns, so if we work with short time horizons, like daily or shorter,it doesn’t make a huge difference. This can be proven easily: settingx = Ft / Ft−1, the scaled percent return is x −1 and the log return islog x , which has the following Taylor expansion:

log x = ∑n

x −1 n

n = x −1 + x −1 2 / 2+⋯

In other words, the first term of the Taylor expansion agrees with thepercent return. So as long as the second term is small compared to thefirst, which is usually true for daily returns, we get a pretty good ap‐proximation of percent returns using log returns.

Here’s a picture of how closely these two functions behave, keeping inmind that when x = 1, there’s no change in price whatsoever, as shownin Figure 6-8.

Figure 6-8. Comparing log and scaled percent returns


www.it-ebooks.info

예시: S&P 지수

Example: The S&P IndexLet’s work out a toy example. If you start with S&P closing levels asshown in Figure 6-9, then you get the log returns illustrated inFigure 6-10.

Figure 6-9. S&P closing levels shown over time

Figure 6-10. The log of the S&P returns shown over time


www.it-ebooks.info

Example: The S&P IndexLet’s work out a toy example. If you start with S&P closing levels asshown in Figure 6-9, then you get the log returns illustrated inFigure 6-10.

Figure 6-9. S&P closing levels shown over time

Figure 6-10. The log of the S&P returns shown over time


www.it-ebooks.info

What’s that mess? It’s crazy volatility caused by the financial crisis. Wesometimes (not always) want to account for that volatility by normal‐izing with respect to it (described earlier). Once we do that we getsomething like Figure 6-11, which is clearly better behaved.

Figure 6-11. The volatility normalized log of the S&P closing returnsshown over time

Working out a Volatility MeasurementOnce we have our returns defined, we can keep a running estimate ofhow much we have seen it change recently, which is usually measuredas a sample standard deviation, and is called a volatility estimate.

A critical decision in measuring the volatility is in choosing a lookbackwindow, which is a length of time in the past we will take our infor‐mation from. The longer the lookback window is, the more informa‐tion we have to go by for our estimate. However, the shorter our look‐back window, the more quickly our volatility estimate responds to newinformation. Sometimes you can think about it like this: if a pretty bigmarket event occurs, how long does it take for the market to “forgetabout it”? That’s pretty vague, but it can give one an intuition on theappropriate length of a lookback window. So, for example, it’s defi‐nitely more than a week, sometimes less than four months. It alsodepends on how big the event is, of course.


www.it-ebooks.info

변동성 측정하기

• 회고창 선택 (lookback window)- 정보를 취하는 과거 시간길이- 길어질 수록 => 추정을 위해 더 많은 정보 필요- 짧아질 수록 => 새로운 정보에 더 빨리 반응- 큰 사건이 일어나면 잊혀지는 데 시간이 얼마나 소요되나?

• 과거 데이터를 어떻게 사용하나?- 롤링 창 적용: 이전 n일 각각에 동일한 가중치 부여- 연속적인 회고 창 사용: 오래된 데이터에 반감기 적용

• 위험 부담을 최소화 하는 가중치 감소화 지수 선택

Figure 6-12. Volatility in the S&P with different decay factors

Exponential DownweightingWe’ve already seen an example of exponential downweighting in thecase of keeping a running estimate of the volatility of the returns ofthe S&P.

The general formula for downweighting some additive running esti‐mate E is simple enough. We weight recent data more than older data,and we assign the downweighting of older data a name s and treat itlike a parameter. It is called the decay. In its simplest form we get:

Et = s ·Et−1 + 1− s ·et

where et is the new term.


www.it-ebooks.info

지수적 가중치 감소화

• 감쇠- 현재 데이터에 더 많은 가중치 부여- 오래된 데이터의 감소화를 s라 하고 모수처럼 취급

Figure 6-12. Volatility in the S&P with different decay factors

Exponential DownweightingWe’ve already seen an example of exponential downweighting in thecase of keeping a running estimate of the volatility of the returns ofthe S&P.

The general formula for downweighting some additive running esti‐mate E is simple enough. We weight recent data more than older data,and we assign the downweighting of older data a name s and treat itlike a parameter. It is called the decay. In its simplest form we get:

Et = s ·Et−1 + 1− s ·et

where et is the new term.


www.it-ebooks.info

금융 모형화 피드백 루프

• 시장은 시간이 흐르면서 학습한다.- 구매는 시장에 영향 => 예상한 신호를 감소시킴

• 무언가를 예측하고, 예측한 것들이 사라지게 만드는 수많은 알고리즘의 조합- 기존 신호들이 대단히 약해짐- 시장 참여자들이 그 신호들을 모두 이해했고, 미리 예측했기 때문

• 일반적인 평가척도는 통하지 않음 => 대신 PnL(Profit & Loss) 그래프 활용

The consequence of this learning over time is that the existing signalsare very weak. Things that were obvious (in hindsight) with the nakedeye in the 1970s are no longer available, because they’re all understoodand pre-anticipated by the market participants (although new onesmight pop into existence).

The bottom line is that, nowadays, we are happy with a 3% correlationfor models that have a horizon of 1 day (a “horizon” for your model ishow long you expect your prediction to be good). This means notmuch signal, and lots of noise! Even so, you can still make money ifyou have such an edge and if your trading costs are sufficiently small.

In particular, lots of the machine learning “metrics of success” formodels, such as measurements of precision or accuracy, are not veryrelevant in this context.

So instead of measuring accuracy, we generally draw a picture to assessmodels as shown in Figure 6-13, namely of the (cumulative) PnL ofthe model. PnL stands for Profit and Loss and is the day-over-daychange (difference, not ratio), or today’s value minus yesterday’s value.

Figure 6-13. A graph of the cumulative PnLs of two theoretical models


www.it-ebooks.info

사전정보 추가하기

• 사전정보: 수학적으로 공식화되고 결합된 의견

• 예제: 오래된 데이터에 감소된 가중치 부여- 새로운 데이터가 오래된 데이터보다 중요

• 사전정보는 자유도를 줄여줌

베이비 모형

• 시계열 그래프의 자기상관계수 계산

• 벌칙함수: 최소화를 하려는 함수에 항을 추가- 모형의 적합정도 측정

• 지수적 가중치감소화항 r 선택

y = Ft = α0 +α1Ft−1 +α2Ft−2 +Ƚ

which is just the example where we take the last two values of the timeseries F to predict the next one. We could use more than two values,of course. If we used lots of lagged values, then we could strengthenour prior in order to make up for the fact that we’ve introduced somany degrees of freedom. In effect, priors reduce degrees of freedom.

The way we’d place the prior about the relationship between coeffi‐cients (in this case consecutive lagged data points) is by adding a matrixto our covariance matrix when we perform linear regression. See moreabout this here.

A Baby ModelSay we drew a plot in a time series and found that we have strong butfading autocorrelation up to the first 40 lags or so as shown inFigure 6-14.

Figure 6-14. Looking at auto-correlation out to 100 lags

We can calculate autocorrelation when we have time series data. Wecreate a second time series that is the same vector of data shifted by a


www.it-ebooks.info

• 사전정보를 가지고 있지 않을 때: 오차 제곱

• 표준 사전정보 추가: 단위행렬의 스칼라 배 추가

• 사전정보 추가: 계수들이 부드럽게 변한다.

F2 β = 1N ∑

iyi −xiβ 2 + ∑

jλ2β j

2 + ∑j

µ2 β j − β j+12

= 1N y −xβ τ y −xβ + λ2βτ β+µ2 Iβ− Mβ τ Iβ− Mβ

where M is the matrix that contains zeros everywhere except on thelower off-diagonals, where it contains 1’s. Then Mβ is the vector thatresults from shifting the coefficients of β by one and replacing the lastcoefficient by 0. The matrix M is called a shift operator and the differ‐ence I − M can be thought of as a discrete derivative operator (see herefor more information on discrete calculus).

Because this is the most complicated version, let’s look at this in detail.Remembering our vector calculus, the derivative of the scalar functionF2 β with respect to the vector β is a vector, and satisfies a bunch ofthe properties that happen at the scalar level, including the fact thatit’s both additive and linear and that:

∂uτ ·u∂β = 2 ∂uτ

∂β u

Putting the preceding rules to use, we have:

∂F2 β∂β = 1

N∂ y −xβ τ y −xβ /

∂β + λ2 · ∂βτ β∂β +µ2 · ∂ I − M β τ I − M β

∂β

= −2N xτ y −xβ +2λ2 · β+2µ2 I − M τ I − M β

Setting this to 0 and solving for β gives us:

β2 = xτx + N · λ2I + N ·µ2 · I − M τ I − M −1xτ y

In other words, we have yet another matrix added to our covariancematrix, which expresses our prior that coefficients vary smoothly. Notethat the symmetric matrix I − M τ I − M has 1’s along its sub- andsuper-diagonal, but also has 2’s along its diagonal. In other words, weneed to adjust our λ as we adjust our µ because there is an interactionbetween these terms.


www.it-ebooks.info

day (or some fixed time period), and then calculate the correlationbetween the two vectors.

If we want to predict the next value, we’d want to use the signal thatalready exists just by knowing the last 40 values. On the other hand,we don’t want to do a linear regression with 40 coefficients becausethat would be way too many degrees of freedom. It’s a perfect place fora prior.

A good way to think about priors is by adding a term to the functionwe are seeking to minimize, which measures the extent to which wehave a good fit. This is called the “penalty function,” and when we haveno prior at all, it’s simply the sum of the squares of the error:

F β = ∑i yi −xiβ 2 = y −xβ τ y −xβ

If we want to minimize F, which we do, then we take its derivative withrespect to the vector of coefficients β, set it equal to zero, and solve forβ—there’s a unique solution, namely:

β = xτx −1xτ y

If we now add a standard prior in the form of a penalty term for largecoefficients, then we have:

F1 β = 1N ∑i yi −xiβ 2 + ∑ j λ2β j

2 = 1N y −xβ τ y −xβ + λIβ τ λIβ

This can also be solved using calculus, and we solve for beta to get:

β1 = xτx + N · λ2I −1xτ y

In other words, adding the penalty term for large coefficients translatesinto adding a scalar multiple of the identity matrix to the covariancematrix in the closed form solution to β .

If we now want to add another penalty term that represents a “coeffi‐cients vary smoothly” prior, we can think of this as requiring thatadjacent coefficients should be not too different from each other, whichcan be expressed in the following penalty function with a new param‐eter µ as follows:


www.it-ebooks.info






β = xτx −1xτ y


F1 β = 1N ∑i yi −xiβ 2 + ∑ j λ2β j







www.it-ebooks.info






β = xτx −1xτ y


F1 β = 1N ∑i yi −xiβ 2 + ∑ j λ2β j







www.it-ebooks.info

F2 β = 1N ∑

iyi −xiβ 2 + ∑

jλ2β j

2 + ∑j

µ2 β j − β j+12

= 1N y −xβ τ y −xβ + λ2βτ β+µ2 Iβ− Mβ τ Iβ− Mβ

where M is the matrix that contains zeros everywhere except on thelower off-diagonals, where it contains 1’s. Then Mβ is the vector thatresults from shifting the coefficients of β by one and replacing the lastcoefficient by 0. The matrix M is called a shift operator and the differ‐ence I − M can be thought of as a discrete derivative operator (see herefor more information on discrete calculus).

Because this is the most complicated version, let’s look at this in detail.Remembering our vector calculus, the derivative of the scalar functionF2 β with respect to the vector β is a vector, and satisfies a bunch ofthe properties that happen at the scalar level, including the fact thatit’s both additive and linear and that:

∂uτ ·u∂β = 2 ∂uτ

∂β u

Putting the preceding rules to use, we have:

∂F2 β∂β = 1

N∂ y −xβ τ y −xβ /

∂β + λ2 · ∂βτ β∂β +µ2 · ∂ I − M β τ I − M β

∂β

= −2N xτ y −xβ +2λ2 · β+2µ2 I − M τ I − M β

Setting this to 0 and solving for β gives us:

β2 = xτx + N · λ2I + N ·µ2 · I − M τ I − M −1xτ y

In other words, we have yet another matrix added to our covariancematrix, which expresses our prior that coefficients vary smoothly. Notethat the symmetric matrix I − M τ I − M has 1’s along its sub- andsuper-diagonal, but also has 2’s along its diagonal. In other words, weneed to adjust our λ as we adjust our µ because there is an interactionbetween these terms.


www.it-ebooks.info






β = xτx −1xτ y


F1 β = 1N ∑i yi −xiβ 2 + ∑ j λ2β j







www.it-ebooks.info

Data & Analytics

데이터과학 입문 - ch6.시간 기록과 금융 모형화