Transcript

Supercharging Crowd Dynamics Estimation inDisasters via Spatio-Temporal Deep Neural Network

Fang-Zhou Jiang∗, Lei Zhong†, Kanchana Thilakarathna∗, Aruna Seneviratne∗,Kiyoshi Takano‡, Shigeki Yamada† and Yusheng Ji†

∗Data61, CSIRO & University of New South Wales, Australia, {firstname.lastname}@data61.csiro.au†National Institute of Informatics, Japan, {zhong, shigeki, kei}@nii.ac.jp

‡University of Tokyo, Japan, eri.u-tokyo.ac.jp

Abstract—Accurate estimation of crowd dynamics is difficult,especially when it comes to fine-grained spatial and temporalpredictions. A deep understanding of these fine-grained dynamicsis crucial during a major disaster, as it guides efficient disastermanagements. However, it is particularly challenging as thesefine-grained dynamics are mainly caused by high-dimensionalindividual movement and evacuation. Furthermore, abnormaluser behavior during disasters makes the problem of accurateprediction even more acute. Traditional models have difficultiesin dealing with these high dimensional patterns caused by dis-ruptive events. For example, the 2016 Kumamoto earthquakesdisrupted normal crowd dynamics patterns significantly in theaffected regions. We first perform a thorough analysis of a crowdpopulation distribution dataset during Kumamoto earthquakescollected by a major mobile network operator in Japan, whichshows strong fine-grained temporal autocorrelation and spatialcorrelation among geographically neighboring grids. It is alsodemonstrated that temporal autocorrelation during disasters ismore than simple diurnal patterns. Moreover, there are manyfactors that could potentially influence spatial correlations andaffect the dynamics patterns. Then, we illustrate how a spatial-temporal Long-Short-Term-Memory (LSTM) deep neural networkcould be applied to boost the prediction power. It is shown that theerror in terms of Mean Square Error (MSE) is reduced by as muchas 55.1-69.4% compared to regressive models such as AR, ARIMAand SVR. Furthermore, LSTM outperforms the aforementionedmodels significantly even when little training data is available rightafter the mainshock. Finally, we also show a Region-aware LSTMdoes not necessarily outperform a regular LSTM.

Index Terms—Data Mining; Data-Driven Modeling; DisasterManagement; Spatio-Temporal Dynamics; Deep Neural Network.

I. INTRODUCTION

Since the 2011 Great East Japan Earthquake, achievingbetter understanding of urban crowd dynamics during disas-ters has received significant attentions from both governmentsand research communities. Crowd evacuation behaviors heav-ily influence disaster management and relief, where carefulprior-disaster planning and effective post-disaster evacuationguidance are key to reducing casualties and chaos. Thesebehaviors are reflected in the high dynamics of crowd popu-lation distribution observed, while our current understanding isextremely limited. For example, it is shown that many reportedevacuation spots in 2016 Kumamoto earthquake, i.e. parkingareas and shopping malls, were not officially planned andrecognized by the administrative organizations as evacuationshelters, which hampers the ability in providing food andsupplies efficiently [1].

Similar to other phenomenas caused by human behaviorssuch as video consumption patterns [2], crowd dynamics ex-hibits spatial and temporal patterns (i.e. diurnal/weekly pat-tern) [3], [4], [5], [6]. Thus, prediction of future crowd dy-namics could be achieved with relatively high accuracy innormal times. However, it becomes much more difficult whenabnormal events take place, especially during major disasters.The abnormality along with high-dimensional individual move-ment caused by large-scale evacuation makes the task highlychallenging.

A lot has been done to improve our current understandingof crowd dynamics during major disasters. For example, Songet al. [7] developed a probabilistic model to simulate andpredict human evacuations over complex geographic areas inTokyo. However, it focuses on modeling and estimation ofindividual mobility patterns and does not accurately reflectthe crowd distribution dynamics both spatially and temporally.In addition, similar to most of the existing works (i,e [1],[8]), population densities are calculated via GPS data samplescollected from users’ mobile phones, which do not accuratelyreflect the behavior of majority crowd. Sekimoto [9] proposed areal-time population movement estimation system during large-scale disasters from mobile phone data using data integrationtechniques. Again, the accuracy of system suffers as both thesample size is small and the adopted framework is simplistic.

In this paper, we first perform a thorough analysis ofa population distribution dataset collected during Kumamotoearthquakes, and show that there exists a strong spatio-temporalcorrelation of crowd population dynamics among co-locatedgrids. After that, we study and analyze factors that influencespatial grid correlations (i.e. physical distance and land us-age), which could potentially aid our understanding of crowdpopulation dynamics. Deep learning has been successfullyapplied to many fields such as image/speech/textual recognition,medical diagnostics etc [10], [11]. Therefore, to overcomethe challenges associated with abnormal human behaviors, weadopt deep neural network in an attempt to discover thosehigh dimensional patterns both spatially and temporally. Wepropose to treat crowd population density in each grid as pixelsin an image and use deep learning techniques to understandthe complex high dimensional patterns. Specifically, we applyLong-Short-Term-Memory model (LSTM) [12] to accuratelypredict crowd population distributions. The performance of

LSTM is evaluated and compared with several baseline regres-sive models, namely AR, ARIMA and SVR. To the best ofour knowledge, our paper is the first that applies deep neuralnetworks to improve urban crowd dynamic estimation, anddemonstrated with a large-scale real-world dataset collected inperiods that are highly dynamic and disruptive. Specifically, thispaper makes the following contributions;• We analyze an unique large-scale real-world crowd dy-

namics dataset during a natural disaster collected by oneof the largest mobile operators in Japan, and illustrate thehigh dimensional patterns of crowd population dynamicsboth spatially and temporally.

• We propose to treat crowd dynamics map as a series ofimages, and leverage LSTM based deep neural networkmodel for predicting the crowd dynamics in both shortand long term.

• We demonstrate and compare the performance of LSTMwith multiple popular regressive models, and show animprovement of estimation error by up to 69.4%.

• Lastly, we illustrate that LSTM outperforms in highlydynamics scenarios (i.e. disasters) even with very littletraining data, while attempts of adding intelligence to deepneural network do not necessarily work better comparedto a regular deep neural network.

The rest of the paper is organized as follows; We intro-duce the dataset in Section II and explore spatio-temporalcorrelations and patterns of crowd dynamics in Section III.Section IV describes spatial-temporal regressive models and aspatial-temporal LSTM based deep neural network. We evaluatethe performance of these models in Section V. Section VIdiscusses the work and provides the direction of future work.Finally, Section VII concludes the paper.

II. DATASET DESCRIPTION

We use a population distribution dataset derived from usermobility data collected by NTT DoCoMo1, a major mobilenetwork operator in Japan. The crowd population distributiondata is collected continuously from the operational mobilenetwork. There are three main steps of the collecting process.i) The operator aggregates the number of mobile users in eachof the base station areas (or cells). ii) The population withineach cell is estimated according to the information such asmarket share etc. Finally, the population is re-aggregate fromthe per-cell populations obtained from the previous processinto each geographical grid. iii) To preserve user’s privacy,if the population in a grid is too small, the grid will not becollected in the final dataset. Due to the high penetration ofmobile devices in Japan, it is considered that the estimatedpopulation distribution is much more accurate than any othertraditional methods of population survey [13]. The downtownarea of Kumamoto city (Chou and Higashi wards), where thedata was collected, is split into approximately 350 geographicalgrids in 500× 500m and no grid is dropped due to the privacyissue. The data covers hourly snapshot of 6 days before and

1www.nttdocomo.co.jp/english/

0 24 48 72 96 120 144Hours

260000

280000

300000

320000

340000

360000

380000

Popu

lati

on

foreshock mainshock

Fig. 1. Aggregated Population

after the Kumamoto earthquake from 14th of April to 19th of2016.

Land usage information of each grid is tagged using datafrom the website of national land numerical information, whichis provided by the Ministry of Land, Infrastructure, Transportand Tourism of Japan2. The data is mapped in a 100 × 100mmesh grid, which has a finer granularity than the populationdata. Land usage is categorized into 17 types such as field, for-est, river, and low/high residential buildings etc. We annotatedeach geographical grid with multiple land usages tags since theland usage data is in a finer-granularity.

III. DATA ANALYSIS

A. Aggregated Dynamics

The Kumamoto earthquakes consist of a 6.2 magnitudeforeshock at 21:26 on 14th of April and a 7.0 magnitudemainshock at 01:25 on 16th of April 2016. Fig. 1 demonstratesthe hourly change in overall population from 14th to 19th ofApril in the downtown region of Kumamoto city with the timeof foreshock and mainshock annotated. As expected, the diurnalpopulation dynamics pattern is disrupted after the foreshock,and both the peak and average population in the followingday drop slightly by approximately 5-10%. In contrast, afterthe mainshock there is a significant population drop of at least20%. Moreover, this trend of population outflow continued fortwo days following the mainshock, which could be caused byboth disaster evacuations and weekend effect. As the workdaysbegan from the 96th hour, the diurnal pattern begins recoveringto a certain extent.

B. Disruption of “Normal” Pattern

Coordinate (i, j) is used to uniquely identify each grid. Wedenote P = {pti,j}∀i, j, t, which is the population of grid (i, j)at time slot t. In addition, pi,j = pti,j ,∀t, where pi,j representsall temporal dynamics of grid (i, j).

We illustrate the spatial distribution of grid populationdensity in a hourly basis in Fig. 2. At each hour, the boxrepresents first (Q1) and third (Q3) quartiles, with the redline indicating median (2nd quartile). The Inter-Quartile Range(IQR) is defined as the distance between first and third quartiles.

2http://www.mlit.go.jp/en/index.html

0 24 48 72 96 120 144Hours

0

1000

2000

3000

4000

5000

6000

7000Po

pula

tion

foreshock mainshock

Fig. 2. Distribution of Population by Hour

0 20 40 60 80 100 120 140

Hours

0

5

10

15

20

25

30

Per.(

%)o

fAno

mal

y foreshockmainshock3 > |K| ≥ 2

|K| ≥ 3

Fig. 3. Distribution of anomaly value K

Points that are outside of Q3+1.5IQR or Q1-1.5IQR are markedas ”+”. A sparse distribution with diurnal pattern is observedbefore the foreshock, while outliers drop dramatically beforethe mainshock and are almost non-existence days after themainshock due to the effects of upcoming weekend and disasterevacuation in the following two days. Diurnal pattern andsparsity start recovering past the 100th hour as both the returnof weekdays and the end of disruptive events.

Similar to [1], we define the anomaly value Kti,j using

the grid’s average population across the period and standarddeviation pi,j and σpi,j

. |K| ≥ 2 represents (≥ µ±2σ, outside95% confidence as K is normal distribution), and |K| ≥ 3indicates the case when the value is outside interval (µ ± 3σ)(99.7% confidence).

Kti,j =

(pti,j − pi,j)

σpi,j

(1)

It can be seen in Fig. 3 that the percentage of anomaly(95% confidence) reaches approximately 30% in the first 20hours. This could be explained by the fact that majority ofour data (∼85%) is recorded post disaster, therefore, “normal”time data behaves differently and is regarded as “abnormal”.Furthermore, percentage of anomaly also displays a periodicpattern that motivates us to look into finer-granularity spatio-temporal dynamics.

Fig. 4 compares the population distribution an hour beforeand after as well as four hours before and after both theforeshock and mainshock. Overall, the population density dropsafter the foreshock as well as the mainshock, which is also

reflected in Fig. 1. Moreover, not only is population distributionpattern affected by the diurnal pattern, but also individualmovement and evacuation patterns. As the two shocks donot happen at the same time of the day. It is challenging tounderstand how the two shocks of different magnitude affectpopulation distribution dynamics on top of the diurnal pattern,especially when it comes to fine-grained spatial-temporal popu-lation distribution dynamics. Next, we attempt to dive deeper tounderstand the finer-grained spatio-temporal crowd populationdynamics during major disaster.

C. Spatio-Temporal Dynamics

1) Rate of Change in Individual Grid: Aggregated dynamicsdo not tell the story of fine-grained spatial and temporalfluctuations caused by individual movements and evacuationsin the affected region. We define the rate of change of grid(i, j) at time t, Ct

i,j , as the percentage change of population

compared to the previous observation, Cti,j =

pti,j − pt−1i,j

pt−1i,j

.

In Fig. 5, we show dynamics of population as well as itschanges in fine-gratuity grids, where different grids with diversedynamics characteristics could be easily identified. In specific,Fig. 5a depicts population of grids sorted by its grid ID, whichgives us a general idea of grid popularity/density. Hence, inFig. 5b, the rate of changes of these same set of grids aredemonstrated. Firstly, some grids display a quick inflow, slowoutflow pattern, while the reverse could also be seen in othergrids. Moreover, there exists grids whose population dynamicsare relatively stable regardless of time without significantchanges. Additionally, the density of a particular grid do nothave a direct implication to its dynamics. Therefore, furtherinvestigation is warranted to better understand the dynamics.

In the following spatial and temporal analysis, we normalizethe population data in each grid into the range between 0 and 1.with tanh estimator, which is shown to be robust and efficientfor normalizing time series data [14].

pti,j = 0.5(tanh(

0.01(pti,j − pi,j)

σpi,j

) + 1)) (2)

where pi,j and σpi,jare mean and standard deviation of pi,j .

2) Strong Fine-grained Temporal Autocorrelation: TemporalAutoCorrelation Function (ACF) [15] is a widely used methodfor discovering temporal data dependency. The definition ofsample ACF at grid (i, j) with a time lag h is defined as follows;

corr(h) =

T−|h|∑t=1

(pt+|h|i,j − pi,j)(pti,j − pi,j)

T∑t=1

(pti,j − pi,j)2,−T < h < T (3)

where T is the total hour counts and p is the mean value ofpopulation at grid (i,j) during the time interval. corr(h) = 1indicates a total positive autocorrelation, while corr(h) = 0indicates no autocorrelation. Temporal ACF plot usually de-picts a repeating pattern if there exists reasonable amount of

(a) 4hr before foreshock (5pm) (b) right before foreshock (9pm) (c) right after foreshock (10pm) (d) 4hr after foreshock (2am)

(e) 4hr before mainshock (9pm) (f) right before mainshock (1am) (g) right after mainshock (2am) (h) 4hr after mainshock (6am)

Fig. 4. Spatial Population Distribution

0 20 40 60 80 100120140Hours

0

50

100

150

200

250

300

Are

as

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Popu

lati

on

(a) Population

0 20 40 60 80 100120140Hours

0

50

100

150

200

250

300

Are

as

−100

−80

−60

−40

−20

0

20

40

60

80

100

Rat

eof

Cha

nge(

%)

(b) Rate of ChangeFig. 5. Heatmap

0 20 40 60 80 100 120 140Hours

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Aut

ocor

rela

tion

Fig. 6. Temporal autocorreation pattern

temporal auto-correlation. In Fig. 6, foreshock and mainshockevents both trigger the a drop in autocorrelation. However, itoscillate with a period of approximately 24hrs, whose amplitudereflects the strength of diurnal pattern. In addition, the rederror bars represent one standard deviation of change amongall grids. Therefore, a strong temporal fine-grained temporalautocorrelation is observed.

3) Strong Fine-grained Spatial Correlation: Move on tospatial relationship, Pearson’s product moment coefficient iswidely used to examine spatial correlation [16]. The correlationcoefficient ρg,g′ between two grids g(i, j) and g′(i′, j′) isrepresented as;

ρg,g′ =cov(g, g′)

σgσg′=E(g − µg)(g′ − µg′)

σgσg′(4)

where E is the expected value operator, cov means covariance.We present the grid-grid correlation matrix in the form ofheatmap in Figure 7. The matrix is symmetric, and bottomhalf of the matrix is used for interpretation. Overall, more than21% of the grids have absolute value of correlation greater than0.5, which is moderately strong. Furthermore, some positivelycorrelated clusters could be discovered due to the arrangementof grid ID that is closely related to their distance. Hence, spatialcorrelation is also highly location-dependent. For the reason thatcrowd dynamics could be heavily influenced by geolocationsand their inter-correlations, we then attempt to discover anycause of spatial correlation by analyzing spatial characteristics,i.e. land usage and distance.

D. Factors Influence Spatial Correlation

1) Distance: We study whether physical distance influencesspatial correlation. Intuitively, regardless of other factors, loca-tions further away from each other behave more differently,

0 50 100 150 200 250 300Grids

0

50

100

150

200

250

300

Gri

ds

−1.0

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Spat

ialC

orre

lati

on

Fig. 7. Spatial Correlation with neighboring grids

0 50 100 150 200 250 300Grids

0

50

100

150

200

250

300

Gri

ds

0.0

1.5

3.0

4.5

6.0

7.5

9.0

10.5

12.0

13.5

Dis

tanc

e(km

)

Fig. 8. Grid Distance Matrix

while closer locations are more correlated and similar. Wetest our hypothesis by comparing Pearson’s coefficient betweenlong term spatial correlation with distance. A value of -0.216suggests a moderate negative relationship can be established,suggesting grids further aways are less correlated regardless oftime. Then, we examine land usages to better understand thespatial details.

2) Land Usage: We category land usage of each grid usingdata collected by Ministry of Land, Infrastructure and Transportdescribed in Section II. As the land usage dataset uses gridsize of 100×100m, each 500×500m grid is tagged 25 times. InTable I, classification of some top category codes are listed. Fullcategory code information can be found in the link3. We furtherassign a broad category to each category code, i.e. commercialor residential area. Fig. 9a lists the distribution of all tags, andFig. 9b shows the distribution of top category in each grid.Majority of the grids by far are tagged as low-rise building(residential areas), followed by high rise buildings (commercialareas).

Next, we show in Fig. 10 how crowd population distributiondynamics react to disasters for grids in different categories. Weconsider the category of grid according to its most frequenttag (Fig. 9b). Commercial areas display the strongest pattern,and are most sensitive to disasters. In addition, nature areas,e.g. rivers, lakes and parks, also depict moderate periodic

3http://nlftp.mlit.go.jp/ksj/gml/codelist/LandUseCd-09-u.html

TABLE ICLASSIFICATION OF TOP CATEGORY CODES

Category Code Classification Broad Category100 Rice field Agricultural200 Other agricultural land Agricultural500 Forest Nature701 High-rise building Commercial703 Low-rise building Residential704 Low-rise building (dense) Residential901 Road Transport1001 Public facility site Facility1100 Rivers and lakes Nature

703

701

901

1001 10

011

00 200

704

500

1002

0

50

100

150

200

250

300

350

No.

ofG

rids

(a) All categories

703

100

200

704

500

701

1100 70

210

0316

000

50

100

150

200

No.

ofG

rids

(b) Top categoriesFig. 9. Distribution of Land Usage

0 20 40 60 80 100 120 140Hours

500100015002000250030003500400045005000

Mea

nPo

pula

tion

(a) Commercial

0 20 40 60 80 100 120 140Hours

0

500

1000

1500

2000

2500

Mea

nPo

pula

tion

(b) Residential

0 20 40 60 80 100 120 140Hours

0

100

200

300

400

500

600

700

Mea

nPo

pula

tion

(c) Agricultural

0 20 40 60 80 100 120 140Hours

−2000

−1000

0

1000

2000

3000

4000

Mea

nPo

pula

tion

(d) Nature

0 20 40 60 80 100 120 140Hours

−400−200

0200400600800

100012001400

Mea

nPo

pula

tion

(e) Facility

0 20 40 60 80 100 120 140Hours

200

400

600

800

1000

1200

1400

1600

1800

Mea

nPo

pula

tion

(f) Transport

Fig. 10. Dynamics by Category

Commercial Residential Agricultural Nature Facility−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Cor

rela

tion

Coe

ffici

ent

Fig. 11. Intra-Category Correlation Coefficient

pattern. These two categories of areas are most affected bydisaster, while residential areas are generally stable with onlyvery slight drop in mean population per grid. The remainingthree categories do not respond sensitively to disastrous events.Furthermore, Intra-correlation coefficient of grids in the samecategory is shown in Fig. 11. Again, grids in commercialareas shows the strongest positive correlations within the samecategory.

An algorithm models spatial and temporal dynamics consid-

ering their correlations as well as factors such as land usagecould potentially perform well in predicting crowd densitydynamics. However, there could be other factors influencingthe dynamics, which is difficult to enumerate completely. Fur-thermore, there are issues lies in generalization of model todifferent locations without much prior knowledge. As a result,in the next section, we discuss potential spatio-temporal modelsthat could be applied to solve the task.

IV. SPATIO-TEMPORAL MODEL OF CROWD DENSITYDYNAMICS FOLLOWING ABNORMAL EVENTS

A. Regressive Models

A possible solution is to apply auto-regressive model [17]in each grid that utilize historical temporal information of eachparticular location to perform predictions. The notation AR(p)indicates an autoregressive model of order p, which could bepresented as follows;

Xt = c+

p∑i=1

ϕiXt−i + εt (5)

where ϕ1, . . . , ϕp are parameters of AR model, c is a constant,and εt is a time-dependent bias. When an autoregressive modelis used for temporal forecasting, the linear predictor is theoptimal h-step ahead forecast in terms of mean-squared error.The reasons why auto-regression performance could be limitedare; a). The fine-grained temporal data are not necessarilystationary in all locations. b). A disruptive event could breakthe usual regressive patterns.

Autoregressive integrated moving average (ARIMA) modelis a generalized model combining autoregressive (AR) withmoving average (MA) mode, and is very popular in time seriesforecasting for non-stationary data [18]. On the other hand,Support Vector Regression (SVR) is another popular methodfor regression based on support vector machine (SVM), and Itis also a widely used machine learning algorithm as a base-line. We apply AR, ARIMA and SVR models in each fine-grained location and evaluate the performance accordingly. Theregressive natures of these model allow us to capture the uniquetemporal dynamics at each geographical location, however theirperformance could suffer in highly dynamics scenarios andwhen amount of historical data is limited.

B. Spatial-Temporal LSTM Model

Traditional feed-forward network has limitations such asfixed length and independence of different training data. Re-searchers developed recurrent neural network (RNN) [19], [20]to accommodate the problem by taking into account for theshort and long term temporal dependency. Fig. 12 is a simpleexample demonstrating how RNN structure works, and thefollowing transformations describe RNN;

ht = gh(WIxt +WRht−1 + bh) (6)

yt = gy(Wyht + by) (7)

In order to obtain the optimal weighting matrix WI ,WR

and Wy , an RNN is trained by combining multiple costs at

Fig. 12. Recurrent Neural Network

OutputGate

ForgetGate

InputGate

Fig. 13. Graphical representation of Long Short-term Memory Model

every single time slot via chained gradients back propagation.That introduces the issue of either vanishing or explodinggradients because weight matrix WR is shared at every singlestep [21]. Exploding gradient issue is relatively easy to dealwith by techniques i.e. truncated back propagation throughtime (BPTT), clipping gradient or RMSprop (adaptive learning).Long Short-term Memory (LSTM) model [12] is speciallydesigned to combat the vanishing gradient problem. Put itsimple, LSTM introduces an input gate, an output gate andan forget gate to manipulate (add, get and flush) memorycells C [22] as illustrated in diagram Fig. 13. The followingtransformations are defined for mapping the inputs xt to theoutputs ht, Ct+1:

ht = σ(Wo[xt, ht−1] + bo)⊙

tanh(Ct) (8)

Ct+1 = ft⊙

Ct + it⊙

Ct (9)⊙represents matrix element wise dot product. tanh function

modify the memory cell to range [0,1]. σ is the sigmoidfunction for squashing the data to a range between 0 and 1.The memory cell Ct+1 at time (t+1) is equal to a combinationof the amount to “forget” from previous time slot Ct and theamount to add from proposal of new time slot Ct. Here, LSTMas one type of deep neural network could learn complicatedspatial and temporal dynamics when the crowd density map atdifferent time instances are used as inputs for training. We usethe results of regressive models as baselines, and compare theirperformances with LSTM based algorithm.

110 115 120 125 130 135 140Hours

280000

285000

290000

295000

300000

305000

310000

315000

320000

Popu

lati

on

Test DataAuto-Regression

Fig. 14. Auto-regression Model

V. EVALUATION

A. System Setup

We use the AR/ARIMA library in statemodel [23], andSVR implementation in sk-learn [24]. Implementation of LSTMmodel is achieved using Keras framework with TensorFlowbackend [25]. Common metrics Mean Squared Error (MSE)is used for evaluating the performance of models, which aretrained with the first 80% data of all grids and test with the rest20% data unless otherwise specified. Validation data accountsfor 5% of selected training data. All available girds are used inthe training and evaluation process, and both models are opti-mized by MSE. Moreover, we choose a LSTM model that has1 hidden layer with 342 hidden units (approximately 1 millionparameters). The results of LSTM layers are passed to eithera linear activation function or a sigmoid activation function,where results are compared. Finally, RMSprop optimizer is usedfor adaptive learning.

B. Results

1) Regressive Models: Fig 14 shows the estimated overallpopulation by summing up the hourly results of prediction inindividual grids. Overall, AR Model delivers a MSE 22040over all 324 grids, and the aggregated prediction results donot fit very well with the testing data. On the other hand,ARIMA (MSE:15020) outperforms AR largely due to the non-stationarity of data that we have shown in the analysis section.In addition, SVR algorithm with RBF kernel could achieve aMSE as low as 36554. We use the three algorithms as baselinesto evaluate the performance of spatial-temporal LSTM.

2) Spatial-Temporal LSTM: Since LSTM is sensitive toscaling, the usual practice is to scale data into a range between0 and 1. The model is trained both with a history windowof 1 time step (LSTM(1))and 5 time steps (LSTM(5)), thetraining loss as well as validation loss are shown in Figure 15.Although training loss keeps dropping in the first 100 epochs,validation loss stays relatively stable after around 20 epochs.The MSE for scaled back LSTM(1) and LSTM(5) are 11761and 6742 respectively, which are 46.6% and 69.4% lower thanthe autoregressive model. In addition, Increasing length ofLSTM history window from 1 to 5 does improve predictionperformance by a further 43%. Table II shows the impactof switching between linear and sigmoid activation functionfor results passed by LSTM layer, where linear performancesslightly better in both cases using our dataset.

TABLE IIMSE(SCALED MSE) FOR LSTM MODEL

Activation Function Linear SigmoidLSTM(1) 11761(0.0203) 18604(0.0209)LSTM(5) 6742(0.0149) 7006(0.0169)

0 20 40 60 80 100Epochs

0.0000.0050.0100.0150.0200.0250.0300.0350.0400.045

Mea

nSq

uare

dE

rror

(MSE

)

Training LossValidation Loss

0 20 40 60 80 100Epochs

0.00

0.01

0.02

0.03

0.04

0.05

Mea

nSq

uare

dE

rror

(MSE

)

Training LossValidation Loss

Fig. 15. LSTM(1) and LSTM(5) Training/Validation Loss

0 20 40 60 80 100 120 140Hours

260000

280000

300000

320000

340000

360000

380000

Popu

lati

on

Training DataModel Fitting(1)Test DataPrediction(1)

0 20 40 60 80 100 120 140Hours

260000

280000

300000

320000

340000

360000

380000

Popu

lati

on

Training DataModel Fitting(5)Test DataPrediction(5)

Fig. 16. Fitting and Prediction Performance of LSTM Model

Fig. 16 illustrates a comparison of training data with the re-constructed LSTM models and the temporal prediction results.The fitting and predictions are surprisingly well, consideringthat the model figures out the high dimensional temporalfluctuations within individual grid, and successfully predictedthe overall population changes for almost 2 days. Furthermore,we present the effectiveness of LSTM(5)(Linear) model infine-grained spatial domain. Fig. 17 lists the heatmaps ofthree random original spatial distributions along with theirreconstructed/predicted distribution. As the fitting is relativelyclose, we further illustrate their differences by number in aheatmap. The first two random time instances (T1 and T2) areshown to be estimated with a per grid error of mostly less than100. Random T3 has good spatial estimation in majority of thegrids, with only a few relatively large errors (∼ 700) (note thedifferences in heatmap scales).

C. Post-Abnormal Event Performance

Next, we evaluate the forecastability of all aforementionedalgorithms only with training data prior to mainshock. Perfor-mance in terms of MSE is presented in Table III. Moreover,we compare the performances in terms of normalized MSEwhen trained with 80% data and only with data prior tomainshock (disruptive event) in Figure 18. More specifically,comparing Fig. 18a and Fig. 18b, we could find out thatAR demonstrates significant performance variation, while therest algorithms performance relatively stable. This is mainlycaused by the factor that AR is not suitable for non-stationarytime series, especially when the changes are highly dynamicsand disruptive (i.e. right after mainshock). Furthermore, LSTMalgorithm outperforms other baseline algorithms regardless ofthe amount of available training data.

0 5 10 15

0

5

10

15400

800

1200

1600

2000

2400

2800

3200

3600

(a) Random T1

0 5 10 15

0

5

10

15300

600

900

1200

1500

1800

2100

2400

2700

(b) Random T2

0 5 10 15

0

5

10

15 800

1600

2400

3200

4000

4800

5600

(c) Random T3

0 5 10 15

0

5

10

15400

800

1200

1600

2000

2400

2800

3200

3600

(d) Rand T1 Prediction0 5 10 15

0

5

10

15300

600

900

1200

1500

1800

2100

2400

2700

(e) Rand T2 Prediction0 5 10 15

0

5

10

15600

1200

1800

2400

3000

3600

4200

4800

(f) Rand T3 Prediction

0 5 10 15

0

5

10

15−160

−120

−80

−40

0

40

80

120

160

(g) Random T1 diff0 5 10 15

0

5

10

15

−160

−120

−80

−40

0

40

80

120

(h) Random T2 diff0 5 10 15

0

5

10

15

−700

−600

−500

−400

−300

−200

−100

0

(i) Random T3 diff

Fig. 17. LSTM(5) Spatial Reconstruction and Prediction

TABLE IIIMODEL PERFORMANCE RIGHT AFTER ABNORMAL EVENT

Model MSE Per(%) Improvement upon Prev ModelAR 473206 -

ARIMA 113856 95.9%SVR 78724 30.9%

LSTM(1-L) 62556 20.5%LSTM(5-L) 55925 10.6%

AR ARIMA SVR(RBF) LSTM(1-L) LSTM(5-L)0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

MSE

(a) 80% Training Data

AR ARIMA SVR(RBF) LSTM(1-L) LSTM(5-L)0.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

MSE

(b) Right After MainshockFig. 18. Model Performance Evaluation

D. Region-Aware LSTM

Lastly, we attempt to investigate whether additional contextinformation can potentially improve the performance of spatial-temporal LSTM model. We propose Region-aware LSTM with

TABLE IVREGION-AWARE LSTM PERFORMANCE

MSE in Diff Categories LSTM(1-L) LSTM(5-L)Residential 11139 6450Commercial 703 291Agricultural 11955 7878

Nature 9318 8726Region-Aware LSTM 11029 6805

Regular LSTM 11761 6742

the hypothesis that locations of similar types/categories aremore likely to behave similarly with regards to crowd dynamics.

Firstly, each grid location is tagged with one major categorybased on the land usage information we collected in Section III.An independent LSTM model is training locations of samecategory, and the overall region-aware LSTM MSE is calculatedby combining estimations in different categories in a weightedmanner. The results of both LSTM(1) and LSTM(5) with linearactivation function is displayed in Table IV. It could be seen thatcategories “commercial” and “nature” show higher regularityand lower MSE, which is confirmed by their strong patternsshown in Fig. 10. Additionally, Region-aware LSTM does notoutperform regular LSTM consistently. Region-aware LSTMachieves 6.2% lower prediction error in the case of LSTM(1),however a 0.9% performance loss when LSTM(5) is used. Webelieve that Region-aware LSTM does not significantly out-perform regular LSTM due to the following possible reasons;a). LSTM already figures out the geographical category/typefrom the training data. b). the proposed hypothesis might nothold in all scenarios. c). The tagged category data does notcompletely reflect the real world scenario, since a grid couldconsist multiple types of areas.

VI. IMPLICATIONS AND FUTURE WORKS

A. Discussion and ImplicationsWe collected shelter location data in the same geographical

region, and map the planned shelter locations in Fig 19.Comparing to crowd population dynamics after shocks that isobserved in Fig 4, we identify mismatches of current planningwith actual crowd dynamics, where further improvement couldbe achieved. As shown in evaluation, the proposed LSTMmodel could achieve higher accuracy with little data. As aresult, a fine-grained spatio-temporal LSTM framework couldpotentially be deployed right after any detected major disasters,and be used to guide post-disaster evacuation planning. Thus,shelter location planning could be guided and improved bythe proposed framework. Furthermore, the framework could becustomized with different cities/areas, and be trained separatelywith historical disaster data without much human interactionsand inputs.B. Future Works

The LSTM deep neural network framework could be fur-ther extended to a hybrid deep neural network model forpotential performance improvement. As explained in previ-ous subsection, RNN architecture is good at temporal pre-diction of high-dimensional/complicated structure. That is to

Fig. 19. Evacuation shelters Locations

say, current LSTM implement considers the relative positionof grids (i.e. Fig. 17) instead of its geographical locations(i.e. Fig 4). We believe that the performance could be furtherimproved by deploying deep neural network structure such asconvolutional-LSTM [26] or RNN-RBM [27]. The convolu-tional layer (CNN) or RBM layer act as filters that could furtheridentify complicated spatial structures that are then passed intoLSTM layers for temporal prediction. We will evaluate whetherconvolutional-LSTM model is suitable for disaster scenario inour future work.

VII. CONCLUSION

In this paper, we first performed an analysis for a crowd pop-ulation distribution dataset collected during the period of 2016Kumamoto earthquake. The analysis shows strong fine-grainedtemporal autocorrelation and spatial correlation among neigh-boring locations, which motivate us to explore methods thataccurately estimate population dynamics in abnormal times. Byadopting a LSTM based deep learning model, we could predictcrowd distribution in specified area at certain time with a higheraccuracy compared to traditional regressive models. The resultsdemonstrate that deep learning based population distributionestimation is practical and could be applied to future disasters.

ACKNOWLEDGMENT

This research was partially supported by the Strategic Inter-national Collaborative Research Program (SICORP) of JapanScience and Technology Agency (JST) on Big Data and Disas-ter Management, and Australian Government Research TrainingProgram Scholarship. We thank our colleagues who providedinsights and expertises that greatly assisted the research.

REFERENCES

[1] T. Yabe, K. Tsubouchi, A. Sudo, and Y. Sekimoto, “Estimating evacuationhotspots using gps data: What happened after the large earthquakes inkumamoto, japan,” in Proc. of the 5th Urban Computing, 2016.

[2] K. Thilakarathna, F.-Z. Jiang, S. Mrabet, M. A. Kaafar, A. Seneviratne,and G. Xie, “Crowd-cache: Leveraging on spatio-temporal correlationin content popularity for mobile networking in proximity,” ComputerCommunications, vol. 100, pp. 104–117, 2017.

[3] P. Turchin, Complex population dynamics: a theoretical/empirical synthe-sis. Princeton University Press, 2003, vol. 35.

[4] J. Wang, J. Tang, Z. Xu, Y. Wang, G. Xue, X. Zhang, and D. Yang,“Spatiotemporal Modeling and Prediction in Cellular Networks : A BigData Enabled Deep Learning Approach,” in INFOCOM, 2017 Proceed-ings IEEE, 2017, pp. 1323–1331.

[5] S. J. Guy, J. Van Den Berg, W. Liu, R. Lau, M. C. Lin, and D. Manocha,“A statistical similarity measure for aggregate crowd dynamics,” ACMTransactions on Graphics (TOG), vol. 31, no. 6, p. 190, 2012.

[6] L. Zhong, K. Takano, F. Jiang, X. Wang, Y. Ji, and S. Yamada, “Spatio-temporal data-driven analysis of mobile network availability duringnatural disasters,” in Information and Communication Technologies forDisaster Management (ICT-DM), 2016 3rd International Conference on.IEEE, 2016, pp. 1–7.

[7] X. Song, Q. Zhang, Y. Sekimoto, T. Horanont, S. Ueyama, andR. Shibasaki, “Modeling and probabilistic reasoning of population evac-uation during large-scale disaster,” in Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and datamining. ACM, 2013, pp. 1231–1239.

[8] M. Shimosaka, K. Maeda, T. Tsukiji, and K. Tsubouchi, “Forecastingurban dynamics with mobility logs by bilinear poisson regression,”in Proceedings of the 2015 ACM International Joint Conference onPervasive and Ubiquitous Computing. ACM, 2015, pp. 535–546.

[9] Y. Sekimoto, A. Sudo, T. Kashiyama, T. Seto, H. Hayashi, A. Asahara,H. Ishizuka, and S. Nishiyama, “Real-time people movement estimation inlarge disasters from several kinds of mobile phone data,” in Proceedingsof the 2016 ACM International Joint Conference on Pervasive andUbiquitous Computing: Adjunct. ACM, 2016, pp. 1426–1434.

[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[11] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neuralnetworks, vol. 61, pp. 85–117, 2015.

[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[13] M. Terada, T. Nagata, and M. Kobayashi, “Population estimation tech-nology for mobile spatial statistics,” NTT DOCOMO Techn. J, vol. 14,pp. 10–15, 2013.

[14] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel, Robuststatistics: the approach based on influence functions. John Wiley & Sons,2011, vol. 114.

[15] P. J. Brockwell and R. A. Davis, Introduction to time series and forecast-ing. springer, 2016.

[16] P. Legendre, “Spatial autocorrelation: trouble or new paradigm?” Ecology,vol. 74, no. 6, pp. 1659–1673, 1993.

[17] H. Lutkepohl, New introduction to multiple time series analysis. SpringerScience & Business Media, 2005.

[18] G. P. Zhang, “Time series forecasting using a hybrid arima and neuralnetwork model,” Neurocomputing, vol. 50, pp. 159–175, 2003.

[19] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur,“Recurrent neural network based language model.” in Interspeech, vol. 2,2010, p. 3.

[20] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra,“Draw: A recurrent neural network for image generation,” arXiv preprintarXiv:1502.04623, 2015.

[21] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of trainingrecurrent neural networks.” ICML (3), vol. 28, pp. 1310–1318, 2013.

[22] Nervana, “Recurrent neural networks,” https://experiencenervana.com/,2017.

[23] J. Seabold and J. Perktold, “Statsmodels: Econometric and statisticalmodeling with python,” in Proceedings of the 9th Python in ScienceConference, 2010.

[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,“Scikit-learn: Machine learning in Python,” Journal of Machine LearningResearch, vol. 12, pp. 2825–2830, 2011.

[25] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.[26] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c.

Woo, “Convolutional lstm network: A machine learning approach forprecipitation nowcasting,” in Advances in Neural Information ProcessingSystems, 2015, pp. 802–810.

[27] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modelingtemporal dependencies in high-dimensional sequences: Application topolyphonic music generation and transcription,” arXiv:1206.6392, 2012.