43
Processing queries to search engine of Yandex: possibilities of analysis and forecast AINL FRUCT 2016 Boldyreva Anna (RANEPA-MIPT)

AINL 2016: Boldyreva

Embed Size (px)

Citation preview

Page 1: AINL 2016: Boldyreva

Processing queries to search engine of Yandex:

possibilities of analysis and forecast

AINL FRUCT 2016

Boldyreva Anna (RANEPA-MIPT)

Page 2: AINL 2016: Boldyreva

Content

Introduction

Databases

Tasks of analysis

Tasks of forecast

Conclusion

INTRODUCTION

Page 3: AINL 2016: Boldyreva

Terminology

Search query is a request made by an Internet user to

obtain information from a search engine; statistics on

search queries are obtained from services from search

engines:

https://www.google.ru/trends/

https://adwords.google.com/

http://wordstat.yandex.ru/.

Descriptor is a word or a phrase that forms part of

search queries introduced by users;

INTRODUCTION

Page 4: AINL 2016: Boldyreva

Terminology

Indicators are economic, social, demographic and other

indicators that are analyzed or forecasted by analysts and

researchers;

Top-rated lists of descriptors are search queries that are

the most highly correlated with selected indicators;

Barometer is the mean value of the normalized dynamics

of the top-rated selection.

INTRODUCTION

Page 5: AINL 2016: Boldyreva

Hypothesis

There is a stable statistical dependence between the intensity of

search queries and real-world events and social processes.

Fig. 1. The dynamics of the descriptor ‘swimsuit’ in

U.S.A.: peaks in February and May-June

INTRODUCTION

Page 6: AINL 2016: Boldyreva

INTRODUCTION

Relevance

We can use search queries

• for monitoring the economic situation in regions in real time avoiding difficulties related to the lack of data, as explained above;

• for parallel control of official information, which allows to reveal distortions introduced by official institutions;

• for forecasting economic, demographic and social parameters during a crisis period;

• for forecasting dynamics of various socio-economic and socio-political processes;

• for analysis of other countries. Here we do not need official data that is published with delay.

Page 7: AINL 2016: Boldyreva

State of the art

2009 — Google has launched a service showing pest holes

in real time based on the intensity of queries from different

regions

2009 — H. Choi and H. Varian introduced the first model

predicting fluctuations in business cycles with the help of

search queries statistics

2011 — D. Engelberg, C. Da and P. Gao demonstrated that

analyzing the dynamics in Google searches for companies gives a 10% advantage to traders;

INTRODUCTION

Page 8: AINL 2016: Boldyreva

State of the art

2011 — Michael Stolbov (MGIMO-University) demonstrated

the feasibility of using Google search statistics to explain the

dynamics of aggregated financial indicators (for example,

deposits of individuals).

2013 — Tobias Prize demonstrated the work "Complex

dynamics of our economic life on different scales: insights

from search engine query data“;

The work is dedicated to the market shares; he analyzed

outbursts in searches «Subprime», «Lehman Brothers» and

«Financial Crisis», followed by a drop in S&P 500 Index.

INTRODUCTION

Page 9: AINL 2016: Boldyreva

Content

Introduction

Databases

Tasks of analysis

Tasks of forecast

Conclusion

Page 10: AINL 2016: Boldyreva

Domain-oriented databases of descriptors

• economic terms — 25000 SQ;

• juridical terms — 4500 SQ;

• crime articles — 365 SQ;

• well-known brands and goods — 3013 SQ;

• emotions:

with positive tonality — 400 SQ

with negative tonality — 400 SQ;

• slang used in finance, computers and other fields — 3300 SQ;

• medical terms —1600 SQ.

DATABASES

Technical databases

• lemmas —18638 ПЗ;

• n-grams ( n=2,3,..8 ) of letters and syllables ~ 90 000 ПЗ.

Page 11: AINL 2016: Boldyreva

Lemma – the initial form of the word Examples: avant-garde, sauna, drum, dune, velvet, bass, basketball, the

battalion commander, a comet, a compass, Icon, contour, piggy, mop,

cordon

n-gram Examples: ев, ег, ед, ее, еж, ти, тк, тл, тм, тн, то, авв, авг, авп, бре, бри,

бро, век, вел, вес, лак, лал, лам, лан, лао, лап, греч, декс, сдел, кром

Emotion words with positive tonality Examples: good, great, beautiful, holiday, goodness, beauty, super, fun,

cool, happy, dream, luck, well, success, joy, laugh, nice

Emotion words with negative tonality Examples: chaos, amoral, immoral, sabotage, punishment, violation,

cattle, schmuck, moron, hopeless, useless, helpless

DATABASES

Page 12: AINL 2016: Boldyreva

Barometers

Examples of words that got into the "barometer" with direct positive

correlation with the indicator “Consumer Price Index":

"treat" – 0.93

"okmarket" (hypermarket) – 0.91

"pariet" (drug for ulcer) – 0.89

"patents" – 0.87

"mfbank" (commercial bank) – 0.87

"headhunter" (site to find job) – 0.86

"pediashur" (baby food) – 0.86

"convenient" – 0.86

"often" – 0.85

"close" – 0.85

DATABASES

Page 13: AINL 2016: Boldyreva

Barometers

Examples of words that got into the "barometer" with direct positive

correlation with the indicator “Consumer Price Index":

"chemical" (british musical duet) – -0.92

"artofvar" (musical group of war veterans) – -0.91

"incest" – -0.87

"group" – -0.87

"babylon" (the italian brand of clothing) – -0.87

“young child" – -0.86

"diprivan" (a sedative) – -0.86

"ilarauto" (van selling) – -0.86

"miss" – -0.86

DATABASES

Page 14: AINL 2016: Boldyreva

Bases of indicators

• Retail trade turnover (mln of roubles);

• Consumer Price Index;

• Entrepreneurs Price Index on industrial products;

• Entrepreneurs Price Index on minerals;

• Unemployment (thousands);

• Sales of new passenger cars and light commercial vehicles (units)

• per capite income (thousands of roubles);

• The dollar/ruble exchange rate. (USDTOM_UTS);

• Brent price (ICE.Brent), USD/баррель;

DATABASES

Page 15: AINL 2016: Boldyreva

Базы индикаторов

• newborns (thousands);

• marriages (thousands);

• real activity (thousands);

• deaths (thousands);

• registered economic crimes

DATABASES

Page 16: AINL 2016: Boldyreva

Programes

1. The program for the collection of search queries’ dynamics from the statistical service of Yandex; 2. The program for the automatic processing of the files and the formation of an Excel spreadsheet; 3. The program for the automatic processing of the tables and selection of top search queries.

DATABASES

Page 17: AINL 2016: Boldyreva

Content

Introduction

Databases

Tasks of analysis

Tasks of forecast

Conclusion

Page 18: AINL 2016: Boldyreva

Distribution of positive searches on the correlation with the indicator

“Retail turnover"

Statistics of queries by regions

ANALYSIS

Page 19: AINL 2016: Boldyreva

Values of correlation coefficients are located on the ordinate axis.

The number of positive descriptors with corresponding level of

correlation relative to the indicator "Turnover of retail trade“ are located

on the horizontal axis.

ANALYSIS

Statistics of queries by regions

Page 20: AINL 2016: Boldyreva

Query statistics on domain-oriented databases

Example: distribution of queries from the database "Brands and products"

relative to the indicators, with which there is a high level of correlation.

Observation: newlyweds are buying more than young parents

ANALYSIS

Page 21: AINL 2016: Boldyreva

Example: distribution of queries with a high level of correlation with the

indicator "Sales of new cars» through thematic databases.

Observation: the active usage of slang, a variety of products/services

Consumer profiling

ANALYSIS

Page 22: AINL 2016: Boldyreva

спазм диафрагмы +21% вертиго +54%

потеря вкусовых

ощущений +108% горький вкус во

рту +52%

дежурный врач +105% телефон аптеки +46% полный пульс +91% приемный покой +46%

кашель с желтой

мокротой +73% маниакальная

фаза +45%

онемение шеи +70% нафтизин +43%

нечувствительность +70% кровотечение из

ушей +38%

ночная потливость +69% вызвать врача +38%

стерильные бинты +67% эфералган +37%

абстинент +53% лекарства купить +36%

дежурная больница +55% свистящее

дыхание +34%

Excess frequency of search

queries on the base of medical

terms in Leningrad

Region compared to

Data for Russia

Data for Russia are accepted for

100%

Increased mortality in Leningrad Region

ANALYSIS

Page 23: AINL 2016: Boldyreva

Content

Introduction

Databases

Tasks of analysis

Tasks of forecast

Conclusion

Page 24: AINL 2016: Boldyreva

Group method of data handling (GMDH)

allows to select the model of optimal complexity in a given class of

models to describe the current set of experimental data

Polynomial class of models:

where x = {xi | i = 1, … , m} is a set of indicators

and w = (wi , wij, wijk, … | i, j, k = 1, … , m) is a weight vector.

FORECAST

Page 25: AINL 2016: Boldyreva

GMDH Shell actualize GMDH

Possibilities:

• Approximation

• Extrapolation

• Classification

http://www.gmdhshell.com

Main constructor: Candidate of Technical Sciences Koshulko A.A.

Program GMDH Shell

FORECAST

Page 26: AINL 2016: Boldyreva

1st criterion: MAPE (mean absolute percentage error):

𝑀𝐴𝑃𝐸 =1

𝑁 𝑦𝑡 − 𝑦𝑡𝑦𝑡∗ 100%

𝑁

𝑡=1

,

where N is sample size, 𝑦𝑡 is real value for 𝑡, 𝑦𝑡 is estimated value for 𝑡;

2nd criterion: P (one-month step forward forecast error):

𝑃 =𝑦𝑁+1 −𝑦𝑁+1𝑦𝑁+1

∗ 100%.

Error evaluation

FORECAST

Page 27: AINL 2016: Boldyreva

Observations are pseudo mixed;

Checking method is cross-checking with two parts;

Internal criterion is OLS;

External criteria is RMSE (root mean squared error) with a

penalty in the form of the difference between the RMSE

value on training and examination parts of the sample;

Neuron function is linear;

The maximum number of layersis 6; The initial layer width is 5.

Forecast settings

FORECAST

Page 28: AINL 2016: Boldyreva

Neural algorithm with linear barometers

MAPE = 1.0%,

One-month forward forecast error P=1.8%

Forecast of retail turnover

FORECAST

Page 29: AINL 2016: Boldyreva

Y1[t] = 64.4813 + Cm3m_pol[t-3]*40.7607 + N2*0.966131 N2[t] = -599.916 - Cm1m_pol[t-1]*410.022 + N3*1.3329 N3[t] = -45.4194 + N12*0.261196 + N4*0.759475 N4[t] = -99.6924 + ORT_PK_otr[t-2]*118.488 + N6*1.02163 N6[t] = 5.48719 - ORT_PK_pol[t-1]*181.667 + N10*1.0246 N10[t] = 1926.58 + Cm1m_pol[t-1]*1209.97 - Cm3m_otr[t-1]*241.064 N12[t] = 2327.24 + Cm3m_pol[t-3]*685.968 - Cm3m_otr[t-3]*581.729

FORECAST

Forecast of retail turnover

Page 30: AINL 2016: Boldyreva

Neural algorithm with barometers in squared roots form

MAPE = 1.4%

One-month forward forecast error P=-2%.

Forecast of USD/ruble exchange course

FORECAST

Page 31: AINL 2016: Boldyreva

Y1[t] = -2.49737e-10 + N2*1 N2[t] = -769.864 + "Cm2_otr[t-2], sqrt"*522.473 + N3*1.09103 N3[t] = 2267.23 - "Cm3_otr[t-3], sqrt"*1563.91 + N6*0.738373 N6[t] = -5926.55 + ""$_PK_otr"[t-2], sqrt"*4826.58 + N10*1.48421 N10[t] = 8962.17 - "Cm2_otr[t-2], sqrt"*3666.07 - "Cm3_otr[t-4], sqrt"*2607.62

FORECAST

Forecast of USD/ruble exchange rate

Page 32: AINL 2016: Boldyreva

Combinatorial algorithm with linear variables

MAPE = 4.5%

One-month forward error in forecast P=-2.9%.

Forecast of economic crimes

FORECAST

Page 33: AINL 2016: Boldyreva

Y[t] = 5368.54 + Cm2_pol[t-1]*8610.05 + Cm2_pol[t-2]*4452.71 +

+Cm2_otr[t-2]*(-11350) + Cm3_pol[t-2]*11285.4

Forecast of economic crimes

FORECAST

Page 34: AINL 2016: Boldyreva

Comparison of algorithms. MAPE

МАРЕ Neuro, no

roots

Neuro, with

squared

roots

Combi, no

roots

Combi, with

squared

roots

Retail turnover in Russia 1,0% 2,1% 2,3% 1,4%

Unemployment 0,8% 0,5% 0,9% 0,6%

Marriages 7,5% 6,9% 9,9% 8,3%

Real activity 0,1% 0,1% 0,1% 0,1%

Price Consumer Index 0,1% 0,1% 0,1% 0,1%

Entrepreneurs Price Index on industrial

products 0,3% 0,3% 0,3% 0,3%

Entrepreneurs Price Index on minerals 1,4% 1,4% 1,4% 1,4%

The dollar/ruble exchange rate 0,8% 1,4% 1,1% 1,2%

Newborns 1,2% 1,8% 1,9% 2,3%

Sales of new passenger cars and light

commercial vehicles 3,5% 1,6% 3,9% 5,4%

Per capita income 2,4% 1,6% 1,9% 1,0%

Economic crimes 3,4% 6,4% 4,5% 6,5%

Oil prices 1,0% 1,7% 1,7% 1,6%

FORECAST

Page 35: AINL 2016: Boldyreva

One-month forward forecast Neuro, no

roots

Neuro, with

squared

roots

Combi, no

roots

Combi, with

squared roots

Retail turnover in Russia 1,8% -3,8% 3,8% -2,9%

Unemployment -2,8% 2,5% -0,9% 3,1%

Marriages -33,2% -20,4% -52,1% -20,2%

Real activity -0,3% -0,2% -0,3% -0,2%

Price Consumer Index 0,1% -0,2% 0,1% 0,7%

Entrepreneurs Price Index on industrial

products 2,3% -0,2% 4,1% -0,2%

Entrepreneurs Price Index on minerals 86,6% 37,0% 1,5% 37,0%

The dollar/ruble exchange rate 12,7% -2,0% 10,6% 17,3%

Newborns -0,7% -0,7% -0,6% 7,6%

Sales of new passenger cars and light

commercial vehicles 7,8% -16,9% 24,9% -100,0%

Per capita income 6,7% 10,8% 13,8% 23,6%

Economic crimes -16,5% 6,6% -2,9% 6,4%

Oil prices -5,3% -7,0% -14,0% -7,0%

Comparison. One-month forward deviations in forecast with barometers

FORECAST

Page 36: AINL 2016: Boldyreva

Content

Introduction

Databases

Tasks of analysis

Tasks of forecast

Conclusion

Page 37: AINL 2016: Boldyreva

CONCLUSION

Scientific results – 1 (databases)

Experimentally we have shown the possibilities of effective

implications of:

• Few domain-oriented databases instead of one;

• Bases of n-grams (𝑛 = 2,8);

• Significantly negatively correlated descriptors along with

significantly positively correlated descriptors;

Page 38: AINL 2016: Boldyreva

CONCLUSION

Scientific results – 2 (analysis)

We suggested interpretation of results of statistic analysis:

• in the field of the evalution of reasons of increased

mortality at the beginning of 2015 in the regions of

Russia;

• in the field of the evalution of people’s reaction on retail

trade turnover;

• in the field of the revealing groups of consumers.

Page 39: AINL 2016: Boldyreva

CONCLUSION

Scientific results – 3 (forecast)

Experimentally we have shown high accuracy of GMDH

algorithms, which allows such error levels as

~3%-6% in the best models of crimes;

~1%–4% in the models for economy and social indicators;

Page 40: AINL 2016: Boldyreva

As the future work we consider

• proposing a technology to use the mentions of descriptors in

social media;

• developing a procedure for processing queries including

outliers related to major circumstances;

• developing models for fuzzy forecasting taking into account

qualitative dynamics of queries.

Future research

CONCLUSION

Page 41: AINL 2016: Boldyreva

Thank you!

[email protected] +7-916-542-37-64

Page 42: AINL 2016: Boldyreva

X_pol – барометр с сильной прямой положительной корреляцией

относительно индикатора X

X_otr – барометр с сильной прямой отрицательной корреляцией

относительно индикатора X

Cmim_pol – барометр с сильной положительной корреляцией с

лагом в i месяцев относительно индикатора X

Cmim_otr – барометр с сильной отрицательной корреляцией с

лагом в i месяцев относительно индикатора X

Обозначения

FORECAST

Page 43: AINL 2016: Boldyreva

Latest research papers

• Boldyreva A., Alexandrov M., Koshulko O., Sobolevskiy O.: Queries to Internet as a tool for analysis of regional police work and forecast of crimes in regions: Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS, 2016, 12 p. [to be published]

• Boldyreva A., Sobolevskiy O., Alexandrov M., Danilova V.: Creating collections of descriptors based on Internet queries: Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS, 11 p. [to be published]

• Boldyreva A.: An integral method for investigating attitudes of Internet users based on search queries. “Mathematical modeling of social processes”, Proc. of Sociological Faculty of MSU, Publ. House MSU (Moscow State Lomonosov Univ.), 2016, vol. 18, pp. 26-34, [rus]

• Boldyreva A.: Building predictive models of economic and social conditions based on the intensity of search queries to the Internet. “Modern economics: theory, policy, innovation. Collection of student research papers”, Moscow, Publ. House RANEPA, 2016, pp. 36-61, [rus]

• Boldyreva A., Alexandrov M., Surkova D.: Words with negative sentiment in search queries to the Internet as an indicator of per capita income in the Federal Districts of Russia. Inductive modeling of complex systems, NAS of Ukraine, Kyev, 2015, vol. 7, pp. 77-92, [rus]