Upload
lidia-pivovarova
View
211
Download
1
Embed Size (px)
Citation preview
Processing queries to search engine of Yandex:
possibilities of analysis and forecast
AINL FRUCT 2016
Boldyreva Anna (RANEPA-MIPT)
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
INTRODUCTION
Terminology
Search query is a request made by an Internet user to
obtain information from a search engine; statistics on
search queries are obtained from services from search
engines:
https://www.google.ru/trends/
https://adwords.google.com/
http://wordstat.yandex.ru/.
Descriptor is a word or a phrase that forms part of
search queries introduced by users;
INTRODUCTION
Terminology
Indicators are economic, social, demographic and other
indicators that are analyzed or forecasted by analysts and
researchers;
Top-rated lists of descriptors are search queries that are
the most highly correlated with selected indicators;
Barometer is the mean value of the normalized dynamics
of the top-rated selection.
INTRODUCTION
Hypothesis
There is a stable statistical dependence between the intensity of
search queries and real-world events and social processes.
Fig. 1. The dynamics of the descriptor ‘swimsuit’ in
U.S.A.: peaks in February and May-June
INTRODUCTION
INTRODUCTION
Relevance
We can use search queries
• for monitoring the economic situation in regions in real time avoiding difficulties related to the lack of data, as explained above;
• for parallel control of official information, which allows to reveal distortions introduced by official institutions;
• for forecasting economic, demographic and social parameters during a crisis period;
• for forecasting dynamics of various socio-economic and socio-political processes;
• for analysis of other countries. Here we do not need official data that is published with delay.
State of the art
2009 — Google has launched a service showing pest holes
in real time based on the intensity of queries from different
regions
2009 — H. Choi and H. Varian introduced the first model
predicting fluctuations in business cycles with the help of
search queries statistics
2011 — D. Engelberg, C. Da and P. Gao demonstrated that
analyzing the dynamics in Google searches for companies gives a 10% advantage to traders;
INTRODUCTION
State of the art
2011 — Michael Stolbov (MGIMO-University) demonstrated
the feasibility of using Google search statistics to explain the
dynamics of aggregated financial indicators (for example,
deposits of individuals).
2013 — Tobias Prize demonstrated the work "Complex
dynamics of our economic life on different scales: insights
from search engine query data“;
The work is dedicated to the market shares; he analyzed
outbursts in searches «Subprime», «Lehman Brothers» and
«Financial Crisis», followed by a drop in S&P 500 Index.
INTRODUCTION
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
Domain-oriented databases of descriptors
• economic terms — 25000 SQ;
• juridical terms — 4500 SQ;
• crime articles — 365 SQ;
• well-known brands and goods — 3013 SQ;
• emotions:
with positive tonality — 400 SQ
with negative tonality — 400 SQ;
• slang used in finance, computers and other fields — 3300 SQ;
• medical terms —1600 SQ.
DATABASES
Technical databases
• lemmas —18638 ПЗ;
• n-grams ( n=2,3,..8 ) of letters and syllables ~ 90 000 ПЗ.
Lemma – the initial form of the word Examples: avant-garde, sauna, drum, dune, velvet, bass, basketball, the
battalion commander, a comet, a compass, Icon, contour, piggy, mop,
cordon
n-gram Examples: ев, ег, ед, ее, еж, ти, тк, тл, тм, тн, то, авв, авг, авп, бре, бри,
бро, век, вел, вес, лак, лал, лам, лан, лао, лап, греч, декс, сдел, кром
Emotion words with positive tonality Examples: good, great, beautiful, holiday, goodness, beauty, super, fun,
cool, happy, dream, luck, well, success, joy, laugh, nice
Emotion words with negative tonality Examples: chaos, amoral, immoral, sabotage, punishment, violation,
cattle, schmuck, moron, hopeless, useless, helpless
DATABASES
Barometers
Examples of words that got into the "barometer" with direct positive
correlation with the indicator “Consumer Price Index":
"treat" – 0.93
"okmarket" (hypermarket) – 0.91
"pariet" (drug for ulcer) – 0.89
"patents" – 0.87
"mfbank" (commercial bank) – 0.87
"headhunter" (site to find job) – 0.86
"pediashur" (baby food) – 0.86
"convenient" – 0.86
"often" – 0.85
"close" – 0.85
DATABASES
Barometers
Examples of words that got into the "barometer" with direct positive
correlation with the indicator “Consumer Price Index":
"chemical" (british musical duet) – -0.92
"artofvar" (musical group of war veterans) – -0.91
"incest" – -0.87
"group" – -0.87
"babylon" (the italian brand of clothing) – -0.87
“young child" – -0.86
"diprivan" (a sedative) – -0.86
"ilarauto" (van selling) – -0.86
"miss" – -0.86
DATABASES
Bases of indicators
• Retail trade turnover (mln of roubles);
• Consumer Price Index;
• Entrepreneurs Price Index on industrial products;
• Entrepreneurs Price Index on minerals;
• Unemployment (thousands);
• Sales of new passenger cars and light commercial vehicles (units)
• per capite income (thousands of roubles);
• The dollar/ruble exchange rate. (USDTOM_UTS);
• Brent price (ICE.Brent), USD/баррель;
DATABASES
Базы индикаторов
• newborns (thousands);
• marriages (thousands);
• real activity (thousands);
• deaths (thousands);
• registered economic crimes
DATABASES
Programes
1. The program for the collection of search queries’ dynamics from the statistical service of Yandex; 2. The program for the automatic processing of the files and the formation of an Excel spreadsheet; 3. The program for the automatic processing of the tables and selection of top search queries.
DATABASES
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
Distribution of positive searches on the correlation with the indicator
“Retail turnover"
Statistics of queries by regions
ANALYSIS
Values of correlation coefficients are located on the ordinate axis.
The number of positive descriptors with corresponding level of
correlation relative to the indicator "Turnover of retail trade“ are located
on the horizontal axis.
ANALYSIS
Statistics of queries by regions
Query statistics on domain-oriented databases
Example: distribution of queries from the database "Brands and products"
relative to the indicators, with which there is a high level of correlation.
Observation: newlyweds are buying more than young parents
ANALYSIS
Example: distribution of queries with a high level of correlation with the
indicator "Sales of new cars» through thematic databases.
Observation: the active usage of slang, a variety of products/services
Consumer profiling
ANALYSIS
спазм диафрагмы +21% вертиго +54%
потеря вкусовых
ощущений +108% горький вкус во
рту +52%
дежурный врач +105% телефон аптеки +46% полный пульс +91% приемный покой +46%
кашель с желтой
мокротой +73% маниакальная
фаза +45%
онемение шеи +70% нафтизин +43%
нечувствительность +70% кровотечение из
ушей +38%
ночная потливость +69% вызвать врача +38%
стерильные бинты +67% эфералган +37%
абстинент +53% лекарства купить +36%
дежурная больница +55% свистящее
дыхание +34%
Excess frequency of search
queries on the base of medical
terms in Leningrad
Region compared to
Data for Russia
Data for Russia are accepted for
100%
Increased mortality in Leningrad Region
ANALYSIS
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
Group method of data handling (GMDH)
allows to select the model of optimal complexity in a given class of
models to describe the current set of experimental data
Polynomial class of models:
where x = {xi | i = 1, … , m} is a set of indicators
and w = (wi , wij, wijk, … | i, j, k = 1, … , m) is a weight vector.
FORECAST
GMDH Shell actualize GMDH
Possibilities:
• Approximation
• Extrapolation
• Classification
http://www.gmdhshell.com
Main constructor: Candidate of Technical Sciences Koshulko A.A.
Program GMDH Shell
FORECAST
1st criterion: MAPE (mean absolute percentage error):
𝑀𝐴𝑃𝐸 =1
𝑁 𝑦𝑡 − 𝑦𝑡𝑦𝑡∗ 100%
𝑁
𝑡=1
,
where N is sample size, 𝑦𝑡 is real value for 𝑡, 𝑦𝑡 is estimated value for 𝑡;
2nd criterion: P (one-month step forward forecast error):
𝑃 =𝑦𝑁+1 −𝑦𝑁+1𝑦𝑁+1
∗ 100%.
Error evaluation
FORECAST
Observations are pseudo mixed;
Checking method is cross-checking with two parts;
Internal criterion is OLS;
External criteria is RMSE (root mean squared error) with a
penalty in the form of the difference between the RMSE
value on training and examination parts of the sample;
Neuron function is linear;
The maximum number of layersis 6; The initial layer width is 5.
Forecast settings
FORECAST
Neural algorithm with linear barometers
MAPE = 1.0%,
One-month forward forecast error P=1.8%
Forecast of retail turnover
FORECAST
Y1[t] = 64.4813 + Cm3m_pol[t-3]*40.7607 + N2*0.966131 N2[t] = -599.916 - Cm1m_pol[t-1]*410.022 + N3*1.3329 N3[t] = -45.4194 + N12*0.261196 + N4*0.759475 N4[t] = -99.6924 + ORT_PK_otr[t-2]*118.488 + N6*1.02163 N6[t] = 5.48719 - ORT_PK_pol[t-1]*181.667 + N10*1.0246 N10[t] = 1926.58 + Cm1m_pol[t-1]*1209.97 - Cm3m_otr[t-1]*241.064 N12[t] = 2327.24 + Cm3m_pol[t-3]*685.968 - Cm3m_otr[t-3]*581.729
FORECAST
Forecast of retail turnover
Neural algorithm with barometers in squared roots form
MAPE = 1.4%
One-month forward forecast error P=-2%.
Forecast of USD/ruble exchange course
FORECAST
Y1[t] = -2.49737e-10 + N2*1 N2[t] = -769.864 + "Cm2_otr[t-2], sqrt"*522.473 + N3*1.09103 N3[t] = 2267.23 - "Cm3_otr[t-3], sqrt"*1563.91 + N6*0.738373 N6[t] = -5926.55 + ""$_PK_otr"[t-2], sqrt"*4826.58 + N10*1.48421 N10[t] = 8962.17 - "Cm2_otr[t-2], sqrt"*3666.07 - "Cm3_otr[t-4], sqrt"*2607.62
FORECAST
Forecast of USD/ruble exchange rate
Combinatorial algorithm with linear variables
MAPE = 4.5%
One-month forward error in forecast P=-2.9%.
Forecast of economic crimes
FORECAST
Y[t] = 5368.54 + Cm2_pol[t-1]*8610.05 + Cm2_pol[t-2]*4452.71 +
+Cm2_otr[t-2]*(-11350) + Cm3_pol[t-2]*11285.4
Forecast of economic crimes
FORECAST
Comparison of algorithms. MAPE
МАРЕ Neuro, no
roots
Neuro, with
squared
roots
Combi, no
roots
Combi, with
squared
roots
Retail turnover in Russia 1,0% 2,1% 2,3% 1,4%
Unemployment 0,8% 0,5% 0,9% 0,6%
Marriages 7,5% 6,9% 9,9% 8,3%
Real activity 0,1% 0,1% 0,1% 0,1%
Price Consumer Index 0,1% 0,1% 0,1% 0,1%
Entrepreneurs Price Index on industrial
products 0,3% 0,3% 0,3% 0,3%
Entrepreneurs Price Index on minerals 1,4% 1,4% 1,4% 1,4%
The dollar/ruble exchange rate 0,8% 1,4% 1,1% 1,2%
Newborns 1,2% 1,8% 1,9% 2,3%
Sales of new passenger cars and light
commercial vehicles 3,5% 1,6% 3,9% 5,4%
Per capita income 2,4% 1,6% 1,9% 1,0%
Economic crimes 3,4% 6,4% 4,5% 6,5%
Oil prices 1,0% 1,7% 1,7% 1,6%
FORECAST
One-month forward forecast Neuro, no
roots
Neuro, with
squared
roots
Combi, no
roots
Combi, with
squared roots
Retail turnover in Russia 1,8% -3,8% 3,8% -2,9%
Unemployment -2,8% 2,5% -0,9% 3,1%
Marriages -33,2% -20,4% -52,1% -20,2%
Real activity -0,3% -0,2% -0,3% -0,2%
Price Consumer Index 0,1% -0,2% 0,1% 0,7%
Entrepreneurs Price Index on industrial
products 2,3% -0,2% 4,1% -0,2%
Entrepreneurs Price Index on minerals 86,6% 37,0% 1,5% 37,0%
The dollar/ruble exchange rate 12,7% -2,0% 10,6% 17,3%
Newborns -0,7% -0,7% -0,6% 7,6%
Sales of new passenger cars and light
commercial vehicles 7,8% -16,9% 24,9% -100,0%
Per capita income 6,7% 10,8% 13,8% 23,6%
Economic crimes -16,5% 6,6% -2,9% 6,4%
Oil prices -5,3% -7,0% -14,0% -7,0%
Comparison. One-month forward deviations in forecast with barometers
FORECAST
Content
Introduction
Databases
Tasks of analysis
Tasks of forecast
Conclusion
CONCLUSION
Scientific results – 1 (databases)
Experimentally we have shown the possibilities of effective
implications of:
• Few domain-oriented databases instead of one;
• Bases of n-grams (𝑛 = 2,8);
• Significantly negatively correlated descriptors along with
significantly positively correlated descriptors;
CONCLUSION
Scientific results – 2 (analysis)
We suggested interpretation of results of statistic analysis:
• in the field of the evalution of reasons of increased
mortality at the beginning of 2015 in the regions of
Russia;
• in the field of the evalution of people’s reaction on retail
trade turnover;
• in the field of the revealing groups of consumers.
CONCLUSION
Scientific results – 3 (forecast)
Experimentally we have shown high accuracy of GMDH
algorithms, which allows such error levels as
~3%-6% in the best models of crimes;
~1%–4% in the models for economy and social indicators;
As the future work we consider
• proposing a technology to use the mentions of descriptors in
social media;
• developing a procedure for processing queries including
outliers related to major circumstances;
• developing models for fuzzy forecasting taking into account
qualitative dynamics of queries.
Future research
CONCLUSION
Thank you!
[email protected] +7-916-542-37-64
X_pol – барометр с сильной прямой положительной корреляцией
относительно индикатора X
X_otr – барометр с сильной прямой отрицательной корреляцией
относительно индикатора X
Cmim_pol – барометр с сильной положительной корреляцией с
лагом в i месяцев относительно индикатора X
Cmim_otr – барометр с сильной отрицательной корреляцией с
лагом в i месяцев относительно индикатора X
Обозначения
FORECAST
Latest research papers
• Boldyreva A., Alexandrov M., Koshulko O., Sobolevskiy O.: Queries to Internet as a tool for analysis of regional police work and forecast of crimes in regions: Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS, 2016, 12 p. [to be published]
• Boldyreva A., Sobolevskiy O., Alexandrov M., Danilova V.: Creating collections of descriptors based on Internet queries: Proc. of 15th Mexican Intern. Conf. on Artificial Intelligence, Springer, LNCS, 11 p. [to be published]
• Boldyreva A.: An integral method for investigating attitudes of Internet users based on search queries. “Mathematical modeling of social processes”, Proc. of Sociological Faculty of MSU, Publ. House MSU (Moscow State Lomonosov Univ.), 2016, vol. 18, pp. 26-34, [rus]
• Boldyreva A.: Building predictive models of economic and social conditions based on the intensity of search queries to the Internet. “Modern economics: theory, policy, innovation. Collection of student research papers”, Moscow, Publ. House RANEPA, 2016, pp. 36-61, [rus]
• Boldyreva A., Alexandrov M., Surkova D.: Words with negative sentiment in search queries to the Internet as an indicator of per capita income in the Federal Districts of Russia. Inductive modeling of complex systems, NAS of Ukraine, Kyev, 2015, vol. 7, pp. 77-92, [rus]