Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Zararlı Yazılım Analizi için Derin Öğrenme
23 Aralık 2019TOBB-ETÜ Siber Güvenlik Günü
Cengiz AcartürkDeniz DemirciMelih ŞırlancıNazenin Şahin
Natural Language Processing (NLP)
How to investigate natural languages for• Natural language understanding (speech-to-text)
• Natural language generation
• Detection and Translation
• Designing conversational agents
• Text simplification• …
Rule-based approaches to NLPFormal representation of language by syntactic categories
Source: Levels of Representations of Syntactic Structures by Roussanka Loukanova, October 22, 2014 , http://staff.math.su.se/palmgren/logling/HandoutLevelsOfRepr.pdf
Lexical categories
Levels of Representations of Syntactic Structures
Roussanka Loukanova
October 22, 2014
In the following sections, we overview linguistic concepts that have shapedresearch developments in linguistics and corresponding mathematical methodsfrom theory of formal languages and other sub-fields of logic. The formalizationof these concepts evolved through stages of major approaches to formal andcomputational syntax and semantics. We introduce some of the concepts bythe way they have been developed in key stages of formal approaches. Theseconcepts and ideas, while introduced at first in 50s-80s, continue to be amongcornerstones in linguistics and corresponding mathematical theories.
1 Basic concepts of syntax of human language
There are three major kinds of concepts of syntax, which are interrelated.
1.1 Constituent structure
The task of a formal grammar is not only to distinguish grammatical expressions,but also to assign them syntactic structure. Constituent structure of a languageexpression represents it as hierarchical composition of larger syntactic constructsfrom parts. The constructed structures are called constituents.
What exactly are the constituents, what information they carry, how theyare related to each other, and similar concepts, vary among theories. Often theconstituent structures are represented by tree diagrams, e.g., (1), with informa-tion on their nodes.
1.2 Syntactic categories
Constituents of sentences and other expressions are classified according to thesyntactic categories they belong to.
Lexical categories Lexical item, e.g., words, are assigned to syntactic cate-gories, traditionally called parts of speech. Usually, we refer to them as lexicalcategories :
• nouns (N): book, man, etc.• verbs (V): run, read, give, etc.
1
Roussanka Loukanova
• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.
Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.
• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)
What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.
(1)
S
NP VP
Det NOM V NP
the Adj N ate Det N
small mouse the cheese
1.3 Grammatical functions in constructions
Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).
2
Phrasal categories
Roussanka Loukanova
• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.
Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.
• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)
What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.
(1)
S
NP VP
Det NOM V NP
the Adj N ate Det N
small mouse the cheese
1.3 Grammatical functions in constructions
Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).
2
Roussanka Loukanova
• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.
Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.
• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)
What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.
(1)
S
NP VP
Det NOM V NP
the Adj N ate Det N
small mouse the cheese
1.3 Grammatical functions in constructions
Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).
2
Tree diagrams
Syntax and Semantics
Formal representation by syntactic categories: NOT enough for constructing meaningful sentences!
Roussanka Loukanova
• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.
Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.
• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)
What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.
(1)
S
NP VP
Det NOM V NP
the Adj N ate Det N
small mouse the cheese
1.3 Grammatical functions in constructions
Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).
2
The small mouse ate the cheese*The colorless green dream ate the sun
How to understand/generate meaningful sentences?
Language models
A language model may help understanding/generating meaningful sentences
Given: A history of wordsA language model predicts the next word
Language models
Can you please come here ?
History (previous words, context)
Word being predicted
P(S) = P(can) x P(you | can) x P(please | can you) x P(come | can you please) x P(here | can you please come)
The research question
Suppose that you have a model for all natural languages
Given: A sentence (“Elnézést, szabad még ez a hely?”)
Goal: To find the closest language model that would find the sentence meaningful (cf. having higher probability)
Word embeddings
Pre-trained Model: The probabilities of all words in a language within the context of all other words
Input: Large corpora (sentences of a language)
A sample repository for pre-trained language models (word embeddings)
How to use this for malware analysis
Malware assembly codes are sentences of a malware language
Benign assembly codes are sentences of a benign language
Suppose that you have a model for both languages
Given: A piece of code
Goal: To find the closest language model (malware or benign) that would find the code meaningful (cf. having higher probability)
Our Pipeline
Experiments & Approaches :• THE CBOW ARCHITECTURE PREDICTS THE CURRENT WORD BASED ON THE
CONTEXT,
• AND THE SKIP-GRAM PREDICTS SURROUNDING WORDS GIVEN THE CURRENT
WORD.
• «All models arewrong, but someare useful.»
Datasets:
Datasets:
• C:\Program FILES(x86)\*.exe
• A total of 5800 executables
• All 32 bit
• Mixture of visual C++, C, ANSI C, MINGW …
• NASM, MASM, FASM
Preprocessing:
Word2vec for weights:
Word2vec for weights:
Assembly Distribution:
Splitting data for training:
Creating The Model:
Accuracy of The Model: