20
Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık 2019 TOBB-ETÜ Siber Güvenlik Günü Cengiz Acartürk Deniz Demirci Melih Şırlancı Nazenin Şahin

Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Zararlı Yazılım Analizi için Derin Öğrenme

23 Aralık 2019TOBB-ETÜ Siber Güvenlik Günü

Cengiz AcartürkDeniz DemirciMelih ŞırlancıNazenin Şahin

Page 2: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Natural Language Processing (NLP)

How to investigate natural languages for• Natural language understanding (speech-to-text)

• Natural language generation

• Detection and Translation

• Designing conversational agents

• Text simplification• …

Page 3: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Rule-based approaches to NLPFormal representation of language by syntactic categories

Source: Levels of Representations of Syntactic Structures by Roussanka Loukanova, October 22, 2014 , http://staff.math.su.se/palmgren/logling/HandoutLevelsOfRepr.pdf

Lexical categories

Levels of Representations of Syntactic Structures

Roussanka Loukanova

October 22, 2014

In the following sections, we overview linguistic concepts that have shapedresearch developments in linguistics and corresponding mathematical methodsfrom theory of formal languages and other sub-fields of logic. The formalizationof these concepts evolved through stages of major approaches to formal andcomputational syntax and semantics. We introduce some of the concepts bythe way they have been developed in key stages of formal approaches. Theseconcepts and ideas, while introduced at first in 50s-80s, continue to be amongcornerstones in linguistics and corresponding mathematical theories.

1 Basic concepts of syntax of human language

There are three major kinds of concepts of syntax, which are interrelated.

1.1 Constituent structure

The task of a formal grammar is not only to distinguish grammatical expressions,but also to assign them syntactic structure. Constituent structure of a languageexpression represents it as hierarchical composition of larger syntactic constructsfrom parts. The constructed structures are called constituents.

What exactly are the constituents, what information they carry, how theyare related to each other, and similar concepts, vary among theories. Often theconstituent structures are represented by tree diagrams, e.g., (1), with informa-tion on their nodes.

1.2 Syntactic categories

Constituents of sentences and other expressions are classified according to thesyntactic categories they belong to.

Lexical categories Lexical item, e.g., words, are assigned to syntactic cate-gories, traditionally called parts of speech. Usually, we refer to them as lexicalcategories :

• nouns (N): book, man, etc.• verbs (V): run, read, give, etc.

1

Roussanka Loukanova

• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.

Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.

• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)

What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.

(1)

S

NP VP

Det NOM V NP

the Adj N ate Det N

small mouse the cheese

1.3 Grammatical functions in constructions

Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).

2

Phrasal categories

Roussanka Loukanova

• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.

Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.

• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)

What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.

(1)

S

NP VP

Det NOM V NP

the Adj N ate Det N

small mouse the cheese

1.3 Grammatical functions in constructions

Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).

2

Roussanka Loukanova

• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.

Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.

• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)

What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.

(1)

S

NP VP

Det NOM V NP

the Adj N ate Det N

small mouse the cheese

1.3 Grammatical functions in constructions

Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).

2

Tree diagrams

Page 4: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Syntax and Semantics

Formal representation by syntactic categories: NOT enough for constructing meaningful sentences!

Roussanka Loukanova

• adjectives (Adj): blue, big, etc.• adverbs (Adv): quickly, easy, etc.• prepositions (P): on, to, above, etc.• determiners (Det): the, this, a, some, every,• subordinator: that, for, to, whether, if,• coordinator: and, or, if . . . then, either . . . or, but• interjection: wow, oh, ah, etc.

Phrasal categories Complex constituents are determined by rules from oneor more lexical items, are constructs that are called phrases. Most of the syn-tactic categories of phrases are based on lexical categories. The following aretraditional notations of phrasal categories, across theories and approaches.

• NP• VP• AdjP (AP)• AdvP (AP)• PP• DetP (DP)

What objects exactly are the lexical items, the rules, and the phrases (i.e.,NP, VP, etc. are usually notations or abbreviations) vary across theories ofsyntax, semantics, and related interfaces. E.g., the tree diagram (1) representsa constituent structure of a sentence.

(1)

S

NP VP

Det NOM V NP

the Adj N ate Det N

small mouse the cheese

1.3 Grammatical functions in constructions

Assuming a given formal grammar, its rules determine well-formed, grammaticalconstructions. E.g., the tree (1) can be a constituent structure defined by sucha grammar, for instance in CBLG, where the labels S, NP, VP, D, NOM, Adj,N, VP, V are abbreviations of CBLG descriptions encoding syntactic categories.Then, in (1), both expressions “the small mouse” and “the cheese” belong tothe same category NP, but they have di!erent grammatical functions, subjectand object respectively, because they stand in di!erent relations to the mainverb “ate” in the constituent structure (1).

2

The small mouse ate the cheese*The colorless green dream ate the sun

How to understand/generate meaningful sentences?

Page 5: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Language models

A language model may help understanding/generating meaningful sentences

Given: A history of wordsA language model predicts the next word

Page 6: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Language models

Can you please come here ?

History (previous words, context)

Word being predicted

P(S) = P(can) x P(you | can) x P(please | can you) x P(come | can you please) x P(here | can you please come)

Page 7: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

The research question

Suppose that you have a model for all natural languages

Given: A sentence (“Elnézést, szabad még ez a hely?”)

Goal: To find the closest language model that would find the sentence meaningful (cf. having higher probability)

Page 8: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Word embeddings

Pre-trained Model: The probabilities of all words in a language within the context of all other words

Input: Large corpora (sentences of a language)

A sample repository for pre-trained language models (word embeddings)

Page 9: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

How to use this for malware analysis

Malware assembly codes are sentences of a malware language

Benign assembly codes are sentences of a benign language

Suppose that you have a model for both languages

Given: A piece of code

Goal: To find the closest language model (malware or benign) that would find the code meaningful (cf. having higher probability)

Page 10: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Our Pipeline

Page 11: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Experiments & Approaches :• THE CBOW ARCHITECTURE PREDICTS THE CURRENT WORD BASED ON THE

CONTEXT,

• AND THE SKIP-GRAM PREDICTS SURROUNDING WORDS GIVEN THE CURRENT

WORD.

• «All models arewrong, but someare useful.»

Page 12: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Datasets:

Page 13: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Datasets:

• C:\Program FILES(x86)\*.exe

• A total of 5800 executables

• All 32 bit

• Mixture of visual C++, C, ANSI C, MINGW …

• NASM, MASM, FASM

Page 14: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Preprocessing:

Page 15: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Word2vec for weights:

Page 16: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Word2vec for weights:

Page 17: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Assembly Distribution:

Page 18: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Splitting data for training:

Page 19: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Creating The Model:

Page 20: Zararlı Yazılım Analizi için Derin Öğrenmeaselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/... · 2019. 12. 31. · Zararlı Yazılım Analizi için Derin Öğrenme 23 Aralık

Accuracy of The Model: