A Malware Detection Algorithm Based on Multi-view.doc

资源描述

《A Malware Detection Algorithm Based on Multi-view.doc》由会员分享，可在线阅读，更多相关《A Malware Detection Algorithm Based on Multi-view.doc（6页珍藏版）》请在三一文库上搜索。

1、精品论文A Malware Detection Algorithm Based on Multi-viewFusionGuo Shanqing1, Yu Qixia1, Lin Fengbo1, Wang Fengyu1, Ban Tao25(1. Computer Science and Technology School,Shandong University, JiNan 250101;2. National Institute of Information and Communications Technology,Tokyo 184-8795) Abstract: One of th

2、e major problems concerning information assurance is malicious code. In order to detect them, many existing run-time intrusion or malware detection techniques utilize information available in Application Programming Interface (API) call sequences to discriminate between benign10and malicious process

3、es. Although some great progresses have been made, the new research results of ensemble learning make it possible to design better malware detection algorithm. This paper present a novel approach of detecting malwares using API call sequences. Basing on the fact that the API call sequences of a soft

4、ware show local property when doing network, file IO and other operations, we firstdivide the API call sequences of a malware into seven subsequences, and then use each subsequence to15build a classification model. After these building models are used to classify software, their outputs are combined

5、 by using BKS and the final fusion results will be used to label whether a software is malicious or not. Experiments show that our algorithm can detect known malware effectively. Keywords: Computer Network; Malware Detection; API Call Sequences; Multi-view Fusion; BKS Algorithm200IntroductionWith th

6、e improvement of the techniques used by malwares like virus, trojan and worm, traditional malware detection approaches based on signatures, were not effectively enough to detect variables of known malwares. In order to solve this problem, some researchers started to25use the API call sequences of a

7、program for detecting malwares and made much progress in this field1. Recently, the local property, a phenomenon that described the affinity between API functions in the network, file IO or other operations, has been discovered, but not taken into consideration to design better detection method. Usi

8、ng this property, we design a novel malware detection algorithm based on ensemble leaning methods and the experiments show that our30algorithm has a better classification result.The rest of the paper is organized as follows: in part 2, we briefly review previous approaches closely related to this pa

9、per; In part 3 , we deliver our core technical contribution include how to extract featuresfrom the API call sequences of a program using information gain algorithm and design a malware detection algorithm based on a multi-modal fusion; in part 4, the experiment35shows that our algorithm have a bett

10、er classification results and can identify a malware type effectively whilst part 5 contains our concluding remarks.1RELATED WORKForrest et al. proposed to use fixed-length system calls for distinguishing benign and malicious UNIX process . Later on, Wepsiet al. improved the method using variable-le

11、ngth system40calls2. In a recent work, the authors proposed to extract semantics information by annotating call sequences for malware detection 3. Using flow graphs to model temporal information of system calls has been proposed in 4. Faraz Ahmed proposed Using Spatio-Temporal Information for Malwar

12、e Detection and Analysis 5, but the extracting method of temporal information leaved abackdoor for insidious malware author to fake API call sequences to cover up temporalFoundations: Doctoral Program of Higher Education (No. 20090131120009)Brief author introduction:Guo Shanqing, (1976-), Male, Asso

13、ciate Professor,network security and system security. E-mail: - 6 -45information.The technology used by the malwares is improving continuously. The statistics showed that most of the malwares came from the modification of the old ones 6 and traditional signature based anti-malware technology is not

14、effective enough to detect varieties of the known malwares. In order to solve this problem, some researcher had used machine learning technology for50detecting the polymorphic malwares, for example, Konrad Rieck used SVM to classify the behavior of malwares 7. Similar method went into finite automat

15、on, HMM 8, data mining 9 and neural network 10. Although simple and well performed, different classification method performed large differences to different type of malwares. Each classification method had its own emphasis. As a result, there was not a classification method which performs well for e

16、very kind of55malwares. Fusion of multiple classification method showed well performance. However, on thechosen of the fusion method, there is still improvement space.2MULTI-VIEW MALWARE DETECTION METHOD BASING ON API CALL SEQUENCESThe system is divided into two parts: offline analysis and online an

17、alysis. The offline analysisKernel ModeHookAPI CallTraceFeatureExtractor60part is responsible for constructing classification modal by training with the historical data. The online analysis part is responsible for three functions: getting the API call sequences of an unknown software, extracting fea

18、tures from the API call sequences, and verifying whether it is amalicious software. We are now going into details with the analysis.Gram 1Gra m2Gra m3APIDatasequencesDataapi1 api2 api3 api4 api5 api6 .Online AnalysisOffline AnalysisBenignVirusClassificationTrojan MaliciousWormTrainingModelMachine Le

19、arningAlgorithm65Fig 1：System Design Figure Fig 2：4-Gram Sequence2.1Getting API Call SequencesAPIs are interface functions that the Microsoft Windows operating system provides for users in dynamic link libraries. It can be run in user space or kernel space.Although there are many70APIs in Microsoft

20、Windows operating system, Nave APIs are enough to show the characteristic of a program. As a result, we focus on the nave APIs in this paper. There are over 900 nave APIs in Microsoft windows operating system and we only consider the most important 229 ones.Table 1API CategoryAPI Category Numbers De

21、scriptionFile Input/Output (I/O) 50 Interface functions of file input and outputDynamic-Link Libraries 7 Interface functions of DLL operationsNetwork Management 16 Interface functions for managing network Memory Management35 Interface functions for managing memory Processes & Threads 39 Interface fu

22、nctions of process operations Registry41 Interface functions of registry operations Windows Socket 41 Interface functions of Windows Socket75In order to catch the API call sequences produced by a running program, we design a hookprogram resident in memory to monitor the running of a program to catch

23、 the critical APIs listed above, which output the API call sequences.2.2Characterization of API Call Sequences80The API call sequences got from the hook program cannot be directly used as the input of a classification method. As a result, we design a feature extraction algorithm using N-Gram and inf

24、ormation gain 11. According to experience and experiment, we set the value of n as 2-6 in n-gram algorithm. After several comparative experiments, we find that 4-gram algorithm performs well and is reasonable in time and resource. For the above reasons, we choose 4-gram algorithm.85The feature extra

25、ction algorithm includes the following process:（1）Extracting the gram of every API call sequence.In 4-gram algorithm, the probability of an API appears only depends on the 3 APIs before it. By doing sliding window operation in every API call sequence, we get a series of 4 length segment sequences. E

26、ach of the segment sequence is called a gram. What we need to do in this90step is to count the probability of each gram appears in an API call sequence. After doing this operation to all the API call sequence, we get a series of 4-gramrecording the probability of a4-gram appears in its corresponding

27、 API call sequence file. Figure 2 shows the 4-gram APIsequences.（2）Calculating the value of information gain for each Gram95The gram we get in step 1 can be treated as a feature. However, there are too many grams, normally over ten thousands and not all the grams are a valuable feature. As a result,

28、 we need to filter out the useful ones. We treat every gram as a feature t and calculate its information gain 15using the formulanIG(t ) = P(Ci ) log2 P(Ci )i =1 .n n + P(t) P(Ci | t) log2 P(Ci | t) + P(t ) P(Ci | t ) log2 P(Ci | t )i =1i =1100The Ci represents a category. In our paper, there are 4

29、different categories, namely benign, virus, trojan and worm, so we have C1, C2, C3 and C4. P (Ci) means the probability of Ci appears. Similarly, P(t) represents the probability of feature t appears in all files.（3）Generating the final signature vectorTo sort the features from step 1 in the order of

30、 its information gain calculated in step 2.105Choose the biggest n ones as signature vector S, namely:S = gram1 , gram2 .gramn ,Then everyAPI call sequence file can be represent by a One-dimensional feature vector R of size n, namely:R = r r r r =1, if grami exsits in file 1 , 2 . n , i 0, if gramno

31、t exsit in filei1101152.3Classification ModelWe designed a multiple classification fusion algorithm (Figure 3) basing on the BKS (behavior knowledge space) proposed by Huang 12.It can be used to discriminate the benign and malicious programs, and it is also well enough for deeply classifying malicio

32、us program into virus, Trojan and worm. This algorithms process procedures were listed bellows:（1） Dividing the API call sequences of a software API call sequence into seven differentsub-sequences according to the API category mentioned in 3.1.（2）Classifying them using machine learning algorithm and

33、 seven results can be got.（3）Fusing them with BKS algorithm or other ensemble methods, and using its fusion result as this softwares label.120V ect o r F eat u r e F ile sC l a s s i f yU s i ng M ach i n e L e a r ni ng M o d e lC l a s s i f y U s i ng B K S3EXPERIMENTA P I S e que nc eC4.5.F ile

34、C4.5.C4.5.E x t r a c t i n g F e a t ur e s by A P I C a t e gor yC4.5.C4.5.C4.5.C4.5.F ile I O D L L N e t w o r k M e m o r y P r o c e s s R e g i s t r y So c k e tB K S C l a s s i f y i ng A l gor i t h mFi n a l R e s u l tFig 3：Multi-View Classification Algorithm1253.1Experiment SamplesTota

35、lly, there are 817 sample programs in our experiment, including benign program, trojan, virus and worm. These programs can be downloaded from web13. We run Microsoft windows XP in VmWare and use API Monitor 14 reference as a “hook” program to catch API call sequences of a program. The constitution o

36、f the samples is shown in Table 2.Table 2Constitution of Experiment SamplesProgram CategoryBenignNumbers of Samples100Trojan193Virus289Worm235Total8171301353.2Performance IndexWe use TP (true positive) Rate, FP (false positive) Rate, Precision, and Accuracy (number of correctly recognized samples/to

37、tal number of samples) as the performance index to judge the classification modal. The terms true positives, true negatives, false positives and false negatives are used to compare the given classification of an item (the class label assigned to the item by a classifier) with the desired correct cla

38、ssification (the class the item actually belongs to). This isillustrated by Table 3.Table 3: Shown of classification performance indexcorrect result / classificationE1 E2obtained E1 result / classification E2tp(true positive) fp(false positive)fn(false negative) tn(true negative)The TP rate, FP rate

39、, Precision, Recall, F-measure and Accuracy are then defined as:140TP Rate=tp , FP Rate=fp , Precision=tp , Accuracy=tp+tntp+fnfp+tntp+fptp+tn+fp+fn1453.3PerformanceWe use LibSVM, Id3, J48, NaiveBayes and SMO to carry out our experiment. For we use10-fold cross-validation, we dont need to distinguis

40、h training samples and testing samples manually. First we check out the capability of these classifications modal to correctly distinguish a program as benign and malicious. The results are shown in Table 4. The “Five” in row API means that five types of api sequences, namely net, process, memory, r

41、eg and socket. The result shows that after fusion by BKS, most of the classification performance (the only exception is NaiveBayes)is improved. Specially, the Id3 and SMO reach a perfect result.150Table 4：Result of Malicious-benign ClassificationTesting method API TP Rate FP Rate Precision Accuracy

42、(%) ClassLibSVM All BKS (LibSVM) Five Id3All BKS (Id3) Five J48All BKS (J48) Five SMOAllBKS (SMO) Five0.998 0.104 0.9840.896 0.002 0.9861.000 0.091 0.9860.909 0.000 1.0000.984 0.117 0.9820.883 0.016 0.8951.000 0.000 1.0001.000 0.000 1.0000.99 0.091 0.9860.909 0.01 0.9330.994 0.039 0.9940.961 0.006 0

43、.961 1.000 0.065 0.99 0.935 0.000 1.0001.000 0.000 1.0001.000 0.000 1.00098.42498.77497.02310097.89898.94999.124100malicious benign malicious benign maliciousbenignmalicious benignmalicious benign malicious benign malicious benign malicious benign155We also carry out experiments for other fusion met

44、hod, namely Vote. The results are shown in Table 5. The performance of BKS is better than that of Vote except for using NaiveBayes as the base classifier.Table 5: Results of Different Fusion MethodBase Classifier Fusion TP Rate FP Rate Precision Accuracy (%) ClassLibSVMVoteBKS Vote0.998 0.104 0.9840

45、.896 0.002 0.9861.000 0.091 0.9860.909 0.000 1.0000.994 0.104 0.9840.896 0.006 0.95898.42498.77498.074malicious benign malicious benign maliciousbenignId3BKS 1.000 0.000 1.000 100 malicious1.000 0.000 1.000 benignJ48VoteBKS Vote0.986 0.117 0.9820.883 0.014 0.9070.994 0.039 0.9940.961 0.006 0.9611.000 0.078 0.9880.922 0.000 1.00097.19898.94998.949malicious benign malicious benign maliciousbenignSMOBKS 1.000 0.000 1.000 100 malicious1.000 0.000 1.

展开阅读全文