SmartSeed:Smart seed generation strategy for fuzzing testing

2021-09-24 13:13:42LVChenYangLIYuWeiJIShouLing

黑龍江大學(xué)工程學(xué)報 2021年3期

LV Chen-Yang, LI Yu-Wei, JI Shou-Ling

(College of Computer Science and Technology, Zhejiang University, Hangzhou 310027,China)

Abstract：Mutation-based fuzzers can mutate the initial seed files to obtain a number of inputs, which are used to test the application in order to trigger potential crashes. As shown in existing literature, seed selection is crucial for fuzzing efficiency. However, current seed selection strategies seem not to be better than randomly picking seed files. Therefore, a novel and generic system, named SmartSeed, to generate seed files towards efficient fuzzing is proposed.We evaluate SmartSeed along with American Fuzzy Lop (AFL) on 12 open-source applications with input formats of mp3, bmp or flv. We also combine SmartSeed with different fuzzers to examine its compatibility. From extensive experiments, SmartSeed has the following advantages: ① It can generate seeds with different input formats and significantly improves the fuzzing performance on most applications; ② SmartSeed is compatible to different fuzzers. In total, SmartSeed finds more than twice unique crashes and 5 040 extra paths than the existing best strategy on 12 applications. From the crashes found by SmartSeed, we discover 16 unreported CVEs.

Key words：fuzzing test;vulnerability detection;seed generation

Fuzzing is one of the most prevalent vulnerability discovery solutions. A fuzzing tool uses a simple yet efficient way to detect vulnerabilities. It generates a great number of input files. Then, the tool tests the objective application with these inputs and detects whether the application behaves abnormally. If the objective application happens to have a crash, the fuzzing tool will store the input file that triggers the crash. In this way, the fuzzing tool can detect the crash and obtain the input files which trigger crashes. Then, users can study these input files to figure out whether the objective application has vulnerabilities.

Nevertheless, modern application is larger and more complex, which makes it harder to adapt the fuzzing tools as needed. To improve fuzzing efficiency, researchers typically develop strategies for improvement following two directions: design better fuzzing tools and use a better fuzzing seed set.

Following the first direction, a number of works focus on designing better tools, which can be classified into following two categories:

The generation-based fuzzers can learn the input format of the objective application[1-5]. Then, they can generate highly-structured input files based on the input format. While other fuzzing tools spend most time passing the format check, generation-based fuzzing tools use the generated highly-structured files to test the execution of applications. Thus, the generation-based fuzzing tools are better at detecting the crashes of applications which check the syntax features and the semantic rules of the input files.

The mutation-based fuzzers do not consider the input format of the objective application. According to genetic algorithms, the tools mutate the initial input seed set by byte flipping, cross over and so on. Then, to discover crashes, the tools take the initial input seed set and the mutated input files as the inputs of applications. Since they require little prior knowledge of the objective application, the mutation-based fuzzing tools work pretty efficiently. However, sometimes they get stuck because of the simple strategy. To improve the efficiency of mutation-based fuzzing tools, a number of researches combine the fuzzing tools with other vulnerability detection technologies such as taint analysis and symbolic execution[6-10]. There are several researches indicating that stimulating fuzzers to improve the coverage and test low-frequency paths can improve the fuzzing efficiency[11-14].

The second direction of improving the fuzzing efficiency is to use a better seed set. Allen D H and Foote presented an algorithm to consider the parameter selection and automated selection of seed files[15]. Woo M et al. developed a framework to evaluate 26 randomized online scheduling algorithms that schedule a better seed for fuzzing[16]. In 2014, Rebert et al. evaluated 6 seed selection strategies of fuzzing and showed how to select the best strategy[17].

However, the current seed selection strategies have some deficiencies. For instance, proved by our experiments and the above mentioned work[17], the current seed selection strategies perform unstable in many application scenarios. Further, they do not have evident advantages than random seed selection in many cases.

Therefore, to solve the problem of how to obtain a better seed set for the applications without highly-structured input format, we come up with the following heuristic questions:

Q1: Can we obtain valuable seeds in an effective manner? As we discussed before, existing solutions cannot yield effective seeds in many scenarios. Therefore, our primary goal is to study how to obtain the effective seeds for fuzzing tools.

Q2: Can we obtain valuable seeds in a robust manner? Based on the assumption that we have already figured out an effective seed strategy, it is still not robust if the strategy can only obtain valuable seeds for some specific input format, or we have to employ the strategy everytime we want to fuzz a new application with the same input format. Thus, our second goal is to design a robust seed strategy. It should be able to obtain valuable seeds for multiple input formats. Moreover, we only need to employ the strategy once for each kind of input format. Then, the obtained files can improve the fuzzing performance for other applications with this input format.

Q3: Can we obtain valuable seeds in a compatible manner? It is unexpected for most fuzzing cases if our strategy can only obtain valuable seeds for specific fuzzing tools. Therefore, we aim to design a seed strategy that can combine with different fuzzing tools and improve their performance.

Following the above heuristic questions, instead of studying how to select seeds, we leverage state-of-the-art machine learning techniques to automatically generate the effective seeds. A novel seed generation system named SmartSeed is presented, whose workflow consists of three stages as shown in Fig.1.

Fig.1 Workflow of SmartSeed

Basically, the workflow of SmartSeed consists of three stages:

(1)Preparation. SmartSeed is a machine learning-based system. To bootstrap SmartSeed, we need to prepare necessary training data. Specifically, we collect some regular files and employ them to fuzz some applications by widely used fuzzers like AFL[18]. Then, we collect the valuable input files that trigger unique crashes or new paths as the training data. The details are showed in Section 1.2.

(2)Model Construction. To make SmartSeed easily extendable in practice, we propose a transformation mechanism to encode the raw training data into generic matrices, which are the training set of a generative model for seed generation. Then, leveraging the generative model, we generate effective files as seeds of fuzzers.

(3)Fuzzing. Leveraging the seeds generated from the constructed generative model, we use fuzzing tools (e.g., AFL) to discover crashes of objective applications.

With the help of SmartSeed, users can efficiently generate a valuable seed set. Our evaluation on 12 open source applications demonstrates that SmartSeed significantly improves the fuzzing performance compared to state-of-the-art seed selection strategies. The main contributions can be summarized as follows:

We present a machine learning-based system named SmartSeed to generate valuable binary seed files for fuzzing applications without of requiring highly-structured input format.

Combining with AFL, we evaluate the seed files generated by SmartSeed on 12 open source applications with the input formats such as mp3, bmp or flv. Compared with state-of-the-art seed selection strategies, SmartSeed finds 608 extra unique crashes and 5 040 extra new paths than the existing best strategy in total. Then, we combine SmartSeed with three popular fuzzing tools, i.e., AFLFast, honggfuzz and VUzzer, and prove its compatibility.

We further analyze the seed sets generated by SmartSeed and other state-of-the-art seed selection strategies, and present several interesting findings to enlighten the research of fuzzing. We realize that the execution speed is an improper indicator for discovering crashes. However, the larger generation of seeds helps fuzzing tools discover more unique paths. What’s more, visualized by t-SNE[19], the seed files generated by SmartSeed are closer to the most valuable files that trigger crashes or paths. Meanwhile, the files of SmartSeed that trigger unique crashes cover the largest area, which implies that the generated files are easier to be mutated into more discrete valuable files. In total, SmartSeed finds 23 unique vulnerabilities on 9 applications, including 16 unreported ones.

1 Smartseed

1.1 System architecture

The core idea of SmartSeed is to construct a generative model. Then, we use this model to generate valuable files as the input seed set of fuzzing tools. As shown in Fig.2, the whole architecture of SmartSeed can be divided into 4 procedures:

Fig.2 Architecture of SmartSeed

(1)Training data collection: We introduce a criterion to measure the value of input files and present a method to obtain a training set for SmartSeed (Section 1.2).

(2)Raw data conversion: To deal with files with unfixed formats or unfixed file sizes, we convert binary files of raw training data to a uniform type of matrices (Section 1.3).

(3)Model construction: Taking the matrices as training data, we construct a seed generative model based on Wasserstein Generative Adversarial Networks (Section 1.4).

(4)Inverse conversion: Based on the generative model, we generate new matrices and convert them into proper input files, which is the reverse process of Procedure (2) (Section 1.5).

In our system, the employed fuzzer can be flexible, i.e., SmartSeed can be combined with most existing mutation-based fuzzers. Since AFL is one of the most efficient existing fuzzers[18], by default, we select AFL to be the fuzzer in our implementation.

1.2 Training data collection

To construct a machine learning model for generating valuable seed files, we need to obtain an initial training set first.

Certainly, we are expected to ensure that an input file in the training set is really valuable. Otherwise, SmartSeed may not be able to learn useful features of “valuable input files” and further generate such kind of files. Therefore, we first clarify valuable input files. Specifically, in our implementation, we define valuable files as the input files that trigger unique crashes or unique paths of applications. The reasons are as follows:

(1)Since the ultimate goal of fuzzing is to detect more crashes, the input files are considered as valuable if they can trigger unique crashes of objective applications.

(2)According to the existing research[11-14], increasing the coverage and the depth of fuzzing paths are more likely to increase the number of explored crashes. Hence, the files triggering new paths are also valuable from this perspective.

Intuitively, we may employ existing seed selection strategies to select a few valuable input files as the training set. However, according to the existing research[17]and the results of our experiments, current seed selection strategies seem to be unreliable. Then, we realize that fuzzing tools such as AFL will store the input files that trigger unique crashes or paths, which faultlessly coincides with our needs. Thus, we have the following training data collection strategy: we can first use regular input files collected from the Internet to fuzz the applications with the same input format. Then, we gather the valuable input files, which trigger unique paths or unique crashes of those applications, as the training set of SmartSeed.

By collecting a training set in this way, we further have the following advantages:

(1)We can accurately evaluate the value of the input files. The input files in the training set can certainly detect unique paths or trigger unique crashes of some applications. Thus, they carry useful features for learning.

(2)During the fuzzing process, we realize that the formats of many files that trigger crashes or new paths are corrupted. The reason is that files are randomly mutated according to the employed genetic algorithm, and it seems that the corrupted files are more likely to trigger unique crashes and paths. However, it is difficult to gather a number of corrupted files from the Internet, while SmartSeed can be trained to generate many corrupted files as expected.

1.3 Raw data conversion

To construct a generic seed generation model, we propose a mechanism to convert the raw input files in the training set to a uniform type of matrices. The reasons for conducting such conversion are as follows.

First, one of our objectives is to make SmartSeed deal with multiple input formats and unfixed file sizes. However, it is inconvenient to adjust the data read mode for different kinds of and different sizes of files. Thus, we should figure out a uniform method to read data from the training set. Second, the formats of many files in the training set are corrupted. A normal reading manner, such as reading a bmp picture file as a three-dimensional matrix, may not work in many application scenarios. Third, based on the knowledge that machine learning algorithms are better to work on quantitative values of matrices rather than some random value types, we are expected to find a way that can convert multiple types of files to a uniform type of matrices. Finally, we expect to give expression to the magic bytes in the binary form of training files. Because in this way a machine learning model would be easier to learn the features of magic bytes that control the code execution paths.Thus,the files should be read in binary form and are expected to be converted to uniform matrices. Below, we introduce the main procedures of raw training data conversion, which are shown in Fig.3.

Fig.3 Training data conversion

(1)Since each type of the files can be read in binary form, we can read the binary form of the file and get binary data.

(2)SmartSeed encodes binary data as the Base64 string in order to recognize the end of the binary data. Thus, we have the character string that consist of the characters in Fig.4.

(3)Now, since a character string may have 65 different characters, SmartSeed converts the character in the string to the number according to the corresponding relations in Fig.4. Therefore, we obtain the number string as shown in Fig.3.

Fig.4 Convert Base64 character to number from 0 to 64

(4)In this procedure, we would like to convert the number string to a matrix. To economize the number of elements in a matrix, SmartSeed converts every six numeric characters (n0,n1,n2,n3,n4,n5) to a large integera1. Since the neural network algorithms are not sensitive to the characteristics of large numbers, SmartSeed divides the larger integer by 100 000 000 000 in order to normalize the larger integerato [0, 1). Finally, add 0 at the end of the matrix if there is an empty element.

Therefore, we can convert raw files with corrupted formats or unfixed file sizes into a uniform type of matrices, which can significantly improve the extendability and compatibility of SmartSeed.

1.4 Model construction

One of the best existing generative models is the Generative Adversarial Networks (GAN) model, which has been widely used for unsupervised learning since 2014[20]. GAN is a new framework consisting of a generative model and a discriminative model. The generative model tries to generate fake data that are similar to the real data in the training set, while the discriminative model tries to distinguish the fake data from the real data. Two models alternatively work together to train each other and further to improve each other. As a result, the generative model will generate data that are too real to be distinguished by the discriminative model. Usually, the generative model provided by GAN can generate more realistic data than other algorithms. However, it is unstable to train a GAN model. GAN also has many problems such as model collapse.

Recently, Martin A et al. presented the Wasserstein GAN (WGAN) model[21]. Unlike other GAN models, WGAN improves the stability of learning significantly. It is also much easier to train a WGAN model. In addition, WGAN can solve the problems of GAN like mode collapse in most application scenarios.

Hence, for our purpose, we employ WGAN to learn the characteristics of valuable files and then generate valuable seed files. We further highlight the benefits of our selection below.

(1)WGAN is one of the best generative models. It is widely used in data generation, e.g., generating high quality pictures. In addition, it is easier to train WGAN without of introducing the problems of the traditional GAN models such as model collapse.

(2)WGAN is one of the unsupervised learning models which can learn the features of the training set by itself. Thus, we do not need to pay special attention to feature selection, which is very time-efficient.

(3)We can freely choose an appropriate machine learning model as the generative model and/or the discriminative model of WGAN. Specifically, according to our analysis, Multi-Layer Perceptron (MLP) focuses more on every quantitative value in a matrix, while Convolutional Neural Network (CNN) pays more attention to the global features of a matrix. Then, in order to construct a better model, we test the performance of both neural network models. For our application, MLP does work better than CNN as the generative and discriminative models of WGAN. It also requires less training time. Consequently, we choose MLP as the model in SmartSeed.

Note that, a detailed description of WGAN can be found in [21]. Since our focus in this paper is to construct a seed generation strategy and further demonstrates its effectiveness, we leave the research of developing an improved machine learning model as the future work. Furthermore, users of SmartSeed can select an alternative machine learning model as the generative model in terms of the application scenarios.

1.5 Inverse conversion

In this subsection, we introduce how to employ the generative model of SmartSeed to obtain an effective input seed set. Since the training set of SmartSeed is a number of matrices, the generative model is trained to generate similar matrices. To obtain binary input files for fuzzing, we have to convert the generated matrices to binary files. Thus, we do the reverse of the abovementioned procedures in Section 1.3.

To be specific, the fist step is to restore the [0, 1) elements of the matrix to the origin integer. Second, convert each integer to six numbers (from 0 to 64), i.e., the number string as shown in Fig.3. Then, convert the numbers to the characters of Base64 and “=”, i.e., the character string as shown in Fig.3. Finally, decode the character string of Base64 into a binary file and store it locally. Thereout, we obtain the input files for the following fuzzing.

2 Implementation & evaluation

In this section, we implement SmartSeed and compare its performance with state-of-the-art seed selection strategies.

2.1 Datasets

To evaluate the performance of SmartSeed, we employ 12 opensource linux applications with input formats of mp3, bmp or flv as shown in Table 1. We choose these applications mainly for three reasons.

Table 1 Objective applications

Second, these applications are popular and important open source programs. For instance, mp3gain is a popular tool to analyze and adjust the volume of mp3 files. From SourceForge[22], mp3gain for linux are downloaded 639 times on average per week. magick is the main program of ImageMagick, whose mirror has been collected 1.7 k times on github. bmp2tiff is the conversion tool of libtiff, which is employed in the experiments of many other researches as a fuzzing dataset[23]. As for flv, avconv and ps2ts are the popular media tools provided by libav and tstools, respectively.

Third, these applications have different code logic and functionalities, and thus are sufficiently representative. For example, magick is used to browse bmp, sam2p is used to convert bmp into eps, bmp2tiff is used to convert bmp into tiff, and exiv2 is a cross platform C++ library and a command line utility to manage image metadata. Four applications mentioned above are provided by different groups with the different code logic. Although mp42aac is a tool in Bento4 that deals with mp4 files, we also use SmartSeed to generate flv files as the initial seed set to fuzz it. This is mainly for evaluating the robustness of SmartSeed.

2.2 SmartSeed implementation

To implement SmartSeed, the first step is to collect the training set. As we mentioned before, we fuzz some applications and collect the valuable input files.

In consideration of the training efficiency, if the size of training files is too large, we have to use a big matrix to store the file, which leads to longer training time. On the contrary, it is hard to collect small files for meaningful multimedia data with formats such as mp3 and flv. What’s more, small files may not carry enough features for machine learning. Therefore, we need to determine a proper size for the employed training files.

Considering both the training efficiency and the training effectiveness, we prefer to employ files less than 17 KB to construct the generative model of SmartSeed. To accommodate files with size of 17 KB or less, 64×64 matrices are sufficient (whose maximum storage is 18 KB). On the other hand, it would be hard for SmartSeed to learn meaningful features from too sparse matrices, e.g., a matrix with more than 35% of its elements are null, since they carry less valuable information. Therefore, we finally decide to collect valuable input files with size between 12 KB and 17 KB as the training data in our implementation. This setting can ensure that the training rate is fast, make it easy to collect valuable files and meanwhile ensure sufficient information is carried.

Note that easy extendability is one of SmartSeed’s advantages. The size of the input files in the training set is adjustable in practice. If users want to use smaller files as the training set, they may convert fewer characters of Base64 to the number of the decimal system or use smaller matrices. On the other hand, if users want to use files with bigger size to train SmartSeed, they may employ a bigger matrix to store the string of Base64.

Now, we are ready to collect the training data. The collection processes are conducted on 16 same virtual machines with an Intel i7 CPU, 4.5 GB memory and a Ubuntu 16.04 LTS system and last for a week for each input format. For applications with the mp3 format, we employ AFL to fuzz mp3gain and ffmpeg, and collect 20 646 valuable input files with size 12 KB to 17 KB as the training set. For bmp, we use AFL to fuzz magick and bmp2tiff. Then, we collect 24 101 input files as the training set. For applications with the flv format, we employ AFL to fuzz avconv and flvmeta. Then, we collect 21 688 input files between 12 KB and 17 KB that discover new paths or unique crashes.

Based on the collected training data, we construct the prototype of SmartSeed. The core function of SmartSeed is implemented in Python 2.7. The code of the WGAN model is implemented in Pytorch 0.3.0. We use Adam as the optimization algorithm of the WGAN model. Then, we test several times to decide the hyperparameters, such as how many times to train the discriminative model after training the generative model. During the training process, we decrease the learning rate of the optimization algorithm from 5×10-4to 5×10-13inch by inch. It takes us around 38 h to train the generative model of SmartSeed for 100 000 times on a server with 2 Intel Xeon E5-2640 v4 CPUs running at 2.40 GHz, 64 GB memory, 4TB HDD and a GeForce GTX 1080TI GPU card. Note that, we only train the generation model once for each format. Then, the generation model can generate the seed set for the four objective applications.

2.3 Effectiveness

2.3.1 Comparison strategies

To compare SmartSeed’s performance with state-of-the-art seed selection strategies’, we implement the following methods.

Random. Under this scheme, we randomly select files from the regular input files that are usually downloaded from the Internet. Since this is the most common seed selection strategy in practice, we take random as the baseline seed strategy in our experiments.

AFL-result. We randomly select seeds from the saved input files of AFL, which are also the files used for training SmartSeed. Since these files are certainly to either trigger unique crashes or new paths during the fuzzing process, they may yield an outstanding performance as a seed set. As our system is learnt from these files, AFL-result can serve as the baseline of SmartSeed.

Peachset. The fuzzing framework Peach provides a seed selection tool named MinSet[24], whose workflow is as follows: (1) MinSet inspects the coverage of each file in the full set which might be collected from the Internet; (2) it sorts the files by their coverage in the descending order; (3) MinSet initializes the seed set as an empty set, which is denoted by peachset; (4) MinSet checks the coverage of files in order. If a file improves the coverage of peachset, the file will be added to peachset.

Hotset. hotset contains the files that discover the most unique crashes and paths withinttime[17]. To construct hotset, first, use each file, that may be collected from the Internet, as the initial seed set to fuzz an application fortseconds. Second, sort the files in the descending order by the number of discovered crashes and paths. Third, select the top-kfiles to constitute hotset. In our experiments, we fuzz each file for 240 seconds.

AFL-cmin. In order to select a better seed set, AFL provides a tool named AFL-cmin. The core idea of AFL-cmin is to filter out the redundant files that inspect already discovered paths. AFL-cmin tries to find the smallest subset of the full set, which still has the same coverage as the full set.

2.3.2 Results

Now, we evaluate the effectiveness of different seed generation/selection strategies. For the schemes that need an initial dataset to bootstrap the seed selection process, we collect a dataset consisting of 4 600 input files for each input format from the Internet. For SmartSeed, we generate seeds based on its generative model. All the following experiments are conducted on same virtual machines with an Intel i7 CPU, 4.5 GB memory and a Ubuntu 16.04 LTS system.

We examine the relationship between the number of seed files and the fuzzing effectiveness. The results are shown in Table 2, from which we can learn the following observations.

Table 2 Unique crashes and paths of each objective application when using different number of files as the initial seed set

(1)Because mutation-based fuzzing tools use mutation operators like byte flipping to mutate files, the variation of the mutated files is slight and slow. Therefore, the fuzzing performance will be influenced if the number of seeds is too small due to the lack of seeds’ diversity. Then, a proper number of seeds should be bigger than the lower bound. Since the performance of all the three fuzzing strategies are not obviously getting worse when the number of seeds in the initial set is 50, we come to a conclusion that 50 or more seeds is enough to bootstrap these fuzzing tools.

(2)In regard to the discovery of unique crashes, for mpg321, SmartSeed and AFL-result perform the best when the number of seeds is 50, while random discovers the most crashes when the seed number is 200. SmartSeed and AFL-result discover the most crashes on sam2p when the seed number is 100, while random finds the most crashes on sam2p with 300 seeds. Therefore, it seems that there is no obvious relationship between the number of discovered crashes and the number of seeds when there are enough seeds.

(3)As for the discovery of unique paths, when fuzzing sam2p, SmartSeed finds the most paths with 200 seeds, while random uses 50 seeds to discover the most paths. For mp42aac, SmartSeed, random and AFL-result find the most paths when the number of seeds is 50, 200 and 100, respectively. It seems that if the number of the initial seeds is large, the number of discovered paths will decrease a bit. We conjecture the reason is that the initial seeds detect more unique paths, and thus, the discovered paths later will be fewer. Therefore, it is a negligible relationship between the number of seeds in the initial set and the number of discovered paths.

In summary, it seems 50 seeds in the initial seed set is enough to guide the fuzzers, while there is no evident relationship between the number of seeds and the fuzzing performance. Therefore, it is proper to use 100 seeds as the initial seed set in our experiments.

Now, we evaluate the effectiveness of the seeds obtained by different strategies. In our experiments, we employ each seed generation/selection strategy to obtain a seed set consisting of 100 files and then feed the seed set to AFL. To control irrelevant variables, we operate all the fuzzing experiments on the same virtual machines for 72 h with an Intel i7 CPU,4.5 G memory and a Ubuntu 16.04 LTS system.

The primary goal of fuzzing is to discover crashes. Thus, we use the number of discovered unique crashes as the main evaluation criterion. Furthermore, as shown in many existing researches[11-14], a higher coverage can improve the fuzzing performance. Therefore, our second evaluation criterion is the number of discovered unique paths during the same time. The results are shown in Table 3. We can deduce the following conclusions.

Table 3 Unique crashes and paths of each objective application discovered by different fuzzing strategies

(1)For discovering unique crashes, SmartSeed is very effective and performs the best in the most evaluation scenarios. When fuzzing mp3 applications, SmartSeed discovers 24 more crashes than the existing best solution for mp3gain, and discovers more than twice unique crashes than the existing best solution for mpg123 and mpg321. When fuzzing bmp applications, SmartSeed also yields the best performance. Again, when fuzzing flv applications, SmartSeed discovers the most crashes in total among the evaluated solutions. An exception is flvmeta, on which AFL-cmin discovers the most crashes. After viewing the the saved valuable files of flvmeta, we conjecture the reason is that normal flv files are more likely to find crashes of flvmeta, while SmartSeed tends to generate corrupted files that are more likely to find paths of flvmeta, since many of its training data are corrupted.

(2)For triggering unique paths, again, SmartSeed is very effective. Among the 12 evaluated applications, SmartSeed discovers the most paths on 8 ones, the second-most paths on one and the third-most paths on three ones. Specifically, for mp3, SmartSeed discovers nearly twice unique paths than other fuzzing strategies except for ffmpeg. It also outperforms other strategies when fuzzing magick, bmp2tiff and sam2p. Both SmartSeed and AFL-result perform well when fuzzing flv applications.

(3)Interestingly, SmartSeed + AFL discovers 78 unique crashes on mpg123 and 238 crashes on magick, which is the only one that can discover crashes among the evaluated strategies. This again suggests that SmartSeed is very effective in practice. Moreover, for ffmpeg and avconv, no strategy finds any crash. We conjecture the reasons are: these two applications are very robust and/or our fuzzing time might be too short to trigger crashes.

(4)For random, AFL-result, peachset, hotset and AFL-cmin, it is difficult to say which is better. In total, the unique crashes and paths discovered by them are around [340, 490] and [11 000, 16 500], respectively. Peachset seems to perform better than others in more scenarios. However, even for the naive solution random, it outperforms others in many applications. For instance, random discovers more crashes than peachset and AFL-result for mpg321, discovers more crashes for mpg321 and mp42aac and the same number of crashes for bmp2tiff compared with hotset, and discovers more crashes for sam2p and mp42aac and the same number of crashes for bmp2tiff compared with AFL-cmin, which is unexpected.

In summary, compared with state-of-the-art seed selection strategies, SmartSeed is more stable and yields much better performance. In total, SmartSeed + AFL discovers 124.6% more unique crashes and 30.7% more unique paths than the existing best seed strategy.

We also visualize the growth of unique crashes for each strategy as shown in Fig.5. We have the following observations.

Fig.5 Number of crashes over 72 h. X-axis: time (over 72 h). Y-axis: the number of unique crashes

(1)SmartSeed is very efficient, e.g., when fuzzing mpg321, the unique crashes discovered by SmartSeed in 10 h are more than the crashes that discovered by other schemes in 72 h.

(2)For existing seed selection strategies, they exhibit similar performance during the fuzzing processes and their curves usually mix together, which is consistent with the results in Table 3.

(3)As indicated by the results in Fig.5, it takes time for each curve to become stable, which implies that sufficient time is necessary to enable these strategies to find crashes. This observation is also meaningful for us to conduct fuzzing evaluation in a proper manner.

In summary, compared with state-of-the-art seed selection strategies, SmartSeed performs pretty better than other seed strategies for generating valuable files with multiple input formats. It not only can discover unique crashes faster, but also can discover more.

2.4 Compatibility

In this subsection, we examine the compatibility and extendibility of SmartSeed. Specifically, we combine SmartSeed with existing popular fuzzing tools and evaluate its performance.

2.4.1 Fuzzing tools

In this evaluation, in addition to AFL, we consider the following fuzzing tools.

AFLFast[11]. AFLFast is a fuzzing tool based on AFL. By using a power schedule to guide the tool towards low-frequency paths, AFLFast can detect much more paths with the same execution counts than AFL.

Honggfuzz[25]. honggfuzz is an easy-to-use fuzzing tool provided by Google. Similar to AFL, honggfuzz modifies the input files from the initial seed set and use them for fuzzing. In addition, honggfuzz provides powerful process state analysis by leveraging ptrace.

VUzzer[12]. VUzzer is a fuzzing tool that focuses on increasing the coverage. By prioritizing the files mutated from the input files that reach deep paths, VUzzer can explore more and deeper paths. Unlike AFL, AFLFast and honggfuzz who count the edge-coverage, VUzzer uses a dynamic binary instrumentation tool named PIN to calculate the block coverage, i.e., VUzzer computes the percent of discovered unique blocks to measure the coverage on an application.

2.4.2 Results

For comparison with SmartSeed, we also consider to combine random and AFL-result with the considered fuzzers. Specifically, we first employ each seed strategy to obtain 100 seed files, then feed each fuzzing tool with the seeds. We use mpg123, mpg321, magick, sam2p, ps2ts and mp42aac as the objective applications. All the fuzzing evaluations last for 72 h and are conducted on the same virtual machines as in Section 2.3.

We show the results in Table 4. Note that, since honggfuzz failed to build sam2p and mp42aac by its compiler hfuzzcc, it cannot count the discovered unique paths for these two applications. From Table 4, we have the following observations.

Table 4 Unique crashes and paths of each objective application using different fuzzing tools

(1)With respect to the discovered unique crashes, when combining with AFLFast, SmartSeed discovers the most crashes in all the evaluation scenarios. When combining with honggfuzz, SmartSeed discovers over twice of crashes on mpg321 than the other strategies and is the only one that guides honggfuzz to discover crashes on magick. While on mpg123, sam2p and mp42aac, all the three seed strategies perform similar. When combining with VUzzer, SmartSeed discovers the most crashes on mpg321, sam2p, ps2ts and mp42aac. As for mpg123 and magick, no crash is discovered by the evaluated strategies. The results demonstrate that SmartSeed is compatible with existing popular fuzzing tools, and meanwhile is very effective in fuzzing.

(2)With respect to the discovered unique paths, SmartSeed also yields the best performance in most of the evaluation scenarios. Specifically, SmartSeed discovers the most unique paths in all the cases when combining with AFLFast. As for honggfuzz, except of the error cases (sam2p and mp42aac), SmartSeed discovers more unique paths than others for mpg123 and ps2ts. For mpg321, all the strategies discover the same number of paths. When combining with VUzzer, we count the block coverage rate instead of unique paths mentioned in [12]. From the results, all the strategies discover similar number of paths on mp3 and bmp applications, while SmartSeed yields a much better performance on flv applications.

Overall, SmartSeed exhibits a good compatibility when combing with different fuzzing tools. Meanwhile, the seeds generated by SmartSeed can significantly improve the performance of existing popular fuzzing tools.

2.5 Vulnerability results

To figure out unique vulnerabilities, we recompile the evaluated applications with AddressSanitizer [26] and use the discovered files of SmartSeed, which triggered crashes, to test the applications. The results are shown in Table 5, from which we can learn the following observations.

Table 5 Vulnerabilities found by SmartSeed,16 new CVE IDs have already received

(1)Although we only run each fuzzing process for 72 h, SmartSeed finds 23 unique vulnerabilities in total, 16 of which are previously unreported. We submit them to CVE[27]and have acquired the corresponding CVE IDs. This proves that SmartSeed is not only efficient in finding crashes but also can guide fuzzing tools to find more undiscovered vulnerabilities.

(2)In total, we discover 9 types of vulnerabilities. This demonstrates that our system does not limit on specific kinds of vulnerabilities, while SmartSeed can guide fuzzing tools to discover various vulnerabilities.

In summary, SmartSeed is efficient at discovering various types of vulnerabilities. Note that recently, the state-of-the-art fuzzer CollAFL[23]also has fuzzed exiv2 with the same version for 200 h, and discovered 13 new vulnerabilities. Nevertheless, we can still discover new and unreported vulnerabilities on exiv2 using SmartSeed within only 72 h. This implies that SmartSeed is effective to guide fuzzing tools to find undiscovered vulnerabilities.

Then, to evaluate the ability of different fuzzing strategies on discovering unique vulnerabilities, we use input files of all six fuzzing strategies that trigger crashes to test the applications recompiled by AddressSanitizer[26]. Since no fuzzing strategy discovers any crash on ffmpeg or avconv and the assignment team of CVE[27]does not regard the crash on flvmeta as a vulnerability, we do not show the results of them. The results are shown in Table 6, from which we can learn the following observations.

Table 6 Vulnerabilities found by six fuzzing strategies

(1)For the already existed CVEs, all the six fuzzing strategies discover nearly the same number of vulnerabilities on mp3gain and bmp2tiff, they do not find any discovered vulnerability on mpg123, mpg321, sam2p and mp42aac. Since there is no published CVEs for ps2ts or tstools, no strategy finds any discovered CVE on ps2ts. Only SmartSeed finds a discovered vulnerability on magick, while others do not find any crash. Although the number of discovered CVEs on mp3gain found by random and AFL-result is one less than the others, they perform pretty well on discovering the found CVEs on exiv2 and find two more than others. As a conclusion, all the six strategies perform close on discovering the existed CVEs, with SmartSeed, random and AFL-result find the most, while others find one or two fewer.

(2)For undiscovered CVEs, SmartSeed finds 16 undiscovered vulnerabilities, which is 6 more than the second most number discovered by random and AFL-result. However, the number of undiscovered CVEs found by peachset, hotset and AFL-cmin is 7, 8 and 7, respectively. SmartSeed performs pretty well on mp3gain, for which it finds three more undiscovered vulnerabilities than others. Only AFL-result finds an undiscovered CVE on bmp2tiff and another on ps2ts that are not discovered by SmartSeed.

In total, SmartSeed discover 23 CVEs, which contain 16 unreported ones. Peachset, hotset and AFL-cmin yield the worst performance on finding unique vulnerabilities, which is unexpected.

3 Further analysis

3.1 Execution count

Some existing coverage-based greybox fuzzing tools, such as AFL, are designed to prioritize mutating the seed files that are executed fast on implementation. Such design is based on the intuition that a fast-executed seed is more likely to be mutated into input files that are also executed fast, and thus, more tests can be conducted on the objective application within a fixed time, followed by more potential crashes might be found. To measure the execution speed of the input files on average, we may use the execution count, which is defined as the number of execution times conducted by the fuzzer within a time period. Evidently, a larger execution count implies more input files are used to test the application and a faster execution speed on average for each input file.

Now, under the same settings as the experiments in Table 3, we analyze the execution count of the seeds generated by SmartSeed and state-of-the-art seed selection strategies. The results are shown in Table 7. Considering the results in Tables 3 and 7 together, we find that: although SmartSeed + AFL is the most effective strategy in most of the evaluation scenarios with respect to discovering both crashes and paths, it does not have the largest execution count in most of the cases, i.e., SmartSeed + AFL discovers more crashes and paths using fewer input files. For instance, SmartSeed + AFL is the only strategy that discovers crashes on magick while its execution count is the least one. Therefore, based on the results in Tables 3 and 7, we conclude that there is no explicit correlation between the execution count and the number of crashes and paths being discovered.

Table 7 Execution count and generation of each objective application under different fuzzing strategies

The above analysis leads to an interesting insight. It might be a misunderstanding that seed files with faster execution speed can discover more unique crashes or paths. Although these seeds can generate more files and test the objective application more times, most of generated files may execute the known paths and thus may discover nothing. Therefore, valuable seeds are more desired for efficient fuzzing rather than fast-executed ones.

3.2 Generation analysis

For mutation-based fuzzing, we usually use generation to measure the mutation/generation-relationship between an input file and the seed files. For instance, the generation of the initial seed files is “1”, the files that are mutated/generated from the seed files have a generation of “2”, and similarly, the files that are mutated/generated from the files with generationkhave a generation ofk+1.

Under the same settings as the experiments in Table 3, we show the largest generation number among all the generated files of each seed strategy in Table 7. Considering the results in Tables 3 and 7 together, we find that in general, a large achieved generation implies a better coverage performance. In 11 of the evaluated 12 applications, the most unique paths are discovered by the fuzzing strategy that has the largest or the second largest generation. For instance, the largest achieved generation of SmartSeed is much larger than other seed generation/selection strategies on mp3gain, mpg123, magick, bmp2tiff, sam2p and mp42aac, and meanwhile, SmartSeed has a better coverage performance on these applications.

Therefore, based on our results and analysis, we have another interesting insight: it is very likely that there is a positive correlation between the largest achieved generation and the coverage performance of seed strategies. This can explain the significant coverage improvement of SmartSeed.

In summary, our results demonstrate that execution count is not the appropriate indicator to evaluate the value of an input file. What’s more, the results show that the larger generations leads to more unique paths, which explains the reason why SmartSeed is easier to discover more unique paths.

3.3 Distribution

We would like to examine the distribution of the seeds generated by different strategies, and the distribution of the valuable files mutated from those seeds that trigger crashes.

To facilitate our analysis, we employ t-SNE[19], which is one of the best dimensionality reduction algorithms that can aggregate similar data together, to visualize the distribution. Specifically, we use SmartSeed, random and AFL-result to generate 300 seeds, respectively, and then visualize the distribution of these seeds in Fig.6.Note that (1) the seeds of AFL-result are selected from the stored files of AFL that can trigger crashes or new paths, which are actually the mutated files of the random’s seeds using AFL as we described before; (2) we do not consider peachset, hotset, and AFL-cmin.This is mainly because the seeds generated by these three schemes highly depend on the initial seed source dataset. It turns out that the seeds obtained by them exhibit similar distribution as random. Thus, we do not plot their distribution in Fig.6 in order to make it more readable. From Fig.6, we have the following observations.

Fig.6 Similarity distribution of the seeds of SmartSeed, random and AFL-result

(1)Although AFL-result is mutated from random, its distribution is far away from random. In other words, it takes time for random to discover valuable input files that trigger crashes or new paths. Compared with the distribution of random, the distribution of SmartSeed is closer to AFL-result. This implies that SmartSeed can discover crashes and paths faster than random.

(2)From Fig.6, we learn that the distribution of AFL-result is more decentralized. Compared with AFL-result, the distribution of SmartSeed is more intensive. SmartSeed is also closer to the main part of AFL-result. These facts indicate that: when using AFL-result to fuzz an application, an input file may spend a longer time to be mutated into another valuable input file because of the discrete distribution. By contrast, SmartSeed may be mutated fast into the main part of valuable input files that can trigger crashes and paths of the objective application. Thus, SmartSeed is more effective compared with AFL-result.

Then, we leverage t-SNE to analyze the valuable mutated files of SmartSeed+AFL, random+AFL and AFL-result+AFL, i.e., the files are mutated from the seeds of SmartSeed, random and AFL-result that trigger unique crashes of objective applications. Since SmartSeed discovers more crashes than random and AFL-result, its points are more than those two. Note that the distriution of valuable files discovered by SmartSeed will only be more sparse if we use the same number of files to plot. The results are shown in Fig.7, from which we have the following observations.

(1)The distribution of valuable files mutated from AFL-result and random are similar. Also, it seems difficult for the seeds of AFL-result and random to be mutated into the distant points that can trigger crashes, which then limits their fuzzing performance.

(2)On the contrary, the points of the valuable mutated files of SmartSeed spread in a larger area. This demonstrates that the seed files generated by SmartSeed are easier to be mutated into the discrete valuable input files that can trigger crashes.

(3)We can learn from Fig.6 and Fig.7 that although the distributions of AFL-result and random are more discrete than SmartSeed, the valuable input files discovered by SmartSeed, which trigger unique crashes of objective applications, cover a larger area. In other words, the relatively-concentrate distribution of the seeds generated by SmartSeed does not limit them to be mutated into discrete input files and trigger more unique crashes.

Fig.7 Similarity distribution of valuable mutated input files of SmartSeed, random and AFL-result that triggered crashes

In summary, the distribution of SmartSeed is dense and closer to the main part of the files that trigger unique crashes and paths, which may explain the better performance of SmartSeed.

4 Discussion

In this section, we make some discussions on SmartSeed starting from the three proposed heuristic questions in Section 1. Then, we remark the limitation and future work of this paper.

Q1:Canweobtainvaluableseedsinaneffectivemanner?As we evaluated in Section 2, SmartSeed significantly outperforms existing state-of-the-art seed selection strategies when fuzzing multiple kinds of applications. Therefore, SmartSeed can generate effective seeds in an effective manner as expected.

Q2:Canweobtainvaluableseedsinarobustmanner?As shown in Section 1, SmartSeed is designed as a generic system to generate valuable seeds for applications without highly-structured input formats. We also implement this system to generate seeds for applications with different formats. As shown in Section 2, once the generative model of SmartSeed is constructed, it can be employed to fuzz many applications with the same/similar format, meanwhile its performance significantly outperforms state-of-the-art seed selection techniques. Therefore, based on our evaluation, SmartSeed can generate valuable seeds in a robust manner.

Q3:Canweobtainvaluableseedsinacompatiblemanner?As shown in Section 2, SmartSeed is easily compatible with existing popular fuzzing tools. Furthermore, their fuzzing performance can also be improved in most of the scenarios. Therefore, SmartSeed is easily extendable and compatible.

As the focus of this paper, SmartSeed in its current form is designed for mutation-based fuzzing. Thus, it is not suitable to generate highly-structured input files. We take this issue as the future work of this paper and keep extending our system. From the performance perspective, SmartSeed could be improved from many aspects. For instance, a better generative model could be designed. Also, it is interesting to study the best working scope of different generative models and how to further improve the model training process. From the evaluation perspective, our primary goal in this paper is to demonstrate the performance and usability of SmartSeed. Certainly, more evaluations can be conducted to comprehensively evaluate SmartSeed, e.g., considering more applications and more formats.

5 Related work

In this section, we briefly introduce the related work.

Mutation-based Fuzzing: As a representative of mutation-based fuzzing, AFL[18]employs a novel type of compiletime instrumentation and genetic algorithms to automatically detect unique crashes. Because of its high-efficiency and ease of use, AFL is one of the most popular fuzzing tools. Based on AFL, Marcel B et al. presented a fuzzing tool named AFLFast that can detect more paths by prioritizing low-frequency paths[11]. In[28], Marcel B et al. combined a simulated annealing-based power schedule scheme with AFL and presented AFLGo. Xu W et al. improved the fuzzing performance of largescale tasks on multicore machines with three new operating primitives[29]. In [23], Gan S T et al. presented a solution to mitigate path collisions, which can be combined with AFL and AFLFast to construct CollAFL and CollAFL-fast, respectively. Recently, Lyu C Y et al. analyzed how to select the mutation operators on each application and presented MOPT-AFL based on AFL[30].

Many works focus on combining fuzzing with other bug detection technologies. Wang T L et al. combined fuzzing with dynamic taint analysis and symbolic execution techniques and presented a fuzzing tool named TaintScope[9]. Istvan H et al. presented a fuzzing tool named Dowser, which takes taint tracking, program analysis and symbolic execution into consideration[8]. Sang K C et al. considered to employ white-box symbolic analysis in their design[6]. Stephens N et al. involved selective symbolic execution into their fuzzer named Driller[7]. Pham V T et al. combined input model-based approaches with symbolic execution and presented Model-based Whitebox Fuzzing[10]. In [31], She D D et al. proposed that neural network models can be used to improve the efficiency of the fuzzing process and developed the corresponding fuzzer named NEUZZ.

In order to improve the fuzzing coverage, Rawat S et al. presented an application-aware evolutionary fuzzing tool named VUzzer[12], which uses static and dynamic analysis to analyze the priority of paths. Payer M et al. presented T-Fuzz[13], which uses a dynamic tracing based technique to detect and remove the checks in objective applications. By using scalable byte-level taint tracking, context-sensitive branch count, gradient descent based search, shape and type inference and input length exploration to solve path constraints, Angora presented by Chen P et al. can increase the branch coverage of objective applications[14]. Recently, Chen P et al. proposed an approach named Matryoshka to solve path constraints that involve deeply nested conditional statements[32].

Note that SmartSeed is designed as a generic system to generate valuable seed files for and to be easily compatible with mutation-based fuzzing. With the seeds generated by SmartSeed, we expect to improve the performance of mutation-based fuzzing.

Generation-based Fuzzing:Generation-based fuzzers are designed to generate input files with specific input formats. Following this track, Godefroid P et al. used the grammar-based specification of valid highly-structured input files to improve the fuzzing performance[2]. Holler C et al. presented LangFuzz to fuzz the applications with highly-structured inputs such as JavaScript interpreters[3]. Kyle D et al. proposed to use Constraint Logic Programming (CLP) for the program generation[4]. With CLP, users can manually write declarative predicates to specify interesting programs Recently, Wang J J et al. presented a novel data-driven seed generation approach named Skyfire[1], which uses Probabilistic Context-Sensitive Grammar (PCSG) to learn the syntax features and semantic rules from the training set. In [5], Godefroid et al. presented a RNN-based machine-learning technique to generate highly-structured format files such as PDF.

The primary difference between SmartSeed and generation-based fuzzing approaches is that our method is used to generate binary seed files such as image, music and video, while they focus on improving the fuzzing efficiency of applications with highly-structured input formats.

Other Fuzzing Strategies:As for kernel vulnerabilities, Jake C et al. presented an interface-aware fuzzing tool named DIFUZE to automatically generate inputs for kernel drivers[33]. Han H K et al. proposed a novel method called model-based API fuzzing and presented IMF for testing commodity OS kernels[34]. You W et al. presented a novel technique named SemFuzz[35]that can learn from vulnerability-related texts and automatically generate Proof-of-Concept (PoC) exploits. Petsios T et al. focused on algorithmic complexity vulnerabilities and proposed SlowFuzz to generate inputs to trigger the maximal resource utilization behavior of applications and algorithms[36].

Seed Selection: To figure out how to select a better initial seed set, Allen D H and Jonathan M F presented an algorithm to consider the parameter selection and automated selection of seed files[15]. Woo M et al. developed a framework to evaluate 26 randomized online scheduling algorithms for fuzzing[16]. Alexandre R et al. evaluated six seed selection strategies of fuzzing and presented several interesting conclusions about seed sets[17]. Recently, Nichols N et al. showed that using the generated files of GAN to reinitialize AFL can potentially find more unique paths of ethkey[37]. However, they neither described the model in detail nor provided any prototype. Thus, we cannot reproduce their model and compare it with SmartSeed.

Different from existing research, we focus on designing a generic seed generation system leveraging start-of-the-art machine learning techniques. We also demonstrate its effectiveness, robustness and compatibility through extensive experiments.

GAN Models:In 2014, Ian G et al. presented a new unsupervised learning framework named Generative Adversarial Networks (GAN)[20]. Radford A et al. presented the Deep Convolutional GAN (DCGAN) model to construct better generative and discriminative models[38]. Later, Zhai S F et al. combined GAN with an Energy Based Model (EBM) and proposed VGAN[39]. Chen X et al. combined GAN with mutual information and presented InfoGAN[40]. Recently, Arjovsky et al. presented the WGAN model by using the approximation of the Wasserstein distance as the loss function[21].

Because GAN and its variants can use the unsupervised learning methods to generate more realistic data, they have been widely applied in many application scenarios such as high-quality image generation[41-43]and image translation[44-46]. In this paper, we extend the application of GAN to improve the fuzzing performance.

6 Conclusions

A novel unsupervised learning system named SmartSeed to generate valuable input seed files for fuzzing is presented. Compared with state-of-the-art seed selection strategies, SmartSeed discovers 608 extra unique crashes and 5 040 extra unique paths when we use AFL to fuzz 12 open source applications with different input formats. Then, SmartSeed with different fuzzing tools are combined. The evaluation results demonstrate that SmartSeed is easily compatible and meanwhile very effective. To further understand the performance of SmartSeed, we make more analysis on the seeds generated by SmartSeed and present several interesting findings to enlighten our knowledge on effective fuzzing. What’s more, we find 23 unique vulnerabilities on 9 applications by SmartSeed and have obtained CVE IDs for 16 unreported ones.

黑龍江大學(xué)工程學(xué)報2021年3期

黑龍江大學(xué)工程學(xué)報的其它文章: 不同處理條件對冰島罌粟種子萌發(fā)影響的初步探究; 玻璃纖維增強聚氨酯基復(fù)合材料圓管彎曲性能試驗研究; 黑龍江(阿穆爾河)流域水文地質(zhì)區(qū)劃研究; 路面除冰方法綜述; 氯代酚類污染物光催化降解研究進展; 具有一步隨機滯后和丟包多傳感器系統(tǒng)分布式遞推融合預(yù)報器