Sim a utility for detecting similarity in computer programs


















At this time the algorithm is in the constant improvement state, and will be much more efficient in the future also providing almost the same precision as superior audio comparison algorithms. Similarity rapidly scans your music collection and shows all duplicate music files you may have.

The comparison powered by " acoustic fingerprint " technology considers the actual contents of files, not just tags or filenames, and thus ensures the extreme accuracy of similarity estimation. Along side with eliminating musical doubles, Similarity also provides the advanced quality control function. Really, who wants a collection full of low-bitrate music, badly remastered compositions and songs full of noises? Similarity analyzes files and calculates a quality score basing on various technical parameters of that record, such as: bitrate, frequency, amplitude cut value, amplitude average value, amplitude maximum and many others.

The program automatically detects all common problems with audio files and assigns a corresponding quality mark to each file. The program is constantly developing and improving. In one or more embodiments of the present invention, model parameters from a discriminative modeling technique such as maximum entropy or the like can be employed to determine if a word is important, unimportant, or counter-evidence to a given class, category, or subcluster.

As indicated at block , the feature vectors representing the sentences in a list of sentences may be sorted by frequency, that is, how many times a given sentence appears in the pertinent training corpus.

At block , pair-wise dot products according to equation 3 above can be computed between every pair of unique normalized feature vectors. Such precomputation can be performed for purposes of efficiency. Initial centroids can be created as follows. One can fetch the most frequently occurring remaining sentence, as per block Of course, on the first pass through the process, this is simply the most frequent sentence. The sentence can then be compared with all existing centroids in terms of the similarity metric, sim.

As indicated in block , when the comparison is performed, if the parameter sim is not greater than a given threshold for any existing centroid, then the sentence is not well modeled by any existing centroid, and a new centroid should be created using the vector represented by the given sentence, as indicated at block Any appropriate value for the threshold that yields suitable results can be employed; at present, it is believed that a value of approximately 0.

As indicated at block , one can loop through the process until all the sentences have been appropriately examined to see if they should correspond to new centroids that should be created. It is presently believed that the seeding procedure just described is preferable in one or more embodiments of the present invention, and that it will provide better results than traditional K-means procedures where an original model is split in two portions, one with a positive peturbation and one with a negative peturbation.

The seeding process described herein is believed to converge relatively quickly. It will be appreciated that algorithms other than the K-means algorithm can also be employed. As indicated at block , each sentence is assigned to the most similar centroid according to an appropriate similarity or distance metric for example, the sim parameter described above. As indicated at block , the assignment proceeds until all the sentences have been assigned.

As shown at block , an average distortion measure can be computed which indicates how well the centroids represent the members of the corresponding subclusters. One can use the average similarity metric over all sentences. As indicated at block , one can continue to loop through the process until an appropriate criteria is satisfied. For example, the criteria can be that the change in the distortion measure between subsequent iterations is less than some given threshold.

In this case, one must of course perform at least two iterations in order to have a difference to compare to the threshold. Where the change in distortion is not less than the desired threshold, one can proceed to block and compute a new centroid vector for each subcluster, and then loop back through the process just described.

The threshold can be determined empirically; it has been found that any small non-zero value is satisfactory, as convergence, with essentially zero change between subsequent iterations, tends to occur fairly quickly.

When the loop is reentered after step , the sentences feature vectors are then reassigned to the closest of the newly calculated centroids and the new distortion measure is calculated. For example, one might choose to delete or merge subclusters that had fewer than five vectors, and one might choose to merge subclusters that were too similar, for example, where the similarity was greater than 0.

When subclusters are merged, the distortion measure may degrade, such that it may be desirable to reset the base distortion measure. Members of the deleted subclusters can be re-assigned to the closest un-deleted centroids. It will be appreciated that one goal of the aforementioned process is to make each subcluster more homogeneous.

Thus, one looks for competing subclusters, that is, two subclusters that are similar. Further, one examines for subclusters that have too many different heterogeneous items in them. Such comparison of subclusters is typically conducted across classes, that is, one sees if a subcluster in a first class is similar to a subcluster in a second, different class or category. Competing subclusters may be flagged for analysis and need not always be merged.

One response would be to move the subclusters between the classes. After the data within each class has been clustered, each subcluster within each class can be represented by one of the aforementioned centroid vectors. As shown as block , one can compute the pair-wise similarity metrics between centroid vectors across classes. The similarity metric can be given by equation 3 above.

Where the similarity metric is greater than some threshold, for example, 0. This comparison and flagging is indicated at blocks , By way of example, subcluster three of class one might be very similar to subcluster seven of class four, and flagging could take place.

The flagged pairs may then be highlighted, for example, using a graphical user interface GUI to be discussed below, as depicted at block Potential confusion can be handled as follows, optionally using the GUI to examine the data.

It may be determined that, for example, cluster seven of class four was labeled incorrectly. If this is the case, all the data in subcluster seven of class four could be assigned to class one in a single step, as indicated at block Thus, in one or more embodiments of the present invention, such reassignment can be accomplished without laboriously re-assigning individual sentences.

It will be appreciated that the foregoing operations can be performed by a software program with, for example, input from an application developer or other user. As noted, inconsistent subclusters can be re-assigned completely to the correct subcluster. However, it will appreciated that such re-assignment could also take place for less than all the sentences in the subcluster; for example, the subcluster to be reassigned could be broken up into two or more groups of sentences, some or all of which could be moved to one or more other subclusters or some could be retained.

As indicated at blocks , , it may be that the confusion between the subclusters is inherent in the application. In such case, a disambiguation dialog may be developed as described above. Where no incorrect labeling is detected, no reassignment need be performed; further, where no confusion is detected, no disambiguation need be performed. Yet further, where the similarity metric does not exceed the threshold in block , the aforementioned analyses can be bypassed. One can then determine, per block , whether all pairs have been analyzed; if not, one can loop back to the point prior to block If all pairs have been analyzed, one can proceed to block , and determine whether the number of conflicts detected exceeds a certain threshold.

This threshold is best determined empirically by investigating whether performance is satisfactory, and if not, applying a more stringent value. The processed image file is shown on the upper-right. The file name 2. It was saved with an Irfanview quality level of 20 of — relatively low quality; highly compressed.

The reference image file is shown on the upper-right. The file name 1. Crop LX7… is shown in brown below the images on the left. Note that SSIM is a symmetrical measurement: results are identical if the processed and reference file images are interchanged. Colorbar Options II has been selected in the View dropdown menu. The other options are No colorbar grayscale image and Grayscale colorbar.

Image size and mean SSIM for the entire image are shown near the center of the window. Because SSIM is not very familiar, some explanation is needed. The original images were downsampled 2X to account for viewing conditions. The most important options involve zooming into the image to see the degradation more clearly. However, after manually investigating false positives in a preliminary study, we found that the provided ground truth contains errors.

An investigation revealed that the provided answer key contained two large clusters in which pairs were missing and that two given pairs were wrong. Again, no transformation or normalisation has been applied to this data set as it is already prepared. Since the SOCO data set is 2. Executions of the 30 tools with all of their possible configurations cover 99,, pairwise comparisons in total for this data set compared to 14,, comparisons in Scenario 1.

We analysed 1, similarity reports in total. This is based on the observation that compilation has a normalising effect. Variable names disappear in bytecode and nominally different kinds of control structures can be replaced by the same bytecode, e.

Likewise, changes made by bytecode obfuscators may also be normalised by decompilers. In this scenario, we focus on the generated data set containing pervasive code modifications of source code files generated in Scenario 1. However, we added normalisation through decompilation to the post-processing Step 3 in the framework by compiling all the transformed files using javac and decompiling them using either Krakatau or Procyon.

We then followed the same similarity detection and analysis process in Steps 4 and 5. The results are then compared to the results obtained from Scenario 1 to observe the effects of normalisation through decompilation. The F-score offers a weighted harmonic mean of precision and recall. It is a set-based measure that does not consider any ordering of results.

The optimal F-scores are obtained by varying the threshold T to find the highest F-score. We observed from the results of the previous scenarios that the thresholds are highly sensitive to each particular data set. Therefore, we had to repeat the process of finding the optimal threshold every time we changed to a new data set. This was burdensome but could be done since we knew the ground truth data of the data sets. The configuration problem for clone detection tools including setting thresholds has been mentioned by several studies as one of the threats to validity Wang et al.

There has also been an initiative to avoid using thresholds at all for clone detection Keivanloo et al. Hence, we try to avoid the problem of threshold sensitivity affecting our results.

Moreover, this approach also has applications in software engineering including finding candidates for plagiarism detection, automated software repair, working code examples, and large-scale code clone detection. Instead of looking at the results as a set and applying a cut-off threshold to obtain true and false positives, we consider only a subset of the results based on their rankings. We present their definitions below. Given n as a number of top n results ranked by similarity, precision-at-n Manning et al.

In the presence of ground truth, we can set the value of n to be the number of relevant results i. With presence of more than one query, an average r -precision ARP can be computed as the mean of all r -precision values Beitzel et al. Lastly, mean average precision MAP measures the quality of results across several recall levels where each relevant result is returned.

An average precision-at-n aprec n of a query q is calculated from:. Mean average precision MAP is then derived from the mean of all aprec n values of all the queries in Q Manning et al.

Precision-at-n, ARP, and MAP are used to measure how well the tools retrieve relevant results within top- n ranked items for a given query Manning et al. We simulate a querying process by 1 running the tools on our data sets and generating similarity pairs, and 2 ranking the results based on their similarities reported by the tools. The higher the similarity value, the higher the rank. The top ranked result has the highest similarity value.

If a tie happens, we resort to a ranking by alphabetical order of the file names. Our calculation of precision-at-n in this study can be considered as a hybrid between a set-based and a ranked-based measure. This is suitable for a case of plagiarism detection. To locate plagiarised source code files, one may not want to give a specific file as a query since they do not know which file has been copied but they want to retrieve a set of all similar pairs in a set ranked by their similarities.

JPlag uses this method to report plagiarised source code pairs Prechelt et al. Moreover, finding the most similar files is useful in a manual study of large-scale code clones e.

We picked one file at a time from the data set as a query and retrieved a ranked result of files including the query itself according to the query. An r -precision was calculated from the top 10 results. We limited results to only the top 10, since our ground truth contained 10 pervasively modified versions for each original source code file including itself.

Thus, the number of relevant results, r , is 10 in this study. We derive ARP from the average of the r -precision values. The same process is repeated for MAP except using average precision-at-n instead of r -precision.

The query-based approach is suitable when one does not require the retrieval of all the similar pairs of code but only the most relevant ones for a given query. This situation occurs when performing code search for automated software repair Ke et al.

One may not feasibly try all returned repair candidates but only the top-ranked ones. Another example is searching for working code examples Keivanloo et al. Using these three error measures, we can compare performances of the similarity detection techniques and tools without relying on the threshold at all.

We have two objectives for this experimental scenario. First, we are interested in a situation where local and global code modifications are combined together. This is done by applying pervasive modifications on top of reused boiler-plate code. This scenario occurs in software plagiarism when only a small fragment of code is copied and later pervasive modifications are applied to the whole source code to conceal the copied part of the code.

It also represents a situation where a boiler-plate code has been reused and repeatedly modified or refactored during software evolution. We are interested to see if the tools can still locate the reused boiler-plate code. Second, we shift our focus from measuring how well our tools find all similar pairs of pervasively modified code pieces, as we did in Scenario 1, to measuring how well our tools find similar pairs of code pieces based on each pervasive code modification type.

This is a finer-grained result and provides insights into the effects of each pervasive code modification type on code similarity. Since some threshold needs to be chosen, we used the optimal threshold for each tool. We follow the 5 steps in our experimental framework see Fig.

Amongst SOCO files, 33 are successfully compiled and decompiled after code obfuscations by our framework. Each of the 33 files generates 10 pervasively modified files including itself resulting in files available for detection Step 4.

We change the similarity detection in Step 4 to focus only on comparing modified code to their original. Given M as a set of the 10 pervasive code modification types, a set of similar pairs of files Sim m F out of all files F with a pervasive code modification m is. The number of code pairs and true positive pairs of A to A P g P c are twice larger than the Original O type because of asymmetric similarity between pairs, i.

By applying tools on a pair of original and pervasively modified code, we measure the tools based on one particular type of code modifications at a time.

In total, we made , pairwise comparisons and analysed similarity reports in this scenario. We used the five experimental scenarios of pervasive modifications, decompilation, reused boiler-plate code, ranked results, and the combination of local and global code modification to answer the six research questions. The execution of 30 similarity analysers on the data sets along with searching for their optimal parameters took several months to complete.

We carefully observed and analysed the similarity reports and the results are discussed below in order of the six research questions. The results for this research question are collected from the experimental Scenario 1 pervasive modifications and Scenario 2 reused boiler-plate code. The tools are classified into 4 groups: clone detection tools, plagiarism detection tools, compression tools, and other similarity analysers.

For clone detectors, we applied three different granularity levels of similarity calculation: line L , token T , and character C. We find that measuring code similarity at different code granularity levels has an impact on the performance of the tools. For example, ccfx gives a higher F-score when measuring similarity at character level than at line or token level.

We present only the results for the best granularity level in each case here. In terms of accuracy and F-score, the token-based clone detector ccfx is ranked first. The top 10 tools with highest F-score include ccfx 0. Interestingly, tools from all the four groups appear in the top ten.

For clone detectors, we have a token-based tool ccfx , an AST-based tool deckard , and a string-based tool simian in the top ten. This shows that with pervasive modifications, multiple clone detectors with different detection techniques can offer comparable results given their optimal configurations are provided. However, some clone detectors, e. This means that ccfx performs similarity computation on one small chunk of code at a time.

This approach is flexible and effective in handling code with pervasive modifications that spread changes over the whole file. We also manually investigated the similarity reports of poorly performing iclones and nicad and found that the tools were susceptible to code changes involving the two decompilers, Krakatau and Procyon. When comparing files after decompilation by Krakatau to Procyon with or without bytecode obfuscation, they could not find any clones and hence reported zero similarity.

For plagiarism detection tools, jplag-java and simjava, which are token-based plagiarism detectors, are the leaders. Other plagiarism detectors give acceptable performance except simtext. This is expected since the tool is intended for plagiarism detection on natural text rather than source code. Compression tools show promising results using NCD for code similarity detection. They are ranked mostly in the middle from 7 th to 24 th with comparable results. The three bzip2-based NCD implementations, ncd-zlib, ncd-bzlib, and bzip2ncd only slightly outperform other compressors like gzip or LZMA.

So the actual compression method may not have a strong effect in this context. Other techniques for code similarity offer varied performance. Tools such as ngram, diff, cosine, jellyfish and bsdiff perform badly. They are ranked amongst the last positions at 22 th , 26 th , 28 th , 29 th , and 30 th respectively.

Surprisingly, two Python tools using difflib and fuzzywuzzy string matching techniques produce very high F-scores. To find the overall performance over similarity thresholds from 0 to , we drew the receiver operating characteristic ROC curves, calculated the area under the curve AUC , and compared them. Figure 7 include the ten highest AUC valued tools. We can see from the figure that ccfx is again the best performing tool with the highest AUC 0.

The best tool with respect to accuracy, and F-score is ccfx. The tool with the lowest false positive is difflib. The lowest false negatives is given by diff. However, considering the large amount of false positive for diff 8, false positives which mean 8, out of 9, dissimilar files are treated as similar , the tool tends to judge everything as similar.

The second lowest false negative is once again ccfx. Compared to our previous study Ragkhitwetsagul et al. Although half i. This, as a result, makes our results more generalisable, i. To sum up, we found that specialised tools such as source code clone and plagiarism detectors perform well against pervasively modified code. They were better than most of the compression-based and general string similarity tools.

Compression-based tools mostly give decent and comparable results for all compression algorithms. String similarity tools perform poorly and mostly ranked amongst the last. However, we found that Python difflib and fuzzywuzzy perform surprisingly better with this expanded version of the data set than on the original data set in our previous study Ragkhitwetsagul et al.

They are both ranked highly amongst the top 5. Lastly, ccfx performed well on both the smaller data set in our previous study and the current data set, and is ranked the 1 st on several error measures. We report the complete evaluation of the tools on the SOCO data set with the optimal configurations in Table 8. Amongst the 30 tools, the top ranked tool in terms of F-score is jplag-text 0. Most of the tools and techniques perform well on this data set. We observed high accuracy, precision, recall, and an F-score of over 0.

Since the data set contains source code that is copied and pasted with local modifications, the three clone detectors; ccfx, deckard, nicad, and simian; and plagiarism detectors; jplag-text, jplag-java and simjava; performed very well with F-scores between 0. Other clone detectors including iclones, nicad, and simian provide the highest F-score at line level. The Python difflib and fuzzywuzzy are outliers of the Others group offering high performance against boiler-plate code with F-score of 0.

Once again, these two string similarity techniques show promising results. The compression-based techniques are amongst the last although they still offer relatively high F-scores ranging from 0. Regarding the overall performance over similarity thresholds of 0 to , the results are illustrated as ROC curves in Fig.

The tool with the highest AUC is difflib 0. To sum up, we observed that almost every tool detected boiler-plate code effectively by reporting high scores on all error measures. Similar to pervasive modifications, we found the string matching techniques difflib and fuzzywuzzy ranked amongst the top This is due to the nature of boiler-plate code that has local modifications, contained within a single method or code block on which clone and plagiarism detectors perform well.

However, on a more challenging pervasive modifications data set, there is no clear distinction in terms of ranking between dedicated code similarity techniques, compression-based, and general text similarity tools. We found that Python difflib string matching and Python fuzzywuzzy token similarity techniques even outperform several clone and plagiarism detection tools on both data sets.

Provided that they are simple and easy-to-use Python libraries, one can adopt these two techniques to measure code similarity in a situation where dedicated tools are not available e. Compression-based techniques are not ranked at the top in either scenario, possibly due to the small size of the source code — NCD is known to perform better with large files.

In the experimental Scenarios 1 and 2, we thoroughly analysed various configurations of every tool and found that some specific settings are sensitive to pervasively modified and boiler-plate code whilst others are not. The complete list of the best configurations of every tool for pervasive modifications from Scenario 1 can be found in the second column of Table 7. The optimal configurations are significantly different from the default configurations, in particular for the clone detectors.

Interestingly, a previous study on agreement of clone detectors Wang et al. This is because ccfx is a widely-used tool in several clone research studies. Two parameter settings are chosen for ccfx in this study: b , the minimum length of clone fragments in the unit of tokens, and t , the minimum number of kinds of tokens in clone fragments. We did a fine-grained search of b starting from 3 to 25 stepping by one and coarse-grained search from 30 to 50 stepping by 5. From Fig. Whilst there is no setting for ccfx to obtain the optimal precision and recall at the same time, there are a few cases that ccfx can obtain high precision and recall as shown on the top right corner of Fig.

The best settings for precision and recall of ccfx are described in Table 9. Trade off between precision and recall for ccfx parameter settings.

The default settings provide high precision but low recall against pervasive code modifications. The landscape of ccfx performance in terms of F-score is depicted in Fig. There are two regions covering the b value of 19 with t value from 7 to 9, and b value of 5 with t value from 11 to The two regions provide F-scores ranging from 0. For boiler-plate code, we found another set of optimal configurations for the 30 tools by once again analysing a large search space of their configurations.

The complete list of the best configurations for every tool from Scenario 3 can be found in the second column of Table 8. These empirical results support the findings of Wang et al. Our optimal configurations can be used as guidelines for studies involving pervasive modifications and boiler-plate code. Nevertheless, they are only effective against their respective data set and not guaranteed to work well on other data sets.

The results after adding compilation and decompilation for normalisation to the post-processing step before performing similarity detection on the generated data set in the experimental scenario 3 is shown in Fig. We can clearly observe that decompilation by both Krakatau and Procyon boosts the F-scores of every tool in the study. Every tool has its number of false positives and negatives greatly reduced and three tools, simian, jplag-java, and simjava, even no longer report any false results.

All compression or other techniques still report some false results. This supports the results of our previous study Ragkhitwetsagul et al. To strengthen the findings, we performed a statistical test to see if the performances before and after normalisation via decompilation differ with statistical significance.

Table 11 shows that the observed F-scores before and after decompilation are different with statistical significance for both Krakatau and Procyon. According to Vargha and Delaney , the A 12 value of 0.

A 12 value over or below 0. The guideline in Vargha and Delaney shows that 0. The similar finding also applies to Procyon The large effect sizes clearly supports the findings that compilation and decompilation is an effective normalisation technique against pervasive modifications. To gain insight, we carefully investigated the source code after normalisation and found that decompiled files created by Krakatau are very similar despite the applied obfuscation. As depicted in Fig. This is because Krakatau has been designed to be robust with respect to minor obfuscations and the transformations made by Artifice and ProGuard are not very complex.

Code normalisation by Krakatau resulted in multiple optimal configurations found for some of the tools. We selected only one optimal configuration to include in Table 10 and separately reported the complete list of optimal configurations on our study website Ragkhitwetsagul and Krinke a. Normalisation via decompilation using Procyon also improves the performance of the similarity detectors, but not as much as Krakatau see Table Interestingly, Procyon performs slightly better for deckard, sherlock, and cosine.

An example of code before and after decompilation by Procyon is shown in Fig. It seems that the low-level approach of Krakatau has a stronger normalisation effect. We answer this research question using the results from RQ1 and RQ2 experimental Scenario 1 and 2 respectively. For the 30 tools from RQ1, we applied the derived optimal configurations obtained from the generated data set denoted as C gen to the SOCO data set. Table 13 shows that using these configurations has a detrimental impact on the similarity detection results for another data set, even for tools that have no parameters e.

We noticed that the low F-scores when C gen are reused on SOCO come from high number of false positives possibly due to their relaxed configurations. To confirm this, we refer for the best configurations settings and threshold for the SOCO data set discussed in RQ1 see Table 8 , the comparison of best configurations between the two data sets is shown in Table The reported F-scores are very high for the dataset-based optimal configurations denoted as C soco , confirming that configurations are very sensitive to the data set on which the similarity detection is applied.

We found the dataset-based optimal configurations, C soco , to be very different from the configuration for the generated data set C gen. Although the table shows only the top 10 tools from the generated data set, the same findings apply for every tool in our study. The complete results can be found from our study website Ragkhitwetsagul and Krinke a. Lastly, we noticed that the best thresholds for the tools are very different between one data set and another and that the chosen similarity threshold tends to have the largest impact on the performance of similarity detection.

This observation provides further motivation for a threshold-free comparison using precision-at-n. In experimental scenario 4, we applied three error measures; precision-at-n prec n , average r -precision ARP and mean average precision MAP ; adopted from information retrieval to the generated and SOCO data set.

The results are discussed below. As discussed in Section 4. For the generated data set, we sorted the 10, pairs of documents by their similarity values from the highest to the lowest.

Then, we evaluated the tools based on a set of top n elements. We varied the value of n from to In Table 14 , we only reported the n equals to 1, since it is the number of true positives in the data set. The ccfx tool is ranked 1 st with the highest prec n of 0. In comparison with the rankings for F-scores, the ranking of the ten tools changed slightly, as simjava and simian perform better whilst jplag-java and difflib tool now performed worse.

As illustrated in Fig. The number of true positives is depicted by a dotted line. We could see that most of the tools performed really well in the very first few hundreds of top n results by having steady flat lines at prec n of 1.



0コメント

  • 1000 / 1000