What is codesearch.isocpp.org?
Who made codesearch.isocpp.org?
Who is codesearch.isocpp.org for?
How big is the ACTCD19 dataset?
How was the ACTCD19 dataset prepared?
Where can I get the ACTCD19 dataset?
What is the query syntax of codesearch.isocpp.org?
How is a token-based search different from a text search?
What does codesearch.isocpp.org do with the matches?
What software is used to perform the code search?
How does the code search engine work?
What is the format of the index file? What search algorithm is used?
I have an issue not listed here?
codesearch.isocpp.org is a code search engine of the ACTCD19 dataset for studying existing practice of C++.
codesearch.isocpp.org was developed by Andrew Tomazos (www.tomazos.com) and is hosted by the Standard C++ Foundation (www.isocpp.org).
codesearch.isocpp.org was developed for ISO Standard C++ proposal authors in order to explore existing C++ practice and to provide empirical evidence to support claims about existing practice made in proposals.
The ACTCD19 dataset consists of a corpus of C and C++ source files that were taken from the source packages of the third party software package repository of a popular Linux distribution (Debian Sid as at March 2019).
Total Source Files | 2,489,599 |
Total Lines of Code | 876,800,403 |
Total Source Bytes | 27,580,074,304 |
Total Tokens | 4,057,773,201 |
Unique Tokens | 47,705,063 |
A mirror was created of the Debian Sid package repository by using apt-mirror. From the mirror, each .dsc file was passed to dpkg-source -x. The extracted files were then filtered based on file extension to filter out non C/C++ source files. The source files were then tokenized according the standard C++ translation phase 1 through 3 rules. Source files that failed to tokenize or contained no tokens were removed. Of each set of files that contained the same token sequence all but a randomly selected one was removed.
The tarball of the entire raw ACTCD19 dataset is 9.9GB and is available for download via this link:
https://codesearch.isocpp.org/actcd19.tar.gz
Individual files of ACTCD19 can be browsed via this link:
https://codesearch.isocpp.org/actcd19
To code search ACTCD19 use this link:
The query string is tokenized according to standard C++ translation phase 1 through 3 rules. That sequence of query tokens is then searched for as a token subsequence in each of the source files of the ACTCD19 dataset. There are currently no special escape sequences or escape tokens at this time.
So the query string foo+bar will tokenize to {“foo”, “+”, “bar”} and so will match the following code:
foo+bar
foo + bar
foo /* comment */ + bar
foo /* newline intentional */
+ bar
but will not match the following:
/* foo+bar */
somethingfoo+bar
“foo+bar”
Firstly, it counts them and returns the number of matches. One can then divide by the previously provided metrics to determine how common the query token sequence is (per file, per line, etc). One can also divide by the number of matches of some other query to determine the relative prevalence of two query token sequences:
For example:
Searching for `switch`...
2489599 source files searched.
1410650 matches found.
Searching for `case`...
2489599 source files searched.
9658183 matches found.
We can then conclude from this there are an average of 1410650 / 2489599 = 0.56 switch statements per source file.
We can also conclude on average a switch statement has 9658183 / 1410650 = 6.8 case labels.
Secondly, if there are less than 100 matches it will list them all in random order. If there are more than 100 matches, it will select an unbiased random subset of 100 matches and list them in random order. This random sample is therefore statistically likely to be a representative sample of the population of all matches.
One can then hand classify the random sample by some predicate, and then multiply the ratio by the returned number of matches to approximate the number of matches that have that predicate. For example if there are 10000 matches and 80 out of 100 of the random sample are from “test code”, one can conclude that there are (100 - 80) / 100 * 10000 = 2000 matches of the query that are not in test code.
More generally, looking through the random sample allows an unbiased survey of usage of the query token sequence.
Each element of the random sample provides:
Andrew Tomazos wrote the code search engine over a few days. The source code of it is posted here:
https://github.com/tomazos/pptoken
There is an indexing program (see ppindex.cc) that is run once to create an index file. That index file is permanently resident in memory on the server. There is then a search program (see ppsearch.h) that uses the memory-resident index file to search for matches.
The indexing program counts occurences of each unique token spelling and then sorts them into descending order, assigning id 1 to the most common token, id 2 to the second most common token and so on until id 47705063 is the least common token.
Each source file is tokenized and its token sequence is translated into these token ids, with a special end-of-file 0 token id appended to the end. These token ids are then concatenated into one long sequence.
The token id sequence is then translated into a novel variable-length encoding with the properties that:
- the lower the token id the shorter the encoding
- a sequence of ids A is a subsequence of a sequence of ids B if and only if the encoded version of A is a subsequence of the encoded version of B
This codec can be seen in token_codec.h. Token ids 0-127 are encoded as 1 byte:
0xxxxxxx
Tokens ids above 127 are encoded with a leading byte:
10nnxxxx
Where nn is a 2 bit unsigned integer N
And xxxx is the lower 4 bits of the token id
The leading byte is trailed by N+1 trailing bytes:
11xxxxxx
Where each xxxxxx encodes the next higher 6 bits of the token id.
This encoded token section is placed into the index file along with several other sections to allow forward and reverse lookup of filenames, token spellings and line numbers.
The search program then tokenizes and encodes the query string into a byte sequence in the same fashion as the indexing program. That query byte sequence is then concurrently searched for from multiple CPU cores over the subdivided encoded section. Because of the codec this is simply a dumb bytewise memory scan.
Each match found (which is a memory address) is reported to a sampler (see dvc/sampler.h) that keeps track of the total matches and keeps a random sample of 100 of the memory addresses.
After the search is complete, the search program then takes the random sample of memory addresses and uses the other sections of the memory resident index file to symbolize the memory addresses back into file and line references. These are then reported to the UI.
Please post suggestions, issues, questions, bugs or anything else about codesearch.isocpp.org, ACTCD19 or pptoken to the ACTCD19 github issue tracker at:
https://github.com/tomazos/actcd19/issues
Thanks for your support.