Custom code2vec implementation for Python3. Uses path-miner from astminer project.
- Clone this repository and submodules
git clone --recurse-submodules https://github.com/NURx2/pycode2vec
- Run
./gradlew shadowJar
inastminer
directory
- Place
train.csv
containingcode_block
,target
and index as the first column in thedataset
folder - Run
train.sh --nthreads N
(by default, the number of threads is -1)
- Place
test.csv
containingcode_block
,target
and index as the first column in thedataset
folder - Run
predict.sh
- Check out
vectors
folder to get the calculated vectors
Configurations can be found in default.config
.
The original work has a lot more opportunities (including already trained models for Java code chunks). It is strongly recommended to get acquainted with it.
The code vectors and the target embeddings are trained to be close to each other, but they are different. Target embeddings are weights vectors between the code vector and the softmax layer. The vector of a specific method name (e.g., sort
) is shared among all the methods that are labeled as sort
, whereas the code vector of each of these examples is slightly different than the others. During training, the softmax (+ cross-entropy loss) encourages the code vector to have a large value of dot-product with the "correct" target embedding (the target embedding of the true label), and a low dot-product with each of the rest of the target embeddings. So eventually, it makes the code vector its corresponding true-target embedding be close to each other in the euclidean space. This is a characteristic of softmax+cross entropy which is not specific to code2vec.