pycode2vec

Custom code2vec implementation for Python3. Uses path-miner from astminer project.

Usage

Clone this repository and submodules git clone --recurse-submodules https://github.com/NURx2/pycode2vec

Before the first launch

Run ./gradlew shadowJar in astminer directory

Training

Place train.csv containing code_block, target and index as the first column in the dataset folder
Run train.sh --nthreads N (by default, the number of threads is -1)

Predicting

Place test.csv containing code_block, target and index as the first column in the dataset folder
Run predict.sh
Check out vectors folder to get the calculated vectors

Configurations can be found in default.config.

The original work has a lot more opportunities (including already trained models for Java code chunks). It is strongly recommended to get acquainted with it.

FAQ

Differences between code vectors and target embeddings

The code vectors and the target embeddings are trained to be close to each other, but they are different. Target embeddings are weights vectors between the code vector and the softmax layer. The vector of a specific method name (e.g., sort) is shared among all the methods that are labeled as sort, whereas the code vector of each of these examples is slightly different than the others. During training, the softmax (+ cross-entropy loss) encourages the code vector to have a large value of dot-product with the "correct" target embedding (the target embedding of the true label), and a low dot-product with each of the rest of the target embeddings. So eventually, it makes the code vector its corresponding true-target embedding be close to each other in the euclidean space. This is a characteristic of softmax+cross entropy which is not specific to code2vec.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
astminer @ ea8d9bc		astminer @ ea8d9bc
code2vec @ 37dc998		code2vec @ 37dc998
nl2ml		nl2ml
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
dataset_to_codefiles.py		dataset_to_codefiles.py
default.config		default.config
filepaths_to_targets.py		filepaths_to_targets.py
pipeline.png		pipeline.png
predict.sh		predict.sh
requirements.txt		requirements.txt
train.sh		train.sh

License

NURx2/pycode2vec

Folders and files

Latest commit

History

Repository files navigation

pycode2vec

Usage

Before the first launch

Training

Predicting

FAQ

Differences between code vectors and target embeddings

About

Topics

Resources

License

Stars

Watchers

Forks

Languages