Skip to content

The tool for getting embeddings of Python 3 code chunks

License

Notifications You must be signed in to change notification settings

NURx2/pycode2vec

Repository files navigation

pycode2vec

Custom code2vec implementation for Python3. Uses path-miner from astminer project.

Usage

  1. Clone this repository and submodules git clone --recurse-submodules https://github.com/NURx2/pycode2vec

Before the first launch

  1. Run ./gradlew shadowJar in astminer directory

Training

  1. Place train.csv containing code_block, target and index as the first column in the dataset folder
  2. Run train.sh --nthreads N (by default, the number of threads is -1)

Predicting

  1. Place test.csv containing code_block, target and index as the first column in the dataset folder
  2. Run predict.sh
  3. Check out vectors folder to get the calculated vectors

Configurations can be found in default.config.

The original work has a lot more opportunities (including already trained models for Java code chunks). It is strongly recommended to get acquainted with it.

FAQ

Differences between code vectors and target embeddings

The code vectors and the target embeddings are trained to be close to each other, but they are different. Target embeddings are weights vectors between the code vector and the softmax layer. The vector of a specific method name (e.g., sort) is shared among all the methods that are labeled as sort, whereas the code vector of each of these examples is slightly different than the others. During training, the softmax (+ cross-entropy loss) encourages the code vector to have a large value of dot-product with the "correct" target embedding (the target embedding of the true label), and a low dot-product with each of the rest of the target embeddings. So eventually, it makes the code vector its corresponding true-target embedding be close to each other in the euclidean space. This is a characteristic of softmax+cross entropy which is not specific to code2vec.

Releases

No releases published

Packages

No packages published