Feature Extraction from JavaScript

Abstract

Source code analyzation is generally a challenging task and it is especially true for loosely typed languages like JavaScript. Traditionally analyzation is done by hand with the help of static analyzation tools which has many disadvantages - one of which is the lack of robustness. The recent advances in machine learning are promising to increase the robustness of source code analysis, however for ML models to work a meaningful and compatible representation is needed. We propose a specific way of extracting features of JavaScript source code based on it’s underlying structure (AST) then we embed these features to a fixed length vector using Doc2Vec. Applying this method on a dataset of 150 000 Java Script source files we found this representation to be meaningful as the semantically similar AST nodes are grouped together after the embedding.

Publication
Proceedings of the 12th Conference of PhD Students in Computer Science (CSCS 2020), Szeged, Hungary, Pages 143-146

BibTeX:

@InProceedings{AJF20,
    author    = {Aladics, Tamás and Jász, Judit and Ferenc, Rudolf},
    booktitle = {Proceedings of the 12th Conference of PhD Students in Computer Science (CSCS 2020)},
    title     = {Feature Extraction from JavaScript},
    year      = {2020},
    address   = {Szeged, Hungary},
    month     = jul,
    pages     = {143-146},
    publisher = {University of Szeged},
    keywords  = {JavaScript, Feature Assembler, Deep Learning, Doc2Vec},
    url       = {http://www.inf.u-szeged.hu/~cscs/cscs2020/pdf/cscs2020.pdf},
}