Source code analyzation is generally a challenging task and it is especially true for loosely typed languages like JavaScript. Traditionally analyzation is done by hand with the help of static analyzation tools which has many disadvantages - one of which is the lack of robustness. The recent advances in machine learning are promising to increase the robustness of source code analysis, however for ML models to work a meaningful and compatible representation is needed. We propose a specific way of extracting features of JavaScript source code based on it’s underlying structure (AST) then we embed these features to a fixed length vector using Doc2Vec. Applying this method on a dataset of 150 000 Java Script source files we found this representation to be meaningful as the semantically similar AST nodes are grouped together after the embedding.
BibTeX:
@InProceedings{AJF20,
author = {Aladics, Tamás and Jász, Judit and Ferenc, Rudolf},
booktitle = {Proceedings of the 12th Conference of PhD Students in Computer Science (CSCS 2020)},
title = {Feature Extraction from JavaScript},
year = {2020},
address = {Szeged, Hungary},
month = jul,
pages = {143-146},
publisher = {University of Szeged},
keywords = {JavaScript, Feature Assembler, Deep Learning, Doc2Vec},
url = {http://www.inf.u-szeged.hu/~cscs/cscs2020/pdf/cscs2020.pdf},
}