Code Similarity Search in a Latent Space

dc.contributor.advisorJermaine, Christopheren_US
dc.creatorQi, Letaoen_US
dc.date.accessioned2017-08-01T16:39:58Zen_US
dc.date.available2017-08-01T16:39:58Zen_US
dc.date.created2017-05en_US
dc.date.issued2017-04-21en_US
dc.date.submittedMay 2017en_US
dc.date.updated2017-08-01T16:39:58Zen_US
dc.description.abstractA huge database of program source codes that supports fast search via code similarity would be useful for several applications, including automated program synthesis and debugging, and user-facing code search in an integrated development environment. Here, "similar" is defined with respect to a set of application-defined similarity functions. The key difficulty in realizing this goal is that standard database indexing techniques cannot be applied to the problem of querying based on arbitrary similarity functions. To address this difficulty, I propose a dictionary-based approach where I represent each piece of code by a vector of similarities to a set of example database codes. Cosine similarity between the vector representing a query code and the vector representing a database code can be used to measure closeness. However, the dictionary may need to be very high dimensional if the goal is to accurately index a wide variety of database codes. Hence, I explore the idea of using projection matrix to the reduce dimensionality of the problem. One approach is to use random projection. The other approach that I explore is learning the projection matrix by developing a machine learning algorithm that is supervised using the text/code pairs provided by StackOverflow, a question-answering website for programmers.en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationQi, Letao. "Code Similarity Search in a Latent Space." (2017) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/96022">https://hdl.handle.net/1911/96022</a>.en_US
dc.identifier.urihttps://hdl.handle.net/1911/96022en_US
dc.language.isoengen_US
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.subjectcode similarity searchen_US
dc.subjectlatent spaceen_US
dc.titleCode Similarity Search in a Latent Spaceen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineEngineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMaster of Scienceen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
QI-DOCUMENT-2017.pdf
Size:
1.7 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.6 KB
Format:
Plain Text
Description: