Code Intelligence Papers

Code intelligence is a cross-research field combined with machine learning and software engineering. Since the large-scale pre-trained language models(PLMs) have achieved amazing improvement in the NLP area, researchers were trying to migrate PLMs from natural language to programming language. GPT-3 shown the ability to automatic programming followed by human instructions(CodeX) and Github proposed the Copilot tools to help developers, the Code intelligence area gradually emerging and reflecting the commercial value.

In this repository, I collect a series of papers on code intelligence, including surveys, sub-area papers, pre-trained models, metrics, datasets, etc. Most of those papers are published on the top conference of AI or SE and attached with opensource code or data. Hope this list can help you to do further research.

Survey

Allamanis M, Barr E T, Devanbu P, et al. A Survey of Machine Learning for Big Code and Naturalness. ACM Computing Surveys (CSUR), 2018.
Gros D, Sezhiyan H, Devanbu P, et al. Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation, ASE 2020.
Liu C, Xia X, Lo D, et al. Opportunities and Challenges in Code Search Tools. ACM Computing Surveys (CSUR), 2021.

Text to Code

Code Search

Gu X, Zhang H, Kim S. Deep Code Search, ICSE 2018.
Sachdev S, Li H, Luan S, et al. Retrieval on Source Code: A Neural Code Search, MAPL 2018.
Luan S, Yang D, Barnaby C, et al. Aroma: Code recommendation via structural code search. Proceedings of the ACM on Programming Languages, 2019.
Cambronero J, Li H, Kim S, et al. When Deep Learning Met Code Search, ESEC/FSE 2019.

Natural Language to Code

Text to Code

Yin P, Neubig G. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation, EMNLP 2018.
Sun Z, Zhu Q, Xiong Y, et al. TreeGen: A Tree-Based Transformer Architecture for Code Generation, AAAI 2020.

Text to SQL

Zhong V, Xiong C, Socher R. Seq2SQL: Generating Structured Queries From Natural Language Using Reinforcement Learning, 2017.
Xu X, Liu C, Song D. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. ICLR 2018.
Yu T, Li Z, Zhang Z, et al. TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation, NAACL 2018.

Code to code

Code Translation(Migration)

Nguyen A T, Nguyen T T, Nguyen T N. Lexical Statistical Machine Translation for Language Migration. ESEC/FSE 2013.
Karaivanov S, Raychev V, Vechev M. Phrase-Based Statistical Translation of Programming Languages. Onward! 2014.
Nguyen A T, Nguyen T T, Nguyen T N. Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code. ASE 2015.
Chen X, Liu C, Song D. Tree-to-tree Neural Networks for Program Translation. NIPS 2018.
Fu C, Chen H, Liu H, et al. Coda: An End-to-End Neural Program Decompiler. NIPS 2019.
Shiv V, Quirk C. Novel positional encodings to enable tree-based transformers. NIPS 2019.
Weisz J D, Muller M, Houde S, et al. Perfection Not Required? Human-AI Partnerships in Code Translation. IUI 2021.

API Mapping

Nguyen T D, Nguyen A T, Nguyen T N. Mapping API Elements for Code Migration with Vector Representations, ICSE-C 2016.
Gu X, Zhang H, Zhang D, et al. DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning. IJCAI 2017.
Nguyen T D, Nguyen A T, Phan H D, et al. Exploring API embedding for API usages and applications. ICSE 2017.
Bui N D Q, Yu Y, Jiang L. SAR: learning cross-language API mappings with little knowledge, ESEC/FSE 2019.
Collie B, Ginsbach P, Woodruff J, et al. M3: Semantic api migrations, ASE 2020.

Code Completion

Svyatkovskiy A, Deng S K, Fu S, et al. IntelliCode Compose: Code Generation Using Transformer. ESEC/FSE 2020.

Code Repair

Code Clone Dectection

Code to Text

Alon U, Brody S, Levy O, et al. code2seq: Generating Sequences from Structured Representations of Code, ICLR 2018.

Code Summarization/Documentation

Li, Jia, et al. EditSum: A Retrieve-and-Edit Framework for Source Code Summarization, ASE 2021.
Junyan Cheng, Iordanis Fostiropoulos, Barry Boehm. GN-Transformer: Fusing Sequence and Graph Representation for Improved Code Summarization, 2021.
Shi E, Wang Y, Du L, et al. CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees, EMNLP 2021.
LeClair A, Haque S, Wu L, et al. Improved Code Summarization via a Graph Neural Network, ICPC 2020.
Ahmad W, Chakraborty S, Ray B, et al. A Transformer-based Approach for Source Code Summarization, ACL 2020.
Wan Y, Zhao Z, Yang M, et al. Improving automatic source code summarization via deep reinforcement learning, ASE 2018.
Iyer S, Konstas I, Cheung A, et al. Summarizing Source Code using a Neural Attention Model, ACL 2016.

Code Representation and Pretrained Models

Nguyen T T, Nguyen A T, Nguyen H A, et al. A Statistical Semantic Language Model for Source Code, ESEC/FSE 2013.
Karampatsis R M, Babii H, Robbes R, et al. Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code ICSE 2020.
Karampatsis R M, Sutton C. SCELMo: Source Code Embeddings from Language Models, 2020.
Kanade A, Maniatis P, Balakrishnan G, et al. CuBERT:Learning and Evaluating Contextual Embedding of Source Code. ICML 2020.
Feng Z, Guo D, Tang D, et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. Findings of EMNLP 2020.
Guo D, Ren S, Lu S, et al. GraphCodeBERT: Pre-training Code Representations with Data Flow, ICLR 2021.
Chen M, Tworek J, Jun H, et al. CodeX: Evaluating Large Language Models Trained on Code, 2021.
Lachaux M A, Roziere B, Chanussot L, et al. TransCoder: Unsupervised Translation of Programming Languages, NeurIPS 2020.
Wang Y, Wang W, Joty S, et al. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP 2021.
Ahmad W U, Chakraborty S, Ray B, et al. PLBART: Unified Pre-training for Program Understanding and Generation, NAACL 2021.
Roziere B, Lachaux M A, Szafraniec M, et al. DOBF: A Deobfuscation Pre-Training Objective for Programming Languages, NeurIPS 2021.
Clement C, Drain D, Timcheck J, et al. PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers, EMNLP 2020.
Jung T H. CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model, 2021.
Peng D, Zheng S, Li Y, et al. OSCAR:How could Neural Networks understand Programs?. 2021.
Qi W, Gong Y, Yan Y, et al. ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation. 2021.
Wang X, Wang Y, Mi F, et al. SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. 2021.

Metrics and Estimation

Papineni K, Roukos S, Ward T, et al. BLEU: a Method for Automatic Evaluation of Machine Translation, ACL 2002.
Ren S, Guo D, Lu S, et al. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis, 2020.
Tran N, Tran H, Nguyen S, et al. Does BLEU Score Work for Code Migration?, ICPC 2019.
Agarwal M, Talamadupula K, Houde S, et al. Quality Estimation & Interpretability for Code Translation, NeurIPS 2020 Workshop on Computer-Assisted Programming.

Datasets and Benchmark

Lu S, Guo D, Ren S, et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation, 2021.
Ahmad W U, Tushar M G R, Chakraborty S, et al. AVATAR: A Parallel Corpus for Java-Python Program Translation, 2021.
Puri R, Kung D S, Janssen G, et al. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks, NeurIPS 2021.
LeClair A, McMillan C. Recommendations for Datasets for Source Code Summarization, NAACL 2019.
Husain H, Wu H H, Gazit T, et al. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. 2019.
Hu X, Li G, Xia X, et al. Summarizing source code with transferred API knowledge, IJCAI 2018.
Miceli-Barone A V, Sennrich R. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation, IJCNLP 2017.
Yu T, Zhang R, Yang K, et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task, EMNLP 2018.
Lin X V, Wang C, Zettlemoyer L, et al. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System, LREC 2018.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Intelligence Papers

Survey

Text to Code

Code Search

Natural Language to Code

Text to Code

Text to SQL

Code to code

Code Translation(Migration)

API Mapping

Code Completion

Code Repair

Code Clone Dectection

Code to Text

Code Summarization/Documentation

Code Representation and Pretrained Models

Metrics and Estimation

Datasets and Benchmark

About

Releases

Packages

License

lvyufeng/code_intelligence_papers

Folders and files

Latest commit

History

Repository files navigation

Code Intelligence Papers

Survey

Text to Code

Code Search

Natural Language to Code

Text to Code

Text to SQL

Code to code

Code Translation(Migration)

API Mapping

Code Completion

Code Repair

Code Clone Dectection

Code to Text

Code Summarization/Documentation

Code Representation and Pretrained Models

Metrics and Estimation

Datasets and Benchmark

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages