Code intelligence is a cross-research field combined with machine learning and software engineering. Since the large-scale pre-trained language models(PLMs) have achieved amazing improvement in the NLP area, researchers were trying to migrate PLMs from natural language to programming language. GPT-3 shown the ability to automatic programming followed by human instructions(CodeX) and Github proposed the Copilot tools to help developers, the Code intelligence area gradually emerging and reflecting the commercial value.
In this repository, I collect a series of papers on code intelligence, including surveys, sub-area papers, pre-trained models, metrics, datasets, etc. Most of those papers are published on the top conference of AI or SE and attached with opensource code or data. Hope this list can help you to do further research.
-
Allamanis M, Barr E T, Devanbu P, et al. A Survey of Machine Learning for Big Code and Naturalness. ACM Computing Surveys (CSUR), 2018.
-
Gros D, Sezhiyan H, Devanbu P, et al. Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation, ASE 2020.
-
Liu C, Xia X, Lo D, et al. Opportunities and Challenges in Code Search Tools. ACM Computing Surveys (CSUR), 2021.
-
Gu X, Zhang H, Kim S. Deep Code Search, ICSE 2018.
-
Sachdev S, Li H, Luan S, et al. Retrieval on Source Code: A Neural Code Search, MAPL 2018.
-
Luan S, Yang D, Barnaby C, et al. Aroma: Code recommendation via structural code search. Proceedings of the ACM on Programming Languages, 2019.
-
Cambronero J, Li H, Kim S, et al. When Deep Learning Met Code Search, ESEC/FSE 2019.
-
Yin P, Neubig G. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation, EMNLP 2018.
-
Sun Z, Zhu Q, Xiong Y, et al. TreeGen: A Tree-Based Transformer Architecture for Code Generation, AAAI 2020.
-
Zhong V, Xiong C, Socher R. Seq2SQL: Generating Structured Queries From Natural Language Using Reinforcement Learning, 2017.
-
Xu X, Liu C, Song D. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. ICLR 2018.
-
Yu T, Li Z, Zhang Z, et al. TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation, NAACL 2018.
-
Nguyen A T, Nguyen T T, Nguyen T N. Lexical Statistical Machine Translation for Language Migration. ESEC/FSE 2013.
-
Karaivanov S, Raychev V, Vechev M. Phrase-Based Statistical Translation of Programming Languages. Onward! 2014.
-
Nguyen A T, Nguyen T T, Nguyen T N. Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code. ASE 2015.
-
Chen X, Liu C, Song D. Tree-to-tree Neural Networks for Program Translation. NIPS 2018.
-
Fu C, Chen H, Liu H, et al. Coda: An End-to-End Neural Program Decompiler. NIPS 2019.
-
Shiv V, Quirk C. Novel positional encodings to enable tree-based transformers. NIPS 2019.
-
Weisz J D, Muller M, Houde S, et al. Perfection Not Required? Human-AI Partnerships in Code Translation. IUI 2021.
-
Nguyen T D, Nguyen A T, Nguyen T N. Mapping API Elements for Code Migration with Vector Representations, ICSE-C 2016.
-
Gu X, Zhang H, Zhang D, et al. DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning. IJCAI 2017.
-
Nguyen T D, Nguyen A T, Phan H D, et al. Exploring API embedding for API usages and applications. ICSE 2017.
-
Bui N D Q, Yu Y, Jiang L. SAR: learning cross-language API mappings with little knowledge, ESEC/FSE 2019.
-
Collie B, Ginsbach P, Woodruff J, et al. M3: Semantic api migrations, ASE 2020.
- Svyatkovskiy A, Deng S K, Fu S, et al. IntelliCode Compose: Code Generation Using Transformer. ESEC/FSE 2020.
- Alon U, Brody S, Levy O, et al. code2seq: Generating Sequences from Structured Representations of Code, ICLR 2018.
-
Li, Jia, et al. EditSum: A Retrieve-and-Edit Framework for Source Code Summarization, ASE 2021.
-
Junyan Cheng, Iordanis Fostiropoulos, Barry Boehm. GN-Transformer: Fusing Sequence and Graph Representation for Improved Code Summarization, 2021.
-
Shi E, Wang Y, Du L, et al. CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees, EMNLP 2021.
-
LeClair A, Haque S, Wu L, et al. Improved Code Summarization via a Graph Neural Network, ICPC 2020.
-
Ahmad W, Chakraborty S, Ray B, et al. A Transformer-based Approach for Source Code Summarization, ACL 2020.
-
Wan Y, Zhao Z, Yang M, et al. Improving automatic source code summarization via deep reinforcement learning, ASE 2018.
-
Iyer S, Konstas I, Cheung A, et al. Summarizing Source Code using a Neural Attention Model, ACL 2016.
-
Nguyen T T, Nguyen A T, Nguyen H A, et al. A Statistical Semantic Language Model for Source Code, ESEC/FSE 2013.
-
Karampatsis R M, Babii H, Robbes R, et al. Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code ICSE 2020.
-
Karampatsis R M, Sutton C. SCELMo: Source Code Embeddings from Language Models, 2020.
-
Kanade A, Maniatis P, Balakrishnan G, et al. CuBERT:Learning and Evaluating Contextual Embedding of Source Code. ICML 2020.
-
Feng Z, Guo D, Tang D, et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. Findings of EMNLP 2020.
-
Guo D, Ren S, Lu S, et al. GraphCodeBERT: Pre-training Code Representations with Data Flow, ICLR 2021.
-
Chen M, Tworek J, Jun H, et al. CodeX: Evaluating Large Language Models Trained on Code, 2021.
-
Lachaux M A, Roziere B, Chanussot L, et al. TransCoder: Unsupervised Translation of Programming Languages, NeurIPS 2020.
-
Wang Y, Wang W, Joty S, et al. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, EMNLP 2021.
-
Ahmad W U, Chakraborty S, Ray B, et al. PLBART: Unified Pre-training for Program Understanding and Generation, NAACL 2021.
-
Roziere B, Lachaux M A, Szafraniec M, et al. DOBF: A Deobfuscation Pre-Training Objective for Programming Languages, NeurIPS 2021.
-
Clement C, Drain D, Timcheck J, et al. PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers, EMNLP 2020.
-
Jung T H. CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model, 2021.
-
Peng D, Zheng S, Li Y, et al. OSCAR:How could Neural Networks understand Programs?. 2021.
-
Qi W, Gong Y, Yan Y, et al. ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation. 2021.
-
Wang X, Wang Y, Mi F, et al. SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. 2021.
-
Papineni K, Roukos S, Ward T, et al. BLEU: a Method for Automatic Evaluation of Machine Translation, ACL 2002.
-
Ren S, Guo D, Lu S, et al. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis, 2020.
-
Tran N, Tran H, Nguyen S, et al. Does BLEU Score Work for Code Migration?, ICPC 2019.
-
Agarwal M, Talamadupula K, Houde S, et al. Quality Estimation & Interpretability for Code Translation, NeurIPS 2020 Workshop on Computer-Assisted Programming.
-
Lu S, Guo D, Ren S, et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation, 2021.
-
Ahmad W U, Tushar M G R, Chakraborty S, et al. AVATAR: A Parallel Corpus for Java-Python Program Translation, 2021.
-
Puri R, Kung D S, Janssen G, et al. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks, NeurIPS 2021.
-
LeClair A, McMillan C. Recommendations for Datasets for Source Code Summarization, NAACL 2019.
-
Husain H, Wu H H, Gazit T, et al. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. 2019.
-
Hu X, Li G, Xia X, et al. Summarizing source code with transferred API knowledge, IJCAI 2018.
-
Miceli-Barone A V, Sennrich R. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation, IJCNLP 2017.
-
Yu T, Zhang R, Yang K, et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task, EMNLP 2018.
-
Lin X V, Wang C, Zettlemoyer L, et al. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System, LREC 2018.