examples/flows/evaluation/eval-groundedness/flow.dag.yaml

$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json
environment:
  python_requirements_txt: requirements.txt
inputs:
  question:
    type: string
    default: What is the name of the new language representation model introduced in
      the document?
  answer:
    type: string
    default: The document mentions multiple language representation models, so it is
      unclear which one is being referred to as \"new\". Can you provide more
      specific information or context?
  context:
    type: string
    default: '["statistical language modeling. arXiv preprint arXiv:1312.3005 . Z.
      Chen, H. Zhang, X. Zhang, and L. Zhao. 2018. Quora question pairs.
      Christopher Clark and Matt Gardner. 2018. Simple and effective
      multi-paragraph reading comprehen- sion. In ACL.Kevin Clark, Minh-Thang
      Luong, Christopher D Man- ning, and Quoc Le. 2018. Semi-supervised se-
      quence modeling with cross-view training. In Pro- ceedings of the 2018
      Conference on Empirical Meth- ods in Natural Language Processing , pages
      1914\u2013 1925. Ronan Collobert and Jason Weston. 2008. A uni\ufb01ed
      architecture for natural language processing: Deep neural networks with
      multitask learning. In Pro- ceedings of the 25th international conference
      on Machine learning , pages 160\u2013167. ACM. Alexis Conneau, Douwe
      Kiela, Holger Schwenk, Lo \u00a8\u0131c Barrault, and Antoine Bordes.
      2017. Supervised learning of universal sentence representations from
      natural language inference data. In Proceedings of the 2017 Conference on
      Empirical Methods in Nat- ural Language Processing , pages 670\u2013680,
      Copen- hagen, Denmark. Association for Computational Linguistics. Andrew M
      Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in
      neural informa- tion processing systems , pages 3079\u20133087. J. Deng,
      W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. 2009. ImageNet: A
      Large-Scale Hierarchical Image Database. In CVPR09 . William B Dolan and
      Chris Brockett. 2005. Automati- cally constructing a corpus of sentential
      paraphrases. InProceedings of the Third International Workshop on
      Paraphrasing (IWP2005) . William Fedus, Ian Goodfellow, and Andrew M Dai.
      2018. Maskgan: Better text generation via \ufb01lling in the.arXiv
      preprint arXiv:1801.07736 . Dan Hendrycks and Kevin Gimpel. 2016. Bridging
      nonlinearities and stochastic regularizers with gaussian error linear
      units. CoRR , abs\/1606.08415. Felix Hill, Kyunghyun Cho, and Anna
      Korhonen. 2016. Learning distributed representations of sentences from
      unlabelled data. In Proceedings of the 2016 Conference of the North
      American Chapter of the Association for Computational Linguistics: Human
      Language Technologies . Association for Computa- tional Linguistics.
      Jeremy Howard and Sebastian Ruder. 2018. Universal language model
      \ufb01ne-tuning for text classi\ufb01cation. In ACL. Association for
      Computational Linguistics. Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng
      Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine
      reading comprehen- sion. In IJCAI . Yacine Jernite, Samuel R. Bowman, and
      David Son- tag. 2017. Discourse-based objectives for fast un- supervised
      sentence representation learning. CoRR , abs\/1705.00557.Mandar Joshi,
      Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large
      scale distantly supervised challenge dataset for reading comprehen- sion.
      In ACL. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel,
      Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought
      vectors. In Advances in neural information processing systems , pages
      3294\u20133302. Quoc Le and Tomas Mikolov. 2014. Distributed rep-
      resentations of sentences and documents. In Inter- national Conference on
      Machine Learning , pages 1188\u20131196. Hector J Levesque, Ernest Davis,
      and Leora Morgen- stern. 2011. The winograd schema challenge. In Aaai
      spring symposium: Logical formalizations of commonsense reasoning , volume
      46, page 47. Lajanugen Logeswaran and Honglak Lee. 2018. An ef\ufb01cient
      framework for learning sentence represen- tations. In International
      Conference on Learning Representations . Bryan McCann, James Bradbury,
      Caiming Xiong, and Richard Socher. 2017. Learned in translation:
      Con-","tool for measuring readability. Journalism Bulletin ,
      30(4):415\u2013433. Erik F Tjong Kim Sang and Fien De Meulder. 2003.
      Introduction to the conll-2003 shared task: Language-independent named
      entity recognition. In CoNLL . Joseph Turian, Lev Ratinov, and Yoshua
      Bengio. 2010. Word representations: A simple and general method for
      semi-supervised learning. In Proceedings of the 48th Annual Meeting of the
      Association for Compu- tational Linguistics , ACL \u201910, pages
      384\u2013394. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
      Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
      Attention is all you need. In Advances in Neural Information Pro- cessing
      Systems , pages 6000\u20136010. Pascal Vincent, Hugo Larochelle, Yoshua
      Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust
      features with denoising autoen- coders. In Proceedings of the 25th
      international conference on Machine learning , pages 1096\u20131103. ACM.
      Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and
      Samuel Bowman. 2018a. Glue: A multi-task benchmark and analysis
      platformfor natural language understanding. In Proceedings of the 2018
      EMNLP Workshop BlackboxNLP: An- alyzing and Interpreting Neural Networks
      for NLP , pages 353\u2013355. Wei Wang, Ming Yan, and Chen Wu. 2018b.
      Multi- granularity hierarchical attention fusion networks for reading
      comprehension and question answering. InProceedings of the 56th Annual
      Meeting of the As- sociation for Computational Linguistics (Volume 1: Long
      Papers) . Association for Computational Lin- guistics. Alex Warstadt,
      Amanpreet Singh, and Samuel R Bow- man. 2018. Neural network acceptability
      judg- ments. arXiv preprint arXiv:1805.12471 . Adina Williams, Nikita
      Nangia, and Samuel R Bow- man. 2018. A broad-coverage challenge corpus for
      sentence understanding through inference. In NAACL . Yonghui Wu, Mike
      Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,
      Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016.
      Google\u2019s neural ma- chine translation system: Bridging the gap
      between human and machine translation. arXiv preprint arXiv:1609.08144 .
      Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How
      transferable are features in deep neural networks? In Advances in neural
      information processing systems , pages 3320\u20133328. Adams Wei Yu, David
      Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V
      Le. 2018. QANet: Combining local convolution with global self-attention
      for reading comprehen- sion. In ICLR . Rowan Zellers, Yonatan Bisk, Roy
      Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset
      for grounded commonsense inference. In Proceed- ings of the 2018
      Conference on Empirical Methods in Natural Language Processing (EMNLP) .
      Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun,
      Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies:
      Towards story-like visual explanations by watching movies and reading
      books. In Proceedings of the IEEE international conference on computer
      vision , pages 19\u201327. Appendix for \u201cBERT: Pre-training of Deep
      Bidirectional Transformers for Language Understanding\u201d We organize
      the appendix into three sections: \u2022 Additional implementation details
      for BERT are presented in Appendix A;\u2022 Additional details for our
      experiments are presented in Appendix B; and \u2022 Additional ablation
      studies are presented in Appendix C. We present additional ablation
      studies for BERT including: \u2013Effect of Number of Training Steps; and
      \u2013Ablation for Different"]} {"question": "What is the main difference
      between BERT and previous language representation models?", "variant_id":
      "v1", "line_number": 2, answer":"BERT is designed to pre-train deep
      bidirectional representations from unlabeled text by jointly conditioning
      on both left and right context in all layers, allowing it to incorporate
      context from both directions. This is unlike previous language
      representation models that are unidirectional, which limits the choice of
      architectures that can be used during pre-training and could be
      sub-optimal for sentence-level tasks and token-level tasks such as
      question answering.","context":["BERT: Pre-training of Deep Bidirectional
      Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton
      Lee Kristina Toutanova Google AI Language
      fjacobdevlin,mingweichang,kentonl,kristout g@google.com Abstract We
      introduce a new language representa- tion model called BERT , which stands
      for Bidirectional Encoder Representations from Transformers. Unlike recent
      language repre- sentation models (Peters et al., 2018a; Rad- ford et al.,
      2018), BERT is designed to pre- train deep bidirectional representations
      from unlabeled text by jointly conditioning on both left and right context
      in all layers. As a re- sult, the pre-trained BERT model can be \ufb01ne-
      tuned with just one additional output layer to create state-of-the-art
      models for a wide range of tasks, such as question answering and language
      inference, without substantial task- speci\ufb01c architecture
      modi\ufb01cations. BERT is conceptually simple and empirically powerful.
      It obtains new state-of-the-art re- sults on eleven natural language
      processing tasks, including pushing the GLUE score to 80.5% (7.7% point
      absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute
      improvement), SQuAD v1.1 question answer- ing Test F1 to 93.2 (1.5 point
      absolute im- provement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute
      improvement). 1 Introduction Language model pre-training has been shown to
      be effective for improving many natural language processing tasks (Dai and
      Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder,
      2018). These include sentence-level tasks such as natural language
      inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing
      (Dolan and Brockett, 2005), which aim to predict the re- lationships
      between sentences by analyzing them holistically, as well as token-level
      tasks such as named entity recognition and question answering, where
      models are required to produce \ufb01ne-grained output at the token level
      (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016).There are
      two existing strategies for apply- ing pre-trained language
      representations to down- stream tasks: feature-based and\ufb01ne-tuning .
      The feature-based approach, such as ELMo (Peters et al., 2018a), uses
      task-speci\ufb01c architectures that include the pre-trained
      representations as addi- tional features. The \ufb01ne-tuning approach,
      such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et
      al., 2018), introduces minimal task-speci\ufb01c parameters, and is
      trained on the downstream tasks by simply \ufb01ne-tuning allpre- trained
      parameters. The two approaches share the same objective function during
      pre-training, where they use unidirectional language models to learn
      general language representations. We argue that current techniques
      restrict the power of the pre-trained representations, espe- cially for
      the \ufb01ne-tuning approaches. The ma- jor limitation is that standard
      language models are unidirectional, and this limits the choice of archi-
      tectures that can be used during pre-training. For example, in OpenAI GPT,
      the authors use a left-to- right architecture, where every token can only
      at- tend to previous tokens in the self-attention layers of the
      Transformer (Vaswani et al., 2017). Such re- strictions are sub-optimal
      for sentence-level tasks, and could be very harmful when applying
      \ufb01ne- tuning based approaches to token-level tasks such as question
      answering, where it is crucial to incor- porate context from both
      directions. In this paper, we improve the \ufb01ne-tuning based approaches
      by proposing BERT: Bidirectional Encoder Representations from
      Transformers.","the self-attention layers of the Transformer (Vaswani et
      al., 2017). Such re- strictions are sub-optimal for sentence-level tasks,
      and could be very harmful when applying \ufb01ne- tuning based approaches
      to token-level tasks such as question answering, where it is crucial to
      incor- porate context from both directions. In this paper, we improve the
      \ufb01ne-tuning based approaches by proposing BERT: Bidirectional Encoder
      Representations from Transformers. BERT alleviates the previously
      mentioned unidi- rectionality constraint by using a \u201cmasked lan-
      guage model\u201d (MLM) pre-training objective, in- spired by the Cloze
      task (Taylor, 1953). The masked language model randomly masks some of the
      tokens from the input, and the objective is to predict the original
      vocabulary id of the maskedarXiv:1810.04805v2 [cs.CL] 24 May 2019word
      based only on its context. Unlike left-to- right language model
      pre-training, the MLM ob- jective enables the representation to fuse the
      left and the right context, which allows us to pre- train a deep
      bidirectional Transformer. In addi- tion to the masked language model, we
      also use a \u201cnext sentence prediction\u201d task that jointly pre-
      trains text-pair representations. The contributions of our paper are as
      follows: \u2022 We demonstrate the importance of bidirectional
      pre-training for language representations. Un- like Radford et al. (2018),
      which uses unidirec- tional language models for pre-training, BERT uses
      masked language models to enable pre- trained deep bidirectional
      representations. This is also in contrast to Peters et al. (2018a), which
      uses a shallow concatenation of independently trained left-to-right and
      right-to-left LMs. \u2022 We show that pre-trained representations reduce
      the need for many heavily-engineered task- speci\ufb01c architectures.
      BERT is the \ufb01rst \ufb01ne- tuning based representation model that
      achieves state-of-the-art performance on a large suite of sentence-level
      andtoken-level tasks, outper- forming many task-speci\ufb01c
      architectures. \u2022 BERT advances the state of the art for eleven NLP
      tasks. The code and pre-trained mod- els are available at
      https:\/\/github.com\/ google-research\/bert . 2 Related Work There is a
      long history of pre-training general lan- guage representations, and we
      brie\ufb02y review the most widely-used approaches in this section. 2.1
      Unsupervised Feature-based Approaches Learning widely applicable
      representations of words has been an active area of research for decades,
      including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et
      al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014)
      methods. Pre-trained word embeddings are an integral part of modern NLP
      systems, of- fering signi\ufb01cant improvements over embeddings learned
      from scratch (Turian et al., 2010). To pre- train word embedding vectors,
      left-to-right lan- guage modeling objectives have been used (Mnih and
      Hinton, 2009), as well as objectives to dis- criminate correct from
      incorrect words in left and right context (Mikolov et al., 2013).These
      approaches have been generalized to coarser granularities, such as
      sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or
      paragraph embeddings (Le and Mikolov, 2014). "]'
outputs:
  groundedness:
    type: string
    reference: ${parse_score.output}
nodes:
- name: parse_score
  type: python
  source:
    type: code
    path: calc_groundedness.py
  inputs:
    gpt_score: ${gpt_groundedness.output}
- name: aggregate
  type: python
  source:
    type: code
    path: aggregate.py
  inputs:
    groundedness_scores: ${parse_score.output}
  aggregation: true
- name: gpt_groundedness
  type: llm
  source:
    type: code
    path: gpt_groundedness.md
  inputs:
    # This is to easily switch between openai and azure openai.
    # deployment_name is required by azure openai, model is required by openai.
    deployment_name: gpt-35-turbo
    model: gpt-3.5-turbo
    max_tokens: 5
    answer: ${inputs.answer}
    question: ${inputs.question}
    context: ${inputs.context}
    temperature: 0
  connection: open_ai_connection
  api: chat