본문 바로가기

Study

GLUE Benchmark

GLUE(General Language Understanding Evaluation) benchmark는 자연언어 이해 시스템을 학습, 평가, 분석하기 위한 리소스 모음으로 영어로 구성되어있다. 언어 모델 평가할 때 많이 사용되고 있다.

 

논문: https://openreview.net/pdf?id=rJ4km2R5t7

사이트: https://gluebenchmark.com/

 

GLUE Benchmark

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems

gluebenchmark.com

 

크게 3가지 종류로 나누어 볼 수 있고 총 9개의 데이터가 있다.

 

SINGLE-SENTENCE TASKS

단일 문장에 대한 분류 Task로 2가지 데이터가 있다.

 

  • CoLA(Corpus of Linguistic Acceptability): 영어 문장이 올바른 문법 예측하는 Task
    • Data Format
      • Column 1: 문장의 출처를 나타내는 코드
      • Column 2: 문법 허용 레이블 (0=허용 X, 1=허용 O).
      • Column 3: 저자가 원래 공지한 문법 허용
      • Column 4: 문장
    • 예시
      Column 1 Column 2 Column 3 Column 4
      gj04 1   Our friends won't buy this analysis, let alone the next one we propose.
      gj04 0 * They drank the pub.

 

  • SST-2(Stanford Sentiment Treebank): 영화 리뷰에 대한 감성을 예측하는 Task
    • 예시
      sentence label
      contains no wit , only labored gags 0
      that loves its characters and communicates something rather beautiful about human nature 1

 

SIMILARITY AND PARAPHRSE TASKS

문장 쌍에 대한 유사성 분류 또는 유사성에 대한 점수 예측 Task로 3가지 데이터가 있다.

 

  • MRPC(Microsoft Research Paraphrase Corpus): 온라인 뉴스에서 추출된 문장 쌍의 말뭉치로, 문장 쌍이 같은 의미인지 예측하는 Task
    • 예시
      Quality #1 ID #2 ID #1 String #2 String
      1 702876 702977 Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence. Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
      0 2108705 2108831 Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion. Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.

 

  • QQP(Quora Question Pairs): 질의응답 웹 사이트 Quora에서 추출한 질문 쌍의 말뭉치로, 질문 쌍이 같은 의미인지 예측하는 Task
    • 예시
      id qid1 qid2 question1  question2 is_duplicate
      133273 213221 213222 How is the life of a math student? Could you describe your own experiences? Which level of prepration is enough for the exam jlpt5? 0
      402555 536040 536041 How do I control my horny emotions? How do you control your horniness? 1

 

  • STS-B(Semantic Textual Similarity Benchmark): 뉴스 헤드라인, 영상 및 이미지 캡션, 자연어 추론 데이터로부터 추출한 문장 쌍의 데이터로, 문장 쌍이 유사한 의미인지를 1부터 5까지의 점수로 예측하는 Task
    • 예시
      index genre filename year old_index source1 source2 sentence1 sentence2 score
      0 main-captions MSRvid 2012test 0001 none none A plane is taking off. An air plane is taking off. 5.000
      1 main-captions MSRvid 2012test 0004 none none A man is playing a large flute. A man is playing a flute. 3.800

 

INFERENCE TASKS

자연어 추론 관련 Task로 4가지 데이터가 있다.

 

  • MNLI(Multi-Genre Natural Language Inference Corpus): 전제 문장과 가설 문장이 주어졌을 때, 전제가 가설을 수반하는지, 가설과 모순되는지, 둘 다 아닌지 예측하는 Task
    • 예시
      • index: 0
      • promptID: 31193
      • pairID: 31193n
      • genre: government
      • sentence1_binary_parse: ( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) )
      • sentence2_binary_parse: ( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) )
      • sentence1_parse: (ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .)))
      • sentence2_parse: (ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .)))
      • sentence1: Conceptually cream skimming has two basic dimensions - product and geography.
      • sentence2: Product and geography are what make cream skimming work.
      • label1: neutral
      • gold_label: neutral

 

  • QNLI(Stanford Question Answering Dataset): 위키피디아에서 가져온 데이터로, 문단에 질문에 대한 답이 포함되어 있는지 예측하는 Task
    • 예시
      index question sentence label
      0 When did the third Digimon series begin? Unlike the two seasons before it and most of the seasons that followed, Digimon Tamers takes a darker and more realistic approach to its story featuring Digimon who do not reincarnate after their deaths and more complex character development in the original Japanese. not_entailment
      2 What two things does Popper argue Tarski's theory involves in an evaluation of truth? He bases this interpretation on the fact that examples such as the one described above refer to two things: assertions and the facts to which they refer. entailment

 

  • RTE(Recognizing Textual Entailment): 뉴스와 위키피디아 기반의 데이터로, 문장 쌍의 함의 여부 예측 Task
    • 예시  
      index sentence1 sentence2 label
      0 No Weapons of Mass Destruction Found in Iraq Yet. Weapons of Mass Destruction Found in Iraq. not_entailment
      1 A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI. Pope Benedict XVI is the new leader of the Roman Catholic Church. entailment

 

  • WNLI(Winograd Schema Challenge): 문장의 대명사가 무엇인지 예측하는 Task
    • 예시
      index sentence1 sentence2 label
      0 I stuck a pin through a carrot. When I pulled the pin out, it had a hole.  The carrot had a hole. 1
      3 Steve follows Fred's example in everything. He influences him hugely.  Steve influences him hugely. 0

 

 

현재(2021.11.07) GLUE Benchmark로 평가한 모델 순위는 아래와 같다.

 


REFERENCE

https://openreview.net/pdf?id=rJ4km2R5t7

https://gluebenchmark.com/