GLUE Benchmark

GLUE(General Language Understanding Evaluation) benchmark는 자연언어 이해 시스템을 학습, 평가, 분석하기 위한 리소스 모음으로 영어로 구성되어있다. 언어 모델 평가할 때 많이 사용되고 있다.

논문: https://openreview.net/pdf?id=rJ4km2R5t7

사이트: https://gluebenchmark.com/

GLUE Benchmark

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems

gluebenchmark.com

크게 3가지 종류로 나누어 볼 수 있고 총 9개의 데이터가 있다.

SINGLE-SENTENCE TASKS

단일 문장에 대한 분류 Task로 2가지 데이터가 있다.

CoLA(Corpus of Linguistic Acceptability): 영어 문장이 올바른 문법 예측하는 Task
- Data Format
  - Column 1: 문장의 출처를 나타내는 코드
  - Column 2: 문법 허용 레이블 (0=허용 X, 1=허용 O).
  - Column 3: 저자가 원래 공지한 문법 허용
  - Column 4: 문장
- 예시
  
  Column 1 Column 2 Column 3 Column 4
  
  gj04 1 Our friends won't buy this analysis, let alone the next one we propose.
  
  gj04 0 * They drank the pub.

SST-2(Stanford Sentiment Treebank): 영화 리뷰에 대한 감성을 예측하는 Task
- 예시
  
  sentence label
  
  contains no wit , only labored gags 0
  
  that loves its characters and communicates something rather beautiful about human nature 1

SIMILARITY AND PARAPHRSE TASKS

문장 쌍에 대한 유사성 분류 또는 유사성에 대한 점수 예측 Task로 3가지 데이터가 있다.

MRPC(Microsoft Research Paraphrase Corpus): 온라인 뉴스에서 추출된 문장 쌍의 말뭉치로, 문장 쌍이 같은 의미인지 예측하는 Task

예시

Quality	#1 ID	#2 ID	#1 String	#2 String
1	702876	702977	Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.	Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
0	2108705	2108831	Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.	Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.

QQP(Quora Question Pairs): 질의응답 웹 사이트 Quora에서 추출한 질문 쌍의 말뭉치로, 질문 쌍이 같은 의미인지 예측하는 Task

예시

id	qid1	qid2	question1	question2	is_duplicate
133273	213221	213222	How is the life of a math student? Could you describe your own experiences?	Which level of prepration is enough for the exam jlpt5?	0
402555	536040	536041	How do I control my horny emotions?	How do you control your horniness?	1

STS-B(Semantic Textual Similarity Benchmark): 뉴스 헤드라인, 영상 및 이미지 캡션, 자연어 추론 데이터로부터 추출한 문장 쌍의 데이터로, 문장 쌍이 유사한 의미인지를 1부터 5까지의 점수로 예측하는 Task

예시

index	genre	filename	year	old_index	source1	source2	sentence1	sentence2	score
0	main-captions	MSRvid	2012test	0001	none	none	A plane is taking off.	An air plane is taking off.	5.000
1	main-captions	MSRvid	2012test	0004	none	none	A man is playing a large flute.	A man is playing a flute.	3.800

INFERENCE TASKS

자연어 추론 관련 Task로 4가지 데이터가 있다.

MNLI(Multi-Genre Natural Language Inference Corpus): 전제 문장과 가설 문장이 주어졌을 때, 전제가 가설을 수반하는지, 가설과 모순되는지, 둘 다 아닌지 예측하는 Task
- 예시
  - index: 0
  - promptID: 31193
  - pairID: 31193n
  - genre: government
  - sentence1_binary_parse: ( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) )
  - sentence2_binary_parse: ( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) )
  - sentence1_parse: (ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .)))
  - sentence2_parse: (ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .)))
  - sentence1: Conceptually cream skimming has two basic dimensions - product and geography.
  - sentence2: Product and geography are what make cream skimming work.
  - label1: neutral
  - gold_label: neutral

QNLI(Stanford Question Answering Dataset): 위키피디아에서 가져온 데이터로, 문단에 질문에 대한 답이 포함되어 있는지 예측하는 Task

예시

index	question	sentence	label
0	When did the third Digimon series begin?	Unlike the two seasons before it and most of the seasons that followed, Digimon Tamers takes a darker and more realistic approach to its story featuring Digimon who do not reincarnate after their deaths and more complex character development in the original Japanese.	not_entailment
2	What two things does Popper argue Tarski's theory involves in an evaluation of truth?	He bases this interpretation on the fact that examples such as the one described above refer to two things: assertions and the facts to which they refer.	entailment

RTE(Recognizing Textual Entailment): 뉴스와 위키피디아 기반의 데이터로, 문장 쌍의 함의 여부 예측 Task

예시

index	sentence1	sentence2	label
0	No Weapons of Mass Destruction Found in Iraq Yet.	Weapons of Mass Destruction Found in Iraq.	not_entailment
1	A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.	Pope Benedict XVI is the new leader of the Roman Catholic Church.	entailment

WNLI(Winograd Schema Challenge): 문장의 대명사가 무엇인지 예측하는 Task

예시

index	sentence1	sentence2	label
0	I stuck a pin through a carrot. When I pulled the pin out, it had a hole.	The carrot had a hole.	1
3	Steve follows Fred's example in everything. He influences him hugely.	Steve influences him hugely.	0

현재(2021.11.07) GLUE Benchmark로 평가한 모델 순위는 아래와 같다.

REFERENCE

https://openreview.net/pdf?id=rJ4km2R5t7

https://gluebenchmark.com/

'Study' 카테고리의 다른 글

분류 모델 성능 평가 지표(Accuracy, Precision, Recall, F1 score 등) (2)	2022.01.02
BM25 (0)	2021.12.13
TF-IDF(Term Frequency - Inverse Document Frequency) (0)	2021.12.13
DTM(Document-Term Matrix) (0)	2021.12.13
[논문] 사용자의 입력 의도를 반영한 음절 N-gram 기반 한국어 띄어쓰기 및 붙여쓰기 오류 교정 시스템 (0)	2021.12.12

Note

GLUE Benchmark

SINGLE-SENTENCE TASKS

SIMILARITY AND PARAPHRSE TASKS

INFERENCE TASKS

'Study' 카테고리의 다른 글

티스토리툴바

Column 1	Column 2	Column 3	Column 4
gj04	1		Our friends won't buy this analysis, let alone the next one we propose.
gj04	0	*	They drank the pub.

sentence	label
contains no wit , only labored gags	0
that loves its characters and communicates something rather beautiful about human nature	1

GLUE Benchmark

SINGLE-SENTENCE TASKS

SIMILARITY AND PARAPHRSE TASKS

INFERENCE TASKS

'Study' 카테고리의 다른 글

'Study' Related Articles

티스토리툴바