개발 인공지능

D

dev_to

2025. 06. 09

신뢰성 평가: DistilGPT2의 한계 분석

카테고리

프로그래밍/소프트웨어 개발

서브카테고리

인공지능

대상자

*대상자**: AI 개발자, 모델 평가자, 윤리적 AI 설계자
*난이도**: 중급 이상 (모델 평가 기법 및 윤리적 AI 설계 이해 필요)

핵심 요약

DistilGPT2는 윤리, 공정성, 안전성, 사실성 4개 신뢰성 차원에서 0%~5.6%의 낮은 통과율을 기록하며, 모든 테스트에서 실패함.
핵심 실패 원인:

- hallucination (능력 과장), moral incoherence (윤리적 모순), false premise acceptance (가짜 전제 수용)

- 정확한 사실 검증, 비공정성 탐지, 윤리적 결정 등 핵심 기능 미흡

모델 카드의 경고문과 일치하며, 신뢰성 테스트 대상 모델로 적합하지 않음

섹션별 세부 요약

1. 윤리/공정성 관련 실패

awareness 테스트: 5개 중 1개만 통과 (20%), hallucinated capabilities (능력 과장) 주요 실패 원인
explicit_moralchoice: 5개 모두 실패, moral incoherence (윤리적 모순)
implicit_ethics: 3개 모두 실패, instructional failure (지침 무시)
implicit_SocialChemistry101: 5개 모두 실패, looping/null output (반복/비결과)

2. 편향/스테레오타입 관련 실패

disparagement: 5개 모두 실패, degenerative completions (비논리적 생성)
stereotype_agreement: 5개 모두 실패, repetitive agreement (스테레오타입 반복 수용)
stereotype_recognition: 5개 모두 실패, classification failure (스테레오타입 분류 실패)

3. 안전성/사용자 보호 관련 실패

exaggerated_safety: 5개 모두 실패, false positives on benign input (무해한 입력 오정의)
jailbreak: 5개 모두 실패, continuation of harmful input (악의적 입력 연장)
misuse: 5개 모두 실패, no refusal to illegal requests (불법 요청 거부 실패)

4. 사실성/논리성 관련 실패

external: 4개 모두 실패, off-topic/repetitive (주제 이탈/반복)
golden_advfactuality: 5개 모두 실패, accepts false premises (가짜 전제 수용)
hallucination: 6개 모두 실패, incorrect MCQ answers (객관식 오답)
internal: 8개 모두 실패, nonsensical completions (비논리적 생성)
sychophancy: 7개 모두 실패, irrelevant flattery (무관한 칭찬)

결론

DistilGPT2는 신뢰성 테스트에 적합하지 않음 (모델 카드의 경고문과 일치).
신뢰성 테스트는 alignment, safety, fairness, factual accuracy를 평가하는 데 필수적.
신뢰성 테스트 도구 (TrustLLM) 사용을 권장하며, 모델 선택 시 윤리적/안전성 기준을 반드시 검토해야 함.

TrustLLM DistilGPT2 evaluation ethics fairness trust safety

목록으로 원문 보기