KoGPT2

Korean GPT-2 pre-trained model (Korean GPT-2 pretrained cased)

KoGPT2

KoGPT2 is an open-source based GPT-2 model trained on Korean language. By optimizing OpenAI’s GPT-2 architecture for Korean, it can be used in various applications requiring Korean language understanding such as text generation, sentence completion, and chatbots.

Project Information

  • Developer: SK Telecom
  • Release Date: 2020 (Korea’s first open-source Korean GPT-2)
  • License: CC-BY-NC-ND 4.0 (Modification and redistribution allowed for non-commercial use)
  • GitHub: https://github.com/SKT-AI/KoGPT2

Key Features

1. Korean Text Generation

  • Natural Korean sentence generation
  • Context-aware sentence completion
  • Support for creative writing

2. Diverse Applications

  • Chatbot building: Conversational AI services
  • Text sentiment prediction: Emotion analysis
  • Response generation: Generating answers to questions
  • Sentence completion: Context-based text completion
  • Storytelling: Creative writing support

3. Developer-Friendly

  • Support for various frameworks (PyTorch, ONNX)
  • Easy installation and usage
  • Abundant example code provided

Installation and Usage

Installation

pip install kogpt2-transformers

Basic Text Generation

import torch
from transformers import GPT2LMHeadModel
from kogpt2_transformers import get_kogpt2_tokenizer

# Load model and tokenizer
tokenizer = get_kogpt2_tokenizer()
model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')

# Text generation
text = "The future of artificial intelligence is"
input_ids = tokenizer.encode(text, return_tensors='pt')

# Set generation parameters
gen_ids = model.generate(
    input_ids,
    max_length=128,
    repetition_penalty=2.0,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    use_cache=True
)

# Decode results
generated = tokenizer.decode(gen_ids[0])
print(generated)

Using Hugging Face Transformers

from transformers import pipeline

# Text generation pipeline
generator = pipeline(
    'text-generation',
    model='skt/kogpt2-base-v2',
    tokenizer='skt/kogpt2-base-v2'
)

# Generate text
prompt = "Korean natural language processing technology"
result = generator(
    prompt,
    max_length=100,
    num_return_sequences=3,
    temperature=0.8
)

for i, text in enumerate(result):
    print(f"Result {i+1}: {text['generated_text']}")

Sentiment Analysis Example

from kogpt2_transformers import get_kogpt2_tokenizer
from transformers import GPT2LMHeadModel
import torch

tokenizer = get_kogpt2_tokenizer()
model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')

# Reviews for sentiment analysis
reviews = [
    "This movie was really fun",
    "The service was terrible",
    "Great product for the price"
]

for review in reviews:
    # Prompt engineering for positive/negative judgment
    prompt = f"{review} This review is"
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=len(input_ids[0]) + 10,
            num_return_sequences=1,
            temperature=0.7
        )
    
    result = tokenizer.decode(output[0])
    print(f"Original: {review}")
    print(f"Analysis: {result}\n")

Chatbot Building Example

from kogpt2_transformers import get_kogpt2_tokenizer
from transformers import GPT2LMHeadModel
import torch

tokenizer = get_kogpt2_tokenizer()
model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')

def generate_response(user_input, context=""):
    """Generate conversation-based response"""
    prompt = f"{context}\nUser: {user_input}\nAI:"
    
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=input_ids.shape[1] + 50,
            temperature=0.8,
            top_k=50,
            top_p=0.95,
            repetition_penalty=1.2,
            do_sample=True
        )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    # Extract only the AI response part
    ai_response = response.split("AI:")[-1].strip()
    
    return ai_response

# Chatbot conversation example
context = ""
while True:
    user_input = input("You: ")
    if user_input.lower() in ['quit', 'exit']:
        break
    
    response = generate_response(user_input, context)
    print(f"AI: {response}\n")
    
    # Update context
    context += f"User: {user_input}\nAI: {response}\n"

Model Specifications

  • Architecture: GPT-2
  • Parameters: 125M
  • Vocabulary Size: 50,000
  • Context Length: 1,024 tokens
  • Training Data: Korean web documents, news, Wikipedia

Performance Benchmarks

TaskDatasetKoGPT2 Score
Text generation qualityHuman evaluation4.2/5.0
Sentence completionSelf-evaluation85%
Conversation naturalnessSelf-evaluation78%

Resources

License

CC-BY-NC-ND 4.0 - Non-commercial use, modification and redistribution allowed


Last modified December 31, 2025: update project & about (24091c9f)