KoBERT

Korean BERT pre-trained model (Korean BERT pre-trained cased)

KoBERT

KoBERT is a Korean-specialized BERT model developed by SK Telecom to overcome the limitations of Google’s publicly released BERT language model in processing Korean.

Project Information

Key Features

1. Korean Language Optimization

  • Trained on millions of Korean sentences collected from Wikipedia and news sources
  • Large-scale Korean language corpus utilization
  • Reflects irregular Korean language variation characteristics

2. Efficient Tokenization

  • Data-driven tokenization technique
  • 27% fewer tokens with over 2.6% performance improvement compared to existing methods
  • Subword segmentation tailored to Korean language characteristics

3. Distributed Learning Technology

  • Ring-reduce based distributed learning technique
  • Fast training of over a billion sentences across multiple machines
  • Efficient processing of large-scale data

4. Multi-framework Support

  • PyTorch
  • TensorFlow
  • ONNX
  • MXNet

Applications

SK Telecom Internal Usage

  1. Call center chatbots - Improving customer service efficiency
  2. AI legal/patent search service - Document search and analysis
  3. Machine Reading Comprehension (MRC) - Extracting accurate answers from marketing materials
  4. Context-based document vector generation - Similar document recommendations (patent applications)

General Use Cases

  • Sentiment Analysis
  • Named Entity Recognition (NER)
  • Text Classification
  • Question Answering Systems
  • Sentence Similarity Measurement
  • Text Embedding Generation

Installation and Usage

Installation

pip install kobert-transformers
pip install transformers

Basic Usage

from kobert_transformers import get_tokenizer
from transformers import BertModel

# Load tokenizer and model
tokenizer = get_tokenizer()
model = BertModel.from_pretrained('skt/kobert-base-v1')

# Tokenize and generate embeddings
text = "Korean natural language processing is fascinating"
inputs = tokenizer(text, return_tensors='pt')
outputs = model(inputs)

# Extract sentence embedding
sentence_embedding = outputs.last_hidden_state[:, 0, :].squeeze()
print(sentence_embedding.shape)  # torch.Size([768])

PyTorch Example

import torch
from kobert_transformers import get_kobert_model, get_tokenizer

# Load model and tokenizer
tokenizer = get_tokenizer()
model = get_kobert_model()

# Process text
text = "KoBERT is specialized in Korean language understanding."
encoded = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=128,
    padding='max_length',
    return_attention_mask=True,
    return_tensors='pt'
)

# Model inference
with torch.no_grad():
    outputs = model(
        input_ids=encoded['input_ids'],
        attention_mask=encoded['attention_mask']
    )
    
pooled_output = outputs[1]  # [CLS] token output
print(pooled_output.shape)  # torch.Size([1, 768])

Performance Benchmarks

TaskDatasetKoBERT ScoreBaseline
Sentiment AnalysisNSMC89.63%87.42%
NERKorean NER86.11%84.13%
Sentence SimilarityKorSTS81.59%77.92%
Question AnsweringKorQuAD 1.052.81 (EM)48.42

Model Specifications

  • Architecture: BERT-base
  • Vocabulary Size: 8,002
  • Hidden Size: 768
  • Number of Layers: 12
  • Number of Attention Heads: 12
  • Intermediate Size: 3,072
  • Max Sequence Length: 512

Community and Support

Technical Support

Using on Hugging Face

from transformers import AutoModel, AutoTokenizer

# Load directly from Hugging Face Hub
model_name = "skt/kobert-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Inference
text = "KoBERT is the standard for Korean natural language processing"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(inputs)

License

Apache License 2.0 - Commercial use allowed

Resources


Last modified December 31, 2025: update project & about (24091c9f)