KoBART

Korean BART Model (Korean BART)

KoBART is a BART (Bidirectional and Auto-Regressive Transformers) model specialized in Korean text generation and summarization. Utilizing an Encoder-Decoder architecture, it demonstrates excellent performance across various natural language generation tasks.

KoBART

Project Information

Key Features

1. Encoder-Decoder Architecture

  • Bidirectional encoder and auto-regressive decoder
  • Optimized for text generation and transformation tasks
  • Balance between context understanding and generation

2. Main Application Areas

  • Text summarization: Condensing long documents into concise summaries
  • Sentence generation: Producing natural Korean language sentences
  • Translation: Sentence transformation and paraphrasing
  • Dialogue generation: Question-answering systems

3. Korean Language Optimization

  • Pre-trained on Korean corpus
  • Considers Korean grammar and word order
  • Supports diverse Korean language domains

Installation and Usage

Installation

pip install transformers torch

Basic Text Summarization

from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

# Load model and tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-base-v2')

# Summarize long text
text = """
SK Telecom is Korea's leading mobile telecommunications company with 
extensive ICT technology including AI, 5G, and cloud services. Recently, 
it developed the Korean large language model A.X and released it as 
open source, contributing to the development of the domestic AI ecosystem.
"""

# Encode and generate summary
inputs = tokenizer(text, return_tensors='pt', max_length=1024, truncation=True)
summary_ids = model.generate(
    inputs['input_ids'],
    max_length=150,
    num_beams=5,
    early_stopping=True
)

# Decode
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

Text Generation Example

# Prompt-based text generation
prompt = "With the advancement of artificial intelligence"

inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    inputs['input_ids'],
    max_length=100,
    temperature=0.8,
    do_sample=True,
    top_k=50
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Model Specifications

  • Architecture: BART
  • Parameters: 123M
  • Vocabulary Size: 30,000
  • Max Sequence Length: 1,024
  • Encoder Layers: 6
  • Decoder Layers: 6

Fine-tuning Guide

from transformers import Trainer, TrainingArguments

# Fine-tuning configuration
training_args = TrainingArguments(
    output_dir='./kobart-finetuned',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch"
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

# Start training
trainer.train()

Resources


Last modified December 31, 2025: update project & about (24091c9f)