This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

KoBART

Korean BART Model (Korean BART)

    KoBART is a BART (Bidirectional and Auto-Regressive Transformers) model specialized in Korean text generation and summarization. Utilizing an Encoder-Decoder architecture, it demonstrates excellent performance across various natural language generation tasks.

    KoBART

    Project Information

    Key Features

    1. Encoder-Decoder Architecture

    • Bidirectional encoder and auto-regressive decoder
    • Optimized for text generation and transformation tasks
    • Balance between context understanding and generation

    2. Main Application Areas

    • Text summarization: Condensing long documents into concise summaries
    • Sentence generation: Producing natural Korean language sentences
    • Translation: Sentence transformation and paraphrasing
    • Dialogue generation: Question-answering systems

    3. Korean Language Optimization

    • Pre-trained on Korean corpus
    • Considers Korean grammar and word order
    • Supports diverse Korean language domains

    Installation and Usage

    Installation

    pip install transformers torch
    

    Basic Text Summarization

    from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
    
    # Load model and tokenizer
    tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
    model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-base-v2')
    
    # Summarize long text
    text = """
    SK Telecom is Korea's leading mobile telecommunications company with 
    extensive ICT technology including AI, 5G, and cloud services. Recently, 
    it developed the Korean large language model A.X and released it as 
    open source, contributing to the development of the domestic AI ecosystem.
    """
    
    # Encode and generate summary
    inputs = tokenizer(text, return_tensors='pt', max_length=1024, truncation=True)
    summary_ids = model.generate(
        inputs['input_ids'],
        max_length=150,
        num_beams=5,
        early_stopping=True
    )
    
    # Decode
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    print(summary)
    

    Text Generation Example

    # Prompt-based text generation
    prompt = "With the advancement of artificial intelligence"
    
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(
        inputs['input_ids'],
        max_length=100,
        temperature=0.8,
        do_sample=True,
        top_k=50
    )
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(generated_text)
    

    Model Specifications

    • Architecture: BART
    • Parameters: 123M
    • Vocabulary Size: 30,000
    • Max Sequence Length: 1,024
    • Encoder Layers: 6
    • Decoder Layers: 6

    Fine-tuning Guide

    from transformers import Trainer, TrainingArguments
    
    # Fine-tuning configuration
    training_args = TrainingArguments(
        output_dir='./kobart-finetuned',
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        evaluation_strategy="epoch"
    )
    
    # Create Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset
    )
    
    # Start training
    trainer.train()
    

    Resources