datasets

2025年3月29日

10:30

https://huggingface.co/docs/datasets/index

The Trainer automatically ignores columns in your dataset which aren’t used by the model. For T5 for instance, the model expects input_ids, attention_mask, labels etc., but not “summary”, “document”, “id”. As long as input_ids etc are in your dataset, it’s fine.

In general you can have whatever column names you want for the text and labels before tokenization - it’s up to you to decide how the text should be processed.

once you’ve tokenized the text, you shouldn’t need to rename the resulting columns like input_ids and attention_mask (and i wouldn’t recommend this since it will probably break the Trainer logic).

by default, the Trainer looks for the label column name labels but you can override this by specifying the value of TrainingArguments.label_names: Trainer — transformers 4.5.0.dev0 documentation

关于为什么这样构建数据集，就能训练的疑问：

ds_train = Dataset.from_pandas(df_train).select_columns(["anchor", "positive", "negative"])

model = SentenceTransformer(config

  trainer = SentenceTransformerTrainer(
        model=model,
        args=args,
        train_dataset=ds_train,
        loss=loss,
    )
    trainer.train()

因为是trainer是SentenceTransformerTrainer，所以他知道应该要处理dataset中的anchor,positive,negative三列，将其tokenize等操作，然后传给sentencetransformer。