2025-09-16 开源方案

2025916

10:45

    我只发现了Data cite corpus这个数据集,没有发现包含metadata的数据集。

     

    1名:

     主要是通过api获取了dataset的标题、作者和年份信息,和article的标题、作者信息、年份进行比较,直接用catboost

    Model Qwen2.5-Coder (32B, AWQ quantization) + vLLM. Coder was better in classification than base model.这个确实是,public lb 0.81->0.85 private lb 0.745->0.749

     

     

    5名:

    只对samndoi进行llm分类。

    prompt:任务要求放在了user prompt中,我的是直接放在system prompt

    <|im_start|>system

    You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>

    <|im_start|>user

    ### Core Instructions ###

    * Inspect WINDOW taking particular interest in ID, given below.

    * The ID is specifically a data citation - that relates to data held in an open-access repository.

    * Determine whether the WINDOW context holds evidence that the WINDOW authors are responsible for the ID held in the public repository.

    * After thinking, give your final answer using the rubric:

        * Owner: the WINDOW authors have some sort of ownership around ID.

        * User: the data has be re-used/referenced/compared in the WINDOW.

        * None: there is no evidence to determine ownership.

    * When reviewing the METADATA remember:

        * The METADATA is collected from several sources and hence has various formats for authour names and dates.

        * The most important thing is finding the overlap of WINDOW author(s) with the ID author(s); usually one author overlap is enough to assume Owner.

    * The final answer should be wrapped in \boxed{} containing only User, Owner or None.

     

    # ID

    https://doi.org/10.17882/47142

     

    # METADATA

    ## ID METADATA

    [Title]: A global bio-optical database derived from Biogeochemical Argo float measurements within the layer of interest for field and remote ocean color applications

    [Authors]: Organelli, Emanuele; Barbieux, Marie; Claustre, Herv; Schmechtig, Catherine; Poteau, Antoine; Bricaud, Annick; Uitz, Julia; Dortenzio, Fabrizio; Dallolmo, Giorgio

    ## WINDOW METADATA

    [Title]: Assessing the Variability in the Relationship Between the Particulate Backscattering Coefficient and the Chlorophyll <i>a</i> Concentration From a Global BiogeochemicalArgo Database

    [Authors]: Marie Barbieux; Julia Uitz; Annick Bricaud; Emanuele Organelli; Antoine Poteau; Catherine Schmechtig; Bernard Gentili; Grigor Obolensky; Edouard Leymarie; Christophe Penkerc'h; Fabrizio D'Ortenzio; Herv Claustre

    [Date]: 2018-2

     

    # WINDOW

    ## Paragraph

    <p xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><s>Sherbrooke, Canada) are acknowledged for useful comments and fruitful discussion.</s><s>We also thank the International Argo Program and the CORIOLIS project that contribute to make the data freely and publicly available.</s><s>Data referring to <ref type="bibr">(Organelli et al., 2016a)</ref> (doi:10.17882/47142)</s><s>and <ref target="#b8" type="bibr">(Barbieux et al., 2017)</ref> (doi: 10.17882/49388) are freely available on SEANOE.</s></p>

     

    ## References Condensed

    <biblstruct xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:id="b100">

     

    <monogr>

    <title level="j">SEANOE</title>

    <imprint>

    <date type="published" when="2016">2016</date>

    </imprint>

    </monogr>

    <note type="raw_reference">Organelli, E., M. Barbieux, H. Claustre, C. Schmechtig, A. Poteau, A. Bricaud, J. Uitz, F. D'Ortenzio, and G. Dall'Olmo (2016a), A global bio-optical database derived from Biogeochemical Argo float measurements within the layer of interest for field and remote ocean colour applications, SEANOE, doi:10.17882/47142.</note>

    </biblstruct><|im_end|>

    <|im_start|>assistant

     

    llm_response = (
        pl.read_parquet('/kaggle/working/llm_out.pq')
        .with_columns(
            cot = pl.col('completions').str.split('</think>').list.first(),
            ans = pl.col('completions').str.split('</think>').list.last()
        )
        .with_columns(
            pl.col('ans').str.extract(r'oxed\{(.*)\}').alias('ans')
        )
        .with_columns(
            pl.when(pl.col('ans').str.starts_with('\\'))
            .then(pl.col('ans').str.extract(r'ext\{(.*)\}'))
            .otherwise('ans')
            .alias('ans')
        )
        .with_columns(
            pl.when(is_doi.and_((pl.col('ans')=='None').or_(pl.col('ans').is_null())))
            .then(S)
            .when((~is_doi).and_((pl.col('ans')=='None').or_(pl.col('ans').is_null())))
            .then(S)
            .when(pl.col('ans')=='Owner')
            .then(P)
            .otherwise(S)
            .alias('type')
        )
        # .join(
        #     get_ground_truth().filter(pl.col('type')!='Missing'), on=['article_id', 'dataset_id'], how='left'
        # )
        .select('article_id', 'dataset_id', 'type',) #'ans', 'type_right')
    )

     

     

    2

    关于为什么有些accession id不算数据集的疑问:

    E.g. In many cases, some Accession numbers from the same table and repository were picked while others were not. The speculation here was that some kind of NER model is being used, thresholding upon which leaves out some relevant Accession numbers

     

    关于DOItype分类,纯规则:

    Accession: SAMN and EMDB -> Primary

    DOI: If dataset is found in multiple papers as per datacite corpus, tag first as Primary rest as Secondary after sorting by publicationDate(这个我怎么就没想到

    DOI: If article_id isSupplementTo dataset_id as per datacite public data file -> Primary

    DOI: If more than 4 occurence of same repo (first 4 letters) or more than 4 DOI mentioned around the dataset in article -> Secondary

     

    分类:2个人用不同的方法

    第一个人:Mohsin - DOI classification

    Most time was spent on creating context for classification (<2048 tokens) . I mapped dataset authors and abstract from datacite public data file. Then I collate context by extracting following parts of paper:

    • context around first match of dataset id regex
    • Whole paper was split in overlapping chunks of 1024 chars, then I used BM25 similarity score w.r.t dataset abstract to get top 3 chunks.用bm25

    Model -> MedGemma-4B lora

     

    第二个人:Classification Models - Nikhil,训练bert来进行分类。

    Data Sources

    Training Labels:

    • Competition Data
    • RDMPage Data (including labels absent from Zenodo DataCite corpus)另一个人分享自己标注的data

    Model Architecture

    Base Model

    microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

    Stabilization Techniques

    1. Token Replacement Strategy

    • Replace all non-target dataset IDs with: "other dataset id"
    • Transform target dataset ID to: "Prediction {DATASET_ID}"
    • Purpose: Focus model attention on specific ID being classified

    2. Hyperparameter Configuration

    • Batch Size: 256 (unusually large)
    • Impact: Significantly improved training stability

    3. Threshold Optimization

    • Run models with multiple random seeds
    • Average out-of-fold (OOF) predictions
    • Select optimal threshold from averaged results

     

    Context Construction Example  除了context,还加了mention timerepository 类型

    DOI Context = [Mention Token] + [Repository Token] + [Original Context]

    Example:
    "More than 3 mentions" + "Zenodo Repository" + [150 chars context]

    Key Success Factors

    Model Separation: Treating accession IDs and DOIs as distinct problems

    Context Optimization: Different window sizes for different ID types

    Feature Engineering: Leveraging metadata (mentions, repository)

    Stability Focus: Large batch size and multi-seed averaging

    Token Strategy: Replacing irrelevant IDs to reduce noise

     

    3

    Dataset type classification

    At the very beginning we knew that type classification was one of the most important parts of this competition. The reason is that, since we use the F1 score as the metric, a FN or a FP in the retrieval part count only as 1 single error. However, if we get a TP sample but misclassify its type we would get 2 errors: 1 FN for the missing correct type and 1 FP for the wrong predicted type.

    Our solution here consists in mainly two steps: a 6-fold Deberta-v3 ensemble and some heuristics.

    一些规则(类似第2名)

    Similarly to other teams, we first used a few rules that proved to work on both LB and CV:

    • All dryad DOIs and SAMN accessions are primary
    • If a DOI title or authors list are similar to the paper (via string edit distance), the dataset is primary
    • If an accession is cited more than 5 times in EUPMC it is secondary (simple probability check)
    • If a DOI has 5 or more citations, it is secondary (simple probability check)

    The remaining citations were then classified using a Deberta-v3 Large ensemble.

     

    Training details

    We created a 6-fold StratifiedGroupKFold (stratified by type and grouped by article_id) and trained one deberta-v3 large on each of those fold using a binary classification setup (classification head on top). The following features were used to generate the training prompts:

    • The context around the citations on the paper (1k characters)
    • The first 500 characters of the article text
    • Paper and dataset titles whenever available

     

    通过两个trick来使训练更加稳定!

    We found two very important tricks to stabilize the training: adding gradient clipping and model/weight EMA. The latter is a very simple technique that consists of having a trainable model via SGD/Adam and a frozen counterpart, which is updated via direct averaging (weighted) of both model's weights after each training step. Using transformers lib, it can be easily implemented by inheriting the Trainer class like the following:

    from typing import Dict

    import torch

    from ema_pytorch import EMA

    import copy

     

    class EMATrainer(Trainer):

        def __init__(self, ema_decay=0.9995, ema_update_every=1, *args, **kwargs):

            super().__init__(*args, **kwargs)

     

            # Initialize EMA after model is set

            self.ema_decay = ema_decay

            self.ema_update_every = ema_update_every

            self.ema = None

     

        def _setup_ema(self):

            if self.ema is None:

                self.ema = EMA(

                    self.model,

                    beta=self.ema_decay,

                    update_every=self.ema_update_every,

                    update_after_step=50  # Start EMA after 50 steps

                )

     

        def training_step(self, model, inputs, num_items_in_batch = None):

            """Override training step to include EMA updates"""

            if self.ema is None:

                self._setup_ema()

     

            # Perform normal training step

            loss = super().training_step(model, inputs, num_items_in_batch=num_items_in_batch)

     

            # Update EMA after each step

            self.ema.update()

     

            return loss

     

        def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix="eval"):

            """Evaluate using EMA model"""

            if self.ema is not None:

                # Temporarily use EMA model for evaluation

                original_model = self.model

                self.model = self.ema.ema_model

     

                results = super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)

     

                # Restore original model

                self.model = original_model

                return results

            else:

                return super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)

     

        def save_model(self, output_dir=None, _internal_call=False):

            """Save EMA models"""

            # Save EMA model

            ema_output_dir = f"{output_dir}/ema_model"

            if self.ema is not None and output_dir is not None:

                self.ema.ema_model.save_pretrained(ema_output_dir)

            else:

                self.model.save_pretrained(ema_output_dir) 

     

    关于EMA的移动平均和amda的移动平均的区别:     

    AdamW 中的移动平均:针对梯度的统计特性,而模型 EMA 中的移动平均:针对参数本身

    • 从流程上看,确实是 “先经 AdamW 的梯度平滑得到 θ_t,再经模型 EMA 对 θ_t 进一步平滑得到 θ_ema_t”,存在 “平滑后再平滑” 的叠加效果,这和 “移动平均的移动平均” 的直觉是一致的。
    • 但核心区别在于:AdamW 的 EMA 是优化器层面的机制,作用于梯度统计量,影响参数的 “更新方向和步长”;模型 EMA 是模型层面的机制,作用于参数本身,直接修正参数的 “最终取值”。

     

     

     

    4名(有代码!需要看看,包括agentdemo

    Type Classification,其实我也应该微调accession type的。

    We trained LLM classifiers that handled both DOI and Accession subsets without separate models for each. Training a robust model was tricky due to competition dataset being small and noisy. We adopted a few strategies to address these challenges:

    • Implemented a tool calling agent that generates synthetic labels using articles from the Europe PMC open access subset. The agent combined keyword and semantic search tools to gather sufficient context/evidence from an article before making a citation type classification. We warmed up public LLMs (Qwen 2.5 family) using the synthetic dataset before finetuning with the competition examples. We shared a demo notebook here showing the agent setup and classification trajectory. Agent generated synthetic dataset can be found here.
    • For model diversity in ensemble, we also trained a few models by adding pseudo labeled examples (generated with Qwen-2.5-72B) to the competition dataset.
    • A few articles contained lots of accession ids (32+), having an outsized impact on model training. So, we limited dataset mentions to a maximum of 24 per article in the fine-tuning datamix. For data augmentation, we masked dataset ids with tokens, where i represents distinct dataset mentions within a given context.
    • Averaged checkpoints using Exponential Moving Average (EMA).

     

    Inference

    • To speed up inference, we assumed secondary type for certain accession ids like cathalphafoldcellosauruschembldbgapigsrpfamreactome, and refseq. This was informed by stats from the synthetic labels, where they were predominantly secondary.
    • We removed accession IDs when a given article had DOI mentions in the data availability (or similar) sections OR when sum of primary DOI probs was greater than 0.8.
    • We used a cascaded inference approach: initial type classification with a fine-tuned Qwen-2.5-14B model, then routing uncertain examples (50% of DOI cases + 20% of accession cases) to a fine-tuned Qwen-2.5-32B model. Finally, the top 10% most difficult predictions were handled by a fine-tuned Qwen-2.5-72B.

    Links:

     

     

    9

    llm提取出pdf to text后的author等信息

    • extract title, author, published year from beginning of paper using Qwen2.5-32B

    embedding相似度判断是否是primary

    • Extracted sentence containing the ACC_ID and calculated the sentence similarity using Qwen3-Embedding-0.6B w.r.t. seed_sentence = "DNA Deposition\nThe following information was supplied regarding the deposition of DNA sequences:\nThey are available at\nGenBank: PRJNA664798. BioSample: SAMN16233641, SAMN16233642, SAMN16233643, SAMN16233644, SAMN16233645." - if >= 0.667 and contains submit/deposit then Primary else Secondary

     

     

    11

    正文和参考文献分开的方式我也用了,但是我怎么就没有想到用作者信息匹配呢,没有想到用llm提取作者信息

    Part 2: DOI Classification (Primary vs. Secondary)

    We used two distinct methods for classifying DOIs:

    • Main Text DOIs: For DOIs found in the main body of the paper, we used a semantic approach, analyzing the context to determine their role.
    • Reference Section DOIs: For DOIs in the references, we based our classification on authorship. A reference DOI was classified as primary if its author list overlapped with the authors of the main paper. To get the paper's authors, we extracted them by having an LLM parse the first 2,000 characters of the text.

     

    12

    Type classification

    DOIs

    LLM-based classification leveraging context and metadata.

    • Qwen2.5-72B-Instruct-AWQ (zero-shot)
    • Input features
      • Context (~3000 chars)
      • Data Citation Corpus based features
        • Number of dataset citations
        • Whether it is the first paper that cited the dataset in the corpus
        • Elapsed days from dataset release date to paper publication date
      • Paper and dataset metadata
        • Title
        • Authors
        • Abstract
    • Force a binary choice (A: Primary / B: Secondary) and select the option with the larger logit.

     

    system = """You are given (1) an article snippet (Context) and (2) a candidate dataset identifier (DOI) with metadata for both the paper and the dataset.
     
    Decide whether the dataset is used as:
    A) Primary   data generated by/for this study
    B) Secondary reused/derived from prior work or previously published dataset
     
    Use BOTH:
    Context: the article snippet discussing data usage
    Metadata similarity: closeness between paper vs. dataset (titles, abstracts, author overlap, topics)
     
    Reply with ONLY one letter: A or B.
    """
    user = (
                f"Identifier (DOI): {to_str(dsid)}\n"
                f"Features: n_citations={to_str(n_cit)}, "
                f"is_first_publication={str(bool(row.get('is_first_publication'))).lower()}, "
                f"citations_before={to_str(cb)}, "
                f"elapsed_days_from_dataset_publication={to_str(dd)}\n\n"
                f"=== Paper Metadata ===\n"
                f"Title: {to_str(row.get('paper_title'))}\n"
                f"Authors: {to_str(row.get('paper_author_name'))}\n"
                f"Abstract: {to_str(row.get('paper_abstract'))}\n\n"
                f"=== Dataset Metadata ===\n"
                f"Title: {to_str(row.get('dataset_title'))}\n"
                f"Authors: {to_str(row.get('dataset_author_name'))}\n"
                f"Abstract: {to_str(row.get('dataset_abstract'))}\n\n"
                f"=== Context (article snippet) ===\n{to_str(row.get('chunk'))}\n"
            )

    The addition of metadata contributed the most to performance improvement (doi only LB: 0.333 → 0.345).

    我微调后的也就0.314,说明metadata很有用。

     

    15

    Finally, I used Grobid to extract the author list for each paper and determined primary vs. secondary authors based on surnames, using an LLM in a multi-stage decision process.

    Grobid是一个开源的机器学习库,专门设计用于处理学术文献。它能够:

    提取文档的元数据(如标题、作者、摘要等)

    识别和解析文档的结构(如章节、段落、图表等)

    提取引用信息和参考文献

    将PDF转换为结构化的XML或JSON格式

    Grobid特别擅长处理学术论文,能够准确识别论文的各个部分,包括引言、方法、结果和讨论等章节。

     

    20

    Accession IDs

    • Extracted candidates using regex patterns.
    • Fine-tuned DeBERTa to filter out false positives.
    • Certain ID types always mapped directly to Secondary.
    • For the rest, fine-tuned Qwen2.5-3B (base) with QLoRA:
      • Input = identifier + context in a structured template with cues.
      • Representation = global mean + target-span mean pooling.
      • Added a span-gated GELU classifier head.
    • Combined with context as input to Light-R1-14B-awq.这是个什么模型

 

已使用 OneNote 创建。