A

scibert_scivocab_uncased

Otherby Ai2·Model page

Ai2's BERT model pretrained on scientific text with a domain-specific vocabulary for NLP tasks in scientific domains.

Share:

Model Card

This is the pretrained model presented in SciBERT: A Pretrained Language Model for Scientific Text, which is a BERT model trained on scientific text.

The training corpus was papers taken from Semantic Scholar. Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.

SciBERT has its own wordpiece vocabulary (scivocab) that's built to best match the training corpus. We trained cased and uncased versions.

Available models include:

  • scibert_scivocab_cased
  • scibert_scivocab_uncased

The original repo can be found here.

If using these models, please cite the following paper:

@inproceedings{beltagy-etal-2019-scibert,
    title = "SciBERT: A Pretrained Language Model for Scientific Text",
    author = "Beltagy, Iz  and Lo, Kyle  and Cohan, Arman",
    booktitle = "EMNLP",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1371"
}
Author
A
Ai2
Organization · ✓
allenai
Details
Downloads174.8K
Likes173
AccessOpen Source
Librarytransformers
CreatedMar 2, 2022
UpdatedOct 3, 2022
View on Hugging Face
Languages
en
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

scibert_scivocab_uncased — AI Model Details | Applied