This guide provides an overview of how to set up and use the DistilBERT model from Hugging Face's Transformers library for various NLP tasks such as text classification, tokenization, and embedding extraction locally.
Ensure Python 3.6 or newer is installed on your system. You can check your Python version by running:
python --version
Install PyTorch and the Transformers library to use DistilBERT. Run the following command:
pip install torch transformers
Use the following Python code to download the DistilBERT model and tokenizer:
from transformers import DistilBertTokenizer, DistilBertModel
# Load pre-trained model tokenizer (vocabulary)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# Load pre-trained model
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
Tokenize your input text with the following code:
input_text = "Hello, world! This is a test sentence."
encoded_input = tokenizer(input_text, return_tensors='pt')
Extract features from your tokenized text as follows:
with torch.no_grad():
outputs = model(**encoded_input)
last_hidden_states = outputs.last_hidden_state
To save the model and tokenizer locally:
model.save_pretrained('./distilbert_local')
tokenizer.save_pretrained('./distilbert_local')
To load them:
model = DistilBertModel.from_pretrained('./distilbert_local')
tokenizer = DistilBertTokenizer.from_pretrained('./distilbert_local')
You're now ready to integrate DistilBERT into your applications for a variety of NLP tasks. Adjust the provided examples according to your specific project needs.
For more detailed information on using DistilBERT and other models in the Transformers library, visit the Hugging Face documentation.
Contributions to improve this guide or the accompanying code are welcome. Please feel free to submit issues or pull requests to the repository.
Happy coding!