Skip to content

Scraping paper data, preprocessed and trained using BERT variants, deployment and an integration to website

License

Notifications You must be signed in to change notification settings

Neloy-Barman/Scientific-Paper-Keywords-Categorization

Repository files navigation

Scientific Paper Keywords Categorization

Project Development Journal

Problem Statement

Fetching papers abstract and keywords, I will create a multi-label keywords classifier that can classify an abstract within selected keywords.

Objective

Keywords is a necessary part of a scientific paper. It helps the search engines to show papers to their users based on relatable topics. So, choosing these words properly is really important. The goal here now is to create a developed and optimized keyword categorizer that can classify a scientific paper between particular keywords based on the abstract of the paper.

Data Collection

To collect data, I decided to scrape the available open access papers at IEEE. I created the scraper files using selenium after inspecting the website. Firstly I collected the urls of the papers using "url_scraper". Then visiting the urls, I fetched the abstract and the IEEE and author keywords using "details_scraper". Facing some unpredictable issues, I managed to scrape data and stored them in different .csv files. You can check out the scraper files within "scrapers" folder.

Data Cleaning & Pre-processing

Within almost all the columns, there were some NaN or redundant values. In the case of "abstracts" column, some values were repetative and irrelavant. Those are considered as the inappropriate ones. So, those rows were deleted. Then I merged the IEEE and author keywords together. From there, I took the most commonly used keywords on the basis of the threshold value of 0.004. Henceafter, I dropped the rows having NaN or the rare keywords and created the final dataset. You can check the data cleaning part in the "data_cleaning" notebook. The following table shows the overview of initial and final csv files. The final dataset is available here.
File Name Data Type Rows Columns
merged_data Tabular Text 40457 3
papers_final_data Tabular Text 36398 2

Dataloader Creation

I encoded the unique keywords. Then I proceed to the row-wise indexing for the available keywords of that row. For different models, the pre-processing part may differ. So, I imported the pre-defined configurations for each model. I splitted the dataset as 90% training and 10% validation set. Finally I created different dataloaders with a batch size of 16. You can check the data loader creation part in the "dataloader_creation" notebook.

Model Experimentations

To classify an abstract into multi-labels, I choose BERT and it's 2 variants. Those are: -
  • BERT
  • DistilBERT
  • RoBERTa
Training process: -
  1. I freezed the model with it's pre-trained weights and ranged the learning rate between suitable values.
  2. Then I trained the model for 10 epochs using fit_one_cycle() method.
  3. After that, I unfreezed the trained model and again selecting a learning rate range, trained the model for 10 epochs.
In the case of BERT and DistilBERT, the whole training process gave a satisfactory result. But for RoBERTa, after unfreezing and training it again cost overfitting problem. So, it shows a better performance in its freezing phase.

Model Evaluation

Model Micro Average Weighted Average
Precision Recall F1-Score Precision Recall F1-Score
BERT 62.211 45.104 52.294 60.635 45.104 50.618
DistilBERT 65.810 40.588 50.209 63.739 40.588 48.119
RoBERTa 69.113 20.353 31.446 59.215 20.353 24.646
If we look at the evaluation table, it is clear that all our model is showing high precision and low recall values in all the cases. That's why a drastic change can be seen in the f1-score values. Although we got high precision values, but it is not showing a satisfactory results that meets expectation. In the case of BERT and DistilBERT, the models are not predicting all the expected classes but the predictions are selective and precise. That's why we see higher precision values. But it's not predicitng more extra classes that results in lower recall values. On the contrary, RoBERTa is more precise and correct. Though, it's predicting less extra classes. Finally, our balancing metric f1-score comes and it shows BERT as one of the best among these. Furthermore, I choose it to move forward with other tasks.

Model Compression

I compressed the model using ONNX. The model size got reduced to 87.45%. But the reduction costs a performance drop in the prediction. To evaluate this, I used micro average f1-score as the performance metrics. There is a 2.8% drop in the performance of the compressed model.
Model Size(MB) Performance
BERT 838.8 52.2939
Compressed BERT 105.3 50.8322

Deployment

I deployed the model using huggingface. Check out the deployment here.

Integration to website

I integrated the model using render. Check out the live website here.
Home Page Prediction Result

Short Video Demonstration

I prepared a short video demonstration and shared it as a linked in post. Check it out here.

References

  • Fallah, Haytame, et al. "Adapting transformers for multi-label text classification." CIRCLE (Joint Conference of the Information Retrieval Communities in Europe) 2022. 2022.

Challenges Faced

  • After a scraper script runs for a long time, sometimes it shows "Aw, Snap!" message in the running chrome. In that case, I just reloaded the webpage mannually and then it started working properly as previous.

  • The required webelements distribution in all webpages wasn't the same. For some webpages, the scraper collecting details were working fine but it showed exceptions for those. So, I had to re-write some codes considering the different ones and generalize the codes.
  • As I had to collect a lot of data, so, I created same type of scrapers and running them simultaneously from different indexes. It boosted my data collection process a bit although it depended much on internet speed.
  • Some abstracts contains values like "Retracted.", "Final version", "IEEE Plagarism Policy." and some more unconsiderable values. So, I went through the whole dataset and found these values mannually for the data cleaning process.
  • In the end, it took huge time to collect a desirable amount of data. So, I had to wait with patience.

About

Scraping paper data, preprocessed and trained using BERT variants, deployment and an integration to website

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published