I’ve applied SGD and MNB classifiers for website classification by performing stemming on words within URLs and then also applied the same algorithms on n-grams without performing stemming. I’ve also implemented CNN on unigram, bigram, and trigram models.
DMOZ dataset is used for this task. It was known as open directory project(ODP). This dataset has over 1.5 websites with 15 categories that they belong like sports, Arts, Business etc. (you can find it here https://www.kaggle.com/shaurov/datasets).