Machine Learning for Website Categorization

Introduction:

 

Endurance caters to the web presence related problems and needs of small businesses across the world. The needs of the customers do vary based on the business that they do, effectively the needs of end consumers. For example, hosting needs for an entertainment website are different from that of a food delivery business or an e-commerce website. Thus it becomes important to come up with an intelligent and automated way of figuring out business vertical of any given website.

Simply put, a website is made up of texts and media. The focus of this article is on categorizing websites based on the text contained in it. So, the problem of categorizing websites is reduced to the well known Document Classification problem.

A classifier needs 2 things: features and label. In our case, we would need the text content of the websites and their known business verticals. The question is: how to represent the text content as features? We need to come up with a vector representation for the meaning of the document.

 

Natural Language Features : Bag of Words

 

Bag of Words is a well known technique for representation of text into numerical features that can be used as input to any machine learning model.. In this approach, frequency of occurrence of each word is used as a feature for training our classifier. Let’s understand it with a simple example.

 

Let’s say we have 2 Documents A,B containing texts.
A: Tim likes having pizza. Tim likes football too.
B: John loves football.
Based on these documents, we create a “bag” of word:
[“Tim”, “likes”, “having”,”pizza”,”football”,”too”,”john”,”loves”]
Now, we can create 2 feature lists corresponding to each documents:
[“Tim”, “likes”, “having”,”pizza”,”football”,”too”,”john”,”loves”]
A:  [2,2,1,1,1,1,0,0]
B:  [0,0,0,0,1,0,1,1]

 

Frequency of the word “Tim” is 2 in document A and 0 in document B. Hence the entry corresponding to document A and the column representing “Tim” is 2 & entry in the corresponding cell for Document B is 0.

Bag-of-words features have some major weaknesses:

  • Loss of ordering of the words : When we constructed the “bag” or list of words earlier, there was no ordering of the words corresponding to each document. All we cared about is the frequency of occurrence of the words in the documents.
  • Semantics of the words ignored : The same word can be used in different contexts in different documents. It isn’t preserved.
  • Data Sparsity: “bag” of words is constructed from all unique words across all documents. Most of the documents will have entries corresponding to few words from the Bag. In the feature list for Document B, we can see that there are only 3 entries. This will be the case for most of the documents.
  • High Dimensionality: As we add more documents, the entries in the bag of words will increase considerably. Let’s say we add another Document D containing the text: “Man is mortal”. Now, the words “man” and “mortal” will be added to the Bag. Similarly, imagine for 30,000 documents, we will have thousands of unique words in the Bag. Thus number of columns increases as we keep on adding more documents. This leads to the classical Curse of dimensionality. High dimensional training data reduces predictive power of models if there are fixed number of training samples available.

In order to overcome the shortcomings of BoW model, we use Doc2Vec algorithm to represent documents as vectors.

 

Natural Language Features : Doc2Vec

 

Doc2Vec is an unsupervised algorithm which associates documents with labels. It is an extension of Word2Vec which is used to generate word embeddings. What this essentially means is that, word2vec takes a huge corpus of words as input and generates vector representation for each word in the corpus. The vectors are correlated to each other in such a way that words used in similar context are placed closer to each other. Similarly, Doc2Vec takes a large corpus of Documents as input and represents each document as vectors such that similar documents are placed together in the vector space. One of the good aspects of Doc2Vec is that you can specify the dimension of vectors you want your document to be represented in. Hence, the curse of dimensionality as mentioned earlier can be avoided.

Here are the 2-D projections of vectors generated by BoW and Doc2Vec techniques. We can visually differentiate the business categories in these graphs. In the following visuals, each point is a unique website colored by corresponding business vertical.

 

 

 

From the projections of Doc2Vec vectors, we can observe that categories which are closely related in nature like “Computer and Electronics” and “Internet and Telecom”, “Business and Industry” and Finance” are placed closely. Similarly, “Arts and Entertainment”, “Games”, “Sports” & “News and Media” are visual neighbours.

 

 

 

Building the Classifier:

 

 

A. Get Data

 

 

 

There are 2 steps involved in getting the data for our classifier:

 

  1. Get the list of websites along with their categories. We obtained the list from DMOZ where you can get name of websites labelled by their hierarchical categories.
  2. Crawl the html from the homepage of all of these websites.

 

B. Preprocess the website data

  1. Extract plain text from html pages.
  2. Tokenize.
  3. Remove all punctuations , stop-words, foreign characters and words having length less than 2.

 

C. Convert Documents to Vectors

 

 

 

 

  1. Train Doc2Vec model using the preprocessed documents and their corresponding labels.
  2. Infer vector for every document. 

 

Classifier

 

We had collected around 2000 websites per category. We kept around 90% of the dataset to train & validate our classifier and the remaining dataset ( hold-out)  to test it.

Steps A through C makes our raw labelled webpages ready to be used for training our classifier. We used Support Vector Machine (SVM) as our classifier.

Classifiers are machine learning algorithms which learn the association between input features and output categories. Once a classifier has been trained on labelled data, it can predict labels for unseen data. Of course, we have to provide the unseen data in the same format as the input. There are several classification algorithms available such as Random Forest, SVM, Naive Bayes, etc. Now, each of these classification algorithms has some specific tuning knobs which cannot be directly learned from the training process. The parameters need to be tuned depending on the kind of dataset we are dealing with. There are various methods of tuning the prior parameters : grid-search, random-search and Bayesian hyper-parameter optimization techniques. We performed Bayesian hyper-parameter optimization and obtained the best hyper-parameters for our classifier.

 

 

Conclusion

 

The two most important components of any machine learning system are the quality of input provided to it and the learning technique used. We discussed the issues with BoW representations because it suffers from the curse of dimensionality and it doesn’t take semantics into consideration. Doc2Vec generates fixed length vectors from documents. So, we could train the classifier better by adding more documents per category.

 

The classifier can be further improved by two suggested approaches:

 

  • Get more data

    That’s true of any predictive models you’ll build. The more the training samples available to your classifier, the better it will learn to classify unseen documents.

 

  • Semi-supervised approach

    Use semi-supervised modelling on top of billions of websites to create a better classifier. In simple terms, it means inferring vector representation of new websites using our Doc2Vec model and labelling them as belonging to certain category if it’s closer to vectors of
    certain categories. This way, we get to have lots of training data for our classifier to learn from.