Automated Backlink Suggestions using NLP

November 20, 2021
No Comments

By Soumayan Pal

Introduction

In this blog, we are going to understand how machine learning can help product marketers and Digital marketers to improve their SEO score. Just for understanding, today’s marketers want organic growth and they require a better SEO score for their content to be visible in front of their audience ( in search engines). And having relevant backlinks and interlinks placed adequately in the blog helps gain SEO score.

Marketers around the world find it challenging to link the right blogs or right videos in their blog because of following reasons:

No idea about the record of hundred and thousands of blogs or videos generate by a company
No time to go through other blogs and their intentions every time they publish an article

Now the above problem is related to scale and also qualifies for AI/ML application. Now, as a part of the solution I have developed an algorithm to suggest relevant backlinks automatically for a content piece, while marketers enjoy their coffee! Ain’t that cool?

What are Backlinks ?

Basically, backlinks are hyperlinks that direct us from the current web page to another web page. Backlinks are very useful in referring users to related websites and most importantly it enhances customer experience. Say, if someone wants to know more about a topic then there should be provision for them to learn more.

We call this the Automated Backlink Suggestions, something I will keep referring to in this blog. Let’s get started…

Pre-requisite or inputs needed for Automated Backlink Suggestions task:

A finalised content piece, usually a blog.
Primary related Keyword ( targeted by companies to improve SEO score)
Database of all blogs, or videos, or any other content to be highlighted or referenced
The link to the content which will be attached as a backlink.
Meta Title of the content
Meta Description of the content

Once the input is fed to a machine, it will leverage an AI model (we have used NLP models) that will suggest backlinks related to the content.

Steps involved for Automated Backlink Suggestions task using NLP models

We can divide the Automated Backlink Suggestions task into 6 different steps for an easier understanding.Lets have a look at the steps before I explain them all:

Processing the input text (blog)
Filtering on the basis of ‘Keywords’
Processing the filtered sentences of the input text
Dealing with the Metadata table
Installing and loading the Sentence embedding model
Finding the similarity score of the word embeddings and filtering out similar page links

Now we will walk through each step in detail…

1. Processing the input text(blog)

First, the user will be asked to enter the content or blog in this case, for which he needs suggestions.

Once we have given the input, the blog gets broken down into sentences. We have used the split function to extract all the sentences with ‘.’(full stop) as the separator. Once we extract all the lines of the blog, a data frame gets created with n number of rows, where n is the number of sentences.

In the above diagram, we can see the sample blog has been splitted into 58 sentences and each sentence is a row of the data frame ‘df1’ . The ‘split_text’ is a list having all the sentences with position indexes.

2. Filtering on the basis of ‘Keywords’

Now that we have got our df1(input data frame) and naturally it’s having a lot of data, we need to filter out only those sentences with prioritised domain-centric keywords which our user wants to attach backlinks with.

Why filtering of sentences is a good idea:

Firstly, this option of filtering will help the user to get more optimised results for selected sentences.
Secondly, it will obviously reduce the computation time and power that will be required for this task.
And finally, it’s kind of unnecessary to keep all the sentences for suggestions…like nobody wants to see backlinks attached to every line of your blog, right ?

So, basically the user can enter the ‘Keywords’, as shown below.

Refer to the below screenshot of the set of keywords (Spinnaker, CI/CD, Kubernetes, Jenkins, AWS, Security, Continuous Delivery) I have entered to filter out the sentences.

Once you enter the set of keywords only those sentences having these exact keywords in them will be kept and all others will be discarded.

In this snapshot, we can see that the list ‘lst’ has only those sentences which have the entered keywords in them. The dataframe ‘df2’ is created which contains 24 filtered sentences as individual rows out of 58 sentences that were present in the blog.

3. Processing the filtered sentences of the input text

The filtered sentences are needed to be processed so that it works as a more optimised input for the model.For this we can perform basic preprocessing steps like removing the URLs, hashtags, numeric values, extra white spaces to process the text in order to get the final version of the user input. If you want to learn more, read my previous blog text similarity and how to implement it.

Let’s have a look at the pre-processing codes directly for a better understanding.

Now another column, ‘Cleaned_blog_lines’ gets added in the dataframe ‘df2’ which contains all the processed sentences of the blog/content with which the user wants to add backlinks.

4. Dealing with the Metadata table

Before performing this task, let us have a look at how the metadata table stored in the database will look like.

As we can see in the snapshot, the dataframe ‘df3’ denotes the metadata table which contains several columns of information about the page links (which will be used as backlinks in the content). We only need the ‘Meta Description’ column which gives a description of every page link.

Once we have loaded this dataframe and extracted the ‘Meta Description’ columns we need to clean and process the description text to make it more machine readable. For this we will perform the same preprocessing steps as we did to process the filtered input sentences.

Once all the preprocessing steps have been performed we get the cleaned description text column for the page links.This column will be compared with the input text to get the similarity score.

5. Installing and loading the Sentence embedding model

Once we have got our pair of inputs, the description metadata of every page link and the user blog where the best suited page links are to be attached as backlinks, we will now have to think of how to create a model that can pick up the most related page links and display them as output.

To do this, first we will have to transform all the input texts into embeddings or word vectors. Embeddings are N-dimensional vectors that try to capture meaning and context in their values. Any set of numbers is a valid word vector, but to be useful, a set of word vectors should capture the meaning of words, the relationship between words, and the context of different words as they have been used in the text. This capturing of meaning and context of the words is referred to as the semantic analysis of the text. To understand more about embeddings and vectors refer to my previous blog. Now there are different pre-trained models available for transforming text into embeddings like the GloVe model, the fastText model or the BERT model. Here, we have used the stsb-roberta-large model of the Sentence Transformers or SBERT package which has been built and developed by the Hugging Face community.

Using GloVe Model

Personally, when I tried to generate word embedding using the GloVe model it failed to recognize the technical words in its vocabulary and kept on throwing errors. Have a look at the error message (ignore the code for now, we will understand it in the next step).

It’s a KeyError showing the word ‘kubernetes’ is not the vocabulary of the model or the dictionary of words on which the model has been trained. The reason behind this error is the GloVe model (or any other model that does not use the transformers approach) cannot handle words on which it has not been trained on.

But it’s kind of impossible to keep all the existing words in the training set, especially technical words like ‘Kubernetes’ or ‘Jenkins’ as thousands of new words are getting born with new technologies emerging everyday. One solution is training the huge model everyday with the updated training set of all existing new words…And yes! you guessed it right, that’s not at all a ‘solution’ (not at least a practical one). So we should use sentence transformers.

Using Sentence Transformers

Firstly, you have to install the Sentence Transformers package into your system. You can run the following command.

!pip installsentence-transformers

Once the model is loaded we are all set to proceed to the next step which is using this model to extract the embeddings or word vectors to check the similarity between them.

6. Finding the similarity score of the word embeddings and filtering out similar page links

This is the final step of our task where we will use the Cosine Similarity metric to get a similarity score ranging from 0 to 1.

If the score is 0, then the two sets of text embeddings are least similar,
and if it is 1 then they are most similar to each other.

There are other text similarity metrics like Jaccard Similarity or Euclidean Distance which we can use. However, I prefer Cosine Similarity because it is better than the former metrics because it can measure the angle between data objects which are far apart and thus more efficient.

To get our set of suggested backlinks for the model, we will first run a loop where the previously loaded model will encode each text piece of ‘Cleaned_blog_lines’ column on the ‘df2’ dataframe into embeddings. Inside this loop we will run a nested loop which will perform the following steps in sequence:

Encode the text in each row of the ‘Cleaned_Meta_description’ column in the dataframe ‘df3’ (metadata table) into embeddings.
Compare the similarity of each encoded row of ‘Cleaned_Meta_description’ with the ith row of the ‘Cleaned_blog_lines’,where i is the loop variable of the outer loop using the cosine similarity metric and produce a score for each jth iteration, where j is the loop variable of the inner loop.
Inside one jth iteration, when the similarity score is produced for that jth row of ‘Cleaned_Meta_description’ , a condition is passed which appends the corresponding row of the ‘Page’ columns in dataframe ‘df3’ (it contains the page links) to a list ‘links’, only if the score is greater than 0.65.

Once we are out of this inner loop, we will get a set of backlinks inside the list ‘links’ for the ith row of the ‘Cleaned_Meta_description’ column.So, we will declare another empty list ‘t_links’ which will keep on storing the ‘links’, i number of times. Along with this we will also store the corresponding original filtered sentences in a list ‘t_blog_lines’. Now that we have understood the theory of backlinks suggestion task, let me show the code to make it more understandable.

When the outer loop is execution is complete, we will get the following lists as output:

t_links – Contains all the sub-lists containing the suggested backlinks for each of the filtered sentences of the user blog. Just in case, there are no suitable links found that are similar to a particular sentence then the sub-list for those sentences will be empty.

t_blog_lines – Contains all the filtered sentences.The position indexes of these sentences are the same as the indexes of the t_links list.

Saving the outputs

Now that we have got both the lists, we will store them in a .csv file and export it so that the user is able to access it. We will create a new data frame, add two columns ’Blog_lines’ and ‘Backlinks’ and store the lists t_blog_lines and t_links in them respectively.

Once the data frame ‘output_df’ if created it will look something like this:

Conclusion

This was all about my experience for developing a backlinks suggestion algorithm which I think should be very useful for product marketers and Digital marketers as it will save a lot of their precious time for creating quality content. I must also add a point that without the highly efficient transformers models these kinds of NLP projects would be difficult and time-consuming to perform. So, I think the open-source communities like Hugging Face are doing a really good job training these agile and robust transformer models.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Platform

features

solutions

Resources

Success Stories