Introduction

Search Engine Optimisation(SEO) plays crucial role in any consumer website where it helps to rank relevant pages on search engines like Google, Bing, DuckDuckGo, etc. Basic building block for successful SEO is how well the websites allow its pages to be crawled or in other terms indexed by search engines.

There are many SEO tools available in market today like Screaming Frog with licensing version and limited trial version too. These tools does not gives us much flexibility in terms to adding assertion to urls, handling urls in special cases like check format of the title page, open graph tags, validate the urls while running , Running crawler on cloud, Scheduling , docker containerisation etc.

So while working with one of my project, I decided to build alternate crawler based on Python's Scrapy that will help achieve the above shortcomings.

Pre-Requisites

Python 3.6 or higher Docker container if you want to run single or multiple instance of crawler

Clone from : > Crawler in Python

Running with Custom URLs:

After the project is set setup create .env file at root level. Following variable are set now:

site - for website url
check_social_in_urls - sub-urls to track social media tags if required(optional)

This .env file is served as environment variable to specify urls you want to crawl

Commands to run

Setup virtual environment with – pyhton3 -e venv nameofenv , a folder is created with name in root directory example (bot)
Activate virtual environment with – source /virtual_environment_folder/bin/activate
After activation, install all required packages using python install -r requirements.txt
And , now under subdirectory demo, run – scrapy crawl me

Running with Docker images

Run docker build -t <image_name> 
docker run crawler:custom crawl me

Final note

There are many aspect of crawler library unearthed which anyone of us can think of when we work on it.

Building Crawler and SEO helper using Python