Introduction
Search Engine Optimisation(SEO) plays crucial role in any consumer website where it helps to rank relevant pages on search engines like Google, Bing, DuckDuckGo, etc. Basic building block for successful SEO is how well the websites allow its pages to be crawled or in other terms indexed by search engines.
There are many SEO tools available in market today like Screaming Frog with licensing version and limited trial version too. These tools does not gives us much flexibility in terms to adding assertion to urls, handling urls in special cases like check format of the title page, open graph tags, validate the urls while running , Running crawler on cloud, Scheduling , docker containerisation etc.
So while working with one of my project, I decided to build alternate crawler based on Python's Scrapy that will help achieve the above shortcomings.
Pre-Requisites
Python 3.6 or higher Docker container if you want to run single or multiple instance of crawler
Clone from : > Crawler in Python
Running with Custom URLs:
After the project is set setup create .env file at root level. Following variable are set now:
site - for website url
check_social_in_urls - sub-urls to track social media tags if required(optional)
This .env file is served as environment variable to specify urls you want to crawl
Commands to run
- Setup virtual environment with –
pyhton3 -e venv nameofenv
, a folder is created with name in root directory example (bot) - Activate virtual environment with –
source /virtual_environment_folder/bin/activate
- After activation, install all required packages using
python install -r requirements.txt
- And , now under subdirectory demo, run –
scrapy crawl me
Running with Docker images
Run docker build -t <image_name>
docker run crawler:custom crawl me
Final note
There are many aspect of crawler library unearthed which anyone of us can think of when we work on it.