There are thousends of bots and web crawlers working the internet but below is my list of the 10 popular search engines user-agents. If you browse the logfiles of your website, you will always see the access to a file called “robots.txt”. These are usually calls from search engines. Their web crawlers with there user-agents that read the robots.txt file (hopefully you have one). They...
Crawler, Robot, Spider, Indexer, Bot
A Web crawler, sometimes called spiderbot, is a software that systematically browses the World Wide Web. So the visited pages can be stored and indexed (web spidering).
Search engines use Web crawling or spidering software to update their web content and indices of web content.
With a robots.txt file webmaster can request bots to index only parts of a website.
A Web crawler starts with a list of URLs to visit, called the seeds.
Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.
Such an archive is known as the repository and is designed to store and manage the collection of web pages.
These archives are stored in such a way they can be searched.
What must a crawler pay attention to?
- Selection, what content (e.g. TopLevel Domains) should be fetched.
- Follow links, fetch only special Link.
- URL normalization, not all links are absolute.
- Crawling Deep, what path level of an URL should be fetched.
- Focused crawling, will fetch only special content topics.
- Re-visit policy, is needed to stay your data fresh.
- Politeness, don’t attack a server with to much load.
- Identification, say who you are with your User-Agent.