Search Tech Blog

Linklist you need if you want to build a search engine

L

Linklist

Here are some background informations about how a search engine exactly work. We light ub what is difficult to crack if we try to build our own web crawler search engine from scratch:

Giga Blast

This page is a bit outdated (2004). But here you can read from the developer Matt Wells personally: All steps the search engine GigaBlast went through during the development process:
http://www.gigablast.com/rants.html

After that, a interview with Matt Wells (Gigablast) that answers the question: “When it comes to competing in the search engine arena, IS bigger always better?”. Some other interesting details of pitfalls you had to overcome running your own search engine:
http://queue.acm.org/detail.cfm?id=988401

But most importantly. the spider/crawler search engine Gigablast (C/C++) has become an open source project hosted on Github:
https://github.com/gigablast/open-source-search-engine

Blog Posts

In addition Yioop is the search engine from the Open Source Search Engine Software: SeekQuarry. In the blog you can read infos about the PHP search engine:
http://www.yioop.com/blog

Technical Theory

The links from the Highscalability Blog are fairly interesting. The forth one is the most technical. If you only have time to read one, go with the forth one:
1. http://highscalability.com/blog/2008/10/13/challenges-from-large-scale-computing-at-google.html
2. http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html
3. http://highscalability.com/blog/2011/8/29/the-three-ages-of-google-batch-warehouse-instant.html
4. http://highscalability.com/blog/2012/4/25/the-anatomy-of-search-technology-blekkos-nosql-database.html
5. http://highscalability.com/blog/2013/1/28/duckduckgo-architecture-1-million-deep-searches-a-day-and-gr.html

Article from twitter about indexing the full history of tweets. Of note is the information about sharding. Due to the liner nature of the data (over time) they need a way to scale across time. Worth a look:
https://blog.twitter.com/2014/building-a-complete-tweet-index

A talk about the internals of Lucene. Covers some design decisions and shows Lucenes internally architecture:
http://lucene.sourceforge.net/talks/pisa/

Not as technical as the above. But a good primer which covers quite a lot of history. Worth a read:
http://alexmiller.com/the-students-guide-to-search-engines/

“Write an Internet search engine with 200 lines of Ruby code”. All about to write a small scale internet search engine in ruby. The code covers crawling as well as indexing for MySQL:
http://blog.saush.com/2009/03/17/write-an-internet-search-engine-with-200-lines-of-ruby-code/

The difficults

This blog has dedicated a separate page: Perhaps the most famous post with the exception of the original Google paper. Written by Anna Patterson. She was the developer of the search engine Cuil and Archive.org. It highlights the difficulties of developing and running a search engine. From crawling to indexing to delivering the ranked search engine result page:
https://www.suchmaschine.biz/writing-your-own-search-engine/

So let’s get into some funny written article, but still many trues in it. You think developing a search engine is easy? You should definitely read this article:
http://www.ideaeng.com/write-search-engine-0402

Ranking Algorithm

Algolia is a search-as-a-service solution provider. Nicolas Dessaigne (Co-founder & CEO at Algolia) made 2014 a blog post about the ranking algorithm they use:
http://blog.algolia.com/search-ranking-algorithm-unveiled/

Google

The oldie of search engine technology papers: “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. From today’s point of view more than outdated. But it describes how the first version of Google was designed and written:
http://infolab.stanford.edu/~backrub/google.html

Some archive.org Links

ProCog was a blog by Matt Wells (Gigablast) who never had much content. Unfortunately this blog was switched off and the content is gone. Matt really knows his stuff and promotes an open ranking algorithm. That’s why I put an archive link to the old content here:
https://web.archive.org/web/20130606014958/http://blog.procog.com:80/

On the page “The Banana Tree” you can find a few articles about the complete design of a search engine from scratch. The site is very old, but some of the content is worth reading. For this reason I have also set an archive link here:
https://web.archive.org/web/20150214015110/http://www.thebananatree.org:80/

The blekko technology and team have joined IBM Watson! But on the old Blekko’s engineering blog there was some interesting material applicable to search engines development:
https://web.archive.org/web/20150315043329/http://blog.blekko.com:80/

Ben Boyter

Ben Boyter made this great series of blog posts about: “How to write a search engine in PHP that work well with 1 million pages”:
http://www.boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/

This complete list was kindly provided to me by Ben Boyter the Developer and Founder of SearchCodeServer.com

My add to this Linklist

This 4 Part Blog Post shows how to implement an actual search engine with working code in python. It deliver detailed articles about creating the index, query the index and ranking the results:
http://www.ardendertat.com/2012/01/11/implementing-search-engines/

The Blogpost from Deangela Neves on medium.com about developing a search engine for TED talks. The TED finder is an open source search engine thats developed with the language python:
https://medium.com/@deangelaneves/how-to-build-a-search-engine-from-scratch-in-python-part-1-96eb240f9ecb

This is a very detailed Quora Post from David Quaid (PrimaryPosition.com). “How do you build a search engine from scratch? What’s the best technology stack for this?”. A must read:
https://www.quora.com/How-do-you-build-a-search-engine-from-scratch-What’s-the-best-technology-stack-for-this/answers/13752046

Above all, many interesting blog posts about crawling the web come from Jim Mischel (Programmer and Author) so i decide to link to the web-crawling category in his blog so you can take a look at all the nice posts:
http://blog.mischel.com/category/web-crawling/

A tiny nice post on the Developer-Blog from Werner Ziegelwanger about “Build your own search engine”:
https://developer-blog.net/en/build-your-own-search-engine/

Linklist Feedback

Do you have another Link, that I have missed in my Linklist?
Please drop me a line.
I would love to here from you.
Please add a link in the comments below.

Linklist
Manuchi / Pixabay

About the author

I. Gaffling

I would like to introduce myself, my name is Igor Gaffling, I was born in 1968 and have more than 30 years of experience in the IT- and new-media industry. In this blog I write about how search engines work, facts, ideas, code experiments and the possibility to develop a simple search engine from scratch that can handle a few million entries at an acceptable speed.

Add comment

Search Tech Blog

Latest posts

Latest comments

Categories

Tag Cloud