Data is a fairly new resource, which gets more important every day. BigData is a common and important keyword for every company that develops and is trying to success over its competitors. The company that can use its data best may win this competition. For this you need software – to collect, to store, to provide and to analyse data. A search engine is a key tool to get the needed data fast.
Build a search engine in just 3 days
Before we start this project we need to define few things:
- you are not able to re-implement google in 3 days
Everyone thinks about google if he or she hears search engine. You use search engines every day and may not even know it. Think about an online shop. If you search Amazon for a product, you use a search engine.
Online shops use search engines to provide information about secondly changing product stock. Think about the last article of a kind that was bought seconds before you entered your search. You do not want to get products that are not available!
- you need to define a small use case
For your first attempt you should define a really simple case that is possible for your data.
- you have already ideas about data sources
With an idea about your first project you may also have an idea about what data you want to search for.
The next 3 days we will build a basic search engine that can be developed the next development cycles.
Day 1 – gathering information
The base of each search engine is information. It is possible to research this information manually and transform it into a machine readable format, but you will notice that this process is far to slow, time consuming and in fact expansive.
We need to automate it. The problem is: we have nearly infinitely different formats that should be read and transformed into one database format. HTML text may be really easy to get and to read, but what about PDF files? We need a good OCR algorithm.
What about unavailable formats like data on floppy disks? On this first day, we will focus on HTML text, the easiest form of information, because you can get it every time from internet.
A program that gets HTML code and transforms it into database sets is called crawler. A basic crawler is able to separate plain text from HTML tags and can separate valuable information from unnecessary words. The first simple step is to build a program that only stores nouns. Why? Because you need at least one noun to set up a useful search request. The simplest possible crawler does the following steps:
- requests HTML site from a given Link
- gets all nouns from plain text (not HTML tags!)
- stores each noun, how often it is mentioned and given source (link) into database
- gets all links from HTML site and for each one start at first step
If you start this simple crawler, it will fill up your database with information. A simple PHP implementation of a web crawler can be found on the given link.
Day 2 – data storage
Storing data is the next big topic. In day 2 we focus on creating a database. For this, we need to think about how stored data will be requested. With this information we can speed up queries. Fast search leads to better user experience later for our search engine. In day 1 we already created a crawler that stores information, but we need a database first!
Choosing a database system
We will use a MySQL database. This is an Open Source solution that nearly every web hoster offers. Normally it stores data for web 2.0 programs like wordpress, but we can also use it for our own implementation. MySQL is a good start – it is easy and there is lots of documentation and user generated help on various websites. If you want to make a business out of your search engine you are able to switch to business solutions from Oracle or Microsoft easily.
We already created a simple crawler in day 1, so we need to add a table to store its information. For this we create a new database and create one table, lets call it webdata. Next we need to add columns. A very basic set is:
- text for one keyword (parsed nouns)
- amount (how often we found that keyword)
- link (where we found this keyword)
- date (to improve our crawler later, we store a date for last scanned information – we may forget data after some time to keep this database up to date)
Now it is time to set up our crawler to do regular or permanent crawling and fill this empty table.
Day 3 – get information
The last 2 days we implemented a system that gets information from web and stores it into a database. It is time to search for keywords and get an ordered list of links. Lets create a simple frontend with PHP, we need only 2 sites:
- search form
a simple form with a textbox and a button. A user inputs a keyword and press this button to start the search.
- result page
after some seconds (or milliseconds) we get a page with a list of ordered links
The source code is simple. The HTML search form has a POST request with given keyword. This request site does a simple SELECT on our created database table. Be sure to order the result set by the amount. On our result page we simply print out the search keyword and a list of links or a message if we found no results.
You implemented a really simple search engine from scratch. It is not perfect (not really useful), but shows all necessary parts. It is time to improve. Some thoughts about it:
- we can only search for one keyword
we need to enable combination of multiple keywords and define stop words to improve search
- fake sites can rank high
we only count a keyword and rank sites high with a high number of that keyword. A website owner may only writes a given keyword thousand times and gets number one. We need to implement validity checks that analyze text and are able to separate good
- sites from bad sites
our crawler should get more useful information from site for example embedded media.
- improve database
create index for columns (for example keyword column) to improve search time
Building a simple search engine is simple, but create a useful fast search tool may be very complicated and time consuming. You learned all basics and now you are able to improve your system as you wish.
Table of Contents