Search Tech Blog

Stop-Word List

S

What is a stop-word list and what advantage does it have to remove them?

Stop words are extremely common words

A Stopword is a word without essential information content, such as “and”, “the”, or “www”, etc. In English, the terms “stopword” or “stopwords” are used for this purpose. They are used very often, but do not really provide any additional information during a search.

Advantages of removing Stop Word

  • First the content is reduced to the essential.
  • Secondly that saves disk space in your data-storage.
  • Thirdly it makes it easier to evaluate the content according to relevance.

Search without Stop-Words

A full text search theoretically indexes all words. But the Stop Words are an exception. The index should not be unnecessarily enlarged by entries with unimportant words. This means that search engines simply ignore these words and punctuation marks other than Boolean operators. The fact that these are also safely ignored is due to the so-called stop word lists, which can be expanded again and again. Even words that are used very often, such as adjectives, verbs or pronouns, are considered stop words and are integrated into the lists. This also includes abbreviations such as www, http or com, which are also regarded as stop words by most search engines. In addition, Stop Words are not included in the previous indexing of the text. So search engines skip them over in order to save space in their databases, and speed up the search query.

Search with Stop-Words

If you want to start a further search, you can do this, for example, with a so-called phrase search. For example, if you want to include stop words in the search, you can put the corresponding search phrase in quotation marks. Alternatively, the search terms can also be linked together using the plus sign.

Special Cases

It becomes difficult with proper names that contain stop words. For example, the English article “the” is a Stop Word, which is also used when searching for “The Who”. This can lead to problems.

To prevent this problem, Google uses an advanced stopword detection that works as follows:

  • Stopwords are determined from lists and removed as usual.
  • Two search queries are generated, both with and without the detected stopwords.
  • Search results are retrieved for these search queries.
  • These search results are compared.
  • If the search results are the same or similar, the removed terms are insignificant stopwords.
    If the documents are different, the stopwords play a content role.

In this way, Google can avoid removing search terms that play an important role in the evaluation of search queries.
https://patents.google.com/patent/US7945579

Jeff Atwood, Co-Founder of Stack Overflow makes the following stopword experiment in 2004.

Over a period of a week, he searched for an entire dictionary of ~110k individual English words and recorded how many hits Google returned for each. Yes, this is probably a massive violation of the Google terms of service, but he tried to keep it polite and low impact — he used Gzip compressed HTTP requests, specified only 10 search results should be returned per query (as all he needed was the count of hits), and he added a healthy delay between queries so he wasn’t querying too rapidly. He is not sure this kind of experiment would fly against today’s Google, but it worked in 2004. At any rate, he ended up with a MySQL database of 110,000 English words and their frequency in Google as of late summer 2004.

Most used words in Google (52)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
the   522,000,000
of    515,000,000
and   508,000,000
to    507,000,000
in    479,000,000
for   468,000,000
internet  429,000,000
on    401,000,000
home    370,000,000
is    368,000,000
by    366,000,000
all   352,000,000
this    341,000,000
with    338,000,000
services  329,000,000
about   319,000,000
or    317,000,000
at    316,000,000
email   311,000,000
from    308,000,000
are   306,000,000
website   302,000,000
us    301,000,000
site    283,000,000
sites   279,000,000
you   276,000,000
information 276,000,000
contact   274,000,000
more    271,000,000
an    271,000,000
search    269,000,000
new   269,000,000
that    267,000,000
your    262,000,000
it    261,000,000
be    258,000,000
prices    258,000,000
as    255,000,000
page    246,000,000
hotels    240,000,000
products  234,000,000
other   222,000,000
have    219,000,000
web   219,000,000
copyright 218,000,000
download  218,000,000
not   214,000,000
can   209,000,000
reviews   209,000,000
our   206,000,000
use   205,000,000
women   200,000,000

Example of English Stop-Word List (153)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
a
about
above
after
again
against
all
am
an
and
any
are
as
at
be
because
been
before
being
below
between
both
but
by
could
did
do
does
doing
down
during
each
few
for
from
further
had
has
have
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
I
I’d
I’ll
I’m
I’ve
if
in
into
is
it
it’s
its
itself
let’s
me
more
most
my
myself
nor
of
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
she
she’d
she’ll
she’s
should
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
we
we’d
we’ll
we’re
we’ve
were
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
would
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

Download Stopwordlists in several languages

https://github.com/gaffling/stopwords/

Stop-Word List
kreatikar / Pixabay

About the author

I. Gaffling

I would like to introduce myself, my name is Igor Gaffling, I was born in 1968 and have more than 30 years of experience in the IT- and new-media industry. In this blog I write about how search engines work, facts, ideas, code experiments and the possibility to develop a simple search engine from scratch that can handle a few million entries at an acceptable speed.

Add comment

Search Tech Blog

Latest posts

Latest comments

Categories

Tag Cloud