Search Tool Data Analysis

by Paige Laytos (laytosplaytosp in BIT330, Fall 2008)

Questions and queries

Web search engines

As a Division I athlete it is essential I eat healthy. I am constantly on the go and want to be sure that I am getting a sufficient amount of nutrients from the food I eat. I had struggled with my eating habits and decided over the summer I took part in a cleanse, Quantum Wellness, that advised me to eat fruit with lower glycemic indexes and I was curious at to where peaches rank on the glycemic index?

Query: fruit glycemic index (used for all three searches)

Blog search engines

When I first started the Quantum Wellness cleanse I wanted to do it as a challenge for myself. During the Quantum Wellness cleanse I really struggled giving up certain items in particular sugar. The cleanse calls for 21 days of no sugar, alcohol, gluten, animal products, and caffeine. Through my struggles with Quantum Wellness I became curious about other people’s experiences and thoughts about the cleanse.

Query: “Quantum Wellness” (used for all three searches)

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 40 25 25
Google 70 25
Yahoo Web 55
All 15
Blog search Technorati Google Blog Bloglines
Technorati 20 0 10
Google Blog 35 10
Bloglines 25
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 0 2 1
10 0 2 3
20 1 4 5
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 0 0 1
10 2 2 4
20 3 3 5
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 1 1 1
10 2 2 2
20 2 2 2
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 1 2 2
10 1 2 2
20 1 2 2

Results

Web search

First Set of Data (overlap of search engines)

Web search
Precision Overlap All
Statistical Measure Live Google Yahoo L/G L/Y G/Y L/G/Y
Size 19 19 19 19 19 19 19
Standard Deviation 22.1 19.5 21.8 9.3 11.2 7.7 7.3
Mean 42.7 54.6 51.7 18.5 20.4 20.8 10.2
Max 80 90 85 35 45 35 25
Min 10 20 10 0 5 5 0
Median 42 57 52 20 20 20 10
Mode 15 70 70 10 10 25 10
Variance 489.5 380.6 475.0 86.5 124.8 59.6 53.6

From this set of data several conclusions can be drawn. For instance, Google has the smallest standard deviation with 19.5 in comparison to Yahoo! and Live with 21.8 and 22.1, respectively. This supports the claim that Google's results don't vary as much as the other two. From the mean, it is clear that the class found Google (54.6%) to be the most precise search engine, with Yahoo! coming in at 51.7% and Live with 42.7% relevance. Looking at the "overlap" column, one could infer that they would find relatively similar results from Google and Yahoo because when the standard deviation of G/Y is 7.7 there is less variation. Overall, the class found 10.2% of their queries in all three search engines.

Second Set of Data (overlap of rankings)

GY YG
Statistical Measure o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Size 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
Standard Deviation 1.2 1.3 1.4 1.2 1.3 1.7 1.2 1.5 2.1 1.2 1.3 1.4 1.2 1.4 1.6 1.3 1.7 2.1
Mean 1.1 1.4 1.6 1.3 2 2.6 1.6 2.5 3.7 1.1 1.2 1.6 1.5 1.9 2.5 1.9 2.6 3.8
Max 4 4 4 4 4 5 4 5 7 4 4 4 4 4 5 4 5 7
Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Median 1 1 2 1 2 3 1 3 4 1 1 1 1 2 3 2 3 4
Mode 1 0 0 1 1 4 1 3 5 1 0 1 1 3 3 1 4 5
Variance 1.4 1.7 2.0 1.5 1.8 3.0 1.5 2.4 4.5 1.4 1.7 1.9 1.5 1.9 2.5 1.6 3.0 4.3

Expanding upon our last set of data this data provides information regarding the positioning of Google and Yahoo! results in the top 5, top 10, and top 20 in comparison to each other. This table compares the relevancy of Google's results to Yahoo!'s and vice versa. The GY section of the chart says for instance the mean for o(20,10) is 2.6. Meaning of the top 20 results in Google, 2.6 of these results appear in Yahoo!'s top 10 results. In the YG portion of the chart the mean for o(10,10) is 1.9. This means that of the top 10 results in Yahoo!, 1.9 of them appear in Google's top 10 results.

Blog search

First Set of Data (overlap of search engines)

Blog search
Precision Overlap All
Statistical Measure Technorati GBlog Bloglines T/G T/B G/B T/G/B
Size 19 19 19 19 19 19 19
Standard Deviation 20.6 21.6 14.0 6.9 7.7 6.4 3.4
Mean 33.4 52.6 44.6 3.9 9.5 7.2 1.6
Max 85 100 75 25 25 20 10
Min 5 25 20 0 0 0 0
Median 30 45 48 0 10 5 0
Mode 35 40 50 0 5 5 0
Variance 425.1 464.9 194.8 48.2 58.7 40.6 11.3

Focusing on the precision portion of the chart, Google Blog with a mean of 52.6% is the most precise. Bloglines came in second with 44.6% of the results being relevant and Technorati had 33.4% on average. Interestingly from the "overlap" section the mean for T/B is 9.5, meaning 9.5% of the relevant results found in Technorati are also found in Bloglines. In all, the amount of relevant results retrieved from all three blog engines was 1.6%, which is not very high.

Second Set of Data (overlap of rankings)

GB BG
Statistical Measure o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Size 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
Standard Deviation .47 .61 .62 .62 .72 1.0 .92 1.1 1.2 .47 .61 .87 .62 .72 1.1 .62 .99 1.2
Mean .29 .35 .47 .41 .47 .82 .71 .76 1.1 .29 .35 .59 .41 .53 .82 .53 .88 1.1
Max 1 2 2 2 2 3 3 4 4 1 2 3 2 2 4 2 3 4
Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Median 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1
Mode 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Variance .22 .37 .39 .38 .51 1.0 .85 1.2 1.4 .22 .37 .76 .38 .51 1.2 .39 .99 1.4

Referring back to our first set of data it can assist in interpreting this second chart. It should be highlighted that the means for the overlaps are very low. They rarely even make it to 1. In the GB section, the mean o(20,10) is .82%. Of the top 20 result from Google Blogs, only .82% of them, can be found in the top 10 of Bloglines. This reveals the relevant results from Google Blogs are rarely found in Bloglines and vice versa, regardless of the order they are in.

Discussion

Web search

  • Meaning:

When using the results from the two sets of data it is clear that Google is the most precise and consistent search engine when compared to Live and Yahoo!. Although it may seem Google and Yahoo are the most alike, when using the second set of data it appears they aren't as similar as you would first assume. Breaking down the numbers, reveals that only 5 of the results are shared between the two.

  • Recommendations:

My recommendation to a person searching for information is to use your resources. There are many different search engines available for use that don't necessarily retrieve similar results. The more specific you make your query the more likely you will find what it is you are looking for.

  • Take Away:

What I was able to take away from this process was the importance of a smart query in finding relevant information. Before this I never really took notice to the precision of a search engine. It provides very valuable information if you are able to understand what it is saying about your query.

  • Follow Up:

The only real methodological change I would make to this experiment would be the definition of "relevant." It is possible results were distorted because of the varying understandings of relevancy from one student to another.

Blog search

  • Meaning:

Google Blog was again the winner in precision, but it is important to point out Google Blog has a higher standard deviation in comparison to the two blog search engines. This means the results vary more. There didn't appear to be much consistency between the relevant results retrieved from the three blog search engines. It was shocking to find how low their mean of relevant results found on all three to be at 1.6%. This carried over to the second set of data, which looked closer at the overlap of Google Blogs and Bloglines. There didn't appear to be much overlap of relevant results.

  • Recommendations:

My recommendation to a person searching for information is to search around. There are so many different blogging resources available that it would be hard to not find one that is relevant to the topic at hand. In my opinion, it would be better to have an idea of where your topic would fall in a broad array of general areas i.e., business, fashion, etc. From there it will be easier to narrow down your search.

  • Take Away:

What I was able to take away from this process was blogs are a great place to ask questions about a particular topic. Blogs are great resource when you looking for information, but it is best to try to narrow in on a specific type of topic. Blog search engines provide a terrific resource to the overwhelming world of blogging.

  • Follow Up:

I thought a similar problem could arise in the blog search as that encountered with the web search with relevancy. It is very subjective and could possibly throw off the results a bit.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License