Eat My Data: Google index

Showing posts with label Google index. Show all posts

Tuesday, December 8, 2009

Google Realtime Search ? No ! Call it: Google realtime ticker (with a filter) !

You might have read the recent announcement of Google introducing realtime search and remember my previous post about the likelihood of Googles infrastructure facing a major challenge with realtime search.

Ok, today's announcement has not yet proven me wrong. What Google presented today is not search how Google itself would define it. This is only a realtime ticker with a filter applied to it. No relevance rating is added to the ticker.

Realtime search is only a challenge if Google wants to sort realtime posts for relevance. The current solution does not do that. For solution no central infrastructure is needed.

However if you would like to add some relevance factor to it, this is different. The computing of relevance of realtime updates would require a central infrastructure. When I use the term relevance think about something like a mechanism to rank often followed tweets higher than others, something where re-tweets push a result higher and where a simple tweet with no links and no follow-ups has a very fast degradation of relevance. This would be realtime page-rank. And for this you need a central infrastructure different from the massive parallel Google server world. Lets wait and see.

Monday, November 30, 2009

Realtime Messages: Can Google include realtime data ?

OK, we already pointed out the factors that will drive search results to be enriched with your social network data (remember: in our opinion this this will be driven by smartphones, social address books and Universal Search); But what about real time data ? How can real time data be included into general search results?

Real time data is definitely getting more and more important in the net. Just think about the news from the Hudson plane crash in January; At this time it was still newsworthy that Twitter was the first to report the accident, but now the new Google Chrome OS, the latest firmeware update from your favorite device etc., all of this is naturally first reported and discussed on Twitter. Certainly realtime data can be found in many other places like facebook, comments in blogs and so on. Thinking about it, I suddenly get the feeling that quite a substantial amount of new data in the web is entered as "realtime data" with a time component. This time component is strongly influencing the relevance of the data.

Lots of action is currently happening in this realtime search space. Please see techcrunch and venturebeat for excellent summaries of the current state of the art.

While the two challenges (social data and realtime data) pose a similar challenge in presenting the search results to the user and weighting relevance of the result, realtime data is by nature much more complicated. Realtime data deeply affects the necessary infrastructure that is needed to process it.

Googles infrastructure is clearly an "offline" architecture. By offline we mean that updates to the Google search index are only included very slowly. The underlying reason is that this gives Google the possibility to scale their systems with massive numbers of rather small and cheap servers. This is normally called horizontal scaling in contrast to vertical scaling, where you need big and expensive machines to which you add processors, storage and memory if necessary. In Googles park of thousands of small servers, the index is replicated for better performance. This replication is not a "realtime" thing. It takes a significant amount of time. Realtime replications are usually very costly. Software architectures with realtime update capabilities tend to be developed for large scale machines. So we have a natural contradiction between Googles way of computing (with massive amounts of small server machines) against the requirement of relatime updates for parts of the index.

A possible solution for Google would be to enrich the standard offline search results with realtime results which are produced from a new and different infrastructure. Most likely this infrastructure will be based on large, powerfull and expensive servers which might be a completely new world for Google. Certainly this is possible for Google, but scaling might be the "real" challenge of the realtime search game.

Certainly we have described the world a bit simplistic here. As the world is neither black nor white there are numerous new trends (e.g. virtualization) and technologies which blur the line between horizontal vs. vertical scaling and offline vs. realtime architectures. Nevertheless we see realtime search as a challenge for big old "offline" Google search.

OJ

Sunday, November 29, 2009

Social data: Extending search into the social networks.

In the previous post we pointed out that data from social networks is not sufficiently included in Google search results. If all your friends digg a certain web page or "like it" in facebook, it should be preferred in your personalized search results.

Several different solutions are already trying to socialize search results (digg, Wowd, Google social search). However none of them takes the integration sufficiently far.

One solution could be that your personal index in a desktop search (e.g Google Desktop Search) is extended with your social network data. For this you would not need to give login/password information directly to Google but to a local application only. Alternatively an authorization mechanism could allow Google to read your personalized social data (This would be similar to flickr and its way of authorizing partner sites).

More progress in the integration is currently beeing made within a different trend. This is the wide adoption of smartphones. On new smartphones your address book is now already linked to your social networks. Additionally you will have Universal Search that searches in your contacts, applications, local files and the Internet from one search window. In my opinion the next step will be to add "social data" to this universal search. However as you do not want your mobile to scan frequently the net and extensively process a personal search index (reducing your battery life), this will still require some remote application producing the index. Wether you will trust the personal index to sit on a centralized server in the cloud or on your local machine is still to be seen.

Another problem is certainly how to mix the social search results into the general results. In this area careful experiments about weighting and grafical positioning are currently beeing made.

What do you think, would social search results be relevant for you ? Are using already any specific standalone search for this (like a search inside of facebook) ?

OJ

Sunday, November 22, 2009

Google: Where is the realtime - and social data in your index ?

Tinyurl, Twitter and social networks pose a significant risk for Google. And even for Google THIS might not be easy solve.

Increasing amounts of data is either part of realtime communication like Twitter or buried in social networks like facebook. Neither of those data streams are included sufficiently in Googles search results.

However this data is very relevant for our search results. News are now often revealed by Twitter messages and traffic is directed by shortened URLs in those messages. All pages and news which are actively discussed by the Twitter community should be ranked higher in my search results. Similarly we would want links which are mentioned frequently in our own social network to be ranked higher in any search we do.

Both streams of data are difficult for Google. The internal structure of the Google index does not cope very well with realtime data. Googles index is replicated across its vast number of servers which means that any index change travels some significant time in Googles infrastructure until it is available to all searches across the world. The largely decentralized structure of Googles infrastructure with thousands of servers once estimated to be Googles core strength might therefore become a major competitive disadvantage for Google.

Social networking data poses a different challenge. Google naturally does not have any access to the private data in social networks. While users will want this data to be included in their overall search results they will be very reluctant to give Google access to this data.

On my personal behaviour i already noticed a significant behaviour shift. Depending on the topic i started using alternative search engines like Twitter search and Facebook search.

With Adsense being the major cash cow, is Google doomed to fail ? In the next two posts we will be exploring some principles how Google might solve this problems.

Eat My Data

Tuesday, December 8, 2009

Google Realtime Search ? No ! Call it: Google realtime ticker (with a filter) !

Monday, November 30, 2009

Realtime Messages: Can Google include realtime data ?

Sunday, November 29, 2009

Social data: Extending search into the social networks.

Sunday, November 22, 2009

Google: Where is the realtime - and social data in your index ?

ShareThis

Search This Blog

Followers

Blog Archive