Eat My Data: 2009

Sunday, December 20, 2009

Google Realtime Search

We were talking about the difficulties of creating a realtime search and how Google introduced a ticker attemp to solve this. Find a interesting article about this here: Google Search RIP

Tuesday, December 8, 2009

Picture Processing: Google Goggles Update

Have you seen Google Goggles ? Hey this is even slightly more disturbing than our previous post. Somebody taking pictures of you with Google Goggles might get the name directly displayed on her mobile.
However Goggles can not do this quite yet.

Any guess how long it will take? I would say no longer than one, two years.....
OJ

Google Realtime Search ? No ! Call it: Google realtime ticker (with a filter) !

You might have read the recent announcement of Google introducing realtime search and remember my previous post about the likelihood of Googles infrastructure facing a major challenge with realtime search.

Ok, today's announcement has not yet proven me wrong. What Google presented today is not search how Google itself would define it. This is only a realtime ticker with a filter applied to it. No relevance rating is added to the ticker.

Realtime search is only a challenge if Google wants to sort realtime posts for relevance. The current solution does not do that. For solution no central infrastructure is needed.

However if you would like to add some relevance factor to it, this is different. The computing of relevance of realtime updates would require a central infrastructure. When I use the term relevance think about something like a mechanism to rank often followed tweets higher than others, something where re-tweets push a result higher and where a simple tweet with no links and no follow-ups has a very fast degradation of relevance. This would be realtime page-rank. And for this you need a central infrastructure different from the massive parallel Google server world. Lets wait and see.

Sunday, December 6, 2009

Picture Processing: or John Q. Public kissing in Hawaii

I guess we all have realized that something is happening in the world of picture processing.
Services like photosynth and Polar Rose and even Automatic Photo Tagging illustrate this trend. Not to forget Picasa and iPhoto face recognition capabilities.

What we see is basically that the computer starts to "understand" the content of the pictures and its relation to the real world. We will not go into details about the mechanisms but it is easy to grasp the idea. Just think of large amounts of public tagged photos, add cheap server processing power and online storage to this, add photo comparison and finally some recognition algorithms for faces and buildings etc.

Lets explore, what this will mean to users and bystanders in the future. Let's think this through a bit:

We can safely assume that all faces in all public pictures will at some time be tagged with the real peoples name. Yes even your name. As typically in the Internet this tagging will not be 100% reliable, but a fair amount of the data will be correct. Even if you think, you can avoid this, it will not help in the long run. Somebody somewhere will put a picture of you online and tag your face with your name. And once this information is in the wild, it can be used as reference for all the other pictures of you.

I guess you might have known this. But extend the thought a bit. After face recognition, comes building recognition (sorry i do not have a link, yet; however it is possible and similar to photosynth). The buildings on your pictures will be recognized and automatically be geotagged. Other recognition algorithms will follow. (How difficult can it be to detect if two faces are kissing each other ?)

So lets put all this together in a single use case in the near future:

John Q. Public is on a holiday trip in Hawaii
Somebody takes a random photo with his mobile that shows John in the background giving a goodbye kiss to his traveling acquaintance.
This photo is uploaded a day later to a public photo page
Somebody will automatically detect John and tag this picture with "Hawaii" (the airport building), "John" (face recognition) and "kissing" (new algorithm)
As the Internet never really forgets something, the picture now captures an eternal moment of John.

So you will stop kissing on airports from now on? It might be too late, your last goodbye could already be online.....and it will resurface whenever you do not expect it.

Who will do all this tagging and analysis? That's easy, don't you remember Google's mission ? (Google's mission: to organize the world's information and make it universally accessible and useful. ) . You will be able to search and find poor John with "John Q. Public"+"kissing"+"Hawaii"

You feel a bit of pitty for John? Maybe this will result in a general tolerance increase. Everybody might have his eternal moments online! So nobody can fingerpoint to somebody else.

What do you think? Is this story too absurd or did we hit some points?
Please comment!

OJ

Monday, November 30, 2009

Realtime Messages: Can Google include realtime data ?

OK, we already pointed out the factors that will drive search results to be enriched with your social network data (remember: in our opinion this this will be driven by smartphones, social address books and Universal Search); But what about real time data ? How can real time data be included into general search results?

Real time data is definitely getting more and more important in the net. Just think about the news from the Hudson plane crash in January; At this time it was still newsworthy that Twitter was the first to report the accident, but now the new Google Chrome OS, the latest firmeware update from your favorite device etc., all of this is naturally first reported and discussed on Twitter. Certainly realtime data can be found in many other places like facebook, comments in blogs and so on. Thinking about it, I suddenly get the feeling that quite a substantial amount of new data in the web is entered as "realtime data" with a time component. This time component is strongly influencing the relevance of the data.

Lots of action is currently happening in this realtime search space. Please see techcrunch and venturebeat for excellent summaries of the current state of the art.

While the two challenges (social data and realtime data) pose a similar challenge in presenting the search results to the user and weighting relevance of the result, realtime data is by nature much more complicated. Realtime data deeply affects the necessary infrastructure that is needed to process it.

Googles infrastructure is clearly an "offline" architecture. By offline we mean that updates to the Google search index are only included very slowly. The underlying reason is that this gives Google the possibility to scale their systems with massive numbers of rather small and cheap servers. This is normally called horizontal scaling in contrast to vertical scaling, where you need big and expensive machines to which you add processors, storage and memory if necessary. In Googles park of thousands of small servers, the index is replicated for better performance. This replication is not a "realtime" thing. It takes a significant amount of time. Realtime replications are usually very costly. Software architectures with realtime update capabilities tend to be developed for large scale machines. So we have a natural contradiction between Googles way of computing (with massive amounts of small server machines) against the requirement of relatime updates for parts of the index.

A possible solution for Google would be to enrich the standard offline search results with realtime results which are produced from a new and different infrastructure. Most likely this infrastructure will be based on large, powerfull and expensive servers which might be a completely new world for Google. Certainly this is possible for Google, but scaling might be the "real" challenge of the realtime search game.

Certainly we have described the world a bit simplistic here. As the world is neither black nor white there are numerous new trends (e.g. virtualization) and technologies which blur the line between horizontal vs. vertical scaling and offline vs. realtime architectures. Nevertheless we see realtime search as a challenge for big old "offline" Google search.

OJ

Sunday, November 29, 2009

Social data: Extending search into the social networks.

In the previous post we pointed out that data from social networks is not sufficiently included in Google search results. If all your friends digg a certain web page or "like it" in facebook, it should be preferred in your personalized search results.

Several different solutions are already trying to socialize search results (digg, Wowd, Google social search). However none of them takes the integration sufficiently far.

One solution could be that your personal index in a desktop search (e.g Google Desktop Search) is extended with your social network data. For this you would not need to give login/password information directly to Google but to a local application only. Alternatively an authorization mechanism could allow Google to read your personalized social data (This would be similar to flickr and its way of authorizing partner sites).

More progress in the integration is currently beeing made within a different trend. This is the wide adoption of smartphones. On new smartphones your address book is now already linked to your social networks. Additionally you will have Universal Search that searches in your contacts, applications, local files and the Internet from one search window. In my opinion the next step will be to add "social data" to this universal search. However as you do not want your mobile to scan frequently the net and extensively process a personal search index (reducing your battery life), this will still require some remote application producing the index. Wether you will trust the personal index to sit on a centralized server in the cloud or on your local machine is still to be seen.

Another problem is certainly how to mix the social search results into the general results. In this area careful experiments about weighting and grafical positioning are currently beeing made.

What do you think, would social search results be relevant for you ? Are using already any specific standalone search for this (like a search inside of facebook) ?

OJ

Sunday, November 22, 2009

Google: Where is the realtime - and social data in your index ?

Tinyurl, Twitter and social networks pose a significant risk for Google. And even for Google THIS might not be easy solve.

Increasing amounts of data is either part of realtime communication like Twitter or buried in social networks like facebook. Neither of those data streams are included sufficiently in Googles search results.

However this data is very relevant for our search results. News are now often revealed by Twitter messages and traffic is directed by shortened URLs in those messages. All pages and news which are actively discussed by the Twitter community should be ranked higher in my search results. Similarly we would want links which are mentioned frequently in our own social network to be ranked higher in any search we do.

Both streams of data are difficult for Google. The internal structure of the Google index does not cope very well with realtime data. Googles index is replicated across its vast number of servers which means that any index change travels some significant time in Googles infrastructure until it is available to all searches across the world. The largely decentralized structure of Googles infrastructure with thousands of servers once estimated to be Googles core strength might therefore become a major competitive disadvantage for Google.

Social networking data poses a different challenge. Google naturally does not have any access to the private data in social networks. While users will want this data to be included in their overall search results they will be very reluctant to give Google access to this data.

On my personal behaviour i already noticed a significant behaviour shift. Depending on the topic i started using alternative search engines like Twitter search and Facebook search.

With Adsense being the major cash cow, is Google doomed to fail ? In the next two posts we will be exploring some principles how Google might solve this problems.

Tuesday, November 17, 2009

Google storage extensions getting cheaper

Hey, have you noticed that Google just decreased their price for additional storage dramatically ? The extensions are mainly valid for Gmail and Picasa Web Albums.

Now it is 20GB additional storage for 5 $ per year. Yes you did read this correctly: 5,-- $ per year for 20GB.

So put all this data of yours online. We will keep you updated here, how all this data can be used for the good and the bad.....

Keep you posted,
OJ

Monday, November 16, 2009

Getting rich with XING network updates

How social networking sites could be using your confidential information.

Today is time for a post about how to become a millionaire in 5 easy steps.

Okay i guess i have your attention now. So let me start somehow different. Maybe it is not you, but Xing that will become rich with using your confidential data!

I used to be a consultant traveling quite a lot to different customers and partners. A lot of them ended up in my contact list in Xing. So now i have a network of around 200+ contacts in my industry. As i am a premium user i get notifications as soon as something new in my network happens.Typically this is something like a new link between an existing contact and somebody else.

When i stopped working as a consultant these network updates became really interresting. If you see one contact adding another one you can often guess if this is a private or a business relation. If it is a business link you can assume that these two persons met for a new project, sales opportunity or similar.

Ok detecting one new link between two companies is not special. I wouldn't bet a fortune that something special is going on between the two companies. But if you see for example ten persons from one company linking to people from a second company? In this case you can assume that something big is happening. And maybe this is a new deal and could affect share price ??

For individuals like you and me it might be a bit difficult to make a lot of money from this information. I guess we do not very often see strong new links between two companies while looking at our network updates. But think about Xing. All this data is sitting right at their feet. And this is data about all individuals and companies connecting to each other.

Maybe this is their real business model ? Detecting something hot and then buying shares ?

Be aware Xing knows your new project, deal or merger !

For us the message is clear: If you start a large project which is not publicly announced or if you start a merger or acquisition: Make sure none of your team members mingle with each other on XING. You could set your privacy settings in Xing. But still Xing does see your data. Treat all your Xing data as publicly known. We are just waiting for somebody to correlate it.

How do you behave with Xing ? Are you aware of the information you disclose to the world with each "new link" ? Do you think Xing uses this data ?

Please comment !

Saturday, November 14, 2009

Unpersonalize Google

Have you ever been proud that you know the Internet inside out ?
You are searching for something and Google does not bring any new unknown links ?

The reasons is not, that you know the Internet too well, but that Google knows you too well. Hardly any search that you do in Google is not personalised specifically to you. Personalisation of search results is one of Googles main focus to provide better answers.

But what does that mean on the other side ? Google is trying to get to know you as good as possible. The more Google knows about you, the better the search results. And the better Google knows you, the better it can put you into a perfect customer segment and sell targeted adds.

This is all not very new. But personalised searches mean, that Google is showing only the stuff that you like anyway. Google is not interested that you live wild and dangerously and explore the unknown of the Internet.

Google uses (at least) the following mechanism for personalization:

Google accounts
Cookies
IP Adresses
and even the language you use and the way you phrase your search.

A perfect article explaining this was done by Danny Sullivan on Searchengineland

So you have not become too clever - it is Google that has become too clever!

Just as a check:

When have you last deleted your cookies ?
When have you last switched out of your Google account ? (this normally switches logging of your search history)
When have you last compared your search results against an anonymous search like Scroogle ?

Did you ever have the feeling that you knew all Google's results in advance ?

Happy searching !
OJ

Friday, November 13, 2009

Stupid, it's the data

Our world is constantly changing and it is changing fast.

One of the underlying forces is technology. More specifically,it is the availability of massive amounts of data and the ability to process it with aceptable cost and time. This data is about our environment (think Google Earth) as well as data about ourselves (think Facebook).
People think differently about this change:

some fear big brother
some are excited about the opportunities and great changes to come
and some people have never thought about this at all.

We the authors have now realized that we sit comfortably in both of the first two groups. We would like to make the most of the opportunities offered by the new types of media, communities and tools available. However at the same time we are uncomfortable about the amount of information which is being collected, stored and processed - often without our knowledge or consent.

So we intend to explore the future of data processing and its implications in a series of random blogs. We hope to
examine what great things may be possible and how to protect ourselves from those people and companies who collect all that data ;-)

Feel free to comment and sent us your ideas for future topics.

ND and OJ

Eat My Data