Google offer of truncating words with the aim to save memory
In a video uploaded in YouTube (“Google Developers Day US – Theorizing from Data” http://youtube.com/watch?v=nU8DcBF-qo4), Peter Norvig from Google in a part of his talk (31:17-33:00) presents results from tests which aims to find the shortest length of any word by which length to save the uniqueness of the word and to not mass it with other words. The need of cutting words is based on the need of saving memory and ignoring the lexical form and keeping only the semantics (as much as possible), many times when you search with Google you see bolded words like “robots”, “robotical” while yu have typed “robot” in the search query. It will be useful when we want to claim is this two strings of characters are one and the same word in different revisions, or completely different things.
Google Research Director offers to cut the length of words up to 4 letters.
Here I want to present to your attention a passage written by people from Cambridge. Read it.
THE PAOMNNEHAL PWEOR OF THE HMUAN MNID
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
Lets apply the experience from this example to the unawareness about the natural genesis of words and their form of memorization in the mind. Psychologist say that we remember the beginning, the end and the rest content but without the letter order. If this is the way human mind works, applying it in machines which operate with natural human words will not cause problems. Never the less this have to continue be true for not only English texts, but for any language which uses array of characters for presenting words.
I think that in order to understand the nature of one phenomena you should look analytically back at its history and find the circumstances that created (caused) it, its behavior and characteristics. May be it is wiser to apply the above algo to the first one and the last one sounds, instead on the first and the last letter, because of the verbal genesis of language, instead of its later written representation.
We have to believe that by reordering all the words except the first and the last, we will not loose the identity (uniqueness) of the word and we will not lie ourselves about it’s equality to another different word.
Lets apply this Cambridge’s approach as a solution for saving memory in manipulation of big amounts of text. I think the application will look like this:
Aoccdrnig // first saving the first and the last letters
A ccdinor g // then order in alphabet order the rest letters
A c2dinor g // coding the same characters by letting only one of the group and next to it placing the count of all similar letters
May be the resulting array of characters which even can contain some numbers will be hard to be red by human, but if the Cambridge’s study is write about the nature of how people memorize words in English the above transformations will keep the uniqueness of the words after this transformation. (Can be used by a spell checker which generates advices by this algo if the word is written with proper first and last letter).
Applying this ideas to international words like names of geographical places and others we I offer one my observation I find useful: International words keep at least their first two groups of consonants in every language they are translated to.
Google and Peter Norvig are completely right about data.
I share the view of Google that by collecting input data we as an algorithm architects understand the problem better than when we have no data. It is a very deep understanding that collecting data can help you when you are taking decisions.
“Mathematics was found with the analogy between two apples and two pears, not by collecting two tons of apples only. And in this way Mathematics teaches us that data is not the only way to model the problem and build a solution which deals with every case of the problem which is reflected on the input data. Statistics in not the only one general algorithm by which people make solutions. We all know that there is data which can’t be collected which uncollected data is relevant (important) to the final results, the brain in this cases works analogically by finding close in some aspect (crucial for the concrete problem) problems which have solutions and even better which has solutions with proved (understood) solutions’ semantics. And this way finding solutions which in the concrete situation are equivalent and able to be applied in the raised problem. Data contains unsorted information about many problem solving possibilities, it is important to understand what data says (analogy) and collect it (statistics).” [part of AI-research page]
We have to recognize Google’s deep understanding about statistical problems and how and where to apply statistical methods. They are very good at this.