Posts Tagged ‘web’
Querying Tables in the Web
I am glad to announce that after a lot of work now I am ready with a new work over how search engines should understand and search in HTML tables.
Printable version as a PDF file: querying-tables-in-the-web
Querying Tables in the Web
Information technology research
Here is expressed a new way of crawling and searching information stored in HTML tables. The results of this search are new tables (like results of SQL queries).
When I have started working on this topic I was not aware of Google’s MapReduce, now I find it very useful to the topic of this paper.
Contents:
-
Abstract. Where tables are used
-
The Algorithm in a few words. Understanding how tables care information
-
Addressing cells. Setting calling names of cells in order to extract information
-
Semantics and value of secondary cells
-
The Algorithm in deep. Theoretical translation of tables into natural text statements
-
Constructing new tables (using MapReduce’s idea). Comparison with SQL queries
-
Legal Drinking Age in Asia. Table to natural text translations
-
Is Semantic web a table like model
Understanding meaning of main cells and secondary cells in a table and deriving meanings in order to present tables as natural language sentences (search-able by machines). Querying this data and receiving as a result a composed table by the search engine.
Free text to table is a possible future option.
Abstract. Where tables are used
No matter if you read a restaurant menu with no table borders (content is which arranged like a table with no borders) or a bordered table, the way content is placed is like it is filled in columns and rows.
I believe that the table is able to be translated in natural language sentences in order to understand and index the information as a natural text by search engines. Browsing the web we find tables explaining time schedules, steps, categories and so on. May be the table is the second most popular form of expressing information next to the simple text (array of characters and symbols).
The Algorithm in a few words. Understanding how tables care information
Data in tables is presented in two main cell groups, the first group is the main cells which carry the general attributes of the data and the second is the group about the concrete category/ case (in variation tables). This two groups are in most of the real life tables separated by their formatting, data type of content and place in the table’s rows and columns.
Understanding the way this cells content relates to each other and translating this relations into natural text sentences like data structure which is able to be queried like a relational database table.
By using of a MapReduce (http://labs.google.com/papers/mapreduce.html) like algorithm, new tables can be created from the database of many tables to present search result by mapping equal cell names or values according to a complicated user query search.
Addressing cells. Setting calling names of cells in order to extract information
Lets address cells of a table like a mathematical matrix’s cells.
This is an example of a table aiming to present all possible combinations of a structure of a table.
|
R1; C1 |
R1; C2 |
R1; C3,4 |
||
|
R2; C1 |
R2; C2 |
R2; C3 |
R2; C4 |
|
|
R3; C1 |
R3; C2 |
R3; C3 |
R3; C4a |
R3; C4b |
|
R4; C1 |
R4; C2,3 |
R4; C4 |
||
|
R5,6; C1 |
R5; C2 |
R5; C3 |
R5; C4 |
|
|
R6; C2 |
R6; C3 |
R6; C4 |
||
|
R7; C1 |
R7; C2 |
R7; C3 |
R7; C4 |
|
Table 1. Theoretical table presenting all important cell arrangements (carrying information about the content of the cell according its arrangement)
One of the possible ways to address cells is to begin setting names from the upper left cell, but the other which I find as more practical is to start from the most lower and most right cell. I prefer the second way because the most common tables have main cells on first (and second) row and/ or first row and first column, so it is very possible if you start with the upper and most left cells to find a cell, which is separating going deep in the table (if table contains more than one main cells row or column). For this reason in order to not name and later address cells like: R3a, C4a. Lets begin from the most low and right cell.
The above consideration does not reflect to the algorithm itself but rather to the more rational naming of cells which rationalization enables easy white box testing and debugging.
Highlighted cell -addressing starting point, this cell’s width and height gives the width of the last column and height of the last row.
Semantics and value of secondary cells
Every secondary cell has value and name (semantic).
The semantics of the cells is placed in their main cells.
Main cells carry only the semantics of the rest cells as its value and does not have semantics. Main cells value is in common a string.
A table can contain from 0 (no main cells found in the table) up to 2 groups main cells (into a main cell is placed another table containing its own main cells forming the second group of letter).
|
Empty/ no content |
semantic1 |
|
semantic2 |
value1 |
Table 2. Scheme of semantics and values of a cell. The bottom right cell is a secondary cell with semantics: semantic1, semantic2 and value: values1
Values can be these data types:
string
number (if a space character, new line character if found it is ignored)
number range (spaces are ignored, “-” and “,”/”;” between the numbers describing the range)
currency (recognized by patterns)
time (recognized by patterns)
data (recognized by patterns)
time and date (recognized by patterns)
picture (the “alt” text if it is set can be treated as value)
href
Note: Ignore comments (i.e.:“18 (<string>)” is not string, but number). Comments are used with this patterns: “<statement> (<comment>)”; “<statement> - <comment>”; “<statement> <- <comment>”. Spaces are added only for a better visual expression. Of its own the algorithm ignores them from the very beginning, except the statement is string.
The Algorithm in deep. Theoretical translation of tables into natural text statements
All cells are addressed and main cells are recognized from the rest. The above theoretical table (Table 1) is used in order to present the all important policies of the algorithm.
R3; C4a and R3; C4b are split cells. R4; C2,3 is a merged cell (by columns). R5,6; C1 is a merged cell (by rows), too.
There is no difference in split cell by row or column. I do not find any reasonable application of this two different kinds of splitting instead of splitting by diagonal (by column and row at once) of the upper right cell.
Every secondary cell is translated into a natural text statement, by this pattern:
[R2; C2] are/is (R2; C2)
are - if main cells are more than 1
is - if there is only one main cell
R2; C2 – the cell itself
(R2; C2) – the value within cell R2; C2
[R2; C2] – the value(s) of the main cells of R2; C2. The semantics of cell R2; C2
Constructing new tables (using MapReduce’s idea). Comparison with SQL querie
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. [labs.google.com]
Having many secondary cells collected from many tables from different domains around the Internet is enough data to construct a new table as a result every time a query is executed. Secondary cells contain in themselves information about the semantics and about its value. Constructing a table as a search result is useful when the database does not contain a suitable table for the search query, but many columns and enough rows from different tables solely extracted can satisfy the search. Secondary cells’ semantics and data types of their values are not unique, many tables represent one and the same phenomena. By using some NLP policy is possible to claim is a secondary cell semantics equal to others secondary cell semantic from a different table.
User search can be treated as a query for a set of secondary cells with specified semantics and a specified value(s) for a semantic(s) which have to be bound with the presented results.
|
Product type |
Vendor |
The Lowest 3 Prices |
Alternatives (on the same price as tempVariable1? ) |
Alternatives (on the same price or lower as tempVariable1? ) |
|
MP3 Player |
Sony |
?Price? |
?Vendor, Product? |
?Price? |
Table 3. Search query presented in a table
Result:
The Lowest 10 Prices are: price1, price2, price3 // there are supposed to be MP3 Players on a same low price, but I have not deal with this possible case, this is just an example
Alternatives on Price price1: vendor1, product1, priceX1; vendor2, product2, priceX2
Alternatives on Price price2: vendor3, product3, priceX3; and all the above
Alternatives on Price price3: vendor4, product4, priceX4; vendor5, product5, priceX5; and all the above
As a comparison with SQL queries and the relational Database theory. I think that this are queries just like SQL ones executed on a relational DataBase but here we have no FROM clause, so the engine searches everywhere he finds such columns’ semantics and uses its rows if possible to do so. This kind of searching is made on the base of multiple searches for intermediate values associated with the same intermediate semantic around this mass storage of tables.
Examples of query type searches which can be performed better if the above is implemented:
What is the population in USA in 1979 and which countries have a similar amount of citizens?
How much does it cost a Sony mp3 player and which are the alternatives on the same or lower price?
How long does it take to travel from LA to San Francisco by car and is there other so big or bigger (in meaning of citizens) city where I can go from LA in a shorter traveling time again by car?
Of course, these are not general search queries according to the level of Natural Language Processing capabilities of search engines. This queries are at all executed by a specific domain related user interface. But I believe that if every result is serialized to tables this approach can be extremely useful.
Many data analysis can be made on the base of such search queries.
Legal Drinking Age in Asia. Table to natural text translations
At the end is attached a table of Legal Drinking Ages in Asia.
Having in mind the above table pieces of important translations have to be presented in order to picture the algorithm of translation and its’ major policies.
“Armenia” “De jure” “Drinking Age” is “none”.
“Armenia” “De jure” “Purchase Age” is “none”.
“Armenia” “De jure” “Drinking Age” and “Purchase Age” are “none”. (the two cells are merged)
“Azerbaijan” “De jure” “Drinking Age” is “18”.
“Azerbaijan” “De jure” “Purchase Age” is “18”.
“Azerbaijan” “De jure” “Drinking Age” and “Purchase Age” are “18”. (the two cells are merged)
“Thailand” “De jure” “Drinking Age” is “18”.
“Thailand” “De jure” “Purchase Age” is “18”.
(no matter that the two columns are with identical values, they are not merged so third sentence like Azerbaijan and Armenia is not composed)
The underlined and words (phrases) in quotes are exact words/ phrases. Search engine algorithms have been always dealing with problems to distinguish is”Vegas” and “Las Vegas” identical or is “Los Angeles” and “Angeles” not identical. In the sentence “Thailand De jure Drinking Age is 18.” additional knowledge is needed in order to claim is “Thailand De jure” one phrase (or it is not) or may be is an other combination of serial words a phrase or not.
The bold-ed words are added by the algorithm and are needed for this presentation of the translation. In order to construct a human readable sentence the above words are added. In a possible technological implementation of the algorithm specific DataBase data separation bytes would be used instead of this English words.
Is Semantic web a table like model
Semantic web is a declaration between the HTML in general code and the information semantics behind the HTML (in microformats and RDFa). This can be presented as a link between the semantic and the concrete technical presentation of the semantic.
This semantic web links (semantic tags and pieces of HTML as values) are not table like, it is hard to compare the values of the semantics because they are different and are closely related to the technical representation.
Attachments to Querying Tables in the Web
Table of Legal Drinking Age in Asia
“De jure” column is the reason this table is presented as a real life example. “De jure” column contains a table with two main cells placed on its first row- “Drinking Age”, “Purchase Age”. This main cells head two columns which are sometimes merged.
|
Muslims are not allowed to drink or possess alcohol, non-Muslim residents and visitors may import small amounts of alcohol for personal consumption. Most restaurants will allow non-Muslim customers to drink their own brought in wine on premises with no corking fee. Public sale of alcohol is illegal. |
|||
|
Indonesia (excluding Bali) |
|||
|
Regulated by underage drinking prohibition law. (Alcohol vending machines widely available.) |
|||
|
Legal ages are reckoned “from birth”, rather than East Asian age reckoning. South Koreans are 20 or 21 in their own reckoning when they reach legal drinking age. |
|||
|
Republic of China (Taiwan) |
|
||
|
|||
|
People under 18 can buy alcohol but in order to purchase alcohol that has more than 4.5 alcohol concentration, you have to be more than 25 years old. |
|||
I have queried Powerset (http://www.powerset.com/) and received these results:
Screenshot not available here. See the printable version of the document.
Screenshot 1. Search results from Powerset on “what is the legal drinking age in Azerbaijan”. With red color are surrounded important text snippets.
Screenshot not available here. See the printable version of the document.
Screenshot 2. Search results from Powerset on “what is the legal drinking age in Armenia”. With red color are surrounded important text snippets.
Azerbaijan: The Legal drinking age is equal to the Age of majority.
Armenia: There is no connection with the Age of majority and the Legal drinking age, the letter is not set in the legal environment of Republic of Armenia.
Virtual world as a place to meet and interact with people on web page
This New Virtual World (known to me by the explaination of Marvin Minsky, MIT in a presentation “Searching the HAL”) will enable people to meet in real time with other users browsing the same content and work in groups (interact in groups around the web).
Google Docs enables people to collaborate on editing one document in real time, by analogy, the above is browsing in collaboration.
Imagine you are browsing and you saw a page which content, functionality or whatever is interesting to you. You want to meet with people out there like you which are visiting this page just right now in the moment (in order to chat in real time with them) or in the past and left a comment, rating, reference or whatever (in order to read, change (wiki philosophy) or write your own on the base of theirs). You just want more from the web.
This can be impelemented by a browser toolbar (by collecting agreements of the participants about some privacy policy and other permission). Visitors can connect between each other in real time. Change some data (RDF data in example) about the page in a public Wiki generated page about the browsed content (if the content on the page is dynamic it is a little bit more complicated than the situation with a static web content). Connect with the authors of the browsed content.
When computers made things extremely fast and people started doing their tasks without bureaucracy they have lost the social contacts between other people doing the same thing. At this time psychologist started examining that using computers is making user to lose its social behavior skills. I think that this loose is not reasonable, technology can keep the social contact in every computer task.
Interact and present your iteration to other people in real time with the page (content) by surfing it and providing additional content (audio, video, text, web links). Or just being a spectator of the others presentations on the content. Like they all sit next to you and show you how they browse the web.
MyBlogLog http://www.mybloglog.com/ is a Yahoo!’s tool which enables visitors to see the recent visitors, but MyBlogLog doesn’t enable interaction in groups with the web page.
An interesting in the topic of the post is a new Microsoft Research’s product called SearchTogether (author of the idea: Meredith Ringel Morris).
Google Talk ChatBack is a tool which enables blog owners to insert a JS app in their blogs and this way let visitors to chat directly to them.
Social networking as a possible platform for such a Virtual World interactions
Imagine you are in a virtual room and meet there with a friend. In the room is a ball, so you pick it up, make it like a flower and throw it to your friend. Your friend catches it and makes of it a football ball and kicks it to you.
Meeting with other web visitors which visit the same content is a social connection. Social networking has a big data base with peoples, it only needs to present a toolbar (browser plug-in) which enables people to “meet” each other on the web.
In a virtual library where you take a book, go next to your friend, point some paragraph of it, create a abstract virtual model (with some mashup tools, i.e. drawing apps) of what the paragraph was about, step by step as the text explains (and all these steps are viewed by your friend) and show how the model will behave further (depending on the subject of your meeting). The wonderful thing is that your friend(s) will be able to watch the all process and interrupt you (if it is needed) and interact with the model by their own. I hope that this can be enabled by technology which is not dependent of the content of the web.
Public Web applications (wiki, blog, etc) are created with the idea to let visitors to interact with the others visitors’ work. Why don’t you let visitors interact in groups with the all web content. (with a toolbar or browser plug-in, not a web site).
Imagine the web as a virtual library of many virtual objects (papers, web applications and so on) and you took your friend and present him some application, the same interactions with it are supposed to be viewed from your friend’s monitor.
New: German company Weblin provided users with avatars by which they can surf the web. When a web user visits a web page he meets with the rest people in the network of Weblin visiting the same page at this moment. Users can chat in private or with the all visitors of the concrete web page.
— Added on 26 September 2008
The 3D future, according to Microsoft: A Photosynth-based “Spatial Web”
http://www.thestandard.com/news/2008/09/25/3d-future-according-microsoft-photosynth-based-spatial-web