Garo Garabedyan's Divergent Thinking Blog

Querying Tables in the Web

with 4 comments

I am glad to announce that after a lot of work now I am ready with a new work over how search engines should understand and search in HTML tables.

Printable version as a PDF file: querying-tables-in-the-web

Querying Tables in the Web

Information technology research

Here is expressed a new way of crawling and searching information stored in HTML tables. The results of this search are new tables (like results of SQL queries).

When I have started working on this topic I was not aware of Google’s MapReduce, now I find it very useful to the topic of this paper.

Contents:

  1. Abstract. Where tables are used

  2. The Algorithm in a few words. Understanding how tables care information

  3. Addressing cells. Setting calling names of cells in order to extract information

  4. Semantics and value of secondary cells

  5. The Algorithm in deep. Theoretical translation of tables into natural text statements

  6. Constructing new tables (using MapReduce’s idea). Comparison with SQL queries

  7. Legal Drinking Age in Asia. Table to natural text translations

  8. Is Semantic web a table like model

Understanding meaning of main cells and secondary cells in a table and deriving meanings in order to present tables as natural language sentences (search-able by machines). Querying this data and receiving as a result a composed table by the search engine.

Free text to table is a possible future option.

Abstract. Where tables are used

No matter if you read a restaurant menu with no table borders (content is which arranged like a table with no borders) or a bordered table, the way content is placed is like it is filled in columns and rows.

I believe that the table is able to be translated in natural language sentences in order to understand and index the information as a natural text by search engines. Browsing the web we find tables explaining time schedules, steps, categories and so on. May be the table is the second most popular form of expressing information next to the simple text (array of characters and symbols).

The Algorithm in a few words. Understanding how tables care information

Data in tables is presented in two main cell groups, the first group is the main cells which carry the general attributes of the data and the second is the group about the concrete category/ case (in variation tables). This two groups are in most of the real life tables separated by their formatting, data type of content and place in the table’s rows and columns.

Understanding the way this cells content relates to each other and translating this relations into natural text sentences like data structure which is able to be queried like a relational database table.

By using of a MapReduce (http://labs.google.com/papers/mapreduce.html) like algorithm, new tables can be created from the database of many tables to present search result by mapping equal cell names or values according to a complicated user query search.

Addressing cells. Setting calling names of cells in order to extract information

Lets address cells of a table like a mathematical matrix’s cells.

This is an example of a table aiming to present all possible combinations of a structure of a table.

R1; C1

R1; C2

R1; C3,4

R2; C1

R2; C2

R2; C3

R2; C4

R3; C1

R3; C2

R3; C3

R3; C4a

R3; C4b

R4; C1

R4; C2,3

R4; C4

R5,6; C1

R5; C2

R5; C3

R5; C4

R6; C2

R6; C3

R6; C4

R7; C1

R7; C2

R7; C3

R7; C4

Table 1. Theoretical table presenting all important cell arrangements (carrying information about the content of the cell according its arrangement)

One of the possible ways to address cells is to begin setting names from the upper left cell, but the other which I find as more practical is to start from the most lower and most right cell. I prefer the second way because the most common tables have main cells on first (and second) row and/ or first row and first column, so it is very possible if you start with the upper and most left cells to find a cell, which is separating going deep in the table (if table contains more than one main cells row or column). For this reason in order to not name and later address cells like: R3a, C4a. Lets begin from the most low and right cell.

The above consideration does not reflect to the algorithm itself but rather to the more rational naming of cells which rationalization enables easy white box testing and debugging.

Highlighted cell -addressing starting point, this cell’s width and height gives the width of the last column and height of the last row.

Semantics and value of secondary cells

Every secondary cell has value and name (semantic).

The semantics of the cells is placed in their main cells.

Main cells carry only the semantics of the rest cells as its value and does not have semantics. Main cells value is in common a string.

A table can contain from 0 (no main cells found in the table) up to 2 groups main cells (into a main cell is placed another table containing its own main cells forming the second group of letter).

Empty/ no content

semantic1

semantic2

value1

Table 2. Scheme of semantics and values of a cell. The bottom right cell is a secondary cell with semantics: semantic1, semantic2 and value: values1

Values can be these data types:

string

number (if a space character, new line character if found it is ignored)

number range (spaces are ignored, “-” and “,”/”;” between the numbers describing the range)

currency (recognized by patterns)

time (recognized by patterns)

data (recognized by patterns)

time and date (recognized by patterns)

picture (the “alt” text if it is set can be treated as value)

href

Note: Ignore comments (i.e.:“18 (<string>)” is not string, but number). Comments are used with this patterns: “<statement> (<comment>)”; “<statement> <comment>”; “<statement> <- <comment>”. Spaces are added only for a better visual expression. Of its own the algorithm ignores them from the very beginning, except the statement is string.

The Algorithm in deep. Theoretical translation of tables into natural text statements

All cells are addressed and main cells are recognized from the rest. The above theoretical table (Table 1) is used in order to present the all important policies of the algorithm.

R3; C4a and R3; C4b are split cells. R4; C2,3 is a merged cell (by columns). R5,6; C1 is a merged cell (by rows), too.

There is no difference in split cell by row or column. I do not find any reasonable application of this two different kinds of splitting instead of splitting by diagonal (by column and row at once) of the upper right cell.

Every secondary cell is translated into a natural text statement, by this pattern:

[R2; C2] are/is (R2; C2)

are – if main cells are more than 1

is – if there is only one main cell

R2; C2 – the cell itself

(R2; C2) – the value within cell R2; C2

[R2; C2] – the value(s) of the main cells of R2; C2. The semantics of cell R2; C2

Constructing new tables (using MapReduce’s idea). Comparison with SQL querie

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. [labs.google.com]

Having many secondary cells collected from many tables from different domains around the Internet is enough data to construct a new table as a result every time a query is executed. Secondary cells contain in themselves information about the semantics and about its value. Constructing a table as a search result is useful when the database does not contain a suitable table for the search query, but many columns and enough rows from different tables solely extracted can satisfy the search. Secondary cells’ semantics and data types of their values are not unique, many tables represent one and the same phenomena. By using some NLP policy is possible to claim is a secondary cell semantics equal to others secondary cell semantic from a different table.

User search can be treated as a query for a set of secondary cells with specified semantics and a specified value(s) for a semantic(s) which have to be bound with the presented results.

Product type

Vendor

The Lowest 3 Prices

Alternatives (on the same price as tempVariable1? )

Alternatives (on the same price or lower as tempVariable1? )

MP3 Player

Sony

?Price?

?Vendor, Product?

?Price?

Table 3. Search query presented in a table

Result:

The Lowest 10 Prices are: price1, price2, price3 // there are supposed to be MP3 Players on a same low price, but I have not deal with this possible case, this is just an example

Alternatives on Price price1: vendor1, product1, priceX1; vendor2, product2, priceX2

Alternatives on Price price2: vendor3, product3, priceX3; and all the above

Alternatives on Price price3: vendor4, product4, priceX4; vendor5, product5, priceX5; and all the above

As a comparison with SQL queries and the relational Database theory. I think that this are queries just like SQL ones executed on a relational DataBase but here we have no FROM clause, so the engine searches everywhere he finds such columns’ semantics and uses its rows if possible to do so. This kind of searching is made on the base of multiple searches for intermediate values associated with the same intermediate semantic around this mass storage of tables.

Examples of query type searches which can be performed better if the above is implemented:

What is the population in USA in 1979 and which countries have a similar amount of citizens?

How much does it cost a Sony mp3 player and which are the alternatives on the same or lower price?

How long does it take to travel from LA to San Francisco by car and is there other so big or bigger (in meaning of citizens) city where I can go from LA in a shorter traveling time again by car?

Of course, these are not general search queries according to the level of Natural Language Processing capabilities of search engines. This queries are at all executed by a specific domain related user interface. But I believe that if every result is serialized to tables this approach can be extremely useful.

Many data analysis can be made on the base of such search queries.

Legal Drinking Age in Asia. Table to natural text translations

At the end is attached a table of Legal Drinking Ages in Asia.

Having in mind the above table pieces of important translations have to be presented in order to picture the algorithm of translation and its’ major policies.

ArmeniaDe jure” “Drinking Age isnone.

ArmeniaDe jure” “Purchase Age isnone.

ArmeniaDe jure” “Drinking Age and Purchase Age arenone. (the two cells are merged)

Azerbaijan De jure” “Drinking Age is18.

Azerbaijan De jure” “Purchase Age is18.

Azerbaijan De jure”Drinking Age and Purchase Age are 18. (the two cells are merged)

Thailand” “De jure”Drinking Age is18.

ThailandDe jure” “Purchase Age is18.

(no matter that the two columns are with identical values, they are not merged so third sentence like Azerbaijan and Armenia is not composed)

The underlined and words (phrases) in quotes are exact words/ phrases. Search engine algorithms have been always dealing with problems to distinguish is”Vegas” and “Las Vegas” identical or is “Los Angeles” and “Angeles” not identical. In the sentence “Thailand De jure Drinking Age is 18.” additional knowledge is needed in order to claim is “Thailand De jure” one phrase (or it is not) or may be is an other combination of serial words a phrase or not.

The bold-ed words are added by the algorithm and are needed for this presentation of the translation. In order to construct a human readable sentence the above words are added. In a possible technological implementation of the algorithm specific DataBase data separation bytes would be used instead of this English words.

Is Semantic web a table like model

Semantic web is a declaration between the HTML in general code and the information semantics behind the HTML (in microformats and RDFa). This can be presented as a link between the semantic and the concrete technical presentation of the semantic.

This semantic web links (semantic tags and pieces of HTML as values) are not table like, it is hard to compare the values of the semantics because they are different and are closely related to the technical representation.

Attachments to Querying Tables in the Web

Table of Legal Drinking Age in Asia

“De jure” column is the reason this table is presented as a real life example. “De jure” column contains a table with two main cells placed on its first row- “Drinking Age”, “Purchase Age”. This main cells head two columns which are sometimes merged.

Country / region

De jure

Notes

Drinking Age

Purchase Age

Armenia

none

Azerbaijan

18

Bahrain

18

Bangladesh

illegal

Alcoholic beverages are allowed for foreigners only and also served in hotels and restaurants but otherwise for Muslims, it’s illegal.

Bali

none

15

Brunei

illegal

Muslims are not allowed to drink or possess alcohol, non-Muslim residents and visitors may import small amounts of alcohol for personal consumption. Most restaurants will allow non-Muslim customers to drink their own brought in wine on premises with no corking fee. Public sale of alcohol is illegal.

Cambodia

none

People’s Republic of China

18

Introduced in January, 2006.

Georgia

none

16

Hong Kong

18

India

18-25 (varies between states).

Consumption of alcohol is prohibited in the states of Gujarat, Manipur and Mizoram. The legal drinking age in Tamil Nadu is 21.

Indonesia (excluding Bali)

21

Iraq

18

18 years old or above is required to purchase alcohol

Iran

illegal

Only alcohol used for Jewish or Christian religious ceremonies is allowed.

Israel

none

18

Jordan

18

Japan

20

Regulated by underage drinking prohibition law. (Alcohol vending machines widely available.)

Kuwait

illegal

Selling alcohol is illegal.

Lebanon

18

Macau

none

none

Malaysia

none

18

The sale of alcohol to Muslims is illegal, as is consumption of alcohol by Muslims in public. However, non-Muslims who are 18 years old or over are allowed to buy and drink alcohol.

Mongolia

18

Myanmar

No age limit for drinking.

Nepal

18

Oman

21

Very few (if any) establishments will serve alcohol during the Holy Month of Ramadan.

North Korea (DPRK)

17

Alcohol may legally be consumed or purchased only on Saturdays.

Pakistan

21

Illegal for Muslims. Forbidden by Sharia (Islamic Law, with qur’anic and other traditional legal inspirations) but can be purchased in some areas of Lahore, Karachi, and Islamabad.

Philippines

none

18 (16)

Qatar

18

Russia

18

Saudi Arabia

illegal

Forbidden by Sharia (Islamic Law, with qur’anic and other traditional legal inspirations). Offenders are typically punished with lashes.

Singapore

18

South Korea

19

Legal ages are reckoned “from birth”, rather than East Asian age reckoning. South Koreans are 20 or 21 in their own reckoning when they reach legal drinking age.

Sri Lanka

18

Republic of China (Taiwan)

18

  • It is illegal for anyone under the age of 18 to consume alcohol.

  • Parents, guardians, and others taking care of people under 18 shall prohibit underage drinking, or risk administrative fines of 10000 to 50000 new Taiwan dollars when the situations are serious.

  • One shall not supply alcohol to anyone under the age of 18. A violator shall be administratively fined 3000 to 15000 new Taiwan dollars.

Thailand

18

18

  • Purchasing age is 18 years old. However, you need to be 20 years old to get into clubs and bars, although this is not very strict.

United Arab Emirates

21

Dubai laws state that no person under the age of 16 may be in a place serving alcohol after 18:00. Alcohol is served only in restaurants and bars attached to hotels. Alcohol is prohibited in Sharjah.

Vietnam

15

18

People under 18 can buy alcohol but in order to purchase alcohol that has more than 4.5 alcohol concentration, you have to be more than 25 years old.

Nowadays results of current search technologies

I have queried Powerset (http://www.powerset.com/), now part of Microsoft’s Bing, and received these results:

Screenshot not available here. See the printable version of the document.

Screenshot 1. Search results from Powerset on “what is the legal drinking age in Azerbaijan”. With red color are surrounded important text snippets.

Screenshot not available here. See the printable version of the document.

Screenshot 2. Search results from Powerset on “what is the legal drinking age in Armenia”. With red color are surrounded important text snippets.

Azerbaijan: The Legal drinking age is equal to the Age of majority.

Armenia: There is no connection with the Age of majority and the Legal drinking age, the letter is not set in the legal environment of Republic of Armenia.

Advertisements

Written by garabedyan

August 20, 2008 at 16:19

4 Responses

Subscribe to comments with RSS.

  1. […] public links >> rationalization Querying Tables in the Web Saved by gabbygurl1993 on Sun 28-9-2008 Heart a remission ingoing Qatar, anyone? Saved by Arachne […]

  2. […] relevance information in web pages With HTML table understanding (Querying Tables in the Web), table recognition (future research task) and Semantic web technologies all the information […]

  3. […] – bookmarked by 6 members originally found by princesssakura300 on 2008-12-27 Querying Tables in the Web https://garabedyan.wordpress.com/2008/08/20/querying-tables-in-the-web/ – bookmarked by 6 members […]

    Bookmarks about Algorithms

    January 21, 2009 at 09:30

  4. Google Squared (http://www.google.com/squared/) is a product in Google Labs implementing such functionality.

    garabedyan

    October 2, 2010 at 23:24


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s