Garo Garabedyan- a private blog

Everything here is written and belongs to me except otherwise noted.

Posts Tagged ‘semantic web

Semantic Web

without comments

I like the above interview and I have decided to write down the foundations of Semantic web according to it as a struggle to make the first popular.

http://se-radio.net/podcast/2008-11/episode-116-semantic-web-jim-hendler
Software Engineering Radio Episode 116: The Semantic Web with Jim Hendler

We have spent a lot of our times in thinking of very fast ways to explain Semantic web and the best I like is:
Right now, if I have a document, I can put it on the web and instant point to any other document with a simple link. But if I have a database or a spreadsheet I can not put that on the web and have it point another database or spreadsheet.
So, the sense is figuring out what is the database and the infrastructure by which data and definitional information can comes to the web rather than these documents is the heart of the Semantic web.

Web for machines, software, programs:
The key concept is right now the information on the web is primarily described for human inns, but if I am using data or something like that it is very hard to describe that for human use, because what I meant to do is aggregated, collated, queried, all things that pure happen back behind somewhere in the infrastructure, than when I present it to the human it is going to look like a table or a text or a web site, so it is really about building a new infrastructure for the web, to make far more the data work together and of course that will require more machines and machine integration things.

Three things you need:
1. Data format that can include links- If I need to integrate DBs I need someone to knowing what can be merged and what can not be merged, which things link to which other things. RDF is primarily the language that gives you that. So we need a standard for the kind of data description.

2. Means for describing terms of that data- If I have a DB, I look at a column called C27 column 19 and I see that it is filled with the number 15, that doesn’t help me much, if I want a merge with another DB. If I know that that represents the age of a person, then I can find the another DB which have either other properties of that person or other ages of people. Now I know how to put them together. An Ontology or Domain description in a formal way to represent it.
These Ontologies define the meanings of things and here is where the term semantic comes.
If I want to link data together I need to know what is it about, and it is often about something that is not in the DB.
We need to take data out of the database into a place where it can be merged or you need query dynamically the DB.
RDF gives you data in triple stores- graph. If you take two table you can not guarantee that you will get table as a result of merging, if you have two trees I can not guarantee that I will have a tree as a result. Merging graphs always gives you graphs.
Directed labeled graph is RDF.
OWL (higher level data schema) to represent the Ontologies.

3. Standard way to query RDF- SPARQL.

Written by garabedyan

July 27, 2009 at 19:07

Querying Tables in the Web

with 3 comments

I am glad to announce that after a lot of work now I am ready with a new work over how search engines should understand and search in HTML tables.

Printable version as a PDF file: querying-tables-in-the-web

Querying Tables in the Web

Information technology research

Here is expressed a new way of crawling and searching information stored in HTML tables. The results of this search are new tables (like results of SQL queries).

When I have started working on this topic I was not aware of Google’s MapReduce, now I find it very useful to the topic of this paper.

Contents:

  1. Abstract. Where tables are used

  2. The Algorithm in a few words. Understanding how tables care information

  3. Addressing cells. Setting calling names of cells in order to extract information

  4. Semantics and value of secondary cells

  5. The Algorithm in deep. Theoretical translation of tables into natural text statements

  6. Constructing new tables (using MapReduce’s idea). Comparison with SQL queries

  7. Legal Drinking Age in Asia. Table to natural text translations

  8. Is Semantic web a table like model

Understanding meaning of main cells and secondary cells in a table and deriving meanings in order to present tables as natural language sentences (search-able by machines). Querying this data and receiving as a result a composed table by the search engine.

Free text to table is a possible future option.

Abstract. Where tables are used

No matter if you read a restaurant menu with no table borders (content is which arranged like a table with no borders) or a bordered table, the way content is placed is like it is filled in columns and rows.

I believe that the table is able to be translated in natural language sentences in order to understand and index the information as a natural text by search engines. Browsing the web we find tables explaining time schedules, steps, categories and so on. May be the table is the second most popular form of expressing information next to the simple text (array of characters and symbols).

The Algorithm in a few words. Understanding how tables care information

Data in tables is presented in two main cell groups, the first group is the main cells which carry the general attributes of the data and the second is the group about the concrete category/ case (in variation tables). This two groups are in most of the real life tables separated by their formatting, data type of content and place in the table’s rows and columns.

Understanding the way this cells content relates to each other and translating this relations into natural text sentences like data structure which is able to be queried like a relational database table.

By using of a MapReduce (http://labs.google.com/papers/mapreduce.html) like algorithm, new tables can be created from the database of many tables to present search result by mapping equal cell names or values according to a complicated user query search.

Addressing cells. Setting calling names of cells in order to extract information

Lets address cells of a table like a mathematical matrix’s cells.

This is an example of a table aiming to present all possible combinations of a structure of a table.

R1; C1

R1; C2

R1; C3,4

R2; C1

R2; C2

R2; C3

R2; C4

R3; C1

R3; C2

R3; C3

R3; C4a

R3; C4b

R4; C1

R4; C2,3

R4; C4

R5,6; C1

R5; C2

R5; C3

R5; C4

R6; C2

R6; C3

R6; C4

R7; C1

R7; C2

R7; C3

R7; C4

Table 1. Theoretical table presenting all important cell arrangements (carrying information about the content of the cell according its arrangement)

One of the possible ways to address cells is to begin setting names from the upper left cell, but the other which I find as more practical is to start from the most lower and most right cell. I prefer the second way because the most common tables have main cells on first (and second) row and/ or first row and first column, so it is very possible if you start with the upper and most left cells to find a cell, which is separating going deep in the table (if table contains more than one main cells row or column). For this reason in order to not name and later address cells like: R3a, C4a. Lets begin from the most low and right cell.

The above consideration does not reflect to the algorithm itself but rather to the more rational naming of cells which rationalization enables easy white box testing and debugging.

Highlighted cell -addressing starting point, this cell’s width and height gives the width of the last column and height of the last row.

Semantics and value of secondary cells

Every secondary cell has value and name (semantic).

The semantics of the cells is placed in their main cells.

Main cells carry only the semantics of the rest cells as its value and does not have semantics. Main cells value is in common a string.

A table can contain from 0 (no main cells found in the table) up to 2 groups main cells (into a main cell is placed another table containing its own main cells forming the second group of letter).

Empty/ no content

semantic1

semantic2

value1

Table 2. Scheme of semantics and values of a cell. The bottom right cell is a secondary cell with semantics: semantic1, semantic2 and value: values1

Values can be these data types:

string

number (if a space character, new line character if found it is ignored)

number range (spaces are ignored, “-” and “,”/”;” between the numbers describing the range)

currency (recognized by patterns)

time (recognized by patterns)

data (recognized by patterns)

time and date (recognized by patterns)

picture (the “alt” text if it is set can be treated as value)

href

Note: Ignore comments (i.e.:“18 (<string>)” is not string, but number). Comments are used with this patterns: “<statement> (<comment>)”; “<statement> - <comment>”; “<statement> <- <comment>”. Spaces are added only for a better visual expression. Of its own the algorithm ignores them from the very beginning, except the statement is string.

The Algorithm in deep. Theoretical translation of tables into natural text statements

All cells are addressed and main cells are recognized from the rest. The above theoretical table (Table 1) is used in order to present the all important policies of the algorithm.

R3; C4a and R3; C4b are split cells. R4; C2,3 is a merged cell (by columns). R5,6; C1 is a merged cell (by rows), too.

There is no difference in split cell by row or column. I do not find any reasonable application of this two different kinds of splitting instead of splitting by diagonal (by column and row at once) of the upper right cell.

Every secondary cell is translated into a natural text statement, by this pattern:

[R2; C2] are/is (R2; C2)

are - if main cells are more than 1

is - if there is only one main cell

R2; C2 – the cell itself

(R2; C2) – the value within cell R2; C2

[R2; C2] – the value(s) of the main cells of R2; C2. The semantics of cell R2; C2

Constructing new tables (using MapReduce’s idea). Comparison with SQL querie

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. [labs.google.com]

Having many secondary cells collected from many tables from different domains around the Internet is enough data to construct a new table as a result every time a query is executed. Secondary cells contain in themselves information about the semantics and about its value. Constructing a table as a search result is useful when the database does not contain a suitable table for the search query, but many columns and enough rows from different tables solely extracted can satisfy the search. Secondary cells’ semantics and data types of their values are not unique, many tables represent one and the same phenomena. By using some NLP policy is possible to claim is a secondary cell semantics equal to others secondary cell semantic from a different table.

User search can be treated as a query for a set of secondary cells with specified semantics and a specified value(s) for a semantic(s) which have to be bound with the presented results.

Product type

Vendor

The Lowest 3 Prices

Alternatives (on the same price as tempVariable1? )

Alternatives (on the same price or lower as tempVariable1? )

MP3 Player

Sony

?Price?

?Vendor, Product?

?Price?

Table 3. Search query presented in a table

Result:

The Lowest 10 Prices are: price1, price2, price3 // there are supposed to be MP3 Players on a same low price, but I have not deal with this possible case, this is just an example

Alternatives on Price price1: vendor1, product1, priceX1; vendor2, product2, priceX2

Alternatives on Price price2: vendor3, product3, priceX3; and all the above

Alternatives on Price price3: vendor4, product4, priceX4; vendor5, product5, priceX5; and all the above

As a comparison with SQL queries and the relational Database theory. I think that this are queries just like SQL ones executed on a relational DataBase but here we have no FROM clause, so the engine searches everywhere he finds such columns’ semantics and uses its rows if possible to do so. This kind of searching is made on the base of multiple searches for intermediate values associated with the same intermediate semantic around this mass storage of tables.

Examples of query type searches which can be performed better if the above is implemented:

What is the population in USA in 1979 and which countries have a similar amount of citizens?

How much does it cost a Sony mp3 player and which are the alternatives on the same or lower price?

How long does it take to travel from LA to San Francisco by car and is there other so big or bigger (in meaning of citizens) city where I can go from LA in a shorter traveling time again by car?

Of course, these are not general search queries according to the level of Natural Language Processing capabilities of search engines. This queries are at all executed by a specific domain related user interface. But I believe that if every result is serialized to tables this approach can be extremely useful.

Many data analysis can be made on the base of such search queries.

Legal Drinking Age in Asia. Table to natural text translations

At the end is attached a table of Legal Drinking Ages in Asia.

Having in mind the above table pieces of important translations have to be presented in order to picture the algorithm of translation and its’ major policies.

ArmeniaDe jure” “Drinking Age isnone.

ArmeniaDe jure” “Purchase Age isnone.

ArmeniaDe jure” “Drinking Age and Purchase Age arenone. (the two cells are merged)

Azerbaijan De jure” “Drinking Age is18.

Azerbaijan De jure” “Purchase Age is18.

Azerbaijan De jure”Drinking Age and Purchase Age are 18. (the two cells are merged)

Thailand” “De jure”Drinking Age is18.

ThailandDe jure” “Purchase Age is18.

(no matter that the two columns are with identical values, they are not merged so third sentence like Azerbaijan and Armenia is not composed)

The underlined and words (phrases) in quotes are exact words/ phrases. Search engine algorithms have been always dealing with problems to distinguish is”Vegas” and “Las Vegas” identical or is “Los Angeles” and “Angeles” not identical. In the sentence “Thailand De jure Drinking Age is 18.” additional knowledge is needed in order to claim is “Thailand De jure” one phrase (or it is not) or may be is an other combination of serial words a phrase or not.

The bold-ed words are added by the algorithm and are needed for this presentation of the translation. In order to construct a human readable sentence the above words are added. In a possible technological implementation of the algorithm specific DataBase data separation bytes would be used instead of this English words.

Is Semantic web a table like model

Semantic web is a declaration between the HTML in general code and the information semantics behind the HTML (in microformats and RDFa). This can be presented as a link between the semantic and the concrete technical presentation of the semantic.

This semantic web links (semantic tags and pieces of HTML as values) are not table like, it is hard to compare the values of the semantics because they are different and are closely related to the technical representation.

Attachments to Querying Tables in the Web

Table of Legal Drinking Age in Asia

“De jure” column is the reason this table is presented as a real life example. “De jure” column contains a table with two main cells placed on its first row- “Drinking Age”, “Purchase Age”. This main cells head two columns which are sometimes merged.

Country / region

Notes

Drinking Age

Purchase Age

none

18

18

illegal

Alcoholic beverages are allowed for foreigners only and also served in hotels and restaurants but otherwise for Muslims, it’s illegal.

none

15

illegal

Muslims are not allowed to drink or possess alcohol, non-Muslim residents and visitors may import small amounts of alcohol for personal consumption. Most restaurants will allow non-Muslim customers to drink their own brought in wine on premises with no corking fee. Public sale of alcohol is illegal.

none

18

Introduced in January, 2006.

none

16

18

18-25 (varies between states).

Consumption of alcohol is prohibited in the states of Gujarat, Manipur and Mizoram. The legal drinking age in Tamil Nadu is 21.

Indonesia (excluding Bali)

21

18

18 years old or above is required to purchase alcohol

illegal

Only alcohol used for Jewish or Christian religious ceremonies is allowed.

none

18

18

20

Regulated by underage drinking prohibition law. (Alcohol vending machines widely available.)

illegal

Selling alcohol is illegal.

18

none

none

none

18

The sale of alcohol to Muslims is illegal, as is consumption of alcohol by Muslims in public. However, non-Muslims who are 18 years old or over are allowed to buy and drink alcohol.

18

No age limit for drinking.

18

21

Very few (if any) establishments will serve alcohol during the Holy Month of Ramadan.

17

Alcohol may legally be consumed or purchased only on Saturdays.

21

Illegal for Muslims. Forbidden by Sharia (Islamic Law, with qur’anic and other traditional legal inspirations) but can be purchased in some areas of Lahore, Karachi, and Islamabad.

none

18 (16)

18

18

illegal

Forbidden by Sharia (Islamic Law, with qur’anic and other traditional legal inspirations). Offenders are typically punished with lashes.

18

19

Legal ages are reckoned “from birth”, rather than East Asian age reckoning. South Koreans are 20 or 21 in their own reckoning when they reach legal drinking age.

18

18

  • It is illegal for anyone under the age of 18 to consume alcohol.

  • Parents, guardians, and others taking care of people under 18 shall prohibit underage drinking, or risk administrative fines of 10000 to 50000 new Taiwan dollars when the situations are serious.

  • One shall not supply alcohol to anyone under the age of 18. A violator shall be administratively fined 3000 to 15000 new Taiwan dollars.

18

18

  • Purchasing age is 18 years old. However, you need to be 20 years old to get into clubs and bars, although this is not very strict.

21

Dubai laws state that no person under the age of 16 may be in a place serving alcohol after 18:00. Alcohol is served only in restaurants and bars attached to hotels. Alcohol is prohibited in Sharjah.

15

18

People under 18 can buy alcohol but in order to purchase alcohol that has more than 4.5 alcohol concentration, you have to be more than 25 years old.

Nowadays results of current search technologies

I have queried Powerset (http://www.powerset.com/) and received these results:

Screenshot not available here. See the printable version of the document.

Screenshot 1. Search results from Powerset on “what is the legal drinking age in Azerbaijan”. With red color are surrounded important text snippets.

Screenshot not available here. See the printable version of the document.

Screenshot 2. Search results from Powerset on “what is the legal drinking age in Armenia”. With red color are surrounded important text snippets.

Azerbaijan: The Legal drinking age is equal to the Age of majority.

Armenia: There is no connection with the Age of majority and the Legal drinking age, the letter is not set in the legal environment of Republic of Armenia.

Written by garabedyan

August 20, 2008 at 16:19