A method to automatically find foaf:Groups based on the cooccurrence of people with keywords in the web

Yohei Asada¹, Yutaka Matsuo², Mitsuru Ishizuka¹
¹University of Tokyo
²National Institute of Advanced Industrial Science and Technology
E-mail: asadayo@miv.t.u-tokyo.ac.jp
(This article is currently under proofread by a native speaker.)

Abstract

We have conducted research on social network extraction from the Web. A social network is useful for showing an overview of human relations in a community. Therefore, we can find a person, with the help of a social network, who is familiar with the person we want to get acquainted with. However, as the network grows larger, it is difficult to find an appropriate person. On the other hand, in the FOAF vocabulary, it seems that the broad concept of foaf:Group includes a group of people who have the same interest in common, such as a group of researchers who work on the same research topic. Knowing such a group, we can efficiently find an appropriate person we research with or an appropriate business partner. In this paper, we propose a method to find foaf:Groups based on cooccurrence with keywords in the web. We first extract the person-by-keyword cooccurrence matrix from the web by a search engine, and then apply the co-clustering algorithm to the matrix. We can yield groups of people and groups of keywords simultaneously. In a experiment, our method can automatically find a group of researchers who work on the same (or similar) research topic and a group of keywords which indicate the research topic of the corresponding researcher group.

1. Introduction

We have conducted research on automatic social network extraction from the Web [1]. In our previous work, we proposed a method to measure relation strength between two individuals based on the number of retrieved web pages that contain both of their names. When we have the list of names in a community, we can extract the social network in the community from the web by check up this relation strength between each combination of two names. Figure 1. shows an exapmle of social network extracted from the web in this way.

Figure 1. an example of social network extracted from the web
A social network is useful for showing an overview of human relations in a community. Therefore, we can find a person, with the help of a social network, who is familiar with the person we want to get acquainted with. However, as the network grows larger, it is difficult to find an appropriate person who is familiar with the person we want to get acquainted with, because persons in more than two or three hops from ourselves are often strangers.
When a person makes friends with another who is not a friend of his/her friend, there often is something they have in common : they listen to the same kind of music, they read the same book, they research on the same topic, they work in the same workplace and so on. Analyzing a social network of university students showed that students of the same interest tend to make friends with each other [9]. In this way, finding something they have in common is as important for social networking as finding someone both of them know.
On the other hand, FOAF attracts a great deal of attention in the context of the Semantic Web and social networking. By FOAF, we can write a machine-readable description of ourselves, who we know and what we create. FOAF files are collected to construct a network of know-links between people, which is also useful for showing an overview of human relations in a community. However, a know-link is not only a relation between people. As mentioned above, having something in common leads to friends, and so it is an important relation between people.
In FOAF vocabularies[2], there is foaf:Group class. According to this specification, "this concept (foaf:Group) is intentionally quite broad, covering informal and ad-hoc groups, long-lived communities, organizational groups within a workplace, etc". It seems that foaf:Group includes a group of people who have the same interest in common, such as a group of researchers who work on the same research topic. By foaf:Group, we can express many kinds of relations other than know-link, such as having the same interest. Knowing such a group, we can efficiently find an appropriate person we research with or an appropriate business partner.
In this paper, we propose a method to automatically find foaf:Groups based on cooccurrence with keywords in the web. Keywords here are words that are related to interest of a person. Our basic idea is very simple : we know what a person is interested in by checking up what kind of keywords his/her name cooccur with. We perform co-clustering the cooccurrence matrix of people and keywords and yield groups of people and keywords simultaneously.
We now give a brief outline of the paper. Section 2 introduces the method to automatically find foaf:Groups based on the cooccurrence of people with keywords in the web. Our method makes use of the entire network to find what kind of groups are in the network and classify each person to a relatively appropriate group simultaneously. To get this simultaneous clustering we use co-clustering algorithm. In Section 3, we present experimental results of finding groups from participants in the 17th annual conference of JSAI (the Japanese Society for Artificial Intelligence). We discuss related work in Section 4, and we present our conclusions and future works in Section 5.

2. Proposed method

In this section, we introduce a method to automatically find foaf:Groups based on cooccurrence with keywords in the web.

Represent people as a matrix

In information retrieval, the entire document collection are often represented as a matrix [4]. This approach is known as vector space model (or bag-of-words model). In vector space model, content-bearing words are extracted from the entire document collection as features and each document is represented as a vector of weighted word frequencies in the feature space. Thus the entire document collection is represented as a document-by-word matrix whose rows correspond to documents and columns to words.
In the same way as a document, a person is characterized by keywords which are related to his/her interest, jobs, projects he/she works on and so on. We represent a person as a vector of keywords weighted according to how many times they cooccur with his/her name in the web.
We use a search engine to know how many times a person cooccur with those keywords in the web: we simply give a query "(his/her name) AND (a keyword)" to a search engine, and the number of web pages which it retrieves is exactly what we want to know. Knowing how many times each person cooccur with each keyword, we can represent the entire community as a cooccurrence matrix A whose rows correspond to persons and columns to keywords. Say, A_ij indicates the number of web pages in which i-th person and j-th keyword cooccur.

Adjust the weight of keywords

Up to this point, all the people in a community are represented by a person-by-keyword matrix. The problem now is that among keywords some are relatively general words which cooccur with many people many times, others are not so general. Those keywords, which do not occur frequently in the web but cooccur with specific people, are very important to characterize those who cooccur with them. Therefore we give those keywords larger weight than others. In order to adjust the weight of keywords, we introduce a value person frequency to each keyword. Person frequency is the value which shows how many people in the community a keyword cooccurs with. This idea is the same as document frequency in the tfidf weighting. The smaller the person frequency of a keyword is, the larger its weight to those who cooccur with it becomes. We use person frequency to weight each keyword as follows :

Now the person-by-keyword matrix A is completed. Then, we apply the co-clustering algorithm to this person-by-keyword matrix A.

Co-clustering[3]

Co-clustering is a clustering algorithm to yield clusterings of both documents and words simultaneously [3]. When we perform co-clustering, all we have to do is to decide how many clusters the document collection is divided into. We do not need to know in advance what kind of clusters are in the document collection. Therefore, we get clusters of documents and corresponding clusters of words automatically.
Figure 2. gives an overview of co-clustring as graph partitioning.

Figure 2. Co-clustering as graph partitioning
In co-clustering, a document-by-word matrix is regarded as a graph in which a vertex indicates each document or each word, and a edge between a document and a word indicates that the word's weight in the document vector is non-zero. Because there is no edge between documents and between words, the graph is bipartite. The basic idea of co-clustering is to find vertex subsets which minimize the summation of edge weight between subsets. Each subset indicates the group of simlar documents and corresponding words.
If we replace a document with a person in the discussion above, we can apply co-clustering to the person-by-keyword matrix A. As a result, we yield subsets of persons and keywords. A subset of persons indicates a group of the persons who have the same interest in common, and the corresponding subset of keywords indicates what kind of interest they have in common.

3. Experimental Results

In this section, we show experimental results and discuss what kind of information we obtain from the results.
We performed a experiment of finding groups from participants in the 17th annual conference of JSAI (the Japanese Society for Artficial Intelligence). We prepared the list of participants and the list of keywords that are related to this conference. Then, for each researcher, we check up the cooccurrence with each keyword in the web by a search engine.
The number of participants and keywords we treated in this experiment were 200 and 170 respectively. That is to say, the size of cooccurrence matrix was 200 x 170. We divide the particpants and keywords into 10 groups by co-clustering.
Table 1. shows the top 10 keywords and the number of people in each groups. The top keywords here are those whose internal edge weights are the greatest. When there are less than 10 keywords in a group, all the keywords are shown. (In our expriment, almost all people and keywords are Japanese. Therefore the keywords in the Table 1. are translated into English.)

Table 1 : The top 10 keywords and the number of people in each group
Group number	The top 10 keywords	The number of people
#1	"TVCF"	1
#2	"edu-tainment"	2
#3	"wearable", "Web service orchestration", "semantics", "MGTP", "demand bus", "sensor network", "radius of curvature", "POMDPs", "multiple categories", "grid environment"	21
#4	"multiple handicaps", "story structure element", "appreciation process", "metaphor", "autistic child", "recycle system", "body language", "hypertext", "the number of steps taken", "kinesthesia"	9
#5	"discourse", "narrative discourse theory", "sound animation"	2
#6	"mutuality", "dialogue management", "conversation", "behavior and intelligence", "design knowledge description", "dynamic adaptation", "parallel recognition", "web adaptation", "learning environment", "object recogntion"	60
#7	"event space", "pen touch", "Robovie", "punin"	0
#8	"rhetoric", "poetic", "poetic effect", "software management"	7
#9	"interaction", "cooperative action", "autonomos movement", "scheduling", "real world information", "object detection", "data mining", "robotics","robust","mining"	70
#10	"corpus", "intelligent learning support enviroment", "dialogue context", "learning support system", "dialogue strategy", "unknown word", "concept assciation", "learning contents", "concept cooccur dictionary", "inheritance support environment"	28

We can guess the research topic of each group from the top 10 keywords in Table 1. For example, the research topic in the group #3 seems to be "ubiquitous", and the research topic in the group #4 seem to be "cognitive science". In the group #6, there seems to be two research topics : "behavioral science" and "cognitive science". The group #9 seems to include two research topics : "(interaction of) robot" and "mining". Broadly speaking, the main research topic in the group #10 seems to be "interactive e-learning system".
In this way, each group has one or two research topics. It is interesting that research topic "cognitive science" are divided into two groups : #4 and #6. This could be interpreted that some subset of the research area "cognitive science" are closely related to "behavioral science". It is also interesting that the group #9 contain two reseach topics "(interaction of) robot" and "mining" which ,at a glance, seem to be completely different. This could be explained that both research topics are related to "pattern recognition". Pattern recognition is necessary for robots to recognize and interact with the outside world, and pattern recognition is ordinary method for data mining.
Thus, it could be said that we extracted reseacher groups and their research fields simultaneously.

4. Related Work

There are several researches to find an expert in the web.
Referral web is an information retrieval system which makes use of a social network [5, 6, 7]. Given the name of a user, the system shows a social network around the user. Each user has an agent that has topic words of user's interest, and the agent searches in the neighborhood an expert who has the same topic words. The difference between Referral web and our approach is that the former focuses on a user but the latter has an overview of the entire network.
On the other hand, social networking by FOAF is now a topic of interest. The foafnaut visualizes know-links in FOAF [8]. However, our approach to automatically extract a social network and useful information for FOAF and social networking from the web seems to be new. We believe that our approach to automatically extract useful information and social networking based on know-links are complementary to each other.

5. Future Work and Conclusions

In this paper, we pointed out that finding a person who have the same interest is important in social networking but it is not easy. We solve this problem by automatically finding foaf:Groups in which people have something in common. Then, we proposed a method to find groups who have the same interest in a community. In the proposed method, we perform co-clustering to automatically find groups of people and keywords. A group of keywords represents the interest which people in the corresponding group have in common. The merits of our proposed method are that by using a search engine we make use of a great deal of information on the web and that we can automatically extract the groups.
We performed an experiment of finding groups from conference participants. Our experiment showed some interesting results such as that two research fields which seem to be different at a glance have something in common. In future work, we will evaluate the experimental results and examine the validity of our keyword weighting method, the number of keywords and so on.
If we get the name list of the FOAF workshop participants, we want to extract the social network of them and find groups of people who have similar interest, and show it in my presentation.

References

[1] Yutaka Matsuo, Hironori Tomobe, Koichi Hashida and Mitsuru Ishizuka. Mining Social Network of Conference Participants from the Web. 12th International WWW Conference, poster, 2003.
[2] FOAF Vocabulary Specification. http://xmlns.com/foaf/0.1/.
[3] I. S. Dhillon. Co-clustering documents and words using Bipartite Spectral Graph Partitioning. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), 2001.
[4] G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
[5] Henry Kautz, Bart Selman, and Mehul Shah. Referral Web : combining social networks and collabolative filtering. Communications of the ACM, vol.40, no.3, 1997.
[6] Henry Kautz, Bart Selman, and Al Milewski. Agent amplified communication. In proceedings of the National Conference on Artificial Intelligence, pp.3-9, 1999.
[7] Henry Kautz, Bart Selman, and Mehul Shah. The Hidden Web. AI Magazine, 18(2), pp.27-36, 1997.
[8] foafnaut. http://www.foafnaut.org/.
[9] Lada A. Adamic, Orkut Buyukkokten, and Eytan Adar. A social network caught in the Web. "First Monday" http://www.firstmonday.dk/, vol.8, no.6, 2003.