MINING MAXIMAL WEB ACCESS PATTERNS- A NEW APPROACH

RAJIMOL A.1*, RAJU G.2
1School of computer Science, Mahatma Gandhi University, Kottayam, Kerala, India
2Department of IT, Kannur University, Kannur, Kerala, India
* Corresponding Author : rajikovoor@yahoo.com

Received : 06-11-2011     Accepted : 09-12-2011     Published : 12-12-2011
Volume : 3     Issue : 4       Pages : 346 - 348
Int J Mach Intell 3.4 (2011):346-348
DOI : http://dx.doi.org/10.9735/0975-2927.3.4.346-348

Conflict of Interest : None declared

Cite - MLA : RAJIMOL A. and RAJU G. "MINING MAXIMAL WEB ACCESS PATTERNS- A NEW APPROACH." International Journal of Machine Intelligence 3.4 (2011):346-348. http://dx.doi.org/10.9735/0975-2927.3.4.346-348

Cite - APA : RAJIMOL A., RAJU G. (2011). MINING MAXIMAL WEB ACCESS PATTERNS- A NEW APPROACH. International Journal of Machine Intelligence, 3 (4), 346-348. http://dx.doi.org/10.9735/0975-2927.3.4.346-348

Cite - Chicago : RAJIMOL A. and RAJU G. "MINING MAXIMAL WEB ACCESS PATTERNS- A NEW APPROACH." International Journal of Machine Intelligence 3, no. 4 (2011):346-348. http://dx.doi.org/10.9735/0975-2927.3.4.346-348

Copyright : © 2011, RAJIMOL A. and RAJU G., Published by Bioinfo Publications. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

Abstract

Web access pattern mining is an application of sequence mining on web log data to generate interesting user access behavior on World Wide Web. In this paper we present a new method for the efficient mining of maximal web access patterns. The method is a variation of recently published, FOL-Mine (First Occurrence List Mine) [1] for mining web access patterns. It is a top-down method that uses the concept of first occurrence to reduce search space and thus improving the performance.

Keywords

Web access pattern, Maximal access patterns, Sequence data base, WAP-Tree methods, Web usage mining.

Introduction

Mining of web access patterns generated by the users’ interaction with the World Wide Web is thrust area of research. Web access pattern mining is an application of sequence mining on web log data to generate interesting user access behavior on World Wide Web.
A sequence database consists of ordered elements or events, recorded with or without a concrete notion of time. There are many applications involving sequence data, such as customer shopping sequences, web click streams and biological sequences. Sequential pattern mining is an important data mining tool used for web log mining. Sequential pattern mining that discover frequent pattern in a Web Access Sequence Data Base (WASD) was first introduced by Agarwal and Srikant in [2] : Given a set of sequences, where each sequence consists of a list of elements and each element consists of a set of items, and given a user-specified minimum support threshold, sequential pattern mining is to find all frequent subsequences. A web log is a sequence of pairs: user-id and access information. For the purpose of study of sequential pattern mining, preprocessing is applied to the original log file and WASD is generated.
Let E be a set of events. A Web Access Sequence (WAS), S = e1e2 …..en for (1 ≤ i ≤ n), is a sequence of events where events can be repeated; n is the length of access sequence. S’= e1’e2’…..en’ is subsequence of Access Sequence (AS) S = e1e2 …..en and S is a super sequence of S’, if and only if 1≤ i1 ≤ i2 ≤ .. ≤ in ≤ n such that E’j = Ej for 1 ≤ j ≤ n . In S = e1e2 ….ek ek+1….en, if subsequence Ssuffix = ek+1 …en is a super sequence of pattern P = e’1e’2 …e’l, and e k+1 = e’1, the subsequence of S, S prefix = e1 e2…ek, is called the prefix of S with respect to pattern P. Support of a pattern S in WASD is the number of sequences Si in WASD, which contain the subsequence P, divided by the number of transactions in the database. An access sequence S is said to be an access pattern of WASD, if support of S in WAS, supWAS(S), is greater than or equal to the given threshold. Given a WASD and a support threshold ξ, Web Access Pattern Mining mines the complete set of ξ-patterns of WAS [3] .
A major challenge in mining frequent patterns from a large data set is the fact that such mining often generates a huge number of patterns satisfying the minimum support threshold, especially when minimum support is set low. This is because if a pattern is frequent, each of its sub-patterns is frequent as well. A large pattern will contain an exponential number of smaller, frequent sub-patterns. Maximal frequent pattern mining was proposed to overcome this situation. A pattern α is a maximal frequent pattern in set D if α is frequent, and there exists no super-pattern β such that α ⊆ β and β is frequent in D.
Recently, Rajimol A and G. Raju proposed FOL-Mine (First Occurrence List Mine), a new efficient, pattern growth, web access pattern mining algorithm [1] . The FOL-Mine algorithm used a highly efficient linked structure, to hold the web access sequences, and employed the concept of first occurrence of symbols to mine access patterns. In this work we have modified the FOL-Mine algorithm so that it generates all and only maximal access patterns.
The rest of the paper is organized as follows. Section 2 provides a study of related works, section 3 describes our new algorithm section and section 4 presents the experimental results. Conclusion is given in section 5.

Related Works

Mining max-patterns was first studied by Bayardo and MaxMiner, an Apriori-based, level-wise, breadth-first search method was proposed to find maximal patterns [4] . To reduce search space MaxMiner performs a subset infrequency pruning and also a superset frequency pruning. Though the super set pruning reduces search space and thus reducing time, the algorithm needs many passes to get all maximal patterns.
Another efficient depth first method MAFIA (Maximal Frequent Itemset Algorithm) was proposed in [5] for mining maximal frequent itemsets from a transactional database. This algorithm uses vertical bitmaps to compress the transaction id list, thus improving the previous work by a factor of three to five, But in [6] the authors conducted an extensive experiments on a variety of data sets and they showed that many times MAFIA yields a superset of maximal patterns.
M. J. Zaki and Gouda K introduced GenMax [6] , a back track search based algorithm for enumerating all maximal patterns. GenMax uses a number of optimizations to quickly prune away a large portion of the subset search space. It also uses a novel progressive focusing technique to eliminate non-maximal itemset and employs diffset propagation for fast frequency checking. They compared their work with previous works in the field using a good set of real and synthetic datasets and claimed that performance could vary significantly depending on the dataset characteristics. But GenMax performs better than the previous algorithms on majority of datasets [5] .
In paper [7] , the authors proposed a method based on MFP (Maximum Frequent Pattern) for discovering maximal access pattern of web users. They introduced a WUAP-Tree (Web User Access Pattern Tree) as basic data structure for the mining process. Recursion is avoided using this structure. WUAP-Tree is a variation of suffix tree with user count and shared prefix sub sequence count.

The Proposed Algorithm

FOL-Mine [1] used a linked list with header node to store the first occurrences of each event in a very compact and efficient way. The count of first occurrences, which gives the total support is held in the header node. Unnecessary generations of intermediate projected databases and tedious support counting are eliminated in FOL-Mine algorithm.
FOL-Mine uses three main procedures i) the main algorithm ii) The algorithm to construct FOL structure and iii) the mining algorithm FOL-Mine.
In the current work to find maximal access patterns, the first two algorithms remain the same. The detailed illustration of these algorithms is available in [1] . The mining algorithm FOL-Mine is modified to mine each and every maximal pattern efficiently.
FOLMax-Mine, the proposed algorithm to mine maximal access pattern is given in Fig. 1. A linked list of type FOL-list holds the first occurrences of an event. Each node of FOL-list is pointer of the first occurrence of an item in a web access sequence. Head node stores the total number of occurrences and start address of the FOL- list.
Structure of Header node and FOL-list are as described below.
1. struct htype {int supp;
symbol item;
folnode *next;}
2. Nodes of FOL-list is of the format
struct folnode {wasdlist *occr;
folnode *next};
FOLMax-Mine (pattern q, FOL L, η)
1. for each a Є ∑ do
i. La ← construct FOL (L, a)
ii set max-flag to 1
iii. if support of a > η
FOLMax-Mine (q. a, La, η);
If max-flag=1;
F ← F U q;
Set max-flag to 0;
end if
iv. delete La
end for
2. return
Fig. (1). Algorithm FOLMax-Mine
Step 1 (i) of the algorithm uses the algorithm construct FOL (L, a) [1] to build the First Occurrence List of the symbol a, that stores the first occurrences of symbol a in each sequence. Step 1 (iii) calls FOLMax-Mine recursively to mine all maximal patterns.

Experiments

All the tests were performed on a Due Core machine with 1GB RAM and running Microsoft Windows XP Professional version 2002. FOLMax-Mine is implemented in Microsoft visual C++ 6.0.
Data sets used for the experiments are msnbc and T10i4D100K. msnbc dataset is available in UCI Machine Learning Repository [8] . The data comes from Internet Information Server (IIS) logs for msnbc.com and news-related portions of msn.com for the entire day of September, 28, 1999 (Pacific Standard Time). Each sequence in the dataset corresponds to page views of a user during that twenty-four hour period. It contains 10,00,000 sequences. T10i4D100k is a synthetic dataset generated by IBM data generator. This dataset contains 1,00,000 sequences.
The execution time of the algorithm is tested. The total execution time is measured by adding codes in the original implementation. [Fig-2] shows the execution time variation of FOLMax-Mine with support on msnbc dataset. [Fig-3] gives a comparison of execution time with number of maximal access pattern generated on msnbc dataset. Experiment was repeated on T10I4D100 synthetic dataset and the result is shown in [Fig-4] and [Fig-5] . For verifying the correctness of the proposed algorithm, we compared the output of FOLMax- Mine with that of FOL-Mine [1] . FOL-Mine generates each and every web access pattern whereas FOLMax-Mine generates all and only Maximal access patterns.

Scalability of algorithm

Keeping the support threshold constant the size of the data base is changed to test the scalability of the algorithm. Experiment is done using the dataset T10i4D100K. Data size is changed from 20k to 100k. Fig. shows the result of the scale-up experiments. The experimental result shows that the algorithm is linearly scalable.

Conclusion

A new method for generating Maximal access pattern is presented. Experimental result also is given. Scaling up experiments shows a linear scaling for the proposed method. The set of max-patterns, though more compact, usually does not contain the complete support information regarding to its corresponding frequent patterns. The process of comparing the efficiency of the new algorithm with the existing one is being carried out.

References

[1] Rajimol A. and Raju G. (2011) FOL - Mine – International conference on Advances in Computing and Communication (ACC2011), 253-262.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[2] Agrawal R. and Srikant R. (1994) In: 20th International Conference on Very Large Databases, Santiago, Chile, 487–499.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[3] Pei J., Han J., Mortazavi-asl B. and Zhu, H. (2000) 4th PAKDD, Kyoto, Japan, 396-407.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[4] Bayardo R.J. (1998) ACM-SIGMOD International Conference on Management of Data, 85-93.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[5] Burdick D., Calimlim M., and Gehrke J. (2001) International Conference on Data Engineering.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[6] Gouda K. and Zaki M.J. (2010) International Conference on Knowledge Discovery and Data Mining.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[7] CHENG L.V., WEI Chu-yuan and ZHANG Han-tao. (2006) Journal of Communication and Computer, 3(11)  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[8] UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

Images
Fig. 2- Execution time variation of FOLMax-Mine with support on msnbc.
Fig. 3- Comparison of execution time and number of maximal access pattern with varying support on msnbc
Fig. 4- Execution time variation of FOLMax-Mine with support on T10I4D100K
Fig. 5- Comparison of execution time and number of maximal access pattern with varying support on T10I4D100K.
Fig. 6- Scalabilty of FOLMax-Mine on the dataset T10i4D100K.