St.Petersburg Institute for Informatics RAS
Welcome
News
Contacts
Guestbook

Data Mining


About
Articles
Links

Deep Data Diver


About
Articles
Deep Data Diver V1.0
Big Basket V1.0
F.A.Q.
Download

Rambler's Top100
TopCTO Наука Технологии

1. Purpose

Big Basket system is designed for analysis of the market basket. It applies a new technology for searching associative rules that is based on a modified apparatus of linear algebra using the data self-organization procedure and the effect of information structure resonance. The system's unique properties give a possibility of finding high-accuracy associations of initial set of transactions elements with the given element in the data. These sets form a basket with high-level support and long item sets.

 

2. Statement of Market Basket Analysis Task

Market basket is a set of commodities (services) purchased by the Customer within one separate transaction. These are, for instance, the results of the Customer's visiting supermarket, grocery, an interactive purchase in a virtual store like Amazon.com, etc. Registering business operations within the whole period of their activities, different companies offering commodities or services accumulate large collections of such transactions (databases).

One of the most common tasks for statistic analysis of such databases is to find commodities and itemsets that are concurrently encountered in many transactions. Customer behavior patterns revealed as a result of this analysis are generally characterized by a list of commodities included in the set and the amount of transactions containing these sets. Trade companies use these patterns in order to allocate commodities in stores in a more correct way, to change the structure of pages in commodity catalogues and web pages, to form packages of services encountered together and so on. A set consisting of i-commodities is called i-itemset. The percentage of transactions having this set is called "support" of the set. It is considered that for this set to be of a certain interest its support should be higher than the minimum established by the user; such sets are called frequent.

For an itemset a "confidence" characteristic is often used; it is connected with the set revelation accuracy using one or another algorithm. The accuracy is often determined with regards to one of the set items. It equals to a probability of some i-element joining the set with the obligatory inclusion of i - 1 elements into the set. The higher  the chosen set "confidence", the more significance has the concerned set for the real practice. Moreover, the length of i-set is an important characteristic.

 

3.  General Information of Big Basket Operation

3.1. Initial Data Format

Initial data should be represented in one of the two types that Big Basket system automatically identifies:

3.1.1. List of transactions in which the commodities included in a transaction are separated from each other by some separator. Data fragment is given below:

 

ketchup, orange juice, tomato sauce, potato chips, beer, sugar

onion rings, sugar, pamper, roses, soya sauce, ketchup, cd

cd, cigarettes, sugar, orange juice, beer, vase, oven

newspaper, cd, battery, sweets, soya sauce, rice

rice, sugar, tomato sauce, apple, pamper, pacifier

ketchup, orange juice, tomato sauce, potato chips, beer, sugar

onion rings, sugar, pamper, roses, soya sauce, ketchup, cd

cd, cigarettes, sugar, orange juice, beer, vase, oven

newspaper, cd, battery, sweets, soya sauce, rice

rice, sugar, tomato sauce, apple, pamper, pacifier

ketchup, orange juice, tomato sauce, potato chips, beer, sugar

onion rings, sugar, pamper, roses, soya sauce, ketchup, cd

cd, cigarettes, sugar, orange juice, beer, vase, oven

 

3.1.2. The flat table where column headings - names of commodities correspond to transactions. Values in the table cells take on two values - "yes" or "no" depending on the case whether the given commodity is included into a certain transaction (Fig. 1).

 

1. Example of Initial Data Presentation

3.2. Sequence of the System's Operations

4 main parameters are set in the Big Basket system after reading of the initial data:

1.  Ai commodity, with which associations are to be found.

2. The transaction number (line in the Data Table), for which the most complete association with the given accuracy is searched.

3. Desirable level of the association error (accuracy).

4. Minimum level of the transaction support with the given item. 

On the first stage of the system operation the user selects a desirable commodity Ai (as a rule, most frequently purchased) and sets a planned error level for the associative rule. Then the system automatically finds the first association with Ai product, for which  Confidence and Support are calculated. During the next stage the system selects the most saturated with Ai commodity purchases transaction that has not been covered earlier by the first association, and finds the second association with Ai product for it. The two obtained associations together cover larger number of transactions than they could do it separately. Further, the procedure continues in the similar way for transactions that have not been covered earlier till all associations with Ai product satisfying the given parameters are found.

4. The System Operation

4.1. General view of the system is represented on Fig. 2.

2. Starting with the System Operation

4.2. Creation of New Project

As an example of the system operation let's take the data represented in the examples of the well-known software product CBA (http://www.comp.nus.edu.sg/~dm2) -supmart.tra file (commercial version of  CBA costs 2,000 US dollars).

Start Project creation wizard by clicking left mouse. New dialog box appears for selection of data source (Fig. 3).

3. Dialog Box for Selection of Data Source

Select, for example, ODBC and press ОК. New dialog box appears for selection of ODBC driver. Select Microsoft Excel driver (as shown on Fig. 4).

4. Selection of Excel Driver

Press Connect button. New dialog box comes out, in which in Russian version operational system press Выбор Книги (Select Workbook) button (Fig. 5).

5. ODBC *.xls Driver Installation

Then in Выбор Книги (Select Workbook) dialog box (Fig. 5) select *.xls file subject to the analysis (in our case initial file supmart.tra was converted to supmart.xls).

6. Выбор книги *.xls

Then press OK again in the dialog box on Fig. 4. A dialog box for setting up SQL query comes out (Fig. 7).

7. Dialog Box for Creation of SQL query to the selected Excel book

Here, in  Table field  select Data name, which we gave to the Data Table of transactions in Excel. At the bottom, in SQL query field a formal record of our query in SQL language appears at once.

Press ОК - the system performs data reading.

Click Options button- a dialog box comes out for correction of the system settings (Fig. 8).

 

8. Dialog Box for System Settings Adjustment

 

On the first bookmark of the dialog box set planned errors for associations, the system will search for.  Besides, here it is possible to change a parameter of self-organizing for the association searching procedure  - recommended values are from 0.3 to 0.5.

On the second bookmark "Program settings" an optional "Show startup window" flag can be removed.

The third bookmark "Associations" (Fig. 9) is used for setting up threshold values for "Confidence" and "Support" levels for associations that are used for final selection of associations to the required basket.

9. Setting Up Parameters for Selection of Associations

4.3. Searching for Associations in the Data

By pressing "Analyze data" button on instrument panel, start Wizard of associations search in the initial data. A dialog box for item selection appears on the screen, with which associations will be searched for in all transactions (Fig. 10). All items available in Data Table are enumerated in the left column, in the right column - their absolute frequencies. Select "CD" product, the most frequently occurring in transactions. Press OK button. The table of found associations with "CD" item comes out on the screen (Fig. 11).

10. Selection of Item to Search Associations with

11. Associations with "CD" Item

As it is shown in the table, 3 associations were found that have one hundred percent "Confidence", and they individually cover approximately from 29 to 37 % of all transactions containing "CD" product.

4.4. Information on Associations Found

Big Basket system provides a possibility of detailed viewing of transactions covered by one or another found association or group of associations. For that it is required to select a desirable association (group of associations) in the table and press  "View details" button. For instance, select the first association found and perform the operation described above. "Associations details" window appears on the screen. In this window select "Data matrix" bookmark , where all transactions covered by the first association are marked  with dark color (Fig. 12).

12. Transactions Covered by the First Association "sugar and soya sauce => cd"

 

If we select all three associations, "Data Matrix" bookmark will be, as follows (Fig. 13):

13. Transactions Covered by All Three Associations

4.5. Graphical Display of Market Basket

An important characteristic of Big Basket system is the function of graphical display of the market basket  as the set of high-accuracy associations. To illustrate this function press View Basket button (graphical display) on the instrument panel. Consumer's basket  window comes out on the screen (Fig. 14).

In the left field commodities, with which associations were searched for, are displayed and the associations actually found.

A diagram in camomile  form is shown in the middle of the window; "CD" commodity is in the center of this diagram. Basket support level is indicated at the left top of the "camomile" field. The first number indicates a support level with respect to the quantity of all transactions. The second number (in brackets) shows this level with respect to the amount of those transactions that include the central commodity "CD".

To the right of the diagram there is a column with the names of commodities included into the constructed basket (these commodities are assigned indexes С00, С01, : С11 indexes for convenient  graphical display).

14. Graphical Display of the Commodities Basket Found

Below the "camomile" is a bar chart showing how many times one or another commodity joined the associations found (percentage). According to this percentage outer circles of the "camomile" are painted in certain colors. Color spectrum interpretation is shown to the left of the "camomile".

In the whole, as we can see, the basket found covers all 47% of transactions with "CD" commodity. This basket was made up by three associations with 100% accuracy. Thereby, a new customer while purchasing "CD" commodity will buy with 100% accuracy an itemset from one of the associations that complied this basket.

If we click on any association in the left field (in our case one of three), the commodities included into the selected association will be marked in bold type during graphical displaying of the basket.

4.6. Construction of All Possible Consumer Baskets

During selection of the commodity, for which associations will be searched, there is a possibility to carry out such searching for all commodities. In order to do that set Analyze all items switch in Select Item dialog box (Fig. 10). Then the following Table of associations (Fig. 15) and correspondent graphical information (Fig. 16) will be displayed for the concerned example.

15. All Discovered Associations

16. All Consumer Baskets
(graphical display is shown only for the basket with "sugar" central commodity)

Selecting the required commodity in the left field of the basket graphical display, the user has a freedom to select the most appropriate basket for him on the ground of one or another reason.

5. Demo version

Maximum 500 transactions and 50 products can be analyzed. Also Save Project and Load Project functions are disabled.

6. System Requirements

Minimum system configuration requirements for Big Basket operation:

         Microsoft Windows 95 or later version;

         Pentium processor - 100 MHz and over;

         32 Mb RAM.


Created by MaxMaster, 2003-2004