Data Mining

ABSTRACT:-

Now a days, digital information is relatively easy to capture and fairly inexpensive to store. The digital revolution has seen collections of data grow in size, and the complexity of the data there in increase. Question commonly arising as a result of this state of affairs s having gathered such quantities of data, what do we actually do with it? It is often the case that large collections of data, however well structured, conceal implicit patterns of information that cannot be readily detected by conventional analysis techniques. Such information may often be usefully analyzed using a set of techniques referred to as “knowledge discovery” or “data mining”. These techniques essentially seek to build a better understanding of data, and in building characterizations of data that can be used as a basis for further analysis, extract value from volume.
Data mining is the process of extracting patterns from data. Data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information – information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
In this paper we present the data warehousing and mining concepts, the goals behind data mining and its applications in the real world.

Introduction
Data mining is the process of extracting patterns from data. Data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.
Data mining can be used to uncover patterns in data but is often carried out only on samples of data. The mining process will be ineffective if the samples are not a good representation of the larger body of data. Data mining cannot discover patterns that may be present in the larger body of data if those patterns are not present in the sample being “mined”. Inability to find patterns may become a cause for some disputes between customers and service providers. Therefore data mining is not foolproof but may be useful if sufficiently representative data samples are collected. The discovery of a particular pattern in a particular set of data does not necessarily mean that a pattern is found elsewhere in the larger data from which that sample was drawn. An important part of the process is the verification and validation of patterns on other samples of data.
The related terms data dredging, data fishing and data snooping refer to the use of data mining techniques to sample sizes that are (or may be) too small for statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). Data dredging may, however, be used to develop new hypotheses, which must then be validated with sufficiently large sample sets.
Index:

1. Data

2. What is data warehouse
3. Key features
4. Data warehouse model
5. What is data mining
6. History of data mining
7. Architecture of data mining
8. How does data mining works
9. What kind of data can be mined
10. Data mining technologies
11. Difference between OLAP and OLTP
12. Features of OLAP
13. Advantages of data mining
14. Disadvantages of data mining
15. Conclusion
16. Refrences

 

DATA

What is data?
In computing, data is information that has been translated into a form that is more convenient to move or process. Relative to today’s computers and transmission media, data is information converted into binary digital form.
In computercomponent interconnection and network communication, data is often distinguished from “control information,” “control bits,” and similar terms to identify the main content of a transmission unit.
What is data warehouse?
In computing, a data warehouse (DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.

Untitled
Data Warehouse Overview

A data warehouse maintains its functions in three layers: staging, integration, and access. Staging is used to store raw data for use by developers. The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users.
This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & O’Brien 2009).
However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.
Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization’s operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated, historical data for analysis.

According to W. H. Inmon, a leading architect in the construction of data warehouse systems, \a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.” [Inmon 1992]. This short, but comprehensive definition presents the major features of a data warehouse. The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems.

  •  Key features: [researchpaper1.pdf]
  •  Subject-oriented: The term subject-oriented indicates that information is
    arranged according to the business areas of an organization, such as manufacturing, sales, marketing, finance etc.
  •  Integrated: The term integrated refers to the fact that a data warehouse integrates data derived from various functional systems in the organisation and provides a unified and consistent view of the overall organisational situation Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.
  • Time-variant: Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time. (e.g., the past 5-10 years)
  • Nonvolatile: The data is non-volatile in that it is always appended to (and rarely deleted from) the warehouse thus maintaining the company’s entire history.

Data warehouse models

There are 3 data warehouse models, according to architecture point of view-

1 Enterprise warehouse
• Collects all of the information about subjects spanning the entire organization.
• Provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope.
• Typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to terabytes or beyond.
• May be implemented on traditional mainframes, UNIX super servers, or paralleled architecture platforms.
2 Data mart
• Contains a subset of corporate-wide data that is of value to a specific group of users, however, scope is confined to specific selected subjects
• Are usually implemented on low-cost departmental servers that are UNIX or windows/NT –based.
• Are categorized as independent or dependent, depending on the source of data operational systems or external information providers, or from data generated locally within a particular department. But, dependent data marts are sourced directly from enterprise data warehouse.
• The data contained in data mart tend to be summarized.
3 Virtual warehouse
• Is a set of views over operational databases.
• Only some of the possible summary views may be materialized for efficient query processing.
• Is easy to build but requires excess capacity on operational database servers.

What is data mining?

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information – information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Data mining is the process of extracting patterns from data. Data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process
History of data mining:-
Recently data mining has been the subject of many articles in business and software magazines. However, just a few short years ago, few people had even heard of the term data mining. Though data mining is the evolution of a field with a long history, the term itself was only introduced relatively recently, in the 1990s. This section explores the history of data mining.

Data mining roots are traced back along three family lines. The longest of these three lines is classical statistics. Without statistics, there would be no data mining, as statistics are the foundation of most technologies on which data mining is built. Classical statistics embrace concepts such as regression analysis, standard distribution, standard deviation, standard variance, discriminant analysis, cluster analysis, and confidence intervals, all of which are used to study data and data relationships. These are the very building blocks with which more advanced statistical analyses are underpinned. Certainly, within the heart of today’s data mining tools and techniques, classical statistical analysis plays a significant role.

Data mining’s second longest family line is artificial intelligence, or AI. This discipline, which is built upon heuristics as opposed to statistics, attempts to apply human-thought-like processing to statistical problems. Because this approach requires vast computer processing power, it was not practical until the early 1980s, when computers began to offer useful power at reasonable prices. AI found a few applications at the very high end scientific/government markets, but the required supercomputers of the era priced AI out of the reach of virtually everyone else. The notable exceptions were certain AI concepts which were adopted by some high-end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS).

The third family line of data mining is machine learning, which is more accurately described as the union of statistics and AI. While AI was not a commercial success, its techniques were largely co-opted by machine learning. Machine learning, able to take advantage of the ever-improving price/performance ratios offered by computers of the 80s and 90s, found more applications because the entry price was lower than AI. Machine learning could be considered an evolution of AI, because it blends AI heuristics with advanced statistical analysis. Machine learning attempts to let computer programs learn about the data they study, such that programs make different decisions based on the qualities of the studied data, using statistics for fundamental concepts, and adding more advanced AI heuristics and algorithms to achieve its goals.

Data mining, in many ways, is fundamentally the adaptation of machine learning techniques to business applications. Data mining is best described as the union of historical and recent developments in statistics, AI, and machine learning. These techniques are then used together to study data and find previously-hidden trends or patterns within.

Architecture of data mining

Following figure shows the simple architecture of data mining. It consist of following steps

Data cleaning:-
Data integration:-
Data selection:-
Data transformation:-

Untitled

Figure : Data mining as a step in the process of knowledge discovery

The Knowledge Discovery in Databases process comprises of a few steps leading from
raw data collections to some form of new knowledge. The iterative process consists of
the following steps:
•Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
•Data integration: at this stage, multiple data sources, often heterogeneous, may
be combined in a common source.
•Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
•Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
•Data mining: it is the crucial step in which clever techniques are applied to
extract patterns potentially useful.
•Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
•Knowledge representation: is the final phase in which the discovered knowledge
is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.

It is common to combine some of these steps together. For instance, data cleaning and
data integration can be performed together as a pre-processing phase to generate a data
warehouse. Data selection and data transformation can also be combined where the
consolidation of the data is the result of the selection, or, as for the case of data
warehouses, the selection is done on transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can
be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results.

How does data mining work?

While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:

Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.

Sequential patterns: Data is mined to anticipate behaviour patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer’s purchase of sleeping bags and hiking shoes.

What kind of Data can be mined?

In principle, data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object oriented databases, data warehouses, transactional databases, unstructured and semi-structured
repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. Here are some examples in more detail:

Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research level. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. The data in these files can be transactions, time-series data, scientific measurements, etc.

Relational Databases: Briefly, a relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key. The most commonly used query language for relational database is SQL, which allows retrieval and manipulation of the data stored in the tables, as well as the calculation of aggregate functions such as average, sum, min, max and count.

Multimedia Databases: Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system. Multimedia is characterized by its high dimensionality, which makes data mining even more challenging. Data mining from multimedia repositories may require computer vision, computer graphics, image interpretation, and natural language processing methodologies.
Data mining technologies

A) OLTP :
The job of earlier on-line operational systems was to perform transaction and query processing. So, they are also termed as on-line transaction processing systems (OLTP).
B) OLAP :
Data warehouse systems serve users or knowledge workers in the role of data analysis and decision-making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are called on-line analytical processing (OLAP) systems.

Difference between OLTP and OLAP
i) Users and system orientation: OLTP is customer-oriented and is used for transaction and query processing by clerks, clients and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives and analysts.
ii) Data contents: OLTP system manages current data in too detailed format. While an OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation. Moreover, information is stored and managed at different levels of granularity, it makes the data easier to use in informed decision-making.
iii) Database design: An OLTP system generally adopts an entity –relationship data model and an application-oriented database design. An OLAP system adopts either a star or snowflake model and a subject oriented database design.
iv) View: OLTP system focuses mainly on the current data without referring to historical data or data in different organizations. In contrast, OLAP system spans multiple versions of a database schema, due to the evolutionary process of an organization. Because of their huge volume, OLAP data are shared on multiple storage media.
v) Access patterns: Access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency, control and recovery mechanisms. But, accesses to OLAP systems are mostly read-only operations, although many could be complex queries.
Features of OLAP

The key indicator of a successful OLAP application is its ability to provide information as needed i.e. its ability to provide “just in time” information for effective decision-making. All the OLAP applications, found in divergent functional areas, have following key features:
1 Multidimensional views of data
• Is inherently representation of an actual business model?
• Provides more than the ability to “slice and dice”, it provides the foundation for analytical processing through flexible access to information.
• Managers must be able to analyze data across any dimensions at any level of aggregation, with equal functionality and ease.
2 Calculation-intensive capabilities
• Real test of an OLAP application is its ability to perform complex calculations; they must be able to do more than simple aggregation.
• OLAP software must provide a rich tool kit of powerful yet succinct computational methods, because key performance indicators often require involved algebraic equations.
• Analytical processing systems are judged on their ability to create information from data.
3 Time Intelligence
• Is an integral component of almost any analytical application. Time is a unique dimension because it is sequential in character. True OLAP systems should understand the sequential nature of time.
• Time hierarchy is not always used in the same manner as other hierarchies. Concepts such as year– to-date and period over period comparisons must be easily defined in an OLAP system. In addition, they must understand the concept of balances over time.

ADVANTAGES OF DATA MINING

  • Marking/Retailing Data mining can aid direct marketers by providing them with useful and accurate trends about their customers’ purchasing behaviour. Based on these trends, marketers can direct their marketing attentions to their customers with more precision. For example, marketers of a software company may advertise about their new software to consumers who have a lot of software purchasing history. In addition, data mining may also help marketers in predicting which products their customers may be interested in buying. Through this prediction, marketers can surprise their customers and make the customer’s shopping experience becomes a pleasant one.
    Retail stores can also benefit from data mining in similar ways. For example, through the trends provide by data mining, the store managers can arrange shelves, stock certain items, or provide a certain discount that will attract their customers.
  • Banking/Crediting
    Data mining can assist financial institutions in areas such as credit reporting and loan information. For example, by examining previous customers with similar attributes, a bank can estimated the level of risk associated with each given loan. In addition, data mining can also assist credit card issuers in detecting potentially fraudulent credit card transaction. Although the data mining technique is not a 100% accurate in its prediction about fraudulent charges, it does help the credit card issuers reduce their losses.
  • Law enforcement Data mining can aid law enforcers in identifying criminal suspects as well as apprehending these criminals by examining trends in location, crime type, habit, and other patterns of behaviours.
  • Researchers Data mining can assist researchers by speeding up their data analyzing process; thus, allowing those more time to work on other projects.

DISADVANTAGES OF DATA MINING

  •  Privacy Issues
    Personal privacy has always been a major concern in this country. In recent years, with the widespread use of Internet, the concerns about privacy have increased tremendously. Because of the privacy issues, some people do not shop on Internet. They are afraid that somebody may have access to their personal information and then use that information in an unethical way thus causing they harm.
    Although it is against the law to sell or trade personal information between different organizations, selling personal information have occurred. For example, according to Washing Post, in 1998, CVS had sold their patient’s prescription purchases to a different company.7 In addition, American Express also sold their customers’ credit care purchases to another company.8 What CVS and American Express did clearly violate privacy law because they were selling personal information without the consent of their customers. The selling of personal information may also bring harm to these customers because you do not know what the other companies are planning to do with the personal information that they have purchased.
  •  Security issues Although companies have a lot of personal information about us available online, they do not have sufficient security systems in place to protect that information. For example, recently the Ford Motor credit company had to inform 13,000 of the consumers that their personal information including Social Security number, address, account number and payment history were accessed by hackers who broke into a database belonging to the Experian credit reporting agency. This incidence illustrated that companies are willing to disclose and share your personal information, but they are not taking care of the information properly. With so much personal information available, identity theft could become a real problem.
  • Misuse of information/inaccurate information
    Trends obtain through data mining intended to be used for marketing purpose or for some other ethical purposes, may be misused. Unethical businesses or people may used the information obtained through data mining to take advantage of vulnerable people or discriminated against a certain group of people. In addition, data mining technique is not a 100 percent accurate; thus mistakes do happen which can have serious consequence.10

Conclusion:-
Data mining is a synonym for knowledge discovery. Data mining also refers to a specific step in the knowledge discovery process, a process that focuses on the application of specific algorithms used to identify interesting patterns in the data repository. These patterns are then conveyed to an end user who converts these patterns into useful knowledge and makes use of that knowledge.
Data mining has evolved out of the need to make sense of huge quantities of information. Usama M. Fayyad says that stored data is doubling every nine months and the “demand for data mining and reduction tools increase exponentially (Fayyad, Piatetsky-Shapiro, & Uthurusamy, 2003, p. 192).” In 2006, $6 billion in text and data mining activities are anticipated (Zanasi, Brebbia, & Ebecken, 2005).
The U.S. government is involved in many data mining initiatives aimed at improving services, detecting fraud and waste, and detecting terrorist activities. One such activity, the work of Able Danger, had identified one of the men who would, one year later, participate in the 9/11 attacks (Waterman, 2005). This fact emphasizes the importance of the final step of the knowledge discovery process: putting the knowledge to use.
There is much work to done in the area of knowledge discovery and data mining, and its future depends on developing tools and techniques that yield useful knowledge without causing undue threats to individuals’ privacy.
References:-
1) Advances and research directions in data warehousing technology by Mukesh Mohania, Sunil Samtani, John F. Roddick, Yahiko Kambayashi [www.vahiko@kuis.kvoto-u.ac.ip][researchpaper1] 2) www.wikipedia.com
3) Future trends in data mining by Hans-Peter Kriegel • Karsten M. Borgwardt •Peer Kröger • Alexey Pryakhin • Matthias Schubert • Arthur Zimek.
4) Book of data mining by jiwei han and micheline kamber[2006],Elservier inc
5) www.springerlink.com

Download PPT

Leave a Reply

Your email address will not be published. Required fields are marked *