Saturday, September 21, 2019
A Guide Into Business Intelligence Studies Information Technology Essay
A Guide Into Business Intelligence Studies Information Technology Essay Data Warehousing: Integration of data from multiple sources into large warehouses and support of on-line analytical processing and business decision making DW vs. Operational Databases Data Warehouse Subject Oriented Integrated Nonvolatile Time variant Ad hoc retrieval Operational Databases Application oriented Limited integration Continuously updated Current data values only Predictable retrieval Data Warehouse: a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process. Data Mart A monothematic data warehouse Department- oriented or business line oriented Top-Down Approach Advantages A truly corporate effort, an enterprise view of data Inherently architected not a union of disparate data marts Single, central storage of data about the content Centralized rules and control May see quick results if implemented with iterations Disadvantages Takes longer to build even with an iterative method High exposure/risk to failure Needs high level of cross-functional skills High outlay without proof of concept Bottom-Up Approach Advantages Faster and easier implementation of manageable pieces Favorable return on investment and proof of concept Less risk of failure Inherently incremental; can schedule important data marts first Allows project team to learn and grow Disadvantages Each data mart has its own narrow view of data Permeates redundant data in every data mart Perpetuates inconsistent and irreconcilable data Proliferates unmanageable interfaces Data Staging Component Three major functions need to be performed for getting the data ready (ETL) extract the data transform the data and then load the data into the data warehouse storage Data Warehouse Subject-Oriented Data is stored by subjects Integrated Data Need to pull together all the relevant data from the various systems Data from internal operational systems Data from outside sources Time-Variant Data the stored data contains the current values The use needs data not only about the current purchase, but on the past purchases Nonvolatile Data Data from the operational systems are moved into the data warehouse at specific intervals Data Granularity Data granularity in a data warehouse refers to the level of detail The lower the level of detail, the finer the data granularity The lowest level of detail à ® a lot of data in the data warehouse Four steps in dimensional modeling Identify the process being modeled. Determine the grain at which facts will be stored. Choose the dimensions. Identify the numeric measures for the facts. Components of a star schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables contain descriptions about the subjects of the business Dimension tables are denormalized to maximize performance Slowly changing dimensions Are the Customer and Product Dim independent of Time Dim? Changes in names, family status, product district/region How to handle these changes in order not to affect the history status? Eg. Insurance 3 suggestions for slowly changing dimensions Type 1 overwrite/erase old values; no accurate tracking of history needed; easy to implement; Type 2 create new record at time of change; partitioning the history (old and new description); Type 3 new current fields, legitimate need to track both old and new states Original and current values; Intermediate Values are lost Junk Dimensions Leave the flags in the fact tables likely sparse data no real browse entry capability can significantly increase the size of the fact table Remove the attributes from the design potentially critical information will be lost if they provide no relevance, remove them Make a flag into its own dimension may greatly increase the number of dimensions, increasing the size of the fact table can clutter and confuse the design Combine all relevant flags, etc. into a single dimension the number of possibilities remain finite information is retained The Monster Dimension It is a compromise Avoids creating copies of dimension records in a significantly large dimension Done to manage space and changes efficiently 3 types of multidimensional data Data from external sources (represented by the blue cylinder) is copied into the small red marble cube, which represents input multidimensional data Pre-calculated, stored results derived from it on-the-fly results, calculated as required at run-time, but not stored in a database Aggregation The system uses physically stored aggregates as a way to enhance performance of common queries. These aggregates, like indexes, are chosen silently by the database if they are physically present. End users and application developers do not need to know what aggregates are available at any point in time, and applications are not required to explicitly code the name of an aggregate When you go for higher level of aggregates, the sparsity percentage goes down, eventually reaching 100% of occupancy Data Extraction Two major types of data extractions from the source operational systems as is (static) data and data of revision as is or static data is the capture of data at a given point in time For initial load Data of revision is known as incremental data capture Data Quality Issues Dummy values in fields Missing data Unofficial use of fields Cryptic values Contradicting values Reused primary keys Inconsistent values Incorrect values Multipurpose fields Steps in Data Cleansing Parsing Correcting Standardizing Matching Consolidating DATA TRANSFORMATION All the extracted data must be made usable in the data warehouse The quality of the data in many old legacy systems is less likely to be good enough for the data warehouse Transformation of source data encompasses a wide variety of manipulations to change all the extracted source data into usable information to be stored in the data warehouse Data warehouse practitioners have attempted to classify data transformations in several ways Basic Tasks Set of basic tasks Selection Splitting/Joining Conversion Summarization Enrichment Loading Initial Load Load mode Incremental Loads Constructive merge mode Type 1 slowly changing dimension: destructive merge mode Full Refresh Load and append modes are applicable OLAP defined: On-line Analytical Processing(OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user Users need the ability to perform multidimensional analysis with complex calculations The basic virtues of OLAP Enables analysts, executives, and managers to gain useful insights from the presentation of data Can reorganize metrics along several dimensions and allow data to be viewed from different perspectives Supports multidimensional analysis Is able to drill down or roll up within each dimension BUSINESS METADATA Is like a roadmap or an easy-to-use information directory showing the contents and how to get it How can I sign onto and connect with the data warehouse? Which parts of the data warehouse can I access? Can I see all the attributes from a specific table? What are the definitions of the attributes I need in my query? Are there any queries and reports already predefined to give the results I need? TECHNICAL METADATA Technical metadata is meant for the IT staff responsible for the development and administration of the data warehouse Technical metadata is like a support guide for the IT professionals to build, maintain, and administer the data warehouse Physical Design Objectives Improve Performance In OLTP, 1-2 secs max; in DW secs to mins Ensure scalability Manage storage Provide Ease of Administration Design for Flexibility. Physical Design Steps Develop Standards Create Aggregates Plan Determine Data Partitioning Establish Clustering Options Prepare Indexing Strategy Assign storage structures Partitioning Breaking data into several physical units that can be handled separately Not a question of whether to do it in data warehouses but how to do it Granularity and partitioning are key to effective implementation of a warehouse Partitions are spread across multiple disks to boost performance Why Partition? Flexibility in managing data Smaller physical units allow easy restructuring free indexing sequential scans if needed easy reorganization easy recovery easy monitoring Improve performance Criterion for Partitioning Vertically (groups of selected columns together. More typical in dimension tables) Horizontally (e.g. recent events and past history. Typical in fact tables) Parallelization The argument goes: if your main problem is that your queries run too slowly, use more than one machine at a time to make them run faster (Parallel Processing). Oracle uses this strategy in its warehousing products. Indexing Structure separate from the table data it refers to, storing the location of rows in the database based on the column values specified when the index is created. They are used in data warehouse to improve warehouse throughput Indexing and loading Indexing for large tables Btree characteristics: Balanced Bushy: multi-way tree Block-oriented Dynamic Bitmap Index Bitmap indices are a special type of index designed for efficient querying on multiple keys Records in a relation are assumed to be numbered sequentially from, say, 0 Given a number n it must be easy to retrieve record n Particularly easy if records are of fixed size Applicable on attributes that take on a relatively small number of distinct values E.g. gender, country, state, à ¢Ã¢â ¬Ã ¦ E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity) A bitmap is simply an array of bits In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute Bitmap has as many bits as records In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise Clustering The technique involves placing and managing related units of data to be retrieved in the same physical block of storage This arrangement causes related units of data to be retrieved together in one single operation In a clustering index, the order of the rows is close to the index order. Close means that physical records containing rows will not have to be accessed more than one time if the index is accessed sequentially DW Deployment Major deployment activities Complete user acceptance Perform initial loads Get user desktops ready Complete initial user training Institute initial user support Deploy in stages DW Growth Maintenance Monitoring the DW Collection of Stats Usage of Stats For growth planning For fine tuning User training Data Content Applications Tools Dimensional Modeling Exercise Exercise: Create a star schema diagram that will enable FIT-WORLD GYM INC. to analyze their revenue. à ¢Ãâ ââ¬â¢ The fact table will include: for every instance of revenue taken attribute(s) useful for analyzing revenue. à ¢Ãâ ââ¬â¢ The star schema will include all dimensions that can be useful for analyzing revenue. à ¢Ãâ ââ¬â¢ The only data sources available are shown bellow. SOURCE 1 FIT-WORLD GYM Operational Database: ER-Diagram and the tables based on it (with data) SOLUTION
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment