Database lecture 5 - 3 April 2015

Physical database design

Physical database is how the data in database are stored on a disk. Normally, database abstract the file out but we need to know the actual structure in order to optimize it.

Storage level

Several records in a database (called logical record -- LR) can be stored in a file (called a physical record -- PR). The logical records may not be from the same table to store in the same physical record.

You access one logical record at a time, which DBMS will access it from one physical record at a time. The physical record is cached in the memory of DBMS. If you try to access a record not loaded to memory, DBMS and OS will locate and read it to memory. The application doesn't need to explicitly request for this.

Benchmark

Response time of database is hard to benchmark, so we measure resource usage such as physical record transfer, CPU and memory usage, network usage instead. To increase the response time we can minimize the memory and disk usage, reducing physical record access time and CPU usage. Joining table is one way that consume lot of resources.

Database design decision

To optimize, we need some data such as

  • Table profile: Summary of a table, their column and their relationships.
  • Table: number of row and physical record (DBMS can generate this)
  • Application profile
  • Query: what queries are used, their frequency
  • Form: frequency of select/insert/delete operation

File structure

  • Sequential files
  • Simple
  • Used in the past
  • Data are stored in a file
  • Unordered (ordered by insertion time) or ordered by key
  • Fast insertion: just append it
  • Slow deletion: physical record must be reorganized
  • Fast sequential search, slow key search
  • Ordered sequential file
  • On insert, reorganize the physical record by key
  • Slow insertion
  • Lookup by key or sequential lookup is fast
  • Hash file
  • Fast access by unique key
  • Map key to physical record address by hashing it (mostly by using mod)
  • Good hash function should distribute data evenly
  • If physical record for a hash value is full, a new file is created and linked to old one (like HashMap)
  • One problem is the data may be too full in one file, and other files are empty so it would be slow to access most data, so data file sometimes need reorganization
  • Slow sequential search
  • BTree
  • Popular, supported by most DBMS
  • Balanced tree so all leaf nodes can be accessed by O(height) equally.
  • Bushy: a node has large number of edges
  • We want a tree that has low height, so the tree must be bushy
  • Each node in a tree is a physical record. It contains key and a pointer to physical record of the data. A node is filled to a half (d keys, d+1 pointers) and up to full (2d key, 2d+1 pointers)
  • The tree dynamically changes when physical record are inserted/deleted.
  • Good performance on both sequential and key search
  • You should practice tree balancing (see slide)
  • Cost of insertion: Cost of finding nearest key + cost of changing node (or adding new level)
  • Logarithmic search cost
  • B+Tree
  • Improved sequential and range search performance
  • All keys are stored redundantly in leaf nodes
  • Similar algorithm to BTree, but data is not deleted from leaf nodes, only copied to parent nodes.

Index matching

When we know what kind of queries we'll going to use we can also determine its index. It can be unique or non-unique index.

For example, if we select by department a lot, we might add index for department column.

Bitmap index

Bitmap is a table of possible values, and if an key contains that value, the bit for that will be true.

It is useful for non-unique column with few values

Bitmap join index

A special type of bitmap that optimize for joining. It is more suitable for data in data warehouse (data that are never changed) as if a data is inserted the entire bitmap needs to be updated.

Query optimization

Another way to increase performance is to write better queries. Some DBMS may optimize queries automatically. The DBMS also plan how to search physical records of it automatically.

Access plan

Access plan is a plan for accessing physical record and joining algorithm (sort before/after join, etc). DBMS will generate several plans and automatically choose the best one for you. You also can explictly send access plan to the DBMS.

Adding index

Adding correct type of index can help increase performance

  • Clustering index: data are stored sequentially
  • Nonclustering index: data are stored non-sequentially

Denormalization

Sometimes too much normalization can also slow data. It is useful in data warehouse.

Raid

Optimizing physical record to align with RAID can be useful

Other

  • Data warehouse: store data in summarized form
  • Parallel computing