Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 Old Forums
 CLOSED - General SQL Server
 Type 2 dimensions

Author  Topic 

X002548
Not Just a Number

15586 Posts

Posted - 2005-06-23 : 08:57:37
OK, I've been doing this for a while, and maybe I've lost touch, but what the hell is this?

In a reply

http://weblogs.sqlteam.com/brettk/archive/2004/06/09/1530.aspx

Scott Jameson (mmmm....I love Jameson's....esp with a Guiness), he posted:

quote:

In dimensional modeling a Type 2 dimensions absolutely requires the surrogate keys. But type 1 seems to be a judgement call. If you have large facts, more efficient surrogate keys can help considerable with performance though space savings and though index efficiency.



I'll go google this, but I've never heard of this before....anyone know what this is.

I've learned a lot from being here this last x years, but it sounds like bs to me.

Thanks


Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Kristen
Test

22859 Posts

Posted - 2005-06-23 : 09:13:18
<woosh/> !
Go to Top of Page

mwjdavidson
Aged Yak Warrior

735 Posts

Posted - 2005-06-23 : 10:18:40
Hi Brett
He's talking about a Kimball Type 2 slowly changing dimension. Check out the link below for an article by the man himself (Kimball that is, not Jameson! ).
http://www.dbmsmag.com/9805d05.html
I've relatively recently had to alter one of our dimensions to support change over time, and type 2 was the way to go. I found using a surrogate key saved a whole lotta grief...

Mark
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2005-06-23 : 11:09:43
Read it, thanks.

Don't buy it.

I've done data warehousing befire they coined the phrase, and I must admit I have never thown my towel fully into that ring, but I still don't buy it.

His big deal is discussing tables with billions of rows....and the fact that natural keys can change....to me if a key changes, it's a new thing, and there should be a relationship table to identify the change.

I don't buy it.



Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx
Go to Top of Page

mwjdavidson
Aged Yak Warrior

735 Posts

Posted - 2005-06-23 : 12:27:02
I take it you don't buy it Brett

Mark
Go to Top of Page

Crito
Starting Member

40 Posts

Posted - 2005-06-23 : 16:08:28
I once heard a DBA say data warehousing set relational theory back a decade. I'm not sure I agree, but what the term meant 10 years ago and what data warehousing means today are definitely two completely different things.
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-06-23 : 19:05:11

I read through Kimball's article, and I would have to say that I agree with most or all that he said. The problem of changing natural keys is widespread, and for a data warehouse you have to have a universal system for dealing with them. I have run into many of these problems myself with data warehouse applications, and it really is the best way to go.

With a data warehouse for a retail company you might be tempted to put SKU (Stock Keeping Unit) in a sales fact table; it may be 7 numeric digits, and fit easily in a 4 byte integer. However, it is very common to reuse SKUs because many of the retail systems on the market have a limited range of SKUs available.

When you deal with fashion merchandise, each variation of a shirt for size and color is a different SKU and is often only sold for a single season and never again. You can easily have several hundred thousand new SKUs each year. In addition, many systems make the SKU a “smart” code: first 3 digits = department, next 3 = class, etc. They may use the same SKU two years later for a totally unrelated product.

If you use SKU in the retail fact table, in two years you are left to deal with the problem that the same SKU now has a completely different meaning. If you update the description in your dimension table, the old data is now pointing to the new SKU.

You can have a parallel problem in that the same product can have different SKUs. Maybe two retail chains with different retail management systems merge, but they sell many of the same products. Oh, and one of the companies has an alphanumeric SKU system, and you get to hack that into your system.

Using natural keys in a data warehouse system is usually a huge mistake, and leaves you with a gigantic mess to work around with hacks. On the other hand, I have never seen a situation where using a surrogate key turned out to be a mistake.





CODO ERGO SUM
Go to Top of Page

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-06-24 : 06:07:54
The problem that type 2 SCD are designed to solve is to keep track of historical changes in an entity.
IMC You need a denormalized structure to be able to implement type 2 SCD in the first place.
Apart from that little caveat, You open up a whole new can of worms if You use type 2 SCD;
because now You need to know which historical version of the entity You are referring to.

Things quickly become very messy...

rockmoose
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2005-06-24 : 10:14:03
Ya think?

With your SKU example, How do wstablish that waht a thing was, and what it became? Whether you use surrogate keys or not?

I didn't find the answer in your post Michael.



Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-06-24 : 11:04:17
I am not entirely sure what you question is, but I think you are asking how you map something back to the source system and keep track of the version.

That is a big subject and I would say a huge part of the ETL processing. My experience is that it has to be customized to the data feed you are receiving to deal with all of the issues in that source of data, and that you end up building custom data structures to do this mapping. I’m not claiming it is easy to do.

For example, I built a process to capture changes to the standard cost of products over time so that I could show cost of sales and margins correctly. The feeder system only had the current cost of a product, so it could not provide the cost or margins on historical sales info. I had to load their entire cost table every day, and compare it to the costs I had previously recorded in order to detect the changes, update the end date of the last cost for each product, and then add a new cost.

Basically, I had to develop something in the data warehouse that really should have been in the transactional system, but wasn’t. Things like this seem to be the norm in commercial retail management systems that may be well designed to handle day to day needs, but have little thought for historical reporting.

Issues like this are what drive the need for data warehouses to begin with. A system that is designed for transaction processing may function very well for that, but may not support convenient historical reporting and trend analysis. The need to merge the data from different transactional systems and business divisions is also another big driver in the needs.





CODO ERGO SUM
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2005-06-24 : 14:54:12
Doesn't that support the idea that the use of natural keys is the best model?

I mean, you say that you have to understand the state of the data before you load it?

How does surrogate keys help?



Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx
Go to Top of Page

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-06-24 : 16:56:39
quote:
Basically, I had to develop something in the data warehouse that really should have been in the transactional system, but wasn’t. Things like this seem to be the norm in commercial retail management systems that may be well designed to handle day to day needs, but have little thought for historical reporting.

I agree 100%, the requirements of the dw often include fixing the schemas of the source systems.
Keeping historical changes is commonly something that OLTP systems are very poor at doing.


The great debate

"Datawarehousing" made the choice "easy", use surrogates to insulate the datawarehouse from changing keys
in the source systems...
I think the main points are that if You have surrogates in the dw You don't need to update millions of transactions
when a source key changes!
And that surrogates help You to integrate the same entity from several sources.

And, well for type 2 SCD, they require a surrogate + 2NF to work.
(logically You could of course have natural + version number / or date)

-------------------
DW are designed to solve reporting and analysis requirements,
some factions (kinmball eg) propose that 2NF -> 3NF + surrogate keys is the best way to meet the reqs.
I don't agree with the Kimball approach much anymore, imo it's somewhat of a quick&dirty design,
and some design rules (conformed dimensions, scd eg.) to glue the mess together.
-------------------

Well, a plan is probably better than no plan...
but it's hard to beat a properly normalized database with well defined keys.
(even if it is a DW)

It would be simpler if the world was more perfect
Things get complicated when systemA calls emp#1 "E1254", and systemB calss emp#1 "654321-4542", and sytemC calls emp#1 "Jiggly wiggly" and the corporation wants a unified view of everything in the DW.

rockmoose
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-06-24 : 18:00:10
There are also physical storage and performance issues with the design of a star schema that favor the use of integer surrogate keys. If you have 50,000 products with a 20 byte natural key and a 50 million row sales fact table using the natural key and an index on that column, it will use 1.86 GB just for that column. With a 4 byte integer key, it would use 0.37 GB for that column. Five times the space used with a big hit in performance, especially when you are scanning a table.

It may not seem all that important to have an extra 1.5 GB in the table, but there is a real cost in terms of disk required, backup run time, load run time, and especially query run time. Since the primary purpose of a data warehouse is reporting, anything that makes queries slower is a big negative. Considering the huge numbers of rows in some of the fact tables, anything that makes them even one byte bigger can have a serious cost and performance hit.






CODO ERGO SUM
Go to Top of Page

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-06-25 : 19:16:22
quote:
Originally posted by Michael Valentine Jones

There are also physical storage and performance issues with the design of a star schema that favor the use of integer surrogate keys. If you have 50,000 products with a 20 byte natural key and a 50 million row sales fact table using the natural key and an index on that column, it will use 1.86 GB just for that column. With a 4 byte integer key, it would use 0.37 GB for that column. Five times the space used with a big hit in performance, especially when you are scanning a table.

It may not seem all that important to have an extra 1.5 GB in the table, but there is a real cost in terms of disk required, backup run time, load run time, and especially query run time. Since the primary purpose of a data warehouse is reporting, anything that makes queries slower is a big negative. Considering the huge numbers of rows in some of the fact tables, anything that makes them even one byte bigger can have a serious cost and performance hit.

CODO ERGO SUM



You want to keep the large facttables as narrow as possible,
I agree with You Michael.

I am inclined to have the "starschemas" not physically stored, but logicaly stored as views. (and the physical in 3NF+).
This depends on how much massaging of the data You need to do.

rantmode>>
If domains were properly implemented in RDBMS, maybe a "large" domain key would not need to be physically stored in each table using it, but only once, and referencing tables only needed a reference to the value of the actual key...
<<rantmode

rockmoose
Go to Top of Page
   

- Advertisement -