none
DataWareHouse FK RRS feed

  • Domanda

  • Ciao a tutti,

    Avrei bisogno di un vostro consiglio.

    Ho alcune dimensioni che sono collegate con la tabella dei fatti, devo creare del FK o no ? potrebbe incidere sulle prestazioni ?

    Grazie

     

    martedì 29 novembre 2011 15:10

Tutte le risposte

  • Simone, dato che ho risposto a questa domanda molti anni fa su un forum in Inglese ti allego quello che avevo scritto all'epoca. Tieni conto che è vecchio di dieci anni, anche se non mi risulta che ci siano stati cambiamenti significativi nello stato dell'arte.

    Andrea

    ---

    Referential Integrity in Data Warehouses

    In general it’s good practice to check referential integrity before loading any data into the data warehouse; for this reason I wouldn't recommend relaxing the RI constraints as a good solution.

     

    To better understand the rationale behind this statement, I have included here a summary of what some of the best known DW gurus have to say about the subject.

     

    Not surprisingly, Kimball and Inmon have rather different opinions, although they agree on the basic fact that some form of referential integrity in the data warehouse is needed.

     

    For Kimball, the notion of RI in a data warehouse environment is very straightforward:

     

    “In the warehouse, referential integrity (RI) means that for each foreign key in the fact table, an entry exists in the corresponding dimension table” (The Data Warehouse Lifecycle Toolkit, page 632)

     

    He suggests that the referential integrity should be checked during the loading process in the staging area so that the data are already coherent when they are loaded into the database tables. However, he adds that sometimes this may not be sufficient, because RI could be compromised in many ways (like deleting dimensional records when there are still fact records depending on the dimension key values):

     

    “Strictly speaking, it is not necessary to declare referential integrity between facts and dimensions, but we recommend doing so because every database otherwise develops integrity problems”. (The Data Warehouse Lifecycle Toolkit, page 635)

     

    Warren Thornthwaite has the same approach as Kimball, except for the fact that he does not recommend to put the constraints into the database:

     

    "The only DW specific question is whether to declare foreign key relationships in the database. If your staging/maintenance application enforces referential integrity, it is not necessary to do so in the database. Having the constraints declared and enforced in the database adds overhead during loads, because the database looks up all the foreign keys upon load. This effort is "wasted" if you've already done the work." (The Data Warehouse Institute, 4th Annual Leadership Conference, Architecture, Design & Implementation Class)

     

    On the other side, Inmon gives a more complex interpretation of RI in data warehouses; in his opinion traditional RI constraints are not applicable to the DWH environment (all the following quotes are taken from the article “Referential Integrity For The Data Warehouse Environment”, which can be found at billinmon.com).

     

    This is what Inmon says:

     

    “Why is it that current implementations of referential integrity under most dbms do not work for data warehouses? The first thing any self-respecting data warehouse administrator does upon implementing a data warehouse is to turn off conventional referential integrity features. Referential integrity as it is implemented under a conventional dbms does not apply to data warehousing for three primary reasons …”

     

    The three reasons are: the size of the data, the fact that data updates are not normally done in data warehouses, and the fact that the relationship between two entities could change in time (the third reason is too complex to explain here, so for those interested I’d recommend to read Inmon’s article).

     

    What he suggests is to define a new approach to RI:

     

    “There is still a need to identify and manage data relationships in the data warehouse environment, even if the relationships are fundamentally different than the relationships found in the operational environment. What is needed for the world of data warehousing is a different approach to referential integrity altogether”

     

    Inmon calls this new approach “Bounded referential integrity”, and says that it is similar to traditional RI except for two things:

     

    - applies to a subset of a table or tables found within the data warehouse, not to an entire table or tables, and

    - applies after the data has been loaded into the warehouse, not as the data is being loaded

     

    This kind of RI cannot be enforced by the usual constraints defined in a relational database; one has to write his own programs that do an audit on the data once they are loaded and report exceptions to the data warehouse administrator.

     

    Personally, I have always tried to follow Kimball directions, and they always worked very well. I cannot exclude that in some very large data warehouses the performance implications of leaving the RI constraints enabled could have a negative impact, but this never happened to me.

    One final point: Microsoft documentation also suggests that RI should be enforced: "Referential integrity must be maintained between all dimension tables and the fact table" (SQL Server 2000 books on line).

     

    Andrea Vincenzi

    Independent DW consultant - Rome - Italy

    mercoledì 30 novembre 2011 09:56