Dealing with somewhat related datasets: missing values, "top values"

Unanswered Dealing with somewhat related datasets: missing values, "top values"

  • Friday, March 08, 2013 12:38 PM
     
     

    I am working through the US Census microdata (http://www.census.gov/acs/www/data_documentation/pums_data/) as a learning exercise.  I have found on this site the results of about ten different (but related) questionaires filled out by several million people.  I have about 17 millions records with 200 columns of data from one questionaire.  Almost all of the columns have numeric data with the values spelled out in lookup tables (that I need to build).  There are a few columns (income, rent, utilities costs) that have an actual integer value.

    One of the challenges that I have is that many of the questions are not asked in certain questionaires.  If the value in the column uses a lookup, my approach is simple:  create additional lookup table rows, one for "Column not populated in this set" and a second for "Column is empty in this set."  My problem is what to do with the actual integer values.  Since I am going to want to aggregate these values, I would like to put a null in the column.  But I would still like to distinguish between "missing because the questiionaire didn't ask question" and "person did not put a value."

    Here are the ideas I have so far:

    1) One fact table for each questionaire that has covers all of these actual integer values.

    2) One fact table, but add a bit column to identify whether the null means "not in the questionaire" or "not filled in".

    I think that either will have challenges when I query this.

    Please let me know your thoughts. 


    Russel Loski, MCT, MCSA SQL Server 2012, 2008, MCITP Business Intelligence Developer and Database Developer 2008 Twitter: @sqlmovers; blog: www.sqlmovers.com

All Replies

  • Friday, March 08, 2013 1:39 PM
     
     
    It would obviously depend on the reporting requirements, but my instinct tells me to have a single "QuestionnaireAnswers" fact table, linking (independently) to dimensions for the Questionnaire and Questions, therefore alleviating the need for a bit column as you simply exclude the reference to the question.
  • Friday, March 08, 2013 1:52 PM
     
     

    I realized as soon as I posted my question that I really need to profile my data first.  Also, building a structure without reporting requirements is backwards.

    Thank you Mike for your reply.  When I profile the data and at least develop some hypothetical reporting requirements, I will see how your suggestion works.


    Russel Loski, MCT, MCSA SQL Server 2012, 2008, MCITP Business Intelligence Developer and Database Developer 2008 Twitter: @sqlmovers; blog: www.sqlmovers.com