Dealing with somewhat related datasets: missing values, "top values"
-
Friday, March 08, 2013 12:38 PM
I am working through the US Census microdata (http://www.census.gov/acs/www/data_documentation/pums_data/) as a learning exercise. I have found on this site the results of about ten different (but related) questionaires filled out by several million people. I have about 17 millions records with 200 columns of data from one questionaire. Almost all of the columns have numeric data with the values spelled out in lookup tables (that I need to build). There are a few columns (income, rent, utilities costs) that have an actual integer value.
One of the challenges that I have is that many of the questions are not asked in certain questionaires. If the value in the column uses a lookup, my approach is simple: create additional lookup table rows, one for "Column not populated in this set" and a second for "Column is empty in this set." My problem is what to do with the actual integer values. Since I am going to want to aggregate these values, I would like to put a null in the column. But I would still like to distinguish between "missing because the questiionaire didn't ask question" and "person did not put a value."
Here are the ideas I have so far:
1) One fact table for each questionaire that has covers all of these actual integer values.
2) One fact table, but add a bit column to identify whether the null means "not in the questionaire" or "not filled in".
I think that either will have challenges when I query this.
Please let me know your thoughts.
Russel Loski, MCT, MCSA SQL Server 2012, 2008, MCITP Business Intelligence Developer and Database Developer 2008 Twitter: @sqlmovers; blog: www.sqlmovers.com
All Replies
-
Friday, March 08, 2013 1:39 PMIt would obviously depend on the reporting requirements, but my instinct tells me to have a single "QuestionnaireAnswers" fact table, linking (independently) to dimensions for the Questionnaire and Questions, therefore alleviating the need for a bit column as you simply exclude the reference to the question.
-
Friday, March 08, 2013 1:52 PM
I realized as soon as I posted my question that I really need to profile my data first. Also, building a structure without reporting requirements is backwards.
Thank you Mike for your reply. When I profile the data and at least develop some hypothetical reporting requirements, I will see how your suggestion works.
Russel Loski, MCT, MCSA SQL Server 2012, 2008, MCITP Business Intelligence Developer and Database Developer 2008 Twitter: @sqlmovers; blog: www.sqlmovers.com

