locked
Parsing field for multipule values RRS feed

  • Question

  • I'm working with a field that I need to parse, but I'm running into some issues with trying to account for the multipule values to key off of that you need to parse from.

    I'm trying to parse a field that contains first and last name and the occaisional middle initial. I've included a list of all the iderations that I've came across while trying to parse this thing out. I've been able to account for the first two through either tsql coding or using a function. However the last one has thrown me and while I can get the first name parsed, I'm having some issues getting the rest of it parsed. Also, these fields aren't always populated, so I need to be able to account for NULL and BLANKS. I've also included the code I have so far

    FullNameField
    
    ----------------
    
    LASTNAME FIRSTNAME <-- Space between LN and FN
    
    LASTNAME, FIRSTNAME <-- Comma Separated
    
    FIRSTNAME MIDDLENAME LASTNAME <-- All separated by space but middle name is not always there
    CREATE FUNCTION [dbo].[ufn_ReturnNameWithComma] ( @ReturnName VARCHAR(150))
    RETURNS VARCHAR(150)
    AS
    BEGIN
     RETURN LEFT(@ReturnName, ISNULL(NULLIF(CHARINDEX(',', @ReturnName) - 1, -1), LEN(@ReturnName)))
    END
    
    --This will return the first name fine, but can't handle multiple names
    CREATE FUNCTION [dbo].[ufn_ReturnNameWithSpace] ( @ReturnName VARCHAR(150))
    RETURNS VARCHAR(150)
    AS
    BEGIN
     RETURN LEFT(@ReturnName, ISNULL(NULLIF(CHARINDEX(' ', @ReturnName) - 1, -1), LEN(@ReturnName)))
    END
    • Edited by BigErn782 Thursday, August 12, 2010 6:04 PM Formatting
    Thursday, August 12, 2010 6:04 PM

Answers

  • The easiest way is to try and normalize the data.  You know that a given person cannot have more than 3 names (first, middle, last) so you can convert the commas to empy spaces.  You can then use parsename to split out each element.

    Something like this should work.

    DECLARE @t TABLE(
    col VARCHAR(100)
    );
    
    INSERT INTO @t VALUES ('LASTNAME FIRSTNAME') -- Space between LN and FN
    INSERT INTO @t VALUES ('LASTNAME, FIRSTNAME')-- Comma Separated
    INSERT INTO @t VALUES ('LASTNAME,FIRSTNAME')-- Comma Separated
    INSERT INTO @t VALUES ('LASTNAME  FIRSTNAME') 
    INSERT INTO @t VALUES ('LASTNAME A FIRSTNAME') 
    INSERT INTO @t VALUES ('LASTNAME Adam FIRSTNAME') 
    
    SELECT 
    	col,
    	LTRIM(PARSENAME([CleanStr],1)) AS FirstName,
    	CASE 
    		WHEN PARSENAME([CleanStr],3) IS NULL 
    		THEN NULL
    	ELSE LTRIM(PARSENAME([CleanStr],2))
    	END AS MiddleName,
    	CASE 
    		WHEN PARSENAME([CleanStr],3) IS NOT NULL 
    		THEN LTRIM(PARSENAME([CleanStr],3))
    	ELSE LTRIM(PARSENAME([CleanStr],2))
    	END AS LastName
    FROM(
    	SELECT 
    		col,
    		REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(col,' ', ' |'),'| ',''),' |',' '),',',' '),' ','.') AS CleanStr
    	FROM @t
    ) AS x
    
    • Edited by Adam Haines Thursday, August 12, 2010 7:07 PM syntax
    • Marked as answer by KJian_ Thursday, August 19, 2010 9:03 AM
    Thursday, August 12, 2010 6:21 PM

All replies

  • The easiest way is to try and normalize the data.  You know that a given person cannot have more than 3 names (first, middle, last) so you can convert the commas to empy spaces.  You can then use parsename to split out each element.

    Something like this should work.

    DECLARE @t TABLE(
    col VARCHAR(100)
    );
    
    INSERT INTO @t VALUES ('LASTNAME FIRSTNAME') -- Space between LN and FN
    INSERT INTO @t VALUES ('LASTNAME, FIRSTNAME')-- Comma Separated
    INSERT INTO @t VALUES ('LASTNAME,FIRSTNAME')-- Comma Separated
    INSERT INTO @t VALUES ('LASTNAME  FIRSTNAME') 
    INSERT INTO @t VALUES ('LASTNAME A FIRSTNAME') 
    INSERT INTO @t VALUES ('LASTNAME Adam FIRSTNAME') 
    
    SELECT 
    	col,
    	LTRIM(PARSENAME([CleanStr],1)) AS FirstName,
    	CASE 
    		WHEN PARSENAME([CleanStr],3) IS NULL 
    		THEN NULL
    	ELSE LTRIM(PARSENAME([CleanStr],2))
    	END AS MiddleName,
    	CASE 
    		WHEN PARSENAME([CleanStr],3) IS NOT NULL 
    		THEN LTRIM(PARSENAME([CleanStr],3))
    	ELSE LTRIM(PARSENAME([CleanStr],2))
    	END AS LastName
    FROM(
    	SELECT 
    		col,
    		REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(col,' ', ' |'),'| ',''),' |',' '),',',' '),' ','.') AS CleanStr
    	FROM @t
    ) AS x
    
    • Edited by Adam Haines Thursday, August 12, 2010 7:07 PM syntax
    • Marked as answer by KJian_ Thursday, August 19, 2010 9:03 AM
    Thursday, August 12, 2010 6:21 PM
  • This is going to be really difficult and may be impossible.

    For example, how will you tell the difference between LASTNAME FIRSTNAME and FIRSTNAME LASTNAME (especially since you said that sometimes there is not always a middle name)?

    Also, how would you handle a person whose name is Maria de la Cruz?  What about Thurston Howell III?  Or John Smith Jr?

    Sorry to be discouraging, but I just want to let you know what you're up against.

    But can you tell me what specifically you're trying to accomplish?  Getting the first word/name in the string?  Or a function to get the 2nd or possibly 3rd word/name?


    --Brad (My Blog)
    • Proposed as answer by Naomi N Thursday, August 12, 2010 6:32 PM
    Thursday, August 12, 2010 6:27 PM
  • Thanks Adam, I'll see what I can do with this as this looks as if it will be very helpful
    Big Ern
    Thursday, August 12, 2010 6:33 PM
  • Usually you attempt to split the names, then present a user interface for the user to make manual corrections (and probably correct auto-corrected names as well). I remember I had such application about 10 years ago written in VFP. The application had set of lookup tables (say, we even had a table with common first names) and quite complex program that was attempting parsing names according to some rules. 
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Thursday, August 12, 2010 6:36 PM
  • Bottom line, I'd like to return all names/words from one query/function. We can make these assumptions off of the data.

    1. Going back over the data, I was incorrect about one combination - LASTNAME FIRSTNAME. That is not happening in this data set.

    2. We can assume that the first word/name is the first name, 2nd is middle, and 3rd is last. I know this will cause issues with the 'DE LA Cruz' scenario, but I've verified that it can be broken out this way.

    3. The last name will always be first when the combination LASTNAME, FIRSTNAME is encountered.


    Big Ern
    Thursday, August 12, 2010 6:39 PM
  • Thing thing to remember here is that what every you do it is not likely to be 100% accurate.  You would need a ton of logic and man hours to parse this out nearly 100%.  The code I posted above will work for situations where 3-4 spaces occur and really doesnt do more than that.  There is no magic button to get this right.  I mean from a logical standpoint, how could you discern that "Haines Adam" is in last name first name order, if no comma exists?  The answer is you really cant, so this type of cleanup is unfortunatly very manual. If you cant rely on your users to input the data consitently it makes it that much more difficult to retrieve the data consitently.  with that said, you can use the code above to help migrate the simple names. this will reduce the amount of manual work required to migrate the remaining rows.
    http://jahaines.blogspot.com/
    • Edited by Adam Haines Thursday, August 12, 2010 7:06 PM syntax
    Thursday, August 12, 2010 7:03 PM
  • I do understand that this is not going to be anywhere near 100% accuracy and have made everyone very aware of this. This is only to supplement a system already in place. If it can return 60% accuracy, it would be better than what is currently in place.

    Ufortunately, when plugging the dataset I have with your code, it still hasn't gotten me any closer to an end result. I'm in the process of seeing what is causing the problems (on the data side, not with your code) and I suspect it has to do with the space not being read between the names.


    Big Ern
    Thursday, August 12, 2010 7:15 PM
  • Do splitting process separately for comma in the data and no , in the data, in other words, 2 separate CTEs with different logic and then UNION ALL them in the final result.
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Thursday, August 12, 2010 7:21 PM
  • Big Ern,

    I just modified my code at the top because i thought of a new problem that you may face.  The problem you may be hitting is when there are multiple spaces between names.  In this scenario it becomes very difficult to parse names out.  This is why I fixed the code to normalize the number of spaces between words.  Try the new code out and see if it helps.


    http://jahaines.blogspot.com/
    Thursday, August 12, 2010 7:22 PM