none
Import Text File RRS feed

  • Question

  • Hi : I need to import large text file every fortnight to do some basic data transformation in PQ. In the text file (as per below) there are multiple blank spaces in the header row as well as in the rows where the transactions are. Each field is marked by two double quotes. What is the best way to get rid of all the empty spaces (including the spaces in header row) given the file contains over 500K rows? Is PQ is suitable to deal with large numbers of rows? Thank you in advance for any assistance.

    https://1drv.ms/t/s!Amc8fiGpDxekh2B16Q5B-WWqV7Y2?e=S7ZxiY

    Monday, December 23, 2019 1:43 AM

Answers

  • Hi M.Awal

    To Trim "inside" a string you need to use a function like the following:

    fnPowerTrim:

    // Original from Ken Puls's blog
    (text as text, optional char_to_trim as text) as text =>
    let
        char = if char_to_trim = null then " " else char_to_trim,
        split = Text.Split(text, char),
        removeblanks = List.Select(split, each _ <> ""),
        result = Text.Combine(removeblanks, char)
    in
        result

    To Trim your Headers + all your columns - whatever the number is - you could do something like:

    let
        Source = Csv.Document(File.Contents(PathToYourTextFile),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]),
        PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
        RemovedExtraSpacesFromHeader = Table.TransformColumnNames(PromotedHeaders, fnPowerTrim),
        TrimmedAllColumns = List.Accumulate(Table.ColumnNames(RemovedExtraSpacesFromHeader),RemovedExtraSpacesFromHeader,
            (state,current)=> Table.TransformColumns(state,{current, Text.Trim})
        )
    in
        TrimmedAllColumns

    39 columns * 500k rows, good luck… ;-)

    EDIT: if you want something really clean you could also do:

    let
        Source = Csv.Document(File.Contents(PathToYourTextFile),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]),
        PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
        RemovedExtraSpacesFromHeader = Table.TransformColumnNames(PromotedHeaders, fnPowerTrim),
            columnNames = Table.ColumnNames(RemovedExtraSpacesFromHeader),
        TrimmedAllColumns = List.Accumulate(columnNames,RemovedExtraSpacesFromHeader,
            (state,current)=> Table.TransformColumns(state,{current, Text.Trim})
        ),
        EmptyAsNull = Table.ReplaceValue(TrimmedAllColumns,"",null,Replacer.ReplaceValue,columnNames)
    in
        EmptyAsNull



    • Edited by Lz._ Monday, December 23, 2019 3:37 PM Code amendments
    • Marked as answer by M.Awal Tuesday, December 24, 2019 10:26 AM
    Monday, December 23, 2019 10:14 AM
  • Hi M.Awal

    If you want nulls in [Level-2], [Level-3] and [Level-4] to be replaced with the value from [Level-1] same row:

        ReplacedNulls = List.Accumulate({"Level-2","Level-3","Level-4"}, EmptyAsNull,
            (state,current)=>
                Table.ReplaceValue(state, each Record.Field(_,current), each
                    if Record.Field(_,current) <> null then Record.Field(_,current)
                    else Record.Field(_,"Level-1"),
                    Replacer.ReplaceValue,{current}
                )
        )

    Hope this is what you expect. AFAIK this is the only way to do it in this context

    • Marked as answer by M.Awal Wednesday, December 25, 2019 12:38 AM
    Tuesday, December 24, 2019 10:37 PM

All replies

  • Hi

    You could read the file and apply a Text.Trim() to all columns before the Promote Header step.

    Not sure how the performance will be but I believe it's reasonably fast for 500k rows

    Monday, December 23, 2019 8:11 AM
  • Hi : Thanks for your prompt reply. Text.Trim does get rid leading or trailing spaces. However, PQ doesn't allow me to select all the columns (39 in this instance)  and perform text.trim. Individually I can. Having said that, it doesn't get rid of the trailing or leading spaces inside the column header even if I do it individual column one at a time.

    Cheers

    Monday, December 23, 2019 8:53 AM
  • Hi S: I tried with the following code which works with Text columns but doesn't work with numeric or date type columns. I am not sure about the performance using the following code for 500K rows or higher.

    Monday, December 23, 2019 10:10 AM
  • Hi M.Awal

    To Trim "inside" a string you need to use a function like the following:

    fnPowerTrim:

    // Original from Ken Puls's blog
    (text as text, optional char_to_trim as text) as text =>
    let
        char = if char_to_trim = null then " " else char_to_trim,
        split = Text.Split(text, char),
        removeblanks = List.Select(split, each _ <> ""),
        result = Text.Combine(removeblanks, char)
    in
        result

    To Trim your Headers + all your columns - whatever the number is - you could do something like:

    let
        Source = Csv.Document(File.Contents(PathToYourTextFile),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]),
        PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
        RemovedExtraSpacesFromHeader = Table.TransformColumnNames(PromotedHeaders, fnPowerTrim),
        TrimmedAllColumns = List.Accumulate(Table.ColumnNames(RemovedExtraSpacesFromHeader),RemovedExtraSpacesFromHeader,
            (state,current)=> Table.TransformColumns(state,{current, Text.Trim})
        )
    in
        TrimmedAllColumns

    39 columns * 500k rows, good luck… ;-)

    EDIT: if you want something really clean you could also do:

    let
        Source = Csv.Document(File.Contents(PathToYourTextFile),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]),
        PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
        RemovedExtraSpacesFromHeader = Table.TransformColumnNames(PromotedHeaders, fnPowerTrim),
            columnNames = Table.ColumnNames(RemovedExtraSpacesFromHeader),
        TrimmedAllColumns = List.Accumulate(columnNames,RemovedExtraSpacesFromHeader,
            (state,current)=> Table.TransformColumns(state,{current, Text.Trim})
        ),
        EmptyAsNull = Table.ReplaceValue(TrimmedAllColumns,"",null,Replacer.ReplaceValue,columnNames)
    in
        EmptyAsNull



    • Edited by Lz._ Monday, December 23, 2019 3:37 PM Code amendments
    • Marked as answer by M.Awal Tuesday, December 24, 2019 10:26 AM
    Monday, December 23, 2019 10:14 AM
  • Hi Lz: Thanks for the solution. I have seen and used that Ken puls the above function. I will come back to you tomorrow when I test the above on the real data. Total rows for last fortnight pay for the entire organisation came to 1.9m rows due to back pay and Christmas bonus. But this is one-off, not a typical fortnight. 

    Cheers


    Monday, December 23, 2019 10:41 AM
  • Thanks again. I tried it and it takes about around 30 seconds. Faster than I thought it would be. Great job. However, I had to add further transformation I getting an error not sure where is it coming from. The transformation is about first replacing blank with null and then populate text string from the column to the left (which is shown in screen shot -2).

    Also does it make a difference performance wise to use Table.Buffer before Table.Sort and Table.filldown? I find these functions cause performance issues esp when the table size is half a million.

    Tuesday, December 24, 2019 10:26 AM
  • Hi

    Not quite clear (from your pictures) what you're trying to achieve (I think I got the idea). So, could you create a separate thread with if possible a representative sample of your data after the EmptyAsNull step and explain what null  fields should be replaced by what (which value) and in which columns?


    • Edited by Lz._ Tuesday, December 24, 2019 5:16 PM
    Tuesday, December 24, 2019 11:08 AM
  • Hi Lz: Sorry for not explaining it clearly. An example would be lvl-1 contains no null and if lvl-2 contains null then I would like to have those null values filled from the values in lvl-1. Same goes for Level-3 & 4.

    Anyway, I pasted the entire code below. There is no error at the moment. The only problem is Table.FillDown is excruciatingly slow.

    let
        Source = Csv.Document(File.Contents("filepath"),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]),
        PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
        RemovedExtraSpacesFromHeader = Table.TransformColumnNames(PromotedHeaders, fnPowerTrim),
            columnNames = Table.ColumnNames(RemovedExtraSpacesFromHeader),
        TrimmedAllColumns = List.Accumulate(columnNames,RemovedExtraSpacesFromHeader,
            (state,current)=> Table.TransformColumns(state,{current, Text.Trim})
        ),
        EmptyAsNull = Table.ReplaceValue(TrimmedAllColumns,"",null,Replacer.ReplaceValue,columnNames),
        replace = Table.ReplaceValue(EmptyAsNull, 0, each if [OU_Lvl_6] <> null then [OU_Lvl_6]
                                                     else if [OU_Lvl_5] <> null then [OU_Lvl_5]
                                                     else if [OU_Lvl_4] <> null then [OU_Lvl_4]
                                                     else if [OU_Lvl_3] <> null then [OU_Lvl_3]
                                                    
                                                     else [OU_Lvl_2],
                                                     (a,b,c)=> if a = null then c else a, {"OU_Lvl_3", "OU_Lvl_4", "OU_Lvl_5", "OU_Lvl_6"}),
        Sort = Table.Buffer(Table.Sort(replace, {{"Emp_No", Order.Ascending}, {"Txn_Type", Order.Descending}})),
        Fillup = Table.Buffer(Table.FillDown(Sort, {"Increm"})),
        replace2 = Table.ReplaceValue(Fillup, 0, each if [Date_Frm] <> null then [Date_Frm]
                                                 else if [Date_To] <> null then [Date_To]
                                                 else [Process_Prd],
                                                 (a,b,c)=> if a = null then c else a, {"Date_Frm", "Date_To"})
    in replace2

    Tuesday, December 24, 2019 9:25 PM
  • Hi M.Awal

    If you want nulls in [Level-2], [Level-3] and [Level-4] to be replaced with the value from [Level-1] same row:

        ReplacedNulls = List.Accumulate({"Level-2","Level-3","Level-4"}, EmptyAsNull,
            (state,current)=>
                Table.ReplaceValue(state, each Record.Field(_,current), each
                    if Record.Field(_,current) <> null then Record.Field(_,current)
                    else Record.Field(_,"Level-1"),
                    Replacer.ReplaceValue,{current}
                )
        )

    Hope this is what you expect. AFAIK this is the only way to do it in this context

    • Marked as answer by M.Awal Wednesday, December 25, 2019 12:38 AM
    Tuesday, December 24, 2019 10:37 PM
  • Very creative the way you use List.Accumulate and including Table.Replace inside it. What's your work around to performance issue when you have to  use Table.Sort and Table.filldown on large dataset?
    Wednesday, December 25, 2019 12:38 AM
  • Very creative the way you use List.Accumulate and including Table.Replace inside it 

    AFAIK that's the only way to do it in your scenario for 2 reasons:

    1 - Table.Replace only accepts static values as 2nd (old value) and 3rd (new value) arguments. In your case you want to take the new value from another column so you should use the following syntax when replacement takes place on 1 column (Level-2 below):

    ReplacedNulls = Table.ReplaceValue(EmptyAsNull, each [#"Level-2"], each
        if [#"Level-2"] <> null then [#"Level-2"]
        else [#"Level-1"],
        Replacer.ReplaceValue,{"Level-2"}
    )

    2 - You cannot do something like:

    ReplacedNulls = List.Accumulate({[#"Level-2"],[#"Level-3"],[#"Level-4"]}, EmptyAsNull,...

    hence Record.Field(…)

    BTW I'll be interested to know how this List.Accumulate replacement (on 3 or 4 columns in your case) performs against your 2 million rows - Thanks in advance



     
    Wednesday, December 25, 2019 7:44 AM
  • What's your work around to performance issue when you have to  use Table.Sort and Table.filldown on large dataset?

    I won't comment Table.FillDown/FillUp as I don't see this as an efficient option in your scenario

    Re. Table.Sort AFAIK the only way to speed up things is to buffer the table to sort and looking at your above code I don't think this is what you do with:

    Sort = Table.Buffer(Table.Sort(replace, {{"Emp_No", Order.Ascending}, {"Txn_Type", Order.Descending}}))
     

    Try the following instead and see how it goes:

    Sort = Table.Sort(Table.Buffer(replace), {{"Emp_No", Order.Ascending}, {"Txn_Type", Order.Descending}})
    Again that's what I know. My recommendation would be that you create a separate thread re. sorting a large table (don't forget to mention 2 millions rows * 40 columns) as there could be other option(s) I'm not aware of. Hope this makes sense 

    Wednesday, December 25, 2019 7:59 AM
  • Hi Lz:

    "BTW I'll be interested to know how this List.Accumulate replacement (on 3 or 4 columns in your case) performs against your 2 million rows" .

    - I did performance testing between Table.ReplaceValue by itself and Table.ReplaceValue inside List.Accumulate. The former is slightly faster.

    I also tested Table.Buffer as per you recommended, still it takes a while to perform sort on 500K rows. I also tried Table.Buffer in the first step. No difference. 

    When you say "create a separate thread", did you mean a separate query and set it as "connection only". Would you be able to provide an example? Thanks

    Friday, December 27, 2019 9:23 AM
  • Hi M.Awal

    Really appreciate your feedback and I'm not surprised at all that Table.ReplaceValue is faster than Table.ReplaceValue(state, each Record.Field(_,current), each…. At least that sounds logical (to me)

    Apologies for confusing you with Create a separate thread, I meant => Create a new/separate request on this forum. This will avoid treating more than one problem within the same thread/request and more importantly contributors who have much more experience than me might give you better recommendations
    If you go that way, again don't forget to mention #rows, #columns and a good estimation of the time it took to do the sorting as it might be normal/expected

    Hope this clarifies & makes senses 

    Friday, December 27, 2019 9:42 AM
  • Sounds good. Thanks for the advice.
    Friday, December 27, 2019 9:46 AM
  • In the meantime and if you haven't already it's worth reading:

    The 2nd is more Power BI oriented but not only and it provides a good summary of information you can find searching i.e. this forum, and you might be interested by some memory aspects + Table.Buffer considerations...

    Friday, December 27, 2019 2:03 PM