none
Transform large data set in PQ RRS feed

  • Question

  • Hi

    I have a dataset comprising around 2m rows and 39 columns in text file, on which I perform some transformations in PQ in a separate excel workbook. I have found that Table.Sort and Table.FillDown are very taxing on my query. Are there any alternative ways of sorting data and filling null values based on the cell above, which are efficient? Please see the entire code below.

    After the transformation, the PQ output excel table can only hold just over 1mil rows. Is there a way I can split into multiple outputs to accommodate for my 2 mil rows?

    Thanks

    let
        Source = Csv.Document(File.Contents("Z:\costing_report31012020.asc"),[Delimiter=",", Columns=39, Encoding=1252, QuoteStyle=QuoteStyle.None]),
       PromotedHeaders = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
       RemovedExtraSpacesFromHeader = Table.TransformColumnNames(PromotedHeaders, fPowerTrim),
            columnNames = Table.ColumnNames(RemovedExtraSpacesFromHeader),
        TrimmedAllColumns = List.Accumulate(columnNames,RemovedExtraSpacesFromHeader,
            (state,current)=> Table.TransformColumns(state,{current, Text.Trim})
        ),
        EmptyAsNull = Table.ReplaceValue(TrimmedAllColumns,"",null,Replacer.ReplaceValue,columnNames),
        replace = Table.ReplaceValue(EmptyAsNull, 0, each if [OU_Lvl_6] <> null then [OU_Lvl_6]
                                                     else if [OU_Lvl_5] <> null then [OU_Lvl_5]
                                                     else if [OU_Lvl_4] <> null then [OU_Lvl_4]
                                                     else if [OU_Lvl_3] <> null then [OU_Lvl_3]
                                                   
                                                     else [OU_Lvl_2],
                                                     (a,b,c)=> if a = null then c else a, {"OU_Lvl_3", "OU_Lvl_4", "OU_Lvl_5", "OU_Lvl_6"}),
        Sort = Table.Buffer(Table.Sort(replace, {{"Emp_No", Order.Ascending}, {"Txn_Type", Order.Descending}})),
        Fillup = Table.Buffer(Table.FillDown(Sort, {"Increm"})),
        replace2 = Table.ReplaceValue(Fillup, 0, each if [Date_Frm] <> null then [Date_Frm]
                                                 else if [Date_To] <> null then [Date_To]
                                                 else [Process_Prd],
                                                 (a,b,c)=> if a = null then c else a, {"Date_Frm", "Date_To"})
    in replace2

    Sunday, April 5, 2020 4:28 AM

Answers

  • Hi M.Awal,

    you can use the query above as staging query without loading anywhere and then reference it with filters in different queries that you then load to the sheets (Keep first/last x rows)

    Performancewise:

    You can eliminate the Buffer in the Fillup-step. I would also try to do the sort-and Fillup as early as possible.

    TrimedAllColumns could be rewritten like so (List.Accumulate costs performance):

        TrimmedAllColumns = Table.TransformColumns(RemovedExtraSpacesFromHeader, List.Transform(columnNames, each{ _, Text.Trim}))

    Having trouble to understand what you're doing with the conditional replacements there, but it might be worth checking if the logic could be simplified.


        


    Imke Feldmann - MVP Data Platform (PowerBI) - http://www.TheBIccountant.com 

    Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to vote it as helpful - Thanks!

    Friday, April 10, 2020 3:00 PM
    Moderator

All replies

  • Hi M.Awal,

    you can use the query above as staging query without loading anywhere and then reference it with filters in different queries that you then load to the sheets (Keep first/last x rows)

    Performancewise:

    You can eliminate the Buffer in the Fillup-step. I would also try to do the sort-and Fillup as early as possible.

    TrimedAllColumns could be rewritten like so (List.Accumulate costs performance):

        TrimmedAllColumns = Table.TransformColumns(RemovedExtraSpacesFromHeader, List.Transform(columnNames, each{ _, Text.Trim}))

    Having trouble to understand what you're doing with the conditional replacements there, but it might be worth checking if the logic could be simplified.


        


    Imke Feldmann - MVP Data Platform (PowerBI) - http://www.TheBIccountant.com 

    Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to vote it as helpful - Thanks!

    Friday, April 10, 2020 3:00 PM
    Moderator
  • Hi Imke: Thanks for your advice above and sorry for late reply.

    1)     "TrimmedAllColumns = Table.TransformColumns(RemovedExtraSpacesFromHeader, List.Transform(columnNames, each{ _, Text.Trim}))"

     What if I have multiple blank spaces in between words. Can I something along this line?  Table.TransformColumns( columnNames, List.Transform(columnNames, each {{ _, fTrim (_)}})). But it is throwing an Expression.Error: We cannot convert a value of type List to type Table. Any idea?

    2)Did try performing Table.Sort and Table.FillDown initially, still slow.

    3) "Having trouble to understand what you're doing with the conditional replacements...."

    They represents different hierarchical levels, eg. if Level-2 is blank then fetch the word from left column level-1 and so on. 

    Tuesday, April 14, 2020 10:40 AM
  • Hi, M.Awal.

    can you show some rows with structure of your source, and show the result you want.

    If not, try use Table.AddColumn instead of Table.TransformColumns, it works faster.

    Tuesday, April 14, 2020 12:18 PM
  • File_Structure

    Hi Aleksey and Imke:

    Thanks for your advice. I have tried following your comments. Performance wise still the same. Please the link for file structure. This dataset is not so bad. Some fortnights I get dataset containing 1.9M rows. In that case, I can't even do the same transformation. Also, I need to have the output in multiple tables as 1.9M wouldn't fit in the one worksheet.

    As Imke already suggested to use staging query without loading anywhere and then reference it with filters in different queries that you then load to the sheets (Keep first/last x rows).

    Cheers


    Friday, April 17, 2020 10:43 AM
  • Hi M.Awal. Were you able to make any headway on this?

    Ehren

    Tuesday, May 5, 2020 8:57 PM
    Owner
  • Hi Ehren

    Same position that I was in before.

    Cheers

    Wednesday, June 17, 2020 10:21 AM