locked
Poor performance importing text file in Power Query RRS feed

  • Question

  • Hello,

    I'm dealing with a 122MB .txt file.

    It contains ~20k rows and 4.3k columns (yes I thought that was absurd too), separated by semicolons.

    I did some cleaning to extract the information I needed, with the first step being splitting the column by the semicolon delimeter, and I use the resulting data to merge it with another query. However, loading the data is taking way too long (8-10 minutes).

    Any ideas on how to improve performance?


    Here is my M code (I cut off the text in the first two steps because it lists all the 4.3k columns.

    let
        Source = Table.FromColumns({Lines.FromBinary(File.Contents("G:\exporta.txt"),null,null,1252)}),
        #"Split Column by Delimiter" = Table.SplitColumn(Source,"Column1",Splitter.SplitTextByDelimiter(";", QuoteStyle.Csv),{"Column1.1", .....
        #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"Column1.1", type text}, .....
        #"Promoted Headers" = Table.PromoteHeaders(#"Changed Type"),
        #"Removed Other Columns" = Table.SelectColumns(#"Promoted Headers",{"_Data", "_Hora", "cScore", "_ID_log", "CNPJ", "CPF"}),
        #"Merged Columns" = Table.CombineColumns(#"Removed Other Columns",{"CNPJ", "CPF"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"CPFCNPJ"),
        #"Reordered Columns" = Table.ReorderColumns(#"Merged Columns",{"CPFCNPJ", "_Data", "_Hora", "cScore", "_ID_log"}),
        #"Replaced Value" = Table.ReplaceValue(#"Reordered Columns",".","",Replacer.ReplaceText,{"CPFCNPJ"}),
        #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","-","",Replacer.ReplaceText,{"CPFCNPJ"}),
        #"Replaced Value2" = Table.ReplaceValue(#"Replaced Value1","/","",Replacer.ReplaceText,{"CPFCNPJ"}),
        #"Changed Type1" = Table.TransformColumnTypes(#"Replaced Value2",{{"CPFCNPJ", Int64.Type}, {"_Data", type date}, {"_Hora", type time}}),
        #"Added Custom" = Table.AddColumn(#"Changed Type1", "Data_Hora", each [_Data]&[_Hora]),
        #"Changed Type2" = Table.TransformColumnTypes(#"Added Custom",{{"Data_Hora", type datetime}}),
        #"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"_Data", "_Hora"}),
        #"Reordered Columns1" = Table.ReorderColumns(#"Removed Columns",{"CPFCNPJ", "Data_Hora", "cScore", "_ID_log"}),
        #"Merged Columns1" = Table.CombineColumns(Table.TransformColumnTypes(#"Reordered Columns1", {{"CPFCNPJ", type text}, {"Data_Hora", type text}}, "pt-BR"),{"CPFCNPJ", "Data_Hora"},Combiner.CombineTextByDelimiter(";", QuoteStyle.None),"Merged")
    in
        #"Merged Columns1"

    Monday, February 13, 2017 11:57 AM

Answers

  • Hi williamfuu. Here are some recommendations:

    • Move the step where you exclude most of the 4.3k columns (the step is currently called Removed Other Columns) as early in the flow as possible. Currently there's quite a bit of other work happening prior to that point.

    • Instead of using Lines.FromBinary, use Csv.Document. Something like the following, but replacing 4300 with the minimum number of columns you need to import in order to get the handful you'll actually keep, and replacing QuoteStyle.None with QuoteStyle.Csv if your data contains quoted fields (such as fields that can themselves contain semicolons as part of the data):

      = Csv.Document(File.Contents("G:\exporta.txt"),[Delimiter=";", Columns=4300, Encoding=1252, QuoteStyle=QuoteStyle.None])

    Let me know if this helps.

    Ehren

    Monday, February 13, 2017 8:37 PM

All replies

  • Hi williamfuu. Here are some recommendations:

    • Move the step where you exclude most of the 4.3k columns (the step is currently called Removed Other Columns) as early in the flow as possible. Currently there's quite a bit of other work happening prior to that point.

    • Instead of using Lines.FromBinary, use Csv.Document. Something like the following, but replacing 4300 with the minimum number of columns you need to import in order to get the handful you'll actually keep, and replacing QuoteStyle.None with QuoteStyle.Csv if your data contains quoted fields (such as fields that can themselves contain semicolons as part of the data):

      = Csv.Document(File.Contents("G:\exporta.txt"),[Delimiter=";", Columns=4300, Encoding=1252, QuoteStyle=QuoteStyle.None])

    Let me know if this helps.

    Ehren

    Monday, February 13, 2017 8:37 PM
  • Ehren,

    These suggestions are brilliant.

    • By deleting the step where Power Query was unnecessarily converting types for all my columns (#"Changed Type"), the refresh time was cut down to ~2min 30s. In hindsight, that step was clearly useless.
    • Good to know you can import txt files using Csv commands - I tried importing only the first 20 columns and the performance was even better: I refreshed my data model in just under a minute. However, this might not be the ultimate solution because the last column in my text file could be useful.

    Thank you!


    • Edited by williamfuu Tuesday, February 14, 2017 12:46 PM
    Tuesday, February 14, 2017 12:46 PM