none
Serious performance issues: Filtering and sorting 900k rows from 20 .xlsx takes literally for ever! RRS feed

  • Question

  • I hope you can help me with this problem. I am using powerpivot to build a data model for a NGO so they can produce their reports in an easy way.

    With the largest table, I need to get rid off some rows that distort then the metrics I built in PowerPivot. This table is built from around 20 files .xlsx of between 3 to 5Mb each. Because of the filtering and sorting (as far I could see step by step in the procedure) the process gets so heavy that it even never ends.

    Reading a lot in some blogs, I tried already gathering around 15 of those files into one, buffering some of the steps with table.buffer(), creating a new file and changing the files extensions to .csv. Nothing seems to work yet.

    I have a 64-bit surface 3 pro (i3 @ 1.5Ghz with 4GB RAM) with Windows 10 Pro and Office64x.

    What can I do? Is this normal?

    Thank you very much always for your great support,
    Gerónimo

    Here is my code:

    let
        Source = Folder.Files("C:\Users\Goyalee\Desktop\New folder\Historial de Aportes"),
        #"Invoke Custom Function1" = Table.AddColumn(Source, "Transform Binary from Historial de Aportes", each #"Transform Binary from Historial de Aportes"([Content])),
        #"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
        #"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "Transform Binary from Historial de Aportes"}),
        #"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform Binary from Historial de Aportes", Table.ColumnNames(#"Removed Other Columns1"[#"Transform Binary from Historial de Aportes"]{0})),
        #"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Source.Name", type text}, {"COD. RTA. TRX.", type text}, {"NO. AUTORIZACIÓN", type text}, {"ESTADO DE TRANSACCIÓN", type text}, {"COD RTA. ENT.", type text}, {"NO. IDENTIFICACIÓN", type text}, {"TIPO IDENTIFICACIÓN", type text}, {"NO. TRANSACCIÓN", type text}, {"FECHA DE TRANSACCIÓN", type datetime}, {"VALOR", Int64.Type}, {"CONCEPTO", type text}, {"ID TRX ORGANIZACIÓN", type text}, {"ID DONANTE ORGANIZACIÓN", type text}, {"ID. COMP. ORGANIZACIÓN", type text}, {"CÓDIGO URL", type any}}),
        #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Source.Name"}),
        #"Removed Duplicates" = Table.Distinct(#"Removed Columns", {"ID TRX ORGANIZACIÓN"}),
        #"Removed Blank Rows" = Table.SelectRows(#"Removed Duplicates", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
        #"Changed Type with Locale" = Table.TransformColumnTypes(#"Removed Blank Rows", {{"FECHA DE TRANSACCIÓN", type date}}, "es-CO"),
        #"Changed to COP" = Table.TransformColumnTypes(#"Changed Type with Locale", {{"VALOR", Currency.Type}}, "es-CO"),
        #"Replaced Value" = Table.ReplaceValue(#"Changed to COP","1","Aprobado",Replacer.ReplaceText,{"ESTADO DE TRANSACCIÓN"}),
        #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","3","Rechazado",Replacer.ReplaceText,{"ESTADO DE TRANSACCIÓN"}),
        #"Replaced Value2" = Table.ReplaceValue(#"Replaced Value1","2","Pendiente de recobro",Replacer.ReplaceText,{"ESTADO DE TRANSACCIÓN"}),
        #"Add Index" = Table.AddIndexColumn(#"Replaced Value2", "Index", 0, 1),
        #"Filter empty PledgeIDs" = Table.SelectRows(#"Add Index", each ([ID. COMP. ORGANIZACIÓN] = "")),
        #"Filter not empty PledgeIDs" = Table.SelectRows(#"Add Index", each ([ID. COMP. ORGANIZACIÓN] <> "")),
        #"Add YearMonth" = Table.AddColumn(#"Filter not empty PledgeIDs", "YearMonth", each Date.Year([FECHA DE TRANSACCIÓN])*100+Date.Month([FECHA DE TRANSACCIÓN])),
        #"Group" = Table.Group(#"Add YearMonth", {"ID. COMP. ORGANIZACIÓN", "YearMonth"}, {{"AllDate", each Table.Max(_,"FECHA DE TRANSACCIÓN"), type record}}),
        #"Expanded AllDate" = Table.ExpandRecordColumn(Group, "AllDate", {"COD. RTA. TRX.", "NO. AUTORIZACIÓN", "ESTADO DE TRANSACCIÓN", "COD RTA. ENT.", "NO. IDENTIFICACIÓN", "TIPO IDENTIFICACIÓN", "NO. TRANSACCIÓN", "FECHA DE TRANSACCIÓN", "VALOR", "CONCEPTO", "ID TRX ORGANIZACIÓN", "ID DONANTE ORGANIZACIÓN", "ID. COMP. ORGANIZACIÓN", "CÓDIGO URL", "Index", "YearMonth"}, {"COD. RTA. TRX.", "NO. AUTORIZACIÓN", "ESTADO DE TRANSACCIÓN", "COD RTA. ENT.", "NO. IDENTIFICACIÓN", "TIPO IDENTIFICACIÓN", "NO. TRANSACCIÓN", "FECHA DE TRANSACCIÓN", "VALOR", "CONCEPTO", "ID TRX ORGANIZACIÓN", "ID DONANTE ORGANIZACIÓN", "ID. COMP. ORGANIZACIÓN.1", "CÓDIGO URL", "Index", "YearMonth.1"}),
        #"Append empty PledgeIDs" = Table.Combine({#"Expanded AllDate",#"Filter empty PledgeIDs"}),
        #"Sort by Index"= Table.Sort(#"Append empty PledgeIDs", {{"Index", Order.Ascending}}),
        #"Remove YearMonth and Index" = Table.RemoveColumns(#"Sort by Index",{"YearMonth", "Index", "YearMonth.1"})
    in
        #"Remove YearMonth and Index"

    Sunday, May 21, 2017 9:40 PM

Answers

  • First of all, you can avoid some unnecessary steps like assigning a type to columns that will be removed later. I doubt that you ever need "Source.Name" column at all, so why apply transformations to it?

    Second: why you ever need sorting of a table for Power Pivot? Initial sort order of rows means nothing for PowerPivot, AND you are deleting Index column as well. 

    I propose to remove #"Add Index" step (and carefulle change next two steps for referring a correct step. Then you can remove #"Sort by Index" step.

    It is hard to add more details becaue there could be other issues adding a leak of performance, but avoiding unnecessary sort will speed up your query a lot.


    Maxim Zelensky Excel Inside

    Monday, May 22, 2017 7:40 AM

All replies

  • Hi Gerotutu

    Two suggestions:

    1. Excel files are 4-5 times slower to load than CSV !

    I would suggest that you use ".csv" files rather than ".xlsx" files as your big sources files:

    You also can check here :
    http://excelando.co.il/en/analyzing-power-query-performance-source-large-files/#blog

    "changing the files extensions to .csv" is not helping, you need to save your files as new Csv files.

    2. you could try to move your"changed type" step at the end of your query

    Hope it helps,
    Anthony

    Monday, May 22, 2017 7:29 AM
  • First of all, you can avoid some unnecessary steps like assigning a type to columns that will be removed later. I doubt that you ever need "Source.Name" column at all, so why apply transformations to it?

    Second: why you ever need sorting of a table for Power Pivot? Initial sort order of rows means nothing for PowerPivot, AND you are deleting Index column as well. 

    I propose to remove #"Add Index" step (and carefulle change next two steps for referring a correct step. Then you can remove #"Sort by Index" step.

    It is hard to add more details becaue there could be other issues adding a leak of performance, but avoiding unnecessary sort will speed up your query a lot.


    Maxim Zelensky Excel Inside

    Monday, May 22, 2017 7:40 AM