locked
Performance of List.Generate() RRS feed

  • Question

  • Hi all,

    I'm working with a csv file that contains one text field that, in some cases, contains line breaks. Obviously this breaks the built-in csv import functionality, so I've been experimenting with custom M code to solve the problem. The following query works for files with a few thousand rows in, but when I point it at the full 100MB file memory usage on my PC goes through the roof and I have to kill Excel:

    let
        BinaryFile = Binary.Buffer(File.Contents("C:\Test.csv")),
        Source = Text.Trim(Text.FromBinary(BinaryFile)),
        DelimiterList = List.Combine(
    { List.Repeat({"|"},18), 
    {"#(cr,lf)"}}),
        ColumnCount=List.Count(DelimiterList),
        LineSplitter = Splitter.SplitTextByEachDelimiter(DelimiterList),
        GetContentsList = List.Generate(
    ()=>LineSplitter(Source), 
    each _{ColumnCount}? <> null, 
    each LineSplitter(_{ColumnCount}), 
    each List.Range(_,0,ColumnCount)),
        OutputTable = Table.FromRows(GetContentsList),
        AddColumnNames = Table.PromoteHeaders(OutputTable)
    in
        AddColumnNames

    I've experimented with adding Binary.Buffer() and List.Buffer() in a few places but nothing seems to make a difference. I'm thinking of some other approaches but I would also like to know if there's a way of making this code perform well - it looks like List.Generate() is the problem here. Any suggestions?

    Thanks,

    Chris


    Check out my MS BI blog I also do SSAS, PowerPivot, MDX and DAX consultancy and run public SQL Server and BI training courses in the UK

    Thursday, February 4, 2016 11:04 PM

Answers

  • Hi Chris. I've tried your M code against a mocked-up dataset, and I think I've finally figured out what's going on.

    If the problem were with List.Generate, then you'd likely see issues in the mashup container evaluation process. But for me at least, that doesn't seem to be the case. The List.Generate approach might be a bit slow, but it doesn't consume unnecessary memory. (It also streams when loading to Excel, which means it should work over very large data sets.)

    However, I was able to get Excel to lock up under certain conditions. Perhaps this is what you were running into.

    During the initial fill to the data model, we are careful to not do things that would block UI interaction. However, if you refresh (e.g. by hitting Close & Load for a query that's already loaded to the data model, or clicking Refresh All on the Data tab), Excel is in control and prevents UI interaction while the refresh is occurring.

    Here are the steps that seem to reproduce the behavior you described:

    1. Do the initial load to the data model with a small sample of data.
    2. Modify my query to pull a much larger amount of data and click "Close & Load".

    Step 1 completes quickly, and Excel remains responsive. However, step 2 causes Excel to lock up for the duration of the load operation.

    Ehren


    Wednesday, February 10, 2016 8:30 PM

All replies

  • Hi Chris,

    Could You put somewhere an example file (of course a small part of data). I understand Your code but my english is not as good as I wish, so I prefer working with the hard data than descriptive information.
    BTW  Maybe You could export data to csv with one additional column which will contains some data as a cut marker? (ten spaces or other unique values). Then task will be simple.

    Regards

    Bill


    • Edited by Bill Szysz Friday, February 5, 2016 1:33 PM
    Friday, February 5, 2016 1:32 PM
  • Thanks Bill, this is customer data and it would take some time to anonymize it. In any case I have another approach which works well now.

    However, I would still like to know what the problem with this approach was. It was quite strange: it seemed as though everything froze when Power Query memory usage went over a certain level. For the solution that worked the query executed very quickly; for the solutions that didn't work I was unable to cancel the query and had to kill Excel.

    Chris


    Check out my MS BI blog I also do SSAS, PowerPivot, MDX and DAX consultancy and run public SQL Server and BI training courses in the UK

    Monday, February 8, 2016 10:06 PM
  • Hi Chris. I've tried your M code against a mocked-up dataset, and I think I've finally figured out what's going on.

    If the problem were with List.Generate, then you'd likely see issues in the mashup container evaluation process. But for me at least, that doesn't seem to be the case. The List.Generate approach might be a bit slow, but it doesn't consume unnecessary memory. (It also streams when loading to Excel, which means it should work over very large data sets.)

    However, I was able to get Excel to lock up under certain conditions. Perhaps this is what you were running into.

    During the initial fill to the data model, we are careful to not do things that would block UI interaction. However, if you refresh (e.g. by hitting Close & Load for a query that's already loaded to the data model, or clicking Refresh All on the Data tab), Excel is in control and prevents UI interaction while the refresh is occurring.

    Here are the steps that seem to reproduce the behavior you described:

    1. Do the initial load to the data model with a small sample of data.
    2. Modify my query to pull a much larger amount of data and click "Close & Load".

    Step 1 completes quickly, and Excel remains responsive. However, step 2 causes Excel to lock up for the duration of the load operation.

    Ehren


    Wednesday, February 10, 2016 8:30 PM
  • Hi Ehren,

    That sounds like it could be my problem - I had loaded a sample dataset into the data model before trying the full dataset. It sounds like deleting the sample dataset before loading the full dataset might have helped here?

    Thanks for your help,

    Chris


    Check out my MS BI blog I also do SSAS, PowerPivot, MDX and DAX consultancy and run public SQL Server and BI training courses in the UK

    Wednesday, February 10, 2016 10:30 PM
  • Yes, if this was indeed the issue then deleting the sample before loading the full dataset would have helped.

    Ehren

    Thursday, February 11, 2016 10:01 PM