none
Excel.Workbook fails to import some cells from xls (97-03) workbooks. NOT a merged cell issue RRS feed

  • Question

  • I've come across what seems like unreliable behavior in the Power Query Excel.Workbook function when opening .xls files. It is returning null values for some cells, when in fact they contain values. I can't detect any weirdness with the xls file, nor can I explain why some cells fail and others don't. It's NOT a merged cell issue. The cells contain regular text characters, nothing weird, and PQ isn't throwing any errors. It just returns null when there should be a value.

    The import command is like this:

    Source = Excel.Workbook(File.Contents("C:\MyPath\MyFile.xls"), null, true)

    This was tested using the latest release of Power BI (Version: 2.73.5586.984 64-bit (September 2019). The problem was replicated on another PC in our system.

    •    If we import the original file (xls) using the Excel.Workbook function, certain cells that contain values are imported as nulls.
    •    If we save the file as xlsx and import using the Excel.Workbook function, the cell contents import correctly.
    •    If we save the file as csv and import using the Csv.Document function, the cell contents import correctly.

    There are workarounds (saving the original xls files as csv or xlsx), but these are clunky, and we'd like to have confidence that the Excel.Workbook function can reliably parse xls files. A lot of the data that we work with was stored in xls files, so this is a common use case for us.

    Thanks in advance for any insight!

    Saturday, September 28, 2019 10:46 AM

Answers

  • Hi Frijoles. We use a library called ACE to read legacy xls files. I reached out to someone on the ACE team, and here's what they shared:

    When Ace loads a sheet, it looks at the first 8 rows to try to guess the data types to use.

    In this case, that means it only sees the first row of data that contains percentages, so it thinks all the columns are number columns, and then fails trying to read in the text in the following rows as numbers.

    The reason January works is that the numbers are formatted as text, so we treat them as text.

    The reason setting HDR=YES makes it work is that then we don’t use the first row to guess, so we see the row of text (which is now the eight row of data) and choose text as the data type rather than number.

    Ehren

    Friday, October 4, 2019 4:03 PM
    Owner

All replies

  • Power Quer y certainly parses them differently (more about it see here for example: https://www.poweredsolutions.co/2018/02/19/power-query-excel-files-xlsx-vs-xls-formats-datatypes/ )

    Although no idea why yours return null unfortunately.


    Imke Feldmann - MVP Data Platform (PowerBI) - http://www.TheBIccountant.com 

    Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to vote it as helpful - Thanks!

    Sunday, September 29, 2019 8:41 AM
    Moderator
  • Hi there. Would you be able to provide a small sample xls that illustrates the issue (with any sensitive info removed, of course)?

    Ehren

    Wednesday, October 2, 2019 11:30 PM
    Owner
  • yes, happy to - how do I submit?
    Thursday, October 3, 2019 7:25 AM
  • Great. Could you share a OneDrive link here (or something similar)?

    Ehren

    Thursday, October 3, 2019 5:12 PM
    Owner
  • January tab should import fine

    April tab should not import the contents of R9C4:R10C6

    https

    ://

    drive.google.com/file/d/1XiWu_OxfSWl3Lw59p2desRHCf8IWyiX6/view?usp=sharing

    Thursday, October 3, 2019 6:07 PM
  • January tab should import fine

    April tab should not import the contents of R9C4:R10C6

    https

    ://

    drive.google.com/file/d/1XiWu_OxfSWl3Lw59p2desRHCf8IWyiX6/view?usp=sharing

    There appears to be a strange behavior, but a workaround would be to use a code like the following:

    let
        Source = Excel.Workbook(File.Contents("C:\Users\MiguelEscobar\Downloads\ImportProblemRepro.xls"), true),
        April1 = Source{[Name="April"]}[Data]
    in
        April1

    Thursday, October 3, 2019 11:23 PM
  • That's curious. It's strange that setting useHeaders to true would have an impact on how the remainder of the sheet is read/imported. I wonder if that provides a clue as to why the import isn't working as expected?
    Excel.Workbook(workbook as binary, optional useHeaders as nullable logical, optional delayTypes as nullable logical)
    Friday, October 4, 2019 1:32 PM
  • We don't have much control over reading xls and xlsb files because for both we rely on the "ACE" provider to be able to read the file. The "useHeaders" flag drives the HDR property on the ACE connection string. After that, the behavior is opaque to us.


    Friday, October 4, 2019 4:03 PM
  • Hi Frijoles. We use a library called ACE to read legacy xls files. I reached out to someone on the ACE team, and here's what they shared:

    When Ace loads a sheet, it looks at the first 8 rows to try to guess the data types to use.

    In this case, that means it only sees the first row of data that contains percentages, so it thinks all the columns are number columns, and then fails trying to read in the text in the following rows as numbers.

    The reason January works is that the numbers are formatted as text, so we treat them as text.

    The reason setting HDR=YES makes it work is that then we don’t use the first row to guess, so we see the row of text (which is now the eight row of data) and choose text as the data type rather than number.

    Ehren

    Friday, October 4, 2019 4:03 PM
    Owner
  • Thank you for the detailed response, that makes sense. Thus it is an artifact of data typing.

    If I understand correctly, the delayTypes argument can be set to FALSE to force typing based on a full scan (slow), or it can be set to TRUE to set types based on the 8-row preview. But it's not clear that type detection and response can be turned off entirely (as it can for unstructured data sources under File/Options/CurrentFile/DataLoad). True?

    Is delayTypes even relevant for xls files? When i set delayTypes to FALSE, I would expect it do a full scan, find text in rows 9 and 10, and then import the full contents of those columns as txt. But that's not what we see.

    More generally, this seems like a use case that demonstrates the value of an option to force type detection off. We frequently work with clients who stored their data in xls files. The headers on the sheets are notoriously messy and variable, and a given column could have any mix of data types, especially at the top of the page. My takeaway here is that to reliably import from these sources that we should actually step out of the 'modern' workflow, save each tab as a csv (using eg VBA), then import into PQ using the Csv.Document function. Any thoughts on alternate approaches?

    Thanks very much for your help!

    Friday, October 4, 2019 4:52 PM
  • Just to clarify: the above is a description of how ACE behaves (which Power Query uses behind-the-scenes) not Power Query in general. The delayTypes option affects xlsx files but not xls. Thus it does not affect the 8-row-scan that ACE performs.

    Ehren

    Friday, October 4, 2019 5:59 PM
    Owner