none
What's the Best Strategy for Creating Multiple Tables? RRS feed

  • Question

  • I'm gathering data and meta data from several thousand web pages, and I want to export one Table with data from those pages and a second Table with their meta data.

    Is there a way to set this up so that the two queries—combined—access each web page only once?

    To my knowledge there's not a way that one query can maintain two Tables.

    One possibility would be to create a massive Table with data and meta data, stage it in the Data Model, then use PQ to return data and meta data from the staging table to the two Excel Tables. However, PQ can't RETURN data from the Data Model...as far as I know.

    So what's my best strategy here?


    Founder, ExcelUser.com

    Monday, July 2, 2018 1:29 PM

Answers

  • Hey Charley,

    Here's a more detailed explanation on why that's happening,written by Ehren. 

    From my experience working against web data sources, it is probably best to create a denormalized table with all of the fields that you want in just 1 table (so you only need to hit that API or web service once) and then re-arrange things in DAX using DAX table functions as needed. If you're bound to Excel, then you'd need to do exactly what you're doing which is load to a table and then reference that table to other queries. In that case you should probably create a VBA code to set up a refresh order.

    If Ehren or anyone from the PQ team sees this, he/they might be able to give you more suggestions as to what you might be able to do.

    Here's a blog post that I did a while ago about this topic.

    • Marked as answer by Charley Kyd Sunday, July 8, 2018 1:48 AM
    Sunday, July 8, 2018 12:03 AM

All replies

  •  You could create a List of records that could contain 2 fields:

    1. The Actual Data (as whatever data type you want)
    2. The Metadata (as whatever data type you want)

    That's probably the easiest way.

    Then you'd need to reference that master query and do, for example, a List.Transform( ReferencedList, Record.Field(_, "Data") )

    Monday, July 2, 2018 9:39 PM
  • Miguel,

    Just to be clear, you're saying that I would have three queries, MasterQuery, DataQuery, and MetaQuery. Both DataQuery and MetaQuery would use MasterQuery as their Source. 

    With this method, would MasterQuery run only once for each web page, even though it's called by two other queries?

    Charley



    Founder, ExcelUser.com

    Monday, July 2, 2018 11:42 PM
  • After thinking about this some more, MasterQuery could set up one table with two fields, PageID and Contents. And then, each of the other queries could filter out the contents they don't need and process what they do need.

    In my case, the metadata on each web page is in a vertical list. So in MasterQuery, I would transpose that list into multiple columns for each page, and then merge the columns into a single Contents column. That way, each of the other two queries would merely need to filter the MasterQuery's table, and then split the Contents column into separate columns as needed. 

    Does that sound reasonable?


    Founder, ExcelUser.com

    Tuesday, July 3, 2018 1:14 PM
  • Miguel,

    Just to be clear, you're saying that I would have three queries, MasterQuery, DataQuery, and MetaQuery. Both DataQuery and MetaQuery would use MasterQuery as their Source. 

    With this method, would MasterQuery run only once for each web page, even though it's called by two other queries?

    Charley



    Founder, ExcelUser.com

    You'd need to buffer that List query as the last step. (List.Buffer) That should cache the results of that query  so you can reference the query with no issues and basically just do "one trip" to the internet to get the data that you need and then just use PQ to shape it.

    Tuesday, July 3, 2018 5:57 PM
  • Thanks!

    You're keep referring to a List and I keep referring to a Table. In this case, does it matter?


    Founder, ExcelUser.com

    Tuesday, July 3, 2018 6:12 PM
  • Thanks!

    You're keep referring to a List and I keep referring to a Table. In this case, does it matter?


    Founder, ExcelUser.com

    Is completely up to you. Both approaches work - it's just that instead of using List.Buffer you'd end up using Table.Buffer
    Wednesday, July 4, 2018 2:27 AM
  • Miguel,

    Table.Buffer isn't working for me at all. The two queries that rely on the buffered table from the preceding query both are hitting the original 2393 web pages each. 

    The preceding query ends: Buffer=Table.Buffer(Results) in Buffer

    The only way I've found to stop that behavior is to load the preceding query into a worksheet and then have the last two queries reference that worksheet Table for the data they need.

    By the way, to make sure that the last two queries didn't launch the query that returned the Table, I named the Table's columns and then wrote the last two queries to reference those named ranges, NOT the Table returned by PQ.

    Because I'm working in Excel (not Power BI), this solution seems workable but kludgy. However, I'm not sure what the calculation chain is like.

    Specifically, when QueryA loads a Table in a worksheet and QueryB references named ranges in that Table, does the calculation chain work as expected? Or do the queries run simultaneously? Or randomly? 

    In short, will I have to manually run QueryA to create the Table and then run QueryB to read it?

    Thanks,

    Charley


    Founder, ExcelUser.com

    Saturday, July 7, 2018 12:02 AM
  • Hey Charley,

    Here's a more detailed explanation on why that's happening,written by Ehren. 

    From my experience working against web data sources, it is probably best to create a denormalized table with all of the fields that you want in just 1 table (so you only need to hit that API or web service once) and then re-arrange things in DAX using DAX table functions as needed. If you're bound to Excel, then you'd need to do exactly what you're doing which is load to a table and then reference that table to other queries. In that case you should probably create a VBA code to set up a refresh order.

    If Ehren or anyone from the PQ team sees this, he/they might be able to give you more suggestions as to what you might be able to do.

    Here's a blog post that I did a while ago about this topic.

    • Marked as answer by Charley Kyd Sunday, July 8, 2018 1:48 AM
    Sunday, July 8, 2018 12:03 AM
  • Miguel,

    About Table.Buffer, Ehren wrote, "However, because buffering happens in memory and is not persisted on disk, buffering during the load of one query does not affect the load of another query. If Q1 is buffered when Q2 is loaded, Q1 will be re-buffered when Q3 is loaded."

    That makes no sense to me, for two reasons. First, Table.Buffer seems to have NO effect on ANY subsequent queries in the same workbook after a refresh-all. Second, Ehren doesn't explain why Q2 apparently destroys the Q1 buffer if Q3 runs in the same workbook and is refreshed with the same refresh-all. 

    That is, I've found no indication that Table.Buffer EVER does anything useful in Excel, in any circumstance.

    Your VBA suggestion is very disappointing. It's a kludge to fix a kludge. (To be clear, I'm disappointed with Microsoft, not you, Miguel.) Instead, Microsoft could have added a function to define a refresh order. If I can do it in VBA, they could do it for M.

    But particularly, I see no user benefit to Microsoft providing a Buffer that doesn't BUFF, so to speak.



    Founder, ExcelUser.com

    Sunday, July 8, 2018 1:47 AM
  • On a counterpart, you can say that this system is perfect for something like a SQL system since this is where Query folding shines.

    Imagine that you reference 2 queries from a single one that has a SQL Database data source. The 2 child queries will try to perform some sort of query folding in order to create the best SQL statement and save quite a bit of local workspace usage.

    The system is not perfect - I agree. I wish that I could tell PQ how to chain interactions/refresh operations or when to cache a query that should be used on another query.

    Hopefully Ehren will see this post and make a comment about it.

    Sunday, July 8, 2018 9:25 PM
  • Obviously, I'm looking at PQ from a completely different perspective...

    For years, I've encouraged companies to include external data with their internal reports. Internal data tells us WHAT happened; external data adds CONTEXT to our performance. It helps us to better understand WHY it happened and what might happen next.

    (These days, for example, I seldom create a chart of a long-term measure of business performance without including shaded areas that mark periods of slowdowns or recessions.)

    In the past, it was a PITA to accumulate and store external data for internal use. But now, with PQ, it's easy...with the exception of query stuttering, so to speak. 

    I'm sure that other non-SQL sources of data have the same problem. And that problem will continue to grow as Microsoft gives us access to ever more sources of non-SQL data.

    We need a RELIABLE way in M for Excel to tell PQ that there's no trespassing beyond this point. The Table.Buffer function SHOULD do that, but doesn't.


    Founder, ExcelUser.com

    Sunday, July 8, 2018 10:29 PM
  • Obviously, I'm looking at PQ from a completely different perspective...

    For years, I've encouraged companies to include external data with their internal reports. Internal data tells us WHAT happened; external data adds CONTEXT to our performance. It helps us to better understand WHY it happened and what might happen next.

    (These days, for example, I seldom create a chart of a long-term measure of business performance without including shaded areas that mark periods of slowdowns or recessions.)

    In the past, it was a PITA to accumulate and store external data for internal use. But now, with PQ, it's easy...with the exception of query stuttering, so to speak. 

    I'm sure that other non-SQL sources of data have the same problem. And that problem will continue to grow as Microsoft gives us access to ever more sources of non-SQL data.

    We need a RELIABLE way in M for Excel to tell PQ that there's no trespassing beyond this point. The Table.Buffer function SHOULD do that, but doesn't.


    Founder, ExcelUser.com

    From external sources, you can create your own custom connector and have a bit of more control over what goes down. Especially if those external sources offer a REST API or some sort of web service - that's where Custom Connectors come in clutch. Sadly, they're currently only part of the Power BI experience.
    Sunday, July 8, 2018 11:54 PM
  • I haven't run into any sources that offer RESTful APIs. In fact, most do what they can to discourage screen scraping.  

    That particularly includes stock market and real estate sites, in my experience.

    Ideally, PQ would encourage agile access, including data-source agility.


    Founder, ExcelUser.com

    Monday, July 9, 2018 12:37 AM
  • Usually most SaaS or these "data" vendors offer a paid service so you can have access to their API. They all have something in their legal terms of use agreement that discourages web scrapping.

    I did a quick google search and found this API for stock data.

    Monday, July 9, 2018 1:48 AM
  • Thanks, Miguel!

    Founder, ExcelUser.com

    Monday, July 9, 2018 1:51 AM
  • This looks very promising:

    [excel.uservoice.com] Admin response: 'Power Query - cache shared nodes' - Sun 2018-07-08 19:45

    Hi all,

    I wanted to share with you that the new query caching mechanism in Excel has been deployed to Office Insiders starting from Excel version 1801 (build 9001.*). At this stage, we would like to allow some “baking” time as we monitor the feature health metrics.

    In this scope, we need your help to ensure a quality release! I encourage you to try the following scenarios and share your feedback: 1. Run Refresh All on a complex workbook with multiple query dependencies. Does it work faster?

    2. Run Refresh All on a complex workbook with multiple query dependencies. Do you see any issues with your data?

    3. Refresh a single query several times. Do you see any issues with your data?

    - The Excel Team

    Excel Team [MSFT]

    Excel Team [MSFT]

    Admin, Office.com

    Monday, July 9, 2018 7:53 AM
  • Agreed. Thanks, Anthony!


    Founder, ExcelUser.com

    Monday, July 9, 2018 1:13 PM