locked
risk of data leak in Power Query when choosing public? RRS feed

  • Question


  • Hello Microsoft Team,

    my company is looking at the leak risks of Power Query. Is there any when choosing privacy option "public"?
    I do not expect that a leak of sensitive data is possible, as Microsoft will surely have prevented this, but I want to be 100% sure before integrating the Power Query Add-In in the company.

    I have tested it with non-sensitive data. If you import e.g. population data or any other public data with the data search tool in Power Query and if you then append your own data to the population table, the question about privacy (private, corporate, public) pops up. I have then chosen these worst case actions and tried to find the now "public" private data in the internet again. I have integrated some unique words in title, description and content so that I would have to find it easily. But there were no results.

    The core question is thus:
    What exactly can happen in the "worst case" if a user chooses public, is there a risk of private data being leaked into a public cloud and made available to the whole world within that destination cloud, in this case the public cloud of population data? And where would that data of other users of the whole world be available to me if I want to use it?


    The "worst case" scenario in detail (which I do not believe to be possible at all):
    If data can be leaked into a public cloud, a user might inadvertently press public, thinking it would be just a system question regarding her own computer. She would interpret the question just as a choice of restriction within her own secure environment (private, corporate with choosing only certain groups or persons in the company, or the whole private cloud that might be available in the company). If the user is not linked to a company at all or there is no company cloud or sharepoint system anywhere, the choice of public could as well be interpreted as irrelevant. So the user might just press "yes and yes and yes" without caring too much about it, just in order to get further on with the program. She just does not expect that an upload into public cloud would be possible that makes the data accessible in the whole world, especially not in a small Add-In like Power Query in Excel.


    Now I have found several sources that confuse me a lot.

    The Microsoft Power BI blog (!) states that personal data can well be leaked into the public cloud if "public" is chosen.
    See:
    <I had to delete this link as my account is not verified, but you can search for the exact words of the following quote and find it directly>

    You find the following quote (the Yelp cloud would be the Wikipedia population cloud in my example, so Yelp is just an example here as well):
    "At this point, we will be asked for information about the privacy level of my workbook data. This is done so that users don’t accidentally leak data from a private or organizational source and inadvertently send it to a public source (like the Yelp API in this case)."

    Thus, according to this Microsoft Power BI blog, the abstract process is as follows. The user might load some public data from a public cloud somewhere. Then she might want to merge it with public data (e.g. for benchmarking or whatever). The public cloud where the data came from will then already be filled in the "privacy pop up window" as the destination cloud. This is then understood as the user being able to upload her personal data to this public cloud.


    Szenario 1: leak is possible
    - Is there any way to just generally prevent any possible leak by just forcing a privacy level <> "public" at any case?
    - And / or: is there any chance to remove data from a public cloud without any user account after such a leak?
    - How can I find such user-made data after upload to the cloud?

    Szenario 2: leak is impossible (which I expect to be true)
    - If a leak is not possible, is the privacy setting "public" relevant at all if a user does not have a Power BI account to share data within her personal sharepoint or cloud system?

    Many thanks, hope the answer is just "no leak possible" and not as long as my text here, sorry for that!

    Kind regards,
    LifeTheUniverseAndEverything


    Some other links or quotes:

    1)
    This is from some Power Query tutorial:
    "so I am working with public data and all of a sudden I add some private data to it
    from an Excel workbook or another data source in my organization
    and now it shouldnt be public anymore it shouldnt be available to anybody
    it actually should be private or organizational"

    2)
    This is a user asking to suppress the privacy question in his obviously data sensitive company network, so a data leak does not seem relevant at all according to this forum question:
    <I had to delete this link as my account is not verified>

    3)
    If you click on options in Power Query, you find global and local privacy options. If you go in one of those privacy submenus and go to the information sign of the last option that will always ingore the privacy settings, you find a hint that this option might cause sensitive data to be shared

    4)
    This question will probably be outdated (2014), a user asking to share queries in the public but cant do it:
    <I had to delete this link as my account is not verified>

    5)
    It is Dutch, but it states that the shared queries are only to be shared within the company, even when choosing "public". I guess that all of these features are only available with a Power BI account.
    <I had to delete this link as my account is not verified>
    -->
    Then you can find the shared queries as a separate option within your private environment, but not in the basic data search from a public source:

    <I had to delete this link as my account is not verified>

    Wednesday, October 12, 2016 8:01 AM

Answers

  • Hi there. When the documents you referenced above talk about "leaking" data, they don't mean that the data is uploaded to the cloud and made available for all to see. Rather, they're referring to the fact that values from one data source can be sent to another data source as part of queries, and that this data can be seen by the operators of those other systems via website logs, server traces, etc.

    Let's look at a couple examples.

    If you're pulling population data from a single Wikipedia page, no data should ever be leaked to it, because Wikipedia (since it's a simple web page source) doesn't support any kind of querying. We will download the page's data, and then locally (i.e. on your machine) use it in conjunction with data from other sources.

    But a public API (such as Yelp, Twitter, Facebook, etc.) is different (as are other sources that support querying, such as SQL, OData, etc.). Let's assume you're joining Yelp data with some internal Excel data, and both of them are categorized as public. In this case, we might realize that we can optimize the call to the public Yelp API by including some data from the Excel file. This data would then be passed to the Yelp http request, and be visible to anyone perusing the internal Yelp web logs. This might be fine ("Who cares if they know I have the zip code 58575 in my data?"), or it might not ("I just sent them our employees' names and social security numbers!"). The privacy settings are intended to be an easy way for the "right" thing to happen automatically. Public data sources never have Organizational or Private data sent to them. And Organizational data sources never have Private data sent to them. But Public to Public, Organizational to Organizational, Public to Organizational, etc. is fine.

    Regarding your other question, there's currently no way to force data sources to be categorized as non-public.

    I hope this helps clarify the effect of the privacy settings (at least a little). Wrapping your head around them can be a challenge.

    Ehren

    Wednesday, October 12, 2016 6:27 PM

All replies

  • Hi there. When the documents you referenced above talk about "leaking" data, they don't mean that the data is uploaded to the cloud and made available for all to see. Rather, they're referring to the fact that values from one data source can be sent to another data source as part of queries, and that this data can be seen by the operators of those other systems via website logs, server traces, etc.

    Let's look at a couple examples.

    If you're pulling population data from a single Wikipedia page, no data should ever be leaked to it, because Wikipedia (since it's a simple web page source) doesn't support any kind of querying. We will download the page's data, and then locally (i.e. on your machine) use it in conjunction with data from other sources.

    But a public API (such as Yelp, Twitter, Facebook, etc.) is different (as are other sources that support querying, such as SQL, OData, etc.). Let's assume you're joining Yelp data with some internal Excel data, and both of them are categorized as public. In this case, we might realize that we can optimize the call to the public Yelp API by including some data from the Excel file. This data would then be passed to the Yelp http request, and be visible to anyone perusing the internal Yelp web logs. This might be fine ("Who cares if they know I have the zip code 58575 in my data?"), or it might not ("I just sent them our employees' names and social security numbers!"). The privacy settings are intended to be an easy way for the "right" thing to happen automatically. Public data sources never have Organizational or Private data sent to them. And Organizational data sources never have Private data sent to them. But Public to Public, Organizational to Organizational, Public to Organizational, etc. is fine.

    Regarding your other question, there's currently no way to force data sources to be categorized as non-public.

    I hope this helps clarify the effect of the privacy settings (at least a little). Wrapping your head around them can be a challenge.

    Ehren

    Wednesday, October 12, 2016 6:27 PM
  • But a public API (such as Yelp, Twitter, Facebook, etc.) is different (as are other sources that support querying, such as SQL, OData, etc.).

    Hello Ehren,

    thank you for your quick answer, I just took some time to think about it:). Then it is clear that in the normal use of the add-in there is no risk of a data leak. But how can I find out which data sites have a public API where data can be uploaded to as you also state it now? Is this noted on the data service pages of the Microsoft Azure data market?

    The "worst case" scenario again: I download statistical data from Azure market and merge it with some private sensitive data in order to get some benchmarking or whatever.

    How can I know what happens when I inadvertently choose public as the privacy setting? Is there some structured information about the API process for every data provider of the Azure data market?

    Thank you,

    LifeTheUniverseAndEverything


    Wednesday, October 19, 2016 2:41 PM
  • > But how can I find out which data sites have a public API where data can be uploaded to as you also state it now?

    It's not about the site, but the protocol being used. Vanilla http/https requests don't support generic querying that we can automatically leverage. OData does. The Facebook API does (as do many SaaS providers that we provide out of the box connectors for).

    In your scenario, the "worst case" of marking private data as public would be that you end up sending some of your private data to the Azure market APIs.

    Ehren

    Wednesday, October 19, 2016 4:53 PM
  • In your scenario, the "worst case" of marking private data as public would be that you end up sending some of your private data to the Azure market APIs.

    OK, thanks again, Ehren! It seems to be a way more technical question than thought.

    This is then something to explore further when using such services, it is probably explained in the details of these services if OData or the like is used.

    The users should just be a bit aware of the theoretical risk, then, I hope this suffices for my company. Still, I do not like these "worst case" 0.001 % probabilities. 

    The main risk is that a normal user would never expect that sharing with the "whole world" might become possible at any time when using an add-in in Excel. He would always just expect that this refers to his company environment at the most. 

    In general, Power Query is a nice tool even without the data services, and the steps to download data from API providers at Microsoft Azure data market and then merge sensitive data and then pressing on public when asked seem farfetched. But you never know who might just do this for curiosity or whatever, Murphy's law is still in action. If you use the Azure data often, it will at least give you an uncomfortable feeling or at least just get on your nerves if you have to pay attention to not to press 'public' at every merge, so I do not really understand why this cannot be generally prevented in a central menu option.

    I guess this is my last question then: If private data reaches an API, does this necessarily mean that this data is available by the same step to all other Azure users of that API, or is there any approval of the newly uploaded data by the specific database providers? 

    I mean the data services would not want to spread wrong or chaotic test data, I guess, so is there some checking done? Could data that is uploaded by mistake ever be taken from the public API or OData?

    Regards and thank you for your service here.

    LifeTheUniverseAndEverything

    Thursday, October 20, 2016 8:33 AM
  • > I guess this is my last question then: If private data reaches an API, does this necessarily mean that this data is available by the same step to all other Azure users of that API, or is there any approval of the newly uploaded data by the specific database providers?

    No, in the case of the Azure Market the data sent in query requests will not be made available to other users of the Azure Market APIs. It will only be visible to Microsoft and (perhaps...not 100% sure) the publisher of the API, through telemetry and server logs. It is not suddenly available for the world to download.

    I hope that clears up your concern.

    Ehren

    Thursday, October 20, 2016 3:09 PM
  • Dear Ehren,

    yes, this clears it up to some 99 %, at least it takes away the fear of doing something really risky if it is clear that such a sharing with the world is not attempted at any stage.

    But. I simply do not understand then why this button exists at all for users. I guess this question is rather naive now, but give it a try.

    If all data just strands in server logs when pressing public, why should the data service and the user willingly incure these traffic costs for nothing? Why does the data have to reach the server logs at all, is it for speed reasons as the query can then be run in the "backend" with no big amount of data being loaded to the user before? But that would be independent from the privacy option I choose. So I still do not get the function of this button :)

    What kind of real example exists about using this "public" privacy option? Thank you very much in advance, the questions should then really stop...

    Regards.

    LifeTheUniverseAndEverything

    Friday, October 28, 2016 8:14 AM
  • The reason these settings exist is because we don't want to send your data to random sites without your permission.

    Ehren

    Monday, October 31, 2016 5:58 PM
  • Well, That About Wraps It Up For God :)

    OK I just accept it, thank you very much for your help and patience. I think I run in logical circles here :)

    Tuesday, November 8, 2016 5:23 PM