none
Import from csv folder, ignore first rows, and put the date from the filenames in a column RRS feed

  • Question

  • Hello all

    We have a csv folder that have dates in the filenames like this:

    and the individual files are like this:


    As you can see, the first 3 rows are causing trouble when we want to import.

    So we want to import the folder fixing the first rows issue and also extract the date from the filenames and place them in a column.

    How can we do it?

    Here´s the zip with all the needed files:

    DataFolder



    BTW note to devs: I think the first rows import issue (which is very normal) could be fixed by just counting the total columns first and using the first row that have the max columns (ignoring the previous rows), couldn´t it?


    • Edited by NicoPer Monday, September 12, 2016 9:22 PM
    Monday, September 12, 2016 9:19 PM

Answers

  • This is solvable.

    PQ is not sometimes that smart and it guesses bad delimiters, quotes etc. Especially with non standard lines at the top.

    Not sure how advanced you are, but this is pretty easy to solve.

    Check it out: https://drive.google.com/file/d/0BxCBbV34rkcETHJuWkd0ZVBBaVE/view?usp=sharing (there is a new query in the file called "Test")

    Over the time, I have discovered that the safest method of importing CSVs is via Lines.FormBinary function. Then splitting by the delimited you define.

    When you want to import more files from a folder, you still start with Import from Folder feature but then you only keep Path & Name columns and take it from there. I know is sounds a bit wierd to get rid of the Binary column, but this approach is 100% reliable. More than the letting PQ decide on delimiters and top non standard rows.

    DZ




    • Edited by Daniel Zrust Wednesday, September 14, 2016 3:30 PM
    • Marked as answer by NicoPer Thursday, September 15, 2016 3:30 AM
    Wednesday, September 14, 2016 3:21 PM

All replies

  • This is solvable.

    PQ is not sometimes that smart and it guesses bad delimiters, quotes etc. Especially with non standard lines at the top.

    Not sure how advanced you are, but this is pretty easy to solve.

    Check it out: https://drive.google.com/file/d/0BxCBbV34rkcETHJuWkd0ZVBBaVE/view?usp=sharing (there is a new query in the file called "Test")

    Over the time, I have discovered that the safest method of importing CSVs is via Lines.FormBinary function. Then splitting by the delimited you define.

    When you want to import more files from a folder, you still start with Import from Folder feature but then you only keep Path & Name columns and take it from there. I know is sounds a bit wierd to get rid of the Binary column, but this approach is 100% reliable. More than the letting PQ decide on delimiters and top non standard rows.

    DZ




    • Edited by Daniel Zrust Wednesday, September 14, 2016 3:30 PM
    • Marked as answer by NicoPer Thursday, September 15, 2016 3:30 AM
    Wednesday, September 14, 2016 3:21 PM
  • Assuming all your data files are in the same format, make a query to parse a file into a table. This will be the basis for creating a user defined function to parse each item in the folder. Let's call it parseFile1

    Make a duplicate of the query, parseFile2. open it up and switch to Advanced view and view the code. You want to substitute for the beginning of the code with something like this.

    (dataFile as text) as table=>
    let
        Source = Csv.Document(File.Contents (dataFile),[Delimiter=",",Encoding=1252]),

    The first line changes the data source from a single file location to a series of data files each named dataFile. Don't forget to have the "," at the end and capitals are important. Usually the next line starts with #SPLIT. Close this query and you should see it as a function now instead of a representation of your data file.

    Now you need to start a new query with the source being your folder of dataFiles, lets call it myFiles. After selecting the folder you get a directory list of the folder, Content, Name, Extent... Folder path. 

    Now add a new column for each dataFile's parseFile2's results. It will store them into a single column to start. We'll call your new column theData and the formula will be parseUnpivot2([Folder Path]&[Name]) This will then go out and apply the query steps stored in parseFile2 to each of the files in the folder and put them into the column theData, sort of a sublist for each file in the list.

    Now you are ready to clean things up. Click on the Expand icon at the top of the new column will let you select the columns from your data file. The directory data for that file will be added to each row of the data in your data file. You can delete/parse  the file directory info as needed to get your data and or other information you want to assign to each row.

    Wednesday, September 14, 2016 4:58 PM
  • Thanks so much man. It worked at the first time.
    I highly appreciate that you took the time with the demo.
    Just copied the code.
    Thursday, September 15, 2016 3:32 AM
  • Thank you very much for your reply too.

    Your solution stays here for future reference.

    Thursday, September 15, 2016 3:36 AM