locked
How can I compute the number of fields in a CSV file that does not contain a header ? RRS feed

  • Question

  • How can I compute the number of fields in a CSV file that does not contain a header ?
    Friday, August 2, 2013 2:58 PM

Answers

  • What I can't get around yet is when the first record of the CSV file contains data like

    1,2,3,2,1

    with non-unique data across the fields.

    I hadn't considered that, when I wrote the code in the other thread.  Good catch!

    In that case, what you're already doing with string parsing is the way to go.  If your data might contain embedded newlines, or if the file might contain a type information line, it'll need a couple more tweaks, but you're already on the right track.

    For those who are confused about the purpose of this question, it's made clear in Larry's first thread at http://social.technet.microsoft.com/Forums/scriptcenter/en-US/5056daac-3e12-4518-bf0b-99890d9aa178/normal-csv-file-to-a-fixed-recordsize-text-file

    • Marked as answer by Yan Li_ Monday, August 12, 2013 12:33 PM
    Tuesday, August 6, 2013 3:20 PM
  • from get-help export-csv -detailed:

        -NoTypeInformation [<SwitchParameter>]
            Omits the type information from the CSV file. By default, the first line of the CSV file contains "#TYPE " foll
            owed by the fully-qualified name of the type of the .NET Framework object.
    

    according to that it is always a single line, and always contains the fully qualified name of the type of the .NET object being exported.

    I suspect that #TYPE is a powershell-only thing rather than something intrinsically belonging to the .csv specification, excel, or any other spreadsheet software.


    Al Dunbar -- remember to 'mark or propose as answer' or 'vote as helpful' as appropriate.

    • Marked as answer by Yan Li_ Monday, August 12, 2013 12:33 PM
    Tuesday, August 6, 2013 6:36 PM

All replies

  • Try this please:

    $CSVFile = Get-Content abc.csv
    $Lines = $CSVFile.Count
    $Columns = ($CSVFile[0].split(",")).Count
    $Fields = $Lines * $Columns

    Best regards
    brima
    Friday, August 2, 2013 5:27 PM
  • For some CSV files that works OK.  But for CSV files that only have one record this slight variation has to be made to the first line

    $CSVFile = @(Get-Content abc.csv)
    $Lines = $CSVFile.Count
    $Columns = ($CSVFile[0].split(",")).Count
    $Fields = $Lines * $Columns
    

    But then it gets more complicated.  I need it to be able to handle CSV files like this one:

    XXX,"Well,now",2,3
    
    Where the answer should be 4.

    Friday, August 2, 2013 5:35 PM
  • ok, i think you could in this case work with RegEx for save the >,< between the "

    this worked exact for your example:

    'XXX,"Well,now",2,3' -replace ('(".*)(,)(.*")'),('$1#$3')

    XXX,"Well#now",2,3

    it replace the >,< between the "" with a >#<

    Friday, August 2, 2013 5:49 PM
  • Thanks.  I'm not very conversant with RE's.  Here is what I ended up with

      # Examine the first record to learn the number of fields in each record
      # ----------------------------------------------------------------------
      $n = 0      # this will be the number of fields in a record in this file
      $q = $false # if $true then we are inside a quoted field
      $s = @(gc $csvPath)[0]
      $l = $s.length
      for ($i = 0; $i -lt $l; $i++)
       {
        $c = $s[$i]
        if ($c -eq '"')
         {
          if ($q)
           {
            if ($s[$i+1] -eq '"')
             {
              $i++ # ignore two doublequotes inside a doublequote wrapped field
             }   
            else
             { 
              $q = $false
             }
           }
          else
           {
            $q = $true
           }
         }
        if (($c -eq ',') -and (!($q))) {$n++} # count commas except those in doublequote wrapped fields
       }
      $n++
      

    Friday, August 2, 2013 9:54 PM
  • The answer might very much depend on what kind of 'computing' you've in mind?

    Not knowing that, perhaps, it would suffice to read out the first line, check out the number of fields and then, do a straight 'Import-CSV -path <yourfile> -header (1..$No)' on your file, whereby '$No' would be the count of fields you detected earlier. This would work, no matter if your file has only one line or many.

    Thus, you'ld get a nice clean object for further computation.

    wizend


    • Edited by Wizend Saturday, August 3, 2013 12:30 PM correction
    • Proposed as answer by Al Dunbar Sunday, August 4, 2013 5:31 PM
    Saturday, August 3, 2013 12:27 PM
  • Thanks!
    Saturday, August 3, 2013 12:42 PM
  • The answer might very much depend on what kind of 'computing' you've in mind?

    Not knowing that, perhaps, it would suffice to read out the first line, check out the number of fields and then, do a straight 'Import-CSV -path <yourfile> -header (1..$No)' on your file, whereby '$No' would be the count of fields you detected earlier. This would work, no matter if your file has only one line or many.

    Thus, you'ld get a nice clean object for further computation.

    wizend


    or you could guess at a first estimate of $No as being 1 plus the number of commas in the first line, do the import-csv above, and look for the last non-blank element in the first line.


    Al Dunbar -- remember to 'mark or propose as answer' or 'vote as helpful' as appropriate.

    Sunday, August 4, 2013 5:31 PM
  • That would work except when the last fields of the first line of the CSV contained no data.  In other words the records ends with one or more consecutive comma.
    Monday, August 5, 2013 12:53 PM
  • Larry,

    There is no 100% miracle tool which can fix a broken csv file!

    There is only a approximation to that.

    Please let your fantasy fly!

    You can use a slight modified solution, suggested by winzend and Al Dunbar!

    1. read all lines of the file to look for the maximum count of commas in a line
    or use only the first 5 or 10 lines for that... (Microsoft is doing so)

    2. Take that determined count value add 50 or 100 or 1000 = $No
    3. Import-CSV -path <yourfile> -header (1..$No)'
    4. look at all returned Objects for the last non-blank element
    5. reread the CSV file with this determined count of headers.


    Please click “Mark as Answer” if my post answers your question and click “Vote As Helpful” if my Post helps you.
    Bitte markiere hilfreiche Beiträge von mir als “Als Hilfreich bewerten” und Beiträge die deine Frage ganz oder teilweise beantwortet haben als “Als Antwort markieren”.
    My PowerShell Blog http://www.admin-source.info
    [string](0..21|%{[char][int]([int]("{0:d}" -f 0x28)+('755964655967-86965747271757624-8796158066061').substring(($_*2),2))})-replace' '
    German ? Come to German PowerShell Forum!

    • Edited by Peter Kriegel Monday, August 5, 2013 1:15 PM sdffdsaf
    Monday, August 5, 2013 1:14 PM
  • Just because a CSV file's records end with a comma doesn't mean it is broken.  It just means that there is no data in the last field.
    Monday, August 5, 2013 1:56 PM
  • Refactored to avoid an out-of-bounds index and packaged as a function

    <###################################################################################>
    
    function Get-ColumnCount
     {
      # a PowerShell function named Get-ColumnCount
      #
      #  Returns the count of columns in a CSV file
      #
      # Usage:
      #  
      #  Get-ColumnCount csvfilename.csv
      #
      # this version is archived at
      #   http://sp.ntpcug.org/PowerShell/Shared%20Documents/Larry_Weiss_Get-ColumnCount_function.ps1
      #
      
      param(
        [string] $csvPath
      )
    
      # All code below expects the CSV file to be well-formed
      # ------------------------------------------------------
      
      # Examine the first record to learn the number of fields in each record
      # ----------------------------------------------------------------------
      $n = 0                                # $n will be the number of fields in a record in this file
      $q = $false                           # if $q is $true then we are inside a quoted field
      $s = @(gc $csvPath -TotalCount 1)[0]  # read the first line of the CSV
      $m = $s.length - 1                    # $m is the maximum index for the string
      for ($i = 0; $i -le $m; $i++)
       {
        $c = $s[$i]
        if ($c -eq '"')
         {
          if ($q)
           {
            if ( ($i -ne $m) -and ($s[$i+1] -eq '"') )
             {
              $i++ # ignore two consecutive doublequotes inside a doublequote wrapped field
             }   
            else
             { 
              $q = $false
             }
           }
          else
           {
            $q = $true
           }
         }
        if (($c -eq ',') -and (!($q))) {$n++} # count commas except those in doublequote wrapped fields
       }
      $n++
      
      return $n
      
     }
    

    Monday, August 5, 2013 2:10 PM
  • Larry, the code you have is pretty good, to be honest.  All tokenizer implementations are likely to wind up running similar code at some point.  Here's the relevant code from the Import-Csv cmdlet (decompiled with dotPeek).  I haven't pasted all of the helper functions that it calls, but the names are descriptive enough that you can tell what it's doing, and that it's very similar to yours.

        private Collection<string> ParseNextRecord(bool isHeaderRow)
        {
          Collection<string> collection = new Collection<string>();
          StringBuilder current = new StringBuilder();
          bool flag = false;
          while (!this.EOF)
          {
            char ch = this.ReadChar();
            if ((int) ch == (int) this._delimiter)
            {
              if (flag)
              {
                current.Append(ch);
              }
              else
              {
                collection.Add(((object) current).ToString());
                current.Remove(0, current.Length);
              }
            }
            else if ((int) ch == 34)
            {
              if (flag)
              {
                if (this.PeekNextChar('"'))
                {
                  int num = (int) this.ReadChar();
                  current.Append('"');
                }
                else
                {
                  flag = false;
                  bool endOfRecord = false;
                  this.ReadTillNextDelimiter(current, ref endOfRecord, true);
                  collection.Add(((object) current).ToString());
                  current.Remove(0, current.Length);
                  if (endOfRecord)
                    break;
                }
              }
              else if (current.Length == 0)
              {
                flag = true;
              }
              else
              {
                bool endOfRecord = false;
                current.Append(ch);
                this.ReadTillNextDelimiter(current, ref endOfRecord, false);
                collection.Add(((object) current).ToString());
                current.Remove(0, current.Length);
                if (endOfRecord)
                  break;
              }
            }
            else if ((int) ch == 32 || (int) ch == 9)
            {
              if (flag)
                current.Append(ch);
              else if (current.Length != 0)
              {
                bool endOfRecord = false;
                current.Append(ch);
                this.ReadTillNextDelimiter(current, ref endOfRecord, true);
                collection.Add(((object) current).ToString());
                current.Remove(0, current.Length);
                if (endOfRecord)
                  break;
              }
            }
            else if (this.IsNewLine(ch))
            {
              if ((int) ch == 13)
              {
                int num = (int) this.ReadChar();
              }
              if (flag)
              {
                current.Append(ch);
                if ((int) ch == 13)
                  current.Append('\n');
              }
              else
              {
                collection.Add(((object) current).ToString());
                current.Remove(0, current.Length);
                break;
              }
            }
            else
              current.Append(ch);
          }
          if (current.Length != 0)
            collection.Add(((object) current).ToString());
          if (isHeaderRow)
          {
            while (collection.Count > 1 && collection[collection.Count - 1].Equals(string.Empty))
              collection.RemoveAt(collection.Count - 1);
          }
          return collection;
        }

    Monday, August 5, 2013 2:17 PM
  • Thanks for that  decompilation.  
    I suspected that I needed to be able to handle line-delineation sequences inside quotes in CSV field data.    
    To do so, I'd need to do a byte-sequence analysis rather than the simple Get-Content approach I have now.    
    I'm not understanding the need for the special treatment for Tabs and space-characters seen in the decompiled code.
    Monday, August 5, 2013 4:05 PM
  • You need the number of fields in a given csv file, but the problem might be that there are fields that contain delimiter characters themselves. In such cases a simple split operation will fail. But those fields containing a delimiter character (',') have to start with a quote and to end with one, otherwise you've got an irregular, broken csv file.

    I checked the following with the example line from your recent post above and it worked for me. But there are as much variations in files as possibilities to fail. Perhaps, you could give it a try:

    # file content = "XXX","Well,now" ,2 ,3,
    $cnt = 0
    (type $file -first 1).Split(',') | % {if ($_.Trim().StartsWith('"') -and $_.Trim().EndsWith('"')) {
        $cnt ++} elseif (!$_.Trim().StartsWith('"')){$cnt++}
       }
    Write-Host ("Number of elements (hopefully): {0}" -f $cnt)
    
    # No of fields: 5

    wizend

    Monday, August 5, 2013 5:08 PM
  • Just because a CSV file's records end with a comma doesn't mean it is broken.  It just means that there is no data in the last field.

    The definition of a CSV file format is defined in the RFC 4180 (http://tools.ietf.org/html/rfc4180).

    A CSV file without a header is definitly broken, because it do not follow the CSV rules!

    But I see you like to reinvent the wheel and programm the Import-CSV by your self.


    Please click “Mark as Answer” if my post answers your question and click “Vote As Helpful” if my Post helps you.
    Bitte markiere hilfreiche Beiträge von mir als “Als Hilfreich bewerten” und Beiträge die deine Frage ganz oder teilweise beantwortet haben als “Als Antwort markieren”.
    My PowerShell Blog http://www.admin-source.info
    [string](0..21|%{[char][int]([int]("{0:d}" -f 0x28)+('755964655967-86965747271757624-8796158066061').substring(($_*2),2))})-replace' '
    German ? Come to German PowerShell Forum!

    Tuesday, August 6, 2013 6:04 AM
  • Thanks for the URL to RFC 4180.    But notice the word "optional" in RFC 4180 in this sentence.

    3.  There maybe an optional header line appearing as the first line
           of the file with the same format as normal record lines
    

    CSV's without headers are not broken.  They are just harder to work with, as the contents of this discussion are proving

    And, I'd prefer not to have to duplicate the logic in Import-CSV when writing a function in PowerShell to return the number of fields in a CSV file, but I don't yet have working code that uses Import-CSV to do the heavy lifting when confronted with a CSV file that does not contain a traditional header line.

    Tuesday, August 6, 2013 12:33 PM
  • Larry, I used Import-Csv for exactly that purpose in your other thread.  The only catch was that it wouldn't work if the file had exactly one record (and your current code, or some variation on it, will work in that case as well).

    Tuesday, August 6, 2013 12:52 PM
  • I sure wish I could find a general solution that uses Import-CSV for the CSV parsing.  To make my explicit logic cover all the cases will require it to be coded to cover the embedded new-line sequences in doublequote wrapped data and perhaps other scenarios.

    I'll review your code again to see if it can be changed to handle a file with only one record.

    Tuesday, August 6, 2013 1:27 PM
  • I'm not understanding the need for the special treatment for Tabs and space-characters seen in the decompiled code.

    I have no idea why they did that, either.  The end result is that leading spaces/tabs are removed, and trailing spaces/tabs are limited to the first character (unless contained within double quotes).

    Also, looking through this code finally explains their (probably bugged) handling of lines starting with #.  I never understood why Import-Csv would ignore a line starting with # if it were the first line in a file, but "Get-Content csvfile.csv | ConvertFrom-Csv" would ignore all lines starting with # (particularly when there's no "comment" functionality in the CSV specification).  Turns out that calls to the ReadHeader() method first look for a line starting with #Type that contains type information from Export-Csv or ConvertTo-Csv, but the way they handle it seems flawed:

        private string ReadTypeInformation()
        {
          string str1 = (string) null;
          if (this.PeekNextChar('#'))
          {
            string str2 = this.ReadLine();
            if (str2.StartsWith("#Type", StringComparison.OrdinalIgnoreCase))
            {
              string str3 = str2.Substring(5).Trim();
              str1 = str3.Length != 0 ? "CSV:" + str3 : (string) null;
            }
          }
          return str1;
        }
    
        internal void ReadHeader()
        {
          if (this._typeName == null && !this.EOF)
            this._typeName = this.ReadTypeInformation();
          if (this._header == null && !this.EOF)
          {
            Collection<string> collection = this.ParseNextRecord(true);
            if (collection.Count != 0)
              this._header = (IList<string>) collection;
          }
          if (this._header == null || this._header.Count <= 0)
            return;
          ImportCsvHelper.ValidatePropertyNames(this._header);
        }
    

    So if the line starts with # (but doesn't match #Type), it still gets completely swallowed by the call to ReadHeader() (which calls ReadTypeInformation, even if a header has already been specified).

    That, combined with pipeline behavior, explains the difference between Import-Csv and GetContent | ConvertFrom-Csv.  Get-Content (by default) pipes one line at a time, and ConvertFrom-Csv calls ReadHeader() on every input object in the stream.  If no type information has already been found in the file, it'll just keep swallowing any line that starts with #.

    If you fetch the entire file contents before piping to ConvertFrom-Csv ( something like ConvertFrom-Csv (Get-Content csvfile.csv) or Get-Content csvfile.csv -Raw | ConvertFrom-Csv ), you get the same results as Import-Csv.


    • Edited by David Wyatt Tuesday, August 6, 2013 1:39 PM edit
    Tuesday, August 6, 2013 1:37 PM
  • I can get around the single line CSV file situation with modifications to your code like below

    $csvPath = 'csv.csv'
    gc $csvPath -TotalCount 1 >temp.csv
    gc $csvPath -TotalCount 1 >>temp.csv
    $csvPath = 'temp.csv'
    $csvData = Import-Csv -Path $csvPath
    $props = $csvData | Get-Member -MemberType NoteProperty
    $props.Count
    

    What I can't get around yet is when the first record of the CSV file contains data like

    1,2,3,2,1
    

    with non-unique data across the fields.

    Tuesday, August 6, 2013 2:17 PM
  • That would work except when the last fields of the first line of the CSV contained no data.  In other words the records ends with one or more consecutive comma.

    Correct, of course. So then one would process all lines imported and find the rightmost last non-blank element.

    But the question may remain, what if the headless csv looks like this:

    1,2,3,,,
    4,5,
    6,,7,8,,,,
    

    the "8" element is the rightmost non-blank one, so my logic would determine the number of fields to be 4. But other possibilities would be 7 (# commas) or 8 (comma following last field seems optional).

    But this all comes down to a question that I do not think has been asked here: If you do not have the header how do you know what the non-blank fields represent, let alone the blank fields?


    Al Dunbar -- remember to 'mark or propose as answer' or 'vote as helpful' as appropriate.

    Tuesday, August 6, 2013 2:56 PM
  • Here is an adaptation of David's logic that can handle single line CSV files as well as the case where non-unique data is present in the first record.   I just need to do a better job of using a generated unique file name and a better location for the scratch file, and remembering to delete it afterwards. 

    $csvPath = 'csv.csv'
    $r1 = @(gc $csvPath -TotalCount 1)
    $r1a = $r1 -split ','
    $r1a | % {$n=1;$r1=''}{$r1=$r1+$_+('""'*$n)+',';$n++}{}
    $r1 = $r1 -replace ".$"
    $r1 >temp.csv
    $r1 >>temp.csv
    $csvPath = 'temp.csv'
    $csvData = Import-Csv -Path $csvPath
    $props = $csvData | Get-Member -MemberType NoteProperty
    $props.Count
    

    Tuesday, August 6, 2013 2:57 PM
  • What I can't get around yet is when the first record of the CSV file contains data like

    1,2,3,2,1

    with non-unique data across the fields.

    I hadn't considered that, when I wrote the code in the other thread.  Good catch!

    In that case, what you're already doing with string parsing is the way to go.  If your data might contain embedded newlines, or if the file might contain a type information line, it'll need a couple more tweaks, but you're already on the right track.

    For those who are confused about the purpose of this question, it's made clear in Larry's first thread at http://social.technet.microsoft.com/Forums/scriptcenter/en-US/5056daac-3e12-4518-bf0b-99890d9aa178/normal-csv-file-to-a-fixed-recordsize-text-file

    • Marked as answer by Yan Li_ Monday, August 12, 2013 12:33 PM
    Tuesday, August 6, 2013 3:20 PM
  • Your example is not a well-formed CSV file.

    From the spec, RFC 4180 (http://tools.ietf.org/html/rfc4180).

     4.  Within the header and each record, there may be one or more
           fields, separated by commas.  Each line should contain the same
           number of fields throughout the file.

    And not all CSV files are self-describing.  Sometimes you just have to know by other means what the columns actually represent.

    Tuesday, August 6, 2013 3:25 PM
  • Your example is not a well-formed CSV file.

    From the spec, RFC 4180 (http://tools.ietf.org/html/rfc4180).

     4.  Within the header and each record, there may be one or more
           fields, separated by commas.  Each line should contain the same
           number of fields throughout the file.

    And not all CSV files are self-describing.  Sometimes you just have to know by other means what the columns actually represent.

    Quite right. Of course, if the user actually *did* have other means to determine what columns were present and what they meant, he would have been able to build his header without asking us how to figure out how many columns exist in a headerless .csv file.

    RFC4180 also states:

       While there are various specifications and implementations for the
       CSV format (for ex. [4], [5], [6] and [7]), there is no formal
       specification in existence, which allows for a wide variety of
       interpretations of CSV files.

    If, as I suspect, might be the case, the headless .csv file the user is dealing with may also be a not well-formed one, it makes little sense to quote the standard. What is he supposed to do with that file, send it back to the developer of the non-compliant application that created it and ask them to fix it?

    Al Dunbar -- remember to 'mark or propose as answer' or 'vote as helpful' as appropriate.

    Tuesday, August 6, 2013 3:48 PM
  • Thanks for the analysis.  Have you considered filing a bug report at

    https://connect.microsoft.com/powershell

    ?

    Tuesday, August 6, 2013 5:14 PM
  • Yes, I decided to narrow down the question here in this thread.  The other thread does explain my motivation.

    Tuesday, August 6, 2013 5:18 PM
  • Thanks for the analysis.  Have you considered filing a bug report at

    https://connect.microsoft.com/powershell

    ?

    For this one?  Probably not worth the effort.  Not many people are likely to vote it up, and the PowerShell team probably wouldn't even read it.  It only becomes an issue if you have a header row that starts with #, and isn't quoted (or, in the GetContent | ConvertFrom-Csv scenario, any line that starts with # and isn't quoted).  Ironically, fixing the bug might actually break some people's scripts, if they've discovered this behavior and assumed that the # character was being treated as a comment in a CSV file.  That's what I originally thought, when I saw those lines being ignored in some cases.

    Tuesday, August 6, 2013 5:32 PM
  • What is the name for the #TYPE record?    I need to do some tests with CSV's that include it.

    Is the #TYPE record always exactly one record?  Is there a specification for how it is composed?

    The #TYPE records produced by code like that below seem simple.  Can they get more complex?

    get-process Powershell | Export-CSV gps.csv

    Tuesday, August 6, 2013 5:51 PM
  • from get-help export-csv -detailed:

        -NoTypeInformation [<SwitchParameter>]
            Omits the type information from the CSV file. By default, the first line of the CSV file contains "#TYPE " foll
            owed by the fully-qualified name of the type of the .NET Framework object.
    

    according to that it is always a single line, and always contains the fully qualified name of the type of the .NET object being exported.

    I suspect that #TYPE is a powershell-only thing rather than something intrinsically belonging to the .csv specification, excel, or any other spreadsheet software.


    Al Dunbar -- remember to 'mark or propose as answer' or 'vote as helpful' as appropriate.

    • Marked as answer by Yan Li_ Monday, August 12, 2013 12:33 PM
    Tuesday, August 6, 2013 6:36 PM
  • Thank you. 

    Tuesday, August 6, 2013 6:49 PM
  • Hi,

    Just checking in to see if the suggestions were helpful. Please let us know if you would like further assistance.

    If you have any feedback on our support, please click here .


    Cataleya Li
    TechNet Community Support

    Thursday, August 8, 2013 2:48 AM
  • Hi Larry,

    If you're interested, yesterday I wrote and posted a tokenizer function much like yours, but which can be configured via command-line arguments (controlling the delimiters and quoting behavior).  Might come in handy for this kind of thing in the future.

    http://gallery.technet.microsoft.com/scriptcenter/Generic-PowerShell-string-e9ccfe73

    Friday, August 30, 2013 1:44 PM
  • Thanks!
    Friday, August 30, 2013 2:16 PM