none
Grab string from URL using Powershell

    Question

  • Hi

    I am new to Powershell and I need some assistance. I am trying to grab the full URL for the jpg file from a webpage but there are no fix matches to the image filename and it can also be stored in different sub domain. Any idea how to create a Powershell script to grab the entire URL and out put to a text file?

    [{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]

    However, the tricky part is that there are many other images on the webpage and only these 3 links from this section of the HTML is important.

    Also, we would like to output the link to a text file first. This script will scrape through multiple URLs which will be read from a list (I managed to get this part working) but I need to be able to select the above URL based on pattern matching or something to achieve it.


    Thanks

    Monday, February 13, 2012 12:27 AM

Answers

  • Regular expressions will be the key here.  My first stab:

    ("[{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f
    -a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-0
    1f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]" -split ',') -match "'http://.+.'"

    yields:

     url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
     url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg'
     url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
     url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg'

    Doing another -Split on the output:

    (("[{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422
    f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-
    01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]" -split ',') -match "'http://.+.'") -split "url:"

    yields:

     'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
    
     'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg'
    
     'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
    
     'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg'

    Putting it together with the file output:

    (("[{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]" -split ',') -match "'http://.+.'") -split "url:" | Foreach {$_.Trim().Replace("'","")} | Where {$_ -match "[a-z0-9]"} | Out-File C:\test\output.txt -Encoding ASCII; Invoke-Item C:\test\output.txt


    • Edited by Will Steele Monday, February 13, 2012 12:43 AM Additional info
    • Proposed as answer by Anders_WangModerator Monday, February 13, 2012 3:32 AM
    • Marked as answer by cyw77 Monday, February 13, 2012 4:11 AM
    Monday, February 13, 2012 12:38 AM
  • Let's change to using a string method for the split and see if that works better for you:

    ($outputurl -replace ".+?'(http:.+?\.jpg')",'$1').split("'") -match '^http'


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    • Marked as answer by cyw77 Monday, February 13, 2012 4:11 AM
    Monday, February 13, 2012 2:04 AM

All replies

  • Regular expressions will be the key here.  My first stab:

    ("[{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f
    -a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-0
    1f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]" -split ',') -match "'http://.+.'"

    yields:

     url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
     url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg'
     url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
     url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg'

    Doing another -Split on the output:

    (("[{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422
    f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-
    01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]" -split ',') -match "'http://.+.'") -split "url:"

    yields:

     'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
    
     'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg'
    
     'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg'
    
     'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg'

    Putting it together with the file output:

    (("[{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]" -split ',') -match "'http://.+.'") -split "url:" | Foreach {$_.Trim().Replace("'","")} | Where {$_ -match "[a-z0-9]"} | Out-File C:\test\output.txt -Encoding ASCII; Invoke-Item C:\test\output.txt


    • Edited by Will Steele Monday, February 13, 2012 12:43 AM Additional info
    • Proposed as answer by Anders_WangModerator Monday, February 13, 2012 3:32 AM
    • Marked as answer by cyw77 Monday, February 13, 2012 4:11 AM
    Monday, February 13, 2012 12:38 AM
  • Hi Will

    Thanks for your prompt response.

    Here I have part of the code

      $outputurl = $webClient.DownloadString("http://www.domain.com")
      (($outputurl -split ',') -match "'http://.+.'") -split "url:" | Foreach {$_.Trim().Replace("'","")} | Where {$_ -match "[a-z0-9]"} | Out-File output.txt -Encoding ASCII; Invoke-Item output.txt

    $outputurl will contain the above HTML code.

    I got the following error

    You must provide a value expression on the right-hand side of the '-' operator.


    Thanks

    Monday, February 13, 2012 1:24 AM
  • Here's mine:

    $string ="{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}"
    
    ($string -replace ".+?'(http:.+?\.jpg')",'$1') -split "'" -match '^http'
    http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg
    http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg
    http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg
    http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Monday, February 13, 2012 1:36 AM
  • Here's mine:

    $string ="{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}"
    
    ($string -replace ".+?'(http:.+?\.jpg')",'$1') -split "'" -match '^http'
    http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg
    http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg
    http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg
    http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Hi

    I am very amazed by the speed that you guys can come up with to write a simple line to resolve this. However I still get the same error. below is my script

    $webClient = new-object Net.WebClient
    $webClient = New-Object Net.WebClient
    $webClient.UseDefaultCredentials = $true
    $webClient.Proxy.Credentials = $webClient.Credentials
    $webClient.Headers.Add("user-agent", "PowerShell Script")

    $info = get-content c:\URL-List.txt

    foreach ($i in $info) {
      $outputurl = ""
      $outputurl = $webClient.DownloadString($i)
      $outputurl -replace ".+?'(http:.+?\.jpg')",'$1' -split "'" -match '^http'
    }

    I got the error below

    You must provide a value expression on the right-hand side of the '-' operator.
    At C:\Scripts\GrabJPG.ps1:14 char:52
    +   $outputurl -replace ".+?'(http:.+?\.jpg')",'$1' -s <<<< plit "'" -match '^http'


    Thanks

    Monday, February 13, 2012 1:46 AM
  • You missed a set of parens:

    ($outputurl -replace ".+?'(http:.+?\.jpg')",'$1') -split "'" -match '^http'

    Those are important, don't take them out.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Monday, February 13, 2012 1:53 AM
  • Same error with those parens.

    You must provide a value expression on the right-hand side of the '-' operator.
    At C:\Scripts\GrabJPG.ps1:14 char:54
    +   ($outputurl -replace ".+?'(http:.+?\.jpg')",'$1') -s <<<< plit "'" -match '^http'


    Thanks

    Monday, February 13, 2012 1:59 AM
  • Are you by any chance running Powershell V1?

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Monday, February 13, 2012 2:00 AM
  • Let's change to using a string method for the split and see if that works better for you:

    ($outputurl -replace ".+?'(http:.+?\.jpg')",'$1').split("'") -match '^http'


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    • Marked as answer by cyw77 Monday, February 13, 2012 4:11 AM
    Monday, February 13, 2012 2:04 AM
  • Are you by any chance running Powershell V1?

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    You are actually right. I will be trying out powershell V2 to see how it goes.

    Thanks

    Monday, February 13, 2012 3:49 AM
  • Let's change to using a string method for the split and see if that works better for you:

    ($outputurl -replace ".+?'(http:.+?\.jpg')",'$1').split("'") -match '^http'


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Did get it to work with the above command. The script seems stuck as I think there are many other http:// which I have to ignore. But I managed to merge your code with Will and got it working using Powershell V1

    $outputurl).split(",") -match "'http://folder.+.'"

    I am confused as to why we need to split the string based on "," just to get it working?


    Thanks

    Monday, February 13, 2012 4:14 AM
  • I switched from using the Powershell -split operator  to using the string method .split().

    In Powershell V2 either one will work.  The -split operator was new in V2, so to get it to work in V1 I had to switch back to using the string method. 

    The -split operator is more flexible, since it uses a regular expression to do the split.  The string split only takes literal characters for the split argument, but that was fine for what we're doing.


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    • Edited by mjolinor Monday, February 13, 2012 11:12 AM
    Monday, February 13, 2012 11:10 AM
  • Thanks for the reply. I understand the syntax of split but I am curious why do we have to use split to achieve the above.

    Thanks

    Monday, February 13, 2012 12:55 PM
  • The split is necessary to get the individual urls extracted into an array.  Your input is a single string containing multiple urls.  We can eliminate all the text that's not urls with the regex -replace, but that's just going to leave you with a single string of multiple urls, not an array of individual urls.  Splitting gives you that array.

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    • Edited by mjolinor Monday, February 13, 2012 1:57 PM
    Monday, February 13, 2012 1:56 PM
  • Thanks for your response.

    I got the above working using this command below

    $outputurl).split(",") -match "'http://folder.+.'"

    Any idea why we have to split(",")?

    Meaning we split based on each , that was found? Sorry if I am asking basic question here as I am very new to Powershell


    Thanks

    Wednesday, February 15, 2012 12:56 AM
  • Same answer as before.  Your string contains multiple URLs.  In order to get them separated from each other, so that each one is a distinct string by itself, you have to split your string apart into multiple strings.

    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "


    • Edited by mjolinor Wednesday, February 15, 2012 1:03 AM
    Wednesday, February 15, 2012 1:02 AM
  • Ok then why use split(",") and not split(".") or not split("http://") or not split("url:")?


    Thanks

    Wednesday, February 15, 2012 2:08 AM
  • It might be easier to show you how to see what each one would do:

    Copy this to your powershell session:

    $string = '[{formatCode: 1002, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 101, url: 'http://folder1.domain.com/home/image/ea3afd89-f09e-4224-9018-4199aec1585d.jpg', width: 30, height: 24}, {formatCode: 102, url: 'http://folder2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width: 60, height: 48}, {formatCode: 103, url: 'http://folder3.domain.com/home/image/fe42a087-01f5-47dd-852d-1587cce0d2f5.jpg', width: 72, height: 57}]'

    Then you can try out all of those and see how the would split that line:

    $string.split(",")

    $string.split(".")

    $string.split("http://")

    $string.split("url:")


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Wednesday, February 15, 2012 2:38 AM
  • Thanks for breaking it down. It makes understanding this easier.
    I think to obtain the desired result is to use $string.split(",") would be the best option to get the best desired result as for each "," it finds, it is separated to an array.

    Using -match "http://folder.+." will display the array that contains the string starting with http://

    However I do have some queries.
    First Question
    If I use $string.split("url:"), I get weird results with missing letters in the url like
     'http
    //fo
    de
    2.domain.com/home/image/7a9d5b7c-feb8-422f-a823-ea534c382f2e.jpg', width
     60, height
     48}, {fo
    matCode
     101,

    Not sure why did that happened. I would think that it finds any occurrence of any letters "url:" and perform a split rather than looking at it as a string.

    Second Question
    When I try to perform a replace after the split I get an error.
    $string1=$string.split(",") -match "http://folder.+."
    $string1.replace("url:", "")

    I got the error below
    Method invocation failed because [System.Object[]] doesn't contain a method named 'replace'.
    At line:1 char:15
    + $string1.replace( <<<< "url:", "")


    Thanks

    Wednesday, February 15, 2012 11:21 PM
  • Not sure why did that happened. I would think that it finds any occurrence of any letters "url:" and perform a split rather than looking at it as a string.

    That's exactly what's happening.  It's splitting it whereever it finds any of the characters u,r,l, or :.  Also realize that when it does the split, it keeps whatever is on either side of the split character, but the split character itself is discarded.

    You're getting this:

    Method invocation failed because [System.Object[]] doesn't contain a method named 'replace'.

    because this:

    $string1.replace("url:", "")

    is not creating a string, it is creating an array of strings.  The individual elements of the array are strings, and you can do a replace on each one, separately:

    $string1 | foreach {$_.replace("url:",""}

    but you cannot do it to the array because an array doesn't have a replace method. 

    The Powershell replace operator on the other hand, will do a replace on an entire array at once. 

    $string1 -replace "url:",""


    [string](0..33|%{[char][int](46+("686552495351636652556262185355647068516270555358646562655775 0645570").substring(($_*2),2))})-replace " "

    Wednesday, February 15, 2012 11:38 PM