locked
Invoke-webrequest webscraping RRS feed

  • Question

  • Hi,

    I'm fairly new to webscraping.

    I can't seem to find a way to capture an attribute of the span-class.

    Does anybody know how to get the value of "data-count" which has the exact amount of followers? The innertext which is 652K is not good enough as data :)

    Thanks
    /Daniel


    Tuesday, July 3, 2018 6:27 AM

Answers

  • $twitter = Invoke-WebRequest –Uri https://twitter.com/PowerShell_Team
    $twitter.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | 
        where nameProp -eq 'followers' |
        foreach {
            $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count")
        }

    You should use the API instead: 

    https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-show

    • Proposed as answer by jrv Wednesday, July 4, 2018 9:07 AM
    • Marked as answer by Daniel Mercourios Wednesday, July 4, 2018 7:41 PM
    Wednesday, July 4, 2018 8:58 AM
  • function Get-TwitterAccount {
        [CmdletBinding()]
        param
        (
            [Parameter(
                Mandatory=$true,
                ValueFromPipeline=$true,
                ValueFromPipelineByPropertyName=$true,
                Position=0)]
            [String[]]$URL
        )
    
        process {
            $nav = 'tweets', 'following', 'followers', 'favorites'
            $dict = [ordered]@{}
            $URL | foreach {
                $resp = Invoke-WebRequest –Uri $_
                $dict.Add('Name', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-nameLink")[0].innerText)
                $dict.Add('Username', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-screenname")[0].innerText)
                $resp.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | 
                    where {$_.getAttribute("data-nav") -in $nav} |
                    foreach {
                        $dict.Add(
                            $_.getElementsByClassName("ProfileNav-label")[0].innerText, 
                            $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count")
                        )    
                    }
                [PSCustomObject]$dict
            }
        }
    }
    
    'https://twitter.com/microsoft', 'https://twitter.com/powershell_team' | Get-TwitterAccount | Format-Table -AutoSize

    You should still use API instead. Scraping will use a lot more bandwidth and server resources than API calls. It is also unreliable and likely to break when Twitter decides to update their web design. It may also go against their terms and conditions which could result in your account getting banned. Use at your own risk.

    Wednesday, July 4, 2018 8:25 PM

All replies

  • Ok, sorry about that.

    I have a script that grabs followers and likes from twitter accounts:

        $typeOfActivity = @{
            0 = 'Tweets'
            1 = 'Following'
            2 = 'Followers'
            3 = 'Likes'
        }

    $WebRequest = Invoke-WebRequest https://twitter.com/$($account.name)

    foreach ($activity in $typeOfActivity.GetEnumerator())
    {

    ($WebRequest.ParsedHtml.body.getElementsByTagName('span') | Where {$_.getAttributeNode('class').Value -eq 'ProfileNav-value'}).innertext[$activity.Name]

    }

    The problem is when the amount of followers is very high and is typed like "100K". It doesn't give me an exact number. Does anybody know how to pull the exact amount which is the value of "data-count" in the span class?

    Tuesday, July 3, 2018 8:41 PM
  • I gave it a try and I have this :

    $twitter = Invoke-WebRequest –Uri https://twitter.com/ExpertsExchange
    [string]$str = $twitter.AllElements | where {$_.Class -eq "ProfileCanopy ProfileCanopy--withNav ProfileCanopy--large js-variableHeightTopBar"} | select -ExpandProperty outertext
    [Regex]$reg = "(\d{1,3}.\d{1,3}|\d{1,3}.\d{1,3}.\d{1,3}) (mil|k) (Siguiendo|Following)"
    $matches = $reg.Matches($str)
    if($matches.Count -gt 0){
        foreach($match in $matches){
            $match.value
        }
    }
    

    The result is something like this: 11,1 mil Siguiendo

    If you need to double check just give it a test in the English Regex.

    • Proposed as answer by j0rt3g4 Wednesday, July 4, 2018 12:02 AM
    Tuesday, July 3, 2018 11:44 PM
  • $twitter = Invoke-WebRequest –Uri https://twitter.com/PowerShell_Team
    $twitter.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | 
        where nameProp -eq 'followers' |
        foreach {
            $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count")
        }

    You should use the API instead: 

    https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-show

    • Proposed as answer by jrv Wednesday, July 4, 2018 9:07 AM
    • Marked as answer by Daniel Mercourios Wednesday, July 4, 2018 7:41 PM
    Wednesday, July 4, 2018 8:58 AM
  • Very nice, Leif-Arne, thank you. That worked for where nameprop equals followers, following and likes but not tweets. Don't know why. Is there a way to grab tweets as well...?
    Wednesday, July 4, 2018 7:42 PM
  • function Get-TwitterAccount {
        [CmdletBinding()]
        param
        (
            [Parameter(
                Mandatory=$true,
                ValueFromPipeline=$true,
                ValueFromPipelineByPropertyName=$true,
                Position=0)]
            [String[]]$URL
        )
    
        process {
            $nav = 'tweets', 'following', 'followers', 'favorites'
            $dict = [ordered]@{}
            $URL | foreach {
                $resp = Invoke-WebRequest –Uri $_
                $dict.Add('Name', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-nameLink")[0].innerText)
                $dict.Add('Username', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-screenname")[0].innerText)
                $resp.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | 
                    where {$_.getAttribute("data-nav") -in $nav} |
                    foreach {
                        $dict.Add(
                            $_.getElementsByClassName("ProfileNav-label")[0].innerText, 
                            $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count")
                        )    
                    }
                [PSCustomObject]$dict
            }
        }
    }
    
    'https://twitter.com/microsoft', 'https://twitter.com/powershell_team' | Get-TwitterAccount | Format-Table -AutoSize

    You should still use API instead. Scraping will use a lot more bandwidth and server resources than API calls. It is also unreliable and likely to break when Twitter decides to update their web design. It may also go against their terms and conditions which could result in your account getting banned. Use at your own risk.

    Wednesday, July 4, 2018 8:25 PM
  • Thanks, very nice. I'll look into the API but had a problem with bad request so I guess there's a little process with access tokens and accounts. For now, your function works very nice and I'm aware of the caveats. Again, thanks!
    Thursday, July 5, 2018 6:52 PM