Answered by:
Invoke-webrequest webscraping

Question
-
Hi,
I'm fairly new to webscraping.
I can't seem to find a way to capture an attribute of the span-class.
Does anybody know how to get the value of "data-count" which has the exact amount of followers? The innertext which is 652K is not good enough as data :)
Thanks
/Daniel
Tuesday, July 3, 2018 6:27 AM
Answers
-
$twitter = Invoke-WebRequest –Uri https://twitter.com/PowerShell_Team $twitter.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | where nameProp -eq 'followers' | foreach { $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count") }
You should use the API instead:
- Proposed as answer by jrv Wednesday, July 4, 2018 9:07 AM
- Marked as answer by Daniel Mercourios Wednesday, July 4, 2018 7:41 PM
Wednesday, July 4, 2018 8:58 AM -
function Get-TwitterAccount { [CmdletBinding()] param ( [Parameter( Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true, Position=0)] [String[]]$URL ) process { $nav = 'tweets', 'following', 'followers', 'favorites' $dict = [ordered]@{} $URL | foreach { $resp = Invoke-WebRequest –Uri $_ $dict.Add('Name', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-nameLink")[0].innerText) $dict.Add('Username', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-screenname")[0].innerText) $resp.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | where {$_.getAttribute("data-nav") -in $nav} | foreach { $dict.Add( $_.getElementsByClassName("ProfileNav-label")[0].innerText, $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count") ) } [PSCustomObject]$dict } } } 'https://twitter.com/microsoft', 'https://twitter.com/powershell_team' | Get-TwitterAccount | Format-Table -AutoSize
You should still use API instead. Scraping will use a lot more bandwidth and server resources than API calls. It is also unreliable and likely to break when Twitter decides to update their web design. It may also go against their terms and conditions which could result in your account getting banned. Use at your own risk.
- Marked as answer by Daniel Mercourios Thursday, July 5, 2018 6:51 PM
Wednesday, July 4, 2018 8:25 PM
All replies
-
Please carefully review the following links to set your expectation for posting in technical forums.
This Forum is for Scripting Questions Rather than script requests
- Script Gallery.
- Script Center
- Learn PowerShell
- Script requests
- PowerShell Documentation
- PowerShell Style Guidelines
- Posting guidelines
- Handy tips for posting to this forum
- How to ask questions in a technical forum
- Rubber duck problem solving
- How to write a bad forum post
- Help Vampires: A Spotter's Guide
- This forum is for scripting questions rather than script requests
\_(ツ)_/
Tuesday, July 3, 2018 7:52 AM -
Ok, sorry about that.
I have a script that grabs followers and likes from twitter accounts:
$typeOfActivity = @{
0 = 'Tweets'
1 = 'Following'
2 = 'Followers'
3 = 'Likes'
}$WebRequest = Invoke-WebRequest https://twitter.com/$($account.name)
foreach ($activity in $typeOfActivity.GetEnumerator())
{($WebRequest.ParsedHtml.body.getElementsByTagName('span') | Where {$_.getAttributeNode('class').Value -eq 'ProfileNav-value'}).innertext[$activity.Name]
}
The problem is when the amount of followers is very high and is typed like "100K". It doesn't give me an exact number. Does anybody know how to pull the exact amount which is the value of "data-count" in the span class?
Tuesday, July 3, 2018 8:41 PM -
I gave it a try and I have this :
$twitter = Invoke-WebRequest –Uri https://twitter.com/ExpertsExchange [string]$str = $twitter.AllElements | where {$_.Class -eq "ProfileCanopy ProfileCanopy--withNav ProfileCanopy--large js-variableHeightTopBar"} | select -ExpandProperty outertext [Regex]$reg = "(\d{1,3}.\d{1,3}|\d{1,3}.\d{1,3}.\d{1,3}) (mil|k) (Siguiendo|Following)" $matches = $reg.Matches($str) if($matches.Count -gt 0){ foreach($match in $matches){ $match.value } }
The result is something like this: 11,1 mil Siguiendo
If you need to double check just give it a test in the English Regex.
- Proposed as answer by j0rt3g4 Wednesday, July 4, 2018 12:02 AM
Tuesday, July 3, 2018 11:44 PM -
$twitter = Invoke-WebRequest –Uri https://twitter.com/PowerShell_Team $twitter.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | where nameProp -eq 'followers' | foreach { $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count") }
You should use the API instead:
- Proposed as answer by jrv Wednesday, July 4, 2018 9:07 AM
- Marked as answer by Daniel Mercourios Wednesday, July 4, 2018 7:41 PM
Wednesday, July 4, 2018 8:58 AM -
Very nice, Leif-Arne, thank you. That worked for where nameprop equals followers, following and likes but not tweets. Don't know why. Is there a way to grab tweets as well...?
- Edited by Daniel Mercourios Wednesday, July 4, 2018 8:05 PM
Wednesday, July 4, 2018 7:42 PM -
function Get-TwitterAccount { [CmdletBinding()] param ( [Parameter( Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true, Position=0)] [String[]]$URL ) process { $nav = 'tweets', 'following', 'followers', 'favorites' $dict = [ordered]@{} $URL | foreach { $resp = Invoke-WebRequest –Uri $_ $dict.Add('Name', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-nameLink")[0].innerText) $dict.Add('Username', $resp.ParsedHtml.body.getElementsByClassName("ProfileHeaderCard-screenname")[0].innerText) $resp.ParsedHtml.body.getElementsByClassName("ProfileNav-stat") | where {$_.getAttribute("data-nav") -in $nav} | foreach { $dict.Add( $_.getElementsByClassName("ProfileNav-label")[0].innerText, $_.getElementsByClassName("ProfileNav-value")[0].getAttribute("data-count") ) } [PSCustomObject]$dict } } } 'https://twitter.com/microsoft', 'https://twitter.com/powershell_team' | Get-TwitterAccount | Format-Table -AutoSize
You should still use API instead. Scraping will use a lot more bandwidth and server resources than API calls. It is also unreliable and likely to break when Twitter decides to update their web design. It may also go against their terms and conditions which could result in your account getting banned. Use at your own risk.
- Marked as answer by Daniel Mercourios Thursday, July 5, 2018 6:51 PM
Wednesday, July 4, 2018 8:25 PM -
Thanks, very nice. I'll look into the API but had a problem with bad request so I guess there's a little process with access tokens and accounts. For now, your function works very nice and I'm aware of the caveats. Again, thanks!Thursday, July 5, 2018 6:52 PM