Knowledge is knowing that we cannot know: November 2014

It has been very fascinating to open and download the files from webpages from the command line. For linux users, the wget command is the heavenly gift. When it comes to windows, it becomes little hectic. Using cmd one can open the webpages but playing with the data is little time consuming. So I am writing this article about how to open and read webpages from Powershell.

Steps that are included in this process are :

Open the webpage
Extract HTML Title, Description, Keywords
Avoid URLs Matching Any of a Set of Patterns
Setting a Maximum Response Size
Setting a Maximum URL Length
Using the Disk Cache
Crawling the Web
Get Referenced Domains
GetBaseDomain
Must-Match Patterns

Now lets start with the commands
Start --> Fast Search Server 2010 for SharePoint (right click --> Run as Administrator)

The Short Version
Add the SharePoint PowerShell cmdlets
Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue

Create and configure the Content Source (enter a URL that doesn't mind you crawling it. Perhaps your blog page?)
$contentSSA = "FASTContent"
$startaddress = [enter a URL here]
$contentsourcename = "Web site crawl"
$contentsource = New-SPEnterpriseSearchCrawlContentSource -SearchApplication $contentSSA -Type Web -name $contentsourcename -StartAddresses $startaddress -MaxSiteEnumerationDepth 0

Start the crawl
$contentsource.StartFullCrawl()
$contentsource.CrawlStatus

Keep executing $contentsource.CrawlStatus until the status changes to CrawlCompleting and then Idle
Execute a search

The Long Version
Again, there really isn't any reason to go over all the steps as they don't really change from step to step. So let's clarify few things.
$contentsource = New-SPEnterpriseSearchCrawlContentSource -SearchApplication $contentSSA -Type Web -name $contentsourcename -StartAddresses $startaddress -MaxSiteEnumerationDepth 0

It is interesting to note that the New-SPEnterpriseSearchCrawlContentSource cmdlet defaults to the Custom crawl rule which will read all pages and all links found at the starting URL. We set MaxSiteEnumerationDepth to zero which causes the crawler to read the content at the site we started at rather than allowing the crawler to go into ADD mode becoming easily distracted and chasing down every car that goes by.

Another method :

(New-Object System.Net.WebClient).DownloadFile($url, $localFileName)
In v3, the Invoke-WebResquest cmdlet:
Invoke-WebRequest -Uri $url -OutFile $localFileName
Another option is with the Start-BitsTransfer cmdlet:
Start-BitsTransfer -Source $source -Destination $destination

There are at least (not 2) 4 ways to open web address URL with default browser in Powershell.
1. Run a exe file with parameter is our url.
How to get exe filepath of default browser?

Function GET-DefaultBrowserPath {
#Get the default Browser path
New-PSDrive -Name HKCR -PSProvider registry -Root Hkey_Classes_Root | Out-Null
$browserPath = ((Get-ItemProperty ‘HKCR:\http\shell\open\command’).'(default)’).Split(‘”‘)[1]
return $browserPath
}
call
Get-DefaultBrowserPath

Simplest way:
just type start ‘http://www.gurucore.com’ in Powershell or cmd.

Knowledge is knowing that we cannot know

Friday, November 14, 2014

Playing with webpages using Powershell...