Updating .Htacces file based on Apache log files

I am still seeing massive amounts of referal traffic hitting my site, eating up my bandwidth.. I did not get time to update my .htaccess file for the last 2 days.. and within the last 24 hours I have had more than 6000 hits, generating in almost 24.000 pageviews… Generating more than 1 GB worth of traffic (So at that speed I will reach my 10 GB limit soon)

Looking through the Apache logs, figuring out which sites I get most referral traffic from, getting the hostnames, transforming them into a format that can be used by the Apache rewrite engine in the .htaccess file has been time consuming. So I decided that some powershell magic, might speed up the process a bit.

[ps]
function Select-FileDialog
{
param(
[string]$Title,
[string]$Directory,
[string]$Filter=“All Files (*.*)|*.*")
[System.Reflection.Assembly]::LoadWithPartialName(“System.Windows.Forms”) | Out-Null
$objForm = New-Object System.Windows.Forms.OpenFileDialog
$objForm.InitialDirectory = $Directory
$objForm.Filter = $Filter
$objForm.Title = $Title
$Show = $objForm.ShowDialog()
If ($Show -eq “OK”)
{
Return $objForm.FileName
}
Else
{
Write-Error “Operation cancelled by user.”
}
}

#Function to create the http rewrite rules.

Function Create-Rewrite {
Param (
$Hostname
)

$HtaRule = “RewriteCond %{HTTP_REFERER} ^http://” + “$($hostname.replace(”.”,"."))" +" [OR]"
$script:BlockList += $HtaRule
}

Function add-htaccess {
Param (
$HtaRules
)
(Get-Content $htaccess) | foreach-object {
$_
if ($_ -match “RewriteEngine”) {
if (!(Select-String -simplematch “$htarules” -Path $htaccess))
{
$HtaRules
}
}

} | set-Content $tempFile
Copy-Item $tempFile $htaccess
}

Function Upload-Ftp {
Param ([Parameter(Position=0, Mandatory=$true)]
[ValidateNotNullOrEmpty()]
[System.String]
$FTPHost,
[Parameter(Position=1)]
[ValidateNotNull()]
$File
)
$webclient = New-Object System.Net.WebClient
$uri = New-Object System.Uri($ftphost)

“Uploading $File…”

$webclient.UploadFile($uri, $File)
}

#Variables
$log = Select-FileDialog -Title “Select an Apache logfile”
$htaccess = “c:\Temp.htaccess”
$tempFile = [IO.Path]::GetTempFileName()
$URLCount = 15
$FTPUsername = “Username”
$FTPPassword = “PassW0rd”

$BlockList = ""
#Create list of sites to block
$script:BlockList = @()

#Get the list of URLS in the the logfile, capturing each element into different named capturing groups

$urls = Select-String ‘^(?<client>\S+)\s+(?<auth>\S+\s+\S+)\s+[(?<datetime>[^]]+)]\s+"(?:GET|POST|HEAD) (?<file>[^ ?"]+)??(?<parameters>[^ ?"]+)? HTTP/[0-9.]+"\s+(?<status>[0-9]+)\s+(?<size>[-0-9]+)\s+"(?<referrer>[^"]*)"\s+"(?<useragent>[^"]*)"$’ $log |
Select -Expand Matches | Foreach { $_.Groups[“referrer”].value }

#Output statistics for the referer hostnames (Only show top 15)
$urls | group | ForEach -begin { $total = 0 } `
-process { $total += $_.Count; $_ } |Sort Count | Select Count, Name |
Add-Member ScriptProperty Percent { “{0,15:0.00}%” -f (100*$this.Count/$Total) } -Passthru | select -Last $URLCount

#Getting the base hostnames from the complete URLS, and outputs statistics to the screen.

$hosts = $urls | Select-String ‘\b[a-z][a-z0-9+-.]*://([a-z0-9-._~%!$&()*+,;=]+@)?(?<host>[a-z0-9-._~%]+|[[a-z0-9-._~%!$&()*+,;=:]+])’ |
Select -Expand Matches | Foreach { $_.Groups[“host”].value } | group | sort count | where {($_.name -notlike “*xipher.dk*") -and ($_.Count -gt 100)} |
ForEach -begin { $total = 0 } `
-process { $total += $_.Count; $_ } | Sort Count | Select Count, Name |
Add-Member ScriptProperty Percent { “{0,10:0.00}%” -f (100*$this.Count/$Total) } -Passthru

Write-Host “List of root hostnames”

$hosts

Foreach ($Url in $hosts) {

Create-Rewrite $url.Name
}

Foreach ($Block in $script:BlockList) {
add-htaccess $Block
}

notepad $htaccess

$script:BlockList

Upload-Ftp -FTPHost “ftp://$($FTPUsername):$($FTPPassword)@xipher.dk/httpdocs/.htaccess” -File $htaccess
Upload-Ftp -FTPHost “ftp://$($FTPUsername):$($FTPPassword)@xipher.dk/httpdocs/WordPress/.htaccess” -File $htaccess
[/ps]

Unfortunately my current hosting company, does not allow me to download the log files via FTP, but I have to connect to the Parallels interface and download it manually.. (I have not had the time looking into automating this part yet, so this is still a manual step)
That is why I added a little function to use a GUI to pick the access_log file.

[ps]
function Select-FileDialog
{
param(
[string]$Title,
[string]$Directory,
[string]$Filter=“All Files (*.*)|*.*")
[System.Reflection.Assembly]::LoadWithPartialName(“System.Windows.Forms”) | Out-Null
$objForm = New-Object System.Windows.Forms.OpenFileDialog
$objForm.InitialDirectory = $Directory
$objForm.Filter = $Filter
$objForm.Title = $Title
$Show = $objForm.ShowDialog()
If ($Show -eq “OK”)
{
Return $objForm.FileName
}
Else
{
Write-Error “Operation cancelled by user.”
}
}
[/ps]

I then call the function like this:

[ps]
$log = Select-FileDialog -Title “Select an Apache logfile”
[/ps]

A little Regex magic runs through the logfiles, and captures the different elements into different named capturing groups, in this step, I expand all referrer hostnames, and put them into the $urls variable

[ps]
$urls = Select-String ‘^(?<client>\S+)\s+(?<auth>\S+\s+\S+)\s+[(?<datetime>[^]]+)]\s+"(?:GET|POST|HEAD) (?<file>[^ ?"]+)??(?<parameters>[^ ?"]+)? HTTP/[0-9.]+"\s+(?<status>[0-9]+)\s+(?<size>[-0-9]+)\s+"(?<referrer>[^"]*)"\s+"(?<useragent>[^"]*)"$’ $log |
Select -Expand Matches | Foreach { $_.Groups[“referrer”].value }
[/ps]
I modified a script by Joel Bennet, to get a little statistics as well, since there can be 1000’s of hostnames, I have selected only to output top 15 by default (using the $URLCount variable.

[ps]
$urls | group | ForEach -begin { $total = 0 } `
-process { $total += $_.Count; $_ } |Sort Count | Select Count, Name |
Add-Member ScriptProperty Percent { “{0,15:0.00}%” -f (100*$this.Count/$Total) } -Passthru | select -Last $URLCount
[/ps]

Then I loop through all the hostnames, and extract the base domain name, using regex again. (Here I choose to ignore all traffic from my own domain name Xipher.dk, and I choose only to look for referral domains, that have generated 100 hits or more

[ps]
$hosts = $urls | Select-String ‘\b[a-z][a-z0-9+-.]*://([a-z0-9-._~%!$&()*+,;=]+@)?(?<host>[a-z0-9-._~%]+|[[a-z0-9-._~%!$&()*+,;=:]+])’ |
Select -Expand Matches | Foreach { $_.Groups[“host”].value } | group | sort count | where {($_.name -notlike “*xipher.dk*") -and ($_.Count -gt 100)} |
ForEach -begin { $total = 0 } `
-process { $total += $_.Count; $_ } | Sort Count | Select Count, Name |
Add-Member ScriptProperty Percent { “{0,10:0.00}%” -f (100*$this.Count/$Total) } -Passthru
[/ps]

The script expects to find a .htaccess file in c:\temp containing at least the following two lines:

RewriteEngine On
RewriteRule (.*) http://%{REMOTE_ADDR}/$ [R=301,L]

comments powered by Disqus