jump to navigation

Internet content filtering in Linux 27 January, 2009

Posted by aronzak in Linux, Mozilla Firefox, web.
trackback

For some reason you want to use content filtering software. You’ve probably heard of or used a few tools specifically for Windows.  Here are some solutions for Linux:

1. Dansguardian

Dansguardian works best with the proxy squid. It does some advanced url filtering, using slightly more complicated techniques to block some URLs. sites containing both one and a second listed word are blocked, as well as media such as pictures and video URLs with listed words. This is fairly clever, but has some problems. Sometimes, this has blocked Google for me, as some URLs with random strings of characters (used to track which search results are clicked on) can be blocked.

Content filtering is done with words and phrases having scores, positive or negative. Sites then get a total score based on their content.Interestingly, this can be positive and negative. For example, the word ‘breast’ is bad, but ‘breast cancer’ is a good phrase. In theory, this should limit the amount of overblocking. Dansguardian also filters against anonymous web proxies, which could otherwise be used to bypass filtering.

Sites are blocked if their score is over the ‘naughtiness’ threshold as defined in /etc/dansguardian/dansguardianf1.conf.

Dansguardian is meant to be used in a public sector network, such as a school or library. By default, it blocks many downloads that could contain viruses, or filtering circumvention software. In a home setup, this is just irritating. To stop this, blank out the configuration file that controls blocking files based on extensions; echo “” > /etc/dansguardian/lists/bannedextensionlist

Installation instructions

Dansguardian needs to be used with a web proxy. The installation of Dansguardian itself is fairly easy, with Dansguardian filtering on one port. Getting it to filter the entire connection is more complicated.

1 Install the squid and dansguardian packages.

apt-get install dansguardian squid

2 Edit /etc/dansguardian/dansguardian.conf file and remove the line that says “UNCONFIGURED”.
3 Start the dansguardian daemon.

/etc/init.d/dansguardian start

4 In Firefox,  open up edit -> preferences -> “Advanced” tab -> “Network” Tab > “Settings” button. Set it to manual proxy. Put 127.0.0.1 (localhost)  in the IP address box and 8080 in the port box.
5 Try to connect to goolge.com to verify you can still connect to the internet.
Finally, check that the filter is working by checking that the network traffic is being logged. Open the file /var/log/dansguardian/access.log.

cat /var/log/dansguardian/access.log

There should be an entry there saying google.com. If the file doesn’t exist, something isn’t set up right.

2 Willow Content Filter

Willow adopts the novel concept of using Bayesian analysis to filter the web. Bayesian analysis is currently useb by most spam filters. The concept is that you have a ‘good’ and a’bad’ sample, and the filter can find ‘spamminess’ as a percentage match to the samples. Unfortunately, spammers attempt to make this difficuly by inserting ‘normal’ text into spam. Unlike spammers, most adult website owners are probably broadly supportive of efforts to more effectively filter the internet, as evidenced by the existence of voluntary labelling efforts.

In theory, Bayesian web filtering should work better than the more rigid score based system, with less overblocking or underblocking. Not only does it use words in the title, metas and body of a page, but it analyses the structure of the page itself. The system may also run faster than Dansguardian.

One issue this creates is that samples of content must be provided, including both kinds. Currently, these are not encrypted, or obfuscated. This creates some potential legal and moral hurdles in distributing and using willow.

Installation instructions

1. Download willow

2. Extract it to /var

3. Edit /var/willow/willow.conf and remove ‘exefilter’

4. You might need to.install some more software. Try installing python-profiler and python-central

5. That should be it. Run /var/willow/willow.py –config=/var/willow/willow.conf

6. Set up Firefox as above, to port 8000 (or whatever it is set to in /var/willow/willow.conf)

If that doesn’t work, edit the configuration some more. Unfortunately there doesn’t seem to be much support for willow right now.

3 Fx Extensions

While not as effective, Procon is extremely easy to install. Foxfilter is another filtering extension, but I find it a little slower and more clunky. If firefox is your only browser, this is an easy option.

(4) MintNanny

Linux Mint has introduced a novel way to prevent domains from being accessed by redirecting the request to 0.0.0.0 by modifying the /etc/hosts file. This is a neat approach as it does not require any software to be set up, but, unless you are going to try and subscribe to a domain blacklist, it is relatively ineffective. Most web routers will give you this kind of simple filtering anyway.

If you want to give this a go, you don’t need the MintNanny frontend. Just get in there and edit /etc/hosts yourself. You can even redirect to another IP. Just add in a site you really don’t want to see, as in the example.

209.85.171.100          microsoft.com   microsoft

The IP here is one for google.com. Neat. You could also put in the IP of your own server.

microsoftcom

So there you have it. If you are setting up a non-home network, you’ll probably want to filter transparently. This is complicated to set up and involves editing the configuration of iptables or your proxy. Good luck. Below are some extras that you might want to use if you do use DG or willow

You may want to edit the page that Dansguardian shows when a resource is blocked to give you the full reason. This can be quite long, so I stuck it in a box.

	<font color=red>
	<b>-CATEGORIES-</b>
	<font color=black>
	<br><br>
	<form action="/html/tags/html_form_tag_action.cfm" method="post">Full Reason:<br />
		<textarea style="width:500px;height:100px;background-color:#FF9900;">
		-REASONLOGGED-
		</textarea><br />
	</form>
	<br><br>

The logs Dansguardian gives contain a whole lot of sometimes irrelevant information. Below is a list to process a log file using the tab format, so that it is easier to read. In the future, it would be nice to work on a way of adding this to a database, and sorting into domains.

#!/bin/bash
# Author: Aronzak
# License: GPL
# A script to process Dansguardian log files

# Use tab formatted access.log
# 1	Date Time
# 2
# 3	IP
# 4	URL
# 5	Full Denied report
# 6	GET/POST
# 7	File size
# 8	Score
# 9	Short Denied report
# 10	1
# 11	HTML error code
# 12	Type
# Edit the following variable to add/remove wanted fields in processed log:

DESIREDFIELDS="1,4,8,9"
DEST=/var/log/dansguardian/

cat /var/log/dansguardian/access.log | cut -f $DESIREDFIELDS > $DEST/full
cat /var/log/dansguardian/access.log | grep DENIED | cut -f $DESIREDFIELDS > $DEST/denied
if [ -f $DEST/old ]; then
	diff $DEST/denied $DEST/old > $DEST/diff
	if [ "`cat $DEST/diff | grep '<'`" != "" ]; then
		echo "bad"
	fi
fi
cp $DEST/denied $DEST/old

Willow has a minimal page displayed when a resorce is blocked. I changed this to be more like the Dnasguardian page. The page that is displayed is set three times in urlfilter.py, domainfilter.py and contentfilter.py.

DEFAULTMSG = ('<html><head><title>Content Filtered</title></head>'
              '<body bgcolor=#FFFFFF><center>'
              '<table border=0 cellspacing=0 cellpadding=2 height=540 width=700>'
              '<tr>'
              '	<td colspan=2 bgcolor=#FEA700 height=100 align=center>'
              '	<font face=arial,helvetica size=6>'
              '	<b>Access has been Denied!</b>'
              '	</td>'
              '</tr><tr>'
              '	<td align=center valign=bottom width=150 bgcolor=#B0C4DE><font size=1 >'
              '	<a href="http://www.digitallumber.net/software/willow/" target="_blank">Willow Content Filter</a>'
              '	</td>'
              '	<td width=550 bgcolor=#FFFFFF align=center valign=center><font size=4>'
              '	Access has been denied.<br><br><br><br>'
              '	The content of the resource requested has been determined to be innappropriate<br><br>'
              '	If you have any queries contact your ICT Coordinator or Network Manager.'
              '	<br><br><br><br><br></tr></table></body></html>')

You might want to back up the files before editing them. Have fun.

Advertisements

Comments»

1. Cyber safety test « Aronzak’s Rantings - 11 February, 2009

[…] buy our products” still persists in 2009. If anyone is interested in real products, check out open source tools, or consider quality products such as those from  xxxchurch or […]

2. Linux for my Kids | Linux Niche - 13 February, 2009

[…] After I got all that software installed via APT I decided to try that Willow setup listed in the Wiki.  However, the install wasn’t as easy as I hoped from reading that wiki and ended up having to do it manually by following the steps here. […]

skanzariya - 18 February, 2010

Check out for NetCop. NetCop does content filter and fits for Home and Office environment. NetCop doesn’t require a Linux Geek or Network Administrator to configure. It is very easy to configure just like open an email account.

For more information about NetCop, please visit:
http://whiteway.in/netcop

or visit my blog:
http://bit.ly/9fqb0G

Thank You.
Suresh

3. maxinemckew - 22 October, 2010

Web proxies are the arteries that carry the life of youth -facebook and myspace- to every office and school :)

4. Squidblacklist (@Squidblacklist) - 27 August, 2014

Allow me to introduce a better blacklist, we are Squidblacklist.org, the worlds leading publisher of native acl blacklists tailored specifically for use with Squid proxy, as well as we also publish multiple alternative formats for all major third party plugins as well as many other filtering platforms, such as UFDBGuard and Barracuda Networks devices..

There is room for better blacklists, we intend to fill that gap.

It would be our pleasure to serve you.

Signed,

Benjamin E. Nichols
http://www.squidblacklist.org

5. desywagner - 1 January, 2015

Reblogged this on Developing and commented:
Straight forward setup of dansguardian explained. Might not be too difficult to set it up to send out accountability emails based on lgo entries…


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: