Home > Cleanfeed Files Explained

Cleanfeed - Files and their function

Overview

Cleanfeed consists of a primary perl script that resides in the pathfilter directory. After installation it shouldn't need any customisation unless your site polices are in some way very different from normal. Editing it is beyond the scope of these instructions.

Besides the main Perl script, there are a number of other files, all located by default in cleanfeed/etc. The majority of these are text files, one contains Regular Expressions and one (cleanfeed.local) is Perl. All files can be viewed or modified with a standard text editor.

File Descriptions

The following table provides a brief overview of the various files.
Filename File Type Description
bad_adult_paths Text Messages containing a Path string that matches an entry in this file will be rejected if the distribution contains a group intended for adult material. Adult groups are defined in parameters adult_groups and not_adult_groups, but more on that in the configuration parameters section.
bad_body Regex A list of Perl Regular Expressions for comparison with the message body. Matching content will result in the message being rejected. This file is only intended for simple regex matching such as spammed url's, email addresses or telephone numbers. More complex filters should be applied in cleanfeed.local, although a filter reload will be needed in order to activate changes applied to that file.
bad_url Regex This works exactly like bad_body but will only match url's. This is achieved by prefixing all the filters in this file with "http://" or "www.". This file is intended to offer concenience over using bad_body whilst also removing the risk of filtering your entire feed by mistake.
bad_url_central Regex Identical functionality to bad_url. The "_central" suffix implies that the file can be downloaded from a central register of spammed url's.
bad_cancel_paths Text Cleanfeed will reject cancel messages where the Path contains one of these plain-text strings.
bad_from Regex Messages will be rejected if the From header matches a Regular Expression defined in this file. Use with caution, these aren't sanity checked. If you tell Cleanfeed to filter every message, it will!
bad_hosts Text Messages with an NNTP-Posting_Host that matches one of the entries in this file will be rejected.
bad_hosts_central Text Exactly the same as bad_hosts, except this file is maintained and published by a central resource. If you don't want to use it, simply delete it and don't download the updates.
bad_paths Text As with bad_adult_paths, except bad_paths applies to messages posted to any groups, not just adult ones.
cleanfeed Perl The main filtering engine
cleanfeed.local Perl A configuration file for customisation of Cleanfeed's behaviour. Although just an example, this file provides a good starting point and its use for beginners is strongly recommended.

The "bad" Guys

All of the files prefixed with bad_ are automatically reloaded by cleanfeed after a specified number of articles have been processed by the News server. The actual number of articles is defined in cleanfeed.local (see the bad_rate_reload option). They can also be reloaded immediately by issuing INN with a filter reload instruction.

cleanfeed.local

The cleanfeed.local file is constructed in Perl and enables customisation of the default Cleanfeed behaviour. The sample file provided with the package offers a good starting point to build upon. Changes to this file only come into effect when an INN filter reload instruction is issued.

The following table describes the various functions within cleanfeed.local:-
Function Description
local_flag_localfeed In many news configurations, articles are fed to the server from sources that are considered local. The role of this function is to identify and tag these local articles. The operator should define one or more conditional rules that cause the function to return True if the source of the article is local. For example:
sub local_flag_localfeed {
    return 1 if $hdr{'X-Trace'} =~ /\.mydomain\.com/;
    return 1 if $hdr{Path} =~ /^not\-for\-mail$/;
    return 1 if $hdr{'Message-ID'} =~ /\@mydomain.com>$/;
    return 0;
};
When True is returned, the Cleanfeed global variable $localfeed will be set
True.
local_flag_spamsource Define a set of rules that return True if a posting originates from a news-service that is considered a frequent source of spam. One such example might be:
sub local_flag_spamsource {
    return 1 if $hdr{'Injection-Info'} =~ /googlegroups\.com/;
    return 1 if $hdr{Path} =~ /newsguy\.com!news\d$/;
    return 0;
};
When True is returned, the Cleanfeed global variable $spamsource will be set True. Besides user-defined functions, this variable is only used within Cleanfeed's scoring functions. Consequently if scoring is disabled, it will do nothing.
local_filter_first Instructions within this function are performed against normal (non-control) articles. They happen very early on in the filtering process prior to any binary or EMP (Excessive Multi-Post) checks.
local_filter_bot Processed after the binary but before the EMP checks, filters within this function are intended to enhance the in-built bot signature scanning.
local_filter_after_emp As it's name suggests, instructions within this function are processed after Cleanfeed's EMP checks. This implies that articles rejected by this function will still seed the EMP hashes.
local_filter_scoring This function is intended to return article scores that are appended to the score generated by the internal scoring filters. If the resulting score exceeds a defined threshold, the article is rejected. Scores applied within this function must be correctly formatted.

To negatively score an article:
**NiceArt
This will reduce the total score by 2.

To positively score an article:
!!!NastyArt
This will increase the total score by 3.

Here is an example that will deduct 1 from locally posted articles and add 2 to groups with "spam" in their name.
sub local_filter_scoring {
    my $score = '';
    $score .= '*LocalPost' if $localfeed;
    $score .= '!!SpamGrp' if $hdr{Newsgroups} =~ /spam/;
    return $score;
};
local_filter_last Called last, after articles have been processed by all the Cleanfeed internal filters. This is probably the best function to put general filters in. This example will log and reject articles with PLUGH in the Subject:
sub local_filter_last {
    if ($hdr{Subject} =~ /PLUGH/) {
        logart('plugh', 'Subject PLUGH');
	return reject('Subject contains PLUGH', 'PLUGH');
    };
};
local_filter_cancel This function isn't applied to normal messages, only Control Cancel ones. It runs after the internal Cleanfeed Cancel checks and offers a means to define local policies for the handling of Cancel messages. Some operators choose not to honour Cancels at all, as per this example:
sub local_filter_cancel {
    # Reject all locally posted cancel messages.
    if ($localfeed) {
        logart('local.cancel', 'Local Cancel');
        return reject('Cancels forbidden');
    };
};
local_filter_control This function is called on messages containing a newgroup or rmgroup Control header and after internal checks have been processed. Should the Cleanfeed internal checks cause a message to be rejected, processing will complete before this function is called. In these instances, local_filter_reject should be used for further actions, such as logging. All rejected Control messages have a short reason of Bad control message. Example local_filter_cancel:
sub local_filter_control {
    # Log locally posted Control messages.
    if ($localfeed) {
        logart('ctrl_rmgroup', 'rmgroup', 271)
            if $hdr{Control} =~ /^\s*rmgroup/i;
        logart('ctrl_newgroup', 'newgroup', 271)
            if $hdr{Control} =~ /^\s*newgroup/i;
    };
};
local_filter_reject The function doesn't quite fit the mould of the others. Instructions within it are only processed on articles that have already been rejected. This is a good place to define log files for rejected articles. There are some examples in the sample cleanfeed.local.

In this example, we log rejected articles where the Message-ID contains Xyzzy.
sub local_filter_reject {
    logart('xyzzy', 'Subject Xyzzy') if $hdr{'Message-ID'} =~ /Xyzzy/;
};
local_config Unlike all the other functions, this one doesn't contain any filters. It's the place for adjusting cleanfeed's internal defaults, such as the bad_rate_reload parameter described above. Details of all the adjustable parameters can be found in the configuration parameters section.