Statistics
| Branch: | Tag: | Revision:

root / README @ 6899d5dd

History | View | Annotate | Download (3.57 KB)

1
  SylFilter - a message filter
2

    
3
  Copyright (C) 2011 Hiroyuki Yamamoto <hiro-y@kcn.ne.jp>
4
  Copyright (C) 2011 Sylpheed Development Team
5

    
6

    
7
About This Program
8
==================
9

    
10
This is SylFilter, a generic message filter library and command-line tools.
11
SylFilter provides a bayesian filter which is very popular as a spam filtering
12
algorithm. SylFilter is also internationalized and can be applied to any
13
languages.
14

    
15
SylFilter library provides simple but powerful C APIs and can be used from C
16
programs.
17

    
18
SylFilter command-line tool can be used as a junk filter program like major
19
tools such as bogofilter and bsfilter etc.
20

    
21
SylFilter is free software and distributed under the BSD-like license.
22
See COPYING for detail.
23

    
24

    
25
Install
26
=======
27

    
28
This program requires GLib and a key-value store engine. Install them before building.
29
Currently SQLite (enabled by default), QDBM and GDBM are supported for key-value store engine.
30

    
31
  $ ./configure
32
  ( $ ./configure --disable-sqlite --enable-qdbm (enables QDBM) )
33
  ( $ ./configure --disable-sqlite --enable-gdbm (enables GDBM) )
34

    
35
  $ make
36
  $ sudo make install
37

    
38

    
39
Usage
40
=====
41

    
42
SylFilter accepts rfc822 message files (for example: MH, Maildir, eml).
43

    
44
Learning junk mails
45

    
46
  $ sylfilter -j ~/Mail/junk/*
47

    
48
Learning clean mails
49

    
50
  $ sylfilter -c ~/Mail/clean/*
51

    
52
Classifying mails
53

    
54
  $ sylfilter ~/Mail/inbox/1234
55

    
56
Show learn status
57

    
58
  $ sylfilter -s
59

    
60
Show learn status and all learned tokens
61

    
62
  $ sylfilter -s -v
63

    
64
Show help message
65

    
66
  $ sylfilter -h
67
  $ sylfilter --help
68

    
69

    
70
Usage with Sylpheed
71
===================
72

    
73
On 'Common preferences... - Junk mail - Learning command:', manually set
74
each command as following:
75

    
76
Junk                : sylfilter -j
77
Not Junk            : sylfilter -c
78
Classifying command : sylfilter
79

    
80

    
81
Other information
82
=================
83

    
84
Token database files are created under ~/.sylfilter/ .
85
(On Windows: %APPDATA%\SylFilter\)
86

    
87

    
88
Library Design
89
==============
90

    
91
The filtering of SylFilter consists of a set of simple filter modules.
92

    
93
         (Learning)                   (Classifying)
94

    
95
        rfc822 message                rfc822 message
96
              |                             |
97
   [ text content filter ]       [ text content filter ]
98
              |                             |
99
  [ word separator filter ]       [ blacklist filter ]  --> spam
100
              |                             |
101
      [ n-gram filter ]         [ word separator filter ]
102
              |                             |
103
     [ learning filter ]            [ n-gram filter ]
104
                                            |
105
                                   [ bayesian filter ]  --> spam
106

    
107
Library users can create arbitrary combination of provided filters.
108
Users also can add their original custom filters.
109

    
110
Please read the source of src/sylfilter.c for library usage.
111

    
112

    
113
Algorithm of Bayesian Filter
114
============================
115

    
116
SylFilter implements Fisher's method which is described by Gary Robinson.
117
It is also implemented by bogofilter and bsfilter.
118

    
119
  http://radio-weblogs.com/0101454/stories/2002/09/16/spamDetection.html
120
  http://www.bgl.nu/bogofilter/fisher.html
121

    
122
SylFilter initially implemented the customized version of algorithm
123
described by Paul Graham.
124

    
125
  http://paulgraham.com/spam.html
126
  http://paulgraham.com/better.html
127

    
128
Robinson-Fisher method is used by default.
129

    
130
Basically the algorithm can be described as follows:
131

    
132
1. Counts the number of occurrences of words in a spam and non-spam.
133
2. Calculates the probability that a message containing it is a spam for
134
   each words in a message.
135
3. Calculates the combined probability using important words in the message.
136

    
137
See the above Web pages for the detail.