在過去十幾年中,為了達到協助使用者擷取有效的訊息,有許多資料探勘 (Data
Mining) 的技術紛紛被提出來以完成各式各樣的知識探勘 (Knowledge Discovery) 任
務。在這些技術的使用下,各種不同形式的樣式 (Pattern) 也相繼產生出來,例如:序
列樣式 (Sequential Pattern)、頻繁項目組 (Frequent Itemset) 以及封閉式樣式 (Closed
Pattern) 與最大樣式 (Maximum Pattern) 等等。然而,在資料探勘的研究領域裡,如何
有效地使用這些發現的樣式,則仍是一個懸而未決的議題。在大部份文件探勘的技術
中,均採取關鍵字的方法以建造由單一文字 (Word) 或單一項目 (Term) 所形成的文字
內容表示法,然而其他的研究技術則相信,詞組片語所攜有的資訊比單一文字來得多的
假說,而捨棄了關鍵字的方法,改選擇以詞組片語來建造文字內容的表示法。令人遺憾
的是,這些以詞組片語為基礎的技術方法並未帶來明顯的效果。推究其原因,則應是高
頻率的詞組(通常是較短的詞組)通常擁有較高量的涵蓋性 (Exhaustivity),但卻也含有較
低量的具體性 (Specificity),於是那些具描述性的詞組便會遭逢所謂低頻次的問題。樣
式分類法模型 (Pattern Taxonomy Model, PTM) 是一個以樣式為基礎的技術方法,其採
用了序列樣式探勘法並以封閉式樣式作為文字代表法的元素。PTM 針對較長的具體性
樣式,運用樣式映射的策略,試圖解決上述低頻次的問題。然而,在PTM 系統的內容
學習階段中,負向資料 (Negative Example) 仍被忽略而未被妥善的使用,而系統所發現
的樣式則需要這些資訊來做重新的評估。因此,本計畫將以發展具有效能及效率的樣式
進化 (Pattern Evolution) 方法為目標,以期能夠解決上述的問題。所提的方法將會以實
際的知識探勘任務來做測試,實驗的結果也將會和現有的方法來做比較,以評估系統的
效能。
In the last decade, many data mining techniques have been proposed for fulfilling various
knowledge discovery tasks in order to achieve the goal of retrieving useful information for
users. Various types of patterns can then be generated using these techniques, such as
sequential patterns, frequent itemsets, and closed and maximum patterns. However, how to
effectively exploit the discovered patterns is still an open research issue, especially in the
domain of text mining. Most of the text mining methods adopt the keyword-based approach
to construct text representations which consist of single words or single terms, whereas other
methods have tried to use phrases instead of keywords, based on the hypothesis that the
information carried by a phrase is considered more than that by a single term. Nevertheless,
these phrase-based methods did not yield significant improvements due to the fact that the
patterns with high frequency (normally the shorter patterns) usually have a high value on
exhaustivity but a low value on specificity, and thus the specific patterns encounter the low
frequency problem. Pattern Taxonomy Model (PTM) is a pattern-based method which
adopts the technique of sequential pattern mining and uses closed patterns as features in the
representative. PTM uses the strategy of mapping discovered patterns into a hypothesis
space and solves the low-frequency problem pertaining to the specific long patterns.
However, information from the negative examples has not been adequately evaluated during
the phase of concept learning in a PTM-based system. The discovered patterns then need to
be evolved by exploiting such information. Therefore, this project aims to develop an
effective and efficient approach for pattern evolution for overcoming the aforementioned
problem. The proposed system will be examined by conducting the real knowledge
discovery tasks and the experimental results will be compared to those of other existing
methods.