edu.isi.mavuno.input
Class TagTokenizer

java.lang.Object
  extended by edu.isi.mavuno.input.TagTokenizer

public class TagTokenizer
extends Object

This class processes document text into tokens that can be indexed.

The text is assumed to contain some HTML/XML tags. The tokenizer tries to extract as much data as possible from each document, even if it is not well formed (e.g. there are start tags with no ending tags). The resulting document object contains an array of terms and an array of tags.

Author:
trevor, metzler

Nested Class Summary
static class TagTokenizer.Pair
           
 
Constructor Summary
TagTokenizer()
           
 
Method Summary
static byte[] makeBytes(String word)
           
 void onAmpersand()
           
 void reset()
          Resets parsing in preparation for the next document.
 Text[] tokenize(String text)
          Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TagTokenizer

public TagTokenizer()
Method Detail

makeBytes

public static byte[] makeBytes(String word)

onAmpersand

public void onAmpersand()

reset

public void reset()
Resets parsing in preparation for the next document.


tokenize

public Text[] tokenize(String text)
Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays.

Parameters:
text -
Throws:
IOException