edu.isi.mavuno.input
Class TagTokenizer
java.lang.Object
edu.isi.mavuno.input.TagTokenizer
public class TagTokenizer
- extends Object
This class processes document text into tokens that can be indexed.
The text is assumed to contain some HTML/XML tags. The tokenizer tries
to extract as much data as possible from each document, even if it is not
well formed (e.g. there are start tags with no ending tags). The resulting
document object contains an array of terms and an array of tags.
- Author:
- trevor, metzler
|
Method Summary |
static byte[] |
makeBytes(String word)
|
void |
onAmpersand()
|
void |
reset()
Resets parsing in preparation for the next document. |
Text[] |
tokenize(String text)
Parses the text in the document.text attribute and fills in the
document.terms and document.tags arrays. |
TagTokenizer
public TagTokenizer()
makeBytes
public static byte[] makeBytes(String word)
onAmpersand
public void onAmpersand()
reset
public void reset()
- Resets parsing in preparation for the next document.
tokenize
public Text[] tokenize(String text)
- Parses the text in the document.text attribute and fills in the
document.terms and document.tags arrays.
- Parameters:
text -
- Throws:
IOException