Rosetta Stone
|
|
|
|
Usage:
java programs.Tokenize [Options]
The Tokenizer tokenizes collections into documents,
documents into sections, and finally sections into sets of
Sentences and Tokens.
This tokenizer recognizes MEDLINE Citation documents and
tokenizes sections according to the structure of MEDLINE citations,
input formated for the SMART retrieval system, fielded text,
and free text.
The options include the following:
Input and Ouput File Options
| Short Name |
Long Name |
Default Value |
Purpose |
| __ |
--fileName= |
stdIn |
Name of file to process |
| __ |
--outputFileName= |
stdOut |
Name of the outputFile to write to |
Input Format Descriptions
The default behavior is to auto-detect MEDLINE Citation format or free
text. The following flags overwrite this feature.
| Short Name |
Long Name |
Default Value |
Purpose |
| __ |
--medlineCitations |
false |
The input is a collection of medLine citations |
| __ |
--mrcon |
false |
The input is a collection of MRCON rows |
| __ |
--freeText |
true |
The input is free text |
| __ |
--fieldedText |
false |
Is the input file/stdin fielded text? |
| __ |
--textField= |
2 |
For fielded text, which field contains the text |
| __ |
--fieldSeparator= |
| |
For fielded text, what char is the separator |
Options to retrieve ever more levels of detail
| Short Name |
Long Name |
Default Value |
Purpose |
| __ |
--collections |
false |
Display Collection information |
| __ |
--documents |
false |
Display Documents |
| __ |
--sections |
false |
Display Sections |
| __ |
--sentences |
false |
Display Sentences |
| __ |
--tokens |
false |
Display tokens |
| __ |
--pipedOutput |
false |
Display in a pipe delimited format |
| __ |
--details |
false |
Display the goory details |
Processing Options
| Short Name |
Long Name |
Default Value |
Purpose |
| __ |
--ambiguousAcronyms |
false |
Disambiguate sentence boundries using the acronyms and
abbreviations file
|
Configuration Options
| Short Name |
Long Name |
Default Value |
Purpose |
| __ |
--configName= |
mmtx cfg |
The name of the configuration file |
| -R |
--MMTX_ROOT= |
/export/nls/mmtx |
MMTX Root path |
| __ |
--ambiguousAcronymsFile= |
data/lexicon/ambiguousAcronymsFile.txt |
Location of the acronyms and abbreviations file needed in
the tokenizer.
|
| __ |
--nmm |
false |
Flag that flips between MetaMap output and non MetaMap output
sytle. This flag is useful when combined with the --pipedOutput
and display flags such as the --sentences, --phrases, --nps,
--variants and other levels of detail.
|
|