Tokenization


MetaMap Transfer
(MMTx)

Rosetta Stone: Metaphor for MetaMap/SKR work
Rosetta Stone

Home


Documentation


Prerequisites


2.4.A Prerequisites


Resources


Download
(Restricted)


Install


Run MMTx


Customize


Trouble Reporter


Review Status
of Trouble Reports


FAQ


Statistics


User's Group
Notes


Administration
(Restricted)
     

Usage:   java programs.Tokenize [Options]

The Tokenizer tokenizes collections into documents, documents into sections, and finally sections into sets of Sentences and Tokens.

This tokenizer recognizes MEDLINE Citation documents and tokenizes sections according to the structure of MEDLINE citations, input formated for the SMART retrieval system, fielded text, and free text.

The options include the following:


Input and Ouput File Options

Short Name Long Name Default Value Purpose
__ --fileName= stdIn Name of file to process
__ --outputFileName= stdOut Name of the outputFile to write to


Input Format Descriptions

The default behavior is to auto-detect MEDLINE Citation format or free text. The following flags overwrite this feature.

Short Name Long Name Default Value Purpose
__ --medlineCitations false The input is a collection of medLine citations
__ --mrcon false The input is a collection of MRCON rows
__ --freeText true The input is free text
__ --fieldedText false Is the input file/stdin fielded text?
__ --textField= 2 For fielded text, which field contains the text
__ --fieldSeparator= | For fielded text, what char is the separator


Options to retrieve ever more levels of detail

Short Name Long Name Default Value Purpose
__ --collections false Display Collection information
__ --documents false Display Documents
__ --sections false Display Sections
__ --sentences false Display Sentences
__ --tokens false Display tokens
__ --pipedOutput false Display in a pipe delimited format
__ --details false Display the goory details


Processing Options

Short Name Long Name Default Value Purpose
__ --ambiguousAcronyms false Disambiguate sentence boundries using the acronyms and abbreviations file


Configuration Options

Short Name Long Name Default Value Purpose
__ --configName= mmtx cfg The name of the configuration file
-R --MMTX_ROOT= /export/nls/mmtx MMTX Root path
__ --ambiguousAcronymsFile= data/lexicon/ambiguousAcronymsFile.txt Location of the acronyms and abbreviations file needed in the tokenizer.
__ --nmm false Flag that flips between MetaMap output and non MetaMap output sytle. This flag is useful when combined with the --pipedOutput and display flags such as the --sentences, --phrases, --nps, --variants and other levels of detail.


Last Modified: March 30, 2007 ii-public
Links to Our Sites
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
Medline Baseline Repository (MBR)
Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Picture of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Logo U.S. National Library of Medicine   NIH Logo National Institutes of Health
DHHS Logo Department of Health and Human Services
     Contact Us    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov