MetaMap Transfer (MMTx) Program

Usage: java programs.MMTx [Options]

MMTx maps (matches) text (from documents, queries) into concepts from the UMLS Metathesaurus.

With this software, text is taken through a series of modules and broken down into the components that include sections, sentences, phrases, lexical elements and tokens. Variants are generated from the resulting noun phrases.

Candidate concepts from the UMLS Metathesaurus are retrieved and evaluated against the noun phrases and their derivatives. The best of the candidates are one result. The resulting concepts are organized in such a way as to best cover the text, known as a final mapping. There are some examples to show MMTx's functionality.

MMTx Options

MMTx is highly configurable, and its behavior is controlled by option flags each of which has a short name (e.g., -p) and a long name (e.g., --plain_syntax). On the command line most of the options are toggle switches. Specifying a non-default option toggles it on; specifying a default option toggles it off. Options that take an argument are never defaults, so their presence always indicates that they are in effect. (Excerpted from "MetaMap: Mapping Text to the UMLS Metathesaurus")

The default options

MMTx's default behavior consists of the following options (where default options are always shown in bold):
-t (--tag_text);
-l (--stop_large_n);
-p (--plain_syntax);
-c (--candidates);
-s (--semantic_types); and
-m (--mappings).
-b (--best_mappings_only);
Each of these options are defined below.

File Options

-fileName=<infile>
Name of Input File
--outputFileName=<outfile>
Name of Output File

Data options

Data options determine the underlying vocabularies and data model used by MMTx.
-V (--mm_data_version) <data version>
specifies which version of MMTx's data files will be used for processing. For Example, 2004 specifies the using the strict model for UMLS 2004, 2004_level0 specifies the using the strict model for the level0 version of UMLS 2004. The default data version is:
-A (--strict_model), -B (--moderate_model) and -C (--relaxed_model)
determine which of the data models is used. If more than one model is specified, the strictest one is used; if none are specified then the strict model is used. See the report Filtering the UMLS Metathesaurus for MetaMap at the SKR website (http://skr.nlm.nih.gov/papers/ index.shtml) for a description of the models.
--KSYear=<year>
Specify Data Model based on UMLS Knowledge Source Year, currently has the same behavior as --mm_data_version.

Processing options

Processing options control MMTx's internal behavior.
-t (--tag_text)
specifies that the SPECIALIST parser will use the results of a tagger to assist in parsing.We previously used the Xerox Parc part of speech tagger but now use the Med- Post/SKR tagger. The MedPost tagger was developed at NCBI specifically for tagging biomedical text; we modified it to use our part of speech tags.
-L (--longest_lexicon_match)
causes lexical lookup to prefer matching as much text as possible to lexicon entries. This used to be the only form of lexical lookup, but it has been superseded by a shortest match algorithm. This is mainly due to the fact that the SPECIALIST lexicon is a syntactic lexicon; multi-word items contain no more information than their constituents which have their own lexicon entries.
-P (--composite_phrases)
causes MMTx to reconstruct longer, composite phrases from the simple phrases produced by the parser. A composite phrase is a simple phrase followed by any prepositional phrase optionally followed by one or more of prepositional phrases. An example is "pain on the left side of the chest" which will map to 'Left sided chest pain' rather than separate concepts as it would without the option. Note that --composite_phrases is experimental; it is currently both inefficient and not completely correct.
-Q (--quick_composite_phrases)
is a version of --composite_phrases designed to overcome its inefficiency. It is both experimental and temporary.
-a (--no_acros_abbrs)
prevents the use of any acronym/abbreviation variants which are the least reliable form of variation because it is normally the case that at most one of the expansions for an abbreviated form is correct.
-u (--unique_acros_abbrs_only)
restricts the generation of acronym/abbreviation variants to those forms with unique expansions. This option produces better results than allowing all forms of acronym/abbreviation variants, but it is still better to prevent all such variants.
-anu (--filterToANU)
This switch specifies the following options: don't use acronyms and abbreviations, except where the acronyms and abbreviations are unique, and number the candidates list. Same as specifying options -a, -n, and -u separately.
-d (--no_derivational_variants)
prevents the use of any derivational variation in the computation of word variants. This option exists because derivational variants, as opposed to all other forms of variation, always involve a significant change in meaning.
-D (--an_derivational_variants)
allows the use of derivational variation between adjectives and nouns, hence the name an_derivational variants. Adjective/noun derivational variants are generally the best of the derivational variants.
-l (--stop_large_n)
prevents retrieval of Metathesaurus candidates for two-character words occurring in more than 2,000 Metathesaurus strings or one-character words occurring in more than 1,000 Metathesaurus strings. This option also prevents retrieval for words that can be a preposition, conjunction or determiner
-i (--ignore_word_order)
allows MMTx to ignore the order of words in the phrases it processes. MMTx was originally developed to process full text and consequently depended very strongly on normal English word order. When in effect, this option avoids the use of specialized word indexes used for efficient candidate retrieval, it ignores word order when matching phrase text to candidate words, and it replaces the normal coverage metric with an involvement metric for evaluating how well a candidate covers the words of a phrase.
-Y (--prefer_multiple_concepts)
causes MMTx to score mappings with more concepts higher than those with fewer concepts. (It does so simply by inverting the normal cohesiveness value.) As a simplified example, with this option in effect, the input text "lung cancer" will score the mapping to the two concepts 'Lung' and 'Cancer' higher than the mapping to the single concept 'Lung Cancer'. This option is useful for discovering higher-order relationships among concepts found in text (e.g., that 'Lung' is the location of 'Cancer' in the example).
-z (--term_processing)
tells MMTx to process terms rather than full text. When invoked, MMTx treats each input as a single phrase (although the parser is still used to determine the head of that phrase). It also causes MMTx to use the involvement metric rather than coverage for evaluating Metathesaurus candidates When used in conjunction with the --allow_overmatches and --allow_concept_gaps options, it constitutes MMTx's browse mode for thorough searching of the Metathesaurus. In this case it is wise to also specify -m (--mappings) to toggle mapping construction off; otherwise, MMTx spends too much time trying to combine the many candidates into final mappings.
-o (--allow_overmatches)
causes MMTx to retrieve Metathesaurus candidates which have words on one or both ends that do not match the text. For example, overmatches of "medicine" include 'Alternative Medicine', 'Medical Records' and 'Nuclear medicine procedure, NOS'. The use of --allow_overmatches greatly increases the number of candidates retrieved and is consequently much slower than MMTx without overmatches. It is appropriate for browsing purposes.
-g (--allow_concept_gaps)
causes MMTx to retrieve Metathesaurus candidates with gaps (such as "Unspecified childhood psychosis" for "unspecified psychosis"). The use of this option does not appreciably affect MMTx's performance. It is appropriate for browsing purposes.
-8 (--dynamic_variant_generation)
forces MMTx to generate variants dynamically rather than by looking up variants in a table. This option is normally used only for debugging purposes.
-K (--ignore_stop_phrases)
simply prevents MMTx from aborting its processing for commonly occurring phrases that are known to produce no mappings. This option is useful only for generating a new table of stop phrases after a change in UMLS data.

Output options

Output options control how MMTx displays results.
-q (--machine_output)
causes output to take the form of Prolog terms rather than human-readable form. The --machine_output option affects all other output options. For further information on machine output, including visually enhanced examples, see http://skr.nlm.nih.gov/Help/.
-f (--fielded_output)
produces multi-line, tab-delimited output. Like machine output, it affects all other output options. For further information on fielded output, including visually enhanced examples, see http://skr.nlm.nih.gov/ Help/.
-T (--tagger_output)
displays the output of the MedPost/SKR tagger lining up input words on one line with their tags on a line below.
-p (--plain_syntax)
controls the output form of the results of the SPECIALIST parser. It simply outputs text without any syntactic information.
-x (--syntax)
controls the output form of the results of the SPECIALIST parser. It outputs a Prolog term showing details of the syntactic processing.
-v (--variants)
displays displays the variants generated for each input word. (See Section 10.1 for an example.)
-c (--candidates)
displays causes the list of Metathesaurus candidates to be displayed, best to worst, according to the MetaMap evaluation metric (see Section 5). Note that if a candidate is not the preferred name for a concept, the preferred name is displayed in parentheses immediately following the candidate. Displaying both the matching string and the preferred concept name when they differ is intended to avoid any confusion about why a concept appears on the candidate list.
-r (--threshold) <integer>
displays restricts output to candidates with evaluation score of the threshold or better. Judicious use of this option can prevent MetaMap from making errors in situations where some input text has no close matches in the Metathesaurus. An appropriate threshold can usually be determined simply by examining MetaMap output for typical text in a given application.
-X (--truncate_candidates_mappings)
first truncates the list of candidates to the 100 top-scoring ones before computing mappings and then truncates the list of mappings to the 8 top-scoring ones. This option can sometimes prevent a combinatorial explosion caused by computing a large number of mappings from a large number of candidates as is often encountered when using --allow_overmatches.
-n (--number_the_candidates)
simply numbers the candidates in a displayed candidate list.
-R (--restrict_to_sources) <list>
restricts output to those sources in the comma-separated <list> where spaces are not allowed.
-e (--exclude_sources) <list>
excludes those sources in the comma-separated <list> where spaces are not allowed.
-J (--restrict_to_sts) <list>
restricts output to those concepts with semantic types in the comma-separated <list> where spaces are not allowed.
-k (--exclude_sts) <list>
excludes concepts not having a semantic type in the comma-separated <list> where spaces are not allowed. (See Semantic Types Table)
-I (--show_cuis)
shows the UMLS CUI for each concept displayed.
-I (--show_treecodes)
Display MeSH Treecodes
-W (--preferred_name_sources)
lists all sources for the preferred names of displayed concepts. Note that this is just one of many possible choices for showing sources; showing all sources for any synonym in a concept, for example, would often produce very cluttered output.
-s (--semantic_types)
causes the semantic types of Metathesaurus concepts to be displayed in square brackets for each concept in the candidate list or in a mapping.
-m (--mappings)
causes mappings to be displayed. Most of the time it is useful to display both the candidate list and also the final mappings.
-b (--best_mappings_only)
restricts mappings displayed to only the top scoring ones. It is almost never useful to display all mappings because of their large number.
-E (--indicate_citation_end)
This option causes an end-of transmission term to be written when processing of each unit of input is complete. It is only useful for processing using the Scheduler and only then with validated generic processing.

Debugging Options

-w (--warnings)
Show warnings (default value: false)
-0 (--debug0)
Debug Level Zero (default value: false)
-5 (--debug5)
Debug Level Five (default value: false)
-6 (--debug6)
Debug Level Six - display prelim info (default value: false)
-9 (--debug9)
Debug Level Nine - Full debug (default value: false)

Miscellaneous options

-h (--help)
displays MetaMap usage, i.e., the form of the command and a list of all options.
-w (--warnings)
enables the display of conditions which are noteworthy if not erroneous. This option is normally used only for debugging purposes.

Last Modified: March 30, 2007 ii-public
Links to Our Sites
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
Medline Baseline Repository (MBR)
Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Picture of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Logo U.S. National Library of Medicine   NIH Logo National Institutes of Health
DHHS Logo Department of Health and Human Services
     Contact Us    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov