📝 Original Info
- Title: Semantic Technology-Assisted Review (STAR) Document analysis and monitoring using random vectors
- ArXiv ID: 1711.10307
- Date: 2017-11-30
- Authors: Researchers from original ArXiv paper
📝 Abstract
The review and analysis of large collections of documents and the periodic monitoring of new additions thereto has greatly benefited from new developments in computer software. This paper demonstrates how using random vectors to construct a low-dimensional Euclidean space embedding words and documents enables fast and accurate computation of semantic similarities between them. With this technique of Semantic Technology-Assisted Review (STAR), documents can be selected, compared, classified, summarized and evaluated very quickly with minimal expert involvement and high-quality results.
💡 Deep Analysis
Deep Dive into Semantic Technology-Assisted Review (STAR) Document analysis and monitoring using random vectors.
The review and analysis of large collections of documents and the periodic monitoring of new additions thereto has greatly benefited from new developments in computer software. This paper demonstrates how using random vectors to construct a low-dimensional Euclidean space embedding words and documents enables fast and accurate computation of semantic similarities between them. With this technique of Semantic Technology-Assisted Review (STAR), documents can be selected, compared, classified, summarized and evaluated very quickly with minimal expert involvement and high-quality results.
📄 Full Content
SemanƟc Technology-Assisted Review (STAR)
Document analysis and monitoring using random vectors
Jean-François Delpech
1515 N. Colonial Ct
Arlington, VA 22209-1439
United States
jfdelpech@gmail.com
The review and analysis of large collecƟons of documents and the periodic monitoring of new
addiƟons thereto has greatly benefited from new developments in computer soŌware. This paper
demonstrates how using random vectors to construct a low-dimensional Euclidean space
embedding words and documents enables fast and accurate computaƟon of semanƟc similariƟes
between them. With this technique of SemanƟc Technology-Assisted Review (STAR), documents can
be selected, compared, classified, summarized and evaluated very quickly with minimal expert
involvement and high-quality results.
- IntroducƟon
In a recent review arƟcle, M. R. Grossman and G. V. Cormack 1 give an extensive discussion of various approaches
to technology-assisted review, as well as a very detailed bibliography. They define “Technology-assisted review
(TAR) [as] the process of using computer soŌware to categorize each document in a collecƟon as responsive or
not, or to prioriƟze the documents from most to least likely to be responsive, based on a human’s review and
coding of a small subset of the documents in the collecƟon.” Various methods of machine learning have been
proposed where a small, human-coded subset of representaƟve documents forms a starƟng point for machine
classificaƟon and categorizaƟon of the whole collecƟon. The results are presumably more consistent and more
reliable than manual review, which is error-prone and not pracƟcal for collecƟons of millions of items.
If consideraƟon is limited to textual documents, as is the case here, the starƟng point is the fact that any
document
containing disƟnct words can be represented as a vector in an orthonormal vector space where
each dimension represents a word
occurring
Ɵmes:
(1)
This has been well understood since the pioneering work of Salton 2, 3.
With
documents and
disƟnct words, the corpus of documents to be reviewed is thus represented by a
term-document matrix with
rows such as EquaƟon 1 and
columns; this matrix is very sparse, since a
document will usually contain only a very small fracƟon of all words (very frequent words, such as the or and are
generally ignored because they have no discriminatory value between documents.) This representaƟon is
extremely fruiƞul and forms the basis of numerous informaƟon retrieval systems.
arXiv:1711.10307v1 [cs.IR] Nov. 28, 2017
Dj
t
wk,j
nk,j
= (
,
, … ,
)
Dj
n1,jw1,j n2,jw2,j
nt,jwt,j
M
N
M × N
M
N
1/13
Note that the dual
representaƟon where words are expressed in terms of documents is in principle
equivalent but seldom used because it presents a number of pracƟcal difficulƟes.
A first approach relies on keywords for the selecƟon of documents; the SMART method iniƟally developed by
G. Salton 2, 3 involves building a reverse index of the collecƟon of documents (at its simplest, just considering each
column of the matrix defined by EquaƟon 1). Although much more sophisƟcated, this is essenƟally what Apache
Lucene 4 does and it can be very useful, for example when searching for a specific word such as a product or an
individual name.
However, in EquaƟon 1, each disƟnct word is orthogonal to each other by the very definiƟon of the embedding
space. This has serious pracƟcal consequences: since different individuals or organizaƟons use different words to
describe the same thing, there is no “best” keyword or set of keywords to retrieve relevant items. For example,
the words spectrum and wavelength have related meanings, but this is completely ignored by a purely keyword-
related soŌware. A robust system should automaƟcally take into account the fact that counterfeiter and
authenƟcaƟon, or fuel, combusƟon and injector are semanƟcally related: if a document contains the word
counterfeiter and another the word authenƟcaƟon, they cover presumably similar topics even if they don’t share
any word. Deliberate content masking is also a serious problem in document review and analysis: authors do not
necessarily seek clarity; in fact, they oŌen prefer some degree of obfuscaƟon for a number of more or less
legiƟmate reasons. Obviously, in the hands of an expert, a sequence of well-designed Boolean queries can be
successful, but when millions of documents are involved it is difficult to be certain that coverage is adequate.
To group together words referring to similar topics so that they are not orthogonal to each other, the iniƟal space
of dimensionality equal to the number
of disƟnct, significant words (usually several hundred thousands
dimensions) must be transformed to a space of much lower dimensionality, say a few hundred dimensions.
Dimensionality reducƟon, ie. low-rank approximaƟon of the
term-document matrix, can be achieved by a
number of techniques which tend to be slow and cumbersome when
and
are both large.
One of the fir
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.