Semantic Technology-Assisted Review (STAR) Document analysis and monitoring using random vectors

Reading time: 5 minute
...

📝 Original Info

  • Title: Semantic Technology-Assisted Review (STAR) Document analysis and monitoring using random vectors
  • ArXiv ID: 1711.10307
  • Date: 2017-11-30
  • Authors: Researchers from original ArXiv paper

📝 Abstract

The review and analysis of large collections of documents and the periodic monitoring of new additions thereto has greatly benefited from new developments in computer software. This paper demonstrates how using random vectors to construct a low-dimensional Euclidean space embedding words and documents enables fast and accurate computation of semantic similarities between them. With this technique of Semantic Technology-Assisted Review (STAR), documents can be selected, compared, classified, summarized and evaluated very quickly with minimal expert involvement and high-quality results.

💡 Deep Analysis

Deep Dive into Semantic Technology-Assisted Review (STAR) Document analysis and monitoring using random vectors.

The review and analysis of large collections of documents and the periodic monitoring of new additions thereto has greatly benefited from new developments in computer software. This paper demonstrates how using random vectors to construct a low-dimensional Euclidean space embedding words and documents enables fast and accurate computation of semantic similarities between them. With this technique of Semantic Technology-Assisted Review (STAR), documents can be selected, compared, classified, summarized and evaluated very quickly with minimal expert involvement and high-quality results.

📄 Full Content

SemanƟc Technology-Assisted Review (STAR) Document analysis and monitoring using random vectors

Jean-François Delpech 1515 N. Colonial Ct Arlington, VA 22209-1439 United States

jfdelpech@gmail.com

The review and analysis of large collecƟons of documents and the periodic monitoring of new addiƟons thereto has greatly benefited from new developments in computer soŌware. This paper demonstrates how using random vectors to construct a low-dimensional Euclidean space embedding words and documents enables fast and accurate computaƟon of semanƟc similariƟes between them. With this technique of SemanƟc Technology-Assisted Review (STAR), documents can be selected, compared, classified, summarized and evaluated very quickly with minimal expert involvement and high-quality results.

  1. IntroducƟon In a recent review arƟcle, M. R. Grossman and G. V. Cormack 1 give an extensive discussion of various approaches to technology-assisted review, as well as a very detailed bibliography. They define “Technology-assisted review (TAR) [as] the process of using computer soŌware to categorize each document in a collecƟon as responsive or not, or to prioriƟze the documents from most to least likely to be responsive, based on a human’s review and coding of a small subset of the documents in the collecƟon.” Various methods of machine learning have been proposed where a small, human-coded subset of representaƟve documents forms a starƟng point for machine classificaƟon and categorizaƟon of the whole collecƟon. The results are presumably more consistent and more reliable than manual review, which is error-prone and not pracƟcal for collecƟons of millions of items. If consideraƟon is limited to textual documents, as is the case here, the starƟng point is the fact that any document containing disƟnct words can be represented as a vector in an orthonormal vector space where each dimension represents a word occurring Ɵmes: (1) This has been well understood since the pioneering work of Salton 2, 3. With documents and disƟnct words, the corpus of documents to be reviewed is thus represented by a term-document matrix with rows such as EquaƟon 1 and columns; this matrix is very sparse, since a document will usually contain only a very small fracƟon of all words (very frequent words, such as the or and are generally ignored because they have no discriminatory value between documents.) This representaƟon is extremely fruiƞul and forms the basis of numerous informaƟon retrieval systems. arXiv:1711.10307v1 [cs.IR] Nov. 28, 2017 Dj t wk,j nk,j = ( , , … , ) Dj n1,jw1,j n2,jw2,j nt,jwt,j M N M × N M N 1/13

Note that the dual representaƟon where words are expressed in terms of documents is in principle equivalent but seldom used because it presents a number of pracƟcal difficulƟes. A first approach relies on keywords for the selecƟon of documents; the SMART method iniƟally developed by G. Salton 2, 3 involves building a reverse index of the collecƟon of documents (at its simplest, just considering each column of the matrix defined by EquaƟon 1). Although much more sophisƟcated, this is essenƟally what Apache Lucene 4 does and it can be very useful, for example when searching for a specific word such as a product or an individual name. However, in EquaƟon 1, each disƟnct word is orthogonal to each other by the very definiƟon of the embedding space. This has serious pracƟcal consequences: since different individuals or organizaƟons use different words to describe the same thing, there is no “best” keyword or set of keywords to retrieve relevant items. For example, the words spectrum and wavelength have related meanings, but this is completely ignored by a purely keyword- related soŌware. A robust system should automaƟcally take into account the fact that counterfeiter and authenƟcaƟon, or fuel, combusƟon and injector are semanƟcally related: if a document contains the word counterfeiter and another the word authenƟcaƟon, they cover presumably similar topics even if they don’t share any word. Deliberate content masking is also a serious problem in document review and analysis: authors do not necessarily seek clarity; in fact, they oŌen prefer some degree of obfuscaƟon for a number of more or less legiƟmate reasons. Obviously, in the hands of an expert, a sequence of well-designed Boolean queries can be successful, but when millions of documents are involved it is difficult to be certain that coverage is adequate. To group together words referring to similar topics so that they are not orthogonal to each other, the iniƟal space of dimensionality equal to the number of disƟnct, significant words (usually several hundred thousands dimensions) must be transformed to a space of much lower dimensionality, say a few hundred dimensions. Dimensionality reducƟon, ie. low-rank approximaƟon of the term-document matrix, can be achieved by a number of techniques which tend to be slow and cumbersome when and are both large. One of the fir

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut