APPEARED IN BULLETIN OF THE

Numerical analysts가 주목해야 하는 복잡도 이론의 한 분야인 정보 기반 복잡도 이론(Information-Based Complexity Theory, IBCT)에 대해 문제를 제기합니다. 두 논문은 대규모 선형 방정식 및 행렬이치 문제를 다루고 있습니다. IBCT는 정보에 대한 제한을 적용하여 최소 비용을 찾아내려고 시도하지만, 정보가 자주 알고리즘에서 기원하는 경우와 정보와 알고리즘이 구분되는 것이 실제 이슈를 흐려주는 경우가 많습니다. 또한, IBCT의 결과는 자연스러운 상황에만 해당되며, 새로운 통찰력을 얻을 수 있으나, 전통적인 오차 분석 및 근사 이론적 관점에서 표현할 필요가 있습니다.

IBCT에 대한 주요 비판은 두 가지입니다. 첫째, 이론이 복잡도 이론의 본질을 이해하지 못하고, 정보 기반의 알고리즘을 사용하여 문제를 해결하는 것만을 고려합니다. 둘째, IBCT는 실제 상황에서 얻은 결과가 아니라 특정 모델에 대한 예상된 성능에만 관심을 가지기 때문에 비현실적인 모형으로 이어집니다.

논문 저자들은 이러한 비판을 두 논문인 [Tr & Wo, 1984]과 [Ku, 1986]를 통해 상세히 설명합니다. 이들은 IBCT의 결과가 실제 알고리즘에 대한 오차 분석 및 성능 개선에만 관련이 없으며, 일반적인 의미에서 불완전하다고 주장합니다.

또한 논문에서는 IBCT의 비판을 간결하게 요약하여 다음과 같이 기술합니다:

"IBCT는 정보 기반 복잡도 이론의 한 분야로, 알고리즘의 최소 비용을 찾으려는 목표를 가지고 있습니다. 그러나 이에 대한 비판은 두 가지가 있습니다. 첫째, IBCT는 자연스러운 상황에서 얻은 결과가 아니라 특정 모델에 대한 예상된 성능에만 관심을 가집니다. 둘째, IBCT의 결과는 실제 알고리즘의 오차 분석 및 성능 개선에만 관련이 없으며, 일반적인 의미에서 불완전합니다."

영어 요약:

Information-Based Complexity Theory (IBCT) is a branch of complexity theory that aims to find the minimal cost of achieving a certain level of approximation for the hardest case in a given problem class F. However, two major criticisms are raised against IBCT: firstly, it fails to understand the essence of complexity theory and only considers information-based algorithms; secondly, IBCT is concerned with predicted performance within specific models rather than real-world results.

The author critiques these points through detailed analysis of two papers, [Tr & Wo, 1984] and [Ku, 1986], which deal with large-scale linear systems and matrix eigenvalue problems. The criticisms are summarized as follows:

"IBCT is a branch of complexity theory that focuses on finding the minimal cost of achieving a certain level of approximation for the hardest case in a given problem class F. However, two major criticisms are raised against IBCT: firstly, it fails to understand the essence of complexity theory and only considers information-based algorithms; secondly, IBCT is concerned with predicted performance within specific models rather than real-world results."

The author highlights that IBCT has failed to produce meaningful insights into numerical analysis and approximation theory, despite its claims to provide a rigorous yet flexible framework for studying complex problems. The paper concludes by emphasizing the need for a more realistic model of information-based complexity theory that takes into account actual algorithms and their performance in real-world scenarios.

APPEARED IN BULLETIN OF THE

arXiv:math/9201266v1 [math.NA] 1 Jan 1992APPEARED IN BULLETIN OF THEAMERICAN MATHEMATICAL SOCIETYVolume 26, Number 1, Jan 1992, Pages 3-28SOME BASIC INFORMATIONON INFORMATION-BASED COMPLEXITY THEORYBeresford N. ParlettAbstract. Numerical analysts might be expected to pay close attention to a branchof complexity theory called information-based complexity theory (IBCT), which pro-duces an abundance of impressive results about the quest for approximate solutionsto mathematical problems.

Why then do most numerical analysts turn a cold shoul-der to IBCT? Close analysis of two representative papers reveals a mixture of nicenew observations, error bounds repackaged in new language, misdirected examples,and misleading theorems.Some elements in the framework of IBCT, erected to support a rigorous yet flex-ible theory, make it difficult to judge whether a model is off-target or reasonablyrealistic.

For instance, a sharp distinction is made between information and algo-rithms restricted to this information. Yet the information itself usually comes froman algorithm, so the distinction clouds the issues and can lead to true but misleadinginferences.

Another troublesome aspect of IBCT is a free parameter F , the class ofadmissible problem instances. By overlooking F ’s membership fee, the theory some-times distorts the economics of problem solving in a way reminiscent of agriculturalsubsidies.The current theory’s surprising results pertain only to unnatural situations, andits genuinely new insights might serve us better if expressed in the conventional modesof error analysis and approximation theory.Contents1.

Introduction2. High level criticisms3.

Preliminaries4. On the optimal solution of large linear systems5.

Optimal solution of large eigenpair problemsReferences1. Introduction and summaryIn 1980 Traub and Wozniakowski published a monograph entitled A generaltheory of optimal algorithms, which initiated a new line of research.

The subject wascalled analytic complexity theory, initially, but is now referred to as information-based complexity theory (IBCT hereafter). The August 1987 issue of Bull.

Amer.Math.Soc.(N.S.) contains a summary of recent results [Pac & Wo, 1987], soacceptance of this branch of complexity theory has been swift.1991 Mathematics Subject Classification.

Primary 68Q25; Secondary 65F99, 65Y20.Received by the editors November 10, 1989 and, in revised form, July 9, 1990The author gratefully acknowledges partial support from ONR Contract N00014-90-J-1372c⃝1992 American Mathematical Society0273-0979/92 $1.00 + $.25 per page1

2B. N. PARLETTOne purpose of the general theory is to provide an attractive setting for thesystematic study of some of the problems that have engaged numerical analystsfor decades.

One among the programs of IBCT is to determine the minimal costof computing an approximate solution (of given accuracy) over all algorithms thatone could use which restrict themselves to certain limited information about thedata. It is also of interest to discover any algorithms that achieve this minimalcost or, at least, some cost close to it.Pride of place in IBCT is given to theinformation to which the algorithms are limited.

By its choice of problems IBCTis (potentially) a branch of complexity theory that is highly relevant to numericalanalysts. Whenever the minimal costs can be estimated, they provide yardsticksagainst which to measure actual algorithms.

The program seems attractive.The purpose of this paper is to sound a warning about IBCT, but first we pointout why our task is less than straightforward and why our observations have morethan local interest.Our task would be easier if mathematics enjoyed a tradition of public debateand controversy, as exists in some other fields, from which some kind of consensuscould emerge if not Truth. But too many of us treat every mathematical assertionas if it must be true or false, and if false, then mistaken, and if mistaken, then asymptom of some lapse or inaptitude.

Such an oversimplification would wronglyimpugn the technical competency of the founders of the IBCT and miss the pointof our criticism; we intend to show that parts of IBCT are true, but mistaken.Our task would be easier if IBCT contained a logical flaw so egregious that itsmere exposure rendered further discussion unnecessary. But the flaws are subtle,subtle enough that they appear only after their consequences have been analyzedin detail.

That is why we must focus upon specific work. We have selected tworelated papers [Tr & Wo, 1984; Ku, 1986] to be examined in detail in §§4 and 5.They concern matrix computations, which is the area we understand best of all themany areas that IBCT covers.

Besides, we believe these papers are typical; writtenin a professional manner, their analysis is not weak, and some of their observationsare new and interesting in a technical way regardless of IBCT. We claim that theirresults, most of them, are seriously misleading.Since the arguments in the papers are impeccable, the flaw must be in theframework.Yet the definitions are laid out plainly for all to see and seem tobe appropriate—a puzzling situation.One source of difficulty is the redefinition of common terms such as “eigenvalueproblem” (see §2.4) or “worst case” (see §2.5) or “information” (see §4.4) or “algo-rithm” (see §§2.1, 3.2, 4.4, and 5.3).

These slight twists to conventional meaningsare subtle enough to escape notice but entail significant consequences. The resultsmislead because they are remembered and discussed in terms of ordinary word us-age.

Most readers will not even be aware of the shifts in meaning, some of which aredue to the tempting but artificial distinction between information and algorithm.Another feature of IBCT that can sometimes rob results of their relevance is thepresence of a free parameter: the class F from which worst cases at to drawn. Thecost of testing membership in F is ignored by IBCT, so the model loses validitywhenever this cost is not negligible.Some of our criticisms require very little knowledge of the subject matter.

Thesecriticisms are presented in the next section. After that, we get down to details,provide some background material, and then examine each paper in turn.

In oursummaries we try to be fair, but we encourage the interested reader to compare

INFORMATION-BASED COMPLEXITY THEORY3our effort with the original work.A handful of reservations about IBCT have appeared in print. In a review ofthe second monograph [Tr, Wo & Wa, 1983], Shub [Shu, 1987] gives a couple ofinstances of unnatural measures of cost.

In [Ba, 1987], Babuska calls on researchersin IBCT to make their models more realistic. We concur but note that the modelmay be so flexible as to embrace pointless investigations as readily as pertinentones.We make no complaint that IBCT ignores the roundofferror that afflicts im-plementation on digital computers.

First, a good understanding of computationin exact arithmetic is a prerequiste for tackling practical issues. Second, we mustacknowledge that a large part of theoretical numerical analysis confines itself to thecomforts of exact arithmetic.IBCT has already produced a large body of results, some of them surprisingand consequently of potential interest.

Yet each surprising result known to us, inworst-case analysis, holds only within a model sufficiently unnatural as to forfeitattention from numerical analysts. This is a pity because IBCT certainly permitsrealistic models, and there is plenty to do; the investigation of average case com-plexity of approximately solved problems is in its infancy.

It would take only afew illuminating results concerning some reasonable models to restore faith in theprogram launched by the general theory of optimal algorithms. So, it is a pleasureto acknowledge the recent and interesting work by Traub and Wozniakowski onthe average case behavior o the symmetric Lanczos algorithm [Tr & Wo, 1990].However, we must note that the infrastructure of IBCT plays little role in thisanalysis.The incentive to write this essay came from discussions held during the workshopat the Mathematical Sciences Research Institute at Berkeley in January 1986 underthe title Problems Relating Computer Science to Numerical Analysis.2.

High level criticisms2.1.This is not complexity theory. Numerical analysis and complexity the-ory are palpably different subjects.Complexity theory (CT hereafter) seeks todetermined the intrinsic difficulty of certain tasks; whereas much of theoreticalnumerical analysis (NA hereafter) has been concerned with studying classes of al-gorithms, analyzing convergence and stability, developing error bounds (either apriori or a posteriori), and detecting either optimal or desirable members of a classaccording to various criteria.

Clearly CT has more ambitious goals than does NA.One major theme of IBCT is to find the minimal cost of achieving a certainlevel of approximation for the hardest case in a given problem class F, using anyalgorithm that confines itself to certain partial information about the case. Oneof the papers we examine is concerned with the solution of large systems of linearequations and the other with the matrix eigenvalue problem (see [Tr & Wo, 1984;Ku, 1986]).

Now we can formulate our first complaint, one that applies to theresults in both papers.The theorems say nothing about the intrinsic cost of computing an ap-proximate solution to either of the problems mentioned above becausethe specified information is not naturally associated with the task butis acquired when a certain class of numerical methods is employed.

4B. N. PARLETTThe class is sometimes called the Krylov subspace methods; one is not given amatrix A explicitly, but instead a few vectors b, Ab, A2b, A3b, .

. .

, Ajb, and onewishes to approximate the solution of a system of linear equations Ax = b or somespecified eigenpair of A. More details are given in §§3.2 and 3.3.

So the invitationto minimize cost over all algorithms subject to the given information turns out,in these cases, to amount to the quest for the best that can be achieved at eachstep of a Krylov subspace method. This is exactly the sort of work that numericalanalysts do.We do not wish to belabor the obvious, but our suggestion that, in the thesecases, IBCT has the appearance of CT without the substance is important for thefollowing reason.

It might be claimed that we interpret IBCT results as thoughthey were results about Krylov subspace methods (i.e., NA) when, in fact, they areCT results concerning Krylov information. In other words, perhaps we are guiltyof looking at the world through NA spectacles and missing that subtle differenceof emphasis characteristic of CT.

This possibility needs to be considered, but thestubborn fact remains that restricting information to Krylov information is not partof the linear equations problem nor of the eigenvalue problem.2.2.Free information in the problem class. The ingredient of IBCT thatallows it to generate irrelevant results is the problem class F (see the second para-graph in §2.1).F did not appear in our brief description of the theory in thesecond paragraph of §1 because it is not a logically essential ingredient but rathera parameter within IBCT.

Let us describe the role it plays. There is a task T tobe accomplished; there is an information sequence N = (N1, N2, N3, .

. . ) coupledwith a measure of cost (Nj costs j units); and there is F. For each Nj the onlyalgorithms admitted for consideration are those that restrict themselves to use Njand standard auxiliary computations.

For a worst-case complexity analysis, themain technical goal is to determine the minimal cost, over admissible algorithms,required to achieve T for the most difficult problem within F that is consistent withN. This minimal cost may be called C(F, N, T ).Suppose now that F1 ⊂F2 and that C(F1, N, T )

Such a result is oflittle relevance to the achievement of T unless one can determine that a problemlies within F1 rather than F2. To put the matter in other words, we might saythat knowledge of membership in F is information and should have a cost attachedto it.

Whenever F is very large (for example, the class of continuous functionsor the class of invertible matrices), it is realistic to assign no cost to it. On theother hand, there are examples (see §4.4) where it may be as expensive to ascertainmembership in F as to achieve T , given N, over a larger class of problems.

Insuch cases C(F, N, T ) bounds one part of the expense while ignoring the other. LetC(N, T ) denote C(F, N, T ) when F is as large as possible.We may reformulate the minimax quantity C(N, T ) with the aid of a useful pieceof notation.

To each Nj there is a set bVj of problems (matrices in our case) thatare indistinguishable by Nj. The set bVj, j = 1, 2, .

. .

, n, are nested and eventuallyreduce to a singleton. Associated with any approximation z is the set Rj(z) ofindistinguishable residuals (e.g., Rj(z) = b −bAz, bA ∈bVj, for the linear equationsproblem Ax = b).

The goal is to find the smallest natural number k such thatthere is a z for which Rk(z) lies in the target ball (e.g., B(0, ε∥b∥), the ball in Rncentered at the origin with radius ε∥b∥). This is C(N, T ).This formulation reveals several things.

First, the admissible algorithms cited

INFORMATION-BASED COMPLEXITY THEORY5in the minimax formulation of C(N, T ) are not really needed; what matters is thesize of Rk(z) for various z. Second, one reason why there is very little in the NAliterature on the problem of finding the minimal k is that for most interesting tasksk = n, the sets Rk(z) are just too big, so the problem is not interesting.One way to reduce the indistinguishable sets is to introduce a subclass F andto use Vj = bVj ∩F in place of bVj.

This was discussed above. For approximationtheory there is no objection to the introduction of unknown quantities that mightcharacterize F. However, as mentioned above.

IBCT seems to use F as a tuningparameter designed to keep k < n.2.3.Spurious challenges. The optimality properties of various Krylov subspacemethods are well known (see [Sti, 1958]).

IBCT’s claim to have something new toadd is based on the suggestion that its theory considers any algorithm (confinedto Krylov information) including those that give approximations z from outsidethe Krylov subspace. See the quotation in §4.1.

The trouble with this apparentnovelty is that it is not possible to evaluate the residual norm ∥b −Az∥for theseexternal z because there is no known matrix A (only Krylov information).Sohow can an algorithm that produces z verify whether it has achieved its goal ofmaking ∥b −Az∥< ε∥b∥? Perhaps that is why no such new algorithm is actuallyexhibited?

IBCT’s suggestion that it goes beyond the well-known polynomial classof algorithms is more apparent than real.2.4.A new eigenvalue problem. The task of computing some or all the eigen-values of a matrix is acknowledged to be of practical importance.

When only afew eigenvalues of a large order matrix are wanted, one seeks either the small-est eigenvalues, or the largest, or all in a given region. Unfortunately, [Ku, 1986]makes a subtle change in the problem.

The redefined goal asks for any approxi-mate eigenpair (value λ and vector x) without reference to where in the spectrumthe approximate eigenvalue may lie. Of course a theorist is entitled to investigateany problem he or she chooses.

However, we have yet to hear of any use for suchoutput. Our complaint is that no indication is given that the goal is an unusualone.

Very few readers would realize that the familiar relevant eigenvalue problemhas been bypassed. Indeed, we missed the point ourselves until a friend pointed itout.It is standard practice to use the size of the residual norm (∥Ax −xλ∥) as themeans by which to decide whether a specified approximate eigenvalue is accurateenough.

IBCT twists things around and makes a small residual norm into the goal.It is the old philosophical error of mistaking the means for the end.2.5.A confusion of worst cases. An important feature of Krylov information{b, Ab, A2b, .

. . } (see the third paragraph in §2.1) is the so-called starting vector bwhich is, of course, quite independent of the goal of computing eigenvalues.

Thereare two different factors that can increase the cost of using this information toapproximate eigenvalues of A. One is an intrinsic difficulty; some matrices in thegiven class may have unfortunate eigenvalue distributions.

The other is that b maybe poorly chosen. Instead of separating the effects of these factors, the eigenvaluepaper combines them and ends up analyzing Krylov information with a worst possi-ble starting vector even though satisfactory staring vectors are easy to obtain.

Thefact that b is treated as prescribed data is quite difficult to spot. This situation isin stark contrast to the linear equations problem where b is part of the problem ofsolving Ax = b.

6B. N. PARLETTThe study of worst choices for b is not without interest (see [Sc, 1979], forexample).

Such studies are relevant to the inverse eigenvalue problem but not tothe complexity of approximating eigenvalues via Krylov subspaces.Section 5 discusses the issue in more detail, but the conclusion is that the wrongplacing of b twists the model away from its original target. It is only this distortedmodel that permits the proof of the surprising results in [Ku, 1986, Abstract].3.

Preliminaries3.1.A word on matrix computations. The subject begins with three basictasks.

(i) Solve systems of linear algebraic equations; written as Ax = b with x, bcolumn vectors and A a matrix. (ii) Compute eigenvalues and eigenvectors of A.

(iii) Solve least squares problems, i.e. minimize ∥b −Ax∥2 over all x.There are very satisfactory programs for accomplishing these tasks when thematrices are small.An n × n matrix is said to be small if a couple of n × narrays can be comfortably stored in the fast memory of the computer.Thesedays a 50 × 50 matrix would be considered small on most computer systems.

Thereader may consult [Pa, 1984] for more information. The methods in use makeexplicit transformations on the given matrix.

There are one or two open problemsconcerning convergence of some methods, but by and large the small matrix problemis in satisfactory condition with respect to conventional one-calculation-at-a-time(sequential) computers.One reason for introducing this slice of history into the discussion is to bringout the fact that computation with large order matrices (say, 5000 × 5000) is asomewhat different game from computation with small ones. Sometimes the veryproblem itself changes.

For tasks (i) and (iii) the goal does remain the same butoften the product Aν, for any vector ν, can be formed cheaply, so one seeks methodsthat exploit this feature and do not factor A. For the eigenvalue problem, there aremany applications where only a few eigenvalues are wanted, perhaps 30 of 5000,and it is desirable to avoid changing A at all.

Thus the task has changed; there isno desire to diagonalize A.For all three problems it often happens that a sequence of similar cases must betreated as parameters are changed. This leads to an updating problem and endsour brief historical digressions.As usual R denotes the real numbers, C the complex numbers, and Rn is thevector space of real n-tuples.3.2.A word on information-based complexity.

We describe a simple versionof IBCT that is used in the two papers to be examined. It does not use the fullpanoply of concepts in the monograph or its sequel [Tr, Wo & Wa, 1983].There are a few essential ingredients that are best seen in a specific context.

(1) A class F, e.g., F = {B : B ∈Rn×n symmetric, positive definite};(2) A task, e.g., given b ̸= 0 in Rn, and ε > 0, find x in Rn s.t. ∥Ax−b∥< ε∥b∥,A ∈F.

Here ∥· ∥is some given norm.For the eigenvalue problem, the task is stated as: find x ∈Cn and ρ ∈C suchthat ∥eAx = xρ∥≤ε for all eA in F that are indistinguishable from A. The defects

INFORMATION-BASED COMPLEXITY THEORY7in this definition were mentioned in §2.5. (3) information N = (N0, N1, .

. .

), e.g., Nj(A, b) = {b, Ab, . .

. , Ajb} for naturalnumbers j ≤n, A ∈F;(4) A measure of cost, e.g., j units for Nj.

In this model the forming of linearcombinations of vectors is free.Items (2) and (3) do not make it clear that A is not known explicitly. There ismore discussion of this point in §4.3.To use an algorithm, restricted to the information N, in order to solve a problemin F will entail cost that may vary with the problem.

The primary goal of worst-case IBCT is to minimize, over all such algorithms, the maximum, over all problemsin F, of that cost. Determining the minimum, even roughly, is worthwhile even ifan algorithm that achieves the minimum is not known.There is an analogousaverage-case theory.This certainly has an appeal.Please note that, in our example, (3) puts this theory firmly within numericalanalysis.

This is because the information in this example, and it is typical, is notpart of the task. The information Nj will only be available if a certain type ofmethod is invoked.

Consequently the theory is not addressing the intrinsic cost, ordifficulty, of solving linear systems but is confined to seeking the number of neededsteps within a chosen class of methods. This is what numerical analysts do, andhave done, from the beginning.In the 1970s the word “complexity” was reserved for the intrinsic difficulty of atask and the word “cost” was used in connection with a particular algorithm.

Forexample, see [Bo & Mu, 1975] and [Wi, 1980]. However, it is now common to talkabout the complexity of an algorithm as a synonym for cost.

This extension ofthe term complexity does no great harm. What is misleading is that the notion ofinformation appears to be independent of any algorithm.

This allows the theory totalk about the set of all algorithms that confine themselves to the given information.As indicated in the previous paragraph, this way of talking may sometimes bea reformulation of the standard practice of optimizing over a specified family ofmethods.For j < n the information Nj(A, B) is partial; there are many matrices that areindistinguishable from A in the sense that each of them generates the same set ofj + 1 vectors. The basic technical concept in the theory, the matrix index k(Φ, A)of an algorithm Φ, is the minimal cost using Φ to guarantee achievement of the taskfor all matrices in F that are indistinguishable from A.

There is more discussionin §4.2 and §2.2.The theory seeks minΦ k(Φ, A) and other related quantities while Φ is restrictedto information N. This minimum is the complexity of the task.IBCT is careful to distinguish itself from the older speciality now called arith-metic complexity, which is concerned with discrete problems such as the minimalnumber of basic arithmetic operations required to form the product of two n × nmatrices (see, for example, [Ra, 1972; Str, 1969; Sch & Str, 1971; Wi, 1970]). An-other way to model some aspects of scientific computing was introduced in 1989 byBlum, Shub, and Smale, [Bl, Sh & Sm, 1989].

Their algebraic complexity has thecost of a basic arithmetic operation independent of the operands just as it does inscientific computation. They analyze computation over rings other than Z and soinclude R and C.

8B. N. PARLETT3.3.Krylov subspaces.

Here we sketch the conventional wisdom on this topic.These subspaces of Rn are defined byKj = Kj(A, b) = span(b, Ab, . .

. , Aj−1b).There is no loss in assuming that dim Kj = j.

Information Nj permits computationof any vector v in Kj+1, not just Kj, at no further cost provided that the coefficientγi in v = Pji=0 γi(Aib) are known.On the practical side the dominant question for the experts has been how toobtain a computationally satisfactory basis for Kj. Round offerror destroys theexpected linear independence of the computed vectors.

Some researchers maintainthat it is more efficient to use a highly redundant spanning set rather than a basis.Others recommend the additional expense of computing an orthonormal basis. Inany case, it is the computation of the basis or spanning set, along with multiplicationof vectors by A, that is the expensive part of the computation.The model ofcost used in IBCT theory reflects this quite well.

It is the number of steps thatmatters. We think of a step as adding one more vector to the Krylov sequence{b, Ab, A2b, .

. .

}.One of the founders of Krylov space methods, C. Lanczos [La, 1952], proposed auseful basis for Kj with the property that the projection of symmetric A onto Kjis a symmetric tridiagonal j ×j matrix Tj. Tridiagonal matrices are easy to handle.With a basis in hand, there are diverse tasks that can be accomplished.

Here is apartial list. (i) Compute an approximation x(j) to A−1b such that x(j) ∈Kj and its resid-ual b −Ax(j) is orthogonal to Kj.

It so happens that ∥b −Ax(j)∥may becomputed without forming x(j) so there is no need to compute unsatisfac-tory x(j). When A ∈SPD (symmetric positive definite matrices) then x(j)coincides with the output of the conjugate gradient algorithm.

(ii) Compute the vector u(j) that minimizes ∥b −Av∥over all v ∈Kj (notKj+1). This is the MR (minimum residual) algorithm.

The extra vectorAjb is needed to ascertain the coefficients in the expansion of u(j). (iii) Compute some, or all, of the Rayleigh-Ritz approximations (θi, yi), i =1, .

. .

, j, to eigenpairs of symmetric A. Here θi ∈R and {y1, .

. .

, yj} is anorthonormal basis for Kj. For each i, Ayi −yiθi is orthogonal to Kj.Krylov subspace methods are not really iterative.

All the basic tasks mentionedin §3.1 are solved exactly (in exact arithmetic) in at most n steps. However theinterest in this approach is due to the fact that in many instances far fewer thann steps are required to produce acceptable approximations.

In other words, totake n steps is the practical equivalent of failure. However, for each basic taskthere are data pairs (A, b) that do require n steps even for loose tolerances such asε = 10−3, so research has focussed on explanations of why so often the cost is muchsmaller.

The gradual realization of the efficacy of changing the Ax = b problem toan equivalent one via a technique called preconditioning has enhanced the use ofKrylov subspace methods.In a sense the convergence of all these methods is completely understood inthe symmetric case and is embodied in the error bounds published by Kaniel andimproved by Paige and Saad (see [Ka, 1966; Sa, 1980; Pa, 1980] for the details). Theerror depends on two things; the eigenvalue distribution of A and the components

INFORMATION-BASED COMPLEXITY THEORY9of the starting vector b along the eigenvectors. Of course, all this analysis supposesa given matrix A, not a set of indistinguishable matrices.From this conventional viewpoint the thrust of these two complexity papers isto see to what extent the standard algorithms (CG, MR, Lanczos) do not makebest use of the information on hand.

Recall that Nj contains an extra vector notin Kj. This is a reasonable project and the results can be expressed within theusual framework.

The term “complexity theory” appears to a numerical analystlike window dressing.4. On the optimal solution of large linear systemsThese sections offer a description of and commentary on [Tr & Wo, 1984].4.1.Spurious generality.

Here is a quotation from the introduction:We contrast our approach with that which is typical in the approximatesolution of large linear systems. One constructs an algorithm Φ thatgenerates a sequence {xk} approximating the solution α = A−1b; thecalculation of xk requires k matrix-vector multiplication and xk lies inthe Krylov subspace spanned by b, Ab, .

. ., Akb.

The algorithm Φ is oftenchosen to guarantee good approximation properties of the sequence {xk}.In some cases Φ is defined to minimize some measure of the error in arestrictive class of algorithms. For instance, let this class be defined asthe class of ‘polynomial’ algorithms; that isα −xk = Wk(A)α,where Wk(0) = 1.Here Wk is a polynomial of degree at most k.. .

. [Some omitted sentences define the minimum residual and conjugate gra-dient algorithms.

]It seems to us that this procedure is unnecessarily restrictive.It isnot clear, a priori, why an algorithm has to construct xk of the formα −xk = Wk(A)α.Indeed, we show that for orthogonally invariantclasses of matrices the minimum residual algorithm (MR) is within atmost one matrix vector multiplication of the lower bound without anyrestriction on the class of algorithms. However, if the class is not or-thogonally invariant, the optimality of MR may disappear.Our first point was made earlier.

The information N does not come with thelinear equations problem. The brief answer to the quoted rhetorical question (whymust an algorithm construct xk of the given form?) that serves to justify the wholepaper is the following.

To any vector x not in the Krylov subspace Kk, there is anadmissible matrix A, such that the residual norm is as large as you please. Thisholds even when A is required to be symmetric and positive definite.

An admissiblematrix A is one that is consistent with the Krylov information (more on this below).4.2.Definitions and optimality. In this section we put down the definitionsmade in [Tr & Wo, 1984].

Our comments are reserved for the next section. (i) Let F be a subclass of the class GL(n, R) of n×n nonsingular real matrices.

10B. N. PARLETT(ii) Let b ∈Rn with ∥b∥= (b, b)1/2 = 1 be given.

For 0 ≤ε < 1, find x ∈Rnsuch that ∥b −Ax∥≤ε (it would have been kinder to add A ∈F). (iii) Krylov information: Nj(A, b) = {b, Ab, .

. ., Ajb}, j = 0, 1, .

. ..(iv) Measure of cost: Nj costs j units.

(v) An algorithm Φ = {φj} is a sequence of mappings φj : Nj(F, Rn) →Rn. (vi) The set of indistinguishable matrices for given Nj(A, b):V (Nj(A, b)) = { eA : eA ∈F : Nj( eA, b) = Nj(A, b)}.

(vii) The matrix index of an algorithm Φ:k(Φ, A) = min(j :max˜A∈V (Nj)∥b −eAxj∥≤ε),Nj = Nj(A, b),where xj = Φj(Nj(A, b)). If the set of j values is empty, then k(Φ, A) = ∞.

(viii) The class index of an algorithmΦ: k(Φ, F) = maxB∈F k(Φ, B). (ix) The optimal matrix index: k(A) = minΦ k(Φ, A) over Φ restricted to N.(x) The optimal class index: k(F) = maxB∈F k(B).

(xi) Strong optimality: Φ is strongly optimal iffk(Φ, B) = k(B), for each B ∈F. (xii) Optimality: Φ is optimal iffk(Φ, F) = k(F).

(xiii) Almost strong optimality: Φ is almost strongly optimal iffk(Φ, B) ≤k(B)+c, for every B ∈F, for some small integer c.Remark 1. Since Aib = A(Ai−1b), it follows that Krylov information Nj(A, b) re-quires j applications of the operator A.

That is why the cost is given as j units.In practice, one uses better bases for the Krylov subspace Kj than is provided byNj(A, b); but for theoretical purposes, this modification may be ignored.Remark 2. It can happen that k(A) ≪k(F).

For this reason, it is of interest tofind algorithms with small matrix index.Remark 3. For simplicity, the dependence of all concepts on n, Nj, b, and ε issuppressed.

The idea is to compute k(A) and k(F) for interesting classes F and tofind strongly optimal or optimal algorithms if possible.4.3.Discussion of the basic concepts. In §2 we pointed out how misleadingit can be to compute complexity for restricted classes F, which are difficult todiscern in practice.

Here we wish to point out that F is introduced into the basicdefinitions, such as V in (vi), and there is no need for it.To add to any confusion, the basic definitions do not make clear the role ofA. In the context of numerical analysis there is a particular matrix A on handand this permits one to test the residual r = b −Av for any vector v. However,in the context of IBCT, that is not quite the case.

In this game we consider aspecific A, but it is not available explicitly. That odd situation certainly warrantssome discussion, and it faithfully reflects the state of affairs at a typical step ina Krylov subspace method.

The matrix is hidden inside a subprogram, and the

INFORMATION-BASED COMPLEXITY THEORY11user has only a basis for the Krylov subspace corresponding to Krylov informationNj(A, b) = {b, Ab, . .

., Ajb}. Associated with Nj(A, b) isbV (Nj(A, b)) = { eA : Nj( eA, b) = Nj(A, b)},the set of matrices indistinguishable from A by Nj(A, b).

Contrast bV with V in§4.2 (vi).With bV defined above, the natural definition of the matrix index of an algorithmΦ isˆk(Φ, A) = min(j :maxeA∈bV (Nj)∥b −eAxj∥≤ε),Nj = Nj(A, b),wherexj = Φj(Nj(A, b)).If the set of j values is empty then ˆk(Φ, A) = ∞. Please note that in contrast to(vii) in §4.2 there are no hidden parameters in ˆk.

It is the first step in the processat which the task is accomplished by Φ for all matrices indistinguishable from Aby N1 through Nˆk. Then the optimal index isˆk(A) = minΦˆk(Φ, A)over all Φ such that Φj uses only Nj(A, b) and standard arithmetic operations.There is no logical need for F. However, given a class F, one may defineˆk(Φ, F) = maxB∈Fˆk(Φ, B);ˆk(F) = minΦˆk(Φ, F).Why did IBCT not follow this simple approach?

Why does IBCT use V = bV ∩Fto define the matrix index k(Φ, A) and thus suppress the role of F? The reason,we suspect, is that with these natural definitions the “polynomial” algorithms,deemed to be restrictive in the introduction to [Tr & Wo, 1984] are mandatory;and, consequently, IBCT has nothing new to offer.

Here is a result of ours thatshows why the nonpolynomial algorithms are of no interest in worst case complexity,i.e., ˆk(Φ, A) = ∞.Theorem. Assume A=At ∈Rn×n.

Let Kj =span(b, Ab, . .

., Aj−1b) have dimensionj (< n) and ∥b∥= 1. To each v /∈Kj there exists eA ∈bV (Nj(a, b)) such that∥b −eAv∥> 1.Sketch of proof.

(1) Should b happen to be orthogonal to some eigenvectors of A,it is always possible to choose an A ∈bV (Nj(A, b)) such that b is not orthogonal toany of A’s eigenvectors. If necessary replace A by A.

(2) There is a distinguished orthonormal basis for Kj that can be extended toa basis for Rn and in which A is represented by the matrixTEtEUwhere T = T t ∈Rj×j, E is null except for a β ̸= 0 in the top right entry,U is unknown. Moreover T and β are determined by Nj(A, b).

12B. N. PARLETT(3) In this distinguished basis b is represented by e1 = (1, 0, 0, .

. .

, 0)t and letany v ∈Kj be represented in partitioned form (f, g) where f ∈Rj, g ∈Rn−j. By hypothesis g ̸= 0, since v /∈Kj, and∥b −eAv∥2 = ∥e1 −T f −Etg∥2 + ∥Ef + eUg∥2,where eU is the (2, 2) block in the representation of eA and is uncontrained.

(4) To each eU ∈R(n−j)×(n−j) there is an eA ∈bV (Nj(A, b)) and for any g ̸= 0,there exists eU such that∥Ef + eUg∥2 = ∥e1βf(j) + eUg∥2 > 1.In particular, it is possible to select eU to be symmetric, positive definite, anddiagonal. Here ends the sketch of the proof.Symmetry is not needed in the result given above, If A is not symmetric there isstill a distinguished orthonormal basis for Kj(A, b) and Rn such that b is representedby e1, and A is represented byHJEL.Now H, E, and the first column of J are determined by Nj(A, b).

Moreover, Je1 ̸= 0and for eA indistinguishable from A we have∥b −eAv∥2 = ∥e1 −Hf −Jg∥2 + ∥e1βf(j) + eLg∥2.This can be made as large as desired. In the language of IBCT, ˆk(Φ, A) = ∞forany Φ such that Φ(Nj(A, b)) takes values outside Kj(A, b).Only two choices were left to IBCT.

Either turn away to unsolved problems orcut down the number of indistinguishable matrices by usingV = bV ∩Finstead of bV .Here is the quandary for IBCT. If F is chosen too small, the model loses realism.If F is allowed to be large, then the standard “polynomial” regime is optimal.4.4.Discussion of results.

The main result of the paper concerns the mini-mal residual algorithm MR: this polynomial algorithm’s output for the informationNj = {b, Ab, . .

. , Ajb} is the vector in the Krylov subspace Kj = span(b, Ab, .

. .

, Aj−1b)that minimizes the residual norm, so MR is optimal in Kj. Given Nj MR needsk(MR, A) steps to achieve an ε-approximation in the worst case.

However, IBCTsays

INFORMATION-BASED COMPLEXITY THEORY13Theorem 3.1 [Tr & Wo, 1984]. If F is orthogonally invariant thenk(MR, A) ≥k(A) ≥k(MR, A) −1for any A ∈F.Furthermore, both the upper and lower bounds can be achieved.Recall that k(A) is the minimal number of steps over all admissible algorithms.The fact that MR is not always strongly optimal for the given information ap-pears to give substance to the theory.

It will astonish the numerical analyst, so letus look at the example that purports to show that MR is not always optimal forthe given information (Example 3.2 in the paper). This class iseFρ = {A : A = I −B, B = Bt, ∥B∥≤ρ < 1}.When ε, ρ, and n are specially related so thatq(ε) = ln((1 + (1 −ε2)1/2)/ε)ln((1 + (1 −ρ2)1/2)/ρ)< n,then MR is just beaten by another polynomial algorithm, called the Chebyshevalgorithm, becausek(Cheb, A) = q(ε), k(MR, A) = q(ε) + 1.A word of explanation is in order.

Recall that Ajb ∈Nj(A, b), but Ajb /∈Kj.The MR algorithm needs Ajb to compute the coefficients γi inMR(Nj(A, b)) =jXi=0γi(Aib).This always beats the Cheb output from Kj. However, Cheb can use the well-knownthree-term recurrence, based on ρ, to obtain its equioscillation approximation fromKj+1, not just Kj.

With the right relation between ε, ρ, and n, one has∥b −A MR(Nq(ε))∥> ∥b −A Cheb(Nq(ε))∥> ∥b −A MR(Nq(ε)+1)∥.Is it fair to compare them? The theory claims to compare algorithms restrictedsolely to information Nj.So how could the Cheb algorithm obtain the crucialparameter ρ?

The answer is that ρ is found in the definition of the problem classeFρ!In other words, knowledge that Cheb can use is passed along through theproblem class, not the information.The important point we wish to make is not just that comparisons may notbe fair but that the results of IBCT tell us as much about the idiosyncrasies ofits framework as they do about the difficulty of various approximation problems.With a realistic class such as SPD (sym. pos.

def. ), MR is optimal (strongly) as itwas designed to be and as is well known.In more recent work [Wo, 1985; Tr & Wo, 1988], the flaw mentioned aboveappears to be corrected and the parameter ρ is put into the information explicitly.Again Cheb wins by 1 because it uses ρ while MR does not.

However, this newclarity comes at the expense of realism; the Krylov information is scrupulously

14B. N. PARLETTpriced while ρ comes free.

Yet membership in eFρ may be more difficult to ascertainthan the approximate solution.Although the IBCT paper does not mention the possibility, Krylov informationmay be used to obtain lower bounds on ρ that get increasingly accurate as the di-mension of the Krylov subspace increases. Algorithms that exploit knowledge of thespectrum will have good average behavior but there is little room for improvementin the worst case.The simple facts are well known: Chebyshev techniques are powerful and usersare willing to do considerable preliminary work to estimate parameters such as ρ. Itis not clear, and depends strongly on the data, when it is preferable to use a weakermethod such as MR that does not need extra parameters.

The result that MR isonly almost strongly optimal is a striking example of obfuscation. The frameworkof IBCT permits unnatural comparison of algorithms.Embracing the conjugate gradient algorithm.

In §4 of their paper the authors gener-alize the framework to cover other known methods such as the conjugate gradientalgorithm. All that is necessary (for IBCT) is to introduce a parameter p intothe basic task.

Now an ε-approximation to A−1b is redefined as any x ∈Rn thatsatisfies∥Ap(x −A−1b)∥< ε∥Ap−1b∥.The cases p = 0, 1/2, 1 are the most important, and when p is not an integer,it is appropriate to restrict attention to the symmetric, positive definite subset(SPD) of Rn×n. When p = 1 we recover MR and the results of §3.

To generalizeMR to p = 0, it is necessary to use the normal equations of a given system. Thenew feature, slipped in without mention, is that with p < 1 the right-hand sideof the new definitions is not directly computable.

So how does an algorithm knowwhen to terminate?Please note that approximation theory can present resultsthat are not computable without a blush because it merely exhibits relationships.In computation, however, there is no virtue in attaining the desired accuracy ifthis fact cannot be detected. Since IBCT defines an algorithm as a sequence ofmappings, it dodges the difficult task of knowing when to stop.The fact that the Conjugate Gradient algorithm (p = 1/2) minimizes ∥b −Ax∥Aat each step is well known (see [Da, 1967]).

Nevertheless, in practice, the algorithmis usually terminated by the less desirable condition ∥b −Ax∥< ε∥b∥because thedesirable A norm is not available (see [Go & Va, 1984]).For the reasons given in the two previous paragraphs, Theorem 4.2 in [Tr &Wo, 1984], which states that under certain technical conditions the generalized MRalgorithm (0 ≤p ≤1) is almost strongly optimal, is a true theorem about improperalgorithms.4.5.An interesting result. Recall thatNj(A, b) ={b, Ab, .

. .

, Ajb],Kj+1 = span{b, Ab, . .

., Ajb},bV (Nj(A, b)) ={ eA : Nj( eA, b) = Nj(A, b)}.Theorem (reformualted by us from Theorem 3.1 in [Tr & Wo, 1984]). If y ∈Rnyields an ε residual norm, ∥b −eAy∥< ε∥b∥, for all eA ∈bV (Nj(A, b)) then so doesits orthogonal projection z onto Kj+1.

INFORMATION-BASED COMPLEXITY THEORY15We offer a simplified version of the argument in [Tr & Wo, 1984].Proof. There are five steps.

(i) Either y ∈Kj+1, and there is nothing more to prove, or y = z + w, withz ∈Kj+1, and 0 ̸= w orthogonal to Kj+1. (ii) For the vector w defined in (i) there is a unique symmetric orthogonal matrixH = H(w), called the reflector that reverses w. In particular: (a) Hx = x,for x orthogonal to w, (b) Hw = −w.Define an auxiliary matrix bA by(iii) bA = HAH ∈V (Nj(A, b)) since, by use of (ii)(a), bA ib = HAiHb = HAib =Aib, i = 1, .

. .

, j.Note thatbAy −b =HAHy −b=HA(z −w −b,using (i) and (ii)(b).=H(Az −b −Aw),using Hb = b.This shows the crucial relationship(iv) ∥bAy −b∥= ∥Az −b −Aw∥, since H preserves norms.Hence(v)∥Az −b∥≤12{∥Az −b −Aw∥+ ∥Az −b + Aw∥},by the triangle inequality,= 12{∥bAy −b∥+ ∥Ay −b∥},by (iv), and (i),≤ε∥b∥,using the hypothesis.Recall that z is y’s projection onto Kj+1. Hence eAz = Az for all eA ∈bV (Nj(A, b))and so∥eAz −b∥= ∥Az −b∥≤ε∥b∥,by (v).Q.E.D.This theorem explains why MR cannot lag more than one step behind any al-gorithm that produces an ε residual norm for all matrices indistinguishable fromA.

For, by definition, MR produces from Nj+1(A, b) (note the increased subscript)the unique vector in Kj+1 that gives the smallest residual and is at least as goodas the vectors y and z defined in the proof above. But y could be the output of arival algorithm.Our formulation of the lemma omits any mention of the class F.Now it isclear why the hypothesis that F should be orthogonally invariant appears in mostof the theorems.Recall that bV (Nj(A, b)) is too big.To cut down the numberof indistinguishable matrices, the theory uses bV ∩F = V .

To make use of thetheorem, it is necessary to have HAH ∈F and this will be true provided that F isorthogonally invariant.

16B. N. PARLETT4.5.Summary.

One hidden defect in the framework for discussing the MR algo-rithms is the far reaching feature that allows the family F to convey what mostpeople would call free information behind the back of the information operator N.More disturbing than the previous defect is that we cannot see how any algorithmother than the well-studied polynomial algorithms could know when it had achievedan ε-approximation if it is restricted to the given information. This gives rise to afeeling that [Tr & Wo, 1984] managed to create an artificial problem where no realpuzzle exists.

The mentioned theorems (3.1 and 4.2) reflect only the propensity oftheir general theory of optimal algorithms for creating such situations.5. Optimal solution of large eigenpair problemsThis section offers a description of and commentary on [Ku, 1986].

The paperdemonstrates cleverness and clean exposition but, nevertheless, suffers from designflaws; it equates different versions of a given algorithm and it redefines a standardtask. From the abstract:The problem of approximation of an eigenpair of a large n × n matrixA is considered.

We study algorithms which approximate an eigenpairof A using the partial information on A given by b, Ab, A2b, . .

. , Ajb,j ≤n, i.e., by Krylov subspaces.

A new algorithm called the generalizedminimal residual algorithm (GMR) is analyzed. Its optimality for someclasses of matrices is proved.

We compare the GMR algorithm with thewidely used Lanczos algorithm for symmetric matrices. The GMR andLanczos algorithms cost essentially the same per step and they have thesame stability characteristics.

Since the GMR algorithm never requiresmore steps than the Lanczos algorithm, and sometimes uses substan-tially fewer steps, the GMR algorithm seems preferable.. . .

The Fortran subroutine is also available via . .

.This last phrase shows that the subject matter is firmly within the field of nu-merical analysis.Implementation issues concerning GMR are described in [Ku,1985].5.1.A subtle change of goal. Here are the first five lines of the paper.

“Sup-pose we wish to find an approximation to an eigenpair of a very large matrix A.That is, we wish to compute (x, ρ), where x is an n× 1 normalized vector, ∥x∥= 1,and ρ is a complex number s.t. (1.1)∥Ax −xρ∥< εfor a given positive ε.

Here ∥· ∥denotes the 2-norm.”It is all too easy to assent to this statement of the problem and pass on to therest of the article. However it is not the normal eigenvalue problem.

We are notaware of any demand at all for the accomplishment of this particular task. Theusers of eigenvalue programs (engineers, theoretical chemists, theoretical physicists)want eigenvalues in specified parts of the spectrum; occasionally, they want the wholespectrum.

The main concern of this article is with symmetric matrices; and becausetheir eigenvalues are real, the usual demands are for the leftmost p eigenvalues (forsome p ≤n) or the rightmost p eigenvalues or for all eigenvalues in a given interval.Eigenvectors may or may not be wanted. There is nothing inherently wrong with

INFORMATION-BASED COMPLEXITY THEORY17restricting attention to the rather special case p = 1 and a few articles (not citedby Kuczynski) have been devoted to it (see [O’L, Ste & Va, 1979; Pa, Si & Str,1982] for the details).To support our description of user’s demands we refer to three publications fromdifferent fields, [Cu & Wi, 1985, Introduction; Je, 1977, Ch. 7; Sh, 1977, §6].The consequences of leaving out a vital aspect of the usual task are most serious,precisely when one seeks optimal performance.One reason why it is so easy to overlook the omission in the problem statement isthat, for symmetric matrices, almost everyone does use the residual norm ∥Ax−xρ∥to judge the accuracy of an approximate eigenpair (x, ρ).

However, it is not veryinteresting to minimize the residual norm if that might yield a ρ in the unwantedpart of the spectrum.Now a pure mathematician is free to define his goal atwill. What is regrettable is that no hint is given to the reader that the goal is notstandard.We say more about the ε appearing in (1.1) in §5.5.We mention one other fact which may be news to readers who are not much con-cerned with eigenvalue problems.

It suggest why the direction taken by Kuczynskihas not appeared in the literature before. If we are given a symmetric matrix Aand seek a single eigenvalue (with or without its eigenspace) then the wanted eigen-value is almost certain to be either the leftmost or the rightmost.

Recall that theRayleigh quotient of a column vector x ̸= 0 in Rn is the number xtAx/xtx. Theextreme eigenvalues of A are the extreme values of the Rayleigh quotient over allpossible vectors x in Rn.

It happens that, for the given Krylov information Nj,the Lanczos algorithm is optimal for this task in the strong sense that it yieldsthe leftmost and rightmost values of the Rayleigh quotient over all vectors in theavailable space Kj. The last vector Ajb in Nj is needed to ascertain the extremevalues over Kj.

Thus the problem is settled. It is a pity that this well-known factwas not mentioned.5.2.Choosing a bad starting vector.

The particular aspect of Information-Based Complexity Theory adopted in the paper under review is called worst-casecomplexity. It seeks bounds on the cost of computing ε-approximations over allmatrices in certain classes F and over all starting directions b. Theorems 3.1, 3.2,4.1, 5.1 (there is no theorem 1.1 or 2.1) in [Ku, 1986] are examples.

In particular,the theory must cover what can happen with the worst possible starting vector.Theorem 3.1 is quoted in full in §5.4.There is nothing wrong with studying the worst case. Indeed it has alreadybeen done.

[Sc, 1979] is a paper with the clear title How to make the Lanczosalgorithm converge slowly in which the author gives formulae for a starting vec-tor that prevents any Rayleigh Ritz approximation from converging until the finalstep! Scott’s paper appeared in Mathematics of Computation, the American Math-ematical Society’s principal outlet for numerical analysis, but it is not referencedin [Ku, 1986].

The fact that some, Krylov subspaces can be very badly alignedwith A’s eigenvectors does prevent worst-case analysis from shedding much light onhow Krylov subspaces approach certain eigenvectors in the usual case of a randomstarting vector. That study, of course, comes under average-case analysis and isripe for attention.Please note that this comment is quite independent of comparisons of GMR andLanczos.

The point is this: the starting vector b is a free parameter in the eigenvalue

18B. N. PARLETTproblem (in contrast to the linear equations problem Ax = b).

It is not given andmay be chosen to improve performance. In the absence of extra information, it isthe almost universal habit to pick b with the aid of a random number generator.Recent theoretical work on Lanczos has been concerned to explain why this choice isso powerful (see [Sa, 1980; Pa, 1980]).

Note that two quite different situations havebeen pushed together under the label ‘worst case’. It is quite normal to considerthe most difficult matrices A because they are part of the problem.

On the otherhand, a bad b is a self-inflicted handicap rather than a genuine difficulty. It is theconfounding of these cases that is unfortunate, not their study.Returning to the eigenvalue problem, we can rephrase our complaint as follows:Kuczynski’s focus, perhaps unwittingly, is on Krylov-subspaces-with-worst-possible-starting-vectors.

What a pity that this was not emphasized! The numerical exam-ples given in the paper are not without interest.

The starting vector there, thoughnot perhaps worst possible, is very bad. Both methods, GMR and Lanczos con-verge very slowly.

The chosen matrices are extensions of the one used by Scott toillustrate how bad Rayleigh Ritz approximations can be (see [Pa, 1980, p. 218]).We ran these examples with our local Lanczos program. It uses a random startingvector, and convergence was quite satisfactory.

The results are given in §5.5.There is a different context in which the focus on worst starting values is muchmore relevant. The GMR algorithm presented by Kuczynski is a generalization ofthe MR (minimum residual) algorithm used to compute approximations to A−1b.There one seeks vectors x in Rn s.t.

∥Ax −b∥< ε∥b∥. A well-chosen subspace maybe used to generate approximate solutions at low cost.

It is advisable to ensurethat the right-hand side is in the chosen subspace, and this consideration leads oneto choose the subspace Kj. In this context b is part of the data (A, b) and is notat our disposal.

The study of bad b’s is relevant to a study of the complexity ofKrylov space methods for linear equations. However, it has been appreciated fromthe beginning that for reasonable ε and unfortunate b then n steps will be requiredunless A is close to the identity matrix (see [Ka, 1966; Thr.

4.3; and Me, 1963].To burden the Lanczos algorithm (or GMR) with unnecessarily bad startingvectors for the eigenvalue problem is like studying the performance of Olympicathletes only when they suffer from some rare affliction like poison ivy.5.3.Redefining the Lanczos algorithm. The new algorithm GMR is con-trasted with the well-known Lanczos algorithm.

Here is Kuczynski’s definition ofthe Lanczos algorithm, from p. 142. The subspace Aj is our Krylov subspace Kj.Perform the following steps:(1) Find an orthonormal basis q1, q2, .

. .

, qj of the subspace Aj; let Qj = (q1, . .

. , qj)be the n × j matrix.

(2) Form the j×j matrix Hj = QtjAQj; compute eigenpairs of Hj; Hjgi = higi,(gi, gm) = δim, i, m = 1, . .

. , j.

(3) Compute the Ritz vectors zi = Qjgi and the residualrLj = min1≤i≤j ∥Azi −θizi∥for 1 ≤i ≤j. (4) Define Zj = {zi, θi), i = 1, 2, .

. .

, j : ∥Azi −ziθi∥= rLj }The jth step of the L algorithm is defined byΦLj (Nj(A, b)) = (xk, ρk),

INFORMATION-BASED COMPLEXITY THEORY19where (xk, ρk) is an arbitrary element from Zj.The trouble is that steps 3 and 4 have been changed from the usual ones toconform with the idiosyncratic goal discussed in §5.1. However, no mention of thisfact is made.Here is the conventional description wherein it is supposed that p eigenvaluesare to be approximated.

It is from [Pa, 1980, p. 214].

(1) Is the same as above. (2) Form the j × j matrix Hj = QtjAQj; compute the p(≤j) eigenpairs of Hjthat are of interest, sayHjgi = giθi,i = 1, .

. .

, pThe θi are Ritz values, θ1 < θ2 < · · · < θj. Equality is not possible.

(3) If desired, compute the p Ritz vectors zi = Qjgi, i = 1, . .

. , p. The full set{(θi, zi), i = 1, .

. .

, j} is the best set of j approximations to eigenpairs of Athat can be derived from Aj alone. (4) Residual error bounds.

Form the p residual vectors ri = Azi −ziθi. Eachinterval [θi −∥ri∥, θi + ∥ri∥] contains an eigenvalue of A.

If some intervalsoverlap then a bit more work is required to guarantee approximations to peigenvalues. See [Pa, 1980, §11-5].

(5) If satisfies then stop.In the context of Kuczynski’s investigations his modification is entirely reason-able; i.e., he selects at each step one Ritz pair with a minimal residual norm.However, it is most misleading to those not familiar with the Lanczos algorithm tosuggest that its purpose is simply to product this Ritz pair. In fact, as indicatedabove, the Lanczos algorithm produces an approximation, albeit crude, to the wholespectrum, namely θ1, .

. .

, θj and the user is free to select from this set to suit thespecific goal. Thus, to approximate the right-most eigenvalue, one concentrates onθj and continues until its error bound ∥rj∥is satisfactory.

In practice, more refinederror bounds can be made but that is not germane here (see [Pa & No, 1985]).It would have been preferable to state the Lanczos algorithm conventionally andthen specify the modifications appropriate for the purpose in hand. This actionwould make clear that the Lanczos algorithm is not trying to minimize one residualnorm.

That is why it is inferior to GMR for that purpose.It is worth pointing out here, that in the model of arithmetic used in thesestudies, the cost of all the Rayleigh-Ritz approximations and of finding the minimalresidual norm over the subspace Kj is taken as nil. It might occur to the readerthat in this context it would cost no more per step to compute all the Rayleigh-Ritz approximations and use whatever approximations one desires.

Thus settingup GMR and Lanczos as competing algorithms is artificial. Moreover, in practice,it is much more expensive to compute the minimal residual than to compute theRayleigh-Ritz residuals.

Kuczynski has devoted a whole report to the task.5.4.Theoretical results. From p. 138 of [Ku, 1986].

We are ready to formulatethe main theory of the paper.Theorem 3.1. If F is unitarily (orthogonally) invariant, then the GMR is almoststrongly optimal in F, i.e., k(Φgmr, A, b) = minΦ k(Φ, A, b) + a, for any (A, b) ∈F × Sn, where a ∈{0, 1, 2}.

20B. N. PARLETTHere k(Φ, A, b) is the minimal number of steps j required to guarantee an ε-residual with algorithm Φ over all matrices eA that are indistinguishable from Awith the given information N = Nj = [b, Ab, .

. .

, Ajb]. The algorithm Φ returns apair ρ, x whose residual norm ∥eAx −xρ∥is to be compared with ε.Recall that with information Nj, the GMR algorithm, by definition, picks out aunit vector x ∈Kj and a ρ ∈C that produce the minimal residual norm.How could any other algorithm possibly do better?

Well, there might be specialcircumstances in which one could deduce a suitable additional component of thatlast vector Ajb that is not used by GMR in forming x, although Ajb is used incalculating the coefficients of GMR’s approximation from Kj. The proof studiesthis possibility and concludes that GMR would make up any discrepancy in at mosttwo more steps.The argument is very nice.

In many important cases, when A is Hermitian forexample, then the constant a in Theorem 3.1 is actually 0.There are other clever results. Theorem 4.2 shows that for symmetric matricesthe residual norm of GMR must be strictly decreasing at lest at every other step.Theorem 5.1 yields a beautiful but esoteric fact about Krylov subspaces generatedby Hermitian A.

For the worst starting vector there is a unit vector v in Kj suchthat∥A∥2j≤∥Av −vρ∥≤∥A∥j,for j < n.As indicated above, these nice results from approximation theory do not addup to a case for replacing Rayleigh-Ritz with some rival algorithm. We sometimesabbreviate Rayleigh-Ritz by R-R.5.5.Numerical examples.

All the numerical results reported in [Ku, 1986]concern symmetric tridiagonal matrices with starting vector e1 (the first columnof the identity matrix I). This starting vector ensures that the Lanczos algorithmreproduces the original matrix.

At this point we should recall that the originalgoal of the Lanczos algorithm was to reduce a symmetric matrix to tridiagonalform.So, the numerical results to be seen below do not relate to the Lanczosrecurrence itself but merely indicate alternative rules for stopping. With the goalof tridiagonalization, the algorithm always stopped at step n. Later, it was realizedthat excellent approximations to a few eigenvectors were usually obtained early inthe process.

There is no single stopping criterion for current Lanczos algorithms;termination depends on what the user wants (see [Cu & Wi, 1985; Pa, 1980; andGo & Va, 1984].Kuczynski provides the Lanczos algorithm with a stopping criterion to suit hispurposes, but his algorithm GMR could have been called (with more justice) theLanczos algorithm with a new stopping criterion. It uses the minimum residualin the whole Krylov subspace instead of the usual (cheap) Rayleigh-Ritz approx-imations.So, the numerical results simply indicate the effect of these differenttermination criteria.The first batch of results concern tridiagonals with nonzero elements chosen atrandom from [−13, 13].

The most striking feature is that the GMR residual and thesmallest Rayleigh-Ritz residual slip below the given ε at the same step in the vastmajority of cases, particularly for ε < 10−3. In Table 8.1, with ε = 10−6, the step

INFORMATION-BASED COMPLEXITY THEORY21was the same in 18 out of 20 cases. In the other two, the Lanczos algorithm (i.e.the minimal R-R norm) took one more step (17 as against 16).Some weight is given to the fact that the smallest R-R residual norm is rarelymonotone decreasing from one step to another whereas GMR does enjoy this prop-erty.

However, if the approximate eigenvalue associated with the minimum residualhappens to change position in the spectrum from step to step, then this mono-tonicity of GMR is not associated with the convergence to a specific eigenvalue ofthe original matrix. No indication is given in the results of how the approximateeigenvalue implicitly chosen by GMR jumps around the spectrum.

In practice, theinteresting thing to know is how many Ritz values have converged, and to whataccuracy, when the algorithm is terminated.Unfortunately, this information isexcluded from the GMR viewpoint and is not reported.The next results, Examples 8.1 and 8.2 in [Ku, 1986], exhibit the dramatic failureof the Lanczos algorithm. On a tridiagonal matrix of order 201 and norm near 1the minimal R-R residual remained at its initial value 0.035 for all steps except thelast (at which it must be 0).

In contrast, the GMR residual declined slowly fromthe intial 0.035 to 0.0039 at step 200. If ε = 0.034, then GMR takes 2 steps whileLanczos takes 201!

However, with ε ≤10−3, both algorithms need 201 steps. Werepeat, once again, that GMR will not know which eigenvalue it has approximated.Unfortunately, no attempt is made to put this example in context.It illus-trates the phenomenon explored in some detail in [Sc, 1979], namely that for everysymmetric matrix with distinct eigenvalues there is a set (with positive Lebesguemeasure on the sphere) of starting vectors such that no Rayleigh-Ritz approxima-tion is any good until the last step.

We must repeat that the Lanczos algorithm isnot obliged to use a poor initial vector. We ran our Lanczos code on this matrix.Our code starts with a normalized version of Ar, where A is the given matrix (oroperator) and r’s elements are chosen at random from a uniform random distri-bution.

The reason for starting with Ar is compelling when an operator A hasunwanted infinite eigenvalues. The results are given in the Table 1 (see p. 24).The accepted eigenvalues (104 of them at step 190) agreed with those computedby EISPACK to all of the fifteen decimals printed out.

The efficiency is not atall bad considering that this is a difficult eigenvalue distribution for Krylov spacemethods.Example 8.2, a tridiagonal of order 501 with null diagonal and monotonely in-creasing offdiagonal elements, caused the minimal R-R residual norm to increasefrom 0.001 initially in 0.011 at steps 499 and 500. In contrast GMR residual normsdeclined to 0.00036 at steps 499 and 500.

Thus, with ε = .00099, GMR terminatesat step 2 whereas Lanczos terminates at step 501! However, with ε ≤10−4, bothtake 501 steps.As with Example 8.1 e1 is a bad staring vector yielding a poor Krylov subspace.We ran our Lanczos program and found the results given in Table 2.We quote the final paragraph of the article.From all the tests we have performed we conclude that the GMR algo-rithm is essentially superior to the Lanczos Algorithm on matrices withconstant or increasing codiagonal elements.For random matrices ormatrices with decreasing codiagonal elements, both algorithms producenearly the same residuals.

22B. N. PARLETTTable 1.

Convergence of Ritz pairs on T201.STEPεNumber ofgood Ritz values2010−213010−314010−345010−356010−4, 10−67, 47010−5, 10−710, 58010−5, 10−714, 109010−5, 10−718, 1410010−5, 10−723, 1811010−5, 10−728, 2412010−5, 10−735, 3013010−5, 10−744, 3714010−5, 10−752, 4415010−5, 10−760, 5416010−5, 10−770, 6117010−5, 10−783, 7518010−5, 10−799, 8919010−5, 10−7118, 109The revealing word here is “codiagonal.” The author has worked exclusivelywith tridiagonal matrices and has forgotten that the goal of the Lanczos recurrenceis to produce a tridiagonal matrix! Given such a matrix one has no need of eitherLanczos or GMR.

As our results indicate, a random starting vector permits theLanczos algorithm to perform satisfactorily even on such craftily-designed matrices.The quotation reveals just how far a mathematical theory can stray from relevance.5.6.Summary. Here is an attempt to formulate the numerical analyst’s version

INFORMATION-BASED COMPLEXITY THEORY23Table 2. Convergence of Ritz pairs on T501.STEPεNumber ofgood Ritz values2010−213010−314010−325010−436010−4, 10−65, 27010−5, 10−77, 38010−5, 10−79, 69010−5, 10−713, 1010010−5, 10−716, 1311010−5, 10−720, 1512010−5, 10−723, 2013010−5, 10−729, 2314010−5, 10−734, 2915010−5, 10−739, 34of Complexity Theory for Krylov subspaces and eigenvalues.For each symmetric n × n matrix there are initial vectors that yield aneigenvalue in one step, and initial vectors that yield an eigenvalue onlyat the nth step.

The nontrivial result contained in the Kaniel-Paige-Saad error bounds (see [Pa, 1980, Chap. 12]) is that with most startingvectors the extreme eigenvalues can be found in a modest number ofsteps that depends on the distribution of the spectrum and is nearlyindependent of n.We summarize our criticism of [Ku, 1986] but wish to note that the paper isessentially the author’s Ph.D. dissertation, and it would not be charitable to holdhim responsible for the vagaries to which IBCT is subject.Section 1 exposes a serious flaw in the model, namely the goal.Section 2 exposes a subtle way in which features of a method are pushedinto the problem statement; the starting vector.Section 3 shows how standard terms can be redefined; the Lanczos algo-rithm.Section 4 contains some clever and interesting results on the approximatingpower of Krylov subspaces.Section 5 shows how very misleading numerical results can be in the absenceof proper interpretation.

24B. N. PARLETTAcknowledgmentThe author has been assisted by many friends in composing this essay but wouldlike to single out Stan Eisenstat, G. W. Stewart, and W. Kahan for their helpfuland detailed comments.References[Ba, 1987]I. Babuska, Information-based numerical practice, J.

Complexity 3 (1987),331–346. [Bl, Shu, & Sm, 1989] L. Blum, M. Shub, and S. Smale, On a theory of computation and com-plexity over the real numbers; NP completeness, recursive functions, andTuring machines, Bull.

Amer. Math.

Soc. (N.S.) 21 (1989), 1–46.

[Bo & Mu, 1975]A. Borodin and I. Munro, Computational complexity of algebraic and nu-meric problems, Amer. Elsevier, New York, 1975.

[Cu & Wi, 1985]J. K. Cullum and R. A. Willoughby, Lanczos algorithms for large symmet-ric eigenvalue computations, Vol. I: Theory, Progr.

Comput. Sci., Birkhauser,Basel, 1985.

[Da, 1967]J. W. Daniel, The conjugate gradient method for linear and nonlinear op-erator equations, SIAM J. Numer. Anal.

4 (1967), 10–26. [Go & Va, 1984]G. H. Golub and C. V. Van Loan, Advanced matrix computations, JohnsHopkins Univ.

Press, Maryland, 1984. [Je, 1977]A. Jennings, Matrix computation for engineers and scientists, John Wiley& Sons, Chichester, 1977.

[Ka, 1966]S. Kaniel, Estimates for some computational techniques in linear algebra,Math. Comp.

20 (1966), 369–378. [Ku, 1986]J. Kuczynski, A generalized minimal residual algorithm for finding aneigenpair of a large matrix, J.

Complexity 2 (1986), 131–162. [Ku, 1985], Implementation of the GMR algorithm for large symmetric eigen-problems, Tech.

Report, Comp. Sci.

Dept., Columbia University, New York,1985. [Ku & Wo, 1992]J. Kuczynski and H. Wo´zniakowski, Average case complexity of the sym-metric Lanczos algorithm, SIAM J. Matrix Anal.

Appl., 1992 (to appear). [La, 1952]C. Lanczos, Solution of systems of linear equations by minimized iterations,J.

Res. Nat.

Bur. Standards 49 (1952), 33–51.

[Me, 1963]G. Meinardus, ¨Uber eine Verallgemeinerung einer Ungleichung von L. V.Kantorowitsch, Numer. Math.

5 (1963), 14–23. [O’L, Ste & Va, 1979] D. O’Leary, G. W. Stewart and J. S. Vandergraft, Estimating the largesteigenvalue of a positive definite matrix, Math.

Comp. 33 (1979), 1289–1292.

[Pa, 1980]B. N. Parlett, The symmetric eigenvalue problem, Prentice-Hall, New Jer-sey, 1980. [Pa, 1982], Two monitoring schemes for the Lanczos algorithm, ComputingMethods in Applied Sciences and Engineering (V. R. Glowinski and J. L.Lions, eds.

), North-Holland Press, 1982. [Pa, 1984], The software scene in the extraction of eigenvalues from sparsematrices, SIAM J. Sci.

Statist. Comput.

5 (1984), 590–603. [Pa & No, 1985]B. N. Parlett and B. Nour-Omid, The use of refined error bounds whenupdating eigenvalues of tridiagonals, Linear Algebra Appl.

68 (1985), 179–219. [Pa, Si & Str, 1982]B. N. Parlett, H. A. Simon, and L. M. Stringer, On estimating the largesteigenvalue with the Lanczos algorithm, Math.

Comp. 38 (1982), 153–165.

[Pac, 1986]E. Packel, Review of A general theory of optimal algorithms, by J. F. Trauband H. Wo´zniakowski (Academic Press, New York, 1980), SIAM Rev. 28(1986), 435–437.

[Pac & Wo, 1987]E. Packel and H. Wo´zniakowski, Recent developments on information-basedcomplexity, Bull. Amer.

Math. Soc.

(N.S.) 17 (1987), 9–26.

INFORMATION-BASED COMPLEXITY THEORY25[Ra, 1972]M. O. Rabin, Solving linear equations by means of scalar products, Com-plexity of Computer Computations (R. E. Miller and J. W. Thatcher, eds. ),Plenum Press, New York, 1972, pp.

11–20. [Sa, 1980]Y. Saad, On the rates of convergence of the Lanczos and the block-Lanczosmethods, SIAM J. Numer.

Anal. 17 (1980), 687–706.

[Sc, 1979]D. S. Scott, How to make the Lanczos algorithm converge slowly, Math.Comp. 33 (1979), 239–247.

[Sch & Str, 1971]A. Schoenhage and V. Strassen, Schnelle Multiplikation Grosser Zahlen,Commuting 7 (1971), 281–292. [Sh, 1977]I. Shavitt, The method of configuration interaction, Modern TheoreticalChemistry, vol.

3 (H. P. Schaefer, ed. ), Plenum Press, 1977.

[Shu, 1987]M. Shub, Review of Information, uncertainty, complexity, by J. F. Traub,H. Wo´zniakowski et al.

(Addison-Wesley, Reading, MA 1983), SIAM Rev.29 (1987), 495–496. [Sti, 1958]E. Stiefel, Kernel polynomials in linear algebra and their numerical appli-cations, NBS Appl.

Math. 43 (1958), 1–22.

[Str, 1969]V. Strassen, Gaussian elimination is not optimal, Numer. Math.

13 (1969),354–356. [Tr & Wo, 1980]J. F. Traub and H. Wo´zniakowski, A general theory of optimal algorithms,Academic Press, New York, 1980.

[Tr & Wo, 1984], On the optimal solution of large linear systems, J. Assoc. Comput.Mach., 31 (1984), 545–559.

[Tr & Wo, 1988], Information-based complexity, Academic Press, New York, 1988. [Tr, Wo & Wa, 1983]J. F. Traub, H. Wo´zniakowski, and G. Wasilkowski, Information, uncer-tainty, complexity, Addison-Wesley, Reading, 1983.

[Wi, 1970]S. Winograd, On the number of multiplication necessary to compute certainfunctions, Comm. Pure Appl.

Math. 23 (1970), 165–179.

[Wi, 1980], Arithmetic complexity of computations, CBMS-NSF Regional Conf.Ser. In Appl.

Math., vol. 33, SIAM, Philadelphia, 1980.

[Wo, 1985]H. Wo´zniakowski, A survey of information-based complexity, J. Complexity1 (1985), 11–44.Department of Mathematics and Computer Science Division of the ElectricalEngineering and Computer Science Department, University of California, Berkeley,California 94720


출처: arXiv:9201.266원문 보기

Subscribe to koineu.com

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe