Reading time: 55 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.22692
  • Date:
  • Authors: Unknown

📝 Abstract

Existing machine learning frameworks operate over the field of real numbers (R) and learn representations in real (Euclidean or Hilbert) vector spaces (e.g., R d ). Their underlying geometric properties align well with intuitive concepts such as linear separability, minimum enclosing balls, and subspace projection; and basic calculus provides a toolbox for learning through gradient-based optimization.

📄 Full Content

But is this the only possible choice? In this paper, we study the suitability of a radically different field as an alternative to R-the ultrametric and nonarchimedean space of p-adic numbers, Q p . The hierarchical structure of the p-adics and their interpretation as infinite strings make them an appealing tool for code theory and hierarchical representation learning. Our exploratory theoretical work establishes the building blocks for classification, regression, and representation learning with the p-adics, providing learning models and algorithms. We illustrate how simple Quillian semantic networks can be represented as a compact p-adic linear network, a construction which is not possible with the field of reals. We finish by discussing open problems and opportunities for future research enabled by this new framework.

Since they have been introduced by Hensel (1897), p-adic numbers have seen numerous applications in number theory, algebraic geometry, physics, and other fields (Koblitz, 1984;Robert, 2000;Gouvêa, 2020). They differ from the real numbers in many important ways, which leads to many fascinating and surprising results, such that the equality 1 + 2 + 4 + 8 + . . . = -1 or the fun fact that in the p-adic world all triangles are isosceles and any point in a ball is a center of that ball.

Yet, with only a few exceptions, very little work has investigated the potential of padic numbers in machine learning. Bradley (2009) studies clustering of p-adic data and proposes suboptimal algorithms for minimizing cluster energies. Murtagh (2004Murtagh ( , 2009) ) analyze dendrograms and ultrametricity in data. Chierchia and Perret (2019) and Cohen-Addad et al. (2020) develop procedures to fit ultrametrics to data. Khrennikov and Tirozzi (1999) propose a “p-adic neural network” (similar to our unidimensional linear classifier in §3). Baker and Molla-Aliod (2022) use a variant of p-adic regression for a small sequenceto-sequence problem in natural language processing, bearing some similarity with our formulation in §4. This paper is an attempt to establish the foundations for p-adic machine learning by developing building blocks for classification ( §3) and regression problems ( §4) and exploring the expressive power of p-adic representations ( §5). The paper ends with a selection of open problems ( §6) which I believe need to be addressed to make this framework practically useful. (The title is intentionally misleading: it is mostly about what I have been learning with the p-adics, and not so much about how one can do machine learning with the p-adics.)

This endeavour comes with several challenges, since the classical tools of gradient-based optimization and statistics are not readily available in the world of the p-adics: although we can do calculus and find roots of functions using Newton’s method, the topological properties of the p-adics seem to make derivatives not so useful (e.g., a function with zero derivative everywhere might not be constant), and it is not obvious how to optimize since the p-adics, unlike the reals, are not an ordered field. However, they possess a very interesting hierarchical structure which appears promising for representation learning and certain classification and regression problems.

Notation. We denote by R and Q the fields of real and rational numbers, respectively, R + the non-negative reals, and Z the ring of integer numbers. We denote [n] = {1, …, n}.

We start by reviewing ultrametric spaces, the field of p-adic numbers, and the ring of p-adic integers, along with their basic properties (Gouvêa, 2020).

Let K be a field (such as Q or R) and let R + denote the non-negative real numbers. An absolute value on K is a function |.| : K → R + satisfying (i) |x| = 0 iff x = 0; (ii) |xy| = |x||y| for all x, y ∈ K; (iii) |x + y| ≤ |x| + |y| for all x, y ∈ K. An absolute value is called non-Archimedean if it has the following property (stronger than iii):

|x + y| ≤ max{|x|, |y|}, for all x, y ∈ K.

(1)

It is called “Archimedean” otherwise. 1

Example 1 (trivial absolute value). The function defined as |x| = 1 if x ̸ = 0 and |0| = 0 is an absolute value, called the trivial absolute value. It is non-Archimedean.

Example 2 (usual absolute value on Q and R). Let K = Q or R. The usual absolute value |x| := max{x, -x} is Archimedean.

Example 3 (p-adic absolute value on Q). Let p be a prime number. Any nonzero x ∈ Q can be written uniquely as x = p n a b , where a and b are co-prime integers not divisible by p, and n ∈ Z. The p-adic absolute value on Q is

This absolute value (used extensively throughout this paper) is non-Archimedean.

An absolute value induces a metric (and consequently a topology) on K through the distance function d(x, y) := |x -y|. It results from properties (i-iii) above that we must have (i) d(x, y) ≥ 0 ∀x, y ∈ K, with equality iff x = y; (ii) d(x, y) = d(y, x) ∀x, y ∈ K; (iii) d(x, z) ≤ d(x, y) + d(y, z) ∀x, y, z ∈ K. Property (iii) is called the (weak) triangle inequality. If | • | is non-Archimedean, then the induced metric satisfies the strong triangle inequality: d(x, z) ≤ max{d(x, y), d(y, z)} ∀x, y, z ∈ K.

(2)

Metrics satisfying (2) are called ultrametrics and the corresponding K is called a ultrametric space. Ultrametric spaces have the interesting property that every triangle is isoscelesmore specifically, for any x, y, z ∈ K, d(x, y) ̸ = d(y, z) implies d(x, z) = max{d(x, y), d(y, z)}.

We provide a simple proof in Appendix A.

Let us recall how the set of real numbers R is constructed by completing the set of rationals Q. Given an absolute value and its induced metric, we can define open balls and a notion of convergence. A field K is complete if any Cauchy sequence 2 in K converges to a limit point in K-for example, Q is not complete with respect to the usual absolute value. 3 By filling the 1 The name comes from the fact that Archimedean absolute values satisfy the Archimedean property: for any x, y ∈ K with x ̸ = 0, there is an integer k ∈ Z such that |kx| > |y|, which is equivalent to the assertion that there are arbitrarily “big” integers, an observation which goes back to Archimedes. This does not happen with non-Archimedean absolute values, where (1) implies |kx| = |x + . . . + x| ≤ |x|.

2 A sequence of elements xn ∈ K is a Cauchy sequence if ∀ϵ > 0 there is M ∈ N such that m, n ≥ M imply |xn -xm| < ϵ.

3 For example, the sequence defined by

is the sequence obtained by applying Newton’s method to find a solution of x 2 -2 = 0).

“holes” in Q with these limit points-the irrational numbers-we obtain the larger, complete field of real numbers R.

What happens if we follow this idea but use instead the p-adic absolute value | • | p to complete Q?4 Q is also incomplete with respect to the metric d(x, y) := |x -y| p . By adding the limits of Cauchy sequences in Q with respect to this metric, we obtain the field of p-adic numbers, denoted by Q p . The subset Z p := {x ∈ Q p : |x| p ≤ 1} ⊂ Q p is called the set of p-adic integers and it has a ring structure.

Proposition 1 (p-adic expansion). Every x ∈ Z p can be written uniquely as an infinite “digit” expansion in base p:

where each a i ∈ {0, …, p -1}. When p is clear from the context, we abbreviate this as

Every x ∈ Q p can be written uniquely as:

where m ∈ Z, each a i ∈ {0, …, p -1}, and a -m ̸ = 0. Furthermore, we have |x| p = p m .

A pivotal difference between Q p and R is that in Q p the digit expansion is carried out “to the left” and not “to the right”. Moreover, the p-adic expansion (3) is unique, which is not the case with real numbers (e.g., 1.000… = 0.999… are two different expansions in base 10 of the same number). Addition and multiplication in Z p can be performed using this p-adic expansion as we would normally do with Z, except quantities carry over infinitely “to the left”. The same holds in Q p by accounting for the “decimal point”. Note that the digit-wise addition 1 4 + 1 4 = 1 2 works (modulus 5) when we carry over to the left. 5 For example, unlike R, Q p is not an ordered field, i.e., expressions such as x ≥ 0 or x ≤ y do not make sense when x, y ∈ Q p . This requires a new approach to define p-adic binary classifiers, as we shall see in §3. Example 7. The set of p-adic integers is a unit ball, Z p = B1 (0). The choice of 0 as the center is arbitrary: we also have Z p = B1 (x) for any x ∈ Z p . We can decompose Z p as the disjoint union of p smaller balls:

This shows that Z p has a hierarchical structure, as illustrated in Figure 2. Note that something similar happens with any ball of Q p with radius p n : it can be decomposed as a disjoint union of p smaller balls, each with radius p n-1 . Additional properties of p-adic balls are shown in Appendix B.

Example 8 (p-adic balls are strings). Any ball Br (a) ⊆ Z p with r > 0 can be identified with a string over an alphabet with p symbols. Namely, with r = p -n and n ∈ N, any element of Br (a) has a p-adic expansion of the form x = * a n-1 . . . a 0 , where a i is the i th digit of a and the wildcard * denotes arbitrary symbols occurring to the left. Therefore, elements of Z p may be regarded as “infinite strings” and balls of Z p may be regarded as strings of finite length.

In Appendix C, we use the connection above between p-adic balls and strings to obtain a simple proof of Kraft’s inequality (Kraft, 1949), a key result in information theory which establishes a necessary and sufficient condition for a code to be a prefix code. (To the best of our knowledge, this proof based on p-adic balls is novel.)

Let d ∈ N and denote by Q d p the d-dimensional vector space over the field Q p with the standard vector addition and scalar multiplication. In this section, we discuss p-adic binary classifiers, which take as input d-dimensional feature vectors x ∈ Q d p and predict outputs Binary classifiers operating over R predict according to the rule ŷ(x) = sign(f (x)), where f : R d → R is a discriminant function and sign(.) is the sign function. However, we cannot apply a similar prediction rule to p-adic classifiers with f : Q d p → Q p , since Q p is not an ordered field, and therefore we cannot use a sign function as above. Instead, we define a classification rule where the prediction is +1 iff the argument is in Z p and -1 otherwise:

The next subsections address the unidimensional case (x ∈ Q p ) and then extend it to d ≥ 1 p-adic features (x ∈ Q d p ).

In the linear and unidimensional case (x ∈ Q p ), we use (5) with f (x) = wx + b, where w, b ∈ Q p are model parameters. If w = 0, this becomes a trivial always-positive (b ∈ Z p ) or always-negative (b / ∈ Z p ) classifier, so we focus on the case w ̸ = 0. Since |wx + b| p = |w| p x + b w p the decision rule (5) becomes

where a = -b w and r = 1 |w|p . Therefore, a unidimensional p-adic binary linear classifier simply checks if the input x lies within a p-adic ball. To train such a classifier, we need to find a ball that encloses as many positive training examples as possible and excludes most negative training examples. Perfect linear separation is possible iff there is a ball that (the left continuation is not important). We represent these points in a tree where each leaf represents the largest ball containing a single point and the nodes represent splitting points (edges are labeled with one or more symbols). To determine the classifier with the smallest misclassification error, we consider a ball rooted at each node (or leaf) and compute the training error associated with that ball. In this example, the optimal classifier corresponds to either of the balls B1 /4 (2) = * 10 or B1 /8 (6) = * 110, where the dashed edge is cut, leading to a training error of 1/8. perfectly separates one class from the other-this problem can be solved in O(n) time. In oneclass classification problems, no negative examples are available, so a reasonable criterion is to search for the ball with the smallest radius which encloses the positive examples-which parallels similar objectives in Euclidean/Hilbert spaces over R (Nolan, 1991;Schölkopf et al., 1999). For p-adic classifiers, this is equivalent to finding the least common ancestor node of the positive examples. A full characterization, proved in Appendix D.1, is given below (and illustrated in Figure 3). Proposition 2. Let D = D + ∪ D -be a training set with positive/negative examples D + = {x (1) , …, x (m) } and D -= {x (m+1) , …, x (n) }. Suppose that m ≥ 2 and that D is linearly separable, i.e., there is (w, b) such that |wx + + b| p ≤ 1 ∀x + ∈ D + and |wx -+ b| p > 1 ∀x -∈ D -. Then:

  1. Pick any x (i) ∈ D + and let x (j) ∈ arg max x+∈D+ |x + -x (i) | p be a maximally distant positive example. Then, w ⋆ := 1/(x (j) -x (i) ) and b ⋆ := -x (i) /(x (j) -x (i) ) parametrize a separating linear classifier. This classifier satisfies w ⋆ x (i) + b ⋆ = 0 and w ⋆ x (j) + b ⋆ = 1.

  2. The classifier (w ⋆ , b ⋆ ) above corresponds to the enclosing ball Br (a) with a = x (i) and r = |x (j) -x (i) | p -this is the (unique) minimal enclosing ball that contains D + .

  3. Let x (k) ∈ arg min x-∈D-|x –x (i) | p be a maximally close negative example (and due to separability of D and ultrametricity, also maximally close to any point in D + ). Then, Br ′ (a) with a = x (i) and r ′ = p -1 |x (k) -x (i) | p is the (unique) maximal enclosing ball that contains D + and defines a separating linear classifier.

  4. Any ball with center in x (i) and radius in [r, r ′ ] is an enclosing ball defining a separating linear classifier. Any separating linear classifier is of this form.

If the problem is not separable, it is possible to find the best enclosing ball with respect to some loss function (e.g., misclassification rate) with a O(n) algorithm (see Figure 3):

  1. First, build a tree whose leaves represent the points in D + and whose nodes represent the balls containing subsets of these points (edges represent one or more digits, and only nodes with multiple children need to be considered).

  2. Then, examine each node and compute the loss value by counting how many positive and negative points are enclosed by the corresponding ball. Pick the best node.

It is interesting to compare this procedure with what happens in the real line (R) where this unidimensional problem is also tractable-one needs to examine what happens in the intervals delimited by consecutive points instead of reasoning about nodes.

We now consider the decision rule (5) with a nonlinear discriminant function f : Q p → Q p , such as a polynomial of degree s ≥ 1, f (x) = c s j=1 (x -a j ) for c, a 1 , …, a s ∈ Q p . In this case (assuming c ̸ = 0) the decision is ŷ(x) = +1 iff s j=1 |x -a j | p ≤ r, with r = 1/|c| p . We show that in this case the positive region may be a union of p-adic balls.6 Proposition 3 (2 nd order p-adic classifier). Let f (x) = c(x -a 1 )(x -a 2 ), with ŷ(x) defined as (5). Let |a 1 -a 2 | p = p -k12 and r = 1/|c| p = p -k for some k 12 , k ∈ Z. Then 1. If k > 2k 12 , ŷ(x) = +1 iff x ∈ Bp -k+k 12 (a 1 ) ∪ Bp -k+k 12 (a 2 ). i.e., the positive region is the disjoint union of two balls with the same radius.

, the positive region is a single ball and the classifier is equivalent to a linear classifier.

The proof is in Appendix D.2.7

Example 9. Consider the classifier in

We have k = 6 and k 12 = 2. Since k > 2k 12 and we have -k + k 12 = -4, the positive region is defined by B 1 16 (1) ∪ B 1 16 (5), which consists of numbers of the form * 0001 or * 0101. See Figure 4.

Let now d ≥ 1. The input space is now a d-dimensional vector space over Q p , which we denote by Q d p . We extend the framework of §3.1 to this scenario as follows. 8 We assume ŷ(

The next result, proved in Appendix D.3, shows that some problems which cannot be solved by real linear classifiers-such as XOR or parity problems, which Minsky and Papert (1988) have shown cannot be solved by “finite order perceptrons”, or counting problems, which require “second-order perceptrons”-are easy for p-adic linear classifiers, and viceversa-count thresholding problems are trivial for real linear classifiers but cannot be solved directly by p-adic linear classifiers.

Proposition 4 (Properties of Q p -linear classifiers).

  1. For d = 2 and any prime p, classifiers in H Q d p can compute any Boolean function. This includes XOR (which H R d cannot solve).

  2. For d ≥ 2 and any prime p, classifiers in H Q d p can solve congruence problems modulo p n (which include parity checks) and counting problems-which classifiers in H R d cannot solve-but they cannot solve count thresholding problems, which are solvable by R-linear classifiers.

Finally, the next proposition, proved in Appendix D.4, generalizes our previous result for the unidimensional case (Proposition 2).

the “positive region” according to this classifier. Then:

8 An alternative way to extend §3.1 which deserves consideration would be to construct a ball-like decision rule similar to (6) by defining balls in Q d p , which can be done by defining a norm on Q d p . An appealing choice is the supremum norm, ∥x∥p := max 1≤i≤d |x i |p, which endows Q d p with the structure of a ultrametric space, keeping the same non-Archimedean flavor as Qp. Unfortunately, classifiers using the rule ∥x -a∥p = max i |x i -a i |p ≤ r seem quite restrictive-they require x i ∈ Br(ai) for each i, so they are similar to a conjunction of binary classifiers applied to each feature, all constrained to have the same radius. We therefore opted by following the construction presented in this section based on linear sums.

The second item in Proposition 5 requires as a condition that the classifier is tight. Note, however, that this is really not an additional condition-for any classifier which classifies correctly all points in D + , we can always multiply w and b by some scalar λ with |λ| p ≥ 1 such that the classifier becomes tight. Note also that this property is not analogous to anything similar for the real numbers, not even for d = 1.

The previous section established what can and cannot be learned by a d-dimensional Q plinear classifier. We now describe an approximate algorithm to learn a classifier from data. In the next two remarks we define, for

We can assume all inputs are p-adic integers-x (j) ∈ Z d p for all j ∈ [n]-without loss of generality. To see why, assume (w, b) ∈ Q d+1 p ; let i correspond to the example where ∥x (i) ∥ p is largest and let p m denote this quantity. Then, define x (j) ′ := p m x (j) ∈ Z d p for all j ∈ [n], and let w ′ := p -m w. We have that |w ′ ⊤ x (j) ′ + b| p = |w ⊤ x (j) + b| p . Therefore, we have an equivalent classification problem where all data consist of p-adic integers, and the weight vector w is scaled.

There is also a formulation equivalent to (5) which ensures (w, b) ∈ Z d+1 p , so that the learning can be reduced to search over the p-adic integers: find

Without loss of generality, we assume that inputs are (d + 1)-dimensional and have a constant feature, x d+1 = 1, so that we can drop the bias parameter. Algorithm 1 shows a simple beam search algorithm for training a linear p-adic classifier. The algorithm assumes that all inputs x (i) are in Z d+1 p , which can be ensured with the preprocessing in Remark 1. We denote the number of positive and negative errors as i) ; w) = +1}|, respectively. Each node n with depth δ in the search tree has associated a weight vector n.w ∈ Z d+1 p where only the δ most significant bits matter. For any descendant node, the number of positive errors ε + can only increase and the number of negative errors ε -can only decrease. Therefore, ε + is a lower bound for the number of mistakes achieved by any descendant node. The worst case runtime complexity of Algorithm 1 is O(kδ max np d+1 ), which is linear on the dataset size but grows exponentially fast with the number of features d.

In §5, we experiment with this algorithm to learn a Q p -linear classifier for solving logical inference problems. ▷ keep the k nodes with fewest mistakes ε+ + ε-20:

if there is n ∈ A with ≤ εmax mistakes then 22:

return w := n.w/p δ 23:

end if 24: until δ = δmax

In linear regression, the goal is to estimate a target y ∈ Q p with ŷ(x) = w ⊤ x + b. Given the relation between p-adic numbers and strings (Example 8), we may informally think of p-adic regression as a sequence-to-sequence problem. Given a dataset D = {(x (i) , y (i) )} n i=1 , we formulate the problem as that of finding w and b that minimize the sup-norm of the residuals,

Consider first the unidimensional case x ∈ Q p . The following result shows that this problem can be solved efficiently. Proposition 6. Assume that x (i) ̸ = x (j) for any i ̸ = j. Then, the following algorithm finds k) . (Note that we also have b ⋆ = y (j) -w ⋆ x (j) .)

The proof is in Appendix D.5. Note that the solution (w ⋆ , b ⋆ ) above satisfies w ⋆ x (j) + b = y (j) and w ⋆ x (k) + b = y (k) , i.e., there is an optimal solution that passes through the points j and k. This is very different from linear regression with the real numbers, and is a consequence of the ultrametric property of the p-adics. We will next see that this property holds also for the multidimensional case (d ≥ 1).

Assume now d ≥ 1 and the overdetermined case n ≥ d + 1. As in §3.4, we can eliminate the bias parameter by appending a constant feature x

, the sup-norm can be written as ∥Xw -y∥ p , which we want to minimize with respect to w ∈ Q d+1 p . The next result, proved in Appendix D.6, shows that, under a suitable invertibility condition, 9 there is an optimal solution which is exemplar-based, as in §4.1.

be the design matrix with n ≥ d + 1, and assume that all its (d + 1)-by-(d + 1) submatrices are invertible. Assume also that the original design matrix X ∈ Q n×d p before bias augmentation (i.e., X := X 1:n,1:d ) also has all its d-by-d submatrices invertible. Then, there is an optimal solution w ⋆ ∈ arg min w ∥Xw -y∥ p passing through d + 1 points.

It should be noted that the conditions of the proposition forbid us (among other things) to have x(i) = 0 as an input, as this would lead to non-invertible submatrices. A consequence of Proposition 7 is that, under the stated assumptions, we can find an optimal w ⋆ by selecting all possible combinations of d + 1 out of the n training examples, for each such combination solve a linear p-adic system (which is guaranteed to have a solution since we assumed that all (d + 1)-by-(d + 1) submatrices of X are invertible) and then pick the combination whose optimal weight w ⋆ lead to the smallest sup-norm. This algorithm runs in time O(n d+1 ).

We now look at representations (“embeddings”) in Q d p . One reason why neural networks are so effective comes from their ability to learn internal representations, by mapping examples to points in R d . What happens in the p-adic space?

Consider first the unidimensional case Q p (a single “embedding dimension”) and let us reason about the embedding of n input examples {x (i) } n i=1 in this space. Since Q p has a hierarchical structure (Figure 2), we can associate to each x (i) the largest ball B ri (x (i) ) that contains this example and no other examples. The different balls are arranged hierarchically and can be represented as a finite tree whose leaves correspond to each of the n input examples. For example, if p = 2, we obtain a binary tree where each ball B ri (x (i) ) corresponds to a bit string-this is similar to Brown clusters (Brown et al., 1992), a popular representation technique in natural language processing. Therefore, we can think of Brown clusters as p-adic representations which contrast with continuous, real-vector representations.

Quillian’s semantic networks. In the previous example, all concepts correspond to leaves in a tree. It is appealing to think about internal nodes higher up in the tree as representing “more general concepts” associated to larger balls in Q p . These larger balls enclose smaller balls (more specific concepts) forming a nested structure. Consider as an example the Quillian (1968). Bottom left: Neural network developed by Rumelhart (1990) and McClelland et al. (1995) to answer queries using this semantic model. Given the active inputs “robin can”, the network produces the completion “grow move fly”. Note the two (non-linear) hidden layers, the first of which embeds input entities onto R 6 . Bottom right: A linear p-adic network with a single embedding dimension (Q p ) which solves the same problem (colors and leaves attributes are excluded for simplicity). semantic network of Figure 5 (top) corresponding to a simple hierarchical propositional model (Quillian, 1968). Rumelhart (1990) and McClelland et al. (1995) built a simple neural network with two hidden layers (Figure 5, bottom left) which is able to answer queries associated with this semantic network, through the embedding of concepts in R 6 . We next construct a more compact linear classifier with p-adic representations (with p = 2) that encodes all the propositions of this semantic network (Figure 5, bottom right). Our p-adic classifier has a similar structure as the network in Rumelhart (1990) but only a single hidden layer (to encode the p-adic representations of concepts) instead of two, and it does not have any non-linear activations. It is a composition of linear functions and therefore it is still a linear classifier. The advantage of including the hidden representation layer is threefold: (i) it makes explicit the representation of the concepts, (ii) it reduces dimensionality, and (iii) it allows using a shared representation space to perform multiple tasks.

We first show the representations and weights of the network without the “is green/red/yellow” attribute in the leaves-in this case, a unidimensional p-adic representation space Q p turns out to be sufficient. Then, we extend the network to include the “is green/red/yellow” attribute by using a two-dimensional p-adic representation space (Q 2 p ). Finally, we consider a new attribute, “has leaves”, which defies the hierarchical structure of the semantic network: all plants have leaves except the pine (which has needles instead of leaves). This exception can be accommodated by using a three-dimensional p-adic representation space (Q 3 p ). In this construction, we always use p = 2.

We first construct the model ignoring the color attributes (“is green/red/yellow”). By looking at Figure 5 (top) we can see that, with the exception of those attributes, all others are associated to a single node of the semantic tree. We also see that each attribute is associated only to a single relation (e.g., “grow” is associated to can but not to ISA, is, or has). Our strategy is to (i) place the entity/concept representations in a p-adic tree with the same structure as the semantic network in Figure 5 and (ii) use a binary linear classifier for each attribute associated to a specific node of the tree.

Our network has the form in Figure 5 (bottom right). We have two sets of inputs: 15 entities/concepts (living thing, plant, animal, …, sunfish, salmon) and 4 relations (ISA, is, can, has). Like Rumelhart (1990), we use one-hot encodings for both entities/concepts and relations. Entities/concepts are then fed into an embedding layer in Q p with no bias parameter. This effectively means that they receive embeddings x 1 , . . . , x 15 , where each x i ∈ Q p . Relations j use one-hot encodings directly. The representations of entities/concepts and relations are concatenated; therefore, when entity i and relation j are present in the input, this leads to a 5-dimensional vector

These representations are shared across all attributes, and each attribute k is associated to a binary classifier with 5-dimensional weights [w k , v k1 , . . . , v k4 ] and a bias parameter, which we set to zero. The output layer predicts

where (see ( 6)) the top condition is equivalent to x i ∈ B1/|w k |p (-v kj /w k ). By examining the structure of the semantic network, it is straightforward to obtain parameters x i , w k , and v kj that lead to the intended classifiers for each attribute k. We start by choosing representations x i ∈ Q p that are compatible with the entities/concepts in Figure 5 (bottom left). The representations in Table 1 satisfy this requirement. Then, for any attribute k, we consider the (unique) relation j associated with that attribute. We first pick w k such that 1/|w k | p is at the correct level of the tree, following the right hand side of ( 7)-e.g., the attribute-relation pair “ISA bird” should encompass the entities/concepts bird, robin, and canary, whose common prefix is • • • 0010, therefore we need 1/|w k | p = 2 -4 which is satisfied by w k = 2 -4 = 1 /16. Then, we pick v kj such that -

w k is a center of the desired ball-for example, for the attribute-relation pair “ISA bird” any center prefixed by • • • 0010 will do. A possible such center is 2, which leads to v kj = -2w k = -2 × 2 -4 = -1 /8. Finally, for attribute-relation pairs that are incompatible we choose a ball disjoint from all the entity/concepts. This leads to the parameters shown in Table 2. This choice of parameters ensure correct classification for all choice of entities/concepts and relations in the input.

Adding colors. The strategy described above fails when we add color attributes, since the red and yellow color attributes are linked to multiple nodes (red is linked to rose, robin, salmon, and yellow is linked to daisy, canary, sunfish). This can be solved by adding a second embedding dimension x ′ i corresponding to the “color” feature (rightmost columns of Table 1). This requires organizing the entities/concepts according to a different hierarchy associated with the colors-fortunately, multidimensional Q p -linear classifiers allow representing multiple hierarchies. Now the input representations will be 6-dimensional vectors [x i , x ′ i , 0, …, 1, …, 0] ∈ Q 6 p and each attribute k is associated to a binary classifier with 6-dimensional weights [w k , w ′ k , v k1 , . . . , v k4 ] and a zero bias. If the attribute k is not a color, we set w ′ k = 0 and the classification rule is exactly as in (7). If the attribute k is a color (green, red, or yellow), we set w k = 0, which leads to

The same logic as above works here, leading to the weights w ′ k and v kj in Table 2.

Handling leaves. As mentioned above, almost all plants have leaves, except pine, which has needles instead of leaves. How can we accommodate this exception? It is of course possible to follow the same reasoning as when we added colors: append one extra dimension to the representation space, and associate the relation/attribute pair “has leaves” to all entities/concepts who have leaves: flower, oak, rose, and daisy. However, we describe a different strategy which shows how we can model exceptions to a rule. We start by encoding the (erroneous) rule that all plants have leaves, which can be easily done by adding to Table 2 a new entry “has leaves” with the same weights as the entry for “has roots”: (w k = 1 /4, w ′ k = 0, and v kj = 0). Now we need to create exceptions for this rule: since a pine has no leaves, we can no longer say that any of its ancestors has leaves; therefore we have to create exceptions for the entities/concepts pine, tree, and plant. This can be done by adding a third dimension to the representation space, call it x ′′ i , which is 0 for entities/concepts which are neither pine, tree, or plant, and which are 1 for any of those entities/concepts. Next we choose a weight w ′′ k which multiplies this new feature, which is 0 for all attributes except “leaves”-this has no effect on entities outside the exception list, but changes the decision Table 2: Weights w k and v kj for each relation j and attribute k. Each relation/attribute pair corresponds to a 2-adic ball with radius 1/|w k | and with -v kj /w k as a center. For example, “ISA bird” corresponds to the ball B 2 -4 (2) which corresponds to p-adic digit expansions • • • 0010, containing the representations of bird, robin, and canary (see Table 1). Attributes k which are not compatible with a relation j (e.g. “ISA grow”) have weights v kj = -w k 2 (not shown in the table). Therefore they correspond to balls with center 1 2 and radius 1/|w k |; since 1/|w k | ≤ 1 for all attributes k in the table, these balls will contain only entities outside Z 2 and therefore none of the entities in Table 1. Color attributes are denoted in blue, and for those we have w k = 0 and the shown value is w ′ k . For non-color attributes we have the opposite, w ′ k = 0 and the shown value is w k . rule for pine, tree, and plant to

when j, k correspond to “has leaves”. Since by design all these entities i (pine, tree, and plant) satisfy w k x i + v kj ∈ Z p , any w ′′ k / ∈ Z p (e.g., w ′′ k = 1 /2) creates the desired exceptions.

Experiment. We run a simple synthetic experiment using the semantic network above as follows. We generate all possible 1,680 propositions combining the 15 entities, 4 relations, Figure 6: Experiment with a learned neural network reproducing Rumelhart (1990) (1990) with a 6-dimensional embeddings and a 15-dimensional hidden layer (690 R parameters) and a linear p-adic network (“padic linear”) with 3-dimensional embeddings (241 Q p parameters) chosen as in Table 1, both illustrated in Figure 5. We train the neural network with gradient backpropagation using the Adam optimizer (learning rate 0.1), and the linear p-adic classifier with Algorithm 1. We run 10 trials with different random initializations for the former and a single trial for the latter, since it is deterministic. Both models managed to overfit the full training set (in all trials) with zero error rate. The results in Figure 6 show the results for the train/test split using different training set sizes. We observe that the two models generalize similarly to the test set.

This paper only touches the surface of how one might perform machine learning in p-adic spaces by establishing basic theoretical results. To find out whether this framework might be practically useful, many challenges have yet to be surpassed. I summarize below some open problems and suggest possible paths to research them.

In §3 we addressed only one-class and binary classification problems. In multi-class problems, we can have K ≥ 2 classes. While any multi-class problem can be reduced to a combination of binary problems (e.g. through one-against-all or one-against-one schemes), it is interesting to try to derive a native framework for p-adic multi-class classification.

We could define a weight vector w k ∈ Q d p and bias b k ∈ Q p for each output class k ∈ [K] and have a decision rule like

A potential drawback with this approach is that ties are very likely, due to the discrete nature of p-adic norms. Let us think about the unidimensional case where d = 1 and assume |w k | p = 1 for all k to simplify. In this case, by defining “class centroids” a k := -b k /w k , the problem becomes that of returning the nearest centroid for a given x. The “Voronoi cells” associated to each class are arranged hierachically as a dendrogram, and there might be ties since the space is ultrametric. Note that not all ties are permitted: there can only be a tie involving two classes if any other classes that are more similar to either of the two are also in the tie.

While we provide efficient classification and regression algorithms for the unidimensional case, for d > 1 the learning algorithms presented here (Algorithm 1 for p-adic classification and the algorithm sketched in §4.2 for p-adic regression) should be seen merely as a proof of concept, since they are not practical-they take exponential runtime with respect to the number of features d. Finding better algorithms is an open problem, on which the relevance of p-adic predictors for any practical purpose strongly depends upon. One open question is whether the tools of p-adic calculus can lead to better learning algorithms-indeed, many of the tools from real analysis are also available to the p-adic numbers, including formal power series, continuity, and differentiation (Koblitz, 1984;Robert, 2000). However, doing this requires overcoming several roadblocks-while Newton’s algorithm works in the p-adics to find roots of p-adic functions based on formal derivatives, p-adic calculus is fundamentally different for real calculus: e.g., a function with zero derivative everywhere might not be constant in Q p . Furthermore, the classical idea in machine learning of optimizing a loss function to fit a model to the training data does not work in a straightforward way in the p-adics-since Q p is not an ordered field, “optimizing” a p-adic valued loss is meaningless. In our regression formulation in §4, we bypassed this by optimizing a p-adic norm, which is Q-valued. It is likely that many interesting p-adic learning problems can be mathematically formulated as p-adic min-norm problems, a topic which deserves further investigation.

We should note, however, that a cornerstone of p-adic analysis is Hensel’s lemma, an analogous to Newton’s method for root finding which has strong convergence properties. This suggests that a possible path for learning p-adic predictors is to ingeniously design a function F leading to a p-adic equation F (θ; D) = 0, where θ ∈ Q d p are the model parameters and D is the training data, in a way that a solution of this equation corresponds to a “optimal” model configuration in some sense. In machine learning models over the reals one would choose F (θ; D) = ∇L(θ; D) for a differentiable loss function L, but in the p-adic case it would be convenient to work directly with F , bypassing the loss function. Finding a root of F through p-adic Newton’s algorithm by generalizing Hensel lifting lemma (Gouvêa, 2020, §4) to Q d p would likely lead to very efficient algorithms.

In this paper, we have resorted to linear predictors (with the exception of the higherorder classifiers covered in §3.2). However, in machine learning over the reals, multi-layer networks can be much more expressive (Hornik et al., 1989). We naturally expect the same to happen with “p-adic neural networks”. For example, while we have shown in Proposition 4 that linear p-adic classifiers cannot solve count thresholding problems, it is easy to construct a single hidden-layer p-adic neural network which solve such problems: we can simply form a first layer of d -k + 1 p-adic classifiers solving exact counting problems for i = k, k + 1, …, d-using the construction presented in Appendix D.3-and then, noting that at most one of these d -k + 1 classifiers returns +1 (the others must return -1), append a top layer with a single exact-one classifier with weights w 1 = … = w d-k+1 = p -n and bias b = (d -k -1)p -n , with n = ⌈log p (d -k + 1)⌉; the resulting multi-layer system returns +1 if at least k inputs are “true” and -1 otherwise.

To make progress in this research question, we need to seek suitable non-linearities and learning criteria.

Our whole paper assumes some prime p is fixed from which a non-Archimedean norm | • | p is constructed. However, which p should be chosen? Can we build predictors which use multiple primes or even all primes simultaneously?

The question above has a similar flavor as the so-called Hasse’s local-global principle (Gouvêa, 2020, §4.8), a cornerstone of number theory. The idea behind this principle is to try to answer a “global” complex question in Q (e.g. finding a rational solution of a Diophantine equation) by working simultaneously at all “local” completions, i.e., Q p for each prime p as well as R. In fact, it is a consequence of Ostrowski’s theorem (Gouvêa, 2020, Theorem 3.1.4) that any completion of Q has one of these forms: every non-trivial absolute value on Q is equivalent (in the sense of inducing the same topology) to one of the non-Archimedean p-adic absolute values | • | p (Example 3) or to the usual Archimedean absolute value | • | ∞ (Example 2). Mathematicians often use p = ∞ (the “prime at infinity”) to index the set of real numbers, denoting Q ∞ = R. We can thus express all completions of Q as the set {Q p : p ≤ ∞}. An example of a local-global result is the product formula (Gouvêa, 2020, Proposition 3.1.5), which states that, for any x ∈ Q \ {0},

(This formula is very easy to prove by using the fundamental theorem of arithmetic.) I next sketch a path for adelic classification taking inspiration from this principle. Let us think about an “ensemble” of p-adic classifiers for all primes p including also a “real classifier”, denoted as p = ∞, whose decision rule-given an input x ∈ Q d -is given by

We can think of this decision rule as letting all p-adic classifiers (for each p) vote through their corresponding absolute value, using their specific parameters w (p) ∈ Q d p and b (p) ∈ Q p , and then collecting all the votes to produce the final decision. From the product formula (8), we have that, if

, if the parameters are the same for all p) and w

e., all the predictions lie at the decision boundary. On the other hand, if we assume that only finitely many p in a set P contribute to the voting and all the others p / ∈ P have w (p) = 0 and b (p) = 1, we can think of algorithms which progressively add new primes p to improve model-data fit until some stop criterion is met.

To parametrize the classifier (9) we need to define w = (w (p) ) p≤∞ and b = (b (p) ) p≤∞these objects are called adeles (Goldfeld and Hundley, 2011). Formally, the ring of adeles A Q is defined (with elementwise addition and multiplication) as

′ denotes the restricted product (rather than the Cartesian product), which requires all but finitely many entries of z = (z (p) ) p≤∞ ∈ A Q to be p-adic integers. Note that, for any z ∈ A Q , the restricted product ensures that [z] := p≤∞ |z| p converges, which enables its use in (9). Note also that any rational number q can be “seen” as an adele z = (q, q, …) with all entries constant, i.e., we have an injection Q -→ A Q . From the product formula (8), we have that adeles corresponding to these rational numbers have [z] = 1. Also, any a ∈ Q p can also be seen as an adele z with z (p) = a and z (p ′ ) = 1 for p ′ ̸ = p, for which we have [z] = |z| p (this works also for p = ∞, i.e., real numbers). Since adeles form a ring, we have that for each

≤ 1 and ŷ = -1 otherwise. Note that the p-adic classifiers defined in §3 are a particular case of this construction resulting from setting w (p) = 0 and b (p) = 1 for all but a particular p. We can also define adelic regression by considering residuals [w ⊤ x + b -y], which are also a generalization of the p-adic regression framework presented in §4.

Developing the concepts above might be an interesting path for future work.

We presented an exploratory study of “p-adic machine learning”, which replaces the field R by Q p . We established the main building blocks for p-adic classification and regression, including prediction rules, learning formulations, and learning algorithms. We derived foundational theoretical properties, some of which somewhat surprising, a consequence of the ultrametricity of Q p : e.g., linear regressors are exemplar-based (they pass through d + 1 training points); unidimensional linear classifiers are enclosing p-adic balls, and nonlinear classifiers with polynomial discriminant functions are unions of these balls. We showed that the topology of Q p is appealing for capturing hierarchical relations between objects in the representation space, such as those arising in semantic networks, and we provided a proof of concept for a network designed by Rumelhart (1990) for which there is a compact linear network in Q p with perfect accuracy, whereas this is not the case in R.

Non-Euclidean geometric representations, such as hyperbolic embeddings, have been studied to capture hierarchical relationships (Nickel and Kiela, 2017). Our framework is radically different from such approaches, which still use the field of real numbers as a backbone. We believe our work is only a first step towards p-adic machine learning, and we identify many potential directions for future work. Overcoming these challenges is an exciting direction for future work.

Invoke the strong triangle inequality (2) and assume d(x, y) ̸ = d(y, z). We can assume without loss of generality that d(x, y) > d(y, z).

From (2) we have d(x, z) ≤ max{d(x, y), d(y, z)} = d(x, y), but applying (2) again we have d(x, y) ≤ max{d(x, z), d(y, z)}, and since we assumed d(x, y) > d(y, z) we must have d(x, y) ≤ d(x, z). Therefore, combining the the two inequalities we must have d(x, z) = d(x, y) = max{d(x, y), d(y, z)}. Example 10. Any ball in Q p with nonzero radius can be written as Bp n (a) = a + p -n Z p . In particular, the decomposition (4) can be equivalently written self-referentially as

This shows that Z p has self-similar structure. We then have the following corollary, which generalizes (4).

Corollary 1. Any ball in Q p satisfies:

where the union is disjoint, as well as Bp n-1 (a) = p Bp n (ap -1 ).

Proof. Using the decomposition of the ball of p-adic integers, the proposition above, and the distributive properties of Minkowski sums and scalar multiplication with unions, we have Bp n (a) = a + p -n Z p = a + p -n p-1 i=1 (i

It is straightforward to see that this union is disjoint. The second statement is easily proved from the proposition above.

The connection expressed in Example 8 between p-adic balls and strings can be used to obtain a simple proof of Kraft’s inequality (Kraft, 1949), an important result in information theory which establishes a necessary and sufficient condition for a code to be a prefix code (hence, uniquely decodable).

Proposition 9 (Kraft’s inequality). Let each source symbol from the alphabet S = {s 1 , . . . , s n } be encoded into a prefix code over an alphabet of size m with codeword lengths ℓ 1 , ℓ 2 , . . . , ℓ n . Then n i=1 m -ℓi ≤ 1. Conversely, for a given set of natural numbers ℓ 1 , ℓ 2 , . . . , ℓ n satisfying the above inequality, there exists a prefix code over an alphabet of size m with those codeword lengths.

We provide a simple proof for the case where m = p is prime based on properties of p-adic balls.

We start with the following lemma.

Lemma 1. Let { Bri (a i )} n i=1 be a finite set of disjoint balls in Q p , i.e., satisfying Bri (a i ) ∩ Brj (a j ) = ∅ for i ̸ = j. Assume r i = p ni for n i ∈ Z. Let Br (a) be another ball enclosing all the balls Bri (a i ), where r = p n . Then, we must have n i=1 r i ≤ r, with equality iff Br (a) = n i=1 Bri (a i ).

Proof. From Corollary 1, we have that Bp n (a) can be decomposed as a disjoint union of p smaller balls, each with radius r i = p n-1 . Then we have p-1 i=0 p n-1 = p n . We can proceed recursively by decomposing some of the smaller balls further, which will keep the equality. Since all balls in Q p must be either nested or have empty intersection, these are the only possible ways to obtain a decomposition of a ball into smaller balls. The set { Bri (a i )} n i=1 is necessarily a subset of one of these decompositions where some balls may be missing, hence the sum n i=1 r i is upper bounded by r = p n .

We now provide a simple proof of Proposition 9 for the case where m := p prime, using the facts we already know about p-adic balls. We start with the the =⇒ direction. We identify each string (code) with a ball contained in Z p ; a string with length ℓ i corresponds to a ball with radius p -ℓi and some center a i ∈ Z p such that the first ℓ i digits of a i correspond to the characters of the string. Since this is a prefix code, none of the balls corresponding to a codeword can be included in another ball corresponding to a different codeword, i.e., the set of balls { Bp -ℓ i (a i )} have pairwise empty intersection. Hence, it satisfies the conditions of Lemma 1 and therefore, since Z p = B1 (0) encloses all these balls, we must have n i=1 p -ℓi ≤ 1. To prove the ⇐= direction, suppose that ℓ 1 , …, ℓ n satisfy n i=1 m -ℓi ≤ 1 and assume without loss of generality that ℓ 1 ≤ … ≤ ℓ n . We can create a prefix code by first picking a string (code) associated with the ball Bp -ℓ 1 (0), then another string (code) associated with a ball with radius p -ℓ2 which is not included in the first ball, etc.

To show 1, note that w j) . For every x -∈ D -, we have

since D is assumed linearly separable, and therefore, due to ultrametricity, the distance between any two positive examples must be strictly smaller that the distance between a positive and a negative example.

To see 2, note that a = -b ⋆ w ⋆ = x (i) and r = 1 |w ⋆ |p = |x (j) -x (i) | p . To show 3, note the classifier defined by Br ′ (a) corresponds to a = -b w and r ′ = 1 |w|p for some w and b; the second equation is satisfied with b = -wa = -wx (i) . We have, for any Point 4 follows automatically from points 1 and 3.

Let us first prove 1. We start by showing the “⇒” direction, i.e., that any x satisfying |f (x)| p ≤ 1 is of the form

x ∈ Bp -k+k 12 (a 1 ) ∪ Bp -k+k 12 (a 2 ),

We have that |f

Moreover, from the strong triangle inequality, we have that p

We now note that, in the scenario 1 where k > 2k 12 , we must have

The third possibility to consider is k 1 = k 2 . In this case the strong triangle inequality implies

, and we must also have from (11

putting everything together we obtain x ∈ Bp -⌈k/2⌉ (a 1 ) = Bp -⌈k/2⌉ (a 2 ) as desired. Finally, to prove the “⇐” direction, note that any x as above satisfies |x -

Boolean problems with d = 2. We start by showing that, for d = 2 and any prime p, classifiers in H Q d p can compute any Boolean function. We assume for convenience that (x 1 , x 2 ) ∈ {0, 1} 2 (we could as well assume (x 1 , x 2 ) ∈ {-1, +1} 2 and we would obtain the same proof with a linear transformation of the weights). There are 16 cases to consider-2 4 possible output assignments for the 4 input configurations. The always-zero and always-one classifiers can be easily solved by picking w = 0 and choosing a suitable b; the cases where the output depends only on x 1 or only on x 2 are also easy since we can choose respectively w 2 = 0 or w 1 = 0 to ignore the irrelevant input and revert to a unidimensional problem. Negating one of the inputs can also be handled with a transformation x ′ i = 1 -x i and recalculating w and b accordingly. The only cases left to analyze are:

  1. The XOR function. Here the target is y(x) = +1 if x 1 ̸ = x 2 and -1 otherwise. We can set

  2. The negation of the XOR function. Similar to the previous case; we can set w 1 = -w 2 = 1 p and b = 0.

  3. The AND function. Here the target is y(x) = +1 if x 1 = x 2 = 1 and -1 otherwise. For p = 2, we can set w 1 = w 2 = 1 4 and b = -1 2 . For p > 2, we can set w 1 = w 2 = 1 p and b = -2 p . We will see below that, like XOR, this is a particular case of a counting problem for which we provide a general solution.

  4. The negation of the AND function. This cannot be solved directly. can solve congruence problems modulo p n (of which parity check problems are a special case, when p = 2 and n = 1) as well as and counting problems. We assume inputs are binary, (x 1 , …, x d ) ∈ {0, 1} d . Congruence modulo p n problems correspond to the following target function:

where n ∈ N and a ∈ {0, …, p n -1}. These problems are solved by a p-adic linear classifier with w i = p -n , for each i ∈ [d], and b = -ap -n . When c inputs are active, we obtain

Counting problems (of which XOR and AND are particular cases) correspond to the following target function (again with domain {0, 1} d ):

where c ∈ {0, …, d}. These problems are solved by a p-adic linear classifier with w i = p -n , for each i ∈ [d], and b = -cp -n , with n set as n = ⌈log p (1 + max{c, d -c})⌉. To see this, observe that counting problem ( 14) is equivalent to congruence problem (13) for a = c and sufficiently large n. More specifically, n should be large enough so that

Count thresholding problems. This corresponds to the following target function (again with domain {0, 1} d ):

For 1, we need to show that for any x (i) such that |w ⊤ x (i) + b| p ≤ 1 we have (i) |w ⊤ x + b| p ≤ 1 ⇒ |w ⊤ x -w ⊤ x (i) | p ≤ 1 and (ii) |w ⊤ x + b| p > 1 ⇒ |w ⊤ x -w ⊤ x (i) | p > 1. Note that |w ⊤ x -w ⊤ x (i) | p = |w ⊤ x + b -w ⊤ x (i) -b| p ≤ max{|w ⊤ x + b| p , |w ⊤ x (i) + b| p }, with equality if |w ⊤ x + b| p ̸ = |w ⊤ x (i) + b| p . For (i), since both |w ⊤ x + b| p ≤ 1 and |w ⊤ x (i) + b| p ≤ 1, it follows that |w ⊤ x -w ⊤ x (i) | p ≤ 1. For (ii), since |w ⊤ x + b| p > 1, we must have |w ⊤ x + b| p ̸ = |w ⊤ x (i) + b| p , and therefore |w ⊤ x -w ⊤ x (i) | p = max{|w ⊤ x + b| p , |w ⊤ x (i) + b| p } = |w ⊤ x + b| p > 1.

Now we prove 2. Let w and b be such that the classifier is tight and satisfies w ⊤ x (i) +b = 0 for some i (guaranteed from point 1), and choose j such that |w ⊤ x (j) + b| p = 1. Then we have an invertible u ∈ Z × p such that u(w ⊤ x (j) + b) = 1. Setting w ⋆ = uw and b ⋆ = ub and noting that w ⋆⊤ x (i) + b ⋆ = u(w ⊤ x (i) + b) = 0 completes the proof.

We will make use of the following lemma. It is instructive to examine first what happens in the zero-dimensional case, where x (i) = 0 for all i = 1, . . . , n. In this case, only y (1) , …, y (n) matter, and the problem is that of finding the “centroid” b that minimizes max i |b -y (i) | p . Lemma 2 implies that b is a solution to this problem iff it is a point in the ball spanned by y (1) , …, y (n) (hence, due to ultrametricity, also a center of that ball). In particular, b = y (j) (for any j ∈ [n]) is a solution, again due to Lemma 2.

We now consider another special case where there is no bias parameter, i.e., where the problem is to find w ∈ Q p which minimizes L(w) := max i |wx (i) -y (i) | p . Assume that x (i) ̸ = 0 for all i ∈ [n]. Then we have L(w) = max i |x (i) | p |w -y (i) /x (i) | p . From Lemma 2, we have that, for any w, there is a j (namely j ∈ arg min k |w -y (k) /x (k) | p ) such that |x (i) | p |y (j) /x (j) -y (i) /x (i) | p ≤ |x (i) | p |w -y (i) /x (i) | p for all i, hence we can assume without loss of generality that the optimal w is of the form w = y (j) /x (j) for some j. We need to find the indices i and j associated to the min-max problem Letting w ⋆ be a partial solution to this problem, we have from the zero-dimensional case that any b of the form b ⋆ = y (k) -w ⋆ x (k) (for arbitrary k) completes the solution. Hence, we can substitute b = y (k) -wx (k) and solve for w, which leads to min w max i |w(x (i) -x (k) )y (i) + y (k) | p . Since for i = k we have |w(x (i) -x (k) ) -y (i) + y (k) | p = |0| p = 0, this problem is equivalent to min w max i̸ =k |w(x (i) -x (k) ) -y (i) + y (k) | p . This now reverts to the problem without bias, from which we obtain the algorithm stated in Proposition 6.

Let w be a regressor which passes through r < d + 1 points. We show that there is another regressor ŵ passing through r + 1 points such that ∥X ŵ -y∥ p ≤ ∥Xw -y∥ p . The result will follow by induction. Assume w ⊤ x (i) = y (i) for i ∈ {j 1 , …, j r } and w ⊤ x (i) ̸ = y (i) for i ∈ [n] \ {j 1 , …, j r }. Let j r+1 , …, j d index arbitrary distinct inputs. We will show that there is k ∈ [n] \ {j 1 , …, j d } and a classifier ŵ such that:

• ŵ⊤ x (i) = w ⊤ x (i) = y (i) , i ∈ {j 1 , …, j r };

• ŵ⊤ x (i) = w ⊤ x (i) / ∈ y (i) , i ∈ {j r+1 , …, j d };

• ŵ⊤ x (k) = y (k) ;

• | ŵ⊤ x (i) -y (i) | p ≤ max j∈[n] |w ⊤ x (j) -y (j) | p , i / ∈ {j r+1 , …, j d , k}.

These conditions imply that ∥X ŵ -y∥ p ≤ ∥Xw -y∥ p . Define X ∈ Q (d+1)×(d+1) p as the submatrix formed by the rows {j 1 , …, j d , k} of X-this matrix is invertible by assumption. Likewise, let ŷ ∈ Q d+1 be defined as ŷ(i) = w ⊤ x (i) = ŵ⊤ x (i) for i ∈ {j 1 , …, j d } and ŷ(d+1) = y (k) . We have X ŵ = ŷ, and therefore ŵ = X-1 ŷ. We further have, for i ∈ [n]: ŵ⊤ x (i) -y (i) = w ⊤ x (i) -y (i) + ( X-1 ŷ -w) ⊤ x (i) = w ⊤ x (i) -y (i) + ( X-1 (ŷ -Xw)) ⊤ x (i) = w ⊤ x (i) -y (i) -(w ⊤ x (k) -y (k) )[ X-1 ] ⊤ :,d+1 x (i) . ( 16)

We now note that matrix X can be decomposed as

where we denote by x(i) ∈ Q d p the inputs before the bias augmentation for i ∈ [n], i.e., x (i) = [x (i) ⊤ , 1] ⊤ , and we use X1:d to denote the matrix whose rows are x(j1) ⊤ , …, x(j d ) ⊤ . From the block matrix inversion formula-which is applicable since X1:d is invertible and X is also invertible, the latter implying that the Schur complement (1 -x(k) ⊤ X-1

.

We now define the following choice rule for k: k := arg max i∈[n]{j1,…,j d ,k}

1 -x(i) X-1 1:d 1 d p , which ensures that [ X-1 ] ⊤ :,d+1 x (i) p ≤ 1. Finally, we apply the strong triangle inequality to (16):

|w ⊤ x (j) -y (j) | p , for all i ∈ [n] \ {j 1 , …, j d , k}, which concludes the proof.

p = p -k12 , we have automatically |x -a 1 | p = p -k12 , which shows that (12) alone is always a feasible solution. If we had supposed above that k 1 > k 2 , by symmetry we would have obtained |x -a 1 | p ≤ p -k+k12 , which together with (12) shows that the disjunction (10) holds in scenario 1. Conversely, let x

In fact, Ostrowski’s theorem(Gouvêa, 2020, Theorem 3.1.4) remarkably states that the three examples in §2.1 tell the full picture regarding absolute values in Q: every non-trivial absolute value on Q is equivalent (in the sense of inducing the same topology) to one of the p-adic absolute values or to the usual absolute value. This result suggests we should give the p-adic absolute values the same importance we give to the usual absolute value. Denoting the latter by |.|∞, this is nicely captured in the product formula which relates all the non-trivial absolute values: p≤∞ |x|p = 1(Gouvêa, 2020, Proposition 3.1.5). We come back to this beautiful formula in §6.4.

Formally, Qp is a totally disconnected Hausdorff topological space. Zp is compact and Qp is locally compact.(Gouvêa, 2020, Corollaries 4.2.6-7).

Note that polynomial classifiers with sufficiently large degree can overfit the training data with perfect accuracy by setting s := |D + |, a j := x (j) for each x (j) ∈ D + , and choosing r = 1/|c|p < min x -∈D -x + ∈D + |x –x + |p. This ensures f (x + ) = 0 for any x + ∈ D + and |f (x -)|p > 1 for any x + ∈ D + .

For s > 2 we have the union of at most s balls, but the balls do not need to have all the same size.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut