Ideas in need of testing.

Take what you find here with a grain of salt.

Next to some random observations and thoughts, this blog is about
(mathematical) intuition and vague hints at possibly interesting
ideas.

Entry: Triangular Matrics
Date:Sun Oct  9 02:31:26 CEST 2005
Type: tex

%% Displacement rank. $\Delta R = R - Z R Z^T$. This looks an awful lot
%% like the update equation for the a--priori estimation covariance
%% estimate $P_{k+1|k} = A P_{k|k} A^T + W$ in the Kalman filter.  So, we
%% could stretch things and give $R$ the meaning of covariance matrix,
%% and $Z$ the dynamics of the system under study.  Waking up now. This
%% is of course because in our setting of displacement rank, $R$ is
%% symmetric/Hermitian.


More dreams were about triangular matrices again.  My ``aha'' last
week was about the property they leave invariant a \emph{subspace
nest}, a total ordering of subspaces, making them into a \emph{nest
algebra}.  This ``recursive quality'' leads to all kinds of properties
about why they are so important for numerical linear algebra. The most
striking is linear system solving, where the property leads to an
inductive divide and conquer strategy which solves the (trivial)
system in the one dimensional subspace and then updates the equations
to reduce the problem size form $N$ to $N-1$.  The second consequence
is $LU$ and $LL^T$ factorization, which establishes the link to
ordinary and positive definite symmetric matrices. The algorithms for
computing this have the same recursive properties.

%% All this is of course intimately related to causality, where the nice
%% example is of course the Schur complement used to express conditional
%% probability.

Now to say something more about the Hermitian/Upper triangular vs
Unitary matrix concept.  (For real matrices, replace with symmetric
and orthogonal).  We know any matrix can be written in ``polar form''
A = UR, where $U$ is unitary, and $R$ is pisitive definite
Hermitian. BUT, we can also take $R$ to be positive definite upper
triangular.  What is exactly the link between these upper triangular
matrices and Hermitian matrices, other than the Cholesky
factorization? I.e. we can take the square root of a Hermitian matrix
to be another Hermitian matrix or a triangular matrix, but can we
transform a positive definite Hermitian matrix directly into an upper
triangular matrix? Is there some deeper structure relating the two
generalizations of positive real number?

Let's have a look at square roots of triangular matrices first.  For
Hermitian matrices it is trivial to see from the eigenvalue
decomposition.  Writing it inductively we get


$$\matrix{A & 0 \\ B & C}^2 = \matrix{A^2 & 0 \\ BA + CB & C^2} =
\matrix{D & 0 \\ E & F}. $$


This is trivial to solve when all submatrices are scalars. If not,
each of these can be split into two equations $A^2 = D$ and $C^2 = F$,
which can be propagated back up to solve $BA+CB = E$. Suppose $a$ is
a scalar and $b$, $e$ are column vectors. Here we have $a = \sqrt{d}$,
and due to commutation $aB = Ba$ the equation above becomes $(C+aI)b =
e$, which is a triangular system. So we have both the divide and
conquer steps, giving us an algorithm.

%% Now, onto Hermitian matrices. Unfortunately, they do not form a group
%% under ordinary matrix multiplication like the triangular matrices do,
%% only if they commute. So this is where to look. If they commute, they
%% clearly do not present the same algebraic structure as lower
%% triangular matrices, so what about commutators and anti--Hermitian
%% matrices?

%% This boils down to identifying with ``positive imaginary'' numbers
%% instead of positive real numbers, and imposing a product like $A
%% \times B = i[A,B]$ with $[A,B] = AB-BA$.  The other interesting link
%% might be that for triangular matrices, the commutator is
%% nilpotent\footnote{ More generally, the invertable triangular matrices
%% are a special case of solvable Lie group, who's Lie algebra are the
%% upper triangular matrices.  }.

%% Now, that's enough for today. I'm waking up. Pure intuition, there are
%% two sides to this coin\ldots The obvious side is that some connections
%% I make here are very trivial from a logic point of view, others are
%% more far out, making me seem like I am on drugs. Now, to make one of
%% those drugged statements: this is about category theory and patterns.
%% Making explicit these more far out statements is caught in category
%% theory, for which at this moment I do not have the energy to dive into
%% fully, but it's about turning the equality sign into morphisms. Or in
%% other words, making approximate intuition a bit more mathematically
%% ``serious''.


Entry: Entropy and non-invertability
Date: Mon Oct 10 05:18:05 CEST 2005

I wonder what the link is between entropy and non-invertibility, since
both are related to one-way processes and information reduction.

[EDIT Thu Jun 18 15:36:35 CEST 2009] This remark appeared in the log.
It hints at something I don't yet understand enough to articulate.
Look at the matrix computations tutorial (basics.tex).


Entry: Displacement
Date: Tue Oct 11 04:11:06 CEST 2005
Type: tex

% here i'm falling in the  ellipse-parabola-hyperbola trap again..
%% Some intuitive leaps. Separability versus displacement structure and
%% differential equations like the Lapace and d'Alambert equation. And
%% symplectic matrices.

So what's this displacement business for symmetric matrices all about?
In short: finding a way to write the sum/difference of two transforms
of a single transform $A$ as a sum/difference of ``squares''.
$$\Delta A = F A - A \tilde{F}^H = G \tilde{G}^H = \sum_{k=1}^r g_k
\tilde{g}_k^H$$


More general, as a limited outer product sum. This should be done in a
way as to lower the rank $r$, making it effectively separable. This
way \emph{the original structure that we were forced to introduce to write
our problem as an ordinary matrix problem can be recovered}. So the
matrices $F_1$ and $F_2$ should really be gotten from the problem
domain. For example for filtering problems, it is a shift matrix.

Care has to be taken though that the original matrix $A$ can be reconstructed
from this representation, since we want an alternative representation, not
something completely different.

One thing that has struck me thinking about structured matrices
before is the question: if your equation is structured, why do you blow it 
up into a matrix? Naivity can be beautiful, because it usually hides the
answer right there in the question: we can't do much equation solving in
practice without using algorithms based on matrices and their factorizations,
so we have to stick to formulating our problem in the language we know,
to find a more general approach instead of exploiting the structure of the
problem directly. This of course means that some of the (more obvious)
fast algorithms we find by applying this philosophy are already encountered
before by exploiting the structure of the problem in a more direct fashion.
The key example here is the Levinson algorithm.


Entry: Hermitian matrices
Date: Tue Oct 11 08:30:44 CEST 2005
Type: tex

A bit more awake now. I'm still quite puzzled by this Hermitian versus
triangular stuff. The main reason for my puzzlement is that they have
the same number of degrees of freedom, are both kind of a generalization
of (positive) real number, indicated by the QR and polar decompositions,
but they do not behave the same under multiplication.

Maybe it's because i'm not making a distinction between Lie group and Lie
algebra. Another difference is that multiplication of triangular matrices
is closed, but this is not the case for Hermitian matrices. For that we
have to move to the Lie bracket and anti--Hermitian matrices.

Something else to remark. Exponentiation is closed for all triangular matrices,
but it is only closed for positive definite Hermitian matrices. So there
is a very subtle difference betweent the two.

All this wierdness originates in the assymetry of the $A=QR=US$
relation of course, where $Q$ and $U$ are unitary, $R$ is upper
triangular and $S$ is Hermitian.  This is not a similarity
transform. It is one--sided. Both $R$ and $S$ can be turned into each
other by pre--multiplying them with a unitary matrix.


Entry: Symplectic forms
Date: Tue Oct 11 09:30:44 CEST 2005
Type: tex

Then, symplectic forms. I can't seem to find any decent explanation of
this anywhere, so here we go. Partition all the coordinates in a phase
space as $(a_i,b_i)$ pairs, which could be represented as complex
numbers, but which we are just going to represent as reals here. The
thing that's preserved under symplectic transforms is
$$\sum_k a_i b_i \sin \theta_i.$$

Now what does this mean if we compare it to that other 
expression $\sum_k a_i b_i \cos \theta_i$. If i recall this last expression 
involves dissipation, so the other one must mean conservation, no?

Let's have a look at the leapfrog method. It solves the nonlinear
continuous time set of equations


$$
\begin{array}{l}
d q / d t = p \\
d p / d t = f(q),
\end{array}
$$

by approximating them as

$$
\begin{array}{l}
q_{n+1} = q_n + \Delta t p_n \\
p_{n+1} = p_n + \Delta t f(q_{n+1}).
\end{array}
$$

It turns out this gives good results because of the $q_{n+1}$. First
updating $q$ and then using this to compute $p$ gives the method its
name. It was realized that the reason this equation works so well is
because it respects \emph{symplectic conservation}.  Which is a
symmetry in Newton's laws of physics analogous to conservation of
energy and angular momentum.

See also that nice picture of invariant tori and quasiperiodicity. I
wonder if there's any nice sound in this.

%% And maybe I can combine this displacement rank stuff to do some
%% symplectic geometry.

%[1] http://www.massey.ac.nz/~rmclachl/SciRevMcL.html

Entry: Golden Ratio
Date: Tue Oct 11 10:30:44 CEST 2005
Type: tex

One interesting remark i read on Baez's This Weeks Finds about the
golden ratio, amongst all the witch hunting, was that the golden ratio
can be seen as the most irrational ratio, because of the fact that
each term in it's fractional expansion is $1$, so adds the minimal
amount of approximation to the true value $(1+\sqrt{5})/2$.


%[1] http://math.ucr.edu/home/baez/twf_ascii/week203


Entry: Questions
Date: Tue Oct 11 17:44:17 CEST 2005
Type: tex

So i should ask more questions. What is a Wiener-Hopf equation, and
what is its relation to displacement rank? What's so special about
symplectic conservation? Why did i hear about momentum and energy, but
not symplectum? [sic. Basically, the answer is in Kepler's law.]  Why
is the wave equation not seperable in more than one time and space
dimension? [Hmm. Because it is intrisicly \emph{spatial}. It is about
the \emph{coupling in space} of certain fenomena. In one dimension,
coupling is only left--right, so there's no arbitrary angle. Maybe the
right answer is: the equation is seperable, but in so many parts
(angles/rays) that it doesn't really compare to the $2$ part 1D
solution.] What's the deal with this time is imaginary business, and
wick rotation. [It's just a way to deal with hyperbolic spaces
differently?] Is this more than a notational trick, or does it really
say something? An, last but not least, what is a stochastic process?
[It is a distribution of functions, and not a function of a stochastic
variable.  Or, the stochastic variable is a function of time, a
trajectory.]


Entry: DFT
Date: Thu Oct 13 08:17:03 CEST 2005
Type: tex

Another thing i suddenly realized. If $F$ is the DFT matrix,
containing as columns the DFT base functions in terms of positive $i$,
we have $F^4 = I$, meaning $F$ represents a $90$ degree rotation
combined with a reflection.  The eigenvalues of $F$ are $4$--th roots
of unity $\in \{\pm 1, \pm i\}$.

Note that a DFT of matrix $A$ looks like $FAF^H$. For $2 \times 2$
matrices, the normalized $F$ looks like
$$F_2 = {1 \over \sqrt{2}}\matrix{1 & 1 \\  1 & -1}.$$
We can write this as 
$$\sqrt{2} F_2 = 
\matrix{1 & 0 \\  0 & -1} +
\matrix{0 & 1 \\  1 &  0} = E_1 + E_2,$$
where $E_1^2 = E_2^2 = I$ en $E_1E_2 = -I$. This is a $2^2$--dimensional
Clifford algebra. Note that for this case $F_2 ^2 = I$.
For $4 \times 4$ matrices we have
$$F_4 = {1 \over \sqrt{4}}
\matrix{
1 &  1  & 1 &  1 \\ 
1 &  i & -1 & -i \\
1 & -1  & 1 & -1 \\ 
1 & -i & -1 &  i}.$$

Something which has always bugged me is why no--one talks about
eigenvectors of a DFT. For the continuous case this is important, but
for DFT this doesn't really seem to make sense. Why?

[Edited] Maybe there is an answer in the recursive formulas for the
Hermite polynomials. Probably something needs to be said about the
Gaussian wave packet. On the real line, the exponential of $-x^2$
somehow makes sense, but on the fine cycle on which the DFT is
defined, it does not really.


Entry: Square Root Matrix Notation
Date: Sun Oct 16 16:25:56 CEST 2005
Type: tex

Reading some displacement rank proofs for symmetric matrices, I found
the notation to be rather cumbersome. Why not use a ``square root
notation'', something like $QAQ^T = Q_A$.  So the subscript denotes
the signature/QF, and the main symbol denotes the basis transform for
the QF. Absence of signature implies $I$.  This gives $\Delta R = I_R
- Z_R = G_J$ as main expression for $Z$ based symmetric displacement
structure. The Cholesky decomposition then becomes $I_R = \sum_k l_k$,
with $l_k$ the columns of the lower triangular Cholesky factor.


Entry: Fisher Information
Date: Mon Oct 17 13:15:33 CEST 2005
Type: tex

Supposedly, Fisher information is used for continuous variables. But
what about the difference of $\int_A p_X(x) \log p_X(x) d x$ and $\log
(\int_A p_X(x) d x)$, for the event $A$? The second one clearly
doesn't make much sense, but how can we relate mutual information to
Bayes' rule then? The answer is rather obvious. Both rules are due to
operations on events, instead of one being a consequence of the other.

Interesting, but not for now: information geometry[1]: the study of
probability and information by way of differential geometry.

%[1] http://en.wikipedia.org/wiki/Information_geometry

Entry: Laplace Operator
Date: Mon Oct 17 14:15:33 CEST 2005
Type: tex

\def\dt{\delta \over \delta t}
\def\dx{\delta \over \delta x}
\def\d2x{\delta^2 \over \delta x^2}

So, onto the relation between time and space. The main player here is
the Laplace operator $\sum_k {\delta^2 \over \delta x_k^2}$, with $k$
ranging over all space dimensions. Instead of starting from some
physical process which leads to an equation involving this operator,
we start to interpret the operator in several dimensions.

For a univariate function it expresses just the second derivative. All
this is nice to understand visually, by expressing the effect of the
operatior on a Gaussian ``packet'', since the operator is linear and
most functions of interest can be modeled as a sum of a collection of
Gaussian packets. Starting from $p(x) = \exp(-x^2/2)$, we have $Dp(x)
= -x p(x)$ and $D^2p(x) = (x^2-1)p(x)$. What we have here is the
Rodrigues formula for the Hermite polynomials, upto scaling
$(-1)^n$. Plotting these functions gives us a good idea of which way
the push goes. For multivariate Gaussians the behaviour is essentially
the same. We could start from $\int p(x) d x$ if we want to see the
effect of a stepwize scalar field.

Where does this bring us? Take for example the diffusion equation $Lp
= {\delta p \over \delta t}$, which equates speed of change over $t$
to the Laplacian.  This means concave sections will see a decrease,
while convex sections will see an increase.  Sections where the
Laplacian is zero will be stationary. So the diffusion equation can be
seen to ``straighten out'' a scalar field, evolving towards a state
with minimal tension.

Equating the Laplacian to zero gives harmonic functions. The equation
$Lp = 0$ can be solved given boundary conditions.  In one dimension,
the solutions are just straight lines between the boundary points.  In
two dimensions, boundaries are one--dimensional giving rise to a
larger family of possible solutions.


For the wave equations, the Laplacian does not express the rate of
change of $p$, but its acceleration $Lp = {\delta^2 p \over \delta
t^2}$.  This means we need extra information about the speed ${\delta
p \over \delta t}$ if we want to interpret what happens for a Gaussian
wave packet. In Short, we need to know if it's moving. Standing still,
a packet will split. When moving with the wave velocity, a packet will
keep its shape.


Question. Are the nonlinear waves produced in the soapbubble pdp patch
solitons?


Intermezzo. An interesting pattern. Orthogonal polynomials on the unit
interval $[0,1]$ with weight function $w(x)=1$ (Legendre), on the
positive real line $\R^+$ with weight function $w(x)=e^{-x}$
(Laguerre), and on the whole of $\R$ with weight function
$w(x)=e^{-x^2}$ (Hermite). I wonder what this has to do with the
Riemann mapping theorem. The 3 intervals above are basicly the analogs
of the unit disk, the upper half plane and the complex plane.


Entry: Real Numbers
Date: Tue Oct 18 11:37:34 EDT 2005
Type: tex

Is it so that each irrational number is the limit of a sequence of
rational numbers? Sounds very much like my surprize about $e$ in the
last entry. Anyway, this is the continued fraction expansion. How did
this work again? Continued fractions are motivated by a desire to have
a ``mathematically pure'' representation for the real numbers as a
sequence $a_k$ of integers.  It is generated recursively as
follows. With $[x]$ denoting integer part of $x$, we construct the
series as $x_0 = x$, and proceed recursively as
$$a_k = [x_k],\qquad x_{k+1} = 1 / (x_k - a_k).$$.

For example, the first recursion for $x$ gives $x = a_0 + 1 / x_1$,
where $x_1 \geq 1$. Repeating this procedure for $x_1$ gives an
infinite series if $x$ is irrational. Truncation of continued fraction
expansion gives in some sense the best rational approximation to an
irrational number. Whenever an integer gets large in this expension,
it doesn't add much accuracy. In this sense the ``golden ratio''
$(1+\sqrt{5})/2$ can be seen as the most irrational number, since its
expansion consists of all ones.

Entry: Beach
Date: Sun Jun  4 21:44:37 EDT 2006
Type: tex

Day on the beach. Pascal's triangle modulo 2 and it's
symmetries. Matrices of symmetric parts of Pascal's triangle.  Linear
cellular automata and relation between boolean algebra and Galois
fields. Modulation and splitting fields. Shifts: generated by $x$,
Pascal's triangle, generated by $1+x$.

Been reading in Tony Smith's page about Clifford algebras. He mentions
the clifford product is sort of an extension of the set--theoretic XOR
from sets and subsets to vector spaces and subspaces.


Limit can be seen as a (linear) operator mapping a sequence of things
to a thing.  So when does this operator commute with other things?
When is $f(\lim_{n\to\infty} s_n) = \lim_{n\to\infty} f(s_n)$ for
example?  My guess would be that $f$ needs to be continuous.

What's the difference between algebra and analysis? The concept of
limit.  Where does this fit in the construction of real numbers? Real
numbers are such that taking the limit of a sequence is a closed
operation. This is not true for the rational numbers, i.e. $e = \sum_k
1/k!$ is not a rational number.

For the real numbers, every Cauchy sequence is convergent. (Now, for
what spaces is this not true?)  This allows one to talk about
continuity, differentiation, integration, \ldots All of them defined
in terms of limits.

Construction of the reals is done axiomatically by imposing
Dedekind--completeness: every non--empty subset of $\R$ with an upper
bound in $\R$ has a least upper bound in $\R$. This is not true for
the rationals, i.e. $S = \{x : x^2 < 2\}$.

Something I didn't know is that most real numbers are not computable
because there are only countably many algorithms, or even definable
because the set of all logical formulas is countably infinite.

So all this madness is in some sense the result of the fact that every
natural number has a successor, and we somehow want to look at the
behaviour of the ``last'' one. When we do this for sequences,
a.k.a. maps from $\N$ to $\R$, we can give the image of this last
element a name, and call it the limit of the sequence. For maps from
$\N$ to $\Q$ this is not always possible, hence the need for
$\R$. Dedekind--completeness is simply the easiest way to impose a
property that makes us see the ``horizon'', being the supremum of a
set.


Entry: Measure Theory
Date: Wed Oct 19 10:30:36 EDT 2005
Type: tex

So, what is Lebesgue integrability, and why should i care? Probably
it's relation to distributions. First comment: Lebesgue integration
interacts better with limits of sequences of functions, and as such is
important in Fourier analysis.  [ It bridges $\mathbb{N}$ and
$\mathbb{R}$, or the discrete and the continuous. ]

This brings us first to \emph{measure theory}, which discusses the
notion of \emph{length}.  We will work with the Lebesgue measure on
$\R$ and $\R^n$, which assigns a length or $n$--volume to subsets of
the line or $n$--dimensional Euclidian space.  An important concept is
a \emph{null set}, which is a set with measure zero. Countable sets
are null sets, as well as sets with dimension smaller than $n$. The
integral is built starting from the measure, following a path through
indicator functions, simple functions, non--negative functions, signed
functions and complex valued functions.

Intuitively, the difference between Riemann integration and Lebesgue
integration can be understood as chopping up the domain---evaluate a
function on a grid---versus the codomain---sum areas of contour
maps. An important property of Lebesgue integration is that integrals
of functions $f$ and $g$ which differ only over a countable set have
the same integral. In other words, $f$ and $g$ are equal \emph{almost
everywhere}.

Now, all this is again very theoretical. I still don't see why in
practice there is anything ``useful'' about Lebesgue
integration. However, it does seem to be very necessary to proove
things about convergence of series and all sorts of tricks involving
different kinds of infinities, or in other words which have to do with
$\R$ not being countable.  Supposedly it has something to do with the
difference between Hilbert spaces (energy, norm from inner product)
and Banach spaces. Important applications are probably the $1$ and
$\infty$ norms related to control and systems theory.

So, what is a Banach space? It is the generalization of completeness
of $\R$ to vector spaces, in a way that the concept of \emph{distance}
is generalized to \emph{norm}.  Completeness is imposed by requiring
that every Cauchy sequence of vectors has a limit.  So a Banach space
is a complete normed vector space.

Sequences and Cantor's diagonal argument. Some detail i got hung up
on: representation of real numbers as continued fractions, which can
be made unique, and Banach spaces of bounded sequences. For the
continued fractions, the values are taken in $\N$.  How does this
relate to the cardinality of the powerset of $\N$?


This brings us to the next question. Disentangle measure, norm,
metric, quadratic form, 2--form, inner product.


Entry: Progress..
Date: Thu Oct 20 15:53:21 EDT 2005

Some small dose of matrix Lie groups, and then on to trace and
determinant.  Been random reading a bit until i got to multilinear
algebra. The thing that strikes me is ``Clifford algebra'' [2] and so
many related things on the bottom of [1].

I didn't get anywhere with this. But i did proove the generalized
schur algorithm.  There is one question though. How to solve a linear
system on the fly without saving the Cholesky coefficients?

Another random thing which was prepended a referral to a Feynmann
quote about always trusting life will throw good things at you as long
as you stay somewhere long enough.

[1] http://en.wikipedia.org/wiki/Multilinear_algebra
[2] http://en.wikipedia.org/wiki/Clifford_algebra


Entry: Spinors
Date: Fri Oct 21 06:08:52 EDT 2005
Type: tex

Why are spinors not used more often in signal processing? To answer
that question, we have to find a way to explain what spinors are
really.

As far as i can see right now, the essence is about representing a
vector (a one--dimensional thingy) as a matrix. So, instead of looking
at the transform $Qx$, we look at $QXQ^{-1}$. This is a double cover
representation.

It brings us straight to Clifford algebras and higher order tensors.
For example, take a $16$--component vector. The spinor is the $Q$.  If
$X$ is a vector, $Q$ is a $128$--dimensional even grade element called
a spinor. One example of this is representing SO(3) as SU(2).

%% Now, can we do something with this? My immediate question would be,
%% what are the analogs of the matrix decompositions for this double cover
%% representation, and what can we say about displacement rank?
 
%% \section{Sat Oct 22 12:24:07 EDT 2005}

%% Been reading some subspace singal processing and identification stuff.
%% Something is should have done way earlier, really. Anyways. Books
%% arrived, so the tensor--sumsin estimation stuff will have to wait:
%% Papy and Delathauwer.


Entry: Unstructured remarks
Date: Sun Oct 23 09:24:55 EDT 2005
Type: tex

Some of yesterday's notes. Pad\'{e} approximation gives rise to a set
of equations with displacement structure. It's no surprise that there
exist recurrence relations. Bernoulli's method for finding the largest
root of a polynomial and the power method for finding the largest
eigenvalue are very related.  Sequences of polynomials give rise to
triangular coefficient matrices. Sturm-Liouville problems, which are
Hermitian $2$nd order linear differential equations give orthogonal
functions. Examples are all the orthogonal polynomials, and Bessel
functions. They are the analog of $(A-\lambda B)x = 0$, with $A,B$
symmetric and $B$ positive definite. The latter equation produces
eigenvectors $x$ that are $B$--orthogonal.

Real polynomials have recurrence relations with that go two steps
back, while complex have only one step.  This is very related
oscillatory behaviour only possible for 2nd degree real difference
equations, and 1st degree complex ones.

Today's reading brings me to Green's functions. If i am correct, it is
about inverting an operator (differential) equation like $Ly = f$, by
transforming it to an operator equation (integral) like $y = Kf$. A
special case (?) is when $L$ is Hermitian. This gives a countable
spectrum with orthogonal eigenfunctions that span the Hilbert space.
This is analogous to the orthogonal decomposition of a Hermitian
matrix. The Green function can then be written as $K(x,x') = \sum_k
\lambda_k^{-1}\phi(x)\overline{\phi(x')}$.


Entry: Derivative of the Inverse Function
Date: Mon Oct 24 16:25:02 EDT 2005
Type: tex

Inspired by the question: ``Why does $\tan^{-1}$ have such a nice
derivative?''

\subsection{Derivative of $f^{-1}$}

Why do some inverse functions have such a nice derivative?  With
$D\{f\}$ denoting the derivative of $f$, we can say this is a
consequence of the fact that the derivatives of $f$ and $f^{-1}$ are
$D\{f\}$ and $D\{f\}^{-1}$.  The former inversion is for the
composition operator, while the latter is for the multiplication.  In
order for this to yield a simple derivative, we need $D\{f\} = g \circ
f$, with $g$ simple.

To distinguish domain and codomain of $f$ we will denote them by $x$
and $y$ respectively.  This gives for the above relations $y = f(x)$,
$x = f^{-1}(y)$, and $D_y\{f^{-1}(y)\} = 1 / D_x\{f(x)\}$.  Following
the chain rule it is easy to see this is true, since the latter
expression is simply a re--arranged version of $D_x \{ x =
f^{-1}(f(x)) \}$. Now with $D_x\{f(x)\} = g(f(x))$ we can write
$D_y\{f^{-1}(y)\} = 1 / g(f(x)) = g(y)^{-1}$. What we have expressed
here is that $f$ is a solution of the first order differential
equation
$$y' = g(y)$$.

The most famous example is of course $g(x) = x$, which yields $y =
\exp(x)$, leading to $D\{\log x\} = 1/x$. Another is $g(x) = -x^2$
which yields $y = x^{-1}$.  There are some more interesting ones among
the trigonometric functions.  For example $g(x) = 1 + x^2$, which is
satisfied by $y = \tan(x)$, and $g(x) = \sqrt{1 - x^2}$, satisfied by
$y = \sin(x)$.


Note that derivation maps the inverse with respect to composition to
inverse with respect to multiplication.  This can easily be seen
geometrically: for $y = f(x)$, exchanging $x$ and $y$ will reflect all
tangent lines along the $y = x$ diagonal, which means $y = mx$ is
mapped to $x = m^{-1} y$. With $m = \tan \alpha$, we have $m^{-1} =
\tan(\pi/4 - \alpha) = \tan(\alpha)^{-1}$.  There is more of
course. We have $f \circ g = f(g(x))$, which is mapped to $(f' \circ
g) g' = f'(g(x)) g'(x)$, so composition is mapped to
multiplication. More about that later.


Entry: Differential equations
Date: Tue Oct 25 09:42:39 EDT 2005
Type: tex

Let's talk a bit about differential equations.  There is no single
elegant theory for solving the general problem. In general we can only
talk about existence and uniqueness. There is a theorem which states
that under certain conditions (Lipschitz), a system of first order
differential equations with initial conditions $y_k' = f_k(x, y_1,
\ldots, y_N)$ has exactly one solution on an interval containing
$x_0$. This is the most general form to which every ODE can be
reduced.

Finding expressions for a solution as an implicit $F(x,y)=0$ or
explicit function $y(x)$ is often not possible. There are however
classes of special cases which allow to obtain such expressions.

The simplest class is no doubt linear equations with constant
coefficients, which has an elegant mapping to linear algebraic
equations through the Laplace transform resulting from $L\{y'\} =
sL\{y\} - y_0$. The constant factor is the result of partial
integration of $\int_0^\infty e^{-sx} y'(x) dx$. Note that the
expression allows to recover $y$ from $y'$ and $y_0$. Expressions for
higher order derivatives $y_k^{(n)}$ follow through induction.  The
consequence of this relation is that a set of linear differential
equations for the functions $y_k$ can be transformed into an algebraic
set of equations linear in $L\{y_k\}$ with coefficients parametrized
in $s$. After obtaining a solution in the Laplace domain by solving
the algebraic set, it can be transformed back to obtain the $y_k(x)$.

% An important property of the Laplace transform is that it transforms
% convolutions into multiplications.

An important tool for studying linear differential equations with
non--constant coefficients is to treat them as operator equations in
infinite dimensional vector spaces.  They can often be transformed to
integral equations using Green's functions.

Classifying equations according to order, we can say some things about
first order equations by rewriting them in terms of the differential
$P(x,y)dx + Q(x,y)dy$, which we hope to be able to integrate.  We will
be able to do so if the differential is \emph{seperable} or
\emph{exact}.  The former means it is reducable to the form $P(x)dx +
Q(y)dy$, where we can integrate the two terms separately. The second
method means we can write it as $dF(x,y)$, which can then be
integrated.

There are several tricks to reduce an equation to one of these two
forms. Equations that can be reduced to the first form include special
forms of $P(x,y)$ and $Q(x,y)$. If both are homogenous in $x$ and $y$,
they can be reduced to separable form by writing them in terms of
$x/y$. The same is possible when both are linear in $x$ and $y$, where
we can shift $x$ or $y$ so both become homogenous.  For the second
form we can try to multiply the differential by an integrating factor,
which is used for example to reduce a linear first order equation to
an exact differential.  Another special case is the Bernoulli
equation, which can be reduced to a linear equation.

For higher order linear differential equations with non--constant
coefficients it is possible to reduce an $N$th order equation to a
first order one by using information about $N-1$ known solutions.  An
$N$th order equation explicit in $y^{(N)}$ can be written as a first
order system of $N$ equations.  There are more exotic tricks of
course. It's quite a maze. From silly substitutions to deep
connections?

To understand a nonlinear differential equation (first order set), it
is instructive to linearize them to first degree, to analyze the
behaviour of the equation around a certain point $(x_0, y_0)$ as $y' =
J_0 (y-y_0) + f_0$ where $J_0$ is the Jacobian of $f$. It is
convenient to perform an affine coordinate transform which brings the
linear equation to real normal form.  Note that in this case we only
study the difference from constant motion, since for $y = y_0$ we have
$y' = f_0$.

We assume the normal form can be decoupled into a $2 \times 2$ block
diagonal system, so we can limit ourselves to the $2$--dimensional
normal forms.  We ignore the case where the normal form has a Jordan
block of order larger $2$.

This brings us to
$$y' = \matrix{\lambda_1 & 0 \\ 0 & \lambda_2} y$$ in case of real
eigenvalues with $\lambda_1 > \lambda_2$. The solution for each
component is $y_k(x) = e^{\lambda_k x}$.  We call a positive
eigenvalue \emph{repulsive} and a negative eigenvalue
\emph{attractive}.  In case of $2$ attractive eigenvalues, the largest
one will dominate behaviour. This can be seen in the example $y_1(x) =
e^{-x}$, $y_2(x) = e^{-2x}$. We can rewrite this as $y_2(x) =
[y_1(x)]^2$, so points along the eigendirection of $y_2$ will be
attracted much faster to the average behaviour (determined by $f_0$)
than those along the direction of $y_1$. The effect of this is that we
can mostly view the system as behaving one--dimensional in this
region, with dynamics determined by the eigenvalue $\lambda_1$. When
both are repulsive, a similar thing happens. For example $y_1(x) =
e^{2x}$, $y_2(x) = e^{x}$. Here $y_1(x) = [y_2(x)]^2$, which explodes
much faster than $y_2(x)$, so again, we can view the system as
essentially one--dimensional. If there is one positive and one
negative eigenvalue, the same reasoning applies, only more
strongly. The system behaves as an explosion along the direction of
$y_1$, while it is compressed in the direction of $y_2$. So the
general conclusion is that when we have a collection of real
eigenvalues, the largest one will be dominant.

When there are complex conjugate eigenvalues, we get
$$y' = \matrix{r & -\omega \\ \omega & r}y = ry + \omega y_{\perp}$$
where $y_{\perp}$ denotes a $90$ degree counterclockwize rotation.
This leads to a \emph{rotation} combined with an attraction or
repulsion.  If $r < 0$, the system will be attracted to the average
behaviour of $f_0$.  When $r > 0$ it will be repelled. In addition,
there is a rotation (oscillation) around the average behaviour, of
which the frequency is determined by $\omega$.  Again, comparing $r$
to the real part of other eigenvalues shows which are dominant.

The degenerate case
$$y' = \matrix{\lambda & 1 \\ 0 & \lambda} y$$ is a bit more
special. A solution would be $y_2 = e^{\lambda x}$, $y_1 = x
e^{\lambda x}$.  Which gives $y_1 = x \log(x) / \lambda$.


% DAMN. loop quantum gravity. 4D -> 2D for small scales.. this smells
% like dominance of diff eq!


Entry: Tensors
Date: Tue Oct 25 14:55:42 EDT 2005
Type: tex


[ Needs serious weeding. ]


\def\notindex{\underbrace{\ldots}}
\def\t{\mathcal}

Tensors for dummies: tensors as grids of numbers.
Working just with coordinate blocks, we start with $\R^N$.
Transforms from $\R^N \to \R^N$ are matrices $T \in \R^{N \times N}$, which form
an algebra. We can multiply $u_{ij}$ and $v_{ij}$ as 
$$t_{ij} = u_{ik}v_{kj},$$
using Einstein summation. Other kinds of multiplication can be defined. For example,
the outer product $t_{ij} = u_i v_j$ or inner product $t = u_i v_i$. The multiplication
for matrices is a combination of the two. A coordinate transform $UV\tilde{U}$ can be
written as $u_{ik} v_{kl} \tilde{u}_{lj}$. When $\tilde{U}$ is $U^T$ this becomes
$u_{ik} v_{kl} u_{jl}$. Which makes it clear that both rows and columns of $v_{ij}$ 
will be transformed by $u_{ij}$. We can multiply a generic tensor with a matrix
along a certain dimension as
$$t_{\notindex_{k-1} j \notindex_{J-k}} = t_{\notindex_{k-1} i \notindex_{K-k}} u_{ij}.$$
Which performs a coordate transform on the $k$--th coordinate array. We can
write this in matrix notation as $\t{T} \times_k U$.

To talk about subspaces generated by tensors we first make some observations.
We call $N$ the dimension of the vector space generating the tensor space, and
call $K$ the degree (also called rank) of the tensor, which indicates the number
of grid coordinates.
The dimension of the space generated by all $N$--dimensional $k$--mode vectors
can never be larger than $N$, while the number of possible vectors is $N^{K-1}$.
The dimension of the space generated by all $N^2$--dimensional $(k,l)$--mode
matrices can never be larger than $N^2$, while the number of possible submatrices
is $N^{K-2}$. This indicates there is something special going on for $K = 2^k$.

For $K=4$ there are two $N$--dimensional spaces generated: the column space $(1)$
and the row space $(2)$. For $K=2$ there are six $N^2$--dimensional spaces generated:
$(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)$.

Some like to define the rank of a $K$--tensor $\t{T}$ as the cardinality of the
minimal set of rank $1$ tensors that sum to $\t{T}$. A rank $1$ tensor is the
outer product of $K$ vectors. So, for a $4$--tensor of rank $R$ we have
$$\t{T}_{ijkl} = \sum_r^R u_{i}^{(r)} v_{j}^{(r)} w_{k}^{(r)} x_{l}^{(r)}.$$

The SVD can be generalized to tensors. This leads to orthogonal tensors.


But first, some weird shit. Take all tensors to be of dim $N^K$.
There is a fundamental difference between even and odd $K$. Vectors $(K=1)$
are transformed by matrices $(K=2)$. We cannot multiply vectors and obtain
vectors. We can only obtain scalars or matrices. But we can do this with matrices.
Does the same thing generalize to higher $K$?

Let's define some products. To multiply the $K_a$--tensor $\t{A}$ with
the $K_b$--tensor $\t{B}$ we need to choose a subset of indices of both,
identify, and sum. We already know what happens when we combine $K \in \{0,1,2\}$.
So let's look at $K=3$. The multiplication $(3,0)$ is trivial: $t_{ijk} = u_{ijk} v$.
The next one is $(3,1)$, for which we have several possibilities $t_{ij} = u_{ijk} v_k$,
where we can choose $3$ different dimensions to multiply using an inner product.
Using an outer product we get $t_{ijkl} = u_{ijk} v_l$. There are $4$ possible ways
to do this. The resultant $K$s are $\{4,2\}$.

The next combination is $(3,2)$. An outer/outer product gives a $5$--tensor,
an inner/outer product gives a $3$--tensor, while an inner/inner product gives a $2$--tensor.
A $(3,3)$ combination gives an o/o/o product $6$, an o/o/i product $4$, an o/i/i product $2$,
and an i/i/i product $0$. Lets move to $(4,0)$ which gives $4$. Then $(4,1)$ gives $\{5,3\}$,
next $(4,2)$ gives $\{6,4,2,0\}$. Note a pattern here. Combining two even $K$s gives an
even $K$. Combining two odd $K$s gives an even $K$. Combining an even $K$ and an odd $K$ gives
an odd $K$. We could call the even $K$s \emph{transforms}, because they have the ability
to leave $K$ the same. 

For example take $4 \times 2 \to 2$, or $t_{ij} = u_{ijkl} v_{kl}$. There are $\binom{4}{2} = 6$ possible
ways of writing the product without changing the order of the indices. (Compare this to
$t_{i} = u_{ik} v_k$, where there are $\binom{2}{1} = 2$ ways of writing the product, corresponding
to $t = U v$ or $t = U^T v$.)
If we choose $\t{U}$ 
to be separable $u_{ijkl} = c_{ik} r_{jl}$ we get the ordinary matrix transform
$C V R$, where $C$ operates on rows and $R$ operates on columns.

% Ik vraag me wel af waarom die $2^k$--structuur nergens opduikt.


Entry: Central Limit Theorem
Date: Tue Oct 25 20:20:41 EDT 2005
Type: tex

Using characteristic functions, the proof of the central limit
theorema is based on the fact one can write the PDF of a sum of
independent variables as the product of their characteristic
functions, which gives a relation like
$$\lim_{n\to\infty}\left(1 + {x \over n} + o({x\over n}) \right)^n =
e^x,$$ where $f \in o(x/n)$ means asymptotically negligible, or
$\lim_{n\to\infty} n f(x/n) = 0$.  More specifictly $x = -{t^2 \over
  2}$ for a standard normal distribution, of which the right hand side
is the characteristic function. Never knew it was so simple.
 

Entry: Tensors and Displacement Rank
Date: Wed Oct 26 08:49:13 EDT 2005
Type: tex

[Probably a dead end.] So, what to do with tensors and displacement rank?  First, we have to
do this to solve inverse transforms. There is little value in
computing a triangular (tetrahedral?) decomposition without using it
to decouple spaces.  Once this is cleared out, i think it is
relatively straightforward to generalize the Schur algorithm to higher
dimensions.

Another thing i was thinking about. What about fractional $Z$? If we
have a structured matrix with fractional delay structure $R - Z_f R
Z_f^T$, with $0 < f < 1$, does this make sense?

One more thing, how to keep track of covarianant/contravariant indices
when they're all on the same level. I understand that when using a
certain signature/metric inner products need to pass through the
metric tensor. But with ordinary coordinate tensors, the inner product
is implicitly defined. Does this only matter for defining the
products?  I.e. if you take care to define everything well (a
coordinate transform is 1=contra 2=covariant), there is nothing to
worry about? Maybe it is a good idea to review the standard tensor
transform things and also have a look at index--free approach, which i
guess mostly deals with the grading of tensor spaces? Also it would be
nice to finally figure out if it is possible to model (unitary)
signals as spinors, or at least have a decent idea of what a spinor
actually is, other than something that changes sign under $2\pi$
radians rotation.


Entry: Particle Music
Date: Wed Oct 26 13:47:18 EDT 2005
Type: tex

A knot in 3D is a map from the complex unit circle to Euclidian
3--space
$$k : \delta \D \to \R^3 : \theta \mapsto (t,x,y)$$
This can be seen as the worldsheet of a 3D spacetime ``fermionic
vacuum fluctuation''.  Meaning, if you take one of it's coordinates to
be time, for example the first one, and intersect it with the current
time sheet $\tau = \{(t, x, y) : t=t_0\}$, you get a collection of
moving/creating/annihilating particles and anti--particles. This could
be nice to map to musical structures. Especially creation/annihilation
is musically interesting.

\def\p{\oplus}
\def\m{\ominus}

At a creation point, the two particles (roots) emerging could be
labeled $\p$ and $\m$ depending on the the direction of the
curve. Let's call a $\p$ particle a root that follows the increasing
part of the curve, and a $\m$ particle the decreasing part.  This
shows that annihilation is always between a $\p$ and $\m$ particle.

Now, what I thought was that nontrivial knots reflect recombinations
of particles.  But this is not the case. The trivial knot can be
deformed in a fluctuation diagram that exchanges particles. So what
exactly is expressed by the ``tangling of world lines'' ?.

Some more comments. For knots, which are 1D entities, it is not
possible to have interactions other than creation and annihilation,
since other interactions would require intersections. For 2D manifolds
embedded in 3D this is possible.  Before collision, an intersection
contains 2 circles. The moment of collision gives a ``twisted
circle'', due to the local saddle point structre. After the collision,
both fuse into a compressed circle.

An interesting set of knots are $(p,q)$ torus knots.


Actually\ldots All this can be nicely cast into real/complex roots of
polynomials.  We could have real and imaginary particles. Take any
polynomial
$$f(x) = \sum_{k=0}^N a_k x^k,$$
and compute its factorization. The real roots correspond to particles.
The complex roots could be seen as ``points in harmony'', since they
come in complex conjugate pairs that can give birth to closely spaced
real poles which we can associate with repulsion due to
inharmonicity. In case $N$ is odd, sweeping $f(x) = c$ from $-\infty$
to $+\infty$ will start out with one particle, and end with one. If
$N$ is even, we start out with nothing and and with $2$ if $a_N$ is
positive, or vice versa if $a_N$ is negative.

To get as much particles as possible during the sweep, it is
interesting to use oscillatory polynomials. Other oscillatory
functions could be used, but polynomials are easier to work with. We
will find good candidates amongst the orthogonal polynomials. Let's
concentrate on those from the Sturm--Liouville problem.

% this is very similar to the MI paper about algebraic systems


Entry: Chebyshev polynomials
Date: Wed Oct 26 15:50:48 EDT 2005
Type: tex

Something I learned yesterday rading Conway's Course in Functional
Analysis. It's obvious in retrospect, but there's a theorem stating
that every Hilbert space with the same cardinality is isomorphic. This
is quite obvious for Chebyshev polynomials on $[-1,1]$, which are
orthogonal w.r.t. the weighting function $(1-x^2)^{-1/2}$ an have the
property
$$T_n(x) = \cos(n \cos^{-1}(x)).$$

With $\theta = cos^{-1}(x)$, or $cos\theta = x$ we write
$T_n(cos\theta) = cos(n\theta)$.  The mapping $x\mapsto\theta$ maps
the interval $[-1,1]$ to $[\pi, 0]$. This mapping is invertable. It is
a Hilbert space isomorphism because it preserves the inner product
$$\int_{-1}^{+1}f(x)g(x)(1-x^2)^{-1/2}dx = \int_{\pi}^0 f(\theta)g(\theta)d\theta.$$
This mapping maps the orthogonal cosine basis $\cos(n\theta)$ for
$[\pi, 0]$ to $T_n$.  Chebyshev polynomials satisfy the differential
equation
$$(1-x^2){d^2 y \over dx^2}  - x {dy \over dx}  + n^2 y = 0,$$
$${d^2 y \over d\theta^2}  + n^2 y = 0.$$
and the recurrence relation
$$T_n(x) = 2xT_{n-1} - 2T_{n-2}),$$
$$\cos(n\theta) = 2\cos \theta \cos[(n-1)\theta] - 2\cos[(n-2)\theta],$$
where the bottom expression is obtained after substitution
$x=\cos\theta$.

These polynomials have equiripple properties, which makes them show up
in several minimax problems in filtering and beamforming techniques.
One result in this domain is that the Cheby grid, determined by the
zeros of $T_n(x)$ is optimal for polynomial interpolation, in a
minimax sense.  Furthermore, evaluation of the polynomial is related
to the DCT.

Now I wondered, if these have a simple bijection to cosine, what about
the other Sturm--Liouville polynomials? Probably, since i never heard
about it anyway, the mapping is terribly simple for the things above,
but pretty complex for the other polynomials.

Entry: Knots
Date: Thu Oct 27 12:02:22 EDT 2005
Type: tex

Then knots.  I'm still thinking about this particle business
above. Wether it's trivial or not. Wether it makes a difference if a
certain compact closed curve in $\R^3$ is a knot or not when you scan
it to get dancing particles. Maybe after all the polynomial approach
is better, at least its easier to understand.  Been toying a bit with
the trefoil knot in its torus knot representation, with the slicing
plane parallel to the revolution plane of the torus. Intersecting it
from the down side gives $3$ pairs of particles. Suppose the
parametrization is such that the $\m$ particles stay on the outside of
the torus, while the $\p$ particles move through the hole. The dance
they make is a $120$ degree rotation before they recombine. Now how is
this permutation related to it being a knot or a link, looking only at
torus knots/links?

Started browsing Arnold's Mathematical methods for classical mechanics.
Very interesting book. Of course, my main interest right now is the magic
word ``symplectic'' and differential geometry. So a little about that.
I've edited it out and moved to tomorrow's entry.


Entry: Differential Geometry
Date: Fri Oct 28 11:12:30 EDT 2005
Type: tex

A new book on differential geometry by Erwin Kreyszig arrived today.
The funny thing about the book is that, while all my orders from
Amazon arrived in perfect state, this one was curved! Looking at it,
it seems to not use the intrinsic view. Talks about embedded stuff, so
this is fundamentally different from the approach used in Arnold's
book, which talks about differential forms, so it's not going to help
much with my ``explanation'' about intrinsic differential geometry.

Arnold's is quite something. Have to get used at the different kind of
style used. He's shooting out concepts at a rate you can't
believe. But, nevertheles, very interesting.  Especially the very
basic geometric things about phase space i never really thought
about. All in the introduction, but i have to take time to absorb
these ideas in a thorough way.

One thing i noticed about reading about differential forms is the
importance of the determinant, since a $k$--form can be expressed as
the determinant of $k$ one--forms. So one can say, if the inner
product relates to angle, the determinant relates to
area/volume/\ldots.  Today i got also Shilov's book on linear
algebra. It starts with determinants.  The last chapter is about
categories. I just had a brief look at it. It talks about the concept
of algebra and linear family, which after choosing a basis talks about
the algebraic structure of non--square matrices basicly. Have to take
some time to see what extra insight this brings.


To my knowledge, embedding a $M$--dim manifold in an $N$--dim
Euclidian space is always possible, where $N>M$. I do not know if it
is possible to write it as a solution set of $N-M$ algebraic
equations. The latter is somehow very interesting to me, because i see
a direct way of plugging in some numerical linear algebra things to
use for animation and the study of motion, hence my interest in
Arnold's book.  Anyway, i'll give Kreyzig a try, and leave the
abstract differential geometry for later when i can relate the two.
% For an implicit function $F(x_1, \ldots, x_N)$ I do recall something
% like $dF = \sum_{k=1}^N {\pd F \over \pd x_k} dx_k,$ from which the
% derivative of th


\subsection{Differential Geometry}


I start with an intuitive introduction of the concepts. We'd like to
explore a manifold $M$ and functions defined on it using tools from
differential and integral calculus.

% * manifold : maps charts atlas (man = class of equivalent atlasses)
% * derivative
% * differential


When we want to study a function $f$ defined on the Euclidian space
$\R^N$, we often resort to its linear approximation around the point
$x'$ which is given by
$$f(x) \approx f_{x'}(x) = f(x') + \sum_{k=1}^N \left. {\pd f \over \pd x_k}\right|_{x'}(x_k - x_k').$$
This function is the equation of the tangent line/plane/volume/\ldots
of $f$ at $x'$.  Since the function $f_{x'}(x) - f(x')$ is a linear
function, it is a vector living in the dual space of the vector space
$\R^N$, which enables us to use linear algebra tools to study the
behaviour of $f$. Iterpreting $f_{x'}(x) - f(x')$ as a function of $x$
and $x'$ gives a bivariate function called the \emph{differential}. It
returns a linear functional for every point $x'$ in $\R^N$. There is
no need to stop at linear approximations. Indeed, there are a lot of
tools which study $f$ using constant, linear and quadratic terms from
its Taylor series.

This is all very useful. But how can we carry over these methods to a
manifold $M$ which is not isomorphic to $\R^N$? It has to look
somewhat like $\R^N$ if we want to use our known tools so we suppose
it is at least locally isomorphic, meaning there is an invertable map
called a chart $\phi : M \to U$ from $M$ around to a subset $U$ of
$\R^N$.  Since the codomain of $\phi$ is the space of $N$--tuples, we
can identify each of its components as $x^k : M \to \R$, and call them
\emph{coordinate maps}.  If the docomains of two such charts $\phi_n$
and $\phi_m$ are overlapping, we call the function $\phi_n \circ
\phi_m^{-1} : \R^N \to \R^N$ a transition function, and require it to
be smooth in the overlapping part.  A collection of overlapping charts
is called an \emph{atlas}.

% transition functions
% Jacobian of transition functions

This structure allows us to relate $M$ to $\R^N$ locally.  We now
define what a tangent vector is, without assuming there is any
algebraic structure on the manifold $M$, which we would assume if it
would be embedded in some higher dimensional Euclidian space.  In the
embedded view, a tangent vector at point $p$ can be used to partition
all the curves passing through $p$. In the intrinsic view, we can turn
this around and define a tangent vector as the equivalence class of
all curves which have a similarity at $p$.  For two curves $\sigma,
\tau : \R \to M$ to be similar we require $p = \sigma(0) = \tau(0)$
and
$${d x^k(\sigma(t)) \over d t} = {d x^k(\tau(t)) \over d t},\qquad k = 1,\ldots,N,$$
where the $x^k$ are the coordinate maps of some chart\footnote{If it
  is true in one chart, it is true in any chart due to the existance
  of transition functions: the partial derivatives in one chart
  transform linearly to the partial derivatives in another chart
  through the Jacobian of the transition function.}, and the
derivatives are evaluated at $t=0$.  We denote all the curves similar
to $\sigma$ as the equivalence class $[\sigma]$. This defines a
partition which behaves as a vector space\footnote{We can define
  vector scaling $\lambda [\sigma(t)] = [\sigma(\lambda t)]$, and
  addition as $[\alpha] + [\beta] = [\alpha + \beta]$, where every
  representative $\sigma(t)$ from the latter equivalence class
  satisfies ${d x^k(\alpha(t)) \over d t} + {d x^k(\beta(t)) \over d
    t} = {d x_p^k(\sigma(t)) \over d t} ,\qquad k = 1,\ldots,N$ in
  some coordinate chart.} we will denote as $TM_p$.


The differential at point $p \in M$ is then defined as the linear map
$$df_p : TM_p \to \R : [\sigma] \mapsto {d f(\sigma(t)) \over d t},$$
where $\sigma(t)$ is any representative of the equivalence classe $[\sigma]$.
The differential $dx_p^k$ is then defined as the functional for which
$dx_p^k([\chi^j]) = \delta_{jk}$. In other words $[\chi^j]$ is the class of curves for which
${d \over dt} x^k(\chi^j(t)) = \delta_{jk}$. This defines a basis/dual basis pair
for $TM_p$ and $TM_p^*$.

The derivative of $f$ along a vector $[\sigma]$, which takes the role
of partial derivative if this vector is one of the basis vectors of
$TM_p$, is defined as ${d \over dt}f(\sigma(t))$, where $\sigma(t)$ is
any representative of $[\sigma]$. So the differential maps a vector
$[\sigma]$ to the derivative of $f$ along along
$[\sigma]$. Interpreting the point $p$ as a variable, we obtain a
function $df : TM \to \R$ defined on the tangent bundle $TM =
\bigcup_p TM_p$.

This is basicly the same as the one--dimensional Euclidian version
where the differential of a function $f(x)$ in one independent
variable $x$ can be written as
$$df : \R^2 \to \R : (x,\Delta) \mapsto f'(x) \Delta x.$$
The $\Delta x$ can be seen to live in $TM_x$, while $x$ lives in $M$,
which makes $(x,\Delta x)$ an element of $TM$.


Entry: Curves in 3D
Date: Sat Oct 29 13:21:55 EDT 2005
Type: tex


Started reading Kreyszig's book on differential geometry. Starts with
curves. The central expression to remember is Frenet's formulas
$$\matrix{ \dot{t} \\ \dot{p} \\ \dot{b} } =
\matrix{ 
0 & \kappa & 0 \\
-\kappa & 0 & \tau \\
0 & -\tau & 0 }
\matrix{ t \\ p \\ b }, $$

\def\SO{\text{SO}}

where $x' = {d x \over d t } = \dot{x} \|x'\|$. So $\dot{x} = { d x
  \over d s }$, where $s$ denotes arc length. This gives the local
orthonormal set of vectors consisting of the unit tangent $t =
\dot{x}$, the principal unit normal $p = k / \kappa$ with $k =
\dot{t}$, and the binormal $b = t \times p$. Note that this matrix is
skew--symmetric, making it an infinitesimal generator of a subgroup of
the rotation group $\SO(3)$. Calling the axis of the infinitesimal
rotation $d$ we get
$$\dot{t} = d \times b, \qquad
\dot{p} = d \times p, \qquad \dot{b} = d \times b.$$ This reads as:
``Locally, every curve in Euclidian 3--space behaves as a helix.''  Or
in other words, locally every curve goes through a screw motion. This
is the approximation of a curve by a 3rd order polynomial
parametrization in the arc length.  This approximation makes a 3rd
order contact with the osculating circle.


Entry: Generalized Schur
Date: Sat Oct 29 14:21:55 EDT 2005
Type: tex

While it is possible to compute the Cholesky factorization of a
symmetric positive definite matrix $R$ using the generator
representation $R-ZRZ^T = GJG^T$, the result is a dense set of
matrices. So, while computation time is $O(N^2)$ instead of $O(N^3)$
for the general case, intermediate storage is still $O(N^2)$ and not
$O(N)$ as it is for $R$.

Can this be remedied? Is there for example a way to compute the
solution, or a number of solutions, without actually storing the
triangular factors?  The tricky thing here is the forward/backward
substitution in solving the linear system $LL^T x = b$. So the formal
question becomes ``Is it possible to translate the problem into some
other other formulation so the triangular factors can be consumed
immediately? Is it at all possible to solve the equations $L^T x = y$
and $L y = b$ by using some kind of pingpong method between two
spaces?

\def\I{\widetilde{I}}
\def\RRR{\widetilde{R}}
\def\L{\widetilde{L}}
\def\x{\widetilde{x}}
\def\y{\widetilde{y}}
\def\b{\widetilde{b}}

Using the argument of symmetry. There is absolutely no reason why we
should not consider the decomposition
$$R = LL^T = \L^T \L,$$
which does not contain the arbitrary choice of lower/upper
triangular. The relation between $L$ and $\L$ can be expressed as
$$L = \I\L\I,$$
which means the first column of $L$ is equal to the reversal of the
last column of $\L$.  Since $\I^2 = I$ and $\I\RRR\I = R$ for a
symmetric Toeplitz matrix, we can use both representations.  This
smells like Levinson and the triangular decomposition of $R^{-1}$. Is
it? Probably it is, since the Levinson algorithm actually finds a
solution to a set.

[NOTE: see later about Levinson. This is very confusing.]  If $R$ is a
symmetric Toeplitz matrix, we can generate both $L$ and $\L$ at the
same time, now what does this bring us? We have the sets of equations
$$
\left\{
\begin{array}{ccc}
L^T x &=& y \\
L y   &=& b \\
\L^T x' &=& y' \\
\L y'   &=& b \\
\end{array}
\right.,$$ where $x = \x$ is the solution to $Rx = b$. Now using $L =
\I\L\I$ and $\I y' = \y$, $\I x' = \x$, and $\b = \I b$ this becomes
$$
\left\{
\begin{array}{ccc}
L^T x &=& y \\
L y   &=& b \\
L^T \x &=& \y \\
L \y   &=& \I b \\
\end{array}
\right..$$
Rearranging some more we get
$$L^T \matrix{x & \x} = \matrix{y & \y},$$
and
$$L \matrix{y & \y}   = \matrix{b & \b}.$$
Can we update this? More about this later.  Check Bultheel's paper and
Kailath's paper about combining triangularization of $T$ and $T^{-1}$.
Note: need to look at Mastronardi's PhD to see how QR and RQ play
together.  [See Sun Nov 6 12:47:15 EST 2005]


Entry: Matrix Computations
Date: Sun Oct 30 11:30:08 EST 2005
Type: tex

\subsection{Levinson algorithm}

Now that I finally understand the Schur algorithm, it is time to
return to the Levinson algorithm, where instead of computing a
triangular factorization, we update the solution to a special kind of
symmetric Toeplitz system.

\def\x{\times}

The main properties which are exploited in the Levinson algorithm are
the symmetric Toeplitz ``rotate'' property $T = \I T \I$, which
relates the forward and backward solutions $Tx = b$ and $T (\I x) =
(\I b)$, and the Toeplitz shift property, which can be expressed as
$$T_{k+1}\matrix{x_k \\ 0} = \matrix{b_k \\ \x}$$ and
$$T_{k+1}\matrix{0 \\ x_k} = \matrix{\x \\ b_k},$$ 
with $T_k x_k = b_k$.
The latter is basicly the observation
that the submatrices indicated by $\x$ in
$$
\matrix{
  \x & \x & . \\
  \x & \x & . \\
  .  & . & .  } \text{ and } \matrix{
  . &  . & . \\
  . & \x & \x \\
  . & \x & \x \\
}$$ are equal. In other words $T_k$ occurs two times in $T_{k+1}$.
Combining these two relations enables us to write a recursive relation
between $T_{k} y_{k} = b_{k}$ and $T_{k+1} y_{k+1} = b_{k+1}$ as
$$T_{k+1}
\left( \matrix{x_k \\ 0} + \rho_{k+1} \matrix{0 \\ \I x_k} \right) =
\left( \matrix{b_k \\ \beta_k} + \rho_{k+1} \matrix{\beta_k \\ \I b_k}
\right).$$

Whether this relation is helpful to solve linear systems depends on
the way $b_k$ and $b_{k+1}$ are related. In other words, if we can
solve
$$b_{k+1} = \left( \matrix{b_k \\ \beta_k} + \rho_{k+1} \matrix{\beta_k \\ \I b_k} \right)$$
for $\rho_{k+1}$, we can obtain $x_{k+1}$.  Note that in this scheme,
the first coordinate of $x$ is never updated, so we can choose it to
be equal to 1. If we choose all coefficients of $b$ except the top one
to be different of zero, and allow that element to contain an
arbitrary value (i.e. it is not part of the data), we are effectively
solving the problem $Tx = -t$, where $t$ is the second row of $T$ and
the vector $b$ contains a ``crutch'' to allow this system to be
updated.  This crutch is called the prediction error in linear
prediction, and the set of equations above are called the Yule--Walker
equations.  The bottom equation will yield $\rho_{k+1}$, while the top
relation will compute the next prediction error $\beta_{k+1}$.

Now, the interesting thing is that this method can be extended to
solve systems $Tx = b$, with arbitrary $b$, by observing that the
problem $Tx = -t$ occurs as the core problem in solving $T_{k+1}
y_{k+1} = b_{k+1}$. Introducing the new unknowns $x_k$ and $\gamma_k$,
$y_k$ and $y_{k+1}$ can be related as
$$\matrix{T_k & t \\ t^T & \tau_k} 
\matrix{y_k + x_k \gamma_k \\ \gamma_{k}} = \matrix{b_k \\
  \beta_{k}}.$$ Using $T_k y_k = b_k$, the top equation gives $T_k x_k
= -t$, which can be solved using Levinson recursion, while the bottom
one gives $t^T (y_k + x_k \gamma_k) + \tau_k\gamma_k = \beta_k$, which
can be solved for $\gamma_k$ using the solution of the first equation.

Now why is this so damn ingenious? The way the symmetry is introduced
is rather remarkable.  The key to solving the system $Tx=-t$ is to
introduce the prediction error.  Let's try to re--invent the
algorithm. Because $t$ is a column of $T$, we can put it where it
belongs and rewrite the equation as
$$\matrix{t & T}\matrix{1 \\ x} = 0.$$
Now, to restore full symmetry to the matrix, we can add the top row
$$\matrix{t_0 & t^T \\ t & T}\matrix{1 \\ x} = \matrix{\beta \\ 0},$$
giving the prediction error $\beta$.

Can we learn something from this? Yes. Exploiting symmetry can be
obvious sometimes, but it pops up in strange ways. The
self--similarity of the Yule--Walker equations leading to the Levinson
recursion is a good example of this: instead of focussing on $Ax=b$
with near exploitable symmetry, we can try to see it as part of a
larger structure which has better exploitable structure.


\subsection{Approximate updating}

\def\d{\Delta}

Something i ran into yesterday. If $Ax = b$, how can we find the
solution to $(A+\d A)x' = b$? Paying attention to which way to factor
out $A$, we can write this as $x' = (I + A^{-1} \d A)^{-1} A^{-1}b$,
where the parenthized expression can be written as $\sum (-1)^k
(A^{-1} \d A)^k$. Assuming $\| \d A \|$ is small, this series quickly
converges. Using the factored form of $A$ this series is easy to
evaluate.

Another note is that $(1+ax)^{-1}$ is a binomial expression if for
non--natural numbers we define $$\binom{x}{k} = x(x-1)\ldots(x-k+1).$$

For more, check [1].

\subsection{Tensor calculus}

Been reading about tensors yesterday. I think I finally get what this
covariant / contravariant thing is about. Lesson to learn: go to books
instead of Wikipedia for really new things.  An important distinction
to make with matrices is wether they are used to represent coordinate
transforms or quadratic forms. For example, using two contravariant
vectors $x^i$ and $y^j$, a quadratic form $\left<x,y\right> =
a_{ij}x^i y^j$ is represented by a covariant form $a_{ij}$.  The
linear transformation $y^j = a_i^j x^i$ is represented by a mixed form
$a_i^j$.

In matrix lingo we can put this as $\bar{x} = R^{-1}x$, which gives
the quadratic form $y^T A x = \bar{y} R^TAR \bar{x}$, and the linear
transformation $\bar{y} = R^{-1}AR \bar{x}$. Both clearly do not
transform in the same way under coordinate transform. The first being
a congruence transform, while the other is a similarity transform.
The missing thing here is of course contravariant matrices $a^{ij}$
which are related to bilinear forms of dual vectors.


%[1] http://www.math.niu.edu/~rusin/known-math/index/65FXX.html


Entry: Rational Approximation
Date: Mon Oct 31 08:43:43 EST 2005
Type: tex

Polynomials in polynomials, or mapping $x^n \mapsto q_n(x)$.  The
thing that triggered this is Chebychev--Pad\'{e} approximation from
AFCNA p. 304.  First, a transform from a polynomial $p(x) = \sum_k^N
p_k x^k$ to a new basis $q_n(x) = \sum_k^n q_{nk} x^k$ is coordinate
transform in the vector space of $N$th order polynomials.

In general, Pad\'{e} approximation transforms the Taylor series $f(x)
= \sum_k^\infty c_k x^k$ of a function to a rational function $a(x) /
b(x)$, where $a(x) = \sum_{k=0}^p a_k x^k$ and $b(x) = \sum_{k=0}^q
b_k x^k$.  There are $p+q+1$ degrees of freedom in the $(p,q)$
approximation. We can safely take $b_0 = 1$ to fix the arbitrary
scaling factor of $a(x)$ and $b(x)$. We impose $p+q+1$ constraints by
requiring $f(x) - a(x)/b(x)$ to be maximally flat at $0$. This is
equivalent to identifying the first $p+q+1$ coefficients in the series
$f(x)b(x)$ with those in $a(x)$.  In the case where
$(p,q)\to(\infty,\infty)$ this leads to the infinite set of equations
$$ \matrix{
c_0 \\
c_1 & c_0 \\
c_2 & c_1 & c_0 \\
\vdots  & & & \ddots }
\matrix{b_0 \\ b_1 \\ b_2 \\ \vdots} = 
\matrix{a_0 \\ a_1 \\ a_2 \\ \vdots}$$
or $C b = a$. The finite truncation of this gives $C\matrix{1 \\
  b}=\matrix{1 \\ a \\ 0}$, where $C \in \R^{p+q+1 \times q+1}$ and
$a$ and $b$ have coefficients $1\ldots p,q$. Rearranging some more we
get $\matrix{C & -I}\matrix{1 \\ b \\ \hline 1 \\ a} = 0$, which can
be cast into $a_0 = c_0$ and the canonical form
$$\matrix{L & -I \\ T & 0}\matrix{b \\ a} = {-c},$$
where $c$ contains coefficients $1\ldots(p+q)$, $L$ is an LTT matrix,
and $T$ is Toeplitz.  This system is highly symmetric, so it should
come as no surprize there are fast algorithms in the style of the
Levinson recursion.

For Chebyshev--Pad\'{e} approximation, all the coefficients $a,b,c$
are with respect to the basis $\{T_n(x)\}$ of Chebyshev polynomials,
instead of $\{x^n\}$. This gives a better approximation in a minimax
sense.

Question. When doing Chebyshev--Pad\'{e} approximation, you need to
start from $c_k$ obtained by projecting $f(x)$ onto the Chebychev
basis, and rearrange it in a matrix. I was thinking if it would be
possible to Transform a LTT matrix like $C$ directly into another
matrix $\bar{C}$ where $\bar{c}_k = f(c_0,\ldots)$ using some tensor
transform. For a diagonal matrix it is simple. Though a structure
preserving tensor transform seems to me something that does exist and
is useful.

%[1] http://mathworld.wolfram.com/PadeApproximant.html

Entry: More..
Date: Mon Oct 31 09:43:43 EST 2005
Type: tex


\subsection{Differential geometry}

After not being able to really concentrate on matrix stuff, and giving
up on operators in Hilbert spaces, i turned to Kreyszig again. Defined the
tensor $b_{ij}$ containing the curvature information. For curves this is
related to $\kappa$, just as the metric tensor $g_{ij}$ is related to the
velocity vector. Hence, for curves it is possible to parametrize them in terms
of $s$, such that $t = dx/ds$ is a unit vector. For surfaces this is of course
not possible, since there are two independent directions, and the ``velocity'' is
a tensor. However, it is possible to construct orthogonal parametrizations 
(but probably not for all surfaces/manifolds?), which at least diagonalize $g_{ij}$.


\subsection{Conjugate gradient}

Conjugate gradient method is an iterative way to solve the system $Ax = b$, in
the case that $A$ is symmetric positive definite. This is used mostly for large sparse systems.


\subsection{Nonlinear equations and optimization}

Newton's algorithm for solving the set of nonlinear equations $f(x) = 0$ starts
from the first order Taylor approximation at $x'$ given by
$$f(x) \approx f(x') + J (x - x'),$$
where $j_{kl} = \left.{\d f_k \over  \d x_l}\right|_{x'}$ is Jacobian evaluated at $x'$.
From this we determine the iterative method
$$x_{k+1} = x_k - J_k^{-1}f(x_k).$$
From this we can derive several modified versions with different convergence properties,
by writing them as 
$$x_{k+1} = x_k + t_k H_k f(x_k).$$


Entry: Quantum Calculus
Date: Tue Nov  1 11:17:26 EST 2005
Type: tex


Got another load of books yesterday. Probably the last, or I will
exceed my allowed weight.  First is a big one: Horrowitz and Will, The
art of electronics. A classic, i've read. I'm not getting into that
right now, takes me too far away from the current topics. Reference
book.  Then there's Weyl's book, the classical groups. Hard core group
theory, i've read. Maybe this is a bit over my head. I don't
know. Would be interesting to finally read a real math book \verb+:-)+
I also got Golub and Van Loan's Matrix Computations. One I should have
gotten a long time ago.  The last one is Conceptual Mathematics, a
book on Category theory, by Lawvere and Shanuel.  Something i had to
get into. Been reading it yesterday. It's nice, though a bit verbose.
Maybe that's a good thing when it gets a bit harder\ldots So, onto pf
stuff.

\subsection{Take 1}

Apparently, there's a branch of mathematics called \emph{quantum
  calculus}. There are two forms, the \emph{h--calculus} and
\emph{q--calculus} which replace the derivative by
$${f(x+h)-f(x) \over h}$$
and
$${f(qx)-f(x) \over qx-x}$$
respectively. The link between the two is $q = \exp(ih)$.
For more info see [1].

%% Maybe the most important hunger I feel at this moment is that for
%% justification.  Why why why!

\subsection{Take 2}

Really, the quantum $h$--calculus is nothing more than discrete signal
processing. Where the derivative maps to the difference operator
$$\lim_{h\to 0} {f(x + h) - f(x) \over h} \mapsto f_s \left[ f(x + {1
\over f_s}) - f(x) \right],$$ where we usally set $h = 1/f_s = 1$.
Now, according to something I read, the more interesting modification
is the $q$--calculus, where the derivative is replaced by
$${f(qx) - f(x) \over qx - x},$$
with $q = \exp(ih)$, which reduces to the usual derivative when $q \to 1$.

%[1] http://math.ucr.edu/home/baez/week183.html


Entry: Levinson Algorithm
Date: Thu Nov  3 16:46:00 EST 2005
Type: tex

A bit more philosophical. This Levinson algorithm is a typical case of
extending a problem to another one, one with higher dimensionality,
which, due to symmetry, is easier to solve than the smaller
problem. 

%% This idea also comes back in a more fomal form in the theory % of
%% support vector machines.

The Levinson algorithm can be rewritten as a recursion for the
solution of the proper system $Tx = b$, with a fixed RHS $b =
e_1$. Which is the same as the prediction error formulation, only with
the RHS normalized. Dividing $x$ by $x_1$ gives the non--normalized
solution. The recursion relations for the normalized system are
$$T_{k+1} \left( \alpha_{k+1} \matrix{x_k \\ 0} + \beta_{k+1}
\matrix{0 \\ \I x_k} \right) = \left( \alpha_{k+1} \matrix{b_k \\
\eta_k} + \beta_{k+1} \matrix{\eta_k \\ \I b_k} \right).$$ These can
be solved by solving the symmetric Toeplitz system
$$ \matrix{1 & \eta_k \\ \eta_k & 1} \matrix{\alpha_{k+1} \\
\beta_{k+1}} = \matrix{1 \\ 0}.$$ This clearly has the same form as
the two solutions $T_n x_n = e_1$ and $T_{n+1}x_{n+1} = e_1$ it
relates.

It is easy to see that linear prediction is basically
orthogonalization: find an $x$ that is orthogonal to all rows of $T$
except the first one.  Note that $T_n x_n = e_1$, but $T_n (\I x_n) =
e_n$, so $a^H T (\I x_n)$ is zero whenever $a_i=0$ for $i \geq
n$. This means the set of vectors $y_i, i \leq n$ derived from zero
extending $\I x_i, i \leq n$ form an orthogonal basis for inner
product space determined by $T_n$.

Suppose $l_i$ is the normalized set derived from $y_i$.  This means
$l_i^H T l_j = \delta_{ij}$. Which gives
$$\matrix{l_1 & \ldots & l_n}^H T \matrix{l_1 & \ldots & l_n} = L^H T L = I$$
or
$$LL^H = T^{-1},$$ so the $l_i$ are the columns of the Cholesky
factorization of $T^{-1}$.


Entry: Matrix Computations
Date: Fri Nov  4 14:46:43 EST 2005
Type: tex

Matrix computations, Golub and Van Loan. Should have read that book a
long time ago, instead of equating the world with the FFT. Anyways,
it's interesting. Browsed Mastronardi's thesis an found out what's
going on with the Hankel STLS problem. STLS1 solves a linear equality
constrained LS problem, while STLS2 uses a (modified) Newton algorithm
on the Lagrangian of the original non--linear optimization problem.

\subsection{Generalized Least squares}

Let's first have a look at two generalized least squares problems. The
first is the equality constrained least squares problem (LSE), given
by $$\min\|Ax-c\| \text{ s.t. }Bx=d.$$ The second is the Gauss--Markov
linear model problem (GLM) $$\min\|y\| \text{ s.t. }Ax + By = d.$$
Both can be solved using the generalized RQ (GRQ) algorithm.  If $B$
is square and nonsingular we can set $By = z$ where we minimize
$\|B^{-1}z\|$, which is a weighted least squares problem.

But before we get into that, we first have a look at some short hand
formulas for matrix differentials.  It's easy to go from $f(x)=ax^2$
to $f'(x)=2ax$. What is the analogy for the matrix derivation? The
imporant thing to realize is that there are several kinds of
derivatives, depending on how we mix functions

The simplest one is the differential of the parametrized vector
$x(t)$, which just the vector of differentials of the components of
$x$ or $$dx_i = {dx_i \over dt} dt.$$ The dual of this is the
differential of the functional $f(x)$, which we can write as an inner
product between the gradient and the vector of differentials of $x_i$
or $$df = \sum {\pd f \over \pd x_i } dx_i.$$ For multivalued vector
functions, the derivative is the combination of both $$df_j = \sum
{\pd f_j \over \pd x_i } dx_i,$$ which we usually write in the
shorthand form $df = J dx$, where $df$ and $dx$ are vectors of
differentials, containing the differentials of the components, and $J$
is the Jacobian matrix, containing all the first order partial
differentials of $f_j$ and $x_i$.

Let's have a look at some special cases.  Affine vector function
$d(Ax+b) = Adx$.  Bilinear form $d(y^T A x) = y^T A dx + dy^T A x$,
which yields the 2--norm squared $d(x^T A x) = dx^T A x + x^T A dx$,
which simplifies to $x^T A dx$ when $A$ is symmetric.  Note that a
differential of a scalar function is always symmetric, as is the case
for the Lagrangian used below. Note that when introducing the
Lagrangian, we can usually clean up the equations a bit if we scale
the multipliers $\lambda$ by an arbitrary constant, or equivalently,
scale the cost function.


The main reason for this introduction is to arrive at the Lagrange
formulation for constrained optimization for a quadratic function with
linear constraints, expressed above. For the LSE problem we have a
langrangian $$L(x,\lambda) = \frac{1}{2}(Ax-c)^T(Ax-c) +
\lambda^T(Bx-c),$$ where the dimension of the langrange multiplier
vector is determined by the number of constraints. The scalar
differential $dL$ in terms of vector differentials $dx$ and $d\lambda$
is $$dL = dx^T [A^T(Ax-c) + B^T\lambda] + d\lambda^T[Bx-d],$$ where we
write the differentials in the quadratic forms on the left so we can
group them. This is possible because the terms that we sum here are
all scalars.  The extremum is reached when $dL = 0$, which means the
vectors occuring in the scalar products with $dx$ and $d\lambda$ both
vanish. This gives the set $$\matrix{A^T A & B^T \\ B & 0}\matrix{x \\
\lambda} = \matrix{2Ac \\ d},$$ which clearly gives the unconstrained
LS problem when $B=0$. Doing the same for the GLM problem gives the
Lagrangian $$L(x,y,\lambda) = \frac{1}{2}y^Ty + \lambda^T(Ax + By -
d),$$ which has a differential $$dL = dx^T(A^T\lambda) + dy^T(y +
B^T\lambda) + d\lambda^T(Ax + By -d).$$ Requiring $dL=0$ gives the set
of equations $$\matrix{0 & 0 & A^T \\ 0 & I & B^T \\ A & B & 0}
\matrix{x \\ y \\ \lambda} = \matrix{0 \\ 0 \\ d}.$$ Since the
equation $y = -B^T \lambda$ is trivial we can simplify this to $A^T
\lambda = 0$ and $Ax - BB^T \lambda = d$ or $$\matrix{BB^T & A \\ A^T
& 0}\matrix{-\lambda \\ x} = \matrix{d \\ 0}.$$ The previous equation
with $B=I$ gives an alternative representation of the ordinary least
squares system where the errors are explicit.


\subsection{Root loci}

Actually, this is quite interesting. How to roots of polynomials and
systems of polynomials move? There's an article in the Mathematical
Intelligencer book about this. Then there's that knot/poly particle
idea I had, and the ubiquity of statements about eigenvalues under
perturbation of a matrix. Not going to make a list here, but maybe i
can move my particle music stuff to matrices instead of polynomials?
What is so intriguing about this? It has strange dynamics.

%% Probably this is the reason why they say algebraic geometry leads to
%% deep waters.  Anyway, back to my basic books.

\subsection{Conjugate gradient}

It's a bit more elaborate than i thought. Plus there's a connection to
the Lanczos method. Interesting, but somehow not surprizing, since
both have to do with orthogonalization.  I think i understand the
basics, after trying the local minimizing gradient method.  The idea
is to always search in directions $p_k$ orthogonal to $Ap_i$ for
$i<k$.  The difficulty is in the proofs that states one ingenious way
to find such a direction.  It's not obvious to me yet. Have to spend
some more time with it.

So first gradient descent. Solving $Ax = b$ is equivalent to
minimizing $$\phi(x) = \frac{1}{2}x^TAx - b^Tx.$$ The gradient of this
function is $\nabla\phi(x) = Ax-b$. We set $r_k = b-Ax_k$, pointing
into the direction of negative gradient, which we will use to descend
from $x_k$.  The optimum on the line $x_{k+1} = x_k + a r_k$ can be
found using the differential $$d\phi = dx_{k+1}^T(Ax_{k+1} - b) =
r_k^T da (-r_k + aAr_k).$$ Setting $d\phi = 0$ gives the solution $$a
= { r_k^T r_k \over r_k^T A r_k }.$$

\subsection{Perturbations -- out there}

Was thinking. Since CG is useful for large systems around close to
$I$, this probably also works for systems with eigenvalues close to
$I$. Too bad it's only for symmetric systems, since it would be nice
to have this for complex signals too.  This doesn't go anywhere. Was
thinking about cheap oversampled signal handling for well--behaved
nonlinearities in hifi audio.  Which makes me think, isn't there a way
to solve problems for anti--symmetric matrices in a similar elegant
way? Since these are more related to rotations and oscillations\ldots
Wow. Maybe i just need sleep. This is way out there.


\section{Sat Nov  5 13:39:25 EST 2005}

More reading in Kreyszig's book about differential geometry. Determinants of
partial derivatives popping up everywhere. No wonder exterior forms are the alternative
formulation of the classic stuff. 

\subsection{2D Convolution}
\def\t{\tensorlu}

So what is a 2D convolution expressed as a tensor? Suppose the input is $a_{ij}$,
and the output is $b_{kl}$. What is the transformation rule? Each of the elements
$b_{kl}$ is a combination of all $a_{ij}$, so we would have
$$b_{kl} = t_{kl}^{\phantom{kl}ij} a_{ij}.$$
A seperable transform would be
$\t{t}{kl}{ij} = \t{h}{k}{i} \t{v}{l}{j}$.
Some remarks on tensor notation. 
While it's not explicitly necessary, I use super and subscript to denote
components in dual space and vector space respectively. To lower and upper
an index means to take a transpose. So $\t{a}{}{i}$ is a linear functional,
while $\t{a}{i}{}$ is a vector. This makes a distinction between matrices
used to determine two--forms $\t{a}{}{ij}$, which combine vectors, and
matrices used to represent linear transforms $\t{a}{i}{j}$, which transform
vectors.

The idea is that $\t{t}{kl}{ij}$ is a sparse, banded tensor. Possibly with
shift structure. Does this make sense to put in displacement form?
If it were not sparse, definitely. Why would I want to put a 2D localized
convolution in this framework? To talk about inverses, maybe?


\subsection{Dilatation}

Was thinking. Which functions are eigenfunctions of the dilatation operator?
In other words, I'm looking for $f$ satisfying
$$f(ax) = \lambda f(x).$$
Since such a function satisfies $f(a^2x)=\lambda f(ax)=\lambda^2 f(x)$ 
and so on, it has to be infinitely self--similar.
One such solution could be 
$$f(x) = e^{\mu \ln x}.$$
The multiplication is mapped to addition and then again mapped to multiplication
since $f(ax)=e^{\mu \ln (a x)} = e^{\mu \ln a}e^{\mu \ln x} = \lambda f(x)$.
If $\mu$ is real this gives solution: $x^\mu$, while if $\mu = i \omega$ we
have oscillatory $f(x) = e^{i \omega \ln x}$. The real and imaginary parts
give $\cos(\omega \ln x)$ and $\sin(\omega \ln x)$. Is there a physics
problem where these come back? Because this seems to look like a 
\emph{lossless bassdrum}, a logarithmic chirp. 
Suppose $g(t) = e^{\mu t}$ is the solution of 
$${dg (t) \over dt} = \mu g(t).$$
Let's transform this using $t = \ln x$, giving $f(x) = g(t)$.
The equation then becomes
$${df (x) \over dt} = {df (x) \over dx} {dx \over dt} = \mu f(x),$$
with ${dx \over dt} = e^t = x$ this becomes
$$x {df (x) \over dx} = \mu f(x).$$
Using separation of variables it is possible to integrate
$${df \over f} = \mu {dx \over x}$$
which indeed yields the suggested function as solution.

\subsection{Newton Raphson}

The NR method solves the equation $f(x)=0$ by making the solutions
stable fixed points of the iterator function
$$F(x) = x - {f(x) \over f'(x)}.$$
The slope of the iterator function is
$$F'(x) = {f(x)f''(x) \over f'(x)^2}.$$
It can be seen that for $f(x) = 0$ we have $F'(x) = 0$ so convergence is fast.
It is interesting to take a polynomial and compute this function to actually
see the iterating function. The division of $f$ by the derivative $f'$ makes sure
the resulting slope is always equal to one. Subtracting this from $x$ indeed gives
a zero slope, which gives quadratic convergence.

So. What about doing this particle thing with NR?
An also, the imaginary solutions, are they bosons?


Entry: Cholesky Decomposition
Date: Sun Nov  6 13:19:58 EST 2005
Type: tex

\def\x{\widetilde{x}}
\def\l{\widetilde{l}}
\def\L{\widetilde{L}}
\def\A{\widetilde{A}}

\subsection{$LL^T$ and $\L^T \L$}


The idea is this: it should be possible to exploit the arbitraryness
of choosing a \emph{direction} when solving a symmetric system with
low displacement rank using its Cholesky decomposition computed by the
generalized Schur algorithm.

This arbitraryness can be ultimately traced back to the assymetry of
$Z$, in the expression $A - ZAZ^T$. Transforming these matrices using
the anti--diagonal matrix $\I$, and denoting $\A = \I A \I$ gives $\A
- Z^T \A Z$. So both are definitely related. Displacement
decompositions with $Z$ ultimately lead to triangular
representations. The direction of $Z$ determines the order of
lower/upper matrices occuring. With $Z$ a downshift matrix, the
generalized Schur algorithm leads to a Cholesky decomposition of the
form $A = LL^T$, with $L$ lower triangular. We can change this to $A =
UU^T$, with $U$ upper triangular by interchanging $Z$ and $Z^T$. But,
as we saw above, this is the same as replacing $A$ with $\A = \I A
\I$.

Can we restore the broken symmetry somehow? Suppose we compute $A =
LL^T$ and $\A = \L \L^T$ at the same time. These matrices are not
directly related, only so that $A = UU^T$ is given by $U = \I \L \I$.

How does this effect the solution of $Ax = b$? Transforming this
equation using $\I$ gives $\A \x = \b$, where $\x = \I x$ and $\b = \I
b$. Using the triangular factors this becomes $LL^T x = b$ and $\L\L^T
\x = \b$. Both solutions are related, but what about the intermediate
solutions $L u = b$ and $\L v = b$?  If we would have a relation
between $u$ and $v$, and a relation between $L$ and $\L$, more
specificly, between the first column of $L$ and the last column of
$\L^T$, we would be able to update the solution of the system without
explicitly storing the Cholesky factors.

I thought there was a trivial relation, but that doesn't seem to be
the case. But there has to be one. There's too much symmetry for there
not to be a relation like this. Now, for a \emph{symmetric Toeplitz}
matrix we do have $\I A \I = A$, which doesn't bring us anything new,
since this makes $L = \L$. So, extrapolating from that, there's not
much gain in this approach?

[Mon Nov 7 13:01:14 EST 2005] It seems there is only a relation if the
``outer rim'' of the matrix bares some relation, since this boils down
to a relation between the $LL^T$ and $UU^T$ decomposition. I can't
think of anything which has this, except for symmetric Toepliz
matrices.


Entry: Lie Bracket
Date: Mon Nov  7 16:41:02 EST 2005
Type: tex


One thing i never noticed before is that the Lie bracket is not
associative. This is related to the non--commutativity of the
associated lie group. Commutativity and associativity are related.
Toby Bartels had a proof about quaternions and octonions and other
structures where these properties where put into relation.
The thing which got me into this is trying to find out the connection
between symmetric and triangular matrices. The main difference is that
symmetric matrices do not form a group. To stretch the rubberband a bit:
they are usually not transforms, but representations of an inner product.
So to some extent this really makes sense, and it's time to take this
serious. You can see it being pretty obvious when using tensor notation:
$\t{a}{i}{j}$ is clearly not the same as $\t{a}{}{ij}$.

The Lie bracket, when defined for associative algebras, takes the form
$$[A,B] = AB - BA.$$ There are $4$ terms in the the triple product.
In total there are $3!=6$ third order terms. The relation above
satisfies the Jacobi identity $$[A,[B,C]] + [B, [C,A]] + [C, [A, B]] =
0,$$ which is taken as an axiom in the abstract definition, next to
bilinearity and $[A,A] = 0$. Note that it is not a bilinear form, but
a binary vector operation $V \times V \to V$, for a vector space
$V$. Note that the cross product in 3D satisfies these relations.

Maybe it really is time to have a better look at determinants, the
cross product and exterior forms. It's all about sines and cosines. In
$\R^2=\{(x,y):x,y\in\R\}$, the inner product is $x_1x_2 + y_1y_2$,
while the determinant (outer product) is $x_1y_2 - x_2y_1$. Some other
forms we can create are the norms squared $n_i^2 = x_ix_i + y_iy_i$.
The inner product gives the cosine between two vectors, scaled by both
norms, while the determinant gives the sine, scaled by both norms.
Why is the inner product symmetric, and the outer product assymetric?
Simply in the definition we can see this is true, but is there a
deeper reason? Surely, orientation makes sense in the case of the
determinant.  It represents \emph{directed area}. In the case of the
inner product, which is more related to projection, assymetry does not
make sense, only the relative angle is important. In the case of the
determinant, absolute angle is important because it carries
information about \emph{inside} or \emph{outside}. For $\R^2$, we can
write both the inner and outer products as bilinear forms
corresponding to $$\matrix{1 & 0 \\ 0 & 1} \text{ and } \matrix{0 & -1
\\ 1 & 0}.$$ Note that if we multiply the first with $\cos \theta$ and
the second with $\sin\theta$ and add we get a one--parameter family of
rotation matrices, which is just Euler's identity $e^{i\theta} = \cos
\theta + i \sin \theta$.

[Tue Nov 8 09:58:12 EST 2005] Some different views. The inner product
$f_a(x) = a.x$ is a ``projection'' on a vector $a$. The outer product,
written as a one--parameter determinant expression $D_S(x) =
|\begin{array}{cccc} x & x_1 & \ldots & x_{N_1} \end{array}|$,
computes the inner product of $x$ and the normal of the subspace
spanned by $\{x_1, \ldots, x_{N-1}\}$. For the two--dimensional case,
this is clearly sine and cosine.

Exterior $N$--forms are determinants taken after transformation an
$N$--dimensional space, determined by the coordinate functions
(functionals) that span the exterior form.  So the only thing to
really understand is the determinant, which has basicly 3 important
properties. Transpose of matrix leaves determinant invariant. This is
because a permutation and its inverse have the same
signature. Determinant is anti--symmetric w.r.t. row/column.  Easy to
see because permutations change parity after composition with a single
transposition.  The third property is linearity w.r.t
row/column. Fixing $N-1$ rows/columns makes the determinant into a
linear functional of the left over column. Combining the last 2
properties, we can see that adding a scaled version of a column to
another column does not change the determinant.  Geometrically, this
amounts to shearing the parallellopiped determined by the
vectors\footnote{ Brings us to symplectic form: projection on 2D space
+ determinant, or 90 degree rotation and inner product. Hamiltonian
systems: conservation of area in phase space, i.e. Keppler's laws.}.

Last but not least: Cramer's rule for solving linear systems
$Ax=b$. It can be seen that $x_i = D_i(b)/D$ is zero if $b$ is in
$S_i$, the span of $\{a_k : k \neq i\}$. On the other hand, if $b$ is
orthogonal to $S_i$, then $x_i$ is the ratio between the norm of $b$
and $a_i$. Combining these two, $x_i$ is expressed as the ratio
between the the components of $b_i$ and $a_i$ in the one dimensional
subspace orthogonal to $S_i$.

%[1] http://www.ugcs.caltech.edu/~toby/


Entry: Tensors and HOS
Date: Tue Nov  8 10:25:49 EST 2005
Type: tex

\subsection{Tensors}

So i have one question leftover. If matrices are morphisms between
vectorspaces, what does this has to do with 2--categories and
morphisms between morphisms? I.e. 4--tensors in the case of matrices.

Reading LDL. An important note is that the unit tensor for a
multilinear transformation is not ``diagonal''. It is the outer
product of diagonal matrices, or in tensor notation $\prod_n
\t{\delta}{i_n}{j_n}$.

The eigenvalue decomposition of a (symmetric) multilinear
transformation follows directly from the embedding as a matrix
transform. An example for an embedding of a $ 2\times 2 \times 2
\times 2$ tensor is $$T = \matrix{S_{11} & S_{12} \\ S_{21} &
S_{22}},$$ where each matrix $S_{ij}$ is a $2 \times 2$ matrix. This
acts on a vector of columns.  Alternatively, we can permute the matrix
$T$ to see its effect on a vector of rows.  The identity tensor then
maps to the identity matrix.  This immediately shows there is a big
difference between odd--order and even--order tensors: odd--order
tensors can never be bijective.

The (super)symmetric tensor associated with a symmetric multilinear
$N$--form can be diagonal.  It seems that for order two tensors,
quadratic forms and linear transforms can both be represented by
matrices: a quadratic form can even be seen as an inner product
between a vector and a transformed vector.  However, for higher order
tensors, a clear distinction has to be made between transforms
(isomorphisms) and $N$--forms. This also clears out the simple fact
why eigenvalue decompositions for symmetric and non--symmetric
matrices are two separate chapters in Golub and Van Loan!

Dwelling a bit on this idea, the product $\t{g}{abc}{}x^a y^b z^c$ can
be interpreted in several ways, but formost, it is a $3$--form of the
vectors $x$, $y$ and $z$.  We could combine the $3$rd order covariant
tensor with the contravariant vector $x$ to give $f_{bc}y^b z^c$, then
lower the index of $y$ using $f_{bc}$ giving $y_c z^c$, where a
covaraint--contravariant vector pair contracts to a scalar. So, just
like with row/column switching in matrices, the order of the sums
commutes, so there is room for different interpretations of the same
expression.


\subsection{Higher order statistics}

The definition of cumulant through the second generating function
$\psi_X(\omega) = \ln \phi_X(\omega)$, defined in terms of the first
generating function $\phi_X(\omega) = E\{e^{i\omega x}\}$ seemed a bit
odd at first, but it really seems to be the case that with $${d \over
dx}\ln f(x) = {1 \over f(x)} {d \over dx} f(x),$$ we have $${d^2 \over
dx^2}\ln f(x) = {1 \over f(x)} {d^2 \over dx^2} f(x) -\left[ {d \over
dx}\ln f(x) \right]^2,$$ and so on.


Entry: Plans for holiday
Date: Fri Feb  3 19:33:00 CET 2006

Been a long while without math related stuff. Been mostly working on
practical things lately. Mostly linux sysadminning, but also a bit on
FORTH and languages. Got very interested in bootstrapping. Especially
since FORTH has such a nice way of including 2 interpreters, and I
added another one in PF.

Anyway, onto what I want to do. Order new books for my next trip to
the states. The selection came down to: Symmetry by Hermann Weyl, Real
Analysis by Norman Haaser, Fourier Series by G. H. Hardy and
Counterexamples in Analysis by Bernard R. Gelbaum. So exclusively
analysis books. That should be enough to be able to get through all
the other analysis books I bought.


Entry: Finite vs. Infinite
Date: Fri May  5 00:59:51 CEST 2006
Type: tex

[EDIT Thu Jun 18 11:51:10 CEST 2009] This is raw material, but it
illustrates a big hyatus in my understanding of math: how the
continuous and the discrete are related.  Wath makes ``connectednes''
so different?

Then prime numbers. Basically, prime numbers are line segments, double
component numbers are line segments or plane segments, and so on. And
large numbers. Why is it that i have the impression that large numbers
are usually well behaved in discrete problems? Given $n$ large enough,
there's always blablabla. Enough space?

Then, on a more philosophical note, what is energy? It's the thing
that causes rotation, yes. It's the thing that makes numbers behave as
angles. One thing struck me a couple of days ago. Writing the simplest
discrete equation that exhibits oscillatory motion requires two state
variables. Position and momentum. Now, it's not really those two
entities that are important, but the angle they make up, and sum of
their squares. The whole deal seems to be complex numbers. What they
represent is an essential thing. But it does look like a
\emph{solution}. Entergy is a constraint that defines a manifold on
which things are allowed to move, but that's it. It doesn't say
why they move as they do, just that they do.

Is it possible to construct almost--perfect circular oscillations
working with exact large numbers instead of floating point numbers?
Are there embeddings of circular groups that fit in the unitary matrix
equation? Is it possible to solve $a^2 + b^2 = 1$, with $a$ and $b$
rational? Examples are of course $(3/5,4/5)$, are there more?  There
are those generated by powers of $${1 \over 5}\matrix{4 & -3 \\ 3 &
4}$$


Entry: Rotations
Date: Mon May 22 14:09:42 EDT 2006
Type: tex

Reading ``Rotations, Quaternions and Double Groups'' by Simon
L. Altmann as a trampoline to a better understanding of Lie groups and
geometry. The other one I'm reading is ``Linear Algebra and Projective
Geometry'' by Reinhold Baer.

A central idea seems to be the relation between group actions
(symmetries represented by orthogonal/unitary operators) and
infinitesimal generators (observables represented by Hermitian
operators). All this is a generalization of
$$g(t) = e^{i \omega t},$$ where $g(t)$ represents a one--parameter
group of symmetry operations, and $\omega \in \mathbb{R}$ represents
the infinitesimal generator of these symmetries, called an observable
(Hermitian as a generalization of real number). As far as I recall the
exponential function maps a Lie algebra onto a Lie group, or at least
the connected component containing the identity.


Entry: Beach
Date: Sun Jun  4 21:44:37 EDT 2006
Type: tex

Day on the beach. Pascal's triangle modulo 2 and it's
symmetries. Matrices of symmetric parts of Pascal's triangle.  Linear
cellular automata and relation between boolean algebra and Galois
fields. Modulation and splitting fields. Shifts: generated by $x$,
Pascal's triangle, generated by $1+x$.

Been reading in Tony Smith's page about Clifford algebras. He mentions
the clifford product is sort of an extension of the set--theoretic XOR
from sets and subsets to vector spaces and subspaces.


Entry: Wavelets
Date: Mon Jun 12 14:53:43 EDT 2006
Type: tex

Relating polynomials to filters can be done using the Z--transform for
infinite signals, and the ring of polynomials modulo $z^N-1$ which is
isomorphic to the ring of diagonal matrices. Since the Z--transform is
a limit case of the ring of polynomials for $N \to \infty$, we work
with the latter.

The general matrix representation of this ring is $QCQ^{-1}$. Take
$Q=F$ the Fourier transform.  The columns of $F$ are the normalized
$W_k / \sqrt{N}$ with $W_k = (1, w_k, \ldots, w_k^{N-1})$ and $w_k =
e^{jk/N}$, the $N$th roots of unity.

This gives the ring of circulant matrices, which can be related to the
ring of polynomials modulo $z^N-1$ by associating $a(z)$ with $A =
a(Z)$, with $Z$ the circulant shift down operator
$$Z=\matrix{
0 & 0 & \hdots & 0 & 1 \\
1 & 0 & \hdots & 0 & 0 \\
0 & 1 & \hdots & 0 & 0 \\
\vdots & \vdots &    & \vdots & \vdots \\
0 & 0 & \hdots & 1 & 0
}$$

The coefficients of the polynomial $a(z)$ are then the elements of the
first column of $A$.  The diagonal elements of $C$ are in this case
given by the DFT of this column, or $c_k = a(w_{-k})$ corresponding to
the eigenvector $W_k$.  For $N \to \infty$ this gives infinite
dimensional Hankel matrices and the Z--transform.


What happens if we replace the matrix $Q$ in the ring representation
to a biorthogonal wavelet basis? This is represented by the matrices
$\W$ and $W$ satisfying

\begin{equation}
\W^TW = \W W^T = I.
\end{equation}

The isomorphism then becomes $A = W C \W^T$.  To relate these matrices
to the polynomials modulo $z^N-1$ we need to find a $Z = W C_z \W ^T$
such that $Z^n = I$.


Entry: Waveguides
Date: Mon Jun 12 15:53:43 EDT 2006
Type: tex


Define the operator $\mathcal{D}_k(f)$ as the discrete approximation
of the second partial derivative, defined by the symmetric convolution
mask $(1, -2, 1)$ on the coordinate $k$ of $f$. $\mathcal{D}_0$
affects the time dimension and $\mathcal{D}_1$ trough $\mathcal{D}_N$
affect the spatial dimensions.  The discrete wave equation for a
function $f$ with one time dimension and $N$ space dimensions is given
by

\begin{equation}
\mathcal{D}_0(f) = c_d^2 \sum_{i=1}^{N} \mathcal{D}_i(f).
\label{waveq}
\end{equation}

It is shown in [1]
that a two dimensional square grid discrete waveguide mesh with scattering matrix
\begin{equation}
S
=
\frac{1}{2}
\left[
\begin{array}{rrrr}
-1 & 1 & 1 & 1 \\
1 & -1 & 1 & 1 \\
1 & 1 & -1 & 1 \\
1 & 1 & 1 & -1 \\
\end{array}
\right]
\label{scatmat}
\end{equation}
is equivalent to (\ref{waveq}) with $c_d = \frac{1}{\sqrt{2}}$ and $N=2$. 


%[1] http://ccrma-www.stanford.edu/~bilbao/master/node12.html


Entry: Maps of the naturals
Date: Sat Jul  1 22:13:14 EDT 2006
Type: tex

A real number can be encoded as a map $\NN \to \NN$, which takes the
form of a continued fraction. A complex number is two such maps, which
can be incorporated into one such map. A holomorphic function can be
encoded as a map $\NN \to \CC$, which takes the form of a power
series. So a holomorphic function can be seen as $\NN \to (\NN \to
\NN)$. What is the next in line? An arbitrary non--smooth function on
$\RR$ or $\CC$ could be $(\NN \to \NN) \to (\NN \to \NN)$.


Entry: Projective geometry and translations
Date: Thu Jul  6 18:15:40 EDT 2006
Type: tex

The reason OpenGL uses $4 \times 4$ matrices is that translations can
be represented as matrices
$$T = \matrix{
1 & 0 & 0 & t_x \\
0 & 1 & 0 & t_y \\
0 & 0 & 1 & t_z \\
0 & 0 & 0 & 1}.$$

This can be seen as an (infinitesimal) rotation around an axis at
infinity. How so? Using not the projective approach, but the algebra
of Euclidian transformations embedded in $\RR^4$ which operate on the
points $(x,y,z,1)$, this point is easy enough to proove. To see that a
translation is in fact a rotation around a point at infinity, write it
as a translation $-t$, followed by a rotation $R$, followed by a
 $t$.
$$
\matrix{
I & t \\
0 & 1}
\matrix{
R & 0 \\
0 & 1}
\matrix{
I & -t \\
0 & 1} =
\matrix{
R & -Rt + t\\
0 & 1}.
$$
Now, in $2$--dimensional Euclidian space, the rotation will look like
$$R=\matrix{\cos \alpha & -\sin \alpha \\ \sin \alpha & \cos \alpha}$$
In the limit of $\alpha \to 0$, and $t \sim 1 / \alpha \to \infty$ the
first order terms dominate, so we get for the rotation of $\alpha$
around $t$
$$\matrix{ 1 & -\alpha & \alpha t_y \\ \alpha & 1 & -\alpha t_x \\ 0 &
0 & 1}$$ 
Choosing $t_y = x / \alpha$ and $t_x = -y / \alpha$ we get a
proper translation
$$\matrix{ 1 & 0 & x \\ 0 & 1 & y \\ 0 & 0 & 1}$$ as the limit case
for $\alpha \to 0$. For higher dimensions we get similar results.


However, using just the projective approach, I run into trouble.
First note that the nomenclature ``rotation around a point'' and
``rotation around an axis'' are artifacts of low dimensional spaces,
respectively of dimensions $2$ and $3$. Once we reach $4$ there are
two different operations that could be seen as rotations, those
leaving invariant a plane, which would be a primitive rotation, and
those without (real) invariants. It is better to talk directly about
the invariant spaces, so the claim above would translate to ``a
translation leaves invariant a line at infinity''.

In the $3$--dimensional projective space used in OpenGL, a line can be
represented by a $2$--dimensional subspace of $\RR^4$, while a point
is a $1$--dimensional subspace. If we take the embedding map for the
Euclidian $3$ space to be 
$$\RR^4 \to \RR^3 : \{(tx,ty,tz,t) : t \in \RR\} \mapsto (x,y,z),$$
the points at infinity of the projective space, not present in the
Euclidian space, are the ones who have no image in this map. They are
$1$ dimensional spaces spanned by vectors of the form
$$a = \matrix{a_1 \\ a_2 \\ a_3 \\ 0}.$$ A line at infinity is spanned
by two such (independent) vectors. So what we need to show is that the
matrix $T$ leaves invariant such a $2$--dimensional subspace.
However, it leaves invariant a $3$--dimensional subspace: the plane at
infinity, so it's not really a rotation in $\RR^4$.

Look's like this is not really the way to approach it\ldots

% It's clear to see that $T$ leaves invariant all points at infinity,
% i.e. the whole plane, which is a $3$ dimensional subspace of
% $\RR^4$. Why is this?

% First to clear up some confusion. The matrix $T$ above has
% determinant $1$, with characteristic polynomial $(\lambda - 1)^4$,
% but is not orthogonal. Therefore it cannot be a rotation of the
% embedding space $\RR^4$.


Entry: tricking yourself to get things done
Date: Thu Jun  4 17:08:01 CEST 2009

Some of the tricks mentioned here[1] are quite interesting.  A
selection:

  - have something boring to put off[2]
  - make starting simple
  - make the next thing simple
  - stop in the middle of something important (!)
  - never end down


[1] http://lesswrong.com/lw/fu/share_your_antiakrasia_tricks/#comments
[2] http://www.structuredprocrastination.com/


Entry: Projections of the space of programming concepts
Date: Fri Jun  5 14:31:32 CEST 2009

The human brain has the ability to abstract: simplifying the world
into a model and then working with this model.  Now, a programming
language could be seen as such a model: it expresses new concepts by
combining other concepts down to primitives.

The more concepts and (orthogonal) _composition mechanisms_ you can
juggle at the same time, the more power it yields simply by
combinatorial explosion.

Simply put, if abstraction A can do 5 things and abstraction B can do
3, if they do unrelated things you can do 15 unique things by
combining them.

However.  I seem to have a certain limit in doing so.  It seems as if
I need to learn to _trust_ the primitives and compositions.  I can't
understand them from definition (= a machine) until I have modeled
their _behaviour_ through experience.

We can't execute machines in our head.  Well, we can and also we
should: this is what mathematical formalism is about.  But the point
is that in general this is not how one programs.  We build models
about their behaviour through correlating input and output.  In
general: doing mathematics is about learning how to give meaning to
the complexity machines can produce and consume.

So, can you build a mathematician by combining a universal correlator
with a universal computer?  Is a universal correlator then a
non-computable function? 

[1] http://en.wikipedia.org/wiki/Church–Turing_thesis
[2] http://en.wikipedia.org/wiki/Hypercomputation


Entry: Me and Math
Date: Fri Jun  5 14:48:12 CEST 2009

The amazing thing about mathematics is the combination of formalism
and "the mathematician".

If I try to reconstruct, I didn't persue a career in mathematics
because I wanted to have a broader knowledge.  I did like math a lot
before I got into college.  The reason is probably computers.

I'm happy I discovered my destiny is to be a programmer though.  I
don't have enough affinity with abstraction for its own sake to do
good math, nor do I have the stamina to go through tedious experiments
to be a good physicist.  Programming on the other hand enables the
offloading of math and physics to the computer in a way that you can
relax a bit.  

If it's not consistent, you won't be able to write it.  If it's too
complicated you're not being abstract enough.  If it doesn't do what
you think it does there are ways to figure out why by performing
experiments.  This keeps me honest.  It tones down the rambling
associations in my head.  It keeps me from flying off.


Entry: Busy-Beaver sound
Date: Mon Jun  8 20:37:35 CEST 2009

Something like[1] but instead of a maximal run length, a maximal
"meaningful" run length with sound output.  Maybe this doesn't make
much sense for a finite machine though..

However, this is related to the idea of "maximum spaghetti code".
Code which is written in non-modular form to maximize utility towards
a particular goal, achieving maximum "amplification of behaviour" by
giving up locality, a key element of modular design.  This idea comes
from I think a Vernor Vinge novel.  It's also related to the
self-modifying noise machine ap0202[2].  The original thought inspired
by this program[3] to encode PCM for playback on an MSX using the 3
PSG channels and an optimal driving strategy, however that is not
autonomous.

Meaningful sound could be short static grains of simple modulations.
Maybe a genetic algorithm on top of sheepsint is more interesting,
like Dave's Fastbreeder[4].

My hunch is that, since the synth itself is quite simple, the
interesting sounds are going to be configurations changes over time on
top of that.  The deal is to find a machine encoding that has maximum
interestingness for a minimum number of bits.  Some elements that
work:

 - logarithmic frequency scales
 - counters
 - FSR sequences
 - stacks (push sound, change, pop sound)
 - bias towards low frequencies (drone)
 - repetitive drum patterns


Instead of looking at state machines that output all-audible sound, it
might be more interesting to look at the lower part of the spectrum
only, which consists of the average behaviour of the "fast bits" of a
binary modulated signal.  More precisely, when the "sound state" of a
machine is an 8-bit value, the audio output could be the carry flag of
an 8-bit accumulation.  This is the simplest sigma-delta modulator,
and works well for a PIC chip.  In fact, this might be used as an
"external synth".  Configuring one of the simplest PICs that have an
internal OSC to work as a (serially controlled) sound generator.


[1] http://en.wikipedia.org/wiki/Busy_beaver
[2] http://1010.co.uk/ap0202.html
[3] http://www.msx.org/PCM-Encoder-0.01.newspost3520.html
[4] http://www.pawfal.org/Software/fastbreeder/


Entry: Honesty
Date: Tue Jun  9 13:27:09 CEST 2009

I ran into this[1] on Less Wrong, about rationality being related to
honesty.  I've come to re-take this as a motto in my life.  I used to
think that dishonesty and the lesser form - holding back honest
judgement when asked - can be appropriate when you're trying to
comfort or help people (friends) that have trouble with motivation,
valuing their own work and a general suboptimal relation to how they
think the world perceives them and their work.  Being one of those
people myself, and after making countless mistakes in trying to
comfort people by what essentially are sugary lies, I am no longer
doing this.  It's better to try to make people understand that they
really are not important, that their judgements about the world are
not important, and what they are doing isn't either.  This
understanding for myself has lead me to do much better work, and find
a lot of joy doing it.  Times when I feel bad about myself or my work
have a good chance to be related to some form of bloated
selfimportance.

[1] http://lesswrong.com/lw/101/honesty_beyond_internal_truth/


Entry: Destroying information.
Date: Tue Jun  9 13:31:55 CEST 2009

This idea is related to reversable computing, mechanics, equivalence,
information and combinatorial circuits.

Something that struck me when studing the concept of Energy in
mechanics, is that it is "that which is time-less".  Conservative
systems are essentially not interesting.  They move about and change
shape, but in a way that doesn't really matter.  They can be described
statically[1].

Diametrically opposed to this is computation.  The _loss_ of
information.  If you look at the essential parts that comprise a
modern computer: the AND and OR gates, you see they throw information
away.  While saying it like this sounds obvious, what was quite a
shock to me reading about reversable computing is that the only thing
that _requires_ energy is the destruction[3] of bits of information.
Essentially, they are transported elsewhere.  Exhausted.

Similarly, continuous time deterministic chaotic[2] systems behave
interestingly because they _consume_ energy.  Qualitatively they use
energy to amplify information from the depts of the decimal expansion
(amplifying small differences into bigger ones) and throw them back
once they are big enough (folding of trajectories).

Now, what this tells me is that choice is an essential part of
computation.  Before you choose, it's ok to _massage_ information by
changing its representation in a reversable way (i.e. a Fourier
transform), but once this leads to some meaningful classification,
some (most) needs to be thrown into an entropy bucket to make room for
new information.

[1] http://en.wikipedia.org/wiki/Hamiltonian_mechanics
[2] http://en.wikipedia.org/wiki/Chaos_theory
[3] http://en.wikipedia.org/wiki/Entropy_in_thermodynamics_and_information_theory


Entry: 9 prefrontal functions
Date: Wed Jun 10 00:13:46 CEST 2009

(from attachment studies, related through neurology..)

A brain is healty when it has the following prefrontal functions
integrated:

1. regulating body, hart, respiration, ...
2. attuned communication
3. balance emotions
4. extinguish fear
5. response flexibility (pause before action)
6. mental time travel (connect past experiences -> predict)
7. empathy
8. morality
9. intuition (re-representation of body state, enables empathy)

These are connected to mindfulness, being present and aware. (aware of
awareness, attention to intention).  Creating those integrated states
through practice changes the structure of the brain.  States become
traits with practice.

Effects: change to "approach state" as opposed to "withdrawal state".
Immune system functions and several fysiological properties improve.

[1] http://www.youtube.com/watch?v=Gr4Od7kqDT8


Entry: Scale feedback
Date: Wed Jun 10 14:25:39 CEST 2009

I worked a lot with wavelets during my time at the uni.  Repeatedly I
ran into the idea that it might be interesting to figure out what it
means to feed information from a higher band into a lower band:
letting localized small-scale phenomena influence large-scale ones.  I
didn't run into any use for this, nor did I encounter this in any
formalized form, but it seems to be related to self-simility, fractals
and deterministic chaos.  Let's look at the former.

If you look at a nonlinear dynamical system that can exhibit chaoitic
behaviour it has the following properties:

   - local unstability

   - global stability

Meaning that if you apply local linearization, there is at least one
axis that experiences exponential growth.  This growth is then
limitied on a larger scale by "bending" trajectories back into regions
that exhibit local unstability.

The initial conditions of a chaotic system can be represented as a
finite set of real numbers.  Supose a typical 3-dimensional system.
Real numbers can be represented by their binary expansion:

    x(0) = .1001001010...
    y(0) = .1110011001...
    z(0) = .1011001110...


Folding the bits into each other gives a single sequence.  

    s(0) = .111010011101000010111001100010...

An initial condition thus has a countable number of bits of
information.  Computing the state vector at any time in the future can
be done by integrating a set of differential equations.  This can be
done in such a way that the error is controllable When the initial
conditions are given with enough precision, it is possible to bound
the error of the result and drive it arbitrarily low.  (This probably
needs some constraints on the continuity of the equation though..)

What makes chaotic systems special is that for a given output
precision, the number of input bits necessary _grows linearly_ with
time.  For locally stable differential equations this is not the case.

What a chaotic system does is lifting information from the depth of
the decimal expansion into the observable world.  This is essentially
scale feedback: information that lives on a small scale is lifted into
large scale.

As a matter of fact, the sequence s(0) resembles a _program_ executed
from left to right, generating the entire future behaviour.  The
computer on which it runs is an integrator, which maps a sequence of
bits to a sequence of bits, given a time difference (represented as a
sequence of bits).


Questions

  * Given a classical example of a chaotic oscillator, what happens
    when you run it backwards?


Entry: What is a polynomial?
Date: Wed Jun 10 22:23:52 CEST 2009

Might sound silly, but if you look at a polynomial in RPN it's
actually easier to understand.  The polynomial

    a x^2 + b x + c

Is written in RPN as

    0
    x * a + 
    x * b + 
    x * c +

Which is Horner's rule[1] for polynomial evaluation.  (The first x *
is there only for symmetry.  This leads to a straightforward geometric
interpretation: a polynomail is a function that performs "iterative
scaling and displacement".  

A power series is then the infinite extent of this:

    0
    ...
    x * a_2 + 
    x * a_1 + 
    x * a_0 +

Somehow this makes more sense: zoom-shift-zoom-shift...

Now that's one way of seeing polynomials.  Another is by collecting
them into a set:  the set of all (finite) polynomials is closed under
addition, multiplication and composition.  Are there any other
entities that have this property?

Polynomials are functions P : R -> R.  The functions + and * can be
lifted from R x R -> R to P x P -> P.  Functions already have a
"native" binary composition P x P -> P.  This yields an algebraic
structure with 3 binary operations: + * o.  How is this structure
called?  (Addition preserves order, multiplication adds order and
composition multiplies order).


[1] http://mathworld.wolfram.com/HornersRule.html


Entry: Abstraction before Experience
Date: Sat Jun 13 14:08:30 CEST 2009

As abstracted from my own experience: people learn by example.  It is
futile to try to explain a model before you give any examples of
real-world "stuff" that the model is trying to explain.  The only
people this does work for is mathematicians, since this is basicly
what they do.

[1] entry://../staapl/20090611-231018
[2] http://thefarmers.org/Habitat/2004/04/you_cant_tell_people_anything.html


Entry: Math dreams
Date: Sat Jun 13 16:31:29 CEST 2009

I did this before: recording weird math intuitions when I was studying
some numerical algorithms theory.  Mostly stuff related to linear
algebra and calculus.

Maybe it's a good time to fish up some ideas?  Most of it was just
consolidation: marvel about finding connections that I'd label now as
obvious.  

One of them I find quite startling still: the importance of causality
in triangular decompositions of matrices.  The fact that you can
factor a linear transform in a forward/backward way still strikes me
as odd.  Let's see if I can find the post..

EDIT: I started moving stuff from the old .tex blogs to here + I'm
working on some way to render .tex to .png on the server.


Entry: Computation, Circuits and Physics.
Date: Sun Jun 14 14:44:11 CEST 2009

(Reworked deleted post from Mon Oct 8 15:53:12 CEST 2007)

An attempt to link two ideas:

* The circuit-view of computation: complexity theory talks about size
  of static boolean circuits instead of run-time of turing machines.
  The former has a simpler mathematical structure.  It's essential
  property is that it makes _dependence_ explicit.  State machine
  transitions leave this explicit: produce some state which might be
  collected later, or might not.  Essentially, removing _time_ from
  the picture makes things simpler.  Time is made implicit, no longer
  ordered.  (Dataflow turnes ordered time into a lattice.)

* Taking this apart into 2 processes: Physical interpretation of
  compututation as reversable _transformation_ followed by _selection_
  (projection).  The former is related to time evolution and the
  concept of energy, while the latter related to projection /
  selection / waveform collapse, decoherence, ...


CIRCUITS

By ignoring the actual operation, but concentrating on the data
dependency, the same pattern emerges in all computations.  The
essential concept is the function, the many to one map, the
integration of information through selection.

Between the mathematics of the natural numbers and the physical
reality of computers there is only one essential difference: physical
circuits cannot deal with infinite structures (like infinite digit
sequences).  However, they can usually deal with enumerable[1]
structures upto the point of interest.  What this means is that it's
possible to keep thinking about infinite structures when modeling
computation, but keep in mind that if you don't compute infinite
results, you can get by with using only finite amount of data from the
infinite models[3].

Now keep in mind that this is not that surprising if you realise that
all programming languages eventually boil down to physical circuits so
in some sense I'm stating the obvious.  However, looking at how this
_actually works_ still amazes me: everything eventually maps back to
"selective" logic circuits (AND/OR), and all concepts are abstractions
on top of this.

EXAMPLE:  LINEAR FILTER (inner product)

Take the concept of linear filtering.  It is several levels of
abstraction removed from logic circuits, but it can be expressed quite
simply.

Linear filtering is the isolation of some information in a bunch of
other information by transforming (rotating) it so it will stand out
in an average: all parts one is not interested in will simply cancel
each other out.

It is interesting to look at how linear filtering is implemented, and
how this high-level process of canclling / integration relates to
more elementary processes of information cancelling / integration.

1. Natural numbers can be represented by infinite bit sequences
   (2-adic numbers[2]).  Taking double-infinite series this then leads
   to a representation of real numbers.

2. Addition of two numbers a, b, is the "carry connecting local
   computation".  The local computation being the XOR of 3 inputs
   (a_i, b_i, carry_i).  This rippling carry effect connects every bit
   in the output with all the input bits below it.  One could call
   addition the "causal" operation.

3. Multiplication of two numbers can then be considered as "acausal"
   each bit in the input is connected to each bit in the output.

(Represented) real numbers with addition and multiplication, together
with some lazyness tricks to avoid infinte sequences and allow
implementation on fine machines is all that is necessary to make the
concept of linear filters implementable.


TRANSFORMATION

Now, the interesting thing is to split computation in transformation
and selection.  In some sense, transformation is free in physics: it
is time-evolution.  Selection is then related to structure.

Real/complex vector spaces are better suited for models of physical
systems.  So can't this be exploited in a better way?

(Sat Apr 7 11:40:04 EDT 2007) 


[1] http://en.wikipedia.org/wiki/Enumerable
[2] http://en.wikipedia.org/wiki/P-adic
[3] http://en.wikipedia.org/wiki/Lazy_evaluation


Entry: Logic vs. Math
Date: Sun Jun 14 15:39:03 CEST 2009

I'm trying to understand the basics of mathematical logic.  Let's try
to answer the question: "What is the difference between a theory/logic
and a model?", by looking at propositional logic vs. boolean algebra.


Entry: Computation and Topology
Date: Sun Jun 14 15:53:38 CEST 2009

(Reworked from Sat Apr  7 11:40:04 EDT 2007)

The simplest model of computation is cicuits, which are directed
a-cyclic graphs[1] or possibly lattices[2].

Now I wondered, can DAGs be made flat, or is "knotting" an essential
part when you try to embed it into a topological structure.

How does this relate to linear logic and reversable computation?
Instead of not allowing destruction, what about not allowing
copy/memoization?  How does conservation of energy relate to
conservation of "cons cells" in a tree rewriting machine?  Permutation
(automorphism) is actually ``Conservation of Energy'' for discrete
structures.

There was this crazy idea in the original post: 1) Waves going through
space are boring.  2) Waves going through a knotted topology are
interesting (essentially circuits).  3) Instead of looking at 3D space
as the thing in which networks are embedded, why not turn it around?
Maybe 3D is just the most random network?  Space is disorder,
therefore infinitely boring?


[1] http://en.wikipedia.org/wiki/Directed_acyclic_graph
[2] http://en.wikipedia.org/wiki/Partially_ordered_set


Entry: Random truth
Date: Mon Jun 15 11:29:57 CEST 2009

I've been reading a bit about incompleteness lately.  I've read Meta
Math[1] before, and a week ago I read a transcript[2] of a
presentation Chaitin gave about his work.  He's a funny guy, though a
bit arrogant.  He's a friend of Stephen Wolfram.  Judging from how
they write, I can imagine those two getting along :)

One of the things that intrigued me and never realized until reading
[1] is that computers are so central to the idea of incompleteness.
I've read stuff about Goedel's proof before, but I don't seem to have
gotten so much from it.  What Chaitin said is that Goedel invented
Lisp!

Now, that got my attention.  I still can't explain it to my own
satisfaction, but it probably has to do with this meta-circularity of
the lisp interpreter, and the relation to Goedel numbering of proofs.
Somewhere there is this "level-shift" which Hofstadter talks about.

I'd like to explore this a bit more.  Programming and writing
programming languages and compilers has given me a lot of intuitive
understanding about computation.  Maybe some more formal ideas won't
hurt..


[1] isbn://1400077974
[2] http://www.cs.auckland.ac.nz/~chaitin/cmu.html
[3] http://en.wikipedia.org/wiki/Gödel's_incompleteness_theorems


Entry: Code vs. Data
Date: Mon Jun 15 13:08:18 CEST 2009

If you look at the circuit model of computation, it turns out that
dependency and connectedness are key ideas.

Does it make sense to look at programs also _only_ from the data-flow
point of view?  In the end, control structures in programming languges
are essentially re-used data dependency subgraphs, if you translate it
to a circuit.

So why is functional programming cool again?  They make data
dependencies explicit, allowing you to _move away_ from the iterated
machine approach, where you essentially loose dependency information
due to random access.


Entry: From Wow! back to normal.
Date: Mon Jun 15 14:22:47 CEST 2009

This happens to me quite often: I'm trying to solve some problem I
can't get a handle on.. Then by manipulating it, I am eventually able
to place the _problem_ into a larger picture which then leads to a
very elegant solution.  This makes me very excited ("This is going to
change the world!") until I place the _solution_ itself into a wider
context, and see that it isn't so special after all.

All that happened is making an (initially) non-trivial link between
two distinct representations of knowledge.  Once the link is there,
the whole thing seems trivial and boring, because this jolt of insight
changed the way I'm thinking about the subject.

I guess this is why I love math.  It turns me into a hero for 5
seconds, and then places my feet back on the ground to go looking for
more hits.


Entry: 24
Date: Thu Jun 18 11:05:16 CEST 2009

I was originally intrigued by a post from TWF[2].  It has popped up in
many other places, and is related to some interesting mathematical
structures.  Maybe a good starting point to do some exploring?

[1] http://en.wikipedia.org/wiki/Dedekind_eta_function
[2] http://math.ucr.edu/home/baez/TWF.html
[3] entry://../math/20090609-145519


Entry: Fixed Points
Date: Thu Jun 18 12:58:32 CEST 2009

Maybe the single most important principle in numerical computation: if
you can't compute it directly, try to build a dynamical system that
converges to something you can use to compute it directly.

What I mean: iterative numerical algorithms take a model with a simple
implementation namely update equations of discrete time dynamical
systems, and use knowledge about their behaviour in the form of some
mathematical theory consisting of a collection of axioms and theorems,
to get to approximation of entities that are not directly computable,
simply by observing the behaviour of these systems.

Essentially this brings one from algebra, which in its basic form as
usable on a computer is very well understood, to analysis (through
limits) and theories _about_ these simple algebraic structures, which
can lead to incredible diverse theories.


Entry: How to do math?
Date: Thu Jun 18 15:39:13 CEST 2009

(At least, what works for me?)

1. The human brain abstracts from examples, not the other way around.
   Do not attempt to go too far without doing active manipulations
   (exercise!).  Mathematics is not a spectator sport.

2. Try to intuitively understand the basic concepts (definitions).  If
   they are too abstract this means that you don't understand well
   enough the theory which _suggests_ these abstractions.  Usually new
   definitions come from patterns that cannot be properly abstracted
   in some basic theory, but making the theory more abstract usually
   does expose the patterns.

3. Look at the most important theorems.  Try to understand how they
   work intuitively, then (very important!) look at the proofs, or at
   least the core elements.


Entry: 2D waveguides
Date: Thu Jun 18 16:21:54 CEST 2009

(Not dated. From problems.tex)

In one dimension, the solution of the wave equation can be split into
two traveling wave equations, which can be easily discretizised using
waveguides. In more dimensions, this doesn't work. How to work around
that?  Basically, I'd like to know what exactly is the complexity and
error trade--off for simulating 2D and 3D waves.  Probably, the only
way to do this is using finite element methods, but, there must be
some hacks floating in nD hack space.


Entry: Our nonlinear ears
Date: Thu Jun 18 16:22:43 CEST 2009

(Not dated. From problems.tex)

Most of the current problems in computer music and musical signal
processing are actually problems in ``Artifical Intelligence''. We, or
some of us at least, are lucky that most sound producing phenomena are
reasonably linear, or at least, have tractable non--linearities. Once
we enter the domain of perceivable qualities, hell breaks loose. There
is scarse information about how things actually work so introspection
and intuition prevail, leading to a patchwork of heuristics and
seemingly original work. Can we do something about it?

Two interesting problems here are pitch detection and source separation.


Entry: Cleaning up the papers/ dir
Date: Thu Jun 18 16:24:27 CEST 2009

So I'm collecting some ideas I wrote down over the years in this
blog.  Supposedly, separating them up into bite-size chucks will make
it easier for the ideas to reach their final destination, be it the
dust bin.


Entry: Geometric Algebra and Computer Music
Date: Thu Jun 18 16:32:01 CEST 2009
Type:tex

% More specifically, I will investigate for several manifolds what the
% concepts of ``modulation'' and

[ Raw. Needs rework. ]

\section{Clifford Algebras}

A thing which has always intrigued me is the algebraic aspects of
sound modulation. For example, two oscillators (oscillations) $a (t)$
and $b (t)$, when combined, produce a third oscillation $c (t)$. Take
for example mixing and cross--modulation. These are no more than the
operations of addition and multiplication in the field of the real
numbers.

If we make the functions take their value in the field of the complex
numbers, we can construct a notion of oscillation easily by using
complex multiplication to generate it. Complex numbers seem to have
rotation \emph{built in}. This can be used to give a lot of sound
generating and processing operations an algebraic feel.

In general, this is the subject of geometric algebra. In layman terms,
we can construct a geometric algebra, or Clifford algebra, by
constructing a new vector space spanned by all distinct products of
$n$ orthogonal basis vectors of an $n$--dimensional vector space
$V$. This gives us a $2^n$ dimensional \emph{multivector} algebra.

A clifford algebra is a subgroup of the tensor algebra, which is
nothing more than the algebra constructed from a vector space, using
the tensor product as multiplication. The tensor product can be seen
as the \emph{most general bilinear form}, or most general
multiplication of vectors. This produces a graded algebra, which is a
direct sum of homogenous subspaces for the vector addition. You can
see this as a generalization of splitting the space of all polynomials
into a direct sum of spaces of homogenous polynomails of degree $n$.

Other multiplications can be defined by restricting this most general
multiplication. The way to do this is to impose an equivalence
relation on the tensor product. For Clifford algebras used in
geometric algebra, this equivalence relation is defined by the ideal
generated by $\{v \otimes v - 1 Q (v)\}$, with $Q (v)$ the quadratic
form defining the algebra. This means that the tensor square of a
vector is taken from the quadratic form, with all of the other
structure coming from the tensor product.

For example, with $V$ the 2--dimensional space spanned by two
orthogonal vectors $\{\sigma_1, \sigma_2 \}$. We define a product
satisfying $\sigma_1^2 = 1$, $\sigma_2^2 = 1$, $\sigma_1 \sigma_2 = -
\sigma_2 \sigma_1$.

The set of basis vectors for the resulting clifford algebra is the
group generated by these 2 elements using the multiplication rules
above, which gives an 8 element group $\{\pm 1, \pm \sigma_1, \pm
\sigma_2, \pm \sigma_1 \sigma_2 \}$. This is the Dihedral group
representing 2 (orthogonal) reflections of the plane $\sigma_1$ and
$\sigma_2$, and their composition, $90$ degree anticlockwize rotation
$\sigma_1 \sigma_2$, which forms an order $4$ subgroup.

Over the reals, this generates a basis for a 4--dimensional vector
algebra. As the symmetry group generated by the basis vectors
suggests.


Entry: Manifolds and Sound
Date: Thu Jun 18 16:35:12 CEST 2009
Type: tex

With $\mathcal{M}$ an arbitrary (Riemannian) manifold, how to make
sounds by defining functions $\mathcal{M} \to \mathbb{R}$ (wave
shapers) and curves $\mathbb{R} \to \mathcal{M}$ (modulators) on these
spaces?  In the case where $\mathcal{M}$ is the complex unit circle,
we have the basic theory of wave shaping and modulation, which is
``playing with sine and cosine''.

This story started a couple of years ago (2002-2003?) when I was using
complex and quaternions multiplication and addition to perform sound
modulation.  I then took a detour reading about Lie groups and
differential geometry and lost track of the original idea, but the
basic driver has stayed the same: I'd like to get a feel for
navigating manifolds like SO(3) the group of 3D rotations (the 3D
projective space) and its double cover and SU(2), the group of unit
norm quaternions (the 3--sphere).  I'd like to do this by making
sound, to see if there are any interesting algorithms hidden here, or
if everything just reduces to combinations of known things.


Entry: Oscillations
Date: Thu Jun 18 16:43:57 CEST 2009
Type: tex


\def\GF{\text{GF}}


\section{Introduction}

Physics is about symmetry, about invarance under transformation. I'm
writing this to order my thoughts about motion, rotations,
oscillations and wave equations a bit better. 

%% [FIXME] Split everything in oscillations due to symmetry (constants
%% of motion), and others. Maybe incorporate the QM picture:
%% observables generate symmetry operations.


\section{Motion}


In the abstract, motion is the trajectory of a point $x(t)$ through a
\emph{phase space} $X$, which is a set of points representing all the
possible states of a system. 
% The parameter $t$ is referred to as \emph{time}. 
% A \emph{space} is a set with some notion of \emph{distance}.

Conservative systems are a special kind of systems where the
trajectory is restricted to a subspace satisfying an equation $E(x) =
E_i$, where $i \in I$, the index set of all possible energy levels. In
other words, the energy equation creates a partition $\{X_i\}$, where
each $X_i$ is a constant energy subspace of $X$.

For conservative motion, each trajectory $x(t) \in X_i$, with $t \in
T$, the set of parameters called \emph{time}. The set $T$ is equipped
with an order. The trajectory will have the form 
$$x(t) = g^t x(0) = \exp(w t) x(0).$$ 
Here $g$ is the generator of motion. The trajectory can be seen to be
a parametrized morphism of the constant energy subspace $X_i$.

This equation is taken as the definition of the
exponential function, which mainly relates $g$ and $w$. The parameter
$w$ is the representative of the linear equation of motion, or
\emph{differential equation}. The structure preserved by $G_i$ is the
algebraic structure used to express the equation of motion, which is
here parametrized by $w$, the \emph{momentum}.

I will now relate these abstract entities to several spaces $X$, with
associated spaces $T$ and $I$.

%% talk about temperature and entropy

\section{Discrete systems}

In this case all of $X$, $T$ and $I$ are taken to be finite or
countable.

\subsection{Binary system}

The simplest nontrivial state space that fits the abstract setting
above is $X=\{(0,0), (0,1), (1,0), (1,1)\}$. Using the energy equation
$E(i,j) = i + j$, there are $3$ possible energy levels $E \in
\{0,1,2\}$, giving rise to the partition $\{X_i\}$ with $X_0 =
\{(0,0)\}$, $X_1=\{(0,1),(1,0)\}$ and $X_2=\{(1,1)\}$. The only
nontrivial trajectory can be found in $X_1$.

To write equations of motion, some algebraic structure on $X$ is
necessary. Take $X$ to be the two--dimensional vector space over
$\GF(2)$. The equation of motion can be written as
$$x(t+1) = \matrix{0 && 1 \\ 1 && 0} x(t) = Gx(t),$$
or $x(t) = G^t x(0)$ where $G$
will leave all spaces $X_i$ invariant. Identifying only one of the
components as position yields an oscillation, where the other state
is the hidden state, or momentum. Here we started from a space and
constructed a compatible equation of motion, but in general one works
the other way. Given the equation of motion, construct the geometry
of phase space.

Keeping the base field at $\GF(2)$, but increasing the dimensionality
of the vector space allows the expressions of linear recursive
equations like
$$x_0(t+1) = \sum_i^{N-1} g_i x_i(t) \text{ and } s_{k+1}(t+1) =s_k(t).$$
which are used to generate pseudo--random bit sequences for example.
This can still be expressed as $x(t) = G^t x(0)$ or $x(t+1) = G x(t)$.


Conservative cellular automata fall in the same category. These are
based on exchange operations similar to $G$. The typical mode
of operation is to have two partitions, $P_0$ and $P_1$ of an
$N$--dimensional grid, such that each element of the even partition
has a maximal local overlap with the elements of the odd partition,
and vice versa. This is the essential coupling necessairy to get
wave--like phenomena.

% Sorting networks are related to this too, but have probably more in common with the heat equation.

%% NOTE: rectilinear motion and harmonic oscillation are the same

\subsection{Finite fields}

%% The simplest possible set $X$ with non--trivial algebraic structure is
%% $\GF(2^2)$, where $X=\{0,1,x,1+x\}$. Addition is modulo $2$, while
%% multiplication is modulo $1+x+x^2$.

Looking at finite fields, there is a little more structure than the
vector spaces, namely the elements can be multiplied.

Let's have a look at the Galois fields $\GF(2^N)$ which can be
represented by formal polynomials over $\GF(2)$, modulo an irreducible
polynomial $m(x)$. The algebraic structures are equipped with
multiplication and addition, which is enough to express the ideas of
linear difference/differential equation and exponential, central in
this paper.

There are two basic structures to create oscillations, which are both
related to each other. The first one deals with the multiplicative
group, which is abelian. This group is also finite, which means it
consists of a collection of primitive cycles of period $N_i$, such
that $g_i^{N_i} = 1$. There can be several irreducible polynomials
$m(x)$ that give a representation of the field $\GF(2^N)$, so it can
be benifitial to choose the ``best'' one. Being only interested in
cycles, I want the generator to be as simple as possible, meaning $x$,
which corresponds to a simple shift. If $x$ is a generator of the largest
cycle $2^{N-1}$, the modulo polynomial $m(x)$ is called primitive.


%% Let's study the effect of $G$ on the basis $\{1, x, x^2, x^3\}$. This
%% gives $G(1) = x + g_0$, $G(x) = x^2 + g_1$, $G(x^2) = x^3 + g_2$ and
%% $G(x^3) = g_3$.

\subsection{Decimal expansion}

It is well known that a fraction represented as a decimal expansion in
base $x$ contains infinitely repeating terms if the denominator
contains prime factors which do not occur in $x$. In general, a
rational number can be represented as
$${n \over d}
= x^p (a_0 + a \sum_k x^{-Pk})
= x^p(a_0 + a {x^P \over x^{P} - 1}) = {Nx^p \over x^P - 1},$$
where $p$ denotes the
position of the point after which the decimal series starts to repeat,
and $P$ is the period of repetition. Anoter way of writing the
fraction is
$${ a \over b} = \prod_k p_k^{m_k},$$ where $p_k$ represents the $k$th
prime number. This can be identified with the previous expression by
forming the product 
$$\prod_{k \in K} p_k^{|m_k|},$$ where $K$ is the set of indices for
which $m_k < 0$ and $p_k$ does not divide $x$. All the other factors
can be incorporated in $N$ and $p$. Now find the smallest $n$ such
that the expression above divides $2^n-1$.


\subsection{Integers and rational numbers}

Starting from Pythagorian triples, which are integer solutions to $a^2
+ b^2 = c^2$ interpreted as the energy equation, it is possible to
create rational conservative systems $x(t+1) = Gx(t)$, where
$$G = {1 \over c}\matrix{a & -b \\ b & a}.$$ This works for example for
the famous $(a,b,c) = (3,4,5)$.  Sets of those can be computed from
two arbitrary integer numbers $m>n$, using $a = m^2 - n^2$, $b=2mn$
and $c=m^2+n^2$. When the relation holds, the determinant of the
matrix is $1$, which means $G$ and all powers $G^k$ preserve the
$2$--norm or energy, and are thus conservative.

Time $T=\mathbb{Z}$.  The phase space $X=\mathbb{Q}^2$, the spaced of
ordered pairs of rational numbers. There are an infinite number of
constant energy partitions parametrized by all positive rational
numbers, so $I = \mathbb{Q}^+$.


\subsection{Cellular automata}

I'm curious about two versions of cellular automata which enable
analysis in the way I present here: linear CA, which can be analyzed
using generating functions or polynomials in the finite case, and the
slightly more general reversible CA, where time evolution is a group
action, probably always a composition of elementary bit swaps?

%% \subsection{Discrete circles}

%% Another thing to look at is the classic discrete circle drawing
%% algorithm.  Viewing this as a conservative system requires some
%% fiddling. The phase space here is again $2$--dimensional, represented
%% by the pair $p,q$. The energy of the system is $E(p,q) = p^2 +
%% q^2$. Suppose that motion does not require this energy balance to be
%% constant all the time, but minimally different from $E_0$, given that
%% the updates are a binary decision problem given by
%% $$q^+ = q + \Delta_p\sigma(p) \text{ and } p^+ = p -
%% \Delta_q\sigma(q),$$ where $\sigma(x)$ represents the sign function
%% and the $\Delta_p, \Delta_q \in \{0,1\}$ are chosen such that
%% $$B(p,q) = p^2 + q^2 - E_0,$$ the energy bucket, which can contain
%% negative or positive energy, remains as close as possible to zero. In
%% practice only one of $\Delta_p, \Delta_q$ will change per step, which
%% one determined by the absolute values of $p$ and $q$. Moreover, $B$
%% can be updated between time steps.

%% \subsection{Oscillation versus rectilinear motion}

%% Note that rectilinear motion always produces oscillatory behaviour in
%% finite fields, which is not very surprizing due to the finiteness.
%% Given two variables $p$ and $q$, with the update equation $p[k+1] =
%% p[k]$ and $q[k+1] = q[k] + p[k]$. In such a system, the energy is
%% purely kinetic, maining $p[k] = p[0]$ is the energy equation.

%% Another way of phrasing this is that harmonic oscillation is
%% rectilinear motion on a manifold constrained by the energy
%% equation. So both are really the same, it is the shape of the space
%% which is different.

%% One thing which strikes me is that motion seems to have a lot to do
%% with morphisms of fields. There is this pattern: for a finite
%% dimensional state space, an equation of motion is a linear update
%% equation, expressed in terms of field multiplications and
%% additions. Due to the linearity of these updates equations, it is
%% possible to rephrase them in terms of automorphisms of the state
%% space.

%% So, motion is a one--parameter submanifold of a constant energy
%% submanifold of phase space, the space of possible configurations of a
%% system. The one--parameter submanifold is a trajectory generated by
%% $g^t = \exp(t a)$, where $g$ is a generator of a subgroup of the
%% automorphism group of the constant energy manifold, and $a$ is the
%% momentum.


%% \subsection{Hacks}

%% \subsubsection{Truncation}
%% An interesting question is wether it is possible to efficiently find
%% numbers like that arbitrarily close to a certain angle. Taking $n=1$
%% it is possible to go arbitrarily close to $0$. If so, it would be
%% possible to ``reduce'' a complicated fraction to a simpler one. This
%% would allow for an oscillator which never leaves the unit circle, but
%% can be reset by allowing a small phase error when truncating the
%% representation.

%% \subsubsection{Cordic}
%% If logarithms where cheap, running exact discrete filters in the log
%% domain would make updating them fairly trivial. Oscillation is just
%% finite wordlength wraparound, while decay is increment and truncation
%% of the log of the amplitude. I'm curious how cordic fits in this
%% picture, and wether there are


% cordic, discrete log

\section{The continuum}

Here I take $T=\mathbb{Z}$, and $X=\mathbb{R}^N$.

\subsection{Real and Complex numbers}

In one dimension there's only the function $f(x) = -x$ to generate the
trivial binary oscillation. Using complex numbers or
multi--dimensional linear transformations allows the representation of
true rotations and oscillations as multiplication $f(x) = ax$, which gives
$y[k+n] = a^n y[k]$.

Trying to do get oscillations with real numbers requires a two--valued
system. The discrete version of the harmonic oscillator with $q$ and
$p$ as state variables is
$$q[k+1] = q[k] + p[k] \text { and } p[k+1] = p[k] - q[k].$$ Here I'm
not using coefficients, but give the general idea: $p$ contributes
positively to $q$, and thus behaves as momentum, generator of motion,
and $q$ contributes negatively to $p$, so which makes sure the point
will eventually stop and return to the zero point.
It's easier to collapse position and momentum into one entity $x$
instead of seeing them as different entities, and write the update
equation as $x[k+1] = Ax[k]$. 

When $\det A = 1$ and $A$ is orthogonal it produces an oscillation.
The state $x[k]$ is again constrained to a lower dimensional manifold
$E(x[k]) = E_0$. The manifold in this example is clearly a circle
since $E(x) = p^2 + q^2$.

%% A generalization of oscillation is possible by
%% letting $A$ be non--orthogonal.  In that case, the phase space volume
%% $pq$ is still preserved, but one of $p$ and $q$ will go to zero, while
%% the other goes to infinity.


\section{Continuous time}
\subsection{The number $e$}

The function $$e^x = \lim_{n\to\infty} \left(1 + {x \over n}
\right)^n$$ pops up when moving from a discrete to a continuous time
representation, transforming an iterated function into a differential
equation. I change notation from $y[k]$ to $y(t)$ to indicate that $t
\in \mathbb{R}$.  Subdividing a state update from $t$ to $t+1$
represented by the complex multiplier $a$ in the state update equation
$y(t+1) = a y(t)$ can be done by setting $a = (1 + s/n)^n$ or $s =
n(a^{1 / n} -1)$. This gives $$y(t + 1) = \left(1 + {s \over
n}\right)^n y(t).$$ Here the update has been replaced by $n$ smaller
updates.  Taking the limit $n \to \infty$ gives
$$y(t + 1) = e^s y(t),$$ with $s = \log a$. Let's take a look at the
intermediate values.  Using $\epsilon = 1/n$ and taking only a single
update gives $$y(t + \epsilon) = (1 + \epsilon s) y(t),$$ which can be
rewritten as $${y(t + \epsilon) - y(t) \over \epsilon} = s y(t).$$
This gives the differential equation ${dy \over dt}=sy$ when $n \to
\infty$, $\epsilon \to 0$.

\subsection{The wave equation}

%Waves are about energy and space and time and Minkowsky spacetime.

Deriving the one--dimensional wave equation from scratch goes as
follows. Imagine a grid in spacetime $(t, x_1, \ldots, x_n)$. At each point
in this spacetime, there are two state variables $(q, p)$ representing the
position and momentum of a point. The update equations for these are similar
to a simple oscillator, however the influence of position on momentum is
now a contribution of all neighbours. So qualitatively it is still true
that 
$$q[t+1,x] = q[t,x] + p[t,x],$$
but the momentum update becomes
$$p[t+1,x] = p[t,x] + \sum_{\{n\}} q[t,x+n] - q[t,x],$$ where $\{n\}$ is the
set of all neighbours, which is $\{-1, +1\}$ in $2$ dimensions.

The continuous version of these are
$${{\partial q} \over {\partial t}} = p,$$ and
$${{\partial p} \over {\partial t}} = \sum_{\{s\}} {{\partial^2 q}
\over {\partial x_s^2}},$$ where the $2$ sides of the discrete
equation are replaced by a second derivative by substituting $$q(t, x
+ \epsilon) = q(t,x) + \epsilon q_x(t,x) + \epsilon^2 q_{xx}(t,x) +
o(\epsilon^2)$$ in
$${{q(t,x+\epsilon) + q(t,x-\epsilon) -2q(t,x)} \over \epsilon^2}$$
which yields $q_{xx}(t,x) + o(\epsilon^2)/\epsilon^2$ which gives the
second derivative after taking the limit $\epsilon \to 0$.

The second differential expression can be interpreted as ``The change
of momentum (force) is equal to the sum of the curve direction in all
directions.'' In other words, if a point's neighbours are closer to
some direction, the point is attracted to that direction.


%% At all times, the total energy of the system, or
%% $$E = \sum_{x} p^2 + q^2$$
%% is constant.


Entry: Tonal Music: Lattice Diagrams
Date: Thu Jun 18 16:52:48 CEST 2009
Type: tex


\begin{abstract}
Some thoughts on lattice diagrams for just intonation and algorithmic
composition. This is about harmonic and tonal distance. Is it possible to
have two different distance measurements, and does it make sense to do so?
What is the mathematical structure that results from this.
[draft].
\end{abstract}


\section{Space of small prime ratios}

I'm looking for a space of harmonic intervals with a measure
reflecting roughness. My main motivation is to take a dive in discrete
geometry, since I know absolutely nothing of the subject.

To represent harmonic ratios, I work in an $N$-dimensional frequency
ratio space or interval space defined by the first $N$ prime numbers
$p_1 \ldots p_N$. I am interested in ratios decomposed in their prime
components like
$$r_n = 2^{n_1} 3^{n_2} 5^{n_3},$$ with $n_i \in \mathbb{Z}$. Let's
call the space of such ordered triplets $\mathbb{B_N}$.
This space is in $1-1$ correspondence with a subset of the rational
numbers, which I'll call $\mathbb{Q_N}$, the set of all rational
composed of the first $N$ prime numbers.

$\mathbb{B_N}$ is $N$--dimensional, so it can be equipped with a
nontrivial geometry. Let's define a measure in this space as
$$d(n,m) = \sum_k |n_k - m_k|.$$
This I call the harmonic distance. What can be said about this
space? An what happens if I relate it to the ordinary distance measure
in $\mathbb{Q}$, which we could call the melodic distance?

Because the ratio basis $B=\{2,3,5\}$ has relatively large unit steps,
It is possible to use another basis which has smaller steps, i.e.
$B'=\{2,\frac{3}{2}, \frac{5}{4}\}$, corresponding to octave, perfect
fifth and major third. Here the mapping $b' = Tb$ and its inverse are
defined by

\begin{equation}
T,T^{-1} = \left(
\begin{array}{rrr}
1 & -1 & -2 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
\end{array}
\right),
\left(
\begin{array}{rrr}
1 & 1 & -2 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
\end{array}
\right)
\end{equation}.

\begin{table}
%REF: http://www.redshift.com/~dcanright/harmser/
\begin{tabular}{|l|l|l|l|}
\hline
ratio & name & $\{2,3,5\}$ & $\{2,3/2,5/4\}$  \\
\hline
$2/1$ & octave & $(1,0,0)$ & $(1,0,0)$ \\
\hline
$3/2$ & 5th  & $(-1,1,0)$ & $(0,1,0)$\\
$4/3$ & 4th & $(2,-1,0)$ & $(1,-1,0)$\\
\hline
$5/4$ & M 3rd & $(-2,0,1)$ & $(0,0,1)$ \\
$8/5$ & m 6th & $(3,0,-1)$& $(1,0,-1)$ \\
$5/3$ & M 6th  & $(0,-1,1)$ & $(1, -1,1)$\\
$6/5$ & m 3rd & $(1, 1,-1)$& $(0, 1, -1)$ \\
\hline
$15/8$  & M 7th & $(-3, 1, 1)$& $(0,1,1)$ \\
$16/15$ & m 2nd  & $(4, -1, -1)$& $(1,-1,-1)$ \\
\hline
$9/8$&  M 2nd        & $(-3, 2, 0)$& $(-1,2,0)$ \\
$10/9$& M 2nd (2)      & $(1,-2, 1)$& $(1,-2,1)$ \\
$16/9$& m 7th        & $(4,-2, 0)$& $(2,-2,0)$ \\
$9/5$&  m 7th (2)     & $(0,2, -1)$& $(0,2,-1)$ \\
\hline
\end{tabular}
\caption{Some octave normalized positive intervals with inverses.}
\label{tab1}
\end{table}

\begin{table}
\begin{tabular}{|l|l|l|}
\hline
ratio & name & $\{2,3,5,7\}$   \\
\hline
$7/4$ & bluesy flat seventh & $(-2, 0, 0, 1)$ \\
$8/7$ & ?                   & $(3, 0, 0, -1)$ \\
$7/6$ & bluesy minor third & $(-1, -1, 0, 1)$ \\
$12/7$ & ?                 & $(2,  1, 0, -1)$ \\
$7/5$ & diminished fifth   & $(0, 0, -1, 1)$ \\
$10/7$& ?                  & $(1, 0, 1, -1)$ \\
$9/7$&  ?                   & $(0,2, 0, -1)$ \\
$14/9$& ?                   & $(1,-2, 0, 1)$ \\
\hline
\end{tabular}
\caption{More intervals based on $7$.}
\label{tab2}
\end{table}

Note that each $T$ mapping the lattice onto itself must be unimodular
or $\det T = \pm 1$. See Table \ref{tab1} for a list of intervals from
the $3$-dimensional basis. Table \ref{tab2} has some extensions from a
$4$-dimensional interval space.

 
%% In order to limit the grid to a small set of notes, we can introduce
%% ``wormholes''. I.e.  two points are considered equal if their melodic
%% distance falls below a certain threshold.  When traversing the grid to
%% build a melody for instance, we can take big leaps with a small
%% melodic distance, or zap to a location which is similar and continue
%% the journey.


Entry: Scale and Rotate
Date: Fri Jun 19 10:59:12 CEST 2009
Type: tex

% Start EDIT here.

Linear algebra and basic geometry are very much related, which gives
most operations on matrices an interesting geometric
interpretation. We will see that matrices can be seen as a
generalization of complex numbers.

So what are complex numbers, in short?  Compared to real numbers,
complex numbers are richer because they are 2--dimensional.  This
introduces an additional concept of \emph{angle}.  Complex
multiplication can modify an angle and can so be seen as to combine
the effects of both \emph{scaling}, a quality inherited from the
reals, and \emph{rotation}, a new quality involving the manipulation
of direction.

Going to higher dimensions, we will see the concepts of scaling and
rotation are still there, though what we call rotation has become much
richer. The biggest change in moving from complex numbers to matrices
is the loss of commutativity. The same happens in geometry for
dimensions $\geq 3$, where rotations no longer commute.

So, what are the analogs?  Over $\RR$ we have orthogonal and symmetric
matrices.  Over $\CC$ we have unitary and Hermitian matrices. Both are
related to their conjugates (transpose versus complex conjugate
transpose) in special ways.

A unitary matrix $U$ is related to its conjugate as 

\begin{equation}
U^H = U^{-1}.
\end{equation} 

A Hermitian matrix $R$ is related to its conjugate as 

\begin{equation}
R^H = R.
\end{equation}

The same goes for orthogonal and symmetric. Where did we see this
before? Suppose the matrices are $1 \times 1$, so they behave like
complex numbers.  This gives $\bar{u} = u^{-1}$, and $\bar{r} =
r$. Which means $u$ resides on the unit circle and $r$ is a real
number.

For real matrices, the same analogy holds, only they cannot represent
complex numbers in the $1 \times 1$ case, where $u \in \{\pm 1\}$, but
if we take $2 \times 2$ matrices, they can! Take for example

\begin{equation}
U = \matrix { cos\theta & \sin\theta \\ -sin\theta & \cos\theta } \text{ and }
R = \matrix { r & 0 \\ 0 & r}
\end{equation}

We can make it completely obvious by writing this as $U = I \cos\theta
+ J \sin\theta$ and $R = r I$, where $J^2 = -I$.

On to the concept of Eigenvalue. We see that eigenvalues are defined
as the solutions to the equation

\begin{equation}
Ax = x \lambda,
\end{equation}

with $\lambda$ a number called an Eigenvalue and $x$ a vector called
an Eigenvector. This means that from the perspective of $x$, the
matrix $A$ really behaves as just a number.


There are $N$ such directions for an $N \times N$ matrix $A$ in which
$A$ behaves as just a number.  There is one exception to this rule
though, which is related to the generalization of the number $0$ to
nilpotent transforms, which leads us to Jordan forms, but we can
safely ignore this right now.

If we arrange these $N$ vectors as columns of an $N \times N$ matrix
$X$, and arrange the corresponding eigenvalues in a diagonal matrix
$\Lambda$, we get the relation

\begin{equation}
AX = X \Lambda,
\label{eigenvaluedecomp}
\end{equation}

which leads to $A=X \Lambda X^{-1}$, which reads $A$ and $\Lambda$ are
similar.

An important remark to make here is that, even if $A$ is real,
$\lambda$ can still be complex.  The reason for this is, that if $A$
represents a rotation, it can only be represented as a number if this
number can represent a rotation, which only complex numbers can.

We saw above that complex numbers can be represented by real matrices,
so for directions where we cannot express the effect of $A$ as simply
a scaling by $\lambda$ along a line, we can represent the effect as a
rotation in a plane spanned by two vectors.  In other words we have

\begin{equation}
A \matrix{x_1 & x_2} = r \matrix{x_1 & x_2}
\matrix{cos \theta & sin\theta \\ -\sin\theta & \cos\theta},
\end{equation}

with $r$ and $\theta$ real numbers. The $2 \times 2$ matrix performs
the rotation, and $r$ performs the scaling.  When we have $2N$
dimensions, we can have $N$ orthogonal rotation planes, so
(\ref{eigenvaluedecomp}) exhibits a matrix $\Lambda$ which carries
$2\times 2$ blocks when we don't resort to complex representation
where a rotation is represented by $2$ complex conjugate eigenvalues,
and $2$ complex conjugate eigenvectors.

Ignoring the scaling factor $r$, and setting $x = (1/\sqrt{2}) (x_1 +
i x_2)$, we have

\begin{equation}
\matrix{x & \bar{x}} \matrix{e^{i\theta} & 0 \\ 0 & e^{-i\theta}} \matrix{x & \bar{x}}^H.
\end{equation}

The complex conjugates assure the product is real. Following from $x^H
x = 1$ and $\bar{x}^H x = 0$ we have $x_1^T x_1 = x_2^T x_2 = 1$ and
$x_1^Tx_2 = 0$ yielding the real form

\begin{equation}
  \matrix{x_1 & x_2} \matrix{ \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta} \matrix{x_1 & x_2}^T.
\end{equation}

This enables us to clearly see what it means to have \emph{complex
  conjugate} eigenvalue/eigenvector pairs: a 2D rotation in the plane
spanned by the real and imaginary parts of the eigenvectors.  With
$\theta$ going counterclockwize, the rotation is from $x_1$ to $x_2$.

So what does it really mean for 2 matrices to be similar? The
definition says $A = QBQ^{-1}$. Now suppose $A$ and $B$ are complex
numbers. In that case, we have commutativity leading to $A = B$. So
similar matrices are one generalization of equality of complex
numbers, reflecting the fact they have the same eigenvalues, since
they are similar to the same diagonal form. Equality is mapped to an
equivalence relation, which ignores \emph{coupling}.

If similarity maps to equality in the complex number analog, where do
we fit congruence?  The complex analog for $A = QBQ^{H}$ is $a = q b
\bar{q} = q\bar{q} b$. So the effect of $q$ changes magnitude, but not
angle. Most of the time we encounter this $B$ and $A$ will be
Hermitian or symmetric, so the analogy reads as ``congruence transform
preserves signature''.


So, we see that matrices are really just another kind of number,
though they are richer since they exhibit a larger number of
phenomena. The main differences are non--commutativity, making $AB !=
BA$, which enables them to represent things like rotations in higher
dimensions, and the presence of zero devisors, which we will discuss
later.

We saw above that a lot of structure survives from the complex
numbers. Let's make some more links.  The concepts of unit complex
number survives as unitary or orthogonal matrix, and the concept of
real number survives as Hermitian or symmetric matrix, since these
have real eigenvalues.  Another remark is that squares of real numbers
are positive, while $X^H X$ and $X X^H$ are positive definite, which
enables us to talk about \emph{real square roots} of positive definite
Hermitian matrices.

Note that each matrix $Z$ can be split in a Hermitian part $(Z+Z^H)/2$
and an anti--Hermitian part $(Z-Z^H)/2$. Rewriting the last one as
$i(Z-Z^H)/(2i)$, enables us to split each matrix as $Z = X + iY$, with
both $X$ and $Y$ Hermitian.  The exponential of an imaginary number
has unit modulus, while the exponential of a skew Hermitian matrix is
a unitary matrix. The exponential of a matrix $Z$ is defined through
the ordinary power series $\exp{Z} = \sum_k Z^k / k!$.  For normal
matrices, we can use the eigenvalue decomposition to compute this,
which reduces to a decoupled scalar exponential.

Entry: Generalized rotations
Date: Fri Jun 19 11:00:12 CEST 2009
Type: tex

What is an elementary rotation or reflection from a more general point
of view?  For any real $2 \times 2$ orthogonal matrix $Q$ we have
$Q^TQ = QQ^T = I$. For any vector $x$, its length is equal to the
length of $Qx$. So transformation by $Q$ preserves length.  What if we
replace the condition on $Q$ with $Q^TJQ = QJQ^T = J$? Basicly it is
still the same principle, only we do not preserve the length, but the
square of a generalized length determined by the quadratic form
$x^TJx$.  In other words, we use a different \emph{signature}. If $J$
is not positive definite, it is no longer a norm. If it is positive
semi--definite, it still represents a semi--norm. If we take length to
mean square root of a quadratic form, an indefinite $J$ can yield
``imaginary length'', since the quadratic form can return negative
results. Let's have a look at this.

Changing as little as possible compared to ordinary elementary
rotations, we investigate

\begin{equation}
J = \matrix{1 & 0 \\ 0 & -1}.
\end{equation}

A physicist would call a $2$--dimensional space equiped with such a
quadratic form a $2$--dimensional spacetime. For example, special
relativity requires the laws of physics to be invariant under
transforms that conserve this entity, which is called \emph{spacetime
  interval}, here expressed as $x^TJx$, or $x_1^2 - x_2^2$.
Transformations that do so are called \emph{Lorentz transformations},
but we will just call them hyperbolic rotations, because they preserve
the entity $x_1^2 - x_2^2 = c$, which is the equation of a hyperbole.
It's asymptotes are given by $(x_1-x_2)(x_1+x_2)=0$.

As circular rotations leave circles invariant. Hyperbolic rotations
leave hyperboles invariant. So both are essentially the same except
for the effects of $J$.  It is no surprise then that just like the
ordinary $2$--dimensional rotation, this has just one degree of
freedom. It can be written in the following (normalized) form

\begin{equation}
\frac{1}{\sqrt{a^2-b^2}}\matrix{a & b \\ b & a}.
\end{equation}

Compare this to the ordinary $2$--dimensional rotation 

\begin{equation}
\frac{1}{\sqrt{a^2+b^2}}\matrix{a & b \\ -b & a}.
\end{equation}

we treated above.  Usually one sets $a = 1$ and nominates $b$ to
\emph{reflection coefficient}.  A different parametrization is $a =
\cosh t$ and $b = \sinh t$. Using the latter parametrization we get
this

\begin{equation}
\frac{1}{2} e^t \matrix{1 & 1 \\ 1 & 1} +
\frac{1}{2} e^{-t} \matrix{1 & -1 \\ -1 & 1},
\end{equation}

which is the expanded eigenvalue decomposition (or SVD) of the
elementary hyperbolic rotation in terms of the asymptotes of the
hyperbole $v_1 = \frac{1}{\sqrt{2}}\matrix{1 & 1}^T$ and $v_2 =
\frac{1}{\sqrt{2}}\matrix{1 & -1}^T$ and can be written as

\begin{equation}
\matrix{v_1 & v_2} \matrix{e^t & 0 \\ 0 & e^{-t}} \matrix{v_1 & v_2}^T.
\end{equation}

Here it is clear to see that, although ordinary (non--hyperbolic)
angles are not preserved, the determinant is still $1$ which means
this transform does conserve area, using expansion along one asymptote
enough to compensate compression along the other.

Another class of strange gizmos are the symplectic matrices, which are
square $2N \times 2N$ matrices orthogonal to a weighting

\begin{equation}
J_{2N} = \matrix{0 & I_N \\ -I_N & 0}.
\end{equation}

Alternatively, one uses a permutation of this as a block diagonal
matrix with $J_2$ blocks, which suggests that we are really talking
about a different kind of number here. So it suffices to study the
effect of a single such a block. The essence of $J$ is that it is
anti--symmetric, which yields an anti--commuting quadratic form $b^TJa
= -a^TJb$.

This is completely different from any other signature matrix we seen
before, which were all symmetric matrices related to positive (or
negative) real numbers.  In the same way, $J$ can thus be seen as the
imaginary (anti--symmetric) part of some arbitrary matrix. Also note
that $J_{2N}^2 = -I_{2N}$.

Let's follow the lead here and move to a complex vector space $\CC^N$,
coupling 2 components of the original vector space into one complex
component, according to the block diagonal distribution of $J$. The
quadratic form above is now represented by $\text{imag}(b^Ha) =
\frac{1}{2}(b^H a - a^Hb)$.


Now, an interesting question would be: are there any numbers that can
represent these generalized rotations, in the same fashon the complex
numbers can represent circular rotations?  The answer is yes of
course. Just imagine them by introducing a set of symbols which are
not real number and define their multiplication table.  For hyperbolic
rotations, this leads to the split complex numbers, defined by the
relation $i^2 = 1$.  For rotations in $3$ dimensions these are the
quaternions defined by $i^2 = j^2 = k^2 = ijk = -1$.  For arbitrary
signature and dimension these generalizations are called Clifford
algebras.  Most of these are no longer fields but rings, with the
exception of the quaternions which form a division algebra. This is
the main reason you don't see them much. The only way they show up
usually is in the form of the odd structured matrix block, like we saw
before\footnote{ In fact, we can see matrices as a generalization of
  all these kinds of numbers.  Matrices can represent quite a lot of
  structure. This is due to an analog to Cayley's theorem for
  groups. The algebra version says that each associative algebra with
  unit element is isomorphic to a subalgebra of the automorphism
  algebra of some vector space.  Note that this is not true for
  non--associative algebras, like the octonions, which are
  fundamentally different from anything the matrix multiplication can
  represent without modification.  }.

% what is a vector field over a division ring?

Entry: Projection and Nilpotents
Date: Fri Jun 19 11:01:12 CEST 2009
Type: tex

Ane thing we didn't talk about yet is the ability for matrices to
exhibit behaviour like $AB = 0$ when $A \neq 0$ and $B \neq 0$, and
stuff like $A^k = 0$ for $k > K$.

In other words, there are \emph{zero divisors} and \emph{nilpotent
  elements}.  This enables them to represent \emph{projection} and
\emph{compaction}. Matrices are a \emph{ring} and not a
\emph{field}. So some elements are not invertable.

% entropy anyone?


% \footnote{ It is possible to have multidimensional extensions of the
%   real numbers which are still fields, however, these can all be
%   nicely contained in the ring of matrices.  It turns out that
%   giving up commutativity gives rise to a lot more structure so we
%   can represent more phenomena that with numbers alone.  For
%   example, rotations in 3D do not commute. This is an essential
%   property of \emph{direction} in higher dimensions.  is dit wel
%   helemaal waar? waarschijnlijk wel.. maar ik kan het niet zomaar
%   bewijzen.. bv z^N+1 met N even geeft een N-dim veld over \RR
%   representeerbaar als diagonaalmatrices of semi-circulante matrices
% }

Entry: Triangular matrices
Date: Fri Jun 19 11:02:12 CEST 2009
Type: tex

So, in the light of the remarks above, what is a triangular matrix?
While symmetric matrices are important from a more theoretical point
of view, as a kind of generalization of (positive) real number, a
triangular matrix can also fulfill this role. However, triangular
matrices are imporant from a practical perspective: they are easily
invertable, and share this property with unitary/orthogonal matrices.
They are a bit like diagonal matrices, but are only half--decoupled.
Enough to have $O(N^2)$ multiplication \emph{and} solving time.

The application of the inverse of a triangular matrix to a vector is
easy to compute in a recursive way.  This is quite important, so we
have a look at it. Suppose we want to solve the following $N \times N$
system

\begin{equation}
Lx = y,
\end{equation}

with $L$ a lower triangular matrix, meaning $l_{ij} = 0$ for $i<j$.
Since the first row of the equations contains only one nonzero
coefficient corresponding to the coordinate $x_1$, we have a decoupled
equation for that coordinate, and write its solution as $x_1 = y_1 /
l_{11}$.  We've solved $1/N$th of the
problem. % This smells like induction.

Since we know $x_1$, we can remove it from the set of equations by
incorporating its effect on the right hand side, or subtract $x_1$
times the first column of $L$ from $y$.  This gives us a $N-1 \times
N-1$ lower triangular set that looks exactly the same as the one we
started with, only smaller! Just repeating this usually helps to get
to the final solution.

The number of operations we performed to get to the vector $x$ is
$O(N^2)$.  This is the same complexity as a dense matrix
multiplication! So there's really not much of a point of computing the
inverse of $L$, which is still a $O(N^3)$ process, since the
substitution process described above is just as costly as applying its
inverse, which is also a lower triangular matrix, save for the $N$
divisions that have to be computed.  This is why engineers love
triangular matrices. But, do mathematicians love them too?


Let's see what we can say about triangular matrices. They form a ring.
Since they inherit all properties of matrices, we only need to proove
their addition and multiplication is closed. For addition, this is
trivial. For multiplication, take $2 \times 2$ matrices, where it is
very easy to proove.  For $N \times N$, it can be prooved by
recursively chopping up the matrices in $2 \times 2$ blocks till all
the blocks are scalars or $2 \times 2$.

Another proof of this investigates the invariance of a special kind of
subspace. Suppose $x$ is a vector with its first $N-K$ coordinates
zero, in other words, an element of the $K$ dimensional subspace
consisting of vectors with this property. If we call this subspace
$T_K$, we can see that $T_i \subset T_j$ for $i \leq j$.  If we chop
up the expression $Lx = y$ as

\begin{equation}
\matrix{L_{11} & 0 \\ L_{21} & L_{22}}
\matrix{x_1 \\ x_2} 
= \matrix{y_1 \\ y_2},
\end{equation}

with $x_1$ the part that's zero, it is easy to see that $y_1$ is also
zero.  So the subspace is invariant under transformation by any lower
triangular matrix $L$. If we construct a lower triangular matrix from
a set of columns with $0$ to $N-1$ zeros, all these columns are
invariant so the entire matrix, which is lower triangular, remains
lower triangular after multiplication.  The algebra of triangular
matrices is called a \emph{nest algebra}.

This is related to the concept of \emph{causality}. Every row of $L$
can be seen as a functional which computes an effect at time instance
$i$ from the history $x$ by means of coefficients $l_{ij}$ with $j
\leq i$.  If there is no event before a certain time instance, there
is no effect.  In DSP speak, $L$ represents a time varying linear
filter. In the case where this filter is time--invariant, we have a
lower triangular Toeplitz matrix which has a lot more interesting
properties, one of them being commutativity, another being
applicability by means of FFT.

% hmm.. relaties: driehoek -> causaliteit -> light cone -> hyperbolic
% space is er een direct verband tussen driehoek en hyperbolic space?
% indirect is er displacement rank, en identificatie van
% driehoeksmatrices met polynomen, en `kwadraten' van die polynomen
% die een soort gegeneraliseerde hyperbool vormen..

What a Givens rotation is for rotations, and a Householder reflection
is for reflections, a Gauss matrix is for\ldots Gaussian elimination.
This matrix is also called an atomic triangular matrix, and looks like

\begin{equation}
L_l = \matrix{
I_{i-1} & 0 & 0 \\
0 & 1 & 0 \\
0 & l & I_{N-i-1}
}
\end{equation}

where $l$ is an arbitrary column vector residing in column $i$.  Note
that it is a vector--parametrized matrix, for vectors ranging from
dimension $N-1$ to $1$.  The effect of applying a row elimination
downward can be expressed by applying such a matrix from the left. It
can be seen as to spread out a coordinate $i$ over all coordinates
$j>i$\footnote{ The interesting thing about this matrix is that $L_l
  ^{-1} = L_{-l}$, so in that respect they behave a bit like
  rotations, for which some kind of conjugation represents inverse.
  Another very interesting thing to remark is that $L_a L_b =
  L_{a+b}$. They map multiplication to addition!  This makes the map
  $l \to L_l$ behave like an exponential.  This shouldn't be so
  surprizing because they represent the operation of adding one
  coordinate to other coordinates, which obviously behaves as
  additions, as a matrix multiplication.}.


So we see triangular matrices are really important, because they allow
us to introduce recurrence relations between matrices exhibiting some
form of triangular structure.  These recurrence relations are a direct
consequence of subspace nests. Recall the subspaces $T_i \subset T_j$
if $i \geq j$. For a triangular transform $L$ we have $LT_i \subset
T_i$.  This means the effect of $L$ on the orthogonal complement of
$T_i$ is independent of $T_i$.  It is exactly this property which
allows us to concentrate on the effect of $L$ on the $1$--dimensional
complement of $T_{N-1}$, and reduce the dimensionality of the problem.

In other words, matrix inversion is a decoupling process. We ``disentangle''
a matrix, by identifying its invariant subspaces. However, by disentangling only in one
direction, we are able to effectively map the problem to a causal solution method, which
is all we need.


\subsection{Outer product notation}

In the sections to come, there are a lot of formulas that would look
very redudant when we use standard matrix notation.  The pattern which
keeps recurring is $QJQ^H$, or more generally $Q\tilde{Q}^H$, where
$\tilde{Q}$ is the dual of $Q$, where the word dual is understood in
the current context. For example, we can write the LU factors as
$L\tilde{L}^H$.  In the case of symmetric matrices, the matrix $J$ is
a signature matrix.

Examples of new notation are $A = L\tilde{L}^H = [L] = \sum_k [l_k]$,
where $[X]$ has the meaning of outer product between $X$ and the dual,
clear from context. In case we want to make the signature explicit, we
can write $QJQ^H = [Q]_J$.

Later, for displacement rank expressions, we have $R - ZRZ^H = GJG^H$,
which can more concisely be written as $R-[Z]_R = [G]_J$, or
alternatively as $[I]_R-[Z]_R = [G]_J$. For a rank one matrix we could
write $[x] = [x]_1$, to emphasize it is a matrix generated by a vector
$x$ and its dual $\tilde{x}$.


Entry: Cholesky decomposition
Date: Fri Jun 19 11:03:12 CEST 2009
Type: tex

Let's put this knowledge about triangular matrices to the test.  Any
Hermitian, positive definite matrix $A$ can be written as

\begin{equation}
A = L L^H = U^H U,
\end{equation}

with $L = U^H$ a lower triangular matrix. It can be seen as the
triangular square root of $A$, as opposed to the Hermitian square
root, which can be computed from the SVD or Eigenvalue decomposition.
It is unique, as long as we require the diagonal elements to be
positive, containing the square roots of the (positive) eigenvalues of
$A$.

The algorithm to compute the decomposition has a similar recursive
taste as the solving of triangular systems. Let's write $L$ in terms
of columns $l_i$ as $\matrix{l_1 & \ldots & l_N}$. This enables us to
write

\begin{equation}
A = \sum_k l_k l_k^H.
\end{equation}

Looking at this sum, it consists of a bunch of rank 1 matrices with
top rows and left colums zero, execept for the first, which is
dense. The top row and left column are only influenced by $l_1 l_1^H$,
so that's where we start.

Looking at $l_1 l_1^H$, its top left element is $l_{11}^2$, so $l_{11}
= \sqrt{a_{11}}$.  The top row is the same as the left column, so we
look only at the left column.  The next element down $l_{21} l_{11}$
has to be equal to $a_{21}$, so we get $l_{21} = a_{21} / l_{11}$, and
so on. This gives us $l_1$, the first column of $L$.

Subtracting $l_1 l_1^H$ from $A$ now gives us a matrix with top row
and left colum equal to zero. If we remove those, we have a Cholesky
decomposition problem of order $N-1$, which we solve recursively until
we get to the trivial $1 \times 1$ problem.


Entry: LU decomposition
Date: Fri Jun 19 11:04:12 CEST 2009
Type: tex

We continue our excursion into to land of the triangle matrices. Next
stop is the LU decomposition. Any invertible matrix $A$ can be written
as

\begin{equation}
A = L U,
\end{equation}

with $L$ a lower triangular matrix and $U$ an upper triangular matrix.
Sometimes a permutation matrix enters the picture to increase
numerical stability.  Because of the properties of triangular
matrices, this can be regarded as an alternative representation of the
matrix $A$ such that applying $A$ and $A^{-1}$ have the same
complexity. See the remarks about triangular matrices above.

A variant can be written $A = LDU$, with $D$ diagonal and $U,L$ unit
upper and lower triagular.  This is usually computed with gaussian
elimination, and used for solving linear systems, or any application
where matrices and inverses are necessary, and numerical stability
permits.

It is interesting to rewrite the LU decomposition as $A = L
\tilde{L}^H$, simply renaming $U$ to $\tilde{L}^H$. What's in a name!
A lot, seemingly. This gives us a better picture about what is
actually going on. Applying the same notational trick as we did for
the Cholesky decomposition gives

\begin{equation}
A = \sum_k l_k \tilde{l}_k ^H
\end{equation}

Here again we can see $A$ modeled as a sum of rank 1 matrices, with
top rows and left columns zero, though they are \emph{asymmetric} this
time.  We can follow almost the same recursive algorithm as we used
the Cholesky decomposition, only we need to compute the top row too,
which is $\tilde{l}_1$.


% Note: Gaussian elimination + structure -> generalized schur algo
% zie kailath slides_main.ps

Writing LU decomposition in terms of $2 \times 2$ block matrices gives 

\begin{equation}
\matrix{A & B \\ C & D} =
\matrix{I & 0 \\ CA^{-1} & I}
\matrix{A & B \\ 0 & D - CA^{-1}B}.
\label{schurcomp}
\end{equation}

The part $D - CA^{-1}B$ is called the \emph{Schur complement} of $A$.
This leads to a decoupling of two orthogonal complement spaces.  If we
write vectors as $(x,y)$ corresponding to the compartiments, then the
right block upper triangular factor leaves invariant the subspace
$(x,0)$ and the left block lower triangular factor does the same for
the subspace $(0,y)$.

Interpreting $A$ in (\ref{schurcomp}) as a scalar gives us a way to
reduce an $N \times N$ problem to a trivial one and a $N-1 \times N-1$
problem for the Schur complement of $A$.

The interesting thing to note is if we have $A = LU$, and squeeze
$\text{diag}(L L^{-1}, I)$ between the two matrices above, and
distribute evenly left and right, we get

\begin{equation}
\matrix{A & B \\ C & D} =
\matrix{L & 0 \\ CU^{-1} & I}
\matrix{U & L^{-1}B \\ 0 & D - CA^{-1}B},
\end{equation}

which is a partial LU decomposition in terms of $A = LU$.


Entry: QR decomposition
Date: Fri Jun 19 11:05:12 CEST 2009
Type: tex

We talked above about triangular matrices and their particulary
attractive features having mostly to do with the nice recursive
algorithms for their inversion, which ultimately boil down to the
invariance of subspace nests. Looking at complexity alone, a
triangular matrix behaves like an orthogonal one, save the fact that
it is not orthogonal of course. (But, as we will see, this depends on
the definition of orthogonal!).

So let's have a look at the QR decomposition, which factors a square
matrix $A$ as the product of an orthogonal (Hermitian) matrix $Q$ and
an upper triangular matrix $R$

\begin{equation}
A = Q R,
\end{equation}

In terms of the outer product notation of the product of two matrices,
this can alternatively be expressed as $A = QL^H = \sum_k q_k
l_k^T$. The first thing this shows us is that $L$ is in fact the
Cholesky factor of $A^HA$. This makes the QR decomposition look a lot
like polar decomposition, identifying $R$ as the positive real part.


This decomposition is not unique, unless we fix the diagonal elements
of $R$ to be positive. If $A$ is real, both $Q$ and $R$ are real, with
$Q$ orthogonal.  The QR decomposition can be seen as an alternative
polar decomposition like (\ref{polarmatrix}).

A common misconception is that the diagonal of $R$ correspond to the
eigenvalues of $A$.  This is not true, however the determinant of $A$
is conserved, and equal to the product of the diagonal elements of
$R$.

The factorization can be generalized to non--square $M \times N$
matrices, with $M > N$, by imposing $Q^H Q = I$, which is the first
step in computing the SVD for an unstructured matrix.  This is often
used to solve the least squares problem $Ax \approx b$ by converting
it to the triangular system $Rx = Q^Hb$.  In that case $Q^H$ projects
$b$ into the column space of $A$.

The QR decomposition can be computed using Gram--Smidth
orthogonalization, Givens rotations or Householder transformations.
Using the Gram--Smidth procedure, we write (\ref{gramsmidth}) with
$W=I$ and $a_n$ the columns of $A$, or

\begin{equation}
a_n = \|f_n\| e_n + \sum_{i=0}^{n-1} (e_i^T a_n) e_i.
\end{equation}.

Building $Q$ with columns $e_i$ gives us $R$ with diagonal elements
$\|f_n\|$ and $e_i^T a_n$ in row $i$ and column $n$.

A variant of this is the inverse QR decompostion

\begin{equation}
A R = Q,
\end{equation}

with $R$ triangular and $Q$ orthogonal.  It can be computed as well
using Gram--Smidth orthogonalization by observing the columns of the
matrix $R$ are orthogonal for the inner product weighted by $A^TA$, or
$R^T A^T A R = Q^T Q = I$.  This is related to the inverse Cholesky
decomposition of $A^TA$ or $(A^TA)^{-1} = L L^T$ with $L = R^T$.  In
practice this is of little use in the most general situation, since
solving and applying a triangular matrix are of the same complexity
order.


% general pattern: write recurrence relation + use commute/inverse of
% real/complex numbers i wonder if it is possible to automate this
% search for algorithms :)

% note that the relationship between QR and Schur decomp are very
% related to the polar decomposition of complex numbers again.


Entry: Schur decomposition
Date: Fri Jun 19 11:06:12 CEST 2009
Type: tex

Any square matrix $A$ can be written in the form

\begin{equation}
A = Q U Q^H,
\label{schurdecomp}
\end{equation}

where $Q$ is unitary, and $U$ is upper triangular with the diagonal
containing the eigenvalues of $A$.  This is somehow a middle road
between SVD and EVD. When $A$ is normal, it is the same as the EVD,
and U will be diagonal. If it is Hermitian and positive definite, it
is in addition equal to the SVD.

% SPINORS!

% First let's see what this means going back to our complex number
% analog.  Suppose $a = qr$, with $q = e^{i\theta}$, we can rewrite
% this as $a = p r \bar{p}$ with $p = e^{i \theta/2}$.  Now this is
% just an alogy, but there must be some relationship because.

% To solve this, we resort to the same tricks we used before, which is
% solving the $2 \times 2$ problem, and then generalizing it to more
% than one dimension.

% this by using tricks we can do with triangular and orthogonal matrices.


Since this involves triangular matrices, let's see if we can come up
with a recurrence relation like we did before. Rearranging
(\ref{schurdecomp}) gives $AQ = QU$. The first column of $U$ contains
only one nonzero element, which makes the first column of $AQ$
proportional to the first column of $Q$, making that column an
eigenvector of $A$ and the nonzero element is an eigenvalue.

In order to construct the decomposition, we find one eigenpair
$(\lambda, q)$.  We complement the eigenvector $q$ with $N-1$
orthogonal vectors to form a unitary matrix $Q_0$. Computing the
product gives $Q_0^H A Q_0 = L_0$. Rearranging this gives the
recurrence relation

\begin{equation}
A = Q_0 \matrix{\lambda & \times \\ 0 & A_1} Q_0^H.
\end{equation}

We can expand this until lower right block becomes a scalar.  This
routine just prooves existence, and needs a ``subroutine'' to find the
eigenvalue and eigenvector.

If $A$ is normal, meaning $A A^H = A^H A$, the Schur decomposition is
the same as the eigenvalue decomposition, and $U$ is diagonal. If $A$
is normal and positive definite, the eigenvalues are real and the
Schur decomposition is equal to the Singular Value Decomposition.


A special case of normal matrices are circulant matrices, which are
constant along diagonal lines that ``wrap around''. These are a
special kind of Toeplitz matrix.  In this case the unitary transform
for diagonalization is the discrete Fourier transform, which can be
computed in $O(N\log N)$ operations, and is of great importance in a
lot of engineering applications.


Entry: Eigenvalue decomposition
Date: Fri Jun 19 11:07:12 CEST 2009
Type: tex

An eigenvalue $\lambda$ of a square matrix $A$ satisfies

\begin{equation}
Ax = \lambda x
\end{equation}

for some eigenvector $x$. The algebraic multiplicity of eigenvalue
$\lambda$ is the multiplicity of the root $\lambda$ in the polynomial
$\det(A-\lambda I)$. The geometric multiplicity is the dimension of
the kernel of $A-\lambda I$. If both are equal for all eigenvalues, we
can write the normal form for the similarity transform, also called
the eigenvalue decomposition

\begin{equation}
A = S \Lambda S^{-1},
\label{eigendecomp}
\end{equation}

with $\Lambda$ a diagonal matrix containing the eigenvalues of $A$,
and the corresponding columns of $S$ containing the eigenvectors.  If
$A$ is normal, we have the normal form for the congruence transform,
with $S^{-1} = S^H$ unitary, containing the normalized eigenvectors
which form an orthogonal base for the column space of $A$.  The
eigenvalue decomposition then corresponds to the Schur
decomposition. In this case, we can obtain the normal form for the
congruence transform, by absorbing the quare root of the eigenvalues
in the left and right matrices $X = S \sqrt{|\Lambda|}$, and keeping
only their sign in the diagonal matrix.  With $J$ a diagonal matrix
containing only $\pm 1$ and $0$ elements, this gives the normal for
for the congruence transform for a Hermitian matrix as

\begin{equation}
A = X J X^H.
\end{equation}


An example of a non normal matrix is $I+Z$, with $Z$ the down shift
matrix defined in (\ref{shiftdown}).  Here it is clear to see that
$ZZ^T \neq Z^TZ$. They are mirror images around the anti--diagonal,
with a zero in to top left and bottom right respectively. Matrices
like these can be brought in Jordan form. They are similar to a
diagonal matrix plus a nilpotent matrix.


Entry: Jordan form
Date: Fri Jun 19 11:08:12 CEST 2009
Type: tex

If the geometric and algebraic multiplicities for eigenvalues
$\lambda_i$ do not conincide, we cannot diagonalize $A$ using the
eigenvalue decomposition. However, we can still write its Jordan
form\footnote{In practice, when the numbers in a matrix are a result
  of measurement, this only occurs in engineering applications if the
  matrix is structured. Nilpotent transforms, and more generally,
  rank--deficiency, have an \emph{exact} quality, that has effectively
  zero probability to occur as a result of measurements alone. For
  rational and finite fields this is of course different, but there it
  is arguable that this is again imposed structure, since we think of
  physical qualities mostly in terms of real numbers.  }. Intuitively
one could say the Jordan form expresses the fact that a linear map
(endomorphism) can essentially perform only 2 kinds of
space--invariant operations that are fundamentally different in
nature: scaling/rotation and `collapse'. The first kind expressed by
the eigenvalue decomposition, while the last one can be found in its
pure form in nilpotent transforms. The Jordan canonical form writes a
linear transformation as a combination of both kinds.

Suppose $Z$ is a nilpotent matrix with index $p$, meaning $Z^p = 0$
and $Z^{p-1} \neq 0$.  There is a vector $n$ such that $Z^{p-1} n \neq
0$. The set of vectors $n_k = Z^k n$ are independent and span a
$p$--dimensional subspace of $Z$. Since $Z$ maps each $n_k$ to
$n_{k+1}$, the space is invariant under transformation by $Z$.

Suppose $\lambda$ is an eigenvalue of $A$ with algebraic multiplicity
$m$, and $k$, with $1 < k < m$, is the dimension of the kernel of
$A-\lambda I$. This means we can find $k$ independent eigenvalues.  We
can augment this base with $m-k$ additional vectors to span a space
that is invariant under $A$.  This space is called the Jordan space
corresponding to eigenvalue $\lambda$, and is equal to the kernel of
$(A-\lambda I)^\mu$, with $\mu$ the smallest number for which the
dimension of this kernel does not longer grow.

A Jordan chain is a sequence of vectors $H_k$, such that $H_0 = 0$ and
$H_{k} = (A-\lambda I) H_{k+1}$.  For each vector in the chain we have
$AH_{k+1} = H_k + \lambda H_{k+1}$, which means the space spanned by
such a chain is invariant under $A$. This makes $H_1$ an eigenvector
since $AH_1 = \lambda H_1$.

For each $\lambda$, we can find a number of Jordan chains to span the
Jordan space.  This gives rise to the decomposition

\begin{equation}
A = S (\Lambda + N) S^{-1},
\end{equation}

with $\Lambda$ a diagonal matrix containing the eigenvalues, and $N$ a
block diagonal nilpotent matrix built from shift--up blocks. The
matrix $S$ containts linear independent vectors built from Jordan
chains.
 

Entry: Singular value decomposition
Date: Fri Jun 19 11:09:12 CEST 2009
Type: tex

Any $M \times N$ matrix $A$ can be written in the form

\begin{equation}
A = U \Sigma V^H,
\end{equation}

where $U$ and $V$ are unitary, and $\Sigma$ is a rectangular matrix
containing the singular values on the diagonal.  The SVD can be seen
as containing the information of the square roots of the EVDs of both
squares of $A$.  Which means $A^HA = V \Sigma^2 V^H$ and $A A^H = U
\Sigma^2 U^H$.  It has the properties of an EVD of a positive definite
Hermitian matrix: positive real eigenvalues and orthogonal
eigenvectors.

The SVD is formost an orthogonal decomposition: it gives orthogonal
bases for the row/column spaces and left/right null spaces. It is
central to so called \emph{subspace methods}.  One of its major
applications low rank approximation with as special cases the
\emph{total least squares} approximation method, and PCA.

The singular values are the square root of the eigenvalues of $A^H A$,
so the SVD could be seen as a sort of double polar representation. In
fact, using the SVD one can compute what is ordinary called the polar
decomposition of $A$ in terms of a unitary and a positive definite
Hermitian matrix

\begin{equation}
A = U \Sigma V^H = ( U V^H ) (V \Sigma V^H).
\label{polarmatrix}
\end{equation}

Similarly, the right polar decomposition can be computed as $(U \Sigma
U^H) (UV^H)$.  This can be seen as a generalization of the polar
representation of complex numbers $a = e^{i\theta} r$, where the
positive factor is taken to be the Hermitian square root of $AA^H$ or
$A^HA$.  Note that this is very similar to the QR decomposition.
Where $R$ is triangular instead of Hermitian, indicating there is a
link between positive definite triangular and Hermitian matrices. This
is reflected in the Cholesky decomposition.

% wat wil zeggen dat er zeker ergens een diep sterk verband is tussen
% hermitiaanse matrices en driehoeks matrices

% rechts geeft het $( U \Sigma U^H ) (U V^H)$.

Entry: CS Decomposition
Date: Fri Jun 19 11:10:12 CEST 2009
Type: tex

For any $2 \times 2$ partitioning of unitary matrix 

\begin{equation}
Q = \matrix{ Q_{11} & Q_{21} \\ Q_{12} & Q_{22} }
\end{equation}

there exist unitary matrices $U_1$, $U_2$ and $V_1$, $V_2$, such that
$D = U^H Q V$, with

\begin{equation}
U = \matrix{ U_1 & \\ & U_2 }, V = \matrix{ V_1 & \\ & V_2 }
\end{equation}

and

\begin{equation}
D = \matrix{ D_{11} & D_{21} \\ D_{12} & D_{22} },
\end{equation}

where $D_{ij}$ are essentially diagonal (permuted diagonal). 
This is essentially a SVD, where $Q_{ij} = U_i D_{ij} V_j^H$.


% It can be seen as the basic form for real matrices with complex
% eigenvalues, without using complex numbers\footnote{ Note that the
%   block above is nothing more than a matrix representation of
%   complex numbers over the field of real numbers.  This is the
%   algebraic extension field $\RR[\sqrt{-1}]$, isomorphic to the
%   quotient of the Euclidian ring of polynomials $\RR[x]$ and its
%   maximal ideal $\{p(x) | (x^2+1)q(x), q(x) \in \RR[x]\}$.}. The
% essential observation is that the block above is unitary similar to
% a $2 \times 2$ diagonal matrix containing the complex conjugate pair
% $c \pm is$.

% SEE page wei : history and generality of the CSD decomposition


Entry: Vandermonde decomposition
Date: Fri Jun 19 11:11:12 CEST 2009
Type: tex

Given a sequence $h_k$ which satisfies the relation

\begin{equation}
h_k = \sum_{l=1}^{L} c_l z_l^k 
\end{equation}

with all $z_l$ different, any
$M \times N$ Hankel matrix of $h_k$ can be written as

\begin{equation}
H = \sum_{l=1}^{L} c_l
\left[
\begin{array}{c}
1 \\
z_l \\
\vdots \\
z_l^{M-1} \\
\end{array}
\right]
\left[
\begin{array}{c}
1 \\
z_l \\
\vdots \\
z_l^{N-1} \\
\end{array}
\right]^T,
\end{equation}

which we denote as $H = \sum_l c_l W_{l,M} W_{l,N}^T = W_M C W_N^T$,
with $C$ a diagonal matrix containing the $c_l$, and $W_M$ and $W_N$
vandermonde matrices built from the $z_l$.

In general we can decompose any Hankel matrix 
$H$ in terms of confluent Vandermonde matrices obtained from
the roots of the polynomial 

\begin{equation}
p(z) = z^{L} + \sum_{l=1}^L a_l z^{L-l}
\end{equation}

% note that also MA signals can be modeled by this!! if all poles are zero,
% the confluent vandermonde matrices represent a shifted unit impulse base..

defined by the $a_l$ 
satisfying the recurrence relation 

\begin{equation}
h_k = - \sum_{l=1}^L a_l h_{k-l}.
\end{equation} 

If $H$ is rank deficient, $p(z)$ can be constructed from an element of
the null space of $H$.

If the Hankel matrix is full rank, the decomposition can be obtained
through the decomposition of the Hankel matrix obtained by the
addition of an extra linearly dependent column.  This is however no
longer unique, since the extension is arbitrary.

In the case where $H$ is (Hankel) circulant, the circulant rank
deficient extension gives the (mirrored) Fourier decomposition $H = F
C F^T$, with $p(z) = z^N - 1$ and $z_l = \exp(j2\pi l / N)$.


Perturbing a Hankel matrix of full rank with a Hankel matrix of
minimal Frobenius norm such that its rank is reduced to $L$ gives us a
way to construct an order $L$ exponential model for a signal $h_k$
with minimal error in a least squares sense.  This is a structured
total least squares problem.


% \subsection{Hankel shifts}


% Observe $W_M^{+} = W W_M{-}$, with $.^+$ and $.^-$ denoting the
% removal of top and bottom rows. The diagonal matrix $W$ can be
% understood as the diagonal representation of some kind of shift
% operator.

% To see this, take $H_L = Z_M C Z_L^T$, the Hankel matrix built from
% the first $L$ linear independent columns of $H$. The product
% $CZ_L^T$ is regular, so $Z_M$ and $H_L$ span the same space. We
% write $H_L$ as a sum of Hankel matrices $Z_m$, which are zero except
% for a one on the $m$th anti--diagonal. This gives

% \begin{equation}
% H_L = \sum_m h_m Z_m.
% \end{equation}


% , which is the analog of the


% This could be understood as a diagonalizable shift operator. Indeed,
% suppose $M \to \infty$, meaning $H \in \mathbb{C}^{\infty \times
%   L}$,


% best gewoon andersom werken want mij lijkt het niet direkt duidelijk
% hoe die compositie nu bewezen kan worden..

% This occurs in (\ref{linpred}) where we model the autocorrelation as
% a sum of expontials, by removing the top row.


% \section{Frequency domain interpolation}
% We use time domain interpolation to set the base pitch of the
% synthesized signal. Let's have a look at the effect of doing this in
% the frequency domain. Suppose we have a spectral sequence
% $C_k$. What happens if we resample this vector?


% \section{Duality}
% To understand the benefits of this method, it is interesting to have
% a look at time/frequency duality. Since we have a direct and
% efficient way to switch between time and frequency domain
% representation, some operations become very straightforward.

% effects: single sideband modulation harmonic exciting + aliasing
% free distortion

% todo: pd objects for basic operations bfft/bifft formant shift (time
% or frequency dilatation)

% Alternative playback: synchronous retriggering.

Entry: Subspace frequency estimation
Date: Fri Jun 19 11:12:12 CEST 2009
Type: tex

An interesting trick exists for relating the Vandermonde decomposition
$H = W_M C W_N^T$ and the SVD $H = U \Sigma V^T$. Suppose $C,\Sigma
\in \mathbb{C}^{L \times L}$, and $H \in \mathbb{C}^{M \times N}$,
with $L < N < M$. We have

\begin{equation}
W_M^+ = W_M^- W.
\label{vandershift}
\end{equation}

with $.^+$ and $.^-$ indicating the removal of the top and bottom row
respectively.  $W$ is a diagonal matrix containing the $z_l$. This can
be used to compute the $z_l$.  Since $W_M$ spans the column space of
$H$, given some matrix $X$ spanning this space, there is a regular
matrix $Q$ satisfying $W_M = X Q$. Removing the top and bottom
equation of this relation, we can construct the equation

\begin{equation}
X^+ Q = X^- Q W.
\label{subspaceshift}
\end{equation}

Every basis for the subspace spanned by the rank $L$ Hankel matrix has
such a shift property.  We can solve this for the matrix $W' =
QWQ^{-1}$, which has the same eigenvalues as $Q$, so we can obtain the
$z_l$.

For $X$ we could use the first $L$ columns of $H$.  An alternative
approach is to use the SVD, which can also be used to compute an
approximate solution in case $H$ is corrupted by noise.  Take $U_L$ to
be the matrix of left singular vectors corresponding to the $L$
largest singular values of $H$. In general, when $L$ is not equal to
the rank of $H$, the approximation is no longer a Hankel matrix and
the subspace shift property $U_L^+ = U_L^- W'$ is not valid. However,
we can still solve this equation in a least--squares sense, which
gives us an approximation of the $z_l$ in case $H$ is corrupted by
noise. This method is called HSVD.  If we solve the last equation in a
total least squares sense, the method is called HTLS.

The signal amplitudes can be recovered by solving a least squares problem
after obtaining the signal poles.

% Applying a variant of this method to a signal covariance matrix
% yields the (TLS--)ESPRIT algorithms.  zie deprettere en vanderveen
% 93 ok, dit brengt ons tot 'realization theory'

\def\O{\mathcal{O}}
\def\C{\mathcal{C}}

This method is a special case of ``realization theory''. Where one computes
a statespace model
$$\matrix{x^+ \\ y} = \matrix{A & B \\ C & D} \matrix{x \\ u}.$$
from a given impulse response $h_n$. This sequence is zero for $n<0$
because we assume the system is causal. We have $h_0 = D$, and for
$n<0$ we have $h_n = CA^nB$.


The parameters are recoverd using the Hankel operator $H$, the
submatrix of the the Toeplitz impulse response matrix $T$ which
relates past input $u_n : n<0$, and future output $y_n : n \geq
0$. Rearranging

Arranging the elements $h_n : n > 0$ into an infinite Hankel matrix
gives $H$.  The elements of $H$ are given by $h_{ij} = CA^{i+j-1}B$.
In other words, $H$ can be factored as $H = \O\C$, where
$$\O = \matrix{C \\ CA \\ CA^2 \\ \vdots}$$
is called the \emph{observability matrix} and
$$\C = \matrix{B & AB & A^2B & \ldots}$$
is called the \emph{controllability matrix}.  This clearly shows $H$
is rank deficient. The Vandermonde decomposition is a special case of
this factorization.

Both $\C$ and $\O$ satisfy shift properties. $\O A$ is equal to $\O$
with top row removed, and $A\C$ is equal to $\C$ with the left row
removed. The decomposition $H = \O\C$ is not unique, so we could use
the SVD as $H = (U)(\Sigma V^T)$. We can obtain $A$ by applying the
shift trick on $U$, and recover $B$, $C$ and $D$ from this.


Entry: Displacement rank
Date: Fri Jun 19 11:13:12 CEST 2009
Type: tex

To explain the idea of displacement rank, we first concentrate on the
Hermitian $N \times N$ matrix $A$. This enables us to employ the
normal form for the congruence transform, which it a bit easier on the
eye than assymetric structures.

The particular kind of structure we try to capture is shift structure,
which occurs in a lot of problems involving time/space--invariant
systems. We are going to investigate the difference between two
quadratic forms $x^H A x$ and $(Zx)^H A (Zx)$, where $Z$ is the
nilpotent down--shift matrix from (\ref{shiftdown}), which satisfies
$Z^N = 0$.  Our new quadratic form is represented by the matrix

\begin{equation}
\Delta A = A - Z A Z^T = G J G^H,
\label{drank}
\end{equation}

where $J = I_p \oplus -I_q$ is an $r \times r$ diagonal signature
matrix containing the elements $j_l \in \{ 1, -1\}$.  The fact that we
take a difference of two squares is a first indicator that we are
going to run into some hyperbolic metric with signature $(p,q)$.

Here $r$ is called the displacement rank of $A$, while the pair
$(p,q)$ is called the displacement inertia.  The redundancy in the
dense matrix $A$ is captured by the pair $\{J, G\}$, called a
generator of $A$.  For most problems of interest $r \ll N$. The task
that remains us is to figure out how to translate ordinary operations
we would like to perform on $A$ to operations on $\{J, G\}$.

The first thing we might want to know is how to perform a matrix
multiplication with $A$ using this representation. Expanding the
recursive formula $A = \Delta A + ZAZ^T$ gives us

\begin{equation}
A = \sum_{k=0}^{N-1} (Z^k G) J (Z^k G)^H.
\end{equation}

Expanding this further in terms of columns of $G$ gives us an
expansion in terms of ``squares'' of lower triangular Toeplitz (LTT)
matrices $L_l L_l^H = \sum_k (Z^k g_l) (Z^k g_l)^H$.  The complete
expression for $A$ then becomes

\begin{equation}
A = \sum_{l=1}^{p} L_l L_l^H - \sum_{l=p+1}^{r} L_l L_l^H,
\end{equation}

with $L_l$ the LTT matrix built from the $l$th colum of $G$.  So in
words, $A$ is written as a difference of two positive definite
Hermitian matrices. Both are the sum of respectively $p$ and $q$
positive definite Hermitian matrices. The remarkable fact is that for
each of these matrices, we have the Cholesky decomposition available
in terms of lower triangular Toeplitz matrices. The interesting thing
about this representation, is that a multiplication with an $N \times
N$ LTT matrix is essentialy a convolution and can be performed using
an FFT of size $2N-1$.  This is our first victory! Complexity of
application of $A$ is reduced from $O(N^2)$ to $O(N \log N)$.

% Note that $A$ is expressed as a sum/difference of ``squares''.


% For those who didn't pay close attention, the FFT sneaked in when we
% started playing with $Z$.


% In the more general non--symmetric case we have $A - Z A Z^T = G B
% ^H$, giving rise to $A = \sum_l L_l^g (L_l^b)^H$, where the $k$th
% column of the LTT matrices $L_l^g$ and $L_l^b$ are built from the
% $l$th column of $Z^kG$ and $Z^kB$. So the matrices are all in LU
% form.

% These LTT matrices can be handled using FFT algorithms. In layman's
% terms, exploiting displacement rank is exploiting shift--invariance
% using a Fourier transform, together with a workaround to handle
% boundary conditions and other transformations that result in this
% invariance no longer strictly holding.

Can we do more? An interesting property is 

\begin{equation}
A-FAF^H \cong A^{-1} - F^H A^{-1} F,
\label{congruence}
\end{equation}

with $\cong$ denoting congruence, for any $F$ and $A$ regular
Hermitian.  Note this is valid for any $F$!


One application of this is about the inverse of a Toeplitz matrix.
Let $\tilde{I}$ be the anti--diagonal matrix. Using this matrix in a
congruence transform reflects rows and columns in a matrix,
effectively rotating the grid. When we apply this on a Hermitian
Toeplitz matrix $T$ or its inverse $T^{-1}$ they are both left
invariant. If we use $T$ and $Z$ from above in (\ref{congruence}), and
transform the right hand side using $\tilde{I}$, we get


% jaja. dit is echt wel interessant!! ik ben eens benieuwd of die
% dingen te relateren zijn aan andere symmetriegroupen enz..

\begin{equation}
\Delta T \cong \Delta T^{-1}.
\end{equation}

The significance of this is of course that we can transform the
generators of $T$ into the generators of $T^{-1}$, without having to
represent any of the two in dense form.  This is not a new result of
course, and is effectively expressed in the Levinson algorithm, where
the invariance under transform by $\tilde{I}$ is explicitly exploited,
together with a recurrence relation. Note that this is less general
and talks specificly about symmetric Toeplitz matrices.

The most important general consequence of (\ref{congruence}) is that
Schur complements inherit displacement structure. The generators of
this complement can be computed using the generalized Schur algorithm.

One thing to remark is that if $G$ is in normal form, which means the
top row $k$ contains only a single nonzero element, and the matrix $Z$
is strictly lower triangular, the corresponding column $g_k$ is the
first column of the Cholesky factor.  Switching to our new notation,
this becomes
$$R = [L] = \sum_{k=1}^N [l_k].$$
We identify this with 
$$R = [G]_J + [Z]_R = \sum_{k=1}^r [g_k]_{j_k} + [Z]_R.$$
where we conclude that $l_1 = g_1$, since all the other terms do not
affect the first row and column of $R$. This gives us the first column
in the Cholesky factorization of $R$. Suppose $j_1 = 1$, we can
compute the rest as
$$R - [g_1] = \bar{R} = \sum_{k=2}^r [g_k]_{j_k} + [Z]_R.$$
The only thing we need to do now is to write $[Z]_R$ in terms of $\bar{R}$.
With $R = \bar{R} + [g_1]$, we have $[Z]_R = [Z]_{\bar{R} + [g_1]} = [Z]_{\bar{R}} + [Zg_1]$.
The final word is thus
$$\bar{R} = [Z g_1] + \sum_{k=2}^r [g_k]_{j_k} + [Z]_{\bar{R}}.$$
 
Which gives us the generator for the Schur complement.  This is one
step of the generalized Schur algorithm. The only thing we assumed is
that we can bring any generator in normal form using $J$--orthogonal
transforms.  This will yield one column of the Cholesky factor. The
generators of the Schur complement are then computed by shifting down
the pivot column.

The question which arrises almost immediately is: we go great lengths
trying to keep the number of coefficients small, but in generating the
Cholesky decomposition, we do generate $N(N-1)/2$ parameters. Can we
somehow consume these parameters immediately? For example, solving
$LL^H x = b$. Solving $Lx' = b$ is straightforward, but the $L^H$
factor gives some trouble.  The thing to do here would be to use some
kind of primal/dual trick.  This is essentially what the Levinson
algorithm does: instead of updating the Cholesky factors, it updates
the solution to a symmetric positive definite Toeplitz algorithm.


% where $J|Q = QJQ^H$, $j_k$ is the $k$--th diagonal of $J$. This
% gives $R - R|Z = J(G) = \sum_{k=1}^r j_k(g_k)$. We can write the
% partial Cholesky decomposition as

%$$
%\begin{array}{lll}
%(I)_R & = & (L)_I \\
%      & = & \sum_k (l_k)_1 \\
%      & = & \sum_{k=1}^r (g_k)_{j_k} + (Z)_R \\ 
%      & = & (l_1)_I + \sum_{k=2}^r (g_k)_{j_k} + (Z)_R. \\
%\end{array}
% $$

% ok, nu komen de nieuwe dingen he!  morgen verder begrijpen.. maar de
% moraal van het verhaal is: recursie wordt ingevoerd owv het aanwezig
% zijn van een relatie tussen een matrix en zijn schur complement.


% ..., using a sequence of transforms that leave the metric
% intact. This brings us to $J$ inner matrices, which are matrices
% satisfying $Q^H J Q = J$. They are a generalization of unitary
% matrices (where $J=I$), but preserve distance with respect to $J$.

% misschien ga ik hier eindelijk eens aan mijn symplectisch trekken
% komen door het introduceren van een symplectic form.

% SCHUR : displacement structure allows us to speed up the computation
% of schur complements triangular factorization can be stated in terms
% of successive schur compliments == generalized schur algorithm

%If $A$ is Toeplitz, we have $J = \text{diag}(1,-1)$ and so 


%\footnote{
%An alternative definition is 
%\begin{equation}
%\Delta_{F,A} (R) = FR - RA = GB^T,
{%\end{equation}
%using the $N \times L$ generator matrices $G$ and $B$.}.


% Waar kan ik dit toepassen. Om effe terug te komen op wat dit
% eigenlijk doet: shift -- invariantie en randvoorwaarden met elkaar
% combineren.


% \section{Heavy speculation}

% So what's the whole idea when modeling audio signals? Present a
% linear model using time--localized features. The problem is really
% to define time--extent. Since we have a discrete number of features,
% sliding over a signal will make some of them appear and
% disappear. This is the cause of all trouble, the fact that we are
% really dealing with discrete events.

% It would be interesting to investigate if it is possible to perform
% event segmentation using HTLS. Basicly, starting from a signal
% $s_n$, partition the signal such that each partition satisfies the
% same prediction equation, minimising an error. This could result in
% a separation of an AR filter and an impulse-like stimulus (impulse +
% MA).

% Without thinking too much, combining the backward and forward errors
% from linear prediction in time localisation of impulses is an option
% to consider. See p132 godsill & rayner


% \subsection{The arrow of time}

%[REWRITE]

% this part is a bit over the top :) the only real info here is
% choleski factor of variance/covariance matrix as conditional
% entities arrow of time, ok, but in practice this only refers to the
% relation between recursive formulation and causal COMPUTERS :) the
% model is not necessarily causal..


% There's something mysterious about triangular matrices. Note the
% interesting recurrent patterns that emerged while we were computing
% the Cholesky and LU factorization, and in solving triangular
% systems. This screams for a closer look.

% But first, some buzzwords.  Causality, assymetry, order, subspace
% inclusion, information delay, entropy, conditional variance (they
% occur as Cholesky factors = conditional expansion of mutivariate
% gaussians).  Time varying systems, the Kalman filter, autoregressive
% modeling with reflection back--forth in time, the wave
% equation\ldots Ok, enough madness. There is something deep going on
% here i don't quite get yet, all centered around this ``conditional
% probability'' thing.  Bayesian networks, Schur complements.

% What is so special about a Cholesky factorization? Writing $A = LL^H
% = R^HR = (QR)^H(QR)$ with $Q$ unitary gives us quite a family of
% ``square roots''.  Because of its signature, $x^TAx = 1$ gives an
% ellipse. Taking the square root of $A$ gives an alternative basis
% where the ellipse becomes the unit sphere. Rotating and reflecting
% this sphere leaves it invariant, hence we have the group of unitary
% matrices $Q$ as the symmetry group of square roots.

% So why choose $L$? The first and most obvious reason is of course a
% pragmatic one. Algorithms involving triangular matrices are
% reasonably fast compared to dense and orthogonal ones. But there is
% a deeper reason, which is intimately linked to the algebraic
% properties of triangular matrices: a Cholesky factorization can be
% seen as a ``causal square root''. It is the one factorization that
% preserves a specific kind of order.

% The basis transform to bring the quadratic form in normal form is
% $x' = L^Tx$. Now, we saw before that lower (or upper) triangular
% matrices preserve some nesting property of subspaces. This nesting
% property is an order relation intimitely linked to the concept of
% time, for which causality determines an order relation of time
% instances. It is exactly this property (symmetry) which allows
% algorithms involving these matrices to be formulated using the
% principle of induction.

% decategorification of the category of finite sets + definition of
% integers + order relation

% now i start to sound as a real burgie

% Where do we see this causality clearly? In statistical signal
% processing. Suppose we have a noise processes $u_k$ and want to
% capture its ``essence''. In order to do this, we need to make some
% assumptions that make our life a lot easier, but are not too far
% fetched.

% We assume it is stationary, meaning its statistical properties do
% not vary over time. The statistical properties can then be expressed
% in terms of expected values of products of $u_k$ from different time
% instances, which only depend on the difference of 2 time instances,
% and not their absolute value. % In other words, time is a torsor.

% We assume it is zero mean or $\E(u_k) = 0$. So we can forget about
% the first degree term. We assume it is gaussian, which means we
% forget about all degrees larger than $2$. This leaves us a second
% order representation of the statistical properties, or, a quadratic
% form, which we all learned to love.

% Suppose we denote $u$ as the vector with each component $i$ a
% statistical variable representing the observation at time instant
% $i$ before now, we get a variance--covariance matrix given by $U =
% \E(uu^T)$. To state the obvious, this is of course only equal to
% $\E(u)\E(u)^T$ if all the components are independent, which is the
% same as saying there is no correlation over time. If there is such a
% correlation, it will be represented by the off--diagonal elements of
% $U$.

% Now, what is $L$ in $U=LL^T$? It is a certain \emph{causal} model
% for the dependence of the components of $u$, such that, the modeled
% \emph{features} $u' = L^Tu$ are \emph{independent}. If you draw this
% as a graph, the interdependency network, or information flow from
% $u$ to $u'$ only goes from past to future, not the other way around.

% Note that not in all systems (square root algorithms) it has this
% kind of interpretation. Since, due to it's complexity properties,
% the Cholesky factorization can be a good for any kind of
% decorrelation where we impose (force) some kind of causal or
% conditional statistical structure.

% BAYES!

% To stay in the statistical framework, we explore the concept of
% Cholesky decomposition combined with \emph{Schur complements}.


% Let's have a look at conditional probability, Schur complements, and
% the accumulation of knowledge (the Kalman filter). A multivariate,
% zero mean gaussian distribution can be written as $p(x) =
% \exp{-x^TAx}$, with $A > 0$ the variance--covariance matrix. This
% has a Cholesky decomposition $A=LL^T$.  This means that $p(x') =
% \exp{-x'Tx'}$, with $x' = L^T x$ is a decoupled multivariate model.

% It is in fact a recurrent Bayesian network. Since we explain the
% variance of one term in terms of previous (co)variance and new
% information. To see this, we have a look at the $2$--dimensional
% case, which will bring us to Schur complements.  Take

% \begin{equation}
% S = \matrix{E(xx) & E(xy) \\ E(yx) & E(yy)}.
% \end{equation}

% Computing the Cholesky factorisation of this gives $S = LL^T$ with

% \begin{equation}
% L = \matrix{\sqrt{E(xx)} & 0 \\ E(xy)/\sqrt{E(xx)} & \sqrt{E(yy) - E(xy)^2/E(xx)}}.
% \end{equation}


% NOTE; for triangular matrices, the commutator is not zero, but it is nilpotent,


% They are very asymmetric, but are square roots of very symmetric
% things.  This pops up in AR modeling for instance.  What about the
% relation of $LL^H$ to $L+L^H$? Compare $a^2 + 2a + 1 = (a+1)^2$ and
% $LL^H + L + L^H + I = (L+I)^H(L+I)$?  For LTT matrixes this sort of
% makes sense, since the diagonal is constant.

% nog iets LTI:polynomen(commutatief), LTV:niet commutatief (owv tijdsvariante rotaties)

Entry: Trace and Determinant
Date: Fri Jun 19 11:14:12 CEST 2009
Type: tex

Trace and determinant are two group homomorphisms from square $N
\times N$ matrices to the base field. The trace preserves the addition
operator $\tr(A) + \tr(B) = \tr(A + B)$, and the determinant preserves
the multiplication operator $\det(A)\det(B) = \det(AB)$.  Note the
determinant formula also works for determinant zero, but in that case
it is no longer a group homomorphism.  With respect to the eigenvalues
we have $\tr(A) = \sum_k \lambda_k$, and $\det(A) = \prod_k
\lambda_k$.  They are related through the exponential function as
$\exp(\tr(A)) = \det(\exp(A))$.

The trace is interesting to define an inner product between matrices,
when considering a space of matrices as a normed vector space. For the
space of $M \times N$ matrices, one could define the inner product
$\left<A,B\right> = \tr(B^HA)$. For $N=1$ or $M=1$ this is the
ordinary inner product for column or row vectors.  We also have
$\left<A,A\right> = \|A\|^2_F = \sum_{ij} |a_{ij}|^2$, the Frobenius
norm.  Since $\tr(rA) = r\tr(A)$, it is a linear map. Also $\tr(AB) =
\tr(BA)$, which can be generalized to cyclic permutations of a product
of matrices.  The trace of the Jacobian of a vector function is the
divergence.

While the trace $\tr(A)/N = \sum_k \lambda_k/N$ can be seen as an
additive average of a square $N \times N$ matrix $A$, the determinant
$\det(A)^{1/N} = \prod_k \lambda_k^{1/N}$ can be seen as a geometric
average. In this light, it is no surprise the determinant is related
to the inverse of a matrix. More specificly $A^{-1} = \adj(A) /
\det(A)$.

We define the adjugate as $\adj(A) = C^T$, the transpose of the
cofactor matrix of $A$.  The cofactors are given by $c_{ij} =
(-1)^{i+j} m_{ij}$. Here $m_{ij}$ is the a minor of $A$.  It is the
determinant of the matrix $A$ with column $i$ and row $j$ removed.
The adjoint is the part of the inverse of $A$ which can be computed
without divisions.  Laplace's formula gives an expression for the
determinant in terms of minors $\det(A) = \sum_k a_{1k} c_{1k}$.

Entry: Matrix Equations
Date: Fri Jun 19 11:15:12 CEST 2009
Type: tex

Finding transforms subject to constraints.
The discrete time Lyapunov equation in the unknown $X$ is given by
$$A X A^H - X + Q = 0,$$
with $Q$ Hermitian.  There is a unique solution iff no eigenvalue of
$A$ is the reciprocal of an eigenvalue of $A^H$.  If this condition is
satisfied, and $\lim_{k\to\infty} A^k = 0$, the unique solution is
given by $X = \sum_k (A)^k Q (A^H)^k$.  It is Hermitian since it is
the sum of (infinitely many) Hermitian matrices.


% rewrite as shur complement

% [Riccati].

Entry: Statistics
Date: Fri Jun 19 11:16:12 CEST 2009
Type: tex

\subsection{Introduction}

A very brief introduction to statistics. We represent a stochastic
variable, denoted by a capital letter $X$, by its \emph{probability
  density function} or PDF $p_X(x)$, which is a function over the
reals. This is also called a probability measure.  It satisfies
$p_X(x) > 0$ and

\begin{equation}
\int_{\RR} p_X(x) d x = 1.
\end{equation}

In general we talk about \emph{distributions} instead of functions,
and we allow the PDF to vanish, so we can take $\RR$ to be the domain.

An event $E$ is defined as a subdomain of the domain of $X$.  For
example, $E = \{x : x>10\}$. The probability for the event $E$ is then
denfined to be

\begin{equation}
P(E) = \int_{x \in E} p_X(x) d x
\end{equation}

% This function can be said to reflect a \emph{belief} about the value
% of $X$. The more localized $p_X(x)$ is around a value $x_0$, the
% stronger our belief is the value is $x_0$.

Since events are subsets, we can compute using the operations of union
$\cup$ and intersection $\cap$. The joint probability of two events is
defined as $P(A,B) = P(A \cap B)$. A conditional probability is
defined as $P(A|B) = P(A \cup B) / P(B)$. Two events are said to be
independent if $P(A \cap B) = P(A) P(B)$. Two events are said to be
mutually exclusive if $P(A \cap B) = 0$, but either $P(A)$ or $P(B)$
does not vanish.

Since union commutes, this enables us to derive Bayes rule

\begin{equation}
P(A|B)P(B) = P(B|A)P(A).
\end{equation}


In practice, we usualy encounter several variables $X_1, \dots, X_N$
that are related. These will be represented by a \emph{joint PDF}, a
multivariate function $p_{X_1 \ldots X_N}(x_1, \ldots, p_N) > 0$
satisfying

\begin{equation}
\int_{\RR^N} p_{X_1 \ldots X_N}(x_1,\dots,x_N) d x_1 \ldots d x_N = 1.
\end{equation}

For a bivariate distribution $p_{XY}(x,y)$ we can say the variables
$X$ and $Y$ are independent if $p_{XY}(x,y)=p_X(x)p_Y(y)$. In other
words, the PDF is separable.

In practice it is easier to disregard the normalization factor.  This
is possible since most formulas are linear in $p$. It is assumed
understood that normalization needs to be performed whenever one
integrates over a PDF.  We will use the symbol $\sim$ to indicate
proportionality upto normalization.

An important concept is the \emph{expectation} operator, which returns
the expected value of a function of a stochastic variable. It is
defined as

\begin{equation}
E\{f(X)\} = \int_\RR f(x) p_X(x) d x. 
\end{equation}

This is a linear operator.  Usually one talks not directly about
$p_X(x)$ but about \emph{moments}.  These are the expectation values
of monomials in $X$. We call $m_X = E\{X\}$ the \emph{mean}. It is the
center of mass of a distribution.  We call $\sigma_X^2 = E\{(X -
m_X)^2\} = E\{X^2\} - m_X^2$ the \emph{variance}.  The square root of
the variance is called \emph{standard deviation}, and is a measure for
the spread of mass around the center.

For an event, we can define the \emph{information content} $I$, also
called \emph{surprisal} or \emph{self information}. It expresses the
amount of information that knowledge about an event adds to one's
overall knowledge. It can be seen as a measure for the reduction in
uncertainty. It depends only on the probability of an event. Events
with low probability carry a lot of self--information, while events
with high probability carry little self--information.

For independent events $A$ and $B$ it has the properties $P(A) > P(B)
\implies I(A) < I(B)$, and $I(A) + I(B) = I(A \cap B)$.  This gives us
$I(A) = -\log(P(A))$.  A measure often used is the expected value of
the self--information, called the \emph{entropy} of $X$, and is
defined as

\begin{equation}
H(A) = E\{I(A)\} = - \int_{A} p_X(x) \log p_X(x) d x.
\end{equation}

If the event is taken to be the entire sample space, we talk about the
source entropy, which is the average amount of information an
observation of $X$ brings.

What does it mean for a distribtion to have maximal entropy?

% Entropy is related to physical systems in thermal equilibrium, where
% it is maximal. It is prominent in information theory.  In short,
% given a choice over a family of distributions, the one with maximal
% entropy is the desirable one. More about this later.

% Most physical systems tend toward a configuration of maximum entropy
% over time.  For a class of PDFs, the one with the highest entropy is
% the one with the east amount of prior knowledge.


For any given distribution determined by mean and variance, the
\emph{normal distribution} is the one with maximal entropy. It is
given by

\begin{equation}
N(x ; m_X, \sigma_X) = \frac{1}{\sqrt{2\pi\sigma_X}}
\exp\left(-\frac{(x-m_X)^2}{2\sigma_X^2}\right).
\end{equation}

The multivariate version of this is $(2\pi\det A)^{-\frac{1}{2}} \exp
(-\frac{1}{2} (x-b)^T A^{-1} (x-b))$, with $b$ the vector of means and
$A$ the variance--covariance matrix.

Adding two stochastic variables $X$ and $Y$ produces a third
stochastic variable $Z = X+Y$. Written in terms of the distributions
$p_X$, $p_Y$ and $p_Z$, and the joint distribution of $X$ and $Y$
written as the bivariate function $p_{XY}$, we can compute $p_Z$ by
observing that the marginal probability $p_Z(z)$ is the sum (integral)
of all marginal contributions $p_{XY}(x,y)$, for which $x + y =
z$. This gives

$$p_Z(z) = \int_{\RR}p_{XY}(z-y,y) d y.$$

When $X$ and $Y$ are independent, their joind distribution is
seperable $p_{XY} = p_X p_Y$, and the above formula reduces to a
convolution. This enables us to compute the PDF in a transform
domain. For this we define the characteristic function of a variable
$X$ as

\begin{equation}
\phi_X(t) = E\{e^{itx}\} = \int_\RR p(x) e^{itx} d x 
\end{equation}

which is the inverse Fourier transform of $p(x)$, interpreting $p(x)$
as a spectrum.  The sum of a linear combination of independent random
variables $S = \sum a_i X_i$ can be done from the relation $\phi_S(t)
= \prod \phi_{X_i}(a_i t)$.

For arbitrary many to many functions $y = f(x)$, we can write $p_Y(y)$
as an integral over contours $f(x) = y$ of the joint PDF
$p_{XY}$. This means that for an invertable $f$, the new PDF is given
by the composition $p_Y(y) = f(p_x(z))$.

A word on \emph{interpretation}. The practical applicability of
probabilistic methods to real world situations depends on the stance
one takes. The \emph{frequentists} approach deals with probabilities
as relative frequencies of occurance or relative descriptions of
populations. The \emph{bayesian} approach deals with probability as a
model for \emph{belief}, and is very much related to information. Of
course this deals only with applicability of probability theory, since
the developpement thereof is axiomatic, and independent of
interpretation.


\subsection{Parameter estimation}

The problem of estimation is defined as follows. We are interested in
determining the value of some unobservable parameter $\theta$, given
observed data $x_0$. To this end we will use an estimator
$\hat{\theta}(x_0)$ that maps the data to an estimate.

If we have information about the statistics of $x$, can can judge the
value of the estimator by quantities like \emph{bias}, defined as
$E\{\theta - \hat{\theta}\}$, and \emph{mean square error}, defined as
$E\{(\theta-\hat{\theta})^2\}$.


An often used estimator is the \emph{maximum likelihood estimator}.
Here $\theta$ is a parameter of a probability density function
$p(x;\theta)$. We suppose $x_0$ is drawn from $p$.  A means to
estimate $\theta$ is to maximize $L(\theta) = p(x_0;\theta)$, which is
called the likelihood function. The estimator is then
$\hat{\theta}(x_0) = \arg\max_\theta p(x_0;\theta)$. Note that this
usually is a biased estimator.


An important concept is \emph{Fisher information}.  It is thought of
as the amount of information that an observable random variable $X$
carries about an unobservable parameter $\theta$ upon which the
probability distribution of $X$ depends.  Using the symbol $;$ to
distinguish variables from parameters, we write a parametrized
distribution for the variable $X$ as $p_X(x;\theta)$.  This is a
function of sample $x$ and unobservable parameter $\theta$. First we
define the \emph{score} as ${d \over d \theta} \log p(x ; \theta)$,
with $L(\theta) = p(x;\theta)$ the likelihood function.  The Fisher
information is then defined as the variance of the score. Since the
expectation value of the score is zero, the Fisher information can be
written as

$$I(\theta) = E\{\left[{d \over d \theta} \log p(x ; \theta)\right]^2 \}$$


% alternative form: in terms of second derivative
% TODO: think about this some more.. 
% it's not so hard, just need to make small leap when awake.


Some problems might be \emph{ill--conditioned}. This is typical for
estimators that try to recover information going against the ``arrow
of time'', or opposite to the normal evolution of a physical
system. Problems of this class have a large sensitivity to measurement
errors.

% Note: we can only talk about estimators, but for some simple
% problems it is clear what the best estimator is, so the properties
% of that estimator are said to be properties of the problem.

This is usually quantified by means of the condition number, which
reflects the ratio of relative errors in the estimate to relative
errors in the data. We write

$$k(\hat\theta) = \max\{ 
{\left|\hat\theta(x) - \hat\theta(x_0) \right| \over \left| \hat\theta(x) \right|}
/
{\left| x - x_0 \right| \over \left| x \right|}
: x-x_0 < \epsilon \},$$

with $\epsilon$ a small region around $x_0$ which makes sense. For a
differentiable estimator we can let $\epsilon \to 0$ and write
everything in terms of derivatives

$$k(\hat\theta) = {\left| \hat\theta'(x) \right| \over \left| \hat\theta(x)\right|} \left|x\right|.$$

An important special case is the linear problem $A\theta=x$, for which
$\hat\theta(x)=A^{-1}x$.  Rearranging the formula above we get

$$k(\hat\theta) = \max\{ 
{\left|A^{-1}(x-x_0)\right| \over \left|x-x_0\right|}
/
{\left|A^{-1}x\right| \over \left|x\right|}
: x-x_0 < \epsilon \}.$$

For any $x \neq 0$, the maximum of $\|Ax\|/\|x\|$ is given by the
absolute value of the largest eigenvalue, the minimum is the absolute
value of the smallest eigenvalue.  The eigenvalues of $A^{-1}$ are the
inverses of the eigenvalues of $A$. With $\|A\|$ denoting the operator
norm, $k(\theta) = \|A^{-1}\| \|A\| = |\lambda_\text{max}(A) /
\lambda_\text{min}(A)|$.
% In some cases it might be interesting to use a different matrix norm
% to define condition number.

% canonical example: poisson distribution

% TODO: entropy, bias, MMSE. andere estimators. fisher. CRB.
% en bayes.

\subsection{Monte Carlo}

In cases where it is impractical to obtain theoretical bounds on an
estimate, one can perform numerical simulations to estimate the bias
and variance of an estimator. This is usually done using Monte Carlo
simulation. MC methods are usually employed for evaluating an integral
$$\int_{x \in D} f(x)dx$$
when the dimensionality of $D$ is so large that standard numerical methods
of integration, which have a complexity $O(\exp(\dim D))$, cannot be employed
in practice. As long as $f$ is well behaved, we can relate it to a stochastic
sampling problem.

This is due to the central limit theorem, which talks about the sum of
independent variables $S_i = \sum_{i=1}^N X_i$, where $X_i$ are
independent variables taken from the same probability distribution
with mean $m_X$ and variance $\sigma_X$.  The theorem states that the
distribution of $S_i$ converges to a normal distribution with mean
$Nm_X$ and variance $\sigma_X \sqrt{N}$.  This means that the expected
value of $S_i/N$ is $m_X$, and the variance of $S_i/N$ is $\sigma_X /
\sqrt{N}$.

We can use this to evaluate the integral above. By selecting random
points in $D$ and evaluating the function the result will eventually
converge to the value of the integral, though the central limit
theorema does not provide a convergence speed. There are several
tricks to speed up the process which all work by influencing the
sampling so important regions in $D$ are sampled more densely. Of
course, to know where to sample is to solve the integration problem.

One of those methods is \emph{importance sampling}. It works by computing implementing
a part of $f$ as the distribution $g$ of a random number generator such that
$$\int_D f(x)dx = \int_D {f(x) \over g(x)} g(x) dx.$$
The integral is then evaluated as $1/N \sum_i f(x_i)/g(x_i)$.  The
best choice is $g(x) \sim | f(x) |$, which is of course nonsense since
it involves evaluation of $f$ and is thus as difficult as ordinary
integration.


% FIXME!! If the variance of $g(x)$ is smaller than the uniform
% distribution the method will converge faster.  Metropolis-Hastings
% algorithm quasi-MC: low-discrepancy sequences. interesting :)


\subsection{Stochastic processes}

% http://en.wikipedia.org/wiki/Stochastic_process

A stochastic process is a stochastic function, i.e. a stochastic
variable with values in a function space. This is NOT the same as a
parametrized PDF!
% There are some relations though.


\subsection{Exponential modeling}

% 
% aka harmonic retrieval -- from papy thesis chap 2

Given a sequence of observations $x_n$, we are asked to construct a model
$$x_n \approx \hat{x}_n = \sum_{k=1}^L c_k z_k^n$$
that approximates $x_n$ in some optimal way. We assume the
measurements and model parameters live in $\C$. Given the $z_k$, this
problem is linear in $c_k$, so it is safe to concentrate on the
exponential sequences $z_k^n$.

The most useful property of such sequences is that they are
eigenfunctions of the shift operator $Z\{x_n\} = x_{n+1}$, or
$$Z\{z_k^n\} = z_k^{n+1} = z_k z_k^n$$.
This might look entirely trivial, but it is this fact that is the basis of
most of the tools used in sinusoidal estimation.

We introduce the polynomial $a_1(z) = 1 - z / z_1$. If we apply it to
the operator $Z$ we get $a_1(Z)\{z_1^n\} = 0$. In other words, the
sequence $z_1^n$ is an element of the null space of the operator
$a_1(Z)$. This means $a_1(Z)$ can ``filter out'' the sequence
$z_1^n$. This is true for any $cz_1^n$, $c \in \C$, since $a_1(Z)$ is
a linear operator.

Similarly, we can define $a_2(z)$.  An important property of
polynomials of $Z$ is that they commute. This means applying first the
operation $a_1(Z)$ and then the operation $a_2(Z)$ is the same as
applying them in reverse order. When we apply $a_1(Z)a_2(Z)$ to
$c_1z_1^n + c_2z_2^n$ we get again zero. So, for the entire model we
can construct the polynomial
$$a(z) = \prod_{k=1}^L (1 - z/z_k) = \sum_{k=0}^L a_k z^k$$
which has the property $a(Z)\{\hat{x}_n\} = 0$. This is called a linear prediction
equation. Noting that the coefficient $a_0 = 1$, we can see this by rewriting the equation above
explicitly in terms of the sequence $\hat{x}_n$ and its delayed versions
$$\hat{x}_n = -\sum_{k=1}a_k\hat{x}_{n-k}.$$ 

The polynomials we defined above can be seen as the frequency domain
representation of a signal. Representing any signal $x_n$ by its
generating function, also called its Z--transform $x(z) = \sum_k x_k
z^k$ allows us to manipulate shift and scale operations on the
sequence $x_n$ in terms of multiplication of $x(z)$ by a polynomial.

The only thing we need to do now is to make the approximation error
explicit.  We can do this by putting it in a statistical framework,
devising some optimal error, and solving the problem. Or we can work
the other way, by inventing some estimator with good properties
(i.e. a fast algorithm) and analyzing its statistical properties to
compare with other methods.

% SUBSPACE: ESPRIT (covariance EVD) LS/TLS
%           HSVD/HTLS (data matrix, 'square root' SVD) LS/TLS


% \subsection{Entropy}


\subsection{The Kalman filter}

Suppose we have a linear system which behaves like this

\begin{equation}
\matrix{x_k \\ y_k} = 
\matrix{A & B \\ C & D}
\matrix{x_{k-1} \\ u_k} + \matrix{u_k \\ v_k},
\end{equation}

where $x_k$ is the current state, $y_k$ the output, $u_k$ the
observable input, and $u_k$ and $v_k$ are noise processes with
covariance $U$ and $V$ influencing the state update and observation
respectively. What we are interested in is to find an estimate for
$x_k$ and its estimate covariance $X_k$.

The filter operates in essentially four steps. First a prediction is
made using information from the previous step. Second step is an
observation. Third step is the calculation of bias and last step is
the update.  The error will be used to update the prediction, but only
proportional to a gain factor which depends on ``uncertainty''.

This two step process is essential to its workings and indicates the
combination of prediction and observation. To not overload the
notation we keep the state matrix and noise matrices constant. We will
set $D = 0$ for simplicity.

The prediction steps are reasonably straightforward


\begin{equation}
\begin{array}{l}
\hat{x}_{k|k-1}  =  A \hat{x}_{k-1|k-1} + B u_k \\
      X_{k|k-1}  =  A X_{k-1|k-1} A^T + W.
\end{array}
\end{equation}

During this step, the system is followed as if it was deterministic,
which is expressed in the first equation. The second equation computes
the estimated a priori error covariance $E[(x-\hat{x})(x-\hat{x})^T]$
as a combination of attenuation of previous estimate by the dynamics
$A$ of the system and injection of new uncertainty due to the random
evolution of $w_k$, expressed by $W$.
% These are just added together because of the assumption they are independent.

We use the estimates we just computed to estimate the output of the
system in much the same way, but subtract it from the observed output
$y_k$, which gives the \emph{innovation} and an estimate of tis
statistics

\begin{equation}
\begin{array}{l}
\hat{e}_k  =  y_k - C \hat{x}_{k|k-1}\\
      E_k  =  C X_{k|k-1} C^T + V.
\end{array}
\end{equation}

where we take again the same approach as for the state update. The
first equation computes the error using a prediction for the output
based on our prediction of the state. The second equation compute an
uncertainty estimate for this prediction, which is the same as the
variance of the innovation (WHY IS THIS? ASSUMPTION?)

Now we have some authorative information about how we're doing so far,
in the form of a prediction error and an estimate of its variance.
This allows us to compute the Kalman gain $K_k$ which is used in the
update step

\begin{equation}
\begin{array}{l}
           K_k = X_{k|k-1} C^T E_k^{-1} \\
\hat{x}_{k|k}  =  \hat{x}_{k|k-1} + K_x e_k \\
      X_{k|k}  =  (I - K_k C_k) X_{k|k-1}.
\end{array}
\end{equation}

This deserves some explanation. If $e_k$ is small, we're on track, and
the state estimate is left where it is. The biggest mistery is in the
$K_k$ of course. It is obtained by minimizing the a--posteriori error
covariance $P_k$.

TODO: geometric interpretation.

Entry: Numeric
Date: Fri Jun 19 11:17:04 CEST 2009
Type: tex

When we are faced with solving problems involving real numbers, we are
bound to resort to approximate methods, because of the simple fact
that (digital) computers can store and manipulate only a finite amount
of information, and most real numbers can only be represented by an
infinite string of bits.

So what we do in practice is to represent real numbers by rational
ones, and keep an eye on how approximation errors propagate through a
system.  An often used representation is the floating point number,
where $r \approx n 2^m$, with $m$ and $n$ integers over a limited
range.  This gives an almost constant relative precision over a large
dynamic range.  This leads to the concept of \emph{machine precision},
which we denote as $\epsilon$.


% well--posed problems resemble physical problems. reverse problems
% are usually ill--posed.  this can be seen in the light of 'going
% against the arrow of entropy'.  information is lost in a dissipative
% system, what a reverse problem tries to do is to re-construct this
% information, which leads to a very high sensitivity to measurement
% errors.  see http://en.wikipedia.org/wiki/Well-posed_problem If a
% problem is well-posed, then it stands a good chance of solution on a
% computer using a stable algorithm. If it is not well-posed, it needs
% to be re-formulated for numerical treatment. Typically this involves
% including additional assumptions, such as smoothness of
% solution. This process is known as regularization.

% condition number

An important concept regarding numerical properties for matrices is
the condition number, $k(A) = \|A^{-1}\| \|A\|$.

% forward / backward stability


Entry: Projective Geometry
Date: Fri Jun 19 11:18:39 CEST 2009
Type: tex

For a regular square $N \times N$ matrix $A$, and $N \times 1$ vectors
$x$ and $b$, the equation $Ax = b$ is satisfied by a unique point $x =
x_0$. In case $A$ has rank $N-1$, we can find a vector $x$ such that
$Ax = 0$. If $b$ lies in the column space of $A$, we can find some
other vector $x_0$ such that $Ax_0 = b$. The general solution is of
the form $x_0 + \lambda x$, with $\lambda$ an arbitrary number.  For
$A$ of rank $N-k$ we can do similar things, yielding a two dimensional
space of solutions.

The solution spaces are called affine spaces. They are \emph{shifted}
vector spaces.  These define the geometric entities of points, lines,
planes, volumes, etc\ldots The observation to make here is that these
sets are no vector spaces themselves. A vector space can act on them
as a symmetry group (translation) but they are cumbersome to
represent, due to the presence of $b$.

What if we dispense of $b$? A way to do so without changing too much
about the solution space is to construct the set of equations $(A -
bI)x = 0$. This leaves the solution space intact, but brings the point
$x_0$ to the origin of a vector space again. So we can safely change
notation and just drop $b$ altogether, writing $Ax = 0$ to
characterise a space of solutions.

What can we say about this space? If some $x_0$ satisfies $Ax_0 = 0$,
we know that $\lambda x_0$ also satisfies this equation. There is not
much difference between both, so we can just as well identify this
1--dimensional space with one and the same thing, defining an
equivalence class identifying all points on one line through the
origin, leaving their only distinguising quality to be their
\emph{direction}.

% collection = set of sets
This construction is called the projective space of dimension $N-1$
derived from the vector space of dimension $N$, as a collection of
1--dimensional subspaces.  It can be visualized in different ways. We
can think about them as (light) rays originating from the origin, or
dually as light rays starting at some point in space and arriving at
the origin.

We can reduce these one--dimensional rays to zero--dimensional points
by intersecting them with some part of $N$--space. Two common
subspaces are the unit sphere, which gives two intersection points
with a ray that have to be identified, and a plane at unit distance
from the origin, which gives one, but does not include tha
point/line/ldots at infinity. Projective space can be seen as the
extension of Euclidian space by ``joining ends''.


To see what this looks like, we take two easily visualizable
examples. First is the projective line, constructed from
1--dimensional subspaces of the plane. This can be identified as a
circle, with opposite points identified, which remains a circle. The
alternative view is the line, together with the point at infinity.
Second is the projective plane, which is again a sphere with opposite
points identified. It can be seen as an extension of the euclidian
plane with a line at infinity.

% What are transformations in projective space? Things taking spaces
% to spaces, instead of points to points. In this way, a rotation in
% $N+1$ space can be seen as a translation in the projective space.

% conics

% Actually, i should add a bit about the Erlangen programme. The links
% between different kinds of geometry and some simple matrix algebra
% consequences. The only 2 i can see that are important at this moment
% are euclidian and projective. hyperbolic maybe to talk about FLT.
% And some things about Lie groups & all. And what kind of geometry
% are triangular matrices?

% notatie: pqbd 

% $\matrix{p & q \\ b & d}$


All this is better seen in terms of Lie groups. The Euclidian group
$E(n)$ can be represented by a subgroup of $GL(n+1)$, as $\matrix{R &
  x \\ 0 & 1}$, where $R$ is an orthogonal matrix, and $x$ a
translation vector. The Poincar\'{e} group similarly as $\matrix{A & b
  \\ 0 & 1}$, which represents all affine transforms $x \to Ax+b$.


Entry: Interpolation
Date: Fri Jun 19 11:19:54 CEST 2009
Type: tex

Some notes about interpolation. This is basicly resampling. Shannon,
Nyquist, filtering and polynomial interpolation. We present 2
interpolation methods: polyphase filtering and polynomial
approximation.  Because we use arbitrary resampling, the standard
techniques for sample rate conversion, where the ratio between
original and resulting samplerate is a rational number, cannot be
employed directly.

% see jos interpolation presentation for an overview
% linear
% allpass (enkel voor fractional delay owv iir)
% sinc
% windowed sinc
% lagrange
% farrow: incrementele update
% thirian allpass
% least squares filter design


% ook iets vreemds: VT interpolatie word tijds-variabel
% genoemd.. inderdaad, de FIR mask verandert telkens, maar in het
% geheel is het interpolatie + hersampling, is dit wel zo?  is het
% niet beter dit een tijds-invariant te zien, maar als een convolutie
% over het hele signaal?

Arbitrary resampling can be understood as a 2 stage process. We first
construct a continuous time signal

\begin{equation}
s(t) = \sum_n s_n h(t-n),
\label{ctinterpol}
\end{equation}

using an interpolation kernel $h(t)$. This is the same as convolving
$\sum_n s_n \delta(t-n)$ with $h(t)$.  The effects of interpolation
can be studied by examining the Fourier transform of $h(t)$.  For
practical purposes, we require $h(t)$ to have compact
support. Causality of $h(t)$ is not really an issue in our scheme,
since all the elements of $s_n$ are known in advance.

\def\emax{\epsilon_{\text{max}}}
\def\emin{\epsilon_{\text{min}}}

Interpolating a discrete sequence $s_n$ reduces to evaluating
(\ref{ctinterpol}) at a point $t'$.  We set $t' = t_0' + \epsilon$,
with $t_0'$ an integer or half-integer reference point we will use to
preserve symmetry in our notation. $\epsilon$ spans a range of $1$, or
$\emin \leq \epsilon < \emax$ and $\emax - \emin = 1$.  This gives us

\begin{equation}
s(t_0' + \epsilon) = \sum_n s_n h(t_0' + \epsilon - n)
\label{dtinterpol}
\end{equation}

which we will simplify using the following change of variables.  $x_i
= i - N/2$, $y_i = s_{t_0' + x_i}$ and $f(t) = s(t_0' + t)$.  This
turns (\ref{dtinterpol}) into

\begin{equation}
f(\epsilon) = \sum_i y_i  h(\epsilon - x_i).
\label{simplepol}
\end{equation}

We require that $h(t)$ only assumes values other than zero in the
interval $-N/2 + \epsilon_{\text{min}} \leq t < N/2 +
\epsilon_{\text{max}}$.  This means we can limit the sum over $x_i$ to
$-N/2 \leq x_i \leq -N/2$, which means the index $i$ in
(\ref{simplepol}) runs from $0$ to $N$.  The neighbourhood grid $X$ is
then $(-N/2, \ldots, N/2)$, and we can use the following matrix
representation

\begin{equation}
f(\epsilon) =
\left[
\begin{array}{c}
h(\epsilon - x_0) \\
h(\epsilon - x_1) \\
\vdots \\
h(\epsilon - x_N) \\
\end{array}
\right]^T
\left[
\begin{array}{c}
y_0 \\
y_1 \\
\vdots \\
y_N \\
\end{array}
\right]
\label{mdtinterpol}
\end{equation}


We will abbreviate (\ref{mdtinterpol}) as $f(\epsilon) = H(\epsilon)^T
Y$.  $H(\epsilon)$ is a function mapping the fractional displacement
$\epsilon$ relative to $t_0'$ to a weighting function for the
neighbourhood values $Y$.  It can be understood as an infinitely large
polyphase filterbank, with each filter implementing a fractional delay
$\epsilon$.


% This is actually 'polyphase filtering'
\subsection{Polyphase filtering}

In the case where $\epsilon$ takes on a finite number of rational
values $m/M$, with $m,M \in \mathbb{N}$ we can use FIR filter design
techniques to determine $h(t)$, which then has the form

\begin{equation}
h(t) = \sum_m c_m \delta(t-m/M),
\label{ppinterpol}
\end{equation}

with $\delta(t)$ the Dirac impulse function. The number of nonzero
coefficients $h_m$ is taken to be $MN$. This is the standard approach
in sample rate conversion where the ratio of 2 sample rates is a
rational number $L/M$. The filter coefficients $h_m$ are designed to
interpolate the $L$ times oversampled signal in such a way that
subsequent decimation by $M$ does not introduce significant aliasing
artifacts. This can be done using any FIR filter design
method. Because the number of taps $MN$ is rather large, one often
resorts to a `windowed sinc' design methods, where the filter is
obtained by multiplying a function with compact support (the window)
with the ideal low-pass filter which has a $\sin(t)/t$ form. By
choosing different windowing functions and moving the cutoff frequency
of the ideal lowpass filter, aliasing vs. bandwidth tradoffs can be
made.

Note that (\ref{ppinterpol}) is still discrete. To get to a continuous
function we can evaluate, we need to refine $h(t)$ further. We can do
this using polynomial interpolation, which is treated below. If $M$ is
large enough, low order interpolation us sufficient.

In the case where we approximate $\epsilon$ by the nearest $m/M$,
(\ref{ppinterpol}) is convolved with a block impulse, making the
resulting $h(t)$ piecewize constant. For linear interpolation, it is
convolved with a hat function, making $h(t)$ piecewize linear.


\subsection{Polynomial interpolation}

An alternative to polyphase filtering to determining $h(t)$ is to use
polynomial interpolation only. In this case, interpolation is
performed by evaluating an $N$th order polynomial through $N+1$
points.
% $C(\epsilon)$ is usually computed using the method of Lagrange interpolation. 
% Using this method, the function $c(t)$ can be directly computed.

% uitzoeken hoe die $c(t)$ nu geconstrueerd word, en zien wat
% er mis is met mijn interpretatie hier..

% 'generalized sinc'
% equivalent to sinc + binomial window

In our scheme we have

\begin{equation}
f(\epsilon) = \sum_i a_i \epsilon^i
\end{equation}

with $a_i$ the coefficients for the interpolating polynomial. With $E
= (1, \epsilon, \ldots, \epsilon^N)$ we can write this as $f(\epsilon)
= A^T E$.

This polynomial
satisfies the following set of equations

\begin{equation}
f(x_n) = \sum_i a_i x_n^i = y_n.
\end{equation}

Using the vandermonde matrix of the neighbourhood coordinates $X$

\begin{equation}
V = \left[
\begin{array}{cccc}
1 & x_0 & \dots & x_0^N \\
1 & x_1 & \dots & x_1^N \\
\vdots & \vdots & & \vdots \\
1 & x_N & \dots & x_N^N \\
\end{array}
\right],
\end{equation}

we can rewrite this set of equations as $VA=Y$ or $A = V^{-1}S$. This
gives $f(\epsilon) = E^T V^{-1} Y$.

We move on to determining $t_0'$ and the range for $\epsilon$ which
will enable us to obtain an expression for $h(t)$. Setting $W =
V^{-1}$, equating $E^TW^TY$ with (\ref{mdtinterpol}) and requiring it
to be valid for all $Y$ and $\emin \leq \epsilon < \emax$ gives

\begin{equation}
H(\epsilon) = W^T
\left[
\begin{array}{c}
1 \\
\epsilon \\
\vdots \\
\epsilon^N \\
\end{array}
\right] = W^T E(\epsilon).
\label{pwpoly}
\end{equation}

Column $i$ of $W$ contains the coefficients of the polynomial defining
a segment of $h(t)$ in the neighbourhood of $x_i$, which makes $h(t)$
piecewize polynomial, i.e. a spline.  The segments are the center
intervals of the Lagrange base polynomials $l_k$.  More explicitly for
segment $k$

\begin{equation}
h(\epsilon - x_k) = l_k(\epsilon) = \prod_{k \neq l} \frac{\epsilon - x_l}{x_k - x_l}
\end{equation}

for $\emin \leq \epsilon < \emax$.  An important property of $h(t)$ is
continuity.  This constraint translates to requiring $l_k(\emin) =
l_{k+1}(\emax)$, for $k$ from $0$ to $N-1$.

As stated above, to preserve symmetry around $0$ in $X$, which yields
symmetric coefficients in $W$ useful to simplify implementation, we
choose $t_0'$ integer when $N$ is even, and half integer when $N$ is
odd. The range of $\epsilon$ is best chosen to be as close to $0$ as
possible, in the middle of $X$, since this reduces the interpolation
error. This leads us to the following choices.

When $N$ is odd, we take $t_0'$ to be half integer, and set $(\emin,
\emax) = (-1/2, 1/2)$.  This leads to a symmetric $h(t)$. When $N$ is
even, we have two optimal choices $(\emin, \emax) = (-1,0)$ or $(0,1)$
respectively.  This will yield an asymmetric $h(t)$.

The splines $h(t-n)$ map the signal $s_n$ to a space of piecewize
polynomials of degree upto $N$. The spectrum of $h(t)$ denoted by
$H(\omega)$ is maximally flat, meaning $\frac{d^n}{d^n
  \omega}H(\omega) = 0$ for $0 \leq n \leq N$.  Note that $h(t)$ is
not necessarily smooth, only continuous.

\subsection{Aliasing}

The interpolation kernel $h(t)$ mentioned above must be designed in a
way that minimizes aliasing. In order to do this, we must know the
rate at which (\ref{ctinterpol}) is sampled before we can design an
optimal kernel.

The sampling rate in our application is highly variable, and the
approaches we presented above result in a fixed $h(t)$, where we would
like this to depend on the resampling ratio. Since we do not have such
a technique at our disposal, we need to find a workaround.


\subsubsection{Multiple $h(t)$}

In the case where we use a finite polyphase filterbank optionally
followed by low order polynomial interpolation, it is possible to
design several $h(t)$ directly, each with a different cutoff
frequency, and choose the most appropriate one based on $r$. If the
spacing of the different interpolators is large, (i.e. one $h(t)$ per
octave), linear interpolation between neighbouring $h(t)$ might be
necessary.


\subsubsection{Multiresolution}

An other approach is multi stage resampling. This is often applied in
fixed rate sample rate converters where it has the interesting
property to reduce the total number of filter coefficients, and so
increases efficiency.

This can be done efficiently when $N$ is a composite number. Since we
will mostly use $N$ to be a power of $2$, our resampling algorithm for
a ratio $r = Nf_n$ reduces to a sequence of $m$ resampling operations
with a decimation factor of $2$, with the final step implementing the
ratio $r/2^m < 1$, so it does not introduce aliasing.

One possible framework for multiresolution is the Discrete Wavelet
Transform (DWT).


Entry: Sequences and Polynomials
Date: Fri Jun 19 11:20:07 CEST 2009
Type: tex


To simplify notation we define define the Z--transform for a sequence
$s_n$ to be its generating function $s(z) = \sum_n s_n z^n$, which
maps causal FIR filters to polynomials.  This is in contrast to the
engineering convention which defines it in terms of $z^{-n}$, so all
statements about poles and zeros of transfer functions should be
interpreted in this respect. With causal signals we mean $s_n = 0$ for
$n<0$.

For a non--causal signal, the Z--transform is the Laurent series,
which is assumed to converge on the unit circle. This means the
generating function has no poles on the unit circle.

The Fourier series is defined as $c_n = r s_n$, with $f(\omega) =
\sum_n c_n e^{j\omega} = s(r e^{j \omega})$, and $r$ in the annulus of
convergence. In most cases $s(z)$ is bounded on the unit circle and we
can set $r=1$.

The bivariate version of the Z--transform, relates semi--infinite
matrices to a bivariate complex function $m_{ij}(x,y) =
\sum_{ij}^{\infty} m_{ij} x^i \overline{y}^j$. We call $m_{ij}(x,y)$
the generating function of the matrix $m_{ij}$.

Below we will elaborate on some finite approximations of the
Z--transform, where we associate polynomials and matrices to finite
sequences.

\subsection{Periodic signals}

The Z--transform can be used as a frequency domain representation of
infinite length sequences with finite energy. It maps shift invariant
operators to multiplication in the transform domain.  For perioidic
signals, the Z--transform does not converge.

Similar tools can be constructed however, by using vectors in
$\mathbb{C}^N$ to represent a single period of an $N$--periodic
signal. The circular down shift operator is then defined as the
permutation operator

\begin{equation}
Z = \left[
\begin{array}{ccccc}
0 & 0 & \hdots & 0 & 1 \\
1 & 0 & \hdots & 0 & 0 \\
0 & 1 & \hdots & 0 & 0 \\
\vdots & \vdots &    & \vdots & \vdots \\
0 & 0 & \hdots & 1 & 0 \\
\end{array}
\right],
\end{equation}

which satisfies $Z^0 = Z^N = I$. These circular down shift operators
form an $N$--dimensional basis for circulant $N \times N$ matrices. We
can associate an order $N-1$ polynomial $a(z) = \sum_{k=0}^{N-1} a_k
z^k$ with a circulant matrix $A = \sum_k a_k Z^k$. The first column of
$A$ contains the coefficients of $a(z)$.

Define The eigenvalues and eigenvectors of $Z$ by

\begin{equation}
ZW_k = w_{-k}W_k.
\label{shifteig}
\end{equation}

We have $Z^nW_k = w_{-k}^nW_k$. From $Z^N=I$ we get $w_k^N = 1$, which
gives the $N$ solutions $w_k = e^{j2\pi k/N}$ and $W_k = (1, w_k,
\ldots, w_k^{N-1})$.  We also have $AW_k = a(Z)W_k = a(w_{-k})W_k$,
which gives us the eigenvalues of $A$ as the discrete Fourier
transform of the coefficients of $a(z)$.  This allows us to write the
eigenvalue decomposition as

\begin{equation}
A = FCF^H = \frac{1}{N} \sum_k c_k W_k W_k^H = \sum_k c_k F_k,
\label{circeig}
\end{equation}

with $C$ a diagonal matrix containing the eigenvalues $c_k =
a(w_{-k})$ and $F$ containing the normalized orthogonal Fourier basis
with Vandermonde columns $W_k / \sqrt{N}$.  This means the circulant
matrices are similar to the diagonal matrices, with the similarity
transform equal to the Fourier transform. The diagonal $N \times N$
matrices form a commutative unitarian ring over $\mathbb{C}$.

This ring has a general matrix representation $QDQ^{-1}$, of which the
circulant matrices are the special case that can be represented using
the basis of shift operators $Z^k$ which is of special interest to us,
since it gives two isomorphic algebras with multiplication defined
respectively as circular convolution and componentwize multiplication.

% The polynomial multiples of $z^N-1$ form an ideal in \mathbb{C}[x]
% Two polynomials $a(z)$ and $b(z)$ are congruent if their difference
% is in I, or $a(z)-b(z) = q(z)(z^N-1) The equivalence classes for
% this congruence relation are isomorphic to the circulant (and
% diagonal) matrices

Note that $Z^{k+N} = Z^k$, so the matrix polynomials $a(Z)$ are
isomorphic to the factor ring of polynomials modulo $z^N-1$.  In this
ring the multiplicative unit element $1 = z^N$ corresponds to the $N
\times N$ unit matrix $I=Z^N$. The prime ideals $I_k$ are generated by
$w_k(z) = 1 - w_kz$ and correspond to matrices with $c_k = 0$. Since
$\prod_k w_k(z) = 1 - z^N = 0$, the nonzero elements of the prime
ideals are all zero divisors.


Each rank $1$ circulant hermitian projection matrix $F_k =
\frac{1}{N}W_k W_k^H$ in the eigenvalue decomposition of $A$
corresponds to the polynomial

\begin{equation}
f_k(z) = \frac{1}{N} \sum_{l=0}^{N-1} (w_k z)^l = \frac{1}{N} \prod_{l \neq k} 1 - w_l z.
\end{equation}


The matrix $F_k$ projects on the one--dimensional space generated by
$W_k$.  These matrices satisfy $F_k F_l = \delta_{kl} F_k$. For the
polynomials we have $f_k(z) f_l(z) = f_k(z) \delta_{kl}$.


The analog of (\ref{circeig}) then becomes

\begin{equation}
a(z) = \sum_{k=0}^{N-1} c_k f_k(z).
\end{equation}


If matrix $A$ is full rank, meaning $c_k \neq 0$, its inverse exists
and is given by $A^{-1} = \sum_k c_k^{-1} F_k = FC^{-1}F^H$. For the
polynomials we can write $a(z) = a_0 z^m \prod_n (1-z_nz)$.  This has
an inverse in the ring of polynomials modulo $z^N-1$ if all of its
factors have an inverse.

The inverse of the factor $z^m$ is $z^{N-m}$.  The inverse of the
factor $1 - a z$ exists if $a^N \neq 1$.  Using $(az)^N = a^N$, we
find it to be

\begin{equation}
(1-az)^{-1} = \frac{1}{1-a^N}\sum_{k=0}^{N-1} (az)^k.
\end{equation}

The factor $1/(1-a^N)$ can be understood as the effect of the wrapping
of the infinite exponential tail.

This makes the ARMA model $b(z)/a(z)$ well defined for periodic
signals represented by polynomials modulo $z^N-1$, as long as $a(z)$
does not vanish at the
$w_k$. %\footnote{The the stability requirement $|a| < 1$ is not necessary for periodic signals.}.
The limit case of $N$ tending to $\infty$ gives us the Z--transform of
$a_n = a^n$, which converges for $|az| < 1$.

% We could abuse notation and define $(1 - w_kz)^{-1} = f_k(z)$, but
% remembering that it is a projection operator, not an inverse. This
% makes sense, because $f_k(z) = \frac{1 - z^N}{1 - w_k z} \mod 1 -
% z^N$.


% As a side note, it might be interesting to look at matrices of the
% form $W_2^TDW_1$, with $W_1$ and $W_2$ biorthogonal wavelet bases.

% Dus. Shift-invariant operations -> exponentials. Wat is de equivalent voor wavelet bases?

Note that the polynomial ring we talked about is over $\mathbb{C}$,
which is algebraicly closed, so subrings of $\mathbb{C}[z]$ can never
be fields.

When we take $\mathbb{R}$ as the starting field, we can construct
fields as quotients of $\mathbb{R}[x]$, using maximal ideals generated
by irreducible polynomials.  For $z^2 + 1$ we get the complex
numbers. In general, polynomials over $\mathbb{R}$, modulo $z^N + 1$,
with $N$ even give a field because the polynomial is irreducible over
$\mathbb{R}$. Polynomials modulo $z^2 - 1$, which is reducable, gives
the ring of split complex numbers.

% what is the significance of this?

\subsection{Finite signals}

Instead of working with circular shifts and finite periods, we can
work with finite signals. Take $Z$ to be the nilpotent $N \times N$
down--shift matrix

\begin{equation}
Z = \left[
\begin{array}{ccccc}
0 & 0 & \hdots & 0 & 0 \\
1 & 0 & \hdots & 0 & 0 \\
0 & 1 & \hdots & 0 & 0 \\
\vdots & \vdots &    & \vdots & \vdots \\
0 & 0 & \hdots & 1 & 0 \\
\end{array}
\right],
\label{shiftdown}
\end{equation}

satisfying $Z^N = 0$. Together with $Z^0 = I$, the $Z^k$ form a basis
for lower triangular Toeplitz matrices (LTT)\footnote{ which can still
  be handled by FFT agorithms by embedding them in $(2N-1) \times
  (2N-1)$ circulant matrices.

% Take for example

% \begin{equation}
% C = left[
% \begin{array}
% T 0
% \end{array}
% \end{equation}

}.


Note that each matrix $M$ which commutes with $Z$ is necessarily LTT,
since these are the only matrices for which a down shift $MZ$ and a
left shift $ZM$ are the same.


With each polynomial of degree smaller than $N$ we can associate an
LTT matrix $A = a(Z)$.  Relating this to the ring of polynomials over
$\mathbb{C}$, the LTT matrices are isomorphic to the quotient ring
obtained by the ideal generated by $z^N$, the polynomials truncated to
degree $N-1$.

In this quotient ring, the zero divisors are given by multiples of
$z$, which are nilpotent\footnote{ Note that $a(0)$ is the eigenvalue
  of $A$ with algebraic multiplicity $N$.}.  A polynomial $a(z)$ is
invertible if all its factors are invertible. If $a(0) \neq 0$, it is
invertible and can be written as $a(z) = a(0)\prod_k (1-z_nz)$.  The
inverse of $1-az$ is given by

\begin{equation}
(1-az)^{-1} = \sum_{k=0}^{N-1} (az)^k,
\end{equation}

which tends to the Z--transform when $N \to \infty$.


% Constructing a basis for any $N \times N$ matrix $M$ can be done in
% the following way.

\subsection{Matrix generating function}

\def\y{\overline{y}}

Now we extend this to two dimensions. The interpretation of the
inverse $x^{-1}$ depends on the field we work in. Suppose this to be
the polynomials modulo $N$, and associate with matrix $M$ a bivariate
function

\begin{equation}
m(x,y) = \sum_{ij} m_{ij} x^i \y^j = X M Y^H,
\end{equation}

with row vectors $X = (1,x,\ldots,x^{N-1})$ and $Y = (1,y,\ldots,y^{N-1})$.
This associates the term $x^i \y^j$ to the matrix with a $1$ on the $i$th row and $j$th column,
counting from zero. Using the notation $Z^{Tj} = (Z^T)^j$, this matrix can be written as 

\begin{equation}
Z^{i} Z^{Tj} - Z^{(i+1)} Z^{T(j+1)} = Z^{i} (I - ZZ^T) Z^{Tj}.
\end{equation} 

When $N \to \infty$ we have the generating function for the semi--infinite matrix $M$.
Using this representation, the LTT matrix $A = a(Z)$ corresponds to 

\begin{equation}
a(x,y) = a(x) \sum_{k=0}^{N-1} (x \y)^k = a(x) (1 - x \y)^{-1},
\end{equation}

while $A^H$ corresponds to $\overline{a(y)}(1-x\y)^{-1}$. 

% These relations are tied to the product of Toeplitz matrices.

% Furthermore we have $(1-x\y)^{-1} = XY^H$, $x(1-x\y)^{-1} = XZY^H$,
% and $\y(1-x\y)^{-1} = XZ^TY^H$.

% A seperable matrix $S$ can be associated to a product of two
% polynomials $s(x,y) = p(x) q(y)$.


% Now take the bivariate polynomial $b(z,z')$ and associate it with
% the matrix polynomial $b(Z,Z^T)$.


% Now consider the mapping $a(z) \to a(Z) a(Z)^H$.

% The Jordan form ...


% We can associate with $\overline{a(1/\bar{z})}$ the upper triangular
% Toeplitz (UTT) matrix $A^H = \overline{a(Z^T)}$. A positive definite
% toeplitz matrix can then be written using its Cholesky factorization
% $A = L L^H = l(Z)\overline{l(Z^T)}$.

% This brings us to the concept of displacement rank.  RICCATI -
% BERNOULLI - WIENER/HOPF - KALMAN FILTER ( kailath berkely lecture )

% NOTE: F^2 = \tilde{I} (reversal) displacement rank: sum of cholesky
% factorizations, chol = LL^H, met L voorgesteld door veelterm

\subsection{Orthogonal polynomials on the unit circle}

In the general case where $s_n$ is a causal sequence in $l_p$, and
$s(z) = \sum_{n=0}^{\infty} s_n z^n$ belongs to the Hardy space $H_p$.
$r(z) = s(z)\overline{s(1/\bar{z})}$ satisfies $r(e^{j\omega}) =
|s(j\omega)|^2 \geq 0$.

We can define an inner product for the weighted hilbert space of
polynomials on the unit circle, using the measure $d \mu = r(z)d
\omega$ with $z=\exp(j\omega)$.  It is given by

\begin{equation}
\left<g,h\right> = \frac{1}{2 \pi}\int_{-\pi}^{\pi} \overline{g(z)} h(z) d \mu,
\label{polyprod}
\end{equation}

The time domain equivalent space is the space spanned by shifts of the
sequence $s_n$.  This means $h_n * s_n$ maps to $h(z)=\sum_n h_n
z^n$. By incorporating the effect of $s_n$ in the measure, also called
the spectral measure, we can work with the polynomials only.  The time
domain equivalent of the inner product is

\begin{equation}
\left<g_n,h_n\right> = \sum_n \bar{s}^g_n s^h_n,
\label{polyprod}
\end{equation}

with $s^g_n = s_n * g_n$ and $s^h_n = s_n * h_n$ the residuals after
filtering $s_n$ with $g_n$ and $h_n$ respectively. This inner product
defines a norm $\| h(z) \| = \left< h, h \right>$, which gives the
energy of the residual after filtering $s_n$ with $h_n$.

In this interpretation, we write the autocorrelation of $s_n$ in terms
of inner products of monomials in this Hilbert space, setting
$g(z)=z^l$ and $h(z)=z^k$ we get

\begin{equation}
\left<z^l,z^k\right> = \sum_i \bar{s}_{i-l} s_{i-k} = \sum_i s_{i + (k-l)} s_{i} = r_{k-l},
\end{equation}

which defines the Gram matrix of the base of monomials $\{ z^k \}$.
We can construct an orthogonal base for this Hilbert space, by
applying the Gram--Smidth orthogonalization procedure to the base of
monomials. This will solve the least squares linear prediction
problem.

To see this, we formulate the effect in the time domain using linear
algebra notation. We define the infinite dimensional column vectors
$S_n$ to be the signal $s_n$ delayed by $n$ and construct the $\infty
\times L$ Hankel matrix $S$ by joining the columns $S_1$ to
$S_L$. Setting $A = (a_1, \ldots, a_L)$ and $E$ the infinite
dimensional column vector containing the prediction error, we require
$A$ is the solution to the least squares problem $S A = -S_0 + E$,
minimizing $\| E \| ^2$.

The vector $E$ is orthogonal to the column space of $S$. By
rearranging this equation to $\left[S_0 S\right]\left[1 A\right]^T =
E$, and observing the norm $\| E \|$ in the time domain space is the
same as the norm $\|a(z)\|$ in the frequency domain, with $a(z) =
\sum_n^L a_n$, and $a_0 = 1$ we need to find a minimal norm polynomial
$a(z)$.  For this polynomial, the time domain orthogonality translates
to the frequency domain orthogonality to the span of
$\{z,z^2,\ldots,z^{n}\}$.

Defining $\phi(z) = z^L \overline{a(1/\bar{z})} = a^*(z)$ gives a
monic polynomial orthogonal to span $\{1,z,\ldots,z^{n-1}\}$. The time
domain equivalent is a `backward predictor', which produces a residual
$F$ that is orthogonal to the span of $\{ S_0, \ldots, S_{L-1}\}$.

It is trivial to show that $\|a(z)\| = \|\phi(z)\|$, so we can
concentrate on the monic polynomial $\phi(z)$. In a Hilbert space,
monic polynomial of minimal norm is the orthogonal one. So our problem
reduces to constructing an orthogonal base of polynomials $\phi_n(z)$
using the Gram--Smidth procedure.

The crucial observation is that there exists a recurrence relation
between the the orthogonal polynomials $\phi_n(z)$, similar to
orthogonal polynomials on the real line.  This will provide us with a
fast algorithm to compute the predictor of order $L+1$ in terms of the
predictor of order $L$.


Define the partial correlation coefficients as

\begin{equation}
\rho_{n+1} = \frac{\left< z^{n+1}, a_n \right>}{\|a_n\|^2}
\end{equation}

These express the correlation between the $n$th order prediction error
and the signal delayed by $n+1$. We can write this as

\begin{equation}
\rho_{n+1} = \frac{\left< z\phi_n, \phi_n^* \right>}{\|\phi_n\| \|\phi_n^*\|},
\label{parcorlev}
\end{equation}

since $a_n(z)$ is orthogonal to all the terms in $z\phi_n(z)$ except
$z^n$.  The recurrence relation for the $\phi_n(z)$ is given by

\begin{equation}
\left[
\begin{array}{c}
\phi_n(z) \\
\phi_n^*(z) \\
\end{array}
\right]
=
\left[
\begin{array}{cc}
1  & - \bar{\rho_n} \\
- \rho_n & 1 \\
\end{array}
\right]
\left[
\begin{array}{cc}
z & 0 \\
0 & 1 \\
\end{array}
\right]
\left[
\begin{array}{c}
\phi_{n-1}(z) \\
\phi_{n-1}^*(z) \\
\end{array}
\right].
\label{szego}
\end{equation}

This can be shown by expanding the relation for $\phi_n(z)$ in terms
of all $\phi_k(z)$, with $k < n$, which gives the Gram--Smidth
procedure. The time domain equivalent of the recurrence relation gives
the lattice filter realization in terms of the partial correlation
coefficients. The time domain equivalent of the combination of
(\ref{parcorlev}) and (\ref{szego}) is the Levinson algorithm, which
exploits the symmetry in (\ref{linpred}).

\def\car{Carath\'{e}odory }


% \subsubsection{\car and Schur functions}

% From a symmetric sequence $s_n$ construct the Toeplitz matrices $S_n = \left[ s_{|i-j|}\right]_{i=0,j=0}^{n}$. Now the $S_n$ are positive definite iff the Laurent series of $s_n$ is real on the unit circle, which means it is a power spectral density function or $\text{Im}(s(z)) = 0$ for $|z| = 1$. \car linked this to the function $c(z) = s_0 + 2 \sum_{i>0} s_n z^n$, which is equal to the Laurent series of $s_n$, except it replaces all terms in $z^{-n}$ by $z^n$. This function is positive real iff $s(z)$ is real on the unit circle. Using the bilinear transform, a \car function can be transformed into a Schur function.


% Define a \car function as a positive real function $\text{Re } c(z) \geq 0$, for $|z|<1$. 


Entry: Spectral Factorization
Date: Fri Jun 19 11:21:37 CEST 2009
Type: tex

We will have a brief look at spectral factorization and linear prediction.

A filter $h(z)$ is minimum phase if it is causal and its inverse
$1/h(z)$ exists and is causal.  Robinson's energy delay theorem states
the minimum phase sequence $h_n$ has the property of minimum energy
delay, which means most of its energy is packed towards $n=0$. We
write $h(z)$ as

\begin{equation}
h(z) = \sum_n h_n z^i = \prod_i h_0 (1 - z_iz).
\end{equation}

If the sum is finite $h(z)$ is a polynomial corresponding to a FIR
filter.  Using the `mathematical' Z--transform, the minimum phase
property requires $|z_i| < 1$, which means the zeros $1/z_i$ are
outside the unit circle.

Any causal filter $h_n$ can be made minimum phase without changing its
maginitude response, by reflecting its zeros outside the unit circle.
To see that this indeed produces the same magnitude spectrum, we write
the minimum phase version as

\begin{equation}
h'(z) = h(z) \prod_{i \in I} \frac{z - \bar{z}_i}{1 - z_i z},
\end{equation}

with $I$ the set of indices satisfying $|z_i| > 1$. This maps all
zeros $1/z_i$ inside the unit circle to $\bar{z}_i$.  $h'(z)$ is a
polynomial of $h(z)$ is a polynomial. $h'(z)/h(z)$ is an all--pass
filter, which has a magnitude response of $1$.  This operation
preserves the autocorrelation

\begin{equation}
r(z) = h(z) \overline{h(1/\bar{z})} = h'(z) \overline{h'(1/\bar{z})},
\label{autocorr}
\end{equation}

since this just switches around zeros in its factorization

\begin{equation}
r(z) = h_0\bar{h}_0 \prod_i (1 - z z_i) (1 - z / z_i).
\end{equation}


If $h'(z)$ is causal minimum phase, then its inverse $a(z) = 1/h'(z)$
exists, and we can write

\begin{equation}
r(z)a(z) = \overline{h'(1/\bar{z})}
\label{specfac}
\end{equation}

which is anti--causal, so the constant term in the Laurent series of
(\ref{specfac}) is $\bar{h}_0 = 1/\bar{a}_0$, and all the other terms
corresponding to $z^n$, $n>0$ are zero. If the order of $a(z)$ is $L$,
this gives us a set of $L+1$ equations to obtain the $a_i$. The normal
form can be obtained by dividing the unknowns and right hand side
$a_0$. Using $\gamma_L^2 = 1/a_0\bar{a}_0$ this gives the well known
symmetric set of Toeplitz equations

\begin{equation}
\left[
\begin{array}{cccc}
r_0    & r_1     & \ldots & r_L      \\
r_1    & r_0     & \ldots & r_{L-1}  \\
\vdots &         &        & \vdots \\
r_L    & r_{L-1} & \ldots & r_0    \\
\end{array}
\right]
\left[
\begin{array}{c}
1 \\
a_1 \\
\vdots \\
a_L \\
\end{array}
\right]
=
\left[
\begin{array}{c}
\gamma_L^2 \\
0 \\
\vdots \\
0 \\
\end{array}
\right]
\label{linpred}
\end{equation}


The first row gives the energy misprediction in terms of the $a_i$,
while the other rows model determine the $a_i$ in terms of the
autocorrelation.

In the part on the exponential model for $h_n$, we had a remark about
the starting phase of the approximation of $p_n$. Spectral
factorization allows us to avoid that problem by modeling the minimum
phase part of a sequence as a sum of exponentials.

Let's have a look at the error.  Let $P_k$ denote the vector with the
elements of $p_n$ shifted up by $k$.  The shift is done either in a
cyclic way, or by adding zeros.  Assume $p_n$ is real. Let $A = [a_1
\dots a_L]^T$, $E = [e_0 \ldots e_{N-1}]^T$, and $P = [P_1 \ldots
P_L]$.

The linear prediction problem solves 

\begin{equation}
PA = -P_0 + E
\label{lsar}
\end{equation}

in a least squares sense, minimizing $E^TE$. The solution is given by
pre--multiplying this equation by $P^T$. Since $P^T P$ is invertible,
we can write

\begin{equation}
A = - (P^T P)^{-1} P^T P_0,
\end{equation}

with $P$ the Hankel signal matrix and $P^TP$ the symmetric positive
definite signal autocorrelation matrix. The solution $A$ projects
$-P_0$ on the column space of $P$.

This can be solved using the Levinson--Durbin algorithm. An
alternative approach is the Schur algorithm, which computes $a(z)$ in
terms of its lattice filter representation directly. Both have
complexity $O(L^2)$. Superfast $O(L\log L)$ algorithms exist, but they
are only better for large $L$.


% In the framework of linear prediction of a WSS process, the error is
% well defined. The predictor for the WSS process behaves as a Minimum
% Mean Square Error (MMSE) predictor, minimizing $E(e_k^2)$.In the
% case where the WSS process is an autoregressive process of order
% $L$, we have in addition $E(e_k, e_l) = 0$, $k \neq l$.


The factorization can be computed analyticly by separating the series
expansion

\begin{equation}
\log r(z) = \sum_n \frac{1}{n}[r(z)-1]^n = \sum_n u_n z^n,
\end{equation}

which converges when $0 < |r(z)| < 2$ on the unit circle. The lower
bound means the corresponding Toeplitz matrix is nonsingular, the
upper bound can be ensured with proper scaling.  We separate the
causal and anticausal part

\begin{equation}
\log r(z) = \left( \frac{u_0}{2} + \sum_{n>0} u_n z^n \right) + \left( \frac{u_0}{2} + \sum_{n<0} u_n z^n \right),
\end{equation}

or $\log r(z) = u^{+} (z) + u^{-} (z)$.
Because $r(z) = r(1/z)$ we have $u^{-} (z) = \bar{u}^{+}(1/z)$.
Taking the exponential gives us $r(z) = h'(z) \bar{h}'(1/z)$. This preserves 
causality of $h'(z) = \exp(u^{+}(z))$. Since $1 / h'(z) = \exp(-u^{+}(z))$ is also
causal, $h'(z)$ is minimum phase. 


% Which amounts to the Levinson algorithm for solving the toeplitz set
% of equations (\ref{linpred}), If $\left< g, h \right>$ is zero, the
% residuals after filtering are orthogonal.


% This algorithm is based on the observation that extending the matrix
% with one row and column, and extening the vector of unknowns with a
% zero produces a system of equations which has the same form when
% order of unknowns and right hand side is reversed. Taking the
% weighted sum of the original and reflected set of equations,
% weigthed by $(1, \rho_{L+1})$, allows us to parametrize the solution
% for $L+1$ in terms of the solution for $L$. The weighting parameter
% $\rho_{L+1}$ is determined by requiring the right hand side of the
% weighted sum to have the same form as (\ref{linpred}).

% This algorithm can be restated in the frequency domain in terms of
% polynomials.


% Using linear algebra arguments, the Levinson algorithm can be
% explained as follows. We write the equation in (\ref{linpred}) as
% $R_L a_L = e_L$. By adding an extra row and column to the Toeplitz
% matrix $R_L$ we obtain $R_{L+1}$ we get

% \begin{equation}
% R_{L+1} 
% \left(
% \left[
% \begin{array}{c}
% a_L \\
% 0
% \end{array}
% \right]
% + \rho_{L+1}
% \left[
% \begin{array}{c}
% 0 \\
% a_L'
% \end{array}
% \right]
% \right)
% =
% \end{equation}

% To relate this to Gram--Smidth orthogonalization in the infinite
% dimensional vector space spanned by shifts of $s$, we compute an
% orthogonal base with the weighting matrix $W$ in (\ref{gramsmidth})
% equal to $S^H S$ and the $\delta_i$ equal to the columns of
% $I_\infty$, we have

% \begin{equation}
% f_n = \delta_n - \sum_{i=1}^{n-1} \frac{f_i^H S^H S \delta_n}{f_i^H S^H S f_i} f_i.
% \end{equation}

% With the vectors $s_n = S\delta_n$ the delayed signal and $e_n =
% Sf_i$ the prediction error this becomes

% \begin{equation}
% f_n = \delta_n - \sum_{i=1}^{n-1} \frac{e_i^H s_i}{e_i^H e_i} f_i.
% \end{equation}

% The trick of the Levinson algorithm is to update the inner products
% from $n$ to $n+1$, by taking into account one extra term of the
% autocorrelation sequence at each step. The frequency domain version
% of the above equation is

% \begin{equation} 
% f_n(z) = z^n - \sum_{i=1}^{n-1} \frac{\left<f_i,z^n\right>}{\left< f_i,f_i \right> } f_i(z).
% \end{equation}


% We can relate this to the QR decomposition of the $\infty \times
% L+1$ Hankel matrix $S_L$ which has its columns the delayed versions
% of the signal $s_n$.

Entry: Autogregressive Model
Date: Fri Jun 19 11:22:40 CEST 2009
Type: tex

\subsection{AR modeling for periodic signals}

In the case where $s_n$ is periodic with period $N$, the polynomials are taken modulo $1+z^N$,
and we can use the finite vector space $\mathbb{C}^N$ as the alternative of the Hilbert space
of polynomials on the unit circle.

With $S$ to the circulant matrix constructed from $s_n$, 
The time domain weigted inner product for the space spanned by the columns of $S$ can be written as

\begin{equation}
\left<g,h\right> = g^H S^HS h.
\end{equation}

Because $S$ is circulant, we can express it as $S = FCF^H$, with $C$ a diagonal matrix.
The equivalent frequency domain inner product then becomes

\begin{equation}
\left<g,h\right> = g^H F C^HC F^H h.
\end{equation}

Where $F^H h$ is the analogy of $h(z)$, and the diagonal matrix $R = C^HC$ is the analogy of $r(z)$.
Finding the equivalent of a base of orthogonal polynomials stated above can be restated as an
inverse QR problem.

Note that the columns of $F$ are also an orthogonal base, though they do not solve our successive AR 
approximation problem since these vectors are dense.


\subsection{AR as inverse QR}

Finding an orthogonal base in a weighted vector space is related
to the inverse QR decomposition of a matrix $S$, using the Gram--Smidth procedure.
This decomposition is written as 

\begin{equation}
SR = Q
\label{iqr1}
\end{equation}

with $R$ an triangular and $Q$ orthogonal.
This is equivalent to stating that $R$ is orthogonal with
respect to the inner product weighted by $S^HS$, or

\begin{equation}
R^HS^HSR = Q^HQ = I,
\end{equation}

which is the time domain equivalent of the orthogonal polynomials treated above.
% another way of saying this is that $R$ reduces the quadratic form defined by $S^HS$ to a sum of squares.

Taking $S$ to be the $\infty \times N$ Toeplitz matrix containing shifts of the signal, and
computing $R$ using Gram--Smidth with weighting $S^HS$ gives us the normalized backward predictors
as columns of $R$. The columns of $Q$ are then the normalized backward prediction errors.

Multiplying (\ref{iqr1}) by the diagonal matrix $E'$ such that the diagonal elements of
of $\Phi = RE'$ are $1$ gives us the predictors as columns of $\Phi$, while the diagonal
elements of $E'$ contain the square root of the prediction error energies.
We can obtain this by computing the columns of $R$ by not normalizing (\ref{gramsmidth}), since we
start from the base $I$.

With $E = QE'$ this gives

\begin{equation}
S\Phi = E.
\end{equation}

The corresponding columns of $\Phi$ and $E$ gives us the solution to the respective
order least squares backward prediction problem. This can be seen because the $n$th
column of $E$ is orthogonal to the $k<n$ columns of $S$. With the diagonal elements
of $R$ equal to $1$, the $n$th colum of $S$ is decomposed as a combination of the
previous columns and an orthogonal component, which is the minimal solution.

The dual approach is to start from the infinite dimensional Hankel matrix of $s_n$.
This gives a different error vector, but the same $R$, which are then contains the forward predictors.


\subsection{The Schur algorithm}

The Levinson algorithm works by updating the backward predictor coefficients,
here represented by the infinite dimensional column vector $\Phi_n$ extended with zeros, computing
the partial correlation coefficients $\rho_n$ along the way. It is possible to compute these 
without using the predictor coefficients directly, by writing an update equation for
$S^H S \Phi_n$ and $S^HS \Phi_n^*$, instead of $\Phi_n$ and $\Phi_n^*$.
This is equivalent to computing the Cholesky factorization of $S^HS$, instead of its
inverse which is given by the Levinson algorithm.


\section{Orthogonalization}

\subsection{Gram--Smidth}

This is a procedure for constructing an orthogonal base in a vector space.
We illustrate this in the $N$ dimensional real vector space with inner
product given by $\left< a, b \right> = a^T W b$, with $W$ a positive definite
matrix. This defines the norm $\| a \| = \left< a, a \right>$.

We start from the base $a_n$ of mutually independent vectors and want to generate
an orthogonal base which spans the same subspace. The base vectors are given by
expressing $a_n$ as a component in the space spanned by $\{e_1, e_n\}$ and a 
component $f_n$ orthogonal to this space, or

\begin{equation}
f_n = a_n - \sum_{i=1}^{n-1} \frac{f_i^T W a_n}{f_i^T W f_i} f_i.
\end{equation}


With $e_n = f_n / \|f_n\|$ we have

\begin{equation}
f_n = a_n - \sum_{i=1}^{n-1} (e_i^T W a_n) e_i.
\label{gramsmidth}
\end{equation}


Entry: Moore's Law
Date: Sat Jun 20 10:41:56 CEST 2009

It's all getting just too expensive[1][2][3].  The idea is that in
order to keep the exponential going, software needs to become more
efficient/effective.  This then again will probably lead to another
exponential prodictivity curve keeping moore's law just as it is.

[1] http://www.dspdesignline.com/news/218100277
[2] http://c2.com/cgi/wiki?EndOfMooresLaw
[3] http://c2.com/cgi/wiki?MooresSecondLaw


Entry: DWS: blurbs removed from paper
Date: Sat Jun 20 13:38:15 CEST 2009
Type: tex

\section{Related Techniques}

\subsection{Wavetable synthesis}

In wavetable synthesis, a sound signal is generated by interpolating
the values of a waveform vector $P \in \mathbb{R}^N$. This is related
to our model with ${\pdifft} p = 0$.  To simplify indexing we choose
the following units: phase $\phi$ and frequency $f$ are assumed to be
modulo $1$, instead of $2\pi$ as chosen above. The periodicity of $p$
is taken to be $N$ in the phase dimension.  Explicitly this gives
$$
\begin{array}{rcl}
s_n &=& p(N\phi_n)  \\
\phi_{n+1} &=& \phi_n + f_n
\end{array}
\label{wavetab}
$$
with $f_n$ the instantanious frequency relative to the sample
frequency of $s_n$, $\phi_n$ the current waveform phase, and $p$ a
function which interpolates $P$ satisfying $p(i) = p_m$, $m = i \mod
N$.

The basic system we use is an extension of this, where the vector $P$
is replaced by a vector sequence $P_k$, updated at a rate $T$. This
turns the equations into

\begin{equation}
\begin{array}{rcl}
s_n &=& p_k(N\phi_n)  \\
k &=& [n / T] \\
\phi_{n+1} &=& \phi_n + f_n
\end{array}
\label{dynwavtab}
\end{equation}

with $[x]$ the integer part of $x$, and $p_k$ a function interpolating
the vector $P_k$.  The general form which takes into account time
interpolation of successive $P_k$ is

\begin{equation}
\begin{array}{rcl}
s_n &=& p(N\phi_n, n)    \\
\phi_{n+1} &=& \phi_n + f_n
\end{array}
\label{dtdynwav}
\end{equation}

which is equal to our original model, except for scaling of the
parameters.  Here $p$ is interpreted as interpolating the grid $P_k$
in both the phase dimension $\phi_n$ and the time dimension $n$.

To relate our model to the general overlap added block approach, we
present a special case of this for $2$--point interpolation of the
$P_k$,

\begin{equation}
\begin{array}{rcl}
s_n &=& p_k(N\phi_n)     w(n - kT)    \\
    &+& p_{k-1}(N\phi_n) w(n - (k-1)T)    \\
k &=& [n / T] \\
\phi_{n+1} &=& \phi_n + f_n
\end{array}
\label{dws}
\end{equation}

Here $w$ is the interpolation function used for interpolating the time
dimension of our time--phase model, which is zero outside the range
$[-T, T]$.  It has the property $w(n) + w(n-T) = 1$, $n \in [0,
T]$. This is related to the windowing function in sinusoidal modeling,
where it is used to limit the time extent of a periodic base function
with infinite extent.


\subsection{Sinusoidal modeling}

We now consider the frequency domain interpretation of our basic
model.  We work from the discrete representation (\ref{dtdynwav}).

Set $P_k = F C_k$, with $F$ the IDFT. This makes $P_k$ a linear
combination of the columns of $F$, so the instantanious waveform is a
sum of harmonicly spaced imaginary exponentials. Translated to $s_n$,
the effect of resampling to $f_k$ is to shift each harmonic relative
to $f_k$.

Let's concentrate on column $l+1$ of $F$, setting the corresponding
coefficient in $C_k$ to $1$, with the rest $0$. The column is a vector
with a frequency of $l$ relative to the period $N$. Assuming perfect
interpolation with $f_n$ constant and $\phi_n = 0$, we have

\def\fscale{\frac{1}{\sqrt{N}}}

\begin{equation}
s_n = \fscale \exp(j2\pi f_n l n).
\label{partial}
\end{equation}

Taking into account interpolation between successive $P_k$, and a
variable $f_k$ gives us a special case of sinusoidal modeling, where a
signal is expressed as a weighted sum of sinusoidal `blobs' which have
most of their energy concentrated in a small part of the
time--frequency plane. The frequency locality is an effect of the
columns of $F$. The time locality is an effect of interpolation
between successive $P_k$.

What we are interested in now is to express the effect of phase change
of the elements of $C_k$ from $k$ to $k+1$ on the resulting signal
$s_n$.  There are 2 terms contributing to this. The first is phase
change due to $f_n$.  With $f_n$ constant, the accumulated phase
change over $T$ samples in (\ref{partial}) is $2\pi f_n l T$. The
second term is independent of $f_n$ and is merely due to the phase
change per $k$ update of the corresponding element of $C_k$. Setting
this equal to $\Delta \phi'$ per $k$ update, and defining the
immediate frequency as phase change over time gives us

\begin{equation}
f' = f_n l + \frac{\Delta \phi'}{2 \pi T}.
\end{equation}

The consequence of this is that it is possible to represent slightly
inharmonic sounds. Updating the phase of element $j$ of $C_k$ on each
$k$ has the effect of detuning the frequency of the corresponding
harmonic.

In the limit case where the phase is reversed on each update,
significant amplitude modulation distortion occurs as a result of
interpolation between successive $P_k$.

However, for smaller updates, this distortion is hardly noticable. On
the other hand, the AM modulation between neighbouring harmonics
resulting from this slight detuning is pleasant to the ear. It is
characteristic of quite a lot of stringed instruments, and makes the
signal sound less synthetic.

\subsection{Phase modulation}

Since phase is an explicit parameter in our model, applying phase
modulation to (\ref{torus}) is trivial. Using a modulator $\phi'(t)$,
we can transform a signal $s(t)$ to

\begin{equation}
s'(t) = p(\phi(t) + \phi'(t), t).
\label{phasemod}
\end{equation}

Phase modulation (and its relative frequency modulation) can be seen
as phase warping. Here this would be $r(\phi(t), t) = \phi(t) +
\phi'(t)$.  In the case where $r$ is independent of $t$, which boils
down to periodicity relationships between $\phi$ and $\phi'$, the
warping can be embedded in $p$. A warped $p$ then looks like

\begin{equation}
p'(\phi, t) = p(r(\phi), t).
\end{equation}

In the discrete representation, this is equivalent to resampling
$P_k$, according to $r(\phi)$. The dual operation is to do this in the
frequency domain, by resampling $C_k = F^HP_k$.


\section{Interpolation}


TODO: two things to explain: aliasing and phase cancellation

Interpolation or resampling is central to our approach, so it makes
sense to reflect on the possible options we have to implement this. [
See basics article: this is basicly Shannon--Nyquist, but with some
extra hacks ordinarily not found in the treatment of sample rate
conversion with rational ratios. ].

\subsubsection{Filtering $P_k$}

The resampling ratio in (\ref{dtdynwav}) is $r = N f_n$. This is the
ratio between the old and new sample rate.  As long as the new sample
rate is higher than the original one, which means $r<1$ or $f_n <
1/N$, there is no aliasing.

We can eliminate undesired frequencies from $P_k$ before it is
interpolated. Several approaches are possible.

Since $P_k$ is periodic, we can simply eliminate the higher
frequencies in its spectrum $C_k$.  Depending on the specific layout
of the processing algorithm at hand, this might be free if the last
step before determining $P_k$ is an IDFT. For example when using a
spectral synthesis or processing operation. If not, this approach
might be too expensive, since it involves extra DFT, IDFT steps.

An alternative is to filter $P_k$ in the time domain. This can be done
using both FIR and IIR techniques. Since we have the whole signal at
hand we can use IIR filters in a forward/backward way to obtain linear
phase response. For IIR filters, running the filter once on $P_k$ and
once with zero input, and adding these 2 responses, allows us to
approximate the steady state response to the periodic extension of
$P_k$.

FIR and IIR filter coefficients could be pre-computed, and chosen
depending on the current resampling rate at run time. When using IIR
filters, the coefficients can be computed at run time using standard
filter design techniques, of which Butterworth filters are interesting
due to the maximally flat property.

This approach is related to the previous one, with the exception that
the different $h(t)$ are determined using a 2 stage process. The
resulting kernel is obtained by convolving the filter for $P_k$ with
the interpolation kernel used.


\section{Pitch/formant coupling}

One of the drawbacks of our model is the coupling between pitch and
formants, which is often referred to as the `chipmunk
effect'. Resampling a signal $s_n$ changes not only its pitch and time
base, but also its formant structure. Formant structure in this
context means the `spectral envelope' of a sound, which is usually
largely independent of its pitch.

An advantage of our model is that time scale modification independent
of pitch and formant structure is possible by resampling $P_k$ in the
$k$ dimension, and keeping $\phi_n$ as it is. This could be seen as a
`harmonic phase vocoder'.

In our model resampling is core, and because the chipmunk effect is
generally not desired because it sounds artificial, we have to take it
into account.  A positive side of this is that if we solve this
problem, we can do arbitrary pitch/time/formant manipulations.


% In perceiving sounds, our brain seems to separate pitch and formant
% structure into largly independent properties. This can be understood
% as an adaptation to the behaviour of most sound sources in our
% natural environment. Formant structure is largely an effect of
% resonating cavities, the size and shape of objects, and is mostly
% constant, with the exception to the rule being voice, where the
% formant structure is largely variable due to a variable shape
% cavity. Pitch on the other hand, is determined by oscillating
% entities places in or near resonators, like strings (vocal chords).

\subsection{Source/filter model}

In general, this problem is solvable if we can decompose a signal
$s(t)$ as

\begin{equation}
s(t) = h(t) * s'(t),
\label{sourcefilter}
\end{equation}

with $h(t)$ the impulse response of a filter representing the formant
structure, and $s'(t)$ an oscillator. Changing the pitch and formant
structure independently amounts to changing the time base of $s'$ and
$h$ respectively.

For sound synthesis this is trivial, since $h(t)$ is explicitly
synthesized.  In our model this translates to synthesizing $C_k$ using
a parametric method generating a spectral envelope. We simply need to
take the playback pitch $f_n$ into account to compensate for formant
shifting when generating the spectrum.  Without further information or
assumptions about the signal $s(t)$, the deconvolution equation
(\ref{sourcefilter}) is underdetermined.

For some sounds, this model is too crude an approximation. A better
model is given by

\begin{equation}
s(t) = h^b(t) * h^r(t) * s'(t),
\end{equation}

where $s'(t)$ is an arbitrary excitation signal, $h^r(t)$ is a
harmonic resonnator determining the pitch, for example an air column
or a string, and $h^b(t)$ contains the overal spectral envelope, which
can be assumed to be constant. However, in our representation we
combine the effects of the last two terms into the oscillator.

% Below we will have a look at several approaches that try to solve
% the problem of independent time/pitch/formant modification, and will
% illustrate how this relates to decomposing $s(t)$ in a source and a
% filter component using additional constraints and assumptions.


\subsection{Periodic signals}


% The coefficients are then given as $s_n =
% \frac{1}{n!}\frac{\diff}{\diff z}s(z)$.


In this section we work with periodic sequences with period $N$. The
tools to use here are circulant matrices and polynomials modulo
$z^N-1$. (See notes). For $a^N \neq 1$ the inverse of a degree 1
polynomial exists and is given by

\begin{equation}
(1-az)^{-1} = \frac{1}{1-a^N}\sum_{k=0}^{N-1} (az)^k,
\label{ringinverse}
\end{equation}

which allows us to transform rational functions (ARMA models) to
polynomials modulo $z^N-1$.

The exponential models in the following section take the form

\begin{equation}
p(z) = \frac{c(z)}{a(z)},
\end{equation}

where we can interpret $p(z)$, $c(z)$ and $1/a(z)$ either as the
Z--transform of the respective infinite sequences, or as the
polynomials $p(z)$, $c(z)$, $a(z)^{-1}$ modulo $z^N-1$, of which the
coefficients represent one period of a periodic signal.  In the latter
case the inverse of a polynomial $a(z)$ is taken to be the polynomial
$a'(z)$ satisfying $a'(z) a(z) = q(z)(z^N-1) + 1$ for some polynomial
$q(z)$.

We normalize the filter nominator polynomial of degree $L$ as

\begin{equation}
a(z) = \prod_l (1 - z_l z),
\end{equation}

with zeros given by $1/z_l$, for which we have $|z_l| < 1$, which
means the zeros are all outside the unit circle, so the filter is
stable.  This normalization allows us to write the impulse response
$h(z) = 1/a(z)$ as a sum of complex exponentials

\begin{equation}
h_n = \sum_l z_l^n.
\end{equation}


Below we take a look at several ways to formulate the problem so we
can shape the decomposition of $p(z) = s(z)h(z)$ so $s(z)$ can serve
as oscillator in our source--filter model.

\subsection{ARMA model}

Let's restate our goals. We want to separate $s(z) = h(z)s'(z)$ with
$h(z)$ a filter and $s'(z)$ an oscillator. This corresponds to the
physics of most sounds occuring in nature, including human speech.
The reason we want to separate the two is to be able to change the
frequency of the oscillator $s'(z)$, without changing the filtering
effect $h(z)$, which we can define as the property which is clearly
recognized by our auditory system as being an invariant of a certain
sound, independent of the pitch.

In general, we model $h(z)$ using an ARMA model

\begin{equation}
s(z) = \frac{b(z)}{a(z)} s'(z), 
\label{arma}
\end{equation}

where $a(z)$ and $b(z)$ are low order polynomials.
% This equation is understood in the following way. If $s'(z)$ is periodic with period $N$, the steady state response $s(z)$ is equal to the periodic extension of $p_n$.

The only way we can make this problem computable, is by introducing
additional information, or make guesses about properties of both
$s'(z)$ and $h(z)$, based on physical or perceptual motivations.  For
the ARMA model, this boils down to choosing the model order for $h(z)$
and an expression for $s'(z)$, which includes the modeling error.

This gives us a general solution to the source--filter separation
problem.  Below we will look at two approaches which give us such a
model. The difference is mostly in the shape of the error term.

Once we have a model which incorporates a relatively small error term,
we are still free to manipulate (\ref{arma}) by refining the
assumptions about $s'(z)$.

For example, depending on additional information, we can move some of
the poles and zeros of $h(z)$ to the excitiation function
$s'(z)$. Since we are working with steady state responses, we can even
move a pole of $h(z)$ to a set of $N$ zeros in $s'(z)$.

\subsection{ARMA Poles}

Taking a look at the zeros of $a(z)$, we can basicly separate them in
two classes. Zeros close to the unit circle correspond to narrow
bandwidth oscillatory modes and are grouped in $h^r(z)$, while zeros
further from the unit circle correspond to wideband resonnance peaks,
the part we are really interested in, and are grouped in $h^b(z)$.

\begin{equation}
s(z) = h^b(z) h^r(z) s'(z),
\end{equation}

which can be obtained from (\ref{arma}).  When shifting the frequency
of $s'(z)$, we probably want to shift the narrow band modes too.  This
means the excitation of the system is actually $h^r(z) s'(z)$ instead
of impulse like, so we have little information about the true shape of
the spectrum.

How can we deal with this or avoid it all together?  We could try to
translate this to a preprocessing step where we can weigh the spectrum
of $p_n$ to accentuate its high frequency modes. After fitting the
model, we could inverse filter the oscillator waveform to compensate
for this.

Another remark to be made is that narrowband modes can easily be
picked from the harmonic spectrum of $p_n$. By removing them
beforehand, it might be possible to push the ARMA model toward broader
peaks.


\subsection{Exponential modeling}

A way to look at the finite sequence $p_n$ is to view it as the steady
state response of some IIR filter $h_n$ to some periodic signal. In
the time domain this is

\begin{equation}
p_n + e_n = h_n = \sum_l c_l z_l^n.
\label{expmod}
\end{equation}

We write this in he frequency domain modulo $z^N-1$. Using
(\ref{ringinverse}) this becomes

\begin{equation}
h(z) = \sum_l \frac{c_l'}{1 - z_l z} = \frac{b(z)}{a(z)}
\end{equation}

with $c'_l = (1 - z_l^N) c_l$ the non--wrapped amplitudes and $a(z) =
\prod_l (1 - z_l z)$.  Denominator $b(z)$ can be computed from the
partial fraction expansion using the $c_l'$.

We can solve (\ref{expmod}) minimizing $\sum_n |e_n|^2$.  Looking at
the error, we seek a sequence $e_n$, such that $p_k + e_n$ is a sum of
exponentials. We set $E_k$ to be the vector containing the error
sequence shifted up by $k$, in the same fashion as we define
$P_k$. Set $E = [E_1 \ldots E_L]$

The exponential modeling problem solves $(P + E)A = -(P_0 + E_0)$
minimizing $E^T E$.  With $P' = [P_0 P]$, $A' = [1 A^T]^T$, and $E' =
[E_0 E]$ it amounts to finding a Hankel matrix $E'$ that satisfies
$(P' + E') A' = 0$. This boils down to finding a rank $L$ hankel
matrix which approximates $P'$, minimizing $\| E' \|_f$.

This is a structured (Hankel) total least squares problem, or
structured errors--in--variables model.  We will also have a look at a
at a subspace approximation called HSVD, where the Lanczos method can
be used to find an (unstructured) rank $L$ approximation.

One important question remains. In our model the finite sequence $p_n$
is one period of a periodic sequence. Shifting the elements of $p_n$
in a circular way is possible in our model, but gives a different
optimization problem for (\ref{expmod}). This can be seen as choosing
the starting phase of the impulse train.

We will see below that working from the autocorrelation function
allows us to ignore the phase problem. This leads us to another
formulation of the exponential modeling problem in terms of linear
prediction, where we do not present an impulse input, but a broad
spectrum noise signal.

% NOTE: LS of TLS is natuurlijk latency versus misfit..  wat dit nu
% eigenlijk wil zeggen hier is niet duidelijk. het zegt enkel iets
% over de foutendistributie. voor TLS gaat de additieve ruis ->wit en
% voor LS gaat het residu (het source signaal) -> wit


% Once the model is fitted, the residual $e_i$ can be inverse filtered
% to replace the assumption of $s(t)$ being an impulse train. This
% gives us the separate description in (\ref{sourcefilter}), with the
% added advantage that $h$ is parametrized in frequency through the
% poles $z_l$.

% Another extension is possible: by deconvolving the inverse filtered
% error term using some constraint or assumption about $h$, we could
% attribute a one part of the deconvolution to $h$ turning it into an
% IIR filter instead of allpole.


\subsection{Frequency domain AR}

Instead of modeling the period vector as a (periodic) AR model, we can
take the dual approach and write its spectrum as an AR model. This is
equivalent to modeling the time domain waveform as a sequence of
pulses, and might be interesting for formant preserving time
stretching, by allowing us to extrapolate the waveform.

% In the framework of linear prediction of a WSS process, the error is
% well defined. The predictor for the WSS process behaves as a Minimum
% Mean Square Error (MMSE) predictor, minimizing $E(e_k^2)$.In the
% case where the WSS process is an autoregressive process of order
% $L$, we have in addition $E(e_k, e_l) = 0$, $k \neq l$.


% So we can relate exponential modeling to linear prediction. In the
% case where we compute $r_n$ from $p_n$, $r_n$ will be periodic, and
% can be written as a sum of sinusoids.


% \subsection{Frequency domain AR}

% One possible extension is to compute the AR model in the frequency
% domain. For te periodic model we can pre--multiply both sides of the
% equation (\ref{lsar}) by the matrix $F^H$ representing the Fourier
% transform, or

% \begin{equation}
% C A = - C_0 + F^HE,
% \end{equation}

% with $C$ the matrix of phase shifted spectra. Since $F^H$ is unitary
% the error norm is preserved $\|F^HE\| = \| E \|$ and we have the
% same solution

% \begin{equation}
% A = - (C^H C)^{-1} C^H C_0,
% \end{equation}

% with $C^H C = (F^HP)^H(F^HP) = P^TP$, and $C^H C_0 =
% (F^HP)^H(F^HP_0) = P^TP_0$.

% NOTE NOTE: alternative intepretation of AR the ar model write the
% spectrum in terms of phase shifted versions of the spectrum NOTE
% NOTE: instead of computing the correlation matrix from the power
% spectrum $c(\omega)$, one could compute it from
% $\sigma(c(\omega))$. (see support vector machines: each linear algo
% in terms of inner products can be written as nonlinear using a
% kernel function NOTE NOTE: fourier - mul - inverse fourier =
% multiplier theory (i.e. fractional differentiation)


% \subsection{Extensions}


% We propose a combined method of spectral factorization, spectral
% weights and exponential modeling.

% First, the spectrum $c_n$ is weighted by $\sqrt{n}$. This
% corresponds to the average pink noise distribution of sound
% signals. The fundamental is discarded.

% As a second step, we compute the minimum phase version of the
% weighted spectrum, recording the phase adjustment. This signal will
% be treated as an impulse response which will be modeled as a sum of
% exponentials.

% The resulting error will be inverse filtered to add to the impulse
% stimulus. The phase and amplitude effects of the weighting will be
% applied in reverse to obtain the original stimulus.


% In speech coding, the guesses amount to assuming $s(z)$ is either a
% periodic impulse (voice speech) or white noise (unvoiced
% speech). The filter $h(z)$ corresponds to the transfer function of
% the vocal tract. This seems to work reasonably well for coding
% applications, but since our goals are high quality sound processing,
% we have to do better than that.

% Take for example the violin. A rough approximation would be the
% domposition $p(z) = h(z) s(z) b(z)$, where $b(z)$ is the input
% oscillator, modeled as a sequence of pulses due to sliding of the
% bow over the strings. $s(z)$ is the transfer function of the strings
% combined with the effect of position and pressure of the fingers of
% the player. This determines the eventual pitch. and $h(z)$ is the
% effect of the violin body and the transfer function from bridge to
% body. In our view we can combine the effect of $b(z)$ and $s(z)$.

% That's quite something. Luckily, we can make some
% approximations. Let's take a look at the results of exponential
% modeling. In the case where the error $e_k=0$, we have the harmonic
% decomposition of the signal $p_k$. This is clearly not what we want,
% since it is what we had, a periodic oscillator.

% Reducing the filter order allows us to move some information from
% $h(z)$ to $e(z)$. What remains in $h(z)=1/a(z)$ can be split in 2
% parts. Strong resonnances could be attributes to harmonics of the
% oscillator, while weak resonnances can be attributed to modes of a
% filter. The closer a zero of $a(z)$ is to $\exp(j2\pi t/T)$, the
% more likely it is to be due to an effect we like to incorporate in
% the oscillator, not the filter.

% We can use this information not only to do a better source/filter
% separation, but also to improve the condition of the estimation
% problem, since poles close to the unit circle severely infect the
% condition of an exponential estimation problem.

% \subsection{Fixed zero ARMA}
% In the case where the periodic signal $p_k$ has a strong harmonic
% content, we try to model its effect by a zero instead of a pole. We
% can do this because we are dealing with periodic signals.


% \subsection{Resampling lattice filters}

% REF = bultheel's paper

% It would be interesting to have an explicit polar representation of
% $h(z)$, as in (\ref{pureexp}). Though, depending on the application,
% a representation using $\rho_l$ might be more interesting.

% We have a brief look at the possibility to use reflection
% coefficients to represent spectral data, instead of polar or
% polynomial form.

% \subsection{Phase vocoder}

% Without resorting to an extensive description of the algorithm, the
% phase vocoder can be seen as an analysis/synthesis tool for time
% base modification independent of pitch and formant structure. It
% assumes a signal $s_n$ can be represented as a sum of sinusoidal
% components which vary slowly over time.

% This presents a model in terms of amplitude, phase and
% frequency. Contracting the time base is then equivalent to
% resynthesis with keeping the amplitudes and frequencies fixed, but
% modifying the phase according to the new time base.

% Several variants of the algorithm exist. Most are based on DFT
% analysis and `peak picking' methods, where the sinusoidal components
% are identified by looking at the peaks of a spectrum. Sythesis is
% done by direct modification of the phase of a DFT spectrum.

% It does not separate pitch and formants, and has the same drawbacks
% as our method when used for pich shifting. In fact, our method can
% be seen as a harmonic phase vocoder, with all sinusoidal components
% harmonicly spaced.

%\subsection{WSOLA}


% \section{Remarks}

% The core of the solution is very simple. Stretching or compressing
% the spectrum $C_k$ is exactly what we need. Dually, we can compress
% or stretch $P_k$. The difficulty lies in the boundary conditions.

% When stretching $C_k$, wrapping around from Nyquist to DC is not a
% good idea, since it introduces aliasing. The dual operation is
% compressing $P_k$, which indeed can introduce aliasing if there are
% high frequencies present. So it seems shifting formants up can best
% be done in the frequency domain.

% When stretching $P_k$, wrapping around is not necessarily a bad
% thing. Or is it?

% Here we can use some kind of hack. Since most real filters in nature
% are causal, we could use this information. Other hack: spectrum
% warping: leave the first couple of harmonics intact, and shift only
% the higher ones?

% Another hack: using HTLS/HSVD on the period vector to isolate
% formants? Suppose we have a


% Here we will discuss the effects of deforming the function $p(\phi,
% t)$ in the phase (or frequency) dimension. This is the basic tool we
% have to re--organize the spectral content at time $t$.

% in how far is it possible and/or desirable to generalize the model
% to pure sinusoidal modeling based on absorbing inharmonicity in the
% phase spectrum?

\section{Effect processing}

Spectral shift. Pitch correction. Harmonic enhancement (aliasing free
waveshaping). Phase coherence.

\section{Data reduction}

Approximation of all discrete sequences (spectra, AR models, pitch
envelope) using DWT compression.  Phase encoding (constant
difference).


\section{Pitch detection}

A crucial part for any analysis algorithm for our model is a pitch
(periodicity) detection scheme.  More so, given a perfect pitch
estimator, parameter estimation for our model is relatively trivial,
and amounts to segmentation of a signal $s(t)$ into a sequence of
intervals $s_i(t)$ related to a sequence of periods $p_i(\phi)$.  The
length of each interval is determined by the instantanious period
$T_i$ at time instance $t_i$, with $t_i = \sum_{k<i} T_k$ the
(non-uniform) time grid determined by accumulating the instantanious
periods. More specifically

\begin{equation}
p_i(\phi) = s(t_i + \frac{2\pi \phi} {T_i}),
\label{chop}
\end{equation}

for $0 \leq \phi < 2\pi$.  This sequence can then be seen as a
representation of $p(\phi, t)$, which can be resampled to a sequence
of $P_k$, which can then be processed and resynthesized using the
approach described in previous sections. The function $\phi(t)$ is
then a piecewize linear function which spans $2\pi$ for every $T_i$.

This allows us to define the optimality of a pitch detector in the
following way. Determine a sequence $T_i$ such that the sequence
$p_i(\phi)$ is as smooth as possible over all $\phi$, and $p_i(\phi)$
contains mostly low frequencies.

The first constraint produces a cost function with multiple extrema at
multiples of $T$. The second constraint introduces a bias toward the
smallest $T$.


% triviaal
In general, for a periodic signal $s(t)$ with period $\omega$ we have

\begin{equation}
s(t) = \sum_k a_k \cos(k \omega t + \phi_i),
\end{equation}

its (normalized) periodic autocorrelation has the form

\begin{equation}
r(t) = \sum_k a_k^2 cos(k \omega  t).
\end{equation}

The $a_k^2$ represent the discrete power spectrum.


\subsection{Existing techniques}

This paragraph gives a summary of pitch detection algorithms that can
be found in the literature. An important characteristic for a method
is its robustness against (sub--)harmonics. This is a direct
consequence of how the problem is formulated --- what information is
taken into account, and how to define pitch.

Pitch detection is generally considered a solved proble. However, this
is only partly true. Algorithm tuning is often done manually, since it
seems to be very hard to rigorously formulate the different
trade--offs related to the problem.

Local minima are not a problem in the case where pitch evolution is
continuous and we merely want to track small variations in
pitch. However, bootstrapping the method requires an initial pitch
estimate to get the correct local minimum. Most sounds exhibit large
pitch jumps that can cause problems in this respect.

However, the methods exhibiting a multi--modal cost function are
usually more compuationally efficient, and combining such a method
with a more expensive but better behaved unimodular estimator might be
an option.


\subsubsection{Subharmonic detectors}

Time domain estimators based on similarity measures between $s(t)$ and
$s(t+T)$ produce cost functions with local minima at multiples of the
true periodicity $T$.

Examples are adaptive comb filtering, which minimizes 

\begin{equation}
f(T) = \int [s(t) * (\delta(t) - \delta(t-T))]^2 \diff t,
\label{comb}
\end{equation}

autocorrelation which
maximizes 

\begin{equation}
f(t) = s(t) * s(-t), 
\end{equation}

and maximum likelihood, which maximizes

\begin{equation}
f(T) = \int [s(t) * \sum_i \frac{\delta (t-Ti)}{T}] ^2 \diff t.
\label{maxli}
\end{equation}

To make this practical, $s(t)$ is sampled and multiplied by a function
$w(t)$ with compact support to limit the time extent, or the integrals
are computed over a limited range.

\subsubsection{Harmonic detectors}

Frequency domain estimators commonly have cost or likelihood functions 
$f(\omega)$ which exhibit multiple 
local extrema at multiples of the pitch $2\pi/T$. 

An example is the use of the short--time power spectrum

\begin{equation}
f(\omega) = |\int s(t) w(t) \exp(-j\omega t)\diff t|^2
\label{powerspec}
\end{equation}

as likelihood function. This is often implemented using DFT `peak
picking' methods (sinusoidal modeling, phase vocoder).

Another approach is adaptive notch filtering, which minimizes

\begin{equation}
f(\omega) = \int [s(t) * h(\omega, t)]^2 \diff t,
\end{equation}

with $h(\omega, t)$ the impulse response of a notch filter with
center frequency $(\omega)$.


\subsection{WSS signals}
% The alternative approach is to apply compression in the time domain

%\begin{equation}
%f'(t) = \sum_i f(t  i),
%\label{unisubharm}
%\end{equation}

%with $f(t)$ one of the period estimators defined above. We will call
% this method subharmonic sum spectrum, since the equivalent operation
% in the frequency domain is indeed a sum parametrized in $\omega / i$
% instead of $\omega i$. It is not clear if this is really useful.

% These two approaches are related in a sense that they produce a new
% cost function that has a more pronounced maximum around the desired
% period or frequency.

When dealing with wide sense stationary (WSS) signals, the tools to
use are the autocorrelation $r(\tau) = E\{s(t)s(t+\tau)\}$ and its
Fourier transform, the power spectrum $c(\omega) = \int \exp(-j\omega
t)r(t)\diff t$.  For simplicity, we will use the deterministic
framework where the autocorrelation is defined as $r(\tau) = \int
s(t)s(t+\tau)\diff t$.


Looking at the method of maximum likelihood, which can be seen 
as finding the most coherent sum of shifted versions of $s(t)$, we can
compute (\ref{maxli}) in the frequency domain giving

\begin{equation}
f(T) = \int [ s'(\omega) \Delta_{2\pi / T}(\omega) ]^2 \diff \omega,
\label{maxlif}
\end{equation}

with $\Delta_T(t) = 1/T \sum_t \delta (t-T)$ an impulse train, and
$s'(\omega)$ the Fourier transform of $s(t)$. 
Another way of writing this is $f(T) = [ \sum_i s'(2 \pi i / T) ] ^2$.
We construct a modified objective function $f(T) = \sum_i |s'(2 \pi i / T)|^2$.
This ignores the phases and cross terms. In terms of the power spectrum this gives

\begin{equation}
f(T) = \int  c(\omega)  \Delta_{2\pi / T}(\omega) \diff \omega,
\label{maxlifmod}
\end{equation}

which is equivalent to computing the maximum coherent sum of shifted
versions of the autocorrelation

\begin{equation}
f(T) = \int r(t) * \Delta_T(t) \diff t.
\end{equation}

% Because $s(t)$ is real valued, the phases in (\ref{maxlif}) will add coherently.
% Most sound sources can be modeled by pink noise, which has a power spectrum rolloff of $f^{-1}$.

Rewriting (\ref{maxlifmod}) with $c'(2 \pi / T) = f(T)$ gives us

\begin{equation}
c'(\omega_0) = \int  c(\omega)  \Delta_{\omega_0}(\omega) \diff \omega = \sum_i c(\omega_0 i),
\label{harmsumspec}
\end{equation}

which gives the method called harmonic sum spectrum. A variant of this
method is called the harmonic product spectrum

\begin{equation}
c'(\omega) = \prod_i c(\omega i).
\end{equation}

Taking the logarithm gives the summation over $\log c(\omega i)$. We
can generalize this method using an arbitrary monotone nonlinear
function to transform the power spectrum before summation. We will
call this `generalized harmonic sum spectrum'.
% Note that when we represent the power spectrum on a log frequency
% scale, summing compressed versions amounts to summing shifted
% versions.


The method performs constructive interference in the case the power
spectrum has several peaks which are harmonicly spaced, resulting in a
spectrum $c'$ which has a more pronounced mode around the harmonic
frequency. Note that when $c$ has a single peak, this method actually
makes things worse, introducing unnecessary subharmonics in the
spectrum.
% For a practical method, this needs to be taken into account.


In the case of optimal comb filtering, we can compute (\ref{comb}) in
the frequency domain as

\begin{equation}
f(T) = \int |s'(\omega)(1 - \exp(j\omega T))|^2 \diff \omega.
\end{equation}

This is seperable, and can be written as a weighted power spectrum

\begin{equation}
f(T) = \int c(\omega) |1 - \exp(j\omega T)|^2 \diff \omega.
\end{equation}

Minimizing this function is equivalent to maximizing its complement
using the weighting function $2 - |1 - \exp(j\omega T)|^2$, which is
an approximation of an impulse train sampling, and can so be seen to
be related to the harmonic sum spectrum.

As long as two different estimators have independent statistics, they
can be combined to provide a more robust estimator. An example is
combining the Average Maginitude Difference Function (AMDF), which is
the $1$--norm variant of the optimal comb filter

\begin{equation}
f(T) = \int |x(t) - x(x + T)| \diff t,
\end{equation}

with the autocorrelation function as

\begin{equation}
f(T) = \frac{f_{\text{AC}}(T)}{\lambda + f_{\text{AMDF}}(T)}.
\end{equation}

% [Kobayashi & Shimamura 1995] report optimal results with $\lambda =
% 1$.


% TOWARDS UNIMODAL POWER SPECTRUM:

% It would be interesting to modify (\ref{harcor}) so the weighting of
% different shifts $r(L k/ i)$ is influenced by the $AR$ model error,
% or

% \begin{equation}
% r_k' = r'(k) = \sum_i^N w_i r(L k/i).
% \end{equation}

% with $w_i, a_i$ chosen to minimize $\gamma$. This is again
% multimodal, which intuitively can be seen by choosing all of the
% $w_i$ zero except for one.

 
% It is not clear to us how to compensate for this. Surely there
% exists a middle ground which enables us to process a signal so we
% can find an algorithm which chooses the middle ground, and performs
% well in all cases (sinudoidal, impulse like, missing fundamental,
% etc..).

% One remark. Since even harmonics do not really contribute much
% information, they could be left out of the summation formula.

% It is not known to us if it is possible to formulate the problem of
% pitch/period detection as a convex optimization problem. In other
% words, it is not known if there exist a convex optimization problem
% which behaves as a pitch detector.

\subsection{Nonlinear methods}

Most practical methods combine the basic methods described above with
extra information to limit the influence of peaks in the objective
function at \mbox{(sub--)harmonics} of the `true' pitch, which are the
cause of misprediction.  Since the generalized harmonic sum spectrum
is related to most of the basic techniques we described, we will use
that formulation.

The main problem can be seen as the tuning to all intermediate cases
of the 2 ideal periodic functions, a single sinusoid, and the impulse
train. Time domain waveform similarity methods (including HSS) are
better for impulse--like signals, frequency domain methods operating
directly on the power spectrum are better for sinusoidal signals.  By
combining subharmonic (time) and harmonic (frequency) detectors, the
problem of \mbox{(sub--)harmonic} errors can largely be avoided.

In this section we take a look at the effect of static nonlinearities
to map a waveform to a shape that is easier to detect.

Take a function $\sigma_{a,b}(x)$ with $\lim_{x \to -\infty} = a$, and
$\lim_{x \to +\infty} = b$. The effect of applying $\sigma_{-1,1}(x)$
to time signal is to introduce extra harmonics in the power
spectrum. The effect of applying $\sigma_{0,1}(x)$ to the power
spectrum is to make the autocorrelation more impulse--like.

It would be interesting to see if there exists a method that would map
any signal to one with a known spectral envelope.

% eigenfuncties van de fourier transformatie hermite polynomen??

% support vector regression? waarschijnlijk kan ik daar nie veel mee doen..
% tenzij het modelleren spectra, waar ik al AR heb..


% This technique can be seen to be related to wiener filtering ????

% If we take $g(x) = x$, we have analogous to (\ref{maxlifmod}) 

% \begin{equation}
% f(T) = \int  c(\omega)  [c(2\pi / T) *  \Delta_{2\pi / T}(\omega)] \diff \omega,
% \label{maxlifmod}
% \end{equation}

% When we have just the power spectrum, we could use the
% previous pitch estimate to obtain the $a_i$.


\subsection{Fiddle}

Here we take a look at Fiddle, the pitch detector for Pure Data.  It
is based on a sinusoidal model, where the signal is modeled as a set
of $L$ `partials' defined by an ordered sequence $(a_l, f_l,
\sigma_l)$, with $f_l$ the frequency of partial $l$, $a_i$ the $4$th
root of power, and $\sigma_l$ an estimate of the standard deviation of
$f_l$, which can be seen as the bandwidth of a peak. The parameters
are estimated using the phase vocoder and are used to construct a
pitch objective function with the following form

\begin{equation}
  c'(f) = \sum_{l=1}^L \sum_{k=1}^K \frac{a_l}{\sigma_l (k + k_0)} w \left( \frac{\log f - \log \frac{f_l}{k} }{\sigma_l / f_l} \right).
\label{fiddle}
\end{equation}

Here $L$ is the number of partials, $K$ is the range of the harmonic
sum, $w$ a smoothing function, and $k_0$ is a term which shapes the
contribution of partial $f_l$ to the pitch $f_l/k$.  Nominators with
$\sigma_l$ are limited to not introduce singularities.  The
implementation computes $c'(f)$ with a resolution of $48$ points per
octave with $k_0 = 6$.  The smoothing function $w$ is the positive
part of a parabola with its maximum at $0$.

We can relate (\ref{fiddle}) to our presentation by separating out the
model for the power spectrum

\begin{equation}
c(f)^{1/4} = \sum_{l=1}^L \frac{a_l}{\sigma_l} w \left(\frac{\log f - \log f_l}{\sigma_l / f_l} \right),
\end{equation}

and expressing the rest as

\begin{equation}
c'(f) = \sum_{k=1}^K \frac{1}{k + k_0} c(f k)^{1/4}.
\end{equation}

This can be seen to be a variant of the harmonic product spectrum,
using the `poor man's logarithm' $f(x) = x^{1/4}$. 


\subsection{Harmonic sum AR}

We could try to model the modified spectrum (\ref{ahss}) as an AR
spectrum.  For the autocorrelation we have

\begin{equation}
r'(t) = \sum_i a_i r(t/i).
\end{equation}

Since $r(t)$ is only defined for integer $t$, we have to interpolate
to obtain a sampling at fractional time instances. Taking $r_k' =
r'(t)$ as sampling, we can run into problems because we only use a
very small range of $r'(t)$, so we take the sampling to be

\begin{equation}
r_k' = \sum_i^N a_i r'(L k/i).
\label{harcor}
\end{equation}

This corresponds to an inverse scaling of the summed power spectrum

\begin{equation}
c'(\omega) = \sum_i^N a_i c(\omega i / L),
\label{scaledhss}
\end{equation}

which will have the pitch peak around $2\pi L / T$. 
We can calculate $r'$ by interpolating the inverse of the power spectrum.

Using the harmonic product spectrum or generalized harmonic sum
spectrum methods, we need to interpolate in the frequency domain,
after performing the nonlinear mapping. In the case where the AR model
order is small, this is more expensive since we need to compute
(\ref{scaledhss}) for each frequency bin instead of just
(\ref{harcor}) for $L$ correlation coefficients.


\subsection{Pitch tracking}

In a practical setting, pitch is not constant in $s(t)$, so we
decompose the problem by estimating local pitch at different time
instances $t_i$ and assume pitch is constant in a small range around
$t_i$. In this case, we can connect the estimates at successive $t_i$
by adding knowledge about the expected pitch, i.e. assuming pitch
variation is small with respect to the sampling $t_i$.

In the case where we can limit the pitch to be detected to one octave,
the simple subharmonic and harmonic detectors behave very well. There
are a lot of ways to do this. Simple heuristic methods use discrete
decision rules, more complex methods apply Bayesian techniques.

On the other hand, tracking does not provide a perfect solution to the
problem of octave jumps in the signal pitch signal itself, so it is
best to combine tracking with an estimator that is already largely
unimodular, i.e. exhibits a cost function with a prominent single
peak.

An example of a tracker combined with HSS is the `adaptive harmonic
sum spectrum' method

% zie ccrma-www.stanford.edu/~pdelac/420/pro.ps
% (damn prime numbers :)

% todo: extend maximum likelihood met een 'beter' ideaal spectrum

\begin{equation}
c'(\omega) = \sum_i a_i c(\omega i).
\label{ahss}
\end{equation}

To prevent the addition of `confusing' peaks in the objective
function, in the case a certain harmonic is not present, we could set
$a_i = 0$.  Ideally, $a_i$ would be related to $c(\omega' i)$, with
$\omega'$ the true pitch.  This corresponds to the heuristic that if
harmonic $i$ of $\omega '$ is not present, it should not increase the
likelihood of $\omega / i$ being the true pitch.

% Another approach is to formulate the problem as the function which minimizes the distance between $c(\omega' i)$ and $c(\omega i)$.


\subsection{Alternatives}

The adaptive harmonic sum spectrum or the maximum likelihood methods
can be improved if we can incorporate knowledge about the shape of the
spectrum to expect. This is basicly the trade--off between sinusoidal
and impulse--like estimators.


% SPECULATION.
% Using prior knowledge about $\omega'$, an interesting variant is

% \begin{equation}
% f(\omega) = \sum_i \|c(\omega i) - c(\omega' i)\| + \sum \|c(\omega \frac{i+1}{2})\|.
% \end{equation}


% Another possibility is a harmonic spectral model, or joint spectral and pitch model,

%\begin{equation}
%f(\omega') = \int \| c(\omega) - \sum_i a_i \phi(\omega' i) \| \diff \omega,
%\end{equation}

%which could be extended to requre the odd harmonics are nonzero.

TODO. Find a way to write a formulation of the pitch
detection/tracking problem using the representation of the time--phase
signal model. Incorporate pre--whitening using low order AR
processing, possibly harmonic enhancement using nonlinear distortion.

% So, it seems the joint estimation of pitch and envelope is what we are looking for.


% \subsubsection{Nonlinear methods}

% Two often used nonlinear methods exist: zero crossing (comparison to reference), which is a very simple ad--hoc method with poor properties, and ceptstral analysis, which can be seen as deconvolution.


% SPECULATION. Following the reasoning that limiting a signal's values to $(-1,1)$ by comparing it to its DC component does not influence its periodicity allows us to construct a spectrum with a broader bandwidth, which is in turn periodic and can be treated in a similar way. Repeating this step several times should eventually be able to produce a square wave of which the frequency can easily be determined.

% SPECULATION2: \sigma in tijdsdomein -> FT -> \sigma in FD


% There are a lot of approaches to pitch detection to be found in the literature. 
% Time domain: autocorrelation. Frequency domain: peak picking, harmonic sum
% spectrum. Cepstrum methods.
% Fiddle uses an ad-hoc likelyhood function.


\section{Non--periodicity}

Our model assumes signal periodicity, which makes it generally
unsuitable for modeling complex sounds. In addition to that, we need
to take into account transients.

% Although the model we present is mainly intended as an overcomplete representation of a signal which allows real--time modification of a lot of its parameters, and therefore intended as a musical tool, there are possibilites to employ compression techniques to limit the storage needed to store sound expansions.


\section{Notation}

We use the following notation.  $s_i$ represents (an element of) a
discrete sequence or a vector. Multiple indices denote elements of
discrete grids or matrices. Capital letters denote matrices and
vectors. The elements of the $K \times L$ matrix $M$ are denoted as
$m_{kl}$, with $k$ the row index running from $1$ to $K$, and $l$ the
column index running from $1$ to $L$.

$s(x)$ represents a real or complex function. All entities take their values from
$\mathbb{R}$ or $\mathbb{C}$. We use $j = \sqrt{-1}$.
We represent the discrete Fourier transform (DFT) using the matrix $F \in \mathbb{C}^{N \times N}$ with 
elements $f_{k+1,l+1} = \frac{1}{\sqrt{N}}\exp(j2\pi kl / N)$. We have
$F^{-1} = F^H$ due to normalization, with $(.)^H$ denoting the complex
conjugate transpose.  The DFT of vector $X$ is then denoted as $F^H X$, and
the inverse discrete Fourier transform (IDFT) of $X$ is $FX$.


Entry: Complex Waveshaping Extras
Date: Sat Jun 20 16:34:42 CEST 2009
Type: tex

Stuff below this is just for inspiration.

\subsection{More unrelated stuff}


The space spanned by the $f_{a,\theta}$ is isomorphic to the space of matrices of the form

\begin{equation}
m'_{a,\theta} = e^{i\theta}
\left(
\begin{array}{rr}
1 & a \\
\bar a & 1 \\
\end{array}
\right).
\end{equation}

$|a| \neq 1$ ensures $\det m'_{a,\theta} \neq 0$. Dividing the nominator and denominator of (\ref{themap}) by $\sqrt{1-|a|^2}$ 
does not change the value of $f_{a,\theta}$ so we will let the matrix with unit norm determinant

\begin{equation}
m_{a,\theta} = \frac{e^{i\theta}}{\sqrt{1-|a|^2}}
\left(
\begin{array}{rr}
1 & a \\
\bar a & 1 \\
\end{array}
\right).
\label{ma}
\end{equation}

correspond to $f_{a,\theta}$. The set of $m_{a,\theta}$ is a group under multiplication. The mapping between
$f$ and $m$ is a group homomorphism mapping composition of $f$ to multiplication of $m$. The inverse of $m_{a,\theta}$
is given by $m_{-a,-\theta}$. This means that  $f_{a,\theta} \circ f_{-a,-\theta} (z) = z$. The $m_{a,\theta}$ span
a hyperbolic space where the inverse is given by the anti-hermitian of $m$.


For the angular representation We will write $a = r e^{i\beta}$.
For $0 < |r| < 1$ we can
set $r = \sin \gamma$.  Transforming (\ref{themap}) we get


\begin{equation}
f_{\beta,\gamma}(z) = \frac{z + \sin \theta e^{i\beta}}{1 + \sin\theta e^{-i\beta} z}.
\label{themap_angular}
\end{equation}


Transforming (\ref{ma}) in the same way we get

\begin{equation}
m_{\gamma,\beta} = \frac{1}{\cos \gamma}
\left(
\begin{array}{cc}
1 & r e^{i\beta} \\
r e^{-i\beta} & 1 \\
\end{array}
\right).
\label{ma_angular}
\end{equation}

Writing (\ref{themap_angular}) as a function of $\alpha$ with $z = e^{i\alpha}$ we get

\begin{equation}
f(\alpha,\beta,\gamma) = \frac{e^{i\alpha} + \sin \gamma e^{i\beta}}{1 + \sin\gamma e^{-i\beta} e^{i\alpha}}.
\label{themap_angular}
\end{equation}


\subsection{Lie groups}

It would be nice to check out what can be said about modulation in algebras that
can be represented by matrices over the reals or complex numbers. 
% I don't know much about  this subject, but find it intriguing nontheless.

The oscillator $x(t) = \exp(i\omega t)$ could be generalized to

\begin{equation}
X(t) = \exp(\Omega t),
\label{lieosc}
\end{equation}

where $X(t)$ is an element of a Lie group, and $\Omega t$ is the corresponding element
of the Lie algebra. Using matrix representations of both allows us to constructs
methods that can be implemented on a digital computer.

The oscillator (\ref{lieosc}) can be written as a set of coupled oscillators,
by using the eigenvalue decomposition of $\Omega = S\Lambda S^{-1}$, assuming
it exists. This gives

\begin{equation}
X(t) = S \exp(\Lambda t) S^{-1}
\end{equation}

Take for example $\text{SO}(3)$ with its universal (double) cover $\text{Spin}(3)$
or $\text{SU}(2)$, which 
is diffeomorphic to the 3--sphere $\text{S}^3$, the unit quaternions.

The Lie algebra $\text{su}(2)$ can be represented by the Pauli matrices

\begin{equation}
\sigma_1 = \left[
\begin{array}{cc}
0 & 1 \\ 
1 & 0 
\end{array}
\right],\text{ }
\sigma_2 = \left[
\begin{array}{cc}
0 & i \\ 
-i & 0 
\end{array}
\right],\text{ }
\sigma_3 = \left[
\begin{array}{cc}
1 & 0 \\ 
0 & -1 
\end{array}
\right].
\end{equation}

Extended with $\sigma_0 = I$, the identity
matrix, $(\sigma_0, i\sigma_1, i\sigma_2, i\sigma_3)$ gives a
representation for the quaternion basis $(1,i,j,k)$ 
with $i^2 = j^2 = k^2 = ijk = -1$. Writing (\ref{lieosc}) in terms
of the quaternions, we have

\begin{equation}
x(t) = \exp(q t).
\end{equation}

If $q$ is purely imaginary, $x$ will have unit norm.


% Other examples include the biquaternions and general Clifford Algebras.


% AARGH!!
% This basis is equivalent to the quaternions 
% $a + bi + cj + dk$ with $-1 = i^2 = j^2 = k^2$, which are represented by

% a\sigma_0 + ib\sigma_1 + ic\sigma_2 + id\sigma_3 =
%\begin{equation}
%\left[
%\begin{array}{cc}
%a + id   & c + ib \\
%-c + ib  & a - id
%\end{array}
%\right].
%\end{equation}


% http://en.wikipedia.org/wiki/Rotation_group
% http://en.wikipedia.org/wiki/Spinor


Entry: Polynomial Composition
Date: Wed Jun 24 18:46:25 CEST 2009
Type: tex

I'm trying to give a meaningful interpretation for polynomial
composition.  While addition and multiplication have a straightforward
interpretation, composition seems to have less structure.  Composing
two polynomials yields
$$p(q(x)) = \sum_{n=0}p_n (\sum_{m=0}^M q_n x^M)^n$$ which doesn't
have a trivial closed form.  It is a linear combination weighted by
$p_n$ of self--convolutions of the original polynomial $q(x)$.  Let's
look at the self--convolution by investigating the squaring operation
$$p(x)^2 = (\sum_{n=0}^N p_n x^n)^2 = \sum_{n=0}^{2N}(\sum_{k+l = n}
p_k p_l)x^n$$ and the effect of higher powers of a simple polynomial
$$(1 + x)^M = \sum_{m=0}^M \binom{M}{m} x^m.$$ The result is a
polynomial which is in some sense ``flatter'' than the orginal, in
that it is more spread out.  It wouldn't be surprising that the
Bernstein basis[2][4] are a good point to start looking for meaning.
$$
\begin{align*} 
1 &= [x + (1 - x)]^n \\
  &= \sum_{k=0}^n \binom{n}{k}x^n(1-x)^{n-k}  \\
  &= \sum_{k=0}^n b_{k,n}(x).
\end{align*}$$
Doing some random searches online I don't find anything special about
polynomial composition, other than iterated functions (dynamical
systems) which have a very rich behaviour that is not easily expressed
in terms of coefficient values.  This leads me to think that maybe it
is better to study the polynomials as opaque \emph{maps} instead of
focussing on linear combination of polynomial powers.

There are some special cases.  The Chebyshev polynomials obey the
nesting property $T_n(T_m(x)) = T_{n m}(x)$, as do $x^nx^m = x^{nm}$.

It looks like it makes more sense to look at composition from the
effect it has on polynomial roots.  Let's limit ourselves to monic
polynomials.  With $p(x) = \prod (x-p_k)$ the composition is written
as
$$p(q(x)) = \prod(q(x)-p_k).$$ This constructs a new polynomial as a
unition of the roots of the polynomials $q(x)-p_k$.  Here each root
$p_k$ \emph{moves} the roots of $q(x)$.


% Is there a polynomial that, raised to a power yields a similar
% expression?

%% For $x \in I = [0,1]$ all the terms are positive which makes them
%% map $I$ to itself.  Let's collect other polynomials that are closed
%% under composition on subsets of $\CC$ to get a better idea of
%% composition.  Chebyshev polynomials[5] are closed on $[-1,1]$.

% I wonder what the link is between $p(x)^n$ with large $n$ and the
% central limit theorema.

% Also it is related to building higher order interpolating
% polynomials from lower order (Neville interpolation [3]).  EDIT: no
% this is multiplication as it raises the order by one every
% multiplication.

% This is related to interpolation, for which the function $f_a(x,y) =
% x + a (y - x)$ plays an important role.

%[1] http://www.math.utah.edu/~pa/MDS/bform.html
%[2] entry://20070505-215632
%[3] http://en.wikipedia.org/wiki/Neville%27s_algorithm
%[4] http://en.wikipedia.org/wiki/Bernstein_polynomial
%[5] http://en.wikipedia.org/wiki/Chebyshev_polynomials
%[6] entry://../library/071ac18443ca8d29abe3e2efcf114b1b


Entry: dna vs. static code
Date: Wed Jul  1 15:41:36 CEST 2009

One of the discussions I find interesting is the static vs. dynamic
debate.  To simplify: Smalltalk vs. Haskell

Is code a construct from which one can create statically linked
executable code (early binding), or should the links allow late
binding?

This is very similar to what you find in nature: genotype
vs. pheonotype.  Apparently evolution has come up with a mechanism
that builds an image (body) from code (dna), and regularly discards
the image to rebuild it from scratch.  It looks to me that this is
mostly about efficiency of storage: genotypes are a _lot_ smaller than
phenotype.  Most of the information of the phenotype is a consequence
of the structure of the compiler.  However, without the compiler there
is no semantics.  On the other hand, the pheonotype contains a lot of
information and is "self-hosted" on top of physics, but this huge
information store makes it vulnerable to bit rot.

Of course this ignores information stored in the nervous system, but
the analogy seems to hold in that "similar" phenotypes can be created
as long as the compiler stays intact.  Here compiler also includes
social and environmental information.


Entry: The Point
Date: Thu Jul  2 01:06:27 CEST 2009

I think it's time to stop wanting to do original research, and
try to understand what is already there through study and
building useful things.


Entry: preserving data
Date: Wed Jul  8 16:35:36 CEST 2009

This interpreter + data business hinted at in the previous posts and
recent struggle with build tools made me think about something: there
isn't any static semantics for data we're storing.  If something
doesn't build you're fucked.


Entry: Loose ideas about embedded programming
Date: Tue Jul 14 16:39:22 CEST 2009

Some scrap ideas and frustrations that I can't place yet (contains
duplication).

* Maybe it's time to look at the semiconductor design industry.  I
  think it is safe to say that all a semiconductor design company does
  is to try to get it right the first time.  It's a hard life.
  Getting it wrong means going out of business.  The best way to do
  this is to write compilers and verification systems that try to
  catch all possible errors.

* To me embedded software development is really about tools: languages
  that make abstraction easier and and make debugging more
  straightforward.  The hardware industry has known this for a very
  long time already.  Maybe the reason is that there is more pressure
  to get things right in a minimal number of iterations, and that this
  industry is not so much hindered by software standards.  But when
  they provide language support, they provide C and really only C.

* What saddens me however is that the "don't touch anything" frame of
  mind seems to be quite important in this endeavor.  Systems are
  complicated, but once they work they are left as is.  Is there
  anything to learn from this?  Is real software necessarily messy?
  Like biology?

* (About me not liking C) What tears me apart in some sense is that it
  seems to be about closed mindedness.  At one point there is the
  point that you really can't beat C and Unix as a base platform.
  These tools are there to stay.  This is in some sense good:
  standards are good and make development cheaper.  What saddens me is
  that because of the structure of C and especially the C
  preprocessor, the commercial success of C and Unix (Linux in
  particular) have eliminated two variables from the equiation: it is
  no longer feasible to pick an optimal implementation language or
  operating system.  C and CPP are just good enough to solve most
  problems with a bit of developper's pain.

* About Staapl to Lisp, Forth and C people: The weird thing is that
  the basic idea seems to be completely obvious to me, which is
  probably the reason why I find it so hard to explain.  The only
  people that seem to get what it is about are Forth and Lisp people,
  and they think it is fairly trivial: an interesting excercise but
  "No thanks, I'll roll my own."  Most embedded software engineers I
  talk to find it esoteric and far removed from the real world.  I've
  even been accused of talking like "management".  So why is this?
  Why aren't people concerned about the tools they are using?  Why
  does the current patchwork of ad-hoc glue code that is called a GCC
  toolchain actually work?  I found the Forth and Scheme communities
  to be quite happy with the gem they discovered and generally
  unwilling (or demotivated?) to spend any energy in advocating the
  general idea.  I find myself in the middle of two worlds.  What I
  wonder is why this is.  My disgruntled response is that embedded
  engineers should learn about programming language design, as this is
  very helpful in shifting your idea about writing programs to writing
  compilers / debuggers / profilers.  If anything it helps you to get
  a good idea about how dynamic and static relate to each other: how
  debugging and reaching optimal implementations are diametrically
  opposed and how this beast actually has a name.

* Maybe the problem (building a standard meta-system for C) is just
  too complicated, and maybe it is true that each application does
  deserve its own ad-hoc system.  I don't think this is the case
  however.  I think the problem is really worse-is-better: that the C
  preprocessor is just good enough to get the job done in most of the
  cases, if you find someone skilled enough to actually pull it off,
  with lot's of TDD traps to catch the awkward errors CPP can produce.

* I really can't ignore the C++ template langue in this sense.  It is
  equally horrible, also probably because it is not meant as a general
  purpose language, but is used as such.  The idea is that a macro
  system should really be a proper language: otherwise people will
  find ways to circumvent it.  You can't squash turing completeness.
  Look at syntax-rules vs. syntax-case.  The problem is how to merge
  this with static checks..

* "Premature Flexibilization".  YAGNI: Ya Ain't Gonna Need It.
  Writing managable software seems to be about being able to grow in
  the correct direction from simple seeds.  You need a language (and
  tools) that makes refactoring easy.  Start with concrete and move to
  more abstract later, but make sure it is possible, and _don't_ waste
  too much time on making things too general.  Move generality to the
  development tools, not the application.  This is close to Chuck
  Moore's philosophy of early binding + writing highly specialized
  code.  Only languages and low-level libraries are reusable.
  Languages provide the "general" and libraries the "specific".  The
  space in between is the realm of complex disposable glue code.  Move
  languages from dynamic (debuggable, specializable, easy to write) to
  static (robust, specialized, easy to maintain).

* I don't like the idea of large test suits for exactly this reason:
  they try to keep a dynamic implementation under control in the case
  where you really want to keep most of the program as-is: where the
  _overall structure_ of your program is something you're more
  interested in.  In this case a stricter static implementation might
  be a better idea than to try to keep all things open sticking to a
  dynamic implementation.

* Once your program is correct, this virtual layer should be
  eliminated completely (unless it provides optimality in another
  sense: i.e. memory usage).  This should be possible for any part of
  your program that provides flexibility that is not needed in the end
  product.  It should be possible to seal up the end product in a
  "blob" that does _only_ what it is inteded to do, and nothing more.
  Simple behaviour is easier to get right.  Compile time complexity is
  managable.  Unnecessary run time complexity is a horrible waste of
  time.

* Anything that is supposed to be configurable in the end product (as
  a feature) should have the semantics of a real (dynamic) programming
  language, or at least have a possible growth path toward one.  If
  you don't acknowledge this early on, it will creep up later in a
  very bad form.  The basic idea is if you want to allow your users to
  tweak your product to their needs, it you should go all the way and
  pick a standard scripting language + some high level sugar on top of
  that to making writing "dumb glue" easy.

* I define "embedded programming" as developping a system that is
  supposed to be sealed up and forgotten about: a localized blob of
  intelligence that does not require human intervention to operate.
  If at some time you find out your system needs to be configurable,
  it is probably best to expose this as a real, complete programming
  language, possibly the debugging inspection system properly sealed
  down to eliminate security issues.

* Once you realize this dynamic vs. static problem, it is very well
  possible to do it using standard tools (C, shell scripts and some C
  code generation) as long as you find a way to keep the structure of
  the metasystem under control.

* Embedded programming is about writing debuggers.

* The moral for embedded software seems to be to focus on
  cross-interface binding.  Move as much complexity as possible to
  compile time.

* What Staapl does is to explore this paradigm, currently sidestepping
  the issue of having to read and write C code by focussing on two
  languages (Forth and Scheme) with a simple syntactic structure.
  Staapl's own code base has been largely on this dynamic -> static
  track.

* While C itself is a rock-solid standard that is not going anywhere,
  anything remotely high-level built on top of it quickly leads to a
  jungle of incompatibilities and ad-hoc solutions.

* I need to open up GCC or LLVM to give compilable intermediate code.

* The theoretical problem related to this seems to be the unification
  of macros (general purpose compile-time computations) with type
  systems and program verification.  This endeavor seems to be mostly
  about finding ways around the halting problem (you can't say what a
  program does without running it, once it has conditional jumps)
  Generic program inspection is undecidable, so clever selective
  simplification of higher level semantics is necessary.

* The one thing which is really important in a large software system
  is its large-scale stucture.  You want it to be brittle, so that
  refactoring (changing internal interfaces on one side) immediately
  breaks the other side in the most spectacular way.  It seems that
  static type systems, due to their conservatism, are better bet here
  than dynamic test suits.

* Note that this approach (dynamic bootstrap + static metaprogramming)
  probably doesn't work in environments with highly volatile code.
  Embedded programming seems to me mostly about writing components
  that stay relatively fixed once they are built.  This is in stark
  contrast with systems that deal a lot with "arbitrariness of
  business logic", one example being web-based user interfaces.

* While Staapl might not be ready for commercial prime-time yet, I
  believe the ideas behind it are quite solid.  I would like to
  continue applying it to practical problems to bring it closer to
  where it might pay off in the real world: serving as a basic tool
  for a development team of a couple of skilled engineers, and as a
  basic tool in the development of specialized software libraries.
  Next to that I would like to see these ideas carried over to a
  C-based programming system, to make integration into existing
  projects more straightforward.  This is not a simple task.  I've
  been trying to find a proper attack angle for many years, but it
  seems I'm not there yet in my current skill set.

* All the design tools I know will hide the metaprogramming behind
  some dumbed-down user interface.  From what I know of people
  actually using these tools is that these interfaces are really
  complicated and behave quite a lot as "magic".  An example that
  generalized to source code is The MathWorks' Real-Time Workshop,
  which is more of an actual compiler for a specific language.  Is the
  problem they solve just too complicated to expose these systems as
  programming languages?  This basic principle would make sure that
  the end user is not limited in the way the system is composed with
  other components outside of the designer's vision.


Entry: Speed reading
Date: Thu Jul 16 14:38:47 CEST 2009

I'm accumulating books and papers much faster than I can possibly
manage to ever read.  However, if I could read all the books I have I
could probably solve some problems I'm encountering in a simpler way,
because these books create the right view by stimulating thought in a
certain way..

I've found some way to deal with this: to collect whatever might be
interesting and take some time here and there to go over good books
and articles and try to extract a single idea from them.

There is one reason why I pick up a book: I know there is something
hidden behind the stroll through the stacking of abstractions that I
can instantly grasp by looking inbetween the lines and finding that
important idea.  Good books simply _need_ to contain a huge amount of
scaffolding to clearly express good ideas.  However, this scaffolding
is only there to get you to the "aha" if you're disciplined enough to
follow the reasoning.

There are two reasons why I put down a book: I get bored by details
because the exposition doesn't immediately come to the point (basicly,
I can't find what I'm looking for) or I get saturated with the
abstraction level of the ideas on the path/part I know I have to read
to get the idea (meaning I am not "smart" enough yet to accomodate the
ideas in my concept space.)

I'm probably going to stop altogether trying to read books from front
to back in great deatail.  It takes too much time and is overall
suboptimal.  I find it better to combine speed-reading with spend most
time first trying to figure out why a problem needs solving and then
to go through the incremental buildup of abstraction once more.

And yes, this is advice given to me by many a wise (wo)man.  It's just
something you need to feel first I guess..


Entry: Energy and angles
Date: Thu Jul 16 14:54:49 CEST 2009

I've talked before[1] about energy as the thing that makes stuff
rotate.  This is probably a bit ill-put.  Energy describes a space
within which motion is possible, and in general, these spaces are
compact, leading necessarily to some form of "rotation".  Otoh.. there
is the E = hν thingy..

[1] entry://20060505-005951


Entry: Scroderiders
Date: Sun Jul 19 12:11:56 CEST 2009

In "A Fire Upon The Deep" ...

[1] http://groups.google.com.vn/group/rec.arts.sf.written/browse_thread/thread/cc9fb4f47bb52343


Entry: a working day
Date: Tue Jul 21 14:24:30 CEST 2009

All attempts to bring structure in my day have failed miserably.  For
"normal" work there is little trouble: just do it.  For the
stimulating but difficult tasks however, some management might be
necessary.  Mostly I just work until I can't work no more, which often
leads to dissatisfaction.  Since I don't really need to steer the good
days, let's try to capture the bad ones:

Problems:

  * I can't think.  Usually these are related to not getting enough
    sleep, being in a stressful environment, and getting hung up on a
    dead end and not wanting to realize it.  This involves _accepting_
    the stuck state and _getting out_ either by doing something non
    work-related, or picking a simple rote task from the list.

  * I can't focus - not knowing where to start.  Pick a difficult task
    and try to advance it.  Don't start anything new!  Starting new
    things is a symptom of badly organized "next actions".  Do this in
    the morning, 1 hour after waking.

  * I have no inspiration.  This happens usually in the afternoon.
    What fixes it is to simply fix a problem by making it look
    significant.

Methods:

  * Leave rote tasks for later.  There are plenty of moments where
    thinking doesn't work, and it helps to just relax and do something
    less demanding.  Make a list of rote tasks.  This could involve
    soldering, sysadmin at 1/2 speed, houshold chores, ...

  * Reading works better at night as long as the material doesn't need
    too many leaps outside of known concepts.  I find I have more
    patience at night.  Let's call this "re-enforcement reading".  Try
    to focus on simple ideas.

  * Don't waste the productive hours of the day to do simple things.
    This will bore you to death and kill motivation.


Entry: ar and order of .o files
Date: Tue Aug 18 10:35:48 CEST 2009

( How not noticing important small details makes you doubt everything
  you know.. )

I'm getting linker errors.

The rules are very simple: when you're linking .o, .a and .so files:
make sure the order is right!  Usually putting the low level things on
which everything depends in the back works best.

Actually, I've never really understood how this works exactly - my
mental model of it was good enough uptil now.  My guess is that all .o
files get included in their entirety, but only if they are somehow
referenced beforehand.

But..  Now I have this:

tom@zni:~/libprim/ex$ objdump -t main.o |grep yy_start
0000000000000000         *UND*	0000000000000000 yy_start

tom@zni:~/libprim/ex$ objdump -t sexp.o |grep yy_start
0000000000000609 l     F .text	000000000000004f yy_start

but:

tom@zni:~/libprim/ex$ gcc -o main main.o sexp.o
main.o: In function `main':
/home/tom/libprim/ex/main.c:3: undefined reference to `yy_start'
collect2: ld returned 1 exit status

and:

tom@zni:~/libprim/ex$ gcc -o main sexp.o main.o
main.o: In function `main':
/home/tom/libprim/ex/main.c:3: undefined reference to `yy_start'
collect2: ld returned 1 exit status

I don't understand..

Damn this sucks: look at the little "l" in the objdump output: it
means *LOCAL*.  Indeed, the function was decleared "static" through a
macro.


Entry: The Lost BASIC Generation
Date: Fri Sep 11 09:11:07 CEST 2009

Dijkstra mentioned in one of his reports that BASIC has done a lot of
bad in forming the minds of programmers of my generation (1975).  I
got my first computer at the age of 12.  It was an MSX that booted up
with BASIC.  I think this is a valid point.  BASIC disconnects
programming from mathematics.

I liked math when I was in high school.  At the same time I loved
computers.  The strange thing is that I never saw the link between
math and programming until much later in college.  In retrospect, this
is really strange.  I remember one day trying to write a starfield
demo, being able to express some mathematics, but not being able to
turn it into an algorithm that actually computed a simulation.

I've learned since then that this is actually quite normal :)
Algorithms are strange beasts.  And they are quite different from
high-school math.

But that's not my point.  What I mean is that if somebody would have
pointed me to functional programming or lisp in that fertile time of
age 13-15, when I was discovering machine language and C, things might
have been quite different.

I got into electrical engineering, which satisfied my desire to do
something with math (algebra, numerics, DSP, control, ...) but at the
same time killed any growth in programming techniques.  All I knew was
C/C++ and this lead me to re-invent dynamic languages (mostly through
Pure Data and later Forth and my Forth/Lisp hybrid PF) until I finally
arrived at Scheme and the typed functional languages.  It's only now
that I re-connect math and software (not counting numerical algorithms
based on linear algebra), after a long detour in C-land.

Anyways.  This detour did give me a unique perspective.  I feel I am
now where I want to be: in a position where I want to make
_implementation_ of numeric code easier, through better understanding
of programming languages and machines.  I felt the pain of low-level
implementation, but I've recently realized that this doesn't have to
be the end of the story.


Entry: Quit Caffeine
Date: Sun Dec 20 12:55:51 CET 2009

I quit coffee last wednesday.  Well, I had one, wednesday after
dinner, but none during the day or in the morning.  Waking up is more
difficult, but I find it work mostly on my motivation: it makes me
a lot more lazy.  I don't get hyped up about what I'm doing.

When I actually get to working (once motivated) it doesn't seem to
have such a large effect: I'm still able to work, and I'm even a bit
more patient when things are not going as planned.

Overall: even if I feel a bit down and tired, it seems better overall
to not be so hyped up the whole day.


Entry: Overwhelmed by TODO
Date: Thu Dec 31 10:41:34 CET 2009

My interests are getting a bit dispersed.  I've got the day job now,
so not much time to meander.  Should I leave all of these on the back
burner, and focus on a single thing?

After quitting caffeine, the direct bursts of motivation in the
morning are lacking.  By now I should be recovered of any
over-expenditure..

- fixing sweb: get rid of mutable graph structure
- explore FPGA board + Xilinx tools
- learn Haskell
- fix SC (better VM)
- port staapl to 16-bit PICs
- fix staapl docs
- make a standard forth for staapl
- make a compiler for PF
- do something with the OSD2
- autodiff and numerics
- fix staapl's asm expression language dep
- use staapl to build some uc circuits


Some of these are really not trivial and require a lot of work.  The
bigger question is however, where do I want to end up?

It's not really the caffeine: it's a combination of relevance and
perceived feasibility which can lead to intrinsic motivation, or more
specifically the perception of a possibility of incremental growth
without getting stuck in problems that require a huge amount of
current context.

Currently the simplest task seems to be to learn Haskell.  At least,
I've found an incremental approach to it (book: ``Real World
Haskell'').

Second is the Scheme VM: which requires a single insight on how to
refactor the current code without breaking it.  Implementation should
take only a day or so at full mental capacity. 

I am really craving coffee at this moment.  Maybe I should use it only
on the week-end, or only at home, not at work?

I'm giving in..

Funny.  Immediately I also crave pastries.  Maybe this wasn't such a
good idea after all..


Entry: Learning boring stuff - There's light at the end of the tunnel.
Date: Thu Jan 14 22:41:51 CET 2010

Learning boring stuff is painful.  I didn't really enjoy learning Java
and the Android API.  (And before that, even less EtherCAT and USB.)

Now, after a very intensive month I'm comfortable with it.  I can see
the straightforward design (of Android) behind the horrible mess that
is a large collection of Java callbacks in a language that doesn't
have a succinct syntax for closures or support for proper tail calls.

It does amaze me every time again _how_ painful it is to fill your
head with details for which you do not have any intrinsic motivation,
especially after having had the luxury to be able to learn a lot of
interesting things the months before that new job.  It's painful for a
long time and then plotseling it is all obvious, and actually gets
interesting as the mind that was bogged down integrating a mountain of
details has freedom to think again and `be clever' using the recently
learned tool, however ugly, in creative ways.


Entry: Free time
Date: Sat Jan 16 14:05:40 CET 2010

Scheme VM: Ha!  I spent at least two quite full days to end up with
something that needs some revision.  It's not at all trivial!  There
are many design choices to make, some of them are hard to abstract
away.

Haskell: mostly reading..  See ``beautiful differentiation''.  I still
don't have a clear goal - except for a hardware mapper or a compiler
like Staapl, where I would be more inclined to use OCaml/MetaOCaml.

Sweb: mostly thinking that I should keep the current implementation as
is, but separate out some modules as purely functional code (the
parsers + formatters).


Entry: Semantics is a stub
Date: Wed Mar 17 09:48:38 CET 2010

Compilation versus interpretation, this diagram again:

                 data
          int      |
     stx ====> fun |
      :            |
 comp :            V
      :          data'
      V
     stx'

Interpretation is used to give meaning to syntax and its manipulation.
I.e. to relate proof construction (logic) to function composition
(semantics).

On the more philosophical level one could argue that the real world
consists of only structure and its manipulation, and no such thing as
meaning meaning exists.

Meaning is a "stub".

I.e. the interpreter in the diagram above doesn't really map stx to
fun.  It is a formal transformation method that derives data' from
(stx,data).

Anyways..


Entry: Filtered Stock Market
Date: Thu Apr 22 09:40:37 EDT 2010

What would happen if transactions were not discrete events, but finite
time price integrals?

A world without conquest?  A world without predators?  Only
herbivores?

Essentially, it doesn't exist: all food consists of transactions
(i.e. blades of grass).  What matters is the size of the transaction.


Entry: TF-synth and nonlinearities
Date: Sun Apr 25 19:50:09 EDT 2010
Type: tex

The main issue in the ``analog modeling'' problem of music DSP is
dealing with aliasing due to discontinuities and non-linearities.

Using a dynwav TF representation, is it possible to solve this
analytically?  I.e. given a sum of sines and some sequence
representation of a nonlinear function (i.e. power series), can we
compute the spectrum directly?

Briefly touched before in [1], this is the realm of function
composition.  Given a signal $x(t)$ and a wave shaper $y(x)$, a new
signal $y \circ x$ can be constructed.  The interesting question is:
how does $y$ need to be represented when we have a representation of
$x(t)$ as a spectrum sequence, and we want to get a similar spectral
representation for $y(t) = y(x(t))$?  Is it possible to cheat by
picking a particular representation of a periodic signal?

I.e. start from phase modulation, and compose the modulators?  If the
wave shaping function $y$ is closed on the interval (the interesting
ones that \emph{squash} a signal), there might be a way to compose
shapers.

I'm looking at this in the wrong way.  It's probably simpler to see
this in terms of roots: polynomial composition turns a single root
into an n--fold root by \emp{shifting} the other poly according to the
roots of the first.  With $p(x) = \prod_i (x-x_i)$ a composition $p
\circ q$ with $q(x) = x + \epsilon$ gives $$(p \circ q)(x) = \prod_i
(x + \epsilon - x_i).$$

%[1] entry://../math/20090620-152527


Entry: There are no events
Date: Fri Apr 30 16:55:16 EDT 2010

Maybe this is what is necessary to get a better understanding of many
things related to music and DSP.  People need events to think (words,
language).  Computers need events to work, they essentially _are_ the
concept of discrete, directed time.  However, such structures do not
tend to lend themselves well to analysis because they are ultimately
local, with no (obvious) connection to the global.  They do not tie
into their surroundings like continuous systems do.

Working with continuous systems allows events to be decoupled from
parameters.  The latter are still a discrete set.


Entry: About abstraction
Date: Mon May 10 16:49:19 EDT 2010

Non-obvious obvious stuff. (NOOS?)

If you have a concrete approach your eventual much more abstract
approach that you'll spend the whole night looking for is probably
going to reduce to it.  It won't solve the problem better, but it will
make your original low-level solution seem pretty obvious.

Probably, but not always..  That is, not if you missed to exploit an
essential redundancy in your low-level approach.

The question is: is this high level pattern-seeking activity
worthwhile or not?  From my experience, in the real world, where
things just need to be done once and as quickly as possible, it is not
worth it.  However, by solving the problem abstractly, you will be
able to solve more related problems and get a better understanding.


Entry: The day
Date: Mon Jun 21 12:02:43 CEST 2010


morning: input.  study: absorb knowledge through asking the right
questions.  more idiosyncratic and stubborn, but open to external
input.

noon: disappointment and confusion.  can't do it all - do some mundane
stuff instead.

day: output.  creative action, problem/self driven.  very
idiosyncratic.  brain seems to be organizing existing knowledge. not
open to input.

evening: input.  study: absorb knowledge through passive listening /
more tolerant.


Entry: What is a difficult problem?
Date: Fri Jun 25 14:17:10 CEST 2010

One that has a difficult solution.

Let's see:
  - simple problem: a solution is clear intuitively at a glance
  - complex problem: a factoring into simple problems is known
  - intractable problem: no factoring is known

How to become an expert?

  * Know the simple solutions of your domain.

  * Don't build unnecessary complex solution through ill-informed
    decomposition.

  * Build more intuitive understanding of composite solutions by
    familiarizing the ``inbetween space''.

  * Recognize the smell of intractability.


Entry: Simplectic updates
Date: Thu Jul  8 10:17:52 CEST 2010

(By lack of better name.)  Probably completely unrelated to the
previous post, but this makes me think of "simplectic updates" where a
system state transition is factored in two steps: a position update
and a velocity update (two triangular matrices).  Such a discrete
simulation of a continuous system keeps intact certain invariants
(while still making slight errors).  I wonder how that relates to
Shur/Levinson style forward/backward prediction.  I suspect it is
actually the same..


Entry: Selling stuff - the cost of wear
Date: Sat Jul 17 15:41:30 CEST 2010

I'm having trouble getting rid of my old music equipment.  I don't
think there is too much emotional value attached, so why don't I want
to sell it?  Isn't it simply conversion of (useless) material goods
into cash?

The problem is price.  How to think about it?  Price should be the
amount it would cost me to buy the same kind of gear back when I need
it.  Most of it has significant wear, and it is still more worth to me
than I can probably charge for it.  I.e. the burns are not worth the
devaluation to me: I have an incorrect concept of the cost of wear.

Maybe it is emotional value then: to me the wear is "character", but
to a buyer it is just devaluation.


Entry: Dynamic range
Date: Mon Aug  2 08:30:25 CEST 2010

Playing a bit with my analog synth, and comparing it to what I can do
with the Behringer digital encoder box, it's clear that solving the
Dynamic range problem is the essence of making a good controller /
synth on a digital computer.

Using analog pots, making small adjustments is really simple.  The
effective resolution is very high.  So, the question is: can this be
reproduced digitally, either using dither or something (noise doesn't
seem to be so much of a problem; average is much more important).
Alternatively: what is needed to have a high-res analog controller?


Entry: Binary decorrelation
Date: Mon Aug  2 08:46:53 CEST 2010

What about simply using dithered PDM?  S-D PDM gives periodic
waveforms.  Dithering could at least de-correlate the phase of these.

Is it possible to see S-D PDM as a form of frequency-dependent phase
width modulation?  If the modulation can be made more explicit
(i.e. what comes out has some more properties) then maybe
decorrelation is feasible.


Entry: PDM: Binary Sequences and base functions
Date: Sat Aug  7 17:24:44 CEST 2010

Instead of trying to start from signals, is it possible to start from
a set of base functions?  Can signals be represented as a (linear?)
composition of primitive signals with certain properties that carry
over after combining them?

The problem is that the useful linear operation (XOR) doesn't carry
over to a linear operation on the densities.  (add -> mul).  

Can we use logarithms?  Entropy?

From afar..  If F(a+b) = F(a)F(b) then F behaves as an exponential, or
+ is really a multiplication.  For XOR this is indeed the case.

(quack quack)


Entry: int profile
Date: Tue Sep  7 16:22:29 CEST 2010

ubt orifuke

(right hand misaligned)


Entry: DLL hell, API versions, reusability and modularity
Date: Fri Sep 10 13:09:08 CEST 2010

I think the important element isn't really reusability - it's
modularity: being able to oversee a design by minimising coupling.

Explicit interfaces are necessary for large projects, but they add a
positive feedback loop that leads to even larger projects.

What is necessary is "global API optimization": some mechanism that
feeds back "global consequences" of APIs back into the APIs and
_forces everyone_ to use them.

It's a bit like garbage collection and finalization: you need global
information (who has a reference but isn't using it) why you perform a
local operation (i want this resource now).


Let's call this the "global feedback problem" (GFP) and hold it
responsable for all problems in computer science.

Yeah.  GFP you suck.


From the other side: some local decisions have huge global impact, so
in order to make good local decisions, you need to know their impact
factor _before_ you make the decision.  

Is there a way to somehow express this more formally?  Dependency
graph of API decisions?


Entry: shell 
Date: Sun Sep 19 00:26:58 CEST 2010

It's just a shell.
Do not forget; it's easily shed.


Entry: I'm enjoying electronics
Date: Wed Dec 22 17:03:49 EST 2010

I've always been intrigued by electronics, but until recently the
practical side of things didn't work so well.  It requires a certain
kind of "observational consistence" that doesn't work so well for my
mind.

Essentially, I'm not a good scientist.  Like my mother and now my wife
say: I don't look!  I guess that's why I like programming better: more
consistency, more reliable and redundant tools, easier debugging.
With electronics, everything has to be right or it simply doesn't work
with very little feedback.  Finding that error feedback seems to be
95% of the design process.  Quite interesting and humbling, coming
from paper-only design!


Entry: Basic equation solving
Date: Thu Dec 23 19:34:52 EST 2010

I've become quite bad at the on-paper manipulation for solving basic
sets of equations.  I guess I need to practice more.

I always was sloppy, making many errors doing manipulations.  It's
gotten much worse though.


Entry: I'm fearful with money but fearless with time!
Date: Tue Dec 28 22:07:13 EST 2010

So what does a person need in this day and age of industrialized
madness?

[1] http://www.moralquotes.com/index.php/a/2008/08/09/life_is_an_ongoing_process_of_choosing_b


Entry: Microwork
Date: Wed Dec 29 02:05:54 EST 2010

It would be great to create a more direct correspondence between
need and effort for the more abstract jobs.

Too bad there isn't a way to perform freelance skilled work on the
micro-scale, i.e. I need <insert-hourly-rate> for this gizmo, so I
work for 1 hour on something else.

Problem is that setup time is usually way too large, demand is in the
form of: "want everything here, now" and the routing problem way too
complicated.

Skilled work needs "storage space" in the form of credits for time
spent.


Entry: Serendipity
Date: Fri Dec 31 01:28:10 EST 2010

Feyman's trick [1]: to keep a number of problems in your head such that
when you learn a new trick you can see if it works with any of them.

This generalizes quite well: work on many things at once, and switch
between them often.  I've found that getting stuck (or getting bored)
means you need a break.  Work on something else.  If it's a real
problem, and it's within your reach to solve, the solution will often
come just by itself.

The reality is that problems don't get solved by thinking hard.  They
get solved by creating lots and lots of context and allowing solutions
to bubble up.  I.e. be prepared, know your stuff, and you will find
the way :)


[1] http://www.hooversbiz.com/2009/01/16/keep-big-problems-in-your-head/


Entry: Integer skipping
Date: Sun Jan  2 12:55:18 EST 2011

This is fairly vague and serves mostly as bait / rehash :

I have a way to express the solution to a numerical problem exactly
(as integers / rationals).  However, the number of digits in the
solution grows (linearly) with time, as each time step involves a
multiplication by a rational number.

How to "renormalize rationals".

More concretely, it involves a linear autoregressive system in the
rationals.  The main idea is that it really needs to be exact, as it
involves subtraction that would kill accuracy if it is done in limited
precision.

This is very similar to the relation between harmony and melody,
i.e. how harmonically ("semantically") different intervals can be
glossed over and be substituted with other ones.  See also "almost
equal composite numbers" [1].


[1] entry://../math/20110102-130526


Entry: Interested in number theory
Date: Sun Jan  2 20:15:49 EST 2011

Mood: intrigued, tired, a bit dull/dumb.

It irks me that I don't "get" number theory.  I mean, I don't get the
point.  I can see that it is really useful to know a bit about the
integers and how they interact, but it seems that the structure you
find when you start looking is very "jagged".  Compared to "smooth"
math that pervade engineering disciplines such as basic calculus and
linear algebra over complex vector spaces.  And yes, pun intended
wrt. smooth.

The discrete is truly mysterious, and the continuous seems to be only
mysterious in the way it interacts with the discrete.


Entry: Our audio/video perception is not 1st order roll-off
Date: Thu Jan  6 13:10:05 EST 2011

What I mean is that some perceptual boundaries seem to be quite abrupt
(possibly with a little bit of hysteresis due to attention).

This contrasts with the "gradual" roll-off exposed by physical
systems.  I.e. there's no way we can hear 40kHz sound, no matter how
loud it is.  If our system would be 1st order with fixed noise floor,
simple amplification would lift it over the threshold.

FIXME: find a better way to express this and relate it to some actual
measurements of hearing / vision.

  
Entry: Bytes in a row
Date: Tue Mar 15 15:01:26 EDT 2011

There's this word I keep forgetting.  It is used to indicate that a
bunch of data is stored at addresses without gaps, i.e. one after the
other.

My mind switched to using the word "consecutive", somehow blocking the
recall of the word I'm looking for.  Annoying, especially because this
keeps happening, especially when I'm tired.

I found the word, it's "contiguous"[1].

EDIT: Maybe I should name my band "contiguity theory" or something, so
I won't forget.  It just happened again..

[1] http://www.merriam-webster.com/dictionary/contiguous


Entry: On getting lazy
Date: Sat Jun 11 10:30:27 CEST 2011

I think it's happening to me.  Why I don't know exactly, but it has to
do with the "wrong kind" of stimulus.

When I'm left idling and "tricked" into remembering how good it feels
to find things out, many of my old questions and projects come back.
I have a lot of them, and the seeds are all over [1].

Thinking about it, there have been two kinds of wrong stimulus:

  - Small things to fix, i.e. infrastructure maintenance.

  - Payed contract work

They are not "bad" by themselves, but they take away the combination
of boredom + freshness that are necessary to get to motivation.
Basically, they stop making me think hard.

I'd like to figure out how to fix that.


[1] entry://../topics


Entry: Getting Stuck
Date: Sat Oct 15 09:40:56 EDT 2011

Well voiced[1].  I have this all the time writing low-level system
drivers (reading datasheets and OS manuals/code).  Also when learning
new things, at this moment that's Haskell's type system.

Whenever a mountain of complexity threatens to burry me it feels as if
the last thing on earth I want to do is to get near it, though mostly
this is not a conscious process, just as Jeff describes.  Getting
conscious of this problem and accepting it helps overcoming it.

[1] http://www.jeffwofford.com/?p=835


Entry: Commutation
Date: Sat Oct 22 15:22:55 EDT 2011

I've been playing with Haskell type classes for representing meta
languages and it seems that most of the time expressing functionality
is about expressing commutation relations.

An interesting example of this is lambda and application for an
embedded simply typed lambda calculus.

-- Simply typed LC
class LAM1 r where
  lit1  ::  t -> r t
  app1  ::  r (a -> b)    ->  r a -> r b
  lam1  ::  (r a -> r b)  ->  r (a -> b)
  
Here app1 and lam1 deal with the commutation of r and (->) type
constructors.


Entry: Structure can't hide
Date: Sun Oct 23 12:38:02 EDT 2011

The crazy thing about mathematics is that even if you don't really
know what you're doing, you can't really do anything wrong.  I mean,
you can make meaningless things that don't work, but you can't make
things that _seem_ to work but don't.

Of course this is obvious, but if you look at this from the
perspective of a low-level C programmer that is getting beat up by the
Haskell type system, it is actually quite amazing.


Entry: Learning haskell
Date: Mon Oct 24 13:54:22 EDT 2011

I keep running into constructs where I can't get the details right.
Mostly this has to do with commutation of type constructors: a (b c)
vs b (a c), or mixing up functionality.

Is there a way to abstract this pattern of misunderstanding?  Or are
some days just prone to not working out right?  Often when I do get
mixed up, I can't really focus on other things that are usually not
such a problem.  Is it just focus?


Entry: Getting confused takes a lot of energy
Date: Thu Oct 27 11:13:03 EDT 2011

I see this pattern pop up almost every day.  If I stay in flow I can
work for hours in a row.  If I stumble out of flow and get confused, I
get tired very quickly: my brain just stops working and I feel quite
uncomfortable.


Entry: mutations
Date: Sat Oct 29 20:09:11 EDT 2011

I'm generating syntax trees with a state continuation monad.  I get it
wrong often, and this creates trees that are not quite right in
connection.  The difference in code is sometimes subtle, leading to
widely different results..


Entry: Kernel vs. Periphery
Date: Wed Nov  2 23:13:11 EDT 2011

Many interesting systems have a structure with a highly intertwined
core, where it's not clear where the center is, and a periphery that's
clearly focused on the center.

I.e. a programming language: lots of "feedback" on the inside, but
usually libraries have more directional / bottom up dependencies.

Does this pattern have a name?


Entry: reverb: fractal compression?
Date: Mon Nov 28 20:56:34 EST 2011

Take a real impulse response, and try to compress its tail.


Entry: Studying programming language theory ...
Date: Thu Dec 15 22:13:17 EST 2011

.. messes with your head.  It teaches you the habit of turning things
upside down, seeing everything as the same, or just a little bit
different.

I'm referring more specifically to writing compilers, transforming
data structures.  The whole code is data business.

Maybe I'm just educated in a slanted way but this thinking sets me
apart from most of my collegues, who are mostly interested in getting
the work done.

Actually, maybe this means some of the madness I thought I lost is
coming back?  That kind that makes many mistakes and seems out of
control but produces a couple of gems here and there, hidden in
debris.


Entry: Spacial density interface
Date: Tue Mar 27 14:39:12 CEST 2012

I never liked vector graphics, but I wonder, how to get a
high-bandwidth interface to the brain other than using visuals or
permanent surgery?

Instead of thinking in objects at the level of representation, it
might be best to let the brain do its object-recognition.  This is how
medical imaging works: give an MD some tools to see certain things and
his/her brain will be trained to do the pattern recognition.

So what data visualization should be about is just ways to project
information into 2D or 3D images.  2D is straightforward, but can we
do 3D?  I.e. density maps?

I'd like to find out how to do this density map thing with some 3D
glasses.  What's the thing to buy?

Maybe that's a nice goal to apply that lurking DSP knowledge..


Entry: Bigger keys
Date: Thu May  3 12:48:34 EDT 2012

Got a KeyTronic KT800U2 big keyboard with big keys. It's been a while
since I've used one of those.  Need to learn how to type again.
Curious if it will make a difference from the flatter kensington.

I'm not sure... maybe previous was better..
Let's sit it out for a bit.


Entry: Natural flicker, 4Hz
Date: Thu Jun  7 14:09:15 EDT 2012

This is going to sound crazy, but in my meditation exercises I've been
seeing a "natural flicker" turn up in my visual field lately.  More
about that here [1].

But what I have seen before is that this range of frequencies is at
the boundary of "events" and "vibrations".  Stay below 4Hz and it
feels like a rhythm with distinct beats, go above and it starts to
feel like a continuous stream of vibrations that has its own quality,
some mix of discrete events and continuous "buzz".

Of course the transition is not a vague one, but it can be clearly
experienced when doubling up.  Try making a drum pattern of 4 Hz (240
bpm) and then doubling it to 8 Hz.  There is a change of quality there
that doesn't happen in the the jumps from e.g. 2->4 and 8->12.  At
16Hz it starts to become less of a "vibration" and more of a "sound".

So I'd say, vibration is between 6 and 16 Hz.  Does this have
something to do with the resonance frequency of the human body /
flesh?  I did find this[2]: RF of human eye is about 18Hz ;)

I wonder if by making this frequency range more explicit in thinking
about music, it could lead to some different way of composing rhythm?

[1] entry://../dharma/20120607-141016
[2] http://en.wikipedia.org/wiki/Infrasound


Entry: USA Dutch?
Date: Sat Jan 26 15:48:42 CET 2013

I heard yesterday that almost, the main language in the U.S. was set
to dutch.  I don't know if this is true, but maybe that's why
U.S. english sounds so "kaks" :)   urr urrr :)


Entry: Perpetually confused
Date: Mon Feb 18 13:09:04 CET 2013

I seem to do better in a role of perpetually confused student, as
opposed to tutor or consultant (in the original sense).

In a sense, I perform better when I don't 100% understand what I'm
doing.  It takes a long time to get to mastery, simplicity.


Entry: Appreciating research
Date: Sun Apr 21 10:07:04 EDT 2013

The best way to appreciate research is to try to solve a problem
yourself, realize it is not easy, and then read a real solution.  I
found often that my arrogance gets in the way, underestimating the
complexity of the problem, or completely missing some structural
element that can be exploited.


Entry: Distributed resource allocator
Date: Sun Dec 13 19:37:45 EST 2015

The problem with trade, is that it is hard to do without a universal
placeholder such as money.  Is it possible to redefine money in terms
of computer-assisted trade interactions?

E.g. a system that keeps "score", to eliminate abuse.

E.g. is it possible to introduce "local virtual money"?

Is it possible to reconstruct stable money without state intervention?
Something ephemeral, that can't be stockpiled?

Maybe the stockpiling is the real problem?


Problems:
- generalization: you want an apple, here's a pear
- trunctation: can't bring 100 people together to trade
- time management: drop off, pick up sequencing
- distributed architecture (this needs to run server-less)


Entry: Do we not see glue code?
Date: Fri Feb 12 15:16:33 EST 2016

Or glue logic.  Whatever.  The thing that's omnipresent in coding is
connection and transformation of arbitrary transformations, i.e. the
implementation of isomorphisms, projections.

The "actual stuff" that's in a representation is merely a mental
model, but it is that model that is important in reasoning about how
data is combined into new data.

I spend a very large amount of time shuffling between transformations.
Is it possible to reduce that?


Entry: Everything is broken
Date: Sun Feb 28 21:01:54 EST 2016

Drunken idealist speaking.  What can be done about it?

The problem is interaction: APIs are incompatible, and state
assumptions are wrong.

There seems to be only one way, which is to build an AI that can do
refactoring.

EDIT: 2016/12/27 Interestingly, there seems to be some merit to
setting up "chains" to solve resource allocation for tasks, e.g. where
to run code based on optimizing I/O overhead.


Entry: Programming is too low-level
Date: Tue Mar  8 12:34:53 EST 2016

What I find is that still, I spent most of my time finding neat things
to do given the tools I have.  There is no true purpose in this other
than the joy of finding things out.

I get most confronted with this in the synth work.  Bogged down in
infrastructure, continuously.  All previous attempts to build a
workable programmer-computer interface have not reached the point of
actually reducing the barrier.

So what is the real problem?


Entry: Interfacing
Date: Mon Dec 26 17:56:13 EST 2016

( slight manic freeflow - maybe there's something there. )

Interfacing has been the problem ever since I started programming:
adapting formats so meaning can be connected.

Shared meaning = connection after unpacking, translation of
interfaces.  The shared meaning is there because the translation
exists.  Human thinking seems to abstract that translation layer,
which is what makes programming so difficult: making the translation
explicit and adapting the small difference in _actual_ meaning is
where the bulk of the work goes.

Because the bulk of the work is in the translation, it might be argued
that meaning itself resides in the encoding/decoding process,
connecting internal representation to effects in the world.

99% of the work I do is ad-hoc translation.
Computation and algorythm by itself is only a small subset.

Is there a way to remove this boilerplate once and for all?

( The funny bit is that the last couple of posts talk only about this. )


Entry: Ideas with type errors
Date: Sat Jul 21 22:21:53 EDT 2018

Wine night, so free association abound.

Why I need a framework: my mind generates a lot of nonsense, but if
I'm living in a formal system, that really constrains it enough to get
to something.


My mind generates a lot of ideas.  Most of those are nonsense.  Having
a formal system right in front of you (a "nature"), really helps to
quickly test them and weed out the junk.

Without that, it goes astray very quickly.


Entry: What about this log
Date: Sat Jul 21 22:24:23 EDT 2018

It's been a while. What's in here?
I haven't read yet, but I think it is about "intuitive madness".

I.e. what happens when ideas that are based on some fundamental
misunderstanding are allowed to proliferate without checks.

Yeah started reading and this is indeed insanity.  But there might be
some leads in here.  What it needs though is some formalization.  It
helps to write (Haskell) programs to encode some of this.