Hello, I'm looking for some method to smooth data. I have talked to math wizards at my local university and googled interwebs. So far I have found (almost)nothing that would be robust enough for my data. I have also looked into python packages: scipy, numpy and R. What I am looking for is something like moving average , but moving average only works correctly if Xn-Xn-1=const which is not always so. Basically im looking similar method to moving average, but it should take account that the data rate on X axis might not be constant. For example, if one measures temperature every day for 1 year, one can smooth the data using 10-day moving average. But if you measure temperature irregularly(some days twice and skip some days) then the moving average will fail. Thank you.
Honestly, If the data smoothing tools in programs like Excel or some of the more complex programs won't do what you want, and if you can't find a method that is 'robust' enough, your data may not be able to be presented in that way. Because even excel can do polynomial smoothing. And what your example is, is a bad example. It shows at best, poor data gathering skills.
Have you considered polynomial interpolation? Though, there's no guarantee that given n points to find the representative polynomial that f[a_(n+1)] will be anywhere near the actual value. Another problem is that if you have A LOT of points, then your polynomial might not be of any practical interest since people like to keep things simple. If you're sufficiently skilled, you might be able to look at the graphed data and make an educated guess as to what sort of function you might need. Maybe a/ln(b*x), e^(a*x), a/(x+b) - c*sin(x^3), etc. and just find the constants a, b, c, .... Though, it takes some experience or a lot of time on your hands. Another possibility is to accept that the data sucks.
Try taking the moving average over both the X and Y axes? Eg. if you want to average over sets of 10 consecutive points: \( \begin{align} \tilde{Y}_{n} \,&=\, \frac{Y_{n} \,+\, Y_{n+1} \,+\, \ldots \,+\, Y_{n+9}}{10} \\ \tilde{X}_{n} \,&=\, \frac{X_{n} \,+\, X_{n+1} \,+\, \ldots \,+\, X_{n+9}}{10} \,, \end{align} \)and plot the \(\tilde{Y}_{n}\)'s vs, the \(\tilde{X}_{n}\)'s, though I don't know about calling a method like this "robust".
Since for weather, the 4-hour variability can easily exceed the 24-hour variability, I'm not sure that data smoothing on irregularly sampled data is going to tell you more about the weather than your sampling practice. However, it looks from first principles that you could adapt an Infinite Impulse Response filter to irregularly-sampled data. Let \(\tau\) be your smoothing constant with units of time. Let \(t_i\) be the times of your data collection. Let \(x_i\) be the values of your data. Let \(y_0 = x_0\) to help smooth the initial transient. Then let \(y_i = ( 1 - \alpha_i ) x_i + \alpha_i y_{i-1}\) where \(\alpha = e^{- \frac{t_i - t_{i-1}}{\tau}}\). In the case of uniformly sampled data with period \(\Delta t\) this is an exponential moving average (EWMA) \(y_i = ( 1 - e^{- \frac{\Delta t}{\tau}} ) x_i + e^{- \frac{\Delta t}{\tau}} y_{i-1}\) and in the limit of a small ratio \(\frac{\Delta t}{\tau} \), this is \(y_i = \frac{\Delta t}{\tau} x_i + (1 - \frac{\Delta t}{\tau}) y_{i-1}\). But statistically, this is just meaningless manipulation of your data and may not reflect on the system you were measuring, especially when the signal has signal power at frequencies higher than half your average sample rate. It's only suitable for "guiding the eye" when you think a plain linear fit would not suit you. I think using AJ Johnson's non-uniform discrete fourier transform, filtering out the high frequency components and then reconstructing the time series should work (in the absence of high frequency data missed by the sampling). Here's a pointer to it for matlab. http://www.mathworks.com/matlabcentral/newsreader/view_original/765504
Have you plotted the data points? This is usually a good idea prior to trying to do curve fitting. Some times a plot of the data indicates that there is no reasonable function which will approximate it. If the appearance of the point plot is encouraging, you might try a spline fit (use Google). If my memory is not way off base, I think a spline fit uses least squares approximating methods, but does not attempt to fit all the points with one function. It chooses sets of points & develops a different approximating function for each set. Using this method, data conforming to an exponential (or some esoteric) function can be closely approximated using third or fourth order polynomials. Note that an exponential function cannot be approximated over a large range by a single third or fourth order polynomial, but a spline fit will do a good job.
Thank you for responses, I'll try out rpenner and przyk methods. I already have a script for cubic-spline (interpolation), but it passes through all the original data points, thats why I (sometimes) need to smooth the data before splining. Found out that OpenOffice Calc has cubic- and b-spline line smoothing options, and it seems that the b-spline could be something usable for smoothing out noise and robust enough if data is irregularly spaced on the X axis. Please Register or Log in to view the hidden image! b-spline in OO calc
This doesn't look like data where you need to draw a line at all, unless you have a model that you are comparing to the data.