# (Real) Machine Learning

Discussion in 'Intelligence & Machines' started by BenTheMan, Oct 3, 2012.

1. ### BenTheManDr. of Physics, Prof. of LoveValued Senior Member

Messages:
8,967
Is anyone here an active researcher/practitioner of machine learning algorithms, or interested in learning more?

I've been using a few free, graphical tools to play around with some simple algorithms (Weka, Knime, RapidMiner), and I'm looking to expanding into some of the Python modules for machine learning: Pycluster, pyml, etc. R also has some nice packages for machine learning that I'll probably get around to fooling with, too.

Are there any good references that people have found particularly helpful?

http://www.cs.waikato.ac.nz/ml/weka/book.html
http://www-stat.stanford.edu/~tibs/ElemStatLearn/
http://rapid-i.com/content/view/181/190/
http://www.knime.org/

Anyway, if anyone else is interested in doing a self-study course, let's talk. My goal is to use the Python modules to solve an actual problem. Which modules, and which problem are both undecided, but there are about 5 that I can think of that would mean a big bonus for me

I'd say you should be proficient in either Python or R, or some other language with ML libraries (not sure about Java, C#, C++, though I suspect all of those languages have good libraries). Math-wise, if you know some basic probability and statistics, and some calculus, you're probably ok. I'm pretty math-proficient (probably better than you, with about 10 exceptions in SciForums), so I can explain anything you (or yall) don't understand.

3. ### BenTheManDr. of Physics, Prof. of LoveValued Senior Member

Messages:
8,967
Hopefully this will spur some discussion, and we can set up a GitHub page or something.

5. ### GustavBannedBanned

Messages:
12,575
thanks for the mention

/proud

7. ### quadraphonicsBloodthirsty BarbarianValued Senior Member

Messages:
9,391
I'm pretty knowledgable in machine learning and related math and algorithms. I don't do Python or R, or use much in the way of off-the-shelf libraries though (HTK is probably the only standard package I've used, generally I roll my own stuff). Anyway feel free to ask me about whatever.

8. ### BenTheManDr. of Physics, Prof. of LoveValued Senior Member

Messages:
8,967
Glad to know there's an expert here

Do you have any general advice for getting started? Like, what types of problems seem to be best solved by AI/ML?

I can't be fully specific publicly about the problem I'm working on, I think, so if you can't say anything in general I understand. But I have a question about how good you can do when implementing algorithms yourself. (I realize this is a bit goofy to ask, given that you said you had little experience with out-of-the box stuff, but I thought I'd ask anyway.)

We've implemented a simple machine learning algorithm that gets us an answer with about 10% accuracy in Knime. (We're trying to disaggregate a signal into its constituent components.) We're just using some basic perceptron to do some clustering analysis, and 10% accuracy is what I can get with a back-of-the-envelope calculation. Some people I work with see this as "Oh, machine learning won't work for this problem". I see this as "We used an out-of-the box solution, and we may do better by implementing the algorithms ourself". (There's also an issue with the data, that I'll leave aside for now. It's a widely used, publicly available data set, but I have my doubts about its integrity.)

Do you have any opinion on this? I'm happy to dive into whatever language (Python, R, C++, C#, Java, in that order!) and hash out the algorithm if I have a reasonable chance of getting better results. It's kind of a back burner/stretch goal for me, so it would really look good to slap the bosses with a solution a month before bonus time.

9. ### Billy TUse Sugar Cane Alcohol car FuelValued Senior Member

Messages:
23,198
Is this thread only concerned with what one might call algorithmic computer learning or does it include what is commonly called neural networks that after training on a "training set" can solve related problems?

10. ### BenTheManDr. of Physics, Prof. of LoveValued Senior Member

Messages:
8,967
To my knowledge, "machine learning" basically means "some algorithm that uses a training set to build a model, and when fed new data that was not in the training set, can make accurate assessments based on the model". The algorithm (and thus, the class of models produced) is up to the researcher--from what I understand, neural nets are an example of this (I'm only really familiar with the perceptron). Less interesting models are also in this class: linear regression, for example. The exact algorithm is dictated by the data, and what you're trying to achieve.

That being said, I am speaking from a position of relative inexperience. Quad may want to correct me here.

11. ### Billy TUse Sugar Cane Alcohol car FuelValued Senior Member

Messages:
23,198
It is true, was so 35+ years ago when I was playing around with what I call "connection machines," but all others call "neural networks,"* that they were emulated in conventional digital algorithmic machine. However, once they have been well trained, they can be much faster, lighter, use less power, if they really are hard-wired connection machines. A subgroup of what are called analogue computers / machines with advantage over but zero flexibility vs. the modern digital computer.

For example, some (if not all) US torpedoes have electrical circuits to process acoustical signals to determine if they are nearing the intended target, which may zig-zag as it hears the torpedo coming, and when best to self destruct (explode). - For sinking a big ship the torpedo does not want to hit it. Instead at a depth below ship about equal to ship length torpedo just makes a big gas bubble. Then both ends of the ship are supported by water but the middle falls into the bubble hole. Ships are very weak structures that easily buckle. In fact just loading an oil tanker improperly (too much oil in chambers at one end) will lift the other end up, but not even out of the water when the stress ruptures it - sort of like the front is trying to get support for the front end by breaking back down into the water.

* There is nothing "neural" about them, except the neurologist, Hebb, did postulate how nerves do interact to learn and his mechanism was implemented in some of the first connection machines, but better leaning schemes (ways to adjust the inter connection weights) were soon developed. Once these weights are known, the machine is just three layers of nodes interconnected by resisters (usually none from first to third layer.) While I was exploring them, it was proven that a three layer machine can do anything a 4 or more layer machine can so I think only three layers are used. (Only wires and resisters is very cheap to produce in mass and "answer" can come out in less than three times the transit time of light!)

12. ### BenTheManDr. of Physics, Prof. of LoveValued Senior Member

Messages:
8,967
I see...so you're basically using a circuit to do the same thing. I guess this is where the name "Support Vector Machine" comes from?

13. ### Billy TUse Sugar Cane Alcohol car FuelValued Senior Member

Messages:
23,198
I never heard of it but - I dropped out of following this field 30+ years ago. I bet a good example of at least an analog computer in use is the control system for most building elevators.

14. ### quadraphonicsBloodthirsty BarbarianValued Senior Member

Messages:
9,391
I'm not sure there is a good general answer to your second question, it really depends a lot on the specific application domain. Intuitively, machine learning solutions tend to pop up in places where you're dealing with some data analysis problem that has gotten to be too complex or interconnected to work your way through analytically, but which you still have relatively high confidence that a solution is actually computable. Then you pick out an appropriate machine learning system, throw some training data at it, and see what pops out.

As far as general advice for getting started, all I can recommend is to familiarize yourself with the more popular, noteworthy techniques and start building intuition about what types of settings they do (and do not) excel in. This would be stuff like Gaussian Mixture Models/EM Algorith, HMMs/Baum Welch algorithm, support vector machines, multilayer perceptrons/back-propagation, etc.

Right, the basic question of how feasible the problem is and what level of performance can be expected (even from an optimal system) always follows along closely behind any use of machine learning algorithms. On the one hand, you can investigate the performance boundaries by throwing more and more expensive, powerful machine learning architectures at it, and then look for a "knee" where the performance ceases to improve significantly with increases in the complexity of the machine learning system. On the other hand, you can turn to things like information theory to try to get performance bounds that do not depend on any particular implementation - in the case of signal decomposition, you can maybe look at what is the mutual information between each desired component and their mixture, to get an idea of how "separable" the components are in the first place. This assumes that you can do a toy-problem version, wherein you start with the components and then artifically combine them to get a test input - then you can use information theory to quantify the possible performance, and also run the actual system and see how close it comes to reconstructing the components you started with.

15. ### quadraphonicsBloodthirsty BarbarianValued Senior Member

Messages:
9,391
No, support vector machines are named for the "support vectors" that figure prominently in the training of said machine. Basically they are classifiers which attempt to draw a hyperplane that separates two classes of data, with the optimality criterion that the hyperplane should be as far as possible from the closest training data in each class. Those training data are called the "support vectors," as they are the ones that determine the hyperplane parameters and so machine performance. I.e., they are the "support" that the hyperplane is built on top of.

16. ### ChipzBannedBanned

Messages:
838
Judging by the company you're with I am guessing you are attempting to decompose a noisy signal with non-ordinal properties. Clustering techniques will likely be fruitless. Most off the shelf products provide solutions better than 10%...
Apache products are typically very reliable; the Apache Machine Learner is named Mahout. https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Recently I have been reading through the white-papers of Numenta, https://www.numenta.com/technology.html.
I'm answering now briefly because I don't have internet at home and am busy at work. I planned on writing a simple implementation to test out the algorithms in the Numenta white papers. I'm proficient in C, C++, Python, D, and Haskell. If you come up with some good ideas which would be fun to implement, I would enjoy helping with the programming side. I have a github account I can send you through PM if you like.

17. ### AlphaNumericFully ionizedRegistered Senior Member

Messages:
6,702
I've done various machine learning bits of work, covering the usual (un)supervised learning, inference, model fitting, decision making etc stuff for things like object detection and tracking, feature detection, world state inference, multi-agent systems and autonomy. Much of it has been the more fundamental abstract side of things, nothing coded up in a way where you could slot it into a pre-existing architecture. As a bit of a side hobby to try to motivate me to learn Java, Python and C++ I'm currently messing around with a Pi raspberry, Kinect, the Arduino architecture and the MRPT open source Kinect interface which allows for machine learning like object detection and classification, environment mapping and navigation (VSLAM stuff), object tracking and modelling (Kalman filters etc) and loads more. By the end of what will likely take months, if not years, I hope to have used the little project to learn various software languages, hardware platforms, got significant experience with a variety of machine learning methods and build myself a little autonomous 'toy'. There's plenty on line of people already doing such things and for me the main obstacle is learning the languages many open source implementations of stuff like VSLAM are written in (ie C++). I could program a rudimentary SLAM algorithm myself but there's people who do this for a research job so I'll not produce anything better than what they release. Once you've got something which can 'see' and process the raw data (images, depth maps) into things like image feature lists, object identifications and environment point cloud maps then you can start doing lots of individual stuff of a more interesting, from my point of view, way. Much of this will come in useful for some work I know I'll be doing in the coming year so I'm getting a head start

18. ### BenTheManDr. of Physics, Prof. of LoveValued Senior Member

Messages:
8,967

AN: As an aside, Java, Python, C++ is probably the correct order to learn the languages in. Java is starting to disappear, as is C++, in favor of lighter languages like Python and C#, but lots of people still code in Java. (And really, I think many of those people are doing so out of habit: at our shop we just compile java into javascript.) I think C++ is limited to big enterprise offerings---it's pretty powerful, but it has a ton of overhead, and can be particularly difficult to debug. Also, the syntax is not particularly human-readable, especially when it comes to things like boost::lambda and function pointers, etc. One thing you may have noticed about Python vs. C++ is the speed of development---Python is just so much easier to code in. And, C# and Python both are untyped languages (or weakly typed?). Basically, you just declare variables without types and let the compiler/interpreter figure out the rest, and you use casting to ensure a particular interface. Also, both Python and C# handle namespaces a bit differently than C++---in both Python and C# the namespaces are just files. So C++ namespace = .py or .cs file. And finally, in Python and C# (I think the same is true in Java, but I don't know the language well enough) everything's a pointer, so the pointer logic that drives me insane is thankfully absent.

19. ### quadraphonicsBloodthirsty BarbarianValued Senior Member

Messages:
9,391
I would skip Java entirely unless you have a clear, salient reason that you need to use it.

If only...

Yeah, no. C/C++ may no longer be the universally dominant language(s) they once were, but if you care about performance or are working a real-time/embedded context, you will be writing C/C++ - if not assembly.

Real-time applications, embedded stuff, and performance-sensitive work in general is still very much C/C++ territory.

If we're talking offline scientific processing, it may not be critical. But even there I strongly prefer it to scripting languages due to the runtime differences (which can be orders of magnitude).

There's definitely a considerable learning curve there, but once you're over it and have learned the good tools I find my workflow in C/C++ to be very efficient. Your mileage may vary, and I don't pretend that it didn't take considerable investment on my part to reach that point, but I do feel that the pay-off is worth it.

Something about striking the correct balance between clarity and transparency in code syntax belongs here. If you really don't care about the details of what goes on under the hood, then C may be too far over on the "transparent" side of the balance. But if you do care about that - which you probably will if you care about optimizing performance (which, in turn gets pretty important on serious machine learning projects due to the amounts of data and complexity of the processing) - then C is a pretty good compromise between those two ideals.

Runtime is at least as important as coding time when it comes to development speed, at least in applications like machine learning where there is a lot of data to sort through and a lot of processing to be done on it. Saving 5 minutes coding up a prototype does you no good if you have to wait an extra 2 hours to see whether that prototype actually works. The whole reason I got into C was exactly that the runtimes on the rapid-prototyping scripting language I was using to do machine learning experiments became unmanagable and slowed my development speed to a crawl. And once I got good at C, I found that there was very little difference in the time it takes me to code up prototypes. Meanwhile, I now get results within seconds of taking my finger off of the "enter" key, instead of having to wait for hours.

If your development time is dominated by coding time, and not runtime, then either you are working in a domain that is not performance-intensive, or you are not a proficient programmer. Machine learning does not fall into the former category. And while it's okay to not be proficient at programming, the relevant thing here is that your coding skills and speed will improve over time, but the runtime penalty of working in an interpretted language will only get more and more severe as your system gets more complicated and test cases proliferate. So choosing a programming language that emphasizes coding speed over runtime may look attractive at the outset, but will eventually bite you in the ass - you'll just end up having to stop what you're doing and switch to the faster languages anyway. And, yes, I am speaking from bitter experience there.

That is a flaw, not a feature. Programmers should think about type and scope when writing code, so they don't produce bloated garbage. Yes, this slows you down at first when you aren't used to it, but in the long run you learn to think in those terms at the outset and end up writing clean, efficient code at a good clip.

Also Python has whitespace dependencies, which I find infuriatingly inane.

All that said, one does not necessarily have to chose one or the other. You can always start out in a higher-level interpretted language like Python and then, as you start butting up against runtime limitations, go and replace the workhorse modules with optimized, compiled C libraries. This gets you the best of both worlds in the sense that you top-level organization and various glue logic can all be in Python, while the really intensive stuff gets handled by C. On the other hand, it is also the worst of both worlds in that you have to learn both of the languages and associated toolchains, as well as how to get them to play nicely together. But that can be broken down into stages - you can start out in Python, and then learn C as needed later on (this is how I did it, except with Matlab instead of Python).

More generally, when approaching these kinds of decisions, it's important at the outset to figure out what your goals and priorities are. If you are going to be developing and maintaining a large machine learning system as an ongoing product, then you're going to care more about performance. If you're going to make a career out of it, it's worthwhile to invest a few months up front to get up to speed on a language that is going to minimize your development speed in the long run (and runtime will be very important there, for stuff like regression testing). If you are doing this as a class or just for play around for self-edification, then probably performance is not such a big concern and the ease and speed of getting the basic ideas coded up so you can play with them is the priority.

20. ### ChipzBannedBanned

Messages:
838
Ech.. I wasn't going to respond to this post until you called un-typed languages a flaw rather than a feature.

Agreed.

Agreed.

Ech... exceptionally wrong! It seems to me you don't know how to use run-time languages and seem to decided untyped languages are inherently dynamic. If your C program took 5 seconds and your (say) Python program took 10 seconds, you probably don't know Python or too much of your cost is system based. And how often is your program constrained by system time and not i/o bound? Only when it's a bad program. Numpy and Scipy libraries controlled by a dynamic language often run at near C++ speeds. Control logic rarely takes even 1% of run-time. So if you are running that much slower while the rest of the developer world is not, the problem is you.

Prototyping Python with BLAS, Numpy, Scipy, PyTables and a whole host of other libraries will usually take HALF the time of a C++ prototype. Not in spite of being dynamic...but because of it! In a prototype you design preliminary data structures, perhaps tuples...or dictionaries, classes... whatever it may be and you initialize the control logic. In the prototype you may realize your design neglected a pass-through variable or a necessary call back or virtually anything else. In C++ what is the chain hierarchy which is needed to make even a trivial change? A small change can be have huge implications on a C++ code base where in Python is practically never does. Since part of the prototype is acknowledging that there will be unexpected hurdles, this matters even more!

Why should I always think of typing when I write code? Because you said so? Hell... isn't that a big benefit of the typedef and the macro-def? And what does typing have to do with bloat? You're talking absolute nonsense here. Haskell is a weakly typed compile time language which in algorithmic design will often outperform C or C++ due to its lazy evaluation. It will also reuse in far much more code reuse. Your mind seems to be stuck in 1999.

Code:
unfoldr (\(f1, f2) -> Just (f1, (f2, f1 + f2))) (0, 1)
fib_seq = 0:1:zipWith (+) fibs (tail fibs)
fib        = 0:scanl (+) 1 fib

Is that inane too?

21. ### quadraphonicsBloodthirsty BarbarianValued Senior Member

Messages:
9,391
On the contrary, these days I spend 100% of my work time working in a run-time interpretted language (Matlab). The problems that creates for me, and the work-arounds I've had to pursue for it, are exactly the basis for my views.

No. Moreover, my comments there didn't say anything about typing, so not sure where that is coming from.

I don't use Python, but the rule-of-thumb I have for comparing my interpretted programs vs. their compiled C equivalents is two orders of magnitude.

If Python gets closer than that, well, bully for Python. But I have plenty of friends who do use Python, and find themselves replacing the intensive functions with C in order to keep runtime under control (and/or relying on the various compiled extension libraries).

And, no, the issue is not system-based costs. It's just computationally-intensive functions. That kind of thing abounds in machine learning, what with the prolific use of iterative nonlinear optimization algorithms.

Right: that kind of thing - relying on compiled C/assembly-optimized libraries for the intensive stuff - is exactly one of the compromises I suggested.

It hardly goes to your contention that interpretted languages are plenty fast enough on their own. Those are exactly examples of cases wherein the performance of such was way too slow, and so they were replaced by compiled C libraries to handle the intensive stuff. That's a direct, explicit demonstration of the vastly superior speed of compiled languages for intensive numerical calculations.

Likewise, if those packages don't already do what you need to do, you're back to either accepting a huge performance penalty by using native Python, or writing and compiling your own C libraries. Just as I said above.

Alternatively, if what you mean by "using interpretted languages the right way" is "avoid doing anything computationally intensive directly with them, and instead rely on compiled libraries for that," then you are not disagreeing with me in the slightest.

I'm only running that much slower in cases where I have to rely on the interpretted languages exclusively. In cases where I am able to replace all of the intensive stuff with compiled C, as you recommend, there is little issue. This being why I recommended exactly such a compromise solution in the post you are responding to.

Sure, supposing those libraries already do what you need done. If not, you're going to have to either eat a huge runtime penalty, or go ahead and write the C code and compile your own library. Either way, you're only getting the speed because somebody took the time to write the code in optimized C and compile it.

I agree that C can present some extra difficulties there, but have found that thoughtful architecture of the general framework up-front tends to avoid most of that.

That said, I have frequently worked in the past by doing initial first-pass prototyping in an interpretted language, and then porting it to C once the basic shape of it was in place. That way I could do the really runtime-intensive parts of the development and maintenance at a reasonable clip.

These days, though, I'm stuck entirely in the interpetted languages and so stuck with nasty runtime penalties.

Well, you could try addressing the reasons I actually gave before complaining that I didn't give any.

The point is that your program is, ultimately, going to end up using data types. And so you might as well keep tabs on that, and ensure that the types you end up with are appropriate.

Perhaps this is my background in embedded targets showing through, though. If you aren't trying to run in real-time on a resource-limited, power-limited platform, you might not care much.

I'm not sure what you mean there - you're saying that typedef is a way to not think about typing? Certainly, it is useful for encapsulating platform type dependencies and such from the whole of the source code, but it is nevertheless all about thinking carefully about typing and having fine control over such.

To be clear, I'm not talking about the source code being bloated. I'm talking about ending up with target code that uses data types which are excessive for their purposes. That does not happen when you think about and specify types as you go. It tends to happen all of the time when you don't.

Well, again, my whole career is in the embedded target world, which indeed probably looks a lot like the big-iron world of 13-ish years ago.

22. ### ChipzBannedBanned

Messages:
838
With those clarifications, we actually are in agreement.

Though I will add a few things.

1. I am shocked they are doing machine learning in Matlab. Perhaps your working in either bayesian or clustering? It would seem difficult to define the structures necessary for a neural network or an inductive logic model. The latter I have little experience with btw.
2. I considered the libraries a /part/ of the language. A lot of physicists I meet have come to prefer python applications to Matlab or C++ or even R. I assume it's because given they (nearly) standard libraries available they're still fast.

I btw am not a RTOS progammer.

23. ### quadraphonicsBloodthirsty BarbarianValued Senior Member

Messages:
9,391
Nah, there's whole off-the-shelf Matlab toolboxes for all that stuff, some of which have been around since the 90's.

That said, Matlab is relatively flexible (if not efficient) in data structure/object terms. However, I'd have to say that the Matlab crowd generally is more EE types who think in procedural terms, rather than CS types who think in more data structure terms. While lots of the official Matlab toolboxes (written by CS guys) use object-oriented approaches, I've found that your average Matlab programmer is happy enough with variables and arrays and the occasional struct and leaving the higher-level data relationships implicit in the code. Probably because the end product is typically C code, possibly for an embedded target.

That's a fair enough definition, I think the issue comes down to whether one wants to do machine learning stuff with off-the-shelf libraries, or roll one's own. In the former case, indeed, the considerations about execution time do not really apply (at least to languages like python that have ample, fast libraries available). But if you plan to code up your own machine learning algorithms, then you have to consider the trade-offs between (say) native python and a compiled C library/program.

I definitely agree that languages like python are much more elegant and less headache for the top-level i/o and control logic stuff. Indeed, I could very well see my way to a general recommendation of "use C heavily - but only as libraries."

Yeah, I'd say that the main thing Matlab has going for it is the IDE, enterprize support, wide range of toolboxes, and other ancillary tools (SimuLink, automatic c-code generation, etc.). If you don't need that stuff, it becomes a hard case to make.