From ASA to CASA, what does the "C" stand for anyway?

CASA workshop @ DAFx

mer, 06/07/2011 - 11:03 |

lagrange

From ASA to CASA, what does the "C" stand for anyway?

Satelite Workshop of the DAFx Conference organized by Mathieu Lagrange and Luis Gustavo Martins at Ircam, Paris.

September 23 2011 (Afternoon)

Introduction

Question Panel

For the videos of the presentations, please follow the links below.

Transcript of the main questions raised during the panel (thanks to Mathias Rossignol)

Q - (mainly as a follow-up to Jon's talk, so concerning chiefly speech) in the real world humans operate based on expectations; they use a model and take context into account. In automatic processing, how can models be introduced? How much prior? How to find a good balance?

Jon - this is why I backed off from having "understanding" in the title of my talk. Dealing with semantics, cognition, is a distinct and tricky matter. It should probably be added on top of our system, but we're avoiding it at the moment.

What we do is try to factorize on one side what's dependent on the environment/context (including cognitive) and on the other what isn't, and see what's the best we can do with the 2nd part.

Boris - context information is something we really need in real-life applications. Temporal context -- what happens right before and after a sound -- is potentially very interesting to disambiguate.

Jon - one thing that I can see being introduced realistically is the notion of "domain" of BG noise; for example we now we're in a house, so we expect household noises in the background.

Josh - in texture recognition, context is necessary: rain, for example, is very similar to applause. But we don't really understand yet how that works.

Q - Mathieu, in your introductory talk you made an accusation that people in connected fields use ideas from CASA but never get into it in depth. For me, as an MIR person, what would be a good way to be more "serious" about CASA?

Mathieu - CASA is often seen as a "dangerous game": the goal is somewhat ill-defined, and it's very hard to evaluate. But even for me, who pretend to be centrally interested in it, it's not perfectly clear, and I would like to take the occasion to forward the question to our experts here:

Sub-Q - if CASA can be defined as a goal in itself, how can we evaluate it?

Josh - It's hard to answer this question directly; there is obviously a cultural problem here: the technological community is focused on performance improvement, but there is also need for fundamental research, to which CASA belongs.

Sub-Q - so are we not yet to the point where we can harness ASA?

Josh: no, that's not it, I wouldn't be that categorical. But if you're going to work on it, it's got to be long term.

Jon: there are some tasks where CASA is fundamental, especially when you need to mimick human behaviour, including human weaknesses. For example in robotics, a robot that communicates like you, has the same insufficiencies, is better because it more easily triggers a feeling of empathy.

Q - (Josh -> Jon) In your system you gather fragments into sources, so the assumption is that the grouping cues used to form fragments are always right; you don't put your fragments into question once they're formed. How could we take fallibility into account?

Jon - we use probabilistic grouping rules to form sources, but that's possible because we have a model to guide us. Having probabilistic fragments raises a problem of efficiency. And if you don't have a model, then you're stuck.

Luis - for music could stream formation be the guiding rule? It's hard to make a model for each instrument!

Mathieu - worse than that, you'd need a model for each instrument *and* playing technique! To go back to Jon's work, that's the good point I appreciate about it: the way it manages to combine in some way a bottom-up and a top-down approach.

Concerning speech and music, I'd like to mention the work of Barbara Tilman, in Lyon: it seems that every result on the perception of speech has a correspondence in music perception. So probably there are, if not dedicated brain areas, at least similar mechanisms in play.

Q (audience) - what about detecting genre on an excerpt and applying corresponding priors to analyze the whole?

Summarized answer - why not, but that raises the question of what tools you use to detect the genre; a somewhat "chicken and egg" problem.

Q - what about special attention phenomena, such as recognizing when your name is spoken, even in noise or in the background?

Jon - that's a typical keyword spotting task; so we can imagine having a "name spotting process" continually running in the background.

The interesting thing is that this suggests there must be, contrary to our assumptions, at least some processing of the "audio background" going on.

---

Josh (spontaneous remark, not question-related) - at the moment, research in psycho-acoustics is unfortunately very detached from machine listening, in part because psycho-acoustics deals a lot with artificial signals now, whereas machine listening is interested in real world sounds.

It seems to me essential now to try and understand how people listen to real world sounds. There are real insights to be gained there.

Mathieu - maybe music could be a good example? It's a sound organization system.

Josh - but it's engineered (by the composer) to make the listener able to perform good scene analysis. So it's a special case, dangerous for generalization.

Trevor - not always, though: an orchestra can sometimes "try" to sound like a single instrument.

Public - modern music increasingly uses sounds without any physical counterpart, that makes it an interesting challenge: it means you have to work on purely perceptual cues.

Public - CASA is hard for music notably because of synchronized onsets, etc., and that's also why MIR people are "scared" of CASA.

Jon - Josh's remark is also true of speech: it's also made to be listened to.

Luis - CASA is not necessarily source separation, but more the separation/identification of perceptual sound objects.

Public (Mathias) - one difference between speech and music, though, is that in speech you'll more commonly have active listening -- moving your head, asking to repeat.

Public - The problem of physically active listening is indeed important, but the notion of intellectually/perceptually active listening must be kept in mind too.

Initial Call

Auditory Scene Analysis (ASA) is the process by which the human auditory system organizes sound into perceptually meaningful elements. Inspired by the seminal work of Al Bregman (1990) and other researchers in perception and cognition, early computational systems were built by engineering or computer scientists such as David Mellinger (1991) or Dan Ellis (1996).

Strictly speaking, a CASA or a “machine listening” system is a computational system whose general architecture or key components design are motivated by facts taken from ASA. Though, ASA being a Gestaltist theory that focuses on the description and not on the explanation of the studied phenomenon, computational enthusiasts are left with a largely open field of investigation.

Perhaps this lack of definition did not fit into the way we do research nowadays, since papers strictly tackling this issue are relatively scarce. Though, informal discussions with experts in the sound and musical audio processing areas confirm that making sense of strongly polyphonic signals is a fundamental problem that is interesting both from the methodological and application point of views. Consequently, we (organisers of this workshop) believe that there are fundamental questions that need to be raised and discussed in order to better pave the way of research in this field.

Among others, those questions are:

From ASA to CASA: only insights ?
- Is the knowledge transfer from ASA to CASA only qualitative ?
- Are there other approaches in scientific fields such as biology, cognition, etc. that are also potentially meaningful for building powerful computational systems ?

What is CASA ?
- Is CASA a goal in itself ?
- Can it be decomposed into well defined tasks ?

Is CASA worth pursuing ?
- What are the major locks in contemporary CASA ?
- How does it relates to other sound processing areas such as Blind Source Separation (BSS) or Music Information Retrieval (MIR) ?

This workshop aims at bringing to the audience some background and new topics on ASA and CASA. Questions such as the ones cited above will then be raised and discussed with the help of the invited speakers.

We are delighted to have 4 confirmed invited speakers (in order of appearance):

Trevor Agus (ENS)
Josh McDermott (NYU)
Jon Barker (Sheffield Univ.)
Boris Defreville (Orelia)

The workshop will take place at Ircam, as a satellite event of the DAFx conference, on Friday Sept. 23. The tentative schedule is the following:

14h00: Welcome talk (Mathieu Lagrange)
14h30: Auditory Scene Analysis

Trevor Agus (Perceptual learning of novel sounds)

Josh McDermott (Sound texture perception via statistics of the auditory periphery)

15h30: Coffe Break
16h00: Machine Listening

Jon Barker (Probabilistic frameworks for Scene understanding)

Boris Defreville (Machine listening in everyday life)

17h00: Questions Panel
18h00: End of the workshop

** Please follow the orange links for more informations about the talks. **

We would like as much as possible to "fuel" the discussion with questions or comments from the community. To do so, please ask questions or give comments on this page.

Ackowledments:

This work has been partialy supported by

ANR, French funding agency in the scope of the HOULE project
“Fundação para a Ciência e Tecnologia” and by the Portuguese Government, in the scope of the Project "A Computational Framework for Sound Segregation in Music Signals", with reference PTDC/EIA-CCO/111050/2009.

Fichier attaché	Taille
lagrangeCasaWs11.pdf	1.08 Mo

lagrange's blog