ieee xplore怎么登陆游戏备份

点击联系发帖人 时间：2017-12-15 02:13

ieee xplore怎么下载

欢迎光临倘文库，如需获取更多资料请使用搜索功能。
【图文】IEEEXplore1
396IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007Auditory Segmentation Based on Onset and Offset AnalysisGuoning Hu and DeLiang Wang, Fellow, IEEEAbstract—A typical auditory scene in a natural environment contains multiple sources. Auditory scene analysis (ASA) is the process in which the auditory system segregates a scene into streams corresponding to different sources. Segmentation is a major stage of ASA by which an auditory scene is decomposed into segments, each containing signal mainly from one source. We propose a system for auditory segmentation by analyzing onsets and offsets of auditory events. The proposed system first detects onsets and offsets, and then generates segments by matching corresponding onset and offset fronts. This is achieved through a multiscale approach. A quantitative measure is suggested for segmentation evaluation. Systematic evaluation shows that most of target speech, including unvoiced speech, is correctly segmented, and target speech and interference are well separated into different segments. Index Terms—Auditory segmentation, event detection, multiscale analysis, onset and offset.I. INTRODUCTIONIN a natural environment, multiple sounds from different sources form a typical auditory scene. An effective system that segregates target speech in a complex acoustic environment is required for many applications, such as robust speech recognition in noise and hearing aids design. In these applications, a monaural (one microphone) solution of speech segregation is often desirable. Many techniques have been developed to enhance speech monaurally, such as spectral subtraction [20] and hidden Markov models [30]. Such techniques tend to assume a priori knowledge or statistical properties of interference, and these assumptions are often too strong in realistic situations. Other approaches, including sinusoidal modeling [21] and comb filtering [11], attempt to extract speech by exploiting the harmonicity of voiced speech. Obviously, their approaches cannot handle unvoiced speech. Monaural speech segregation remains a very challenging task. On the other hand, the auditory system shows a remarkable capacity in monaural segregation of sound sources. This perceptual process is referred to as auditory scene analysis (ASA)Manuscript received September 9, 2005; revised January 19, 2006. This work was supported in part by the Air Force Office of Scientific Research under Grant FA-0117, by the Air Force Research Laboratory under Grant FA-0093, and by the National Science Foundation under Grant IIS0081058. An earlier version of this paper was presented at the 2004 ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Hong-Goo Kang. G. Hu is with the Biophysics Program, The Ohio State University, Columbus, OH 43210 USA (e-mail: hu.117@osu.edu). D. Wang is with the Department of Computer Science and Engineering and Center for Cognitive Science, The Ohio State University, Columbus, OH 43210 USA (e-mail: dwang@cse.ohio-state.edu). Digital Object Identifier 10.1109/TASL.[4]. According to Bregman, ASA takes place in the brain in two stages: The first stage decomposes an auditory scene into segments (or sensory elements) and the second stage groups segments into streams. Considerable research has been carried out to develop computational auditory scene analysis (CASA) systems for sound separation and has obtained success in separating voiced speech [5], [8], [15], [18], [33], [34] (see [6], [12] for recent reviews). A typical CASA system decomposes an auditory scene into a matrix of time-frequency (T-F) units via bandpass filtering and time windowing. Then, the system separates sounds from different sources in two stages, segmentation and grouping. In segmentation, neighboring T-F units responding to the same source are merged into segments. In grouping, segments likely belonging to the same source are grouped together. We should clarify that the term segmentation used in CASA has a different meaning than that in speech segmentation used in speech processing, which refers to identifying temporal boundaries between speech units (e.g., phonemes or syllables) of clean speech. Auditory segmentation here occurs on a two-dimensional (2-D) time-frequency representation of the input scene. In addition, the scene as a rule contains multiple sound sources. Segmentation in CASA has a similar meaning as segmentation in visual analysis (more discussion below). In addition to the conceptual importance of segmentation for ASA, a segment as a region of T-F units contains global information of the source that is missing from individual T-F units, such as spectral and temporal envelope. This information could be key for distinguishing sounds from different sources. As shown in [18], grouping segments instead of individual T-F units is more robust for segregating voiced speech. A recent model of robust automatic speech recognition operates directly on auditory segments [2]. In our view, effective segmentation provides a foundation for grouping and is essential for successful CASA. Previous CASA systems generally form segments according to two assumptions [5], [8], [18], [33]. First, signal from the same source likely generates responses with similar temporal or periodic structure in neighboring auditory filters. Second, signals with good continuity in time likely originate from the same source. The first assumption works well for harmonic sounds, but not for noise-like signals, such as unvoiced speech. The second assumption is problematic when target and interference have significant overlap in time. From a computational standpoint, auditory segmentation corresponds to image segmentation, which has been extensively studied in computer vision. In image segmentation, the main task is to find bounding contours of visual objects. These contours usually correspond to sudden changes of certain local image properties, such as luminance and color. In auditory/$25.00 (C) 2006 IEEEAuthorized licensed use limited to: Sichuan University. Downloaded on January 12, 2010 at 19:35 from IEEE Xplore. Restrictions apply.HU AND WANG: AUDITORY SEGMENTATION BASED ON ONSET AND OFFSET ANALYSIS397segmentation, the corresponding task is to find onsets and offsets of individual auditory events, which correspond to sudden changes of acoustic energy. In this paper, we propose a system for auditory segmentation based on onset and offset analysis of auditory events. Onsets and offsets are important ASA cues [4] for the reason that different sound sources in an environment seldom start and end at the same time. In addition, there is strong evidence for onset detection by auditory neurons [27]. There are several advantages for applying onset and offset analysis to auditory segmentation. In the time domain, onsets and offsets form boundaries between sounds from different sources. Common onsets and offsets provide natural cues to integrate sounds from the same source across frequency. In addition, since onset and offset cues are common to all types of sounds, the proposed system can in principle deal with both voiced and unvoiced speech. Specifically, we apply a multiscale analysis, motivated by scale-space theory widely used in image segmentation [29], to onset and offset analysis for auditory segmentation. The advantage of using a multiscale analysis is to provide different levels of detail for an auditory scene so that one can detect and localize auditory events at appropriate scales. Our multiscale segmentation takes place in three stages. First, an auditory scene is smoothed to different degrees (scales). Second, the system detects onsets and offsets at certain scales, and forms segments by matching individual onset and offset fronts. Third, the system generates a final set of segments by integrating analysis at different scales. Scale-space analysis for speech segmentation as in speech processing (see earlier discussion) has been studied before [24]. This paper is organized as follows. In Section II, we propose a working definition for an auditory event to clarify the computational goal of segmentation. In Section III, we first give a brief description of the system and then present the details of each stage. We propose a quantitative measure to evaluate the performance of auditory segmentation in Section IV. The results of the system on segmenting target speech in noise are reported in Section V. This paper concludes with a discussion in Section VI. II. WHAT IS AN AUDITORY EVENT? Because at any time there are infinite acoustic events taking place simultaneously in the world, one must limit the focus of CASA to an acoustic environment re in other words, only events audible to a listener should be considered. To determine the audibility of a sound, two perceptual effects need to be considered. First, a sound must be audible on its own, i.e., its intensity must exceed a certain level, referred to as the absolute threshold in a frequency band [25]. Second, when there are multiple sounds in the same environment, a weaker sound tends to be masked by a stronger one [25]. Hence, we consider a sound to be audible in a local T-F region if it satisfies the following two criteria. o Its intensity is above the absolute threshold. o Its intensity is higher than the summated intensity of all other signals in that region. The absolute threshold of a sound depends on frequency and is different among listeners [25]. For young adults with normal hearing, the absolute threshold is about 15 dB sound pressurelevel (SPL) within the frequency range of 300 Hz 10 kHz [22]. Therefore, we take 15-dB SPL as a constant absolute threshold for the sake of simplicity. Based on the above criteria, we define an auditory event as the collection of all the audible T-F regions for an acoustic event. Thus, the computational goal of auditory segmentation is to generate segments for contiguous T-F regions from the same auditory event. This goal is consistent with the ASA principle of exclusive allocation, that is, a T-F region should be attributed to only one event [4]. We note that the exclusive allocation principle seems to contradict the fact that acoustic signals tend to add linearly (see, e.g., [1]). Besides the aforementioned auditory masking phenomenon, there is considerable evidence supporting this principle from both human speech intelligibility [7], [28] and automatic speech recognition [9], [28] studies (for an extensive discussion, see [32]). To make this goal of auditory segmentation concrete requires a T-F representation of an acoustic input. Here, we employ a cochleagram representation of an acoustic signal, which refers to analyzing the signal in frequency by cochlear filtering (e.g., by a gammatone filterbank) followed by some form of nonlinear rectification corresponding to hair cell transduction, and in time through some form of windowing [23]. Specifically, we use a filterbank with 128 gammatone filters centered from 50 Hz to 8 kHz [26], and decompose filter responses into consecutive 20-ms windows with 10-ms window shifts. [18], [33]. Fig. 1(a) shows such a cochleagram for a mixture of a target female utterance and crowd noise with music, with the overall signal-to-noise ratio (SNR) of 0 dB. Here, the nonlinear rectification is simply the response energy within each T-F unit. With this T-F representation, we obtain the ideal segments of an event in an acoustic mixture as follows. First, we mark the audible T-F units of the event according to the premixing target and interference. Then we merge all marked units into spatiall each region then corresponds to a segment. Fig. 1(b) shows the resulting bounding contours (black line) of the target segments in the mixture. Gray regions form the background corresponding to the entire interference. Because the passbands of gammatone filters are relatively wide, particularly in the high-frequency range, adjacent harmonics may activate a number of adjacent filters. As a result, an ideal segment can combine several harmonics, as shown in Fig. 1(b). As a working definition, we consider a phoneme, a basic phonetic unit of speech, as an acoustic event. There are two issues for treating individual phonemes as events. First, two types of phonemes, stops and affricates, have clear boundaries between a closure and a subsequent release in the middle of these phonemes. Therefore, we treat a closure in a stop or an affricate as an event on its own. This way, the acoustic signal within each event is generally stable. The second issue is that neighboring phonemes can be coarticulated. As a result, coarticulation may lead to unnatural boundaries between some consecutive ideal segments. These ideal segments may be put together by a real segmentation system, creating a case of under-segmentation. Alternatively, one may define a syllable, a word, or even a whole utterance from the same speaker as an acoustic event. However, in such a definition, many valid acoustic boundaries between phonemes are not taken into account. Consequently, some ideal segments are likely to be divided by a segmentationAuthorized licensed use limited to: Sichuan University. Downloaded on January 12, 2010 at 19:35 from IEEE Xplore. Restrictions apply.398IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007Fig. 2. Diagram of the system. Note that the scale increases from bottom to top.Fig. 1. Sound mixture and its ideal speech segments. (a) Cochleagram of a female utterance, “That noise problem grows more annoying each day,” mixed with a crowd noise with music. (b) Bounding contours (black line) of the ideal segments of the utterance. The total number of ideal segments is 96.system into smaller segments, creating a case of over-segmentation. We will come back to this issue in the evaluation and discussion sections. III. SYSTEM DESCRIPTION Our system estimates ideal segments of auditory events via an analysis of signal onsets and offsets. Onsets and offsets, corresponding to sudden intensity changes, tend to delineate auditory events. In addition, onset/offset times of a segment, which is a part of an event, usually vary smoothly across frequency. Such smooth variation is partly due to the fact that certain speech events, such as stops and fricatives, exhibit smooth-varying onset and offset boundaries in certain ranges of frequency. Also, the passbands of neighboring frequency channels have significant overlap. Temporal alignment is an effective cue to group neighboring frequency channels. As shown in Fig. 1(b), even with strong interference, boundaries of most segments are reasonably smooth across frequency. Fig. 2 gives the diagram of our system. An acoustic mixture is first normalized so that the average intensity is 60-dB SPL. Then it is passed through a bank of gammatone filters [26] (see Section II). To extract its temporal envelope, the output from each filter channel is half-wave rectified, low-pass filtered (a filter with a 74.5-ms Kaiser window and a transition band from 30 to 60 Hz) and downsampled to 400 Hz. The temporal envelope, indicating the intensity of a filter output, is used for onset and offset analysis. Note that, unlike the cochleagram representation, we do not divide the temporal envelope into consecutive frames in this analysis. Onsets and offsets correspond to the peaks and valleys of the time derivative of the intensity. However, because of the intensity fluctuation within individual events, many peaks and valleys of the derivative do not correspond to real onsets and offsets.Therefore, the intensity is smoothed over time to reduce the fluctuations in the smoothing stage. The system further smoothes the intensity over frequency to enhance the alignment of onsets and offsets. The degree of smoothing is called the scale—the larger the scale is, the smoother the intensity becomes. In the stage of onset/offset detection and matching, the system detects onsets and offsets in each filter channel and merges detected onsets and offsets into onset and offset fronts if they occur at close times. It then matches individual onset and offset fronts to form segments. As a result of smoothing, event onsets and offsets of small T-F regions may be blurred at a larger (coarser) scale. Consequently, the system may miss small events or generate segments combining different events, a case of under-segmentation. On the otherhand, atasmaller(finer)scale,thesystemmaybesensitiveto insignificant intensity fluctuations within individual events. Consequently, the system tends to separate a continuous event into several segments, a case of over-segmentation. Therefore, it is difficult to obtain satisfactory segmentation with a single scale. Our system handles this issue by integrating onset/offset information across different scales in an orderly manner in the stage of multiscale integration, which yields the final set of segments. The detailed description of the last three stages is given below. A. Smoothing Smoothing corresponds to low-pass filtering. Our system first smoothes the intensity over time with a low-pass filter and then smoothes the intensity over frequency with a Gaussian kernel. Let denote the initial intensity—logarithmic temporal envelope—at time in filter channel . We have (1) (2) is a low-pass filter with passband in hertz, where and is a Gaussian function with zero mean and standard deviation . “ ” denotes convolution. The parameter pair indicates the degree of smoothing. The larger is, is. We refer to as the (2-D) the smoother scale, and the smoothed intensities at different scales form the so-called scale space [29]. Here we apply low-pass filtering instead of generic diffusion [29] for smoothing over time because this way it is more intuitive to decide the appropriate scales for segmentation according to the acoustic and perceptual properties of the target we are interested in (see Section III-C). In an earlier study, we applied anisotropic diffusion and obtained similar results [19].Authorized licensed use limited to: Sichuan University. Downloaded on January 12, 2010 at 19:35 from IEEE Xplore. Restrictions apply.HU AND WANG: AUDITORY SEGMENTATION BASED ON ONSET AND OFFSET ANALYSIS399Fig. 3. Smoothed intensity values at different scales. (a) Initial intensity for all the channels. (b) Smoothed intensity at the scale (1/2, 1/14). (c) Smoothed intensity at the scale (6, 1/14). (d) Smoothed intensity at the scale (6, 1/4). (e) Initial intensity in a channel centered at 560 Hz. (f) Smoothed intensity in the channel at the scale (1/2, 1/14). (g) Smoothed intensity in the channel at the scale (6, 1/14). (h) Smoothed intensity in the channel at the scale (6, 1/4). The input is the same as shown in Fig. 1(a).As an example, Fig. 3 shows the initial and smoothed intensities for the input mixture shown in Fig. 1(a). Fig. 3(a) shows the initial intensity. The smoothed intensities at three scales, (1/2, 1/14), (6, 1/14), and (6, 1/4) are shown in Fig. 3(b)–(d), respectively. To display more details, Fig. 3(e)–(h) shows the initial and smoothed intensities at these three scales in a single frequency channel centered at 560 Hz, respectively (see Section III-C for the implementation details of the low-pass filter). As shown in the figure, the smoothing process gradually reduces the intensity fluctuations. Local details of onsets and offsets also become blurred, but the major intensity changes corresponding to onsets and offsets are preserved. B. Onset/Offset Detection and Matching , onset and offset candidates are deAt a certain scale tected by marking peaks and valleys of the time derivative of the smoothed intensity(3)An onset candidate is removed if the corresponding peak is , which suggests that the candidate is smaller than a threshold likely an insignificant intensity fluctuation. Since the peaks corresponding to true onsets are usually significantly higher than other , peaks, we use the threshold and are the mean and standard deviawhere tion of all the derivative values, respectively. We have also tested , , and as the threshold, but the performance is not as good. Then in each filter channel, the system determines the offset represent the time of time for each onset candidate. Let the th onset candidate in channel . The corresponding offset , is chosen among the offset canditime, denoted as and . The decidates located between sion is simple if there is only one offset candidate in this range. When there are multiple offset candidates, we choose the one . Note with the largest intensity decrease, i.e., the smallest that there is at least one offset candidate between two onset candidates since there is at least one local minimum between two local maxima. Since frequency components with close onset or offset times likely arise from the same source, our system connects common onsets and offsets into onset and offset fronts. There is usually some onset time shifts in adjacent channels in response to the same event. This is because the onset times of the components of an acoustic event may vary across frequency. Masking by interference may further shift detected onset and offset times. Also, each gammatone filter introduces a small, frequency-dependent delay in its response. Based on these considerations, we allow a tolerance interval when connecting onset/offset candidates in neighboring frequency channels. Specifically, we connect an onset candidate with the closest onset candidate in an adjacent channel if their distance in time is less than the same applies to offset candidates. This threshold value sho otherwise onsets (or offsets) from the same event will be prevented from joining together. On the other hand, a threshold value that is too big will connect some onsets from different events together. As found in [10], [31], human listeners start to segregate two sounds when their onset times differ by 20–30 ms. Therefore, we choose 20 ms as the threshold. If an onset front thus formed occupies less than three channels, we do not further process it because the front is likely insignificant. Onset and offset fronts are vertical contours across frequency in a cochleagram. The next step is to match individual onset and offset fronts , to form segments. Let denote an onset front with consecutive channels, and the corresponding offset times as described earlier. The system first selects all the offset fronts that cross at least one of these offset times. Among them, the one that crosses the most of the these offset times is chosen as the matching offset front, occupied by the and all the channels from to matching offset front are labeled as “matched.” The offset times in these matched channels are updated to those of the matching are laoffset front. If all the channels from to beled as matched, the matching procedure is finished. Otherwise, the process repeats for the remaining unmatched chan-Authorized licensed use limited to: Sichuan University. Downloaded on January 12, 2010 at 19:35 from IEEE Xplore. Restrictions apply.400IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007nels. In the end, the T-F region between , and the updated offset times , yields a segment. In the aforementioned segmentation, we assume that onset candidates in adjacent channels correspond to the same event if they are sufficiently close in time. This assumption may not always hold. To reduce the error of merging different sounds with similar onsets, we further require the corresponding temporal envelopes to be similar since sounds from the same source usually produce similar temporal envelopes. More specifically, for , let be the closest an onset candidate be the overonset candidate in an adjacent channel, and and lapping duration between . The similarity between the temporal envelopes from these two channels in this duration is measured by their correlation (see [33]) (4) where indicates the normalized with zero mean and unity . Then in forming onset fronts, we furvariance within ther require temporal envelope correlation to be higher than a threshold . By including this requirement, our system reduces the errors of accidentally merging sounds from different sources into one segment. C. Multiscale Integration Our system integrates analysis at different scales to form segments. It starts at a coarse scale, i.e., generating segments as described in Section III-B. Then, at a finer (smaller) scale, it locates more accurate onset and offset positions for segments, and new segments can be created within the current background. Segments are also expanded along the formed onset and offset , fronts as follows. Let and , be the onset times and offset times of a segment occupying consecutive channels. Note that lower-frequency channels are at lower positions in our cochleagram representation [see Fig. 1(a)]. The expansion works by considering the onset front at the current scale crossing and the offset front crossing . If both of these fronts extend beyond the segment, i.e., occupying chan, or channels with higher center frequencies, nels above the segment will expand to include the channels that are crossed by both the onset and the offset fronts. Similarly, the expansion considers the channels below , or the channels with lower center frequencies. At the end of expansion, segments with the same onset times in at least one channel are merged. One could also start from a fine scale and then move to coarser scales. However, in this case, the chances of over-segmenting an input mixture are much higher, which is less desirable than under-segmentation since in subsequent grouping larger segments are preferred (see Section IV). In this study, we are interested in estimating T-F segments of speech. Since temporal envelope variations down to 4 Hz are essential for speech intelligibility [13], [14], the system starts. In addition, the system segmentation at the time scale . We have also considstarts at the frequency scale and . In both situations, the ered starting at system performed slightly worse. In the results reported here, the system forms segments in three scales from coarse to fine: (6, 1/4), (6,1/14), and (1/2, 1/14). At the finest scale, i.e., (1/2, 1/14), the system does not form new segments since these segments tend to occupy insignificant T-F regions. The is 0.95, 0.95, and 0.85, a larger is threshold used in the first two scales because smoothing over frequency increases the similarity of temporal envelopes in adjacent channels. At each scale, a low-pass filter with a 182.5-ms Kaiser window and a 10-Hz transition band is applied for smoothing over time. Note that the passband of the filter corresponds to the time scale. We have also considered segmentation using more scales and with different types and parameters for the low-pass filter, and obtained similar results. Fig. 4 shows the bounding contours of segments at different scales for the mixture in Fig. 1(a), where Fig. 4(a) shows the segments formed at the starting scale (6, 1/4), and Fig. 4(b) and (c) those from the multiscale integration of 2 and 3 scales, respectively. The background is represented by gray. Compared with the ideal segments in Fig. 1(b), the system captures a majority of speech events at the largest scale, but misses some small segments. As the system integrates analysis at smaller scales, more speech at the same time, more segments from interference also appear. Note that the system does not specify the sound source for each segment, which is the task of grouping not addressed here. IV. EVALUATION METRICS Only a few previous models have explicitly addressed the problem of auditory segmentation [5], [8], [18], [33], but none have separately evaluated the segmentation performance. How to quantitatively evaluate segmentation results is a complex issue, since one has to consider various types of mismatch between a collection of ideal segments and that of estimated segments. On the other hand, similar issues occur also in image segmentation, which has been extensively studied in computer vision and image analysis. So we have adapted region-based metrics by Hoover et al. [16], which have been widely used for evaluating image segmentation systems. Our region-based evaluation compares estimated segments with ideal segments of a target source since in many situations one is interested in only target extraction. In other words, how the system segments interference will not be considered in evaluation. Hence, we treat all the T-F regions dominated by interference as the ideal background. Note that this can be extended to situations where one is interested in evaluating segmentation of multiple sources, say, when interference is a competing talker. For example, one may evaluate how the system segments each source separately. The general idea is to examine the overlap between ideal segments and estimated segments. Based on the degree of overlapping, we label a T-F region as correct, under-segmented, over-segmented, missing, or mismatch. Fig. 5(a) illustrates these cases, where ovals represent ideal target segments (numbered with Arabic numerals) and rectangles estimated segmentsAuthorized licensed use limited to: Sichuan University. Downloaded on January 12, 2010 at 19:35 from IEEE Xplore. Restrictions apply.HU AND WANG: AUDITORY SEGMENTATION BASED ON ONSET AND OFFSET ANALYSIS401Fig. 5. Illustration of different matching situations between ideal and estimated segments. (a) Correct segmentation, under-segmentation, over-segmentation, missing, and mismatch. (b) Multiple labels for one overlapping region. Here, an oval indicates an ideal segment and a rectangle an estimated one. The background is represented by gray.of mixing T-F regions from different sources. Therefore, if an estimated segment combines T-F regions belonging to different speakers, it is not under-segmentation, but missing or mismatch depending on the degree of overlapping. Segment II is well covered by the ideal background, which is not considered in the evaluation. Much of segment VI is covered by the ideal background and, therefore, we treat the white region of the segment the same as segment II (Note the difference between I and VI). , be the set of ideal Quantitatively, let indicates the ideal background and others segments, where , , be the the ideal segments of target. Let estimated segments produced by the system, where , , the estimated corresponds to an estimated segment and be the overlapping region between background. Let and . Furthermore, let , , and denote the corresponding energy in these regions. Given a threshold, we deis well-covered by an estimated fine that an ideal segment if includes most of the energy of . That segment is (5)Fig. 4. Bounding contours of estimated segments from multiscale analysis. (a) One-scale analysis at the scale of (6, 1/4). (b) Two-scale analysis at the scales of (6, 1/4) and (6, 1/14). (c) Three-scale analysis at the scales of (6, 1/4), (6, 1/14), and (1/2, 1/14). The input is the same as shown in Fig. 1(a). The background is represented by gray.Similarly,is well-covered byif (6)(numbered with Roman numerals). As shown in Fig. 5(a), estimated segment I well covers ideal segment 1, and we label the overlapping region as correct. So is the overlap between segment 7 and VII. Segment III well covers two ideal segments, 3 and 4, and the overlapping regions are labeled as under-segmented. Segment IV and V are both well covered by segment 5, and the overlapping regions are labeled as over-segmented. All the remaining regions from ideal segments—segments 2 and 6 and the parts of segments 5 and 7 marked by diagonal lines—are labeled as missing. The black region in segment I belongs to the ideal background, but since it is merged with ideal segment 1 into an estimated segment we label this black region as mismatch, as well as the black region in segment III. Note the major differences among under-segmentation, missing, and mismatch. Under-segmentation denotes the error of combining multiple T-F regions belonging to different segments of the same source, whereas missing and mismatch denote the error, the above definition of well-coveredFor any ness ensures that an ideal segment is well covered by at most one estimated segment, and vice versa. Then we label a nonempty overlapping region as follows. , and is labeled as correct if o A region and are mutually well-covered. , and be all the o Let ideal target segments that are well-covered by one esti, . The corresponding overlapmated segment, , , are labeled as ping regions, under-segmented if these regions combined include most , that is of the energy of (7) , , and be all the eso Let timated segments that are well-covered by one ideal seg, . The corresponding overlapping rement, , , are labeled as overgions,Authorized licensed use limited to: Sichuan University. Downloaded on January 12, 2010 at 19:35 from IEEE Xplore. Restrictions apply.分享到：
PPT制作技巧}

久游无息网