Simultaneous Speech Subtitling Systems for Multiple Speakers

Suzuki, Takuya

doi:10.1007/978-3-030-50732-9_16

Takuya Suzuki⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1226))

Included in the following conference series:

International Conference on Human-Computer Interaction

4455 Accesses
3 Citations

Abstract

This research is about the User Interface of subtitle displaying for the situation such as meetings or workshops, that plural speakers talk at the same time. A visual approach such as note-taking, sign language, is commonly used to support hearing-impaired person. Meanwhile, even with that kind of visual assistance, a hearing-impaired person still might not be able to get enough information in the situation plural speakers talk at the same time. To improve that problem, some assistance such as sending materials beforehand, the facilitator controlling the number of speakers, are offered occasionally. However, not all cases have that assistance as it requires human resources. A tool that makes a hearing-impaired person be able to have an equal opportunity to get information visually as a hearing person gets it aurally, is necessary. During this research, a system was developed to assist the hearing-impaired people in the situations plural speakers talk in parallel such as conferences or group workings when hearing-impaired people and hearing people participate together. This system dictates voices talked in that situation and displays dictated texts. The developed system was validated with the corporation of hearing-impaired people and hearing people.

You have full access to this open access chapter, Download conference paper PDF

Development and Evaluation of UniTalker, an Application for Simultaneous Presentation of Subtitles from Multiple Speakers

Dialogue enabling speech-to-text user assistive agent system for hearing-impaired person

Article 11 January 2016

Hearing impaired speech recognition: Stockwell features and models

Article 05 October 2019

Keywords

1 Introduction

Support for hearing-impaired people has been centred on hearing aid systems and visual methods, such as note-taking and sign language interpretation. In recent years, methods that utilize information and communication technologies, such as voice recognition technology, have been producing better results. However, there are still problems that prove to be difficult to solve. For instance, some of the challenges that hearing-impaired people face at school and work include group work and meetings. In situations where multiple speakers are talking simultaneously, hearing-impaired people cannot obtain enough information through sign language or subtitles alone. Some solutions for this are providing materials in advance, ensuring that multiple speakers do not talk at the same time, or having a moderator to manage the speakers.

However, these solutions are not available at all times. This can result in the decrease in the amount of information exchanged and, sometimes, in the loss of participation from hearing-impaired people. To solve this problem, we need solutions that are founded on principles other than the existing types of support and accommodations. If hearing-impaired people had the opportunity to visually obtain the same amount of information that other people receive through hearing, it would help solve these problems.

I am a teaching staff of a special university for students who have a hearing impairment. My teaching experience made me realize the difficulties for hearing-impaired people when they are in the group which hearing people are the majority. If few hearing-impaired people are in the group of hearing people at a learning session or a conference, it should be one of the most difficult situation for hearing-impaired people.

2 Planning the System

In Japan, it is a tendency to have a support method that hearing people is requested to limit the amount of their statement, such as the rule that speakers need to raise a hand to take their tune so that multiple speakers do not talk at the same time.

However, such rules might be thought of as scaling back the quality and quantity of communication. Sometimes these support methods became a psychological burden for a hearing-impaired person, which made them hesitate to participate in those opportunities. To solve those issues, the methods subtitle speeches in their entirety on the situation several people speak at the same time were developed.

At first, Note-taking was considered. Showing the subtitle by Note-taking is a common method of deaf support system. It is used often in the situation such as lecture which has importance more on the accuracy of the information than simultaneity of subtitle display. Meanwhile, the simultaneity of subtitle display is more important than the accuracy of the information in the group talking.

I examined previous studies about speech and its subtitling. According to the study by Maruyama and others [1], tolerance limit of voice and subtitle is about 1 s. The study by Shimogori and others [2] indicates the timing of subtitle displaying has a bigger influence on an understanding level than the accuracy of subtitle information. These previous studies indicate the duration between speech and subtitle appearance should be within 1 s.

In the next step, I considered the method subtitling within 1 s from speech. Note-taking is known it takes some time duration from speech to subtitle displaying. The study by Ariumi reports [3] that time durations from speech to display of subtitles are 4 s by fast note-taker then 5 to 10 s by slow note-taker. It is assumed to take more time for the subtitling of several speakers’ speeches as the situation requires more work to recognize who is the speaker. Accordingly, I considered it is impossible to subtitling in 1 s by a manual note-taking method.

Next, subtitling by voice recognition system was considered. The method and system developed by this study will be offered free of charge in the end. Therefore, I considered it is important to the voice-recognition system that it is free of charge. First, I tried voice recognition of Google Document and found it subtitle in 1 s from the speech. However, this recognition system on Google Documents could not subtitle several speakers’ voices separately. And its submitted text is not divided by each talk, it is not suitable for the user interface to display several speakers’ talk. Therefore, another method uses the voice recognition system on Google Document was examined. UDTalk [4] is a subtitling software of the Japanese language. It uses a voice recognition system. It is free for personal use. It subtitles within 1to 2 s from a voice, still, there is a delay from the result of Google Document about 2 times of duration as feeling. UD Talk doesn’t have the UI adapt several speakers’ talks as Google Document doesn’t. To solve that issue, speakers need to pass UD Talk device to each other. That handling takes time in addition to the voice recognition process. Thus, free subtitle system displays subtitle of talk in Japanese language and have UI shows several speakers’ talks was not existed. They are the methods to subtitle the talks which are hearing persons offer the information to a hearing-impaired person.

Another issue is in the situation hearing-impaired person offers the information to hearing person, which is that it is difficult for a hearing-impaired person to put in the talk at natural timing. Hearing-impaired person who can speak is able to put in the talk by uttering. Nevertheless, if uttering is not the option, the person intends to speak needs to interrupt the talk visually, raising hands, for example.

It is a psychological barrier for a hearing-impaired person. Another issue prevents support is the installation cost of the system. If the installation cost is expensive, it will not be exploited fully even the system dismisses the stress during the offering and accepting the information for a hearing-impaired person. If the installation cost is expensive, it will not be exploited fully even the system dismisses the stress during the offering and accepting the information for a hearing-impaired person.

As stated in the beginning, this study is aiming to remove difficulties in hearing-impaired person’s communication in the group of hearing person. Therefore, it is essential that the system is easy to install and requires none of the special devices.

In the recognition process of the Japanese language, lots of revision will be required for mistakes on Kana to Kanji conversion process. The UI makes the revision of the text easy should be necessary.

Those were the conditions the developed software must satisfy. The points put importance on developing the system as follows;

1.
The duration from utterance to display of subtitles should be within approximately 1 s.
2.
The user interface, make voices of several speakers easy to recognize, is necessary.
3.
The UI of this system needs to make a speech-handicapped and hearing-impaired person easy to participate in the discussion.
4.
The system should be installed free of charge on any device.
5.
The UI should have a function to revise the recognized text from the voice. That function needs to be easy to edit the text.

3 Developing the System

First, the voice recognition engine was considered. There are several systems that recognize and transcript the voice spoken in Japanese. In February 2019, when the development started, IBM Watson Speech to text, Microsoft Azure Speech to Text, Google Cloud Speech-to-Text existed. Google Cloud Speech-to-Text was free of charge within 60 min in a month. However, it is obvious that the usage of data excess the 60 min limitation soon. It could not keep being free of charge. Another engine is necessary.

Next, the voice recognition system included in the Google Chrome browser was examined. It is free of charge if you use the voice recognition system via the Chrome browser, irrespective of the minutes of use. Though it works only on Google Chrome, this recognition engine is able to decrease the installation cost of the system mentioned in the previous paragraph. Therefore, this system was developed as a Web application. A brief prototype voice recognizes and the subtitle was developed and examined. The duration from utterance from subtitle was found to same as Google Document. It means it is possible to make the duration from utterance from subtitle within 1 s. This system satisfies no. 1 and 4 of the conditions described at the end of Sect. 1.

The UI was considered next. This system should have the UI easy to distinguish several speakers’ voices. As a result of the research, a method was found, which separate several speakers’ voices by independent Component Analysis use software. Still, it should be more definite if several speakers’ voices are separated by hardware. So, the system was made to be connected via each speakers’ device such as laptop computer, or smart mobile. This feature satisfies the condition no. 2 stated before.

In the next step, the function makes hearing-impaired person easy to participate in the discussion was considered. Although many hearing-impaired people speak on oral, it is difficult to speak to be processed on the voice recognition system for most of them. Therefore, text writing function was added to the UI. This text writing function will have the transcript voice and that text is able to revise on the UI. This feature satisfies the condition no. 3 and 5 stated before. Thus, those specifications satisfied every condition.

Figure 1 shows the display right after the log-in, one of the appearance image of the developed prototype. Figure 2 shows the screen after entering the room. The Japanese text in the image was translated and noted on the image in English.

Here is how to use this system. The user logs in on Google account, then set the room. Figure 1 shows the status of only one room was set. Several rooms could be set as well. The user issues the invitation code to invite other users to the room. Image 3 shows the situation three users participate in the room and having chat. The Guest user is only able to join the room but set the room.

The following is the explanation of image 3. “Text Input Area” is used for input text like an ordinary text editor. In the prototype phase, the Enter key was not worked. “Determine” button was used to fix the input text. “Start Recording” button is toggle style. Pushing it starts recording of sound, then the button inverted to gray. The result of voice recognition is inputted in “Text Input Area” and stay there during the time the user set. When the text appears in the window, the result text could be revised. Pushing the “Determine” button or running out the set staying time, the input was determined and appears in “Main Window” below. In the same time, inverted to the grey “Start Recording” button turn into default status. It means users need to push the “Start Recording” button at each speaking. And voice recognized result was not accessible to other users when it is in “Text Input Area”. Only after the text determined, other users see the results.

4 Experiment and Feedback

After development, the prototype was examined in a practical experiment. The experiment was held at a Japanese company in the situation one hearing person speaks to nine hearing-impaired people.

However, the experiment was stopped soon because the operation was too burdensome as users need to push the “Start Recording” button at each talk. That was only feedback as experimented time was too short. The UI was adjusted to continue recognizing voice so that the user should not need to take action at each speaking. Figure 3 shows an adjusted prototype appearance.

After the adjustment of the prototype, the practical experiment was conducted. Two Japanese companies participated. The prototype used in the situation as follows;

1.
The situation two hearing people and one hearing-impaired person talk
2.
The situation four hearing people and two hearing-impaired people talk like a group meeting

The feedback from the experiment was as follows.

Sharing the inputting status in “Text Input Area” is preferred.
Make Enter key to determining the text in addition to the “Determine” button, is preferred.
“Start Recording” is not suitable. Please change to another word.
If you understand the context, some mistyping could be followed.
Occasional delay of voice recognition should be resolved.
As it is application works on a Web browser, it is easy to start using.

The invitation code is too long. As of typing, it is not impractical, we need to receive the code via some network. A shortened code or QR code is preferred.

These feedbacks have corresponded to the ongoing revising process of the prototype.

After revise, other practical experiments are planned at several companies. The development and validation will continue and further progress is expected.

References

Maruyama, I., Abe, Y., Sawamura, E., Mitsuhashi, T., Ehara, T., Shirai, K.: Cognitive experiments on timing differences for superimposing closed captions in news programs. I EICE Tech. Rep. Hum. Commun. Sci. 99(123), 21–28 (1999)
Google Scholar
Shimogori, N., Ikeda, T., Sekiya, Y.: How display timing of captions affects comprehension of EFL speakers. IPSJ SIG Technical reports, GN75, E1-E6, 18 March 2010
Google Scholar
Ariumi, Y.: Characteristics of textual information in speech-to-text translation service with computers for university students with hearing impairment; Focusing on the factors of class situations, translators and users. University of Tsukuba, Ph. D. thesis, March 2013
Google Scholar
UDTalk. https://udtalk.jp/. Accessed 20 Mar 2020

Download references

Author information

Authors and Affiliations

Tsukuba University of Technology, Tsukuba, Ibaraki, Japan
Takuya Suzuki

Authors

Takuya Suzuki
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Takuya Suzuki .

Editor information

Editors and Affiliations

University of Crete and Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Constantine Stephanidis
Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Margherita Antona

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suzuki, T. (2020). Simultaneous Speech Subtitling Systems for Multiple Speakers. In: Stephanidis, C., Antona, M. (eds) HCI International 2020 - Posters. HCII 2020. Communications in Computer and Information Science, vol 1226. Springer, Cham. https://doi.org/10.1007/978-3-030-50732-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-50732-9_16
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50731-2
Online ISBN: 978-3-030-50732-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics