Audio_Deepfake_Detection_MLAAD_Nicolas_Müller_CybersecurityBlog_Fraunhofer_AISEC_buehne

How to build suitable datasets for successful detection of audio deepfakes

Deepfakes are a significant threat to democracy as well as private individuals and companies. They make it possible to spread disinformation, to steal intellectual property and to commit fraud, to name but a few. While robust AI detection systems offer a possible solution, their effectiveness depends largely on the quality of the underlying data, simply put: »Garbage in, garbage out.« But how do you create a dataset that is well suited to identifying the ever-evolving deepfakes and enables robust detection? And what constitutes high-quality training data?

Deepfakes are a major threat to democracy, but also to private individuals and companies. One example is attackers using deepfakes to create a false identity in video or phone conversations to obtain confidential information for industrial espionage. Or they may use it to make fraudulent money transfers abroad. But what can we do to prevent this? There are three main approaches:

  1. Education and training: The public needs to be educated that both video and audio deepfakes exist and how they are being exploited. Additional measures include training the ear to identify fake audio tracks (see Deepfake Detection: Spot the Deepfake).
  2. Verification and signature of media contents: Technologies such as the Content Authenticity Initiative enable verification of media authenticity (see Content Authenticity Initiative).
  3. AI-assisted deepfake detection: This involves developing AI systems that can analyze unknown audio recordings and determine whether they are real or fake. These AI detectors are designed to reliably identify even the latest deepfakes, while deepfake creators are doing everything they can to avoid being detected. Similar to antivirus detection, this is an ongoing competition in which the defender’s goal is to raise the bar for the attacker so high that an attack is no longer worthwhile.

An example for the use of an Ai-assisted detection system is the following analysis of an audio deepfake. In this fake recording, British Prime Minister Keir Starmer is said to have made the following statement: »I don’t really care about my constituents and I’m deceiving them.« When deepfake recognition is applied to the platform Deepfake Total, the recording is recognized as a fake, as can be seen in the following screenshot: The Deepfake-O-Meter is red.

Figure 1: Analysis of the Keir Starmer audio deepfake through http://deepfake-total.com/

The MLAAD dataset

The basis of any AI-assisted deepfake detection is the underlying dataset. Suitable samples of original and fake audio tracks are collected, which are then used to train the detection model.

The deepfake detection system shown above was trained on the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), which contributed to the high detection rate even for new and unknown audio deepfakes. The MLAAD dataset addresses one of the major challenges of audio deepfake detection:

  • Balanced TTS systems: Audio deepfakes are often created using text-to-speech (TTS) systems that can synthesize any text in the voice of the target person, as was the case with Keir Starmer. There is a large variety of TTS systems, each with its own characteristics. Some are particularly good at creating emotional speech, while others can create a near-perfect vocal resemblance to the target person. However, training detection systems on audio data from only a few TTS systems means they will only be able to detect the specific features of those systems.
    Deepfake detection requireslarge amounts of data: The more diverse the deepfake data in the training set, the better the detection. The MLAAD dataset currently includes 59 different TTS systems — more than any other dataset — and is continuously being expanded to further increase diversity.
  • Balanced languages: Similar to TTS systems, it is also important to include a large variety of languages. Frequently, conventional datasets only include English or Chinese audio tracks, even though deepfakes are created in many different languages. Deepfakes in other languages cannot be reliably detected by a detection system trained only in English, for example. MLAAD includes 35 different languages, again more than any other dataset. 

As indicated above, we used 59 different TTS systems to create this MLAAD dataset. In some cases, specially designed systems were used, or new approaches were developed from open-source collections. We applied them according to a standardized model, which is illustrated in the following figure:

Figure 2: Creating the MLAAD dataset.

As a starting point, we use the M-AILABS dataset, which contains audio tracks in eight languages: English, French, Spanish, Italian, Polish, German, Russian and Ukrainian. To increase variety, we automatically translate the text of these recordings into one of a total of 35 languages, if necessary. We then synthesize 1,000 audio tracks with each of the 59 TTS models, creating an unprecedented diversity of deepfake speech data.

The MLAAD dataset as a base for research questions

Apart from the practical benefits that our dataset offers, there are also great benefits for the scientific community. Researchers can now check in a controlled manner which characteristics in audio deepfakes can be detected and with what accuracy. For example, it is possible to check whether the detection of German deepfakes is better or worse than for English or Spanish deepfakes. A dataset with 59 deepfake models also helps in other disciplines such as source tracing. Its purpose is to determine which AI system created a given deepfake. MLAAD has already been used by researchers in the USA, for example: https://arxiv.org/abs/2407.08016.

Wrap-Up

The reliability of AI detection models depends largely on the quality of the training data. Diversity and balance are the hallmarks of a high-quality dataset for the detection of deepfakes. The MLAAD dataset contains audio dataset from 35 languages and uses 59 different text-to-speech systems to cover a wide range of characteristics. It assists in developing robust anti-spoofing methods, analyzing the origin of deepfakes, and other challenges.

Datasets such as MLAAD are a critical building block for AI-assisted deepfake detection to combat disinformation, safeguard democracy, and protect individuals and companies.

Further information
Author
muller_nicolas_0185_rund
Nicolas Müller

Dr. Nicolas Müller received his doctorate in computer science from the Technical University of Munich in 2022 with a dissertation on the »Integrity and Correctness of AI Datasets.« Prior to that, he completed a degree in mathematics, computer science and theology at the University of Freiburg, graduating with distinction in 2017. Since 2017, he has been a machine learning scientist in the Cognitive Security Technologies department at the Fraunhofer Institute for Applied and Integrated Security AISEC. His research focuses on the reliability of AI models, the identification of machine learning shortcuts, and AI-assisted audio deepfake detection.

Most Popular

Never want to miss a post?

Please submit your e-mail address to be notified about new blog posts.
 
Bitte füllen Sie das Pflichtfeld aus.
Bitte füllen Sie das Pflichtfeld aus.
Bitte füllen Sie das Pflichtfeld aus.

* Mandatory

* Mandatory

By filling out the form you accept our privacy policy.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other Articles

Secure System-On-Chip: Protecting Operating Systems and Hardware

How can we trust chips and operating systems that power IoT, industry and the cloud? In this interview, Fraunhofer AISEC cybersecurity researcher Dr. Michael Weiß explains how GyroidOS, secure system-on-chip and open standards like RISC-V create verifiable, tamper-resistant platforms for tomorrow’s critical infrastructure.

Read More »

Mastering Cross-Divisional Cybersecurity Risk Management in the Automotive Industry

Modern cars are interconnected systems of software, sensors, and cloud services. As automotive companies divide their work across engineering, production, and backend divisions, cybersecurity risks often fall through the cracks. While standards such as ISO/SAE 21434, the ISO/IEC 27000 family, and the IEC 62443 series provide important building blocks, none fully explain how to align cybersecurity across divisions. Our research at Fraunhofer AISEC reveals the consequences: hard-to-compare risk assessments, unclear communication, and fragmented security strategies. The solution lies in a cross-divisional approach that connects processes, tools, and terminology. Based on a structured analysis of key cybersecurity standards and interviews with experts from six automotive manufacturers, this article contrasts what the standards expect with how organizations work today – and outlines concrete steps to close the gaps.

Read More »

Automated cloud certification with EMERALD: Architecture, evidence, and trustworthy security

In the face of growing complexity and regulatory requirements, the security of cloud services is becoming increasingly challenging. However, conventional certification procedures require considerable financial and time investment to meet these requirements. That is why the EU research project EMERALD is pursuing a new approach: It is developing a framework for continuous, automated security certification based on semantically structured evidence. This article will discuss the concepts, methods, and validation approaches of the EMERALD platform.

Read More »