The Rise of Deepfakes - From a Cybersecurity perspective

Jan 25

Written By David I

Introduction

Impersonation has been in existence for a very long time. One of the earliest documented cases of fraudulent impersonation dates back to the biblical story of Jacob and Esau. In this account, Jacob deceives his blind father, Isaac, by pretending to be his brother Esau to receive the paternal blessing intended for the firstborn.

Deepfakes are a form of impersonation, but in more recent times, it has attained new heights with the introduction of artificial intelligence. So what is a deepfake? A deepfake is any type of media that has either been generated or manipulated using Artificial Intelligence.

From a cybersecurity perspective, deepfakes are a very serious threat. They are likely to take center stage in the coming years. According to Deloitte 91% of all breaches start with some form of social engineering. Social engineering has become more realistic and have the ability to fool even the most vigilant people. Lets talk about two ways that social engineering will become more dangerous with the introduction of deepfakes and the readily available D

Vishing
Vidshing (Video phishing)

Vishing is when a malicious actor calls a victim and attempts to fool them into doing something that would be harmful to them, someone they know of their organization. Vishing has been in existence for a long time, some prominent cases of vishing include:

IRS Phone Scam: During tax filing season, scammers often pose as IRS representatives in alarming or threatening phone calls to dupe unsuspecting taxpayers into handing over money or personal information. In one notorious example of vishing (voice phishing), a group posing as both IRS and immigration officials contacted over 50,000 individuals, stealing hundreds of millions of dollars by threatening arrest or deportation.

COVID-19 Scams: During the pandemic, scammers offered fraudulent products claiming to prevent or cure COVID-19. The Federal Communications Commission warned against such vishing schemes.

Drumroll.... this is the first time this term has been coined..... vidshing also known as Deepfake Phising. Vidshing (or Deepfake Phising.) is when a malicious actor makes a video call to someone using a deep fake for the conversation. Vidshing is already being used. In a case reported by CNN, a finance employee at a multinational firm was tricked into transferring $25 million after fraudsters used deepfake technology to impersonate the company's chief executive during a video call.

Deepfake Attack Scenario (the Trump Zoom call demo)

In this blog post we walk through a live deepfake scam scenario to demonstrate the tech in action Here is the scenario:

President Trump has chosen you for an award of $1million dollars. He calls you on zoom to personally congratulate you. For this scenario, we will use an existing speech that we can download from youtube. We start by cloning the target's voice.

Clone the targets voice

Voice cloning, the process of replicating a person's voice using technology, has evolved significantly over the centuries, progressing from mechanical devices to advanced artificial intelligence systems.

In the late 18th century, inventors like Wolfgang von Kempelen developed mechanical devices to emulate human speech. Von Kempelen's "acoustic-mechanical speech machine," described in 1791, used bellows and models of the tongue and lips to produce both consonant and vowel sounds.

The 1930s and 1940s saw the emergence of electronic devices such as the VODER (Voice Operated Demonstrator) and the VOCODER (Voice Operated Coder), developed by engineers like Homer Dudley at Bell Labs. The VODER was an electro-mechanical device that generated speech sounds by filtering waveforms, while the VOCODER analyzed speech into its fundamental tones and resonances.

In 1962, Bell Laboratories' researchers, including Max Mathews, used a computer (IBM 704) to synthesize singing, marking a significant milestone in digital speech synthesis.

The computer performed a rendition of "Daisy Bell," demonstrating the potential of computers to generate human-like speech.

The 21st century introduced deep learning techniques that revolutionized voice cloning. In 2016, by directly modeling raw audio waveforms, producing highly natural-sounding voices. Now virtually any voice can be cloned using deeplearning techniques provided that you have a reference audio

We will be using Llasa, a text-to-speech (TTS) system built upon the text-based LLaMA language model (in 1B, 3B, and 8B configurations). Llasa incorporates 65,536 speech tokens derived from the XCodec2 codebook and has been trained on 250,000 hours of bilingual Chinese-English speech. As a result, it can generate audio entirely from text input or by using a supplied speech prompt. To sync the audio generated to a video so that it looks like the person in the video is speaking the words, we will use the video-retalking library.

System Overview

For this demonstration our setup features four NVIDIA GPUs—one RTX 3080, one RTX 3060, and two RTX 3080 Ti cards—running on driver version 555.42.06 with CUDA 12.5. On the CPU side, we’re using an Intel® Core™ i7-9700K with 8 cores clocked at up to 5 GHz and support for virtualization (VT-x). This combination offers plenty of parallel horsepower for deep learning tasks, 3D rendering, or any GPU-accelerated workload.

Creating the audio impersonation.

Llasa-3B can be found here. The following Python script uses a pre-trained model (Llasa-3B) to clone the voice and generate a deepfake audio clip.

import argparse
import logging
import os
import torch
import torchaudio
import soundfile as sf
from typing import List

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from xcodec2.modeling_xcodec2 import XCodec2Model

# Set up logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Configure stream handler (to stdout) with a consistent format
console_handler = logging.StreamHandler()
console_formatter = logging.Formatter(
    fmt="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)
console_handler.setFormatter(console_formatter)
logger.addHandler(console_handler)


def ids_to_speech_tokens(speech_ids: List[int]) -> List[str]:
    """
    Convert a list of integer speech IDs into the string token format <|s_12345|>.

    Args:
        speech_ids (List[int]): A list of integers representing speech IDs.

    Returns:
        List[str]: A list of strings where each string is in the format <|s_x|>,
            representing the speech ID tokens.
    """
    return [f"<|s_{speech_id}|>" for speech_id in speech_ids]


def extract_speech_ids(speech_tokens_str: List[str]) -> List[int]:
    """
    Convert a list of tokens like <|s_12345|> back into integer speech IDs (12345).

    Args:
        speech_tokens_str (List[str]): A list of tokens in the format <|s_12345|>.

    Returns:
        List[int]: A list of integer speech IDs extracted from the tokens.
    """
    speech_ids = []
    for token_str in speech_tokens_str:
        if token_str.startswith("<|s_") and token_str.endswith("|>"):
            num_str = token_str[4:-2]
            num = int(num_str)
            speech_ids.append(num)
        else:
            logger.warning(f"Unexpected token: {token_str}")
    return speech_ids


def main() -> None:
    """
    Parse command-line arguments and run the LLaSA TTS process locally without Gradio.
    """
    parser = argparse.ArgumentParser(description="Run LLaSA TTS locally without Gradio.")
    parser.add_argument("--ref_audio", default="trr.mp3",
                        help="Path to the reference audio (wav, mp3, etc.)")
    parser.add_argument("--text", default="Good morning Adam, it is with great pleasure that i announce that you are a beneficiary of the 1 million dollars relief, granted to 2000 people as part of the new relief act. I requested this zoom meeting to personally congratulate you and wish you all the best. Thank you.",
                        help="Text to synthesize.")
    parser.add_argument("--output", default="output.wav",
                        help="Output WAV filename (default: output.wav)")
    args = parser.parse_args()

    # ----------------------
    # Load Tokenizer and Model
    # ----------------------
    llasa_3b = "HKUST-Audio/Llasa-3B"

    logger.info("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(llasa_3b,cache_dir='/media/agent/dige-backup/cache')

    logger.info("Loading LLaSA model with auto device mapping...")
    model = AutoModelForCausalLM.from_pretrained(
        llasa_3b,
        trust_remote_code=True,
        device_map="auto",          # <-- Auto device map
        torch_dtype=torch.float16,  # Optionally use fp16 if GPUs are available

    )

    logger.info("Loading XCodec2 model (moved to GPU if available)...")
    model_path = "HKUST-Audio/xcodec2"
    Codec_model = XCodec2Model.from_pretrained(model_path)
    Codec_model.eval()
    if torch.cuda.is_available():
        Codec_model.cuda()

    logger.info("Loading Whisper pipeline...")
    # Note: pipeline does not support device_map='auto' at this time.
    #       If you have a GPU, you can set device=0 (the first GPU).
    whisper_turbo_pipe = pipeline(
        "automatic-speech-recognition",
        model="openai/whisper-large-v3-turbo",
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device=0 if torch.cuda.is_available() else -1,  # 0 => first GPU, -1 => CPU
    )

    # ----------------------
    # Process reference audio
    # ----------------------
    logger.info(f"Reading reference audio from {args.ref_audio}")
    waveform, sample_rate = torchaudio.load(args.ref_audio)
    duration_sec = len(waveform[0]) / sample_rate

    # Trim if longer than 15 seconds
    if duration_sec > 15:
        logger.info("Trimming audio to first 15 seconds...")
        waveform = waveform[:, : sample_rate * 15]

    # Convert to mono if stereo
    if waveform.size(0) > 1:
        waveform_mono = torch.mean(waveform, dim=0, keepdim=True)
    else:
        waveform_mono = waveform

    # Resample to 16kHz
    logger.info("Resampling reference audio to 16 kHz...")
    prompt_wav = torchaudio.transforms.Resample(
        orig_freq=sample_rate, new_freq=16000
    )(waveform_mono)

    # ----------------------
    # Transcribe reference audio using Whisper
    # ----------------------
    logger.info("Transcribing reference audio...")
    prompt_text = whisper_turbo_pipe(prompt_wav[0].numpy())["text"].strip()
    logger.info(f"Reference audio transcribed text: '{prompt_text}'")

    # ----------------------
    # Build final text
    # ----------------------
    target_text = args.text.strip()
    if len(target_text) == 0:
        raise ValueError("Target text is empty. Please provide some text to generate speech.")
    elif len(target_text) > 300:
        logger.warning("Text is too long. Truncating to first 300 characters.")
        target_text = target_text[:300]

    input_text = prompt_text + " " + target_text

    # ----------------------
    # TTS generation
    # ----------------------
    logger.info("Encoding reference audio into speech tokens (XCodec2)...")
    with torch.no_grad():
        # Encode the prompt audio to speech tokens
        vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
        # vq_code_prompt shape => [batch=1, group=1, seq_len=?]
        vq_code_prompt = vq_code_prompt[0, 0, :]

        # Convert IDs to special speech tokens
        speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)

        formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

        # Prepare the chat structure for LLaSA
        chat = [
            {"role": "user", "content": "Convert the text to speech:" + formatted_text},
            {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + "".join(speech_ids_prefix)},
        ]

        # Tokenize using the chat template method
        input_ids = tokenizer.apply_chat_template(
            chat,
            tokenize=True,
            return_tensors="pt",
            continue_final_message=True
        )

        if torch.cuda.is_available():
            input_ids = input_ids.to("cuda")

        speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")

        logger.info("Generating speech tokens...")
        outputs = model.generate(
            input_ids,
            max_length=2048,
            eos_token_id=speech_end_id,
            do_sample=True,
            top_p=1.0,
            temperature=0.8,
        )

        # Extract newly generated speech tokens
        generated_ids = outputs[0][input_ids.shape[1] - len(speech_ids_prefix) : -1]

        # Convert tokens to string representation
        speech_tokens_str = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

        # Convert from <|s_12345|> to integer IDs
        speech_tokens_int = extract_speech_ids(speech_tokens_str)
        speech_tokens_int = torch.tensor(speech_tokens_int).unsqueeze(0).unsqueeze(0)
        if torch.cuda.is_available():
            speech_tokens_int = speech_tokens_int.cuda()

        logger.info("Decoding speech tokens to waveform...")
        gen_wav = Codec_model.decode_code(speech_tokens_int)

        # Keep only the newly generated portion (beyond the prompt length)
        gen_wav = gen_wav[:, :, prompt_wav.shape[1] :]

    # ----------------------
    # Save output
    # ----------------------
    out_wav = gen_wav[0, 0, :].cpu().numpy()
    sf.write(args.output, out_wav, 16000)
    logger.info(f"Done! Synthesized audio saved to {args.output}")


if __name__ == "__main__":
    main()

Running the above is easy, just pass in your audio file (the audio that contains the voice of the speaker you want to clone), add the text you want the speaker to say, and it will generate an output file that contains whatever you want it to say.

Thats it! you have generated your first cloned audio! Next we will sync the audio to some video that we fetch off the internet. Here is the audio we generated for our scenario:

Syncing the audio to video

To download a video from youtube you can use yt-dlp
Like we mentioned earlier, to sync this video to the audio, we will use video-retalking

Instructions are available on the github page. So we go ahead and run the script by passing in our audio and video, and the result is:

This video shows an example of a deepfake

If you pay close attention to the mouth area, you might see some tiny inconsistencies, but this will only get better with time.

PS: take care that for whatever video you select, ensure the person being impersonated is not too close to the camera, video-retalking is not very good with such videos. Also, the audio and video must be of the same length.

Streaming the Deepfake Live (or System Setup)

Now to stream it live to zoom. Below is a step-by-step guide on how to create a virtual (software) camera device on Ubuntu and feed video frames to it using Python. This allows other applications (like Zoom, Skype, etc.) to recognize your software-generated or processed stream as if it were coming from a normal webcam.

Here are the instructions:

Install and Load v4l2loopback

sudo apt-get update
sudo apt-get install v4l2loopback-dkms v4l2loopback-utils

This installs the kernel module v4l2loopback, which lets you create one or more virtual video devices (e.g., /dev/videoX).

Load the Kernel Module

After installing, load the module:

sudo modprobe v4l2loopback devices=1 video_nr=10 card_label="deepfake" exclusive_caps=0
devices=1 

This creates a single virtual camera at /dev/video10 labeled “deepfake. To confirm that it was created, you can run the following commands:

ls /dev/video*
v4l2-ctl --list-devices

Then log out and log back in.

Next we Install Python Dependencies:

pip3 install opencv-python pyvirtualcam

Now we write the python script. The python script below demonstrates how to capture frames from a video file and stream them to a virtual camera device. It uses the argparse module to allow users to specify the video path, virtual camera device, and optional frames-per-second override. Internally, OpenCV (cv2) reads each video frame in a loop, while pyvirtualcam sends the frames to the specified virtual camera mimicking a live video feed. If the video ends, the script loops back to the beginning, ensuring an uninterrupted stream. The code also includes best practices such as a main function guard, docstrings for clarity, type hints, and structured error handling.

#!/usr/bin/env python3
"""
A script to read frames from a video file and stream them to a virtual camera
using the pyvirtualcam library.

Example usage:
    python script_name.py --video-path generated.mp4 --device /dev/video10
"""

import argparse
import sys
import cv2
import pyvirtualcam

def parse_arguments() -> argparse.Namespace:
    """
    Parse command-line arguments.

    Returns:
        argparse.Namespace: An object containing the parsed command-line arguments.
    """
    parser = argparse.ArgumentParser(
        description="Stream video frames to a virtual camera."
    )
    parser.add_argument(
        "--video-path",
        type=str,
        default="generated.mp4",
        help="Path to the input video file."
    )
    parser.add_argument(
        "--device",
        type=str,
        default="/dev/video10",
        help="Path to the virtual camera device."
    )
    parser.add_argument(
        "--fps",
        type=float,
        default=0.0,
        help="Frames per second override. If 0, FPS is inferred from the video."
    )
    return parser.parse_args()

def main() -> None:
    """
    Main entry point of the script that captures frames from a video file and
    sends them to a specified virtual camera device in an infinite loop.
    """
    args = parse_arguments()

    # Initialize video capture from the specified video path
    cap = cv2.VideoCapture(args.video_path)
    if not cap.isOpened():
        print(f"Error: cannot open video file {args.video_path}")
        sys.exit(1)

    # Read properties from the video
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # If the user provided an FPS override, use it. Otherwise, get from video.
    fps = args.fps if args.fps != 0 else cap.get(cv2.CAP_PROP_FPS)
    if fps == 0 or fps is None:
        fps = 30.0  # fallback if we can't read FPS from the video

    print(f"Opening {args.video_path} at {width}x{height} @ {fps:.2f} FPS")

    # Create a virtual camera using the pyvirtualcam library
    with pyvirtualcam.Camera(width=width, height=height, fps=fps, device=args.device) as cam:
        print(f"Using virtual camera: {cam.device}")

        # Read frames from the video indefinitely
        while True:
            ret, frame = cap.read()
            if not ret:
                # If the video ends or fails to read, loop back to the start
                cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
                continue

            # Convert OpenCV's BGR frame to RGB for pyvirtualcam
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

            # Send the frame to the virtual camera
            cam.send(frame_rgb)

            # Wait until the next frame should be processed
            cam.sleep_until_next_frame()

    # Clean up the capture resource
    cap.release()

if __name__ == "__main__":
    main()

After running the script, start zoom, you should be able to select video10, (deepfake) as a source. Here is the script running on zoom:

Thats it! in this blog, we discussed the history of deepfakes, we talked about vishing and eventually vidshing. We discuss how dangerous deepfakes already are and what they could be in the future. In the next series we will discuss how to protect yourself and your organization.

David I

The Rise of Deepfakes - From a Cybersecurity perspective

Introduction

Deepfake Attack Scenario (the Trump Zoom call demo)

Clone the targets voice

Syncing the audio to video

Streaming the Deepfake Live (or System Setup)

DeepSeek: AI Breakthrough or Security Threat from China?

Crafting and Executing a Polyglot File: A PNG Image with Embedded Data

Beryllium