Deepfake reproduces the advanced version: you can modify the mouth voice by inputting text

Deepfake reproduces the advanced version: you can modify the mouth voice by inputting text

Photoby Christian Wieditron Unsplash, this article is from the public account:

Originally intended to facilitate video editing, the technology is enough to become a nightmare of the truth.

In recent years, there have been many techniques that use in-depth learning to edit video.

The most famous must be that Deepfake, caused star face-changing videos to wreak havoc on porn sites; and Deep Video Potrait (DVP), which was born last year, easily generated fake speech videos that scared news organizations and politicians.

If you think that this is enough to worry, then you are too young to look at the deep learning researchers. In their eyes, as long as it is in the name of science, no technology is limited – even if it can cause a serious moral crisis.

Recently, the researchers developed a new technology to manipulate video through deep learning, which can be added in one sentence, deleted, or even modified, so that the speaker in the video speaks of what he wants to say, and appears to be very natural, It's as if the speaker said it himself.

For example, the original report of Financial TV was "Apple's stock price closed at $191.45", and the researchers changed the number to "182.25 US dollars." In English, the two groups of numbers that are completely different in pronunciation and mouth shape are hard to see. Modified:

The scary thing is that the way to manipulate the video is very simple, just modify the text of the video transcription. This technique can find the corresponding position of the text in the video, automatically generate the voice and face models, and then automatically paste it to generate a new video...

The researchers found that 59.6% of the subjects thought that the video edited by the technology was a real video, but 20% of the subjects thought that the unedited video was fake.

In other words, the video generated through the processing of this pipeline is enough to fool most people's eyes.

This technology is currently not open to the public, and there is no editing software available to ordinary people because it is still in the research and testing phase. Researchers are from Stanford University, the Max Planck School of Information, Princeton University, and the Adobe Research Institute. This research has been submitted to the computer graphics top conference SIGGRAPH 2019.

As you can see in the following video, the editing effect of this technology is, and how``real'' video is edited:

Your device does not currently support playback for the time being

This technology actually incorporates a variety of deep learning methods, including speech recognition, lip search, face recognition and reconstruction, and speech synthesis.

To put it simply, the researchers first process the image and sound of the video, separate the picture and phoneme of the modified part, assemble the phoneme of the modified statement, then generate a new face model according to the pronunciation of these words, and finally mix and render it into a new video.

The decomposition steps are roughly as follows:

1) Input video, the requirement must be talking-head video, that is, the video with the main face of the face (which can include the upper body) and the main content of the speech;

2) Enter the words that need to be modified, as well as the modified text;

3) Use the phoneme alignment technique to index the speech in the video for subsequent work. Phonemes are part of a word, such as "Apple" consisting of pinyin ping and guo;

4) using a lip search, finding the video segment and corresponding phonemes needing to be modified in the original video;

5.a) Aurally, the phonemes of the modified words are assembled and embedded in the original video;

(b) visually, the human face in the video is tracked and modeled, and then a picture of the lower half-face is reconstructed for each frame of the video according to the pronunciation of the modified words (because the facial action at most of the speech does not involve the above nose), and re-rendering a piece of video (silent);

6) Synthesize the new voice with the speaker's voice data in the video, and finally mix and edit into a new video.

From left to right: frames corresponding to different phonemes; top to bottom: original video to rendering, final composite effect

The researchers asked 138 people to take part in the user survey, asked them to watch three sets of videos and give true or false, that is, unedited and edited judgments. The three sets of videos are A (the real), B (real), C (uses A as the basis for replacing the words and sentences of B with "fake video"). Moreover, the researchers told the subjects in advance that the subject of the study was "video editing," so the subjects knew that what they saw was bound to have fake videos, so they were more alert to looking for "horse feet."

59.6% of the subjects considered the C group to be a real video; 20% of the subjects considered the original, unedited video to be false.

The researchers also compared the new technology with Deepfake, MorphCut and DVP. They found that the new tube acted better on the mouth, the intraoral picture synthesis(teeth, tongue, etc.), and the inserted frame generated by the predecessor was often very hard, and the flaw could be seen with a little attention.

Below: Deepfake (Face2Face) A tooth phantom appears on the inserted frame.

Bottom: DVP has a highly discernible error in the restoration of teeth.

Bottom: DVP has a problem with the restoration of the upper limb movements in the painting, resulting in a continuation of the loophole (film and television terminology, which means that the clip leads to an illogical picture, such as a frame disappearing between the two frames held by the hand ).

Falsified Yoshua Bengio

Bottom: MorphCut (a feature in Adobe Premier Pro that inserts computer-generated frames into a blunt edit to make the picture smooth), causing severe ghosting on the face of the person in the picture

A paste of Yoshua Bengo.

the longer the video input, the better the final editing effect, the better the visual, and the 40-minute video material, the researchers found that the best effect of the thesis and the video presentation can be achieved; however, even if only a very small amount of data is used, For example, two minutes of video training, the final synthetic face error rate is only 0.021, and is only 0.003 higher than the 40-minute video (0.018).


The paper mentioned that there is no direct correlation between the length of the modified words and the quality of the pieces, but the results of the lip search and phoneme search will affect the final editing effect. For example, if the mouth shape and pronunciation of the modified words have never appeared in the data set, the effect may not be too good. (The parameter blending method used by the researchers can also make up for this. For example, fox can be combined with v and ox, and does not necessarily require words with f.)


After completing the above steps, the editor can modify the video at will, and if only some words and sentences are modified, the time spent can be ignored compared with the training / preparation.

For example, when a politician's speech is over, there can be a complete twist on the internet two days after the last two days, but the "false video" of any problem can not be seen entirely.

In the context of news, this technology suddenly becomes the most worrying thing. This approach has a certain need for computing, so it is not necessarily possible for the roadman to complete it, but if hackers or hostile politicians want to engage in organized defamatory attacks on the victims, this approach described in this paper is no better.

Today, a British marketing agency posted a small segment of Zuckerberg's speech on its Instagram account. In the video, Zuckerberg, wearing an iconic "human" expression, said, "imagine a person who has complete control over the stolen data of billions of people, all their secrets, their lives, their future." I owe it entirely to the ghost. The ghost tells me who can control the data, who can control the future. "

Wraith is a device and art exhibition that the marketing agency is promoting. The video is in fact the marketing of the exhibition. The video itself is made with Deeptake or the like, and the technology comes from Canne. ai, an Israeli company, and the voice is looking for a man that is not entirely different from Mr. Zuckerberg's efforts. In fact, the marketing agency has also found a well-known person, such as Trump, Kim Kardashian, and Morgan Freeman, to make similar video.

If the video is harmless to humans and humans, another tech-based clip has no high-tech clip, and a great deal of harm to one of the top-level politicians in the United States.

In the first few weeks, the two-piece House of Representatives Nancy Pelosi's" inarticulate "video streaming online. Soon, the video was found to have used a very boring method of editing, making it look like a drink or a stroke. Some social and video platforms, including Facebook, have refused to ban the video.

In the current environment of extreme socialization and confrontation, as well as the prevalence of fake news, similar videos tend to have strong communication potential. More advanced technology will make the quality of the video better, and the damage to the victim and the further tearing of the society will only be more serious.

The researchers point out in the paper that they believe the main purpose of the study is to simplify the work pressure of video editors (and the content industry as a whole). For example, in situations where you pronounce wrong lines or miss pictures, you can now directly use depth learning algorithms to generate accurate pictures and sounds, and you no longer need to remake them at a high price.

Another important use scenario is translation. In the paper (and the supporting video), the effect of cross-language generation of video is demonstrated, because it is not a word in nature, but a mouth type and a phoneme, and is not limited by the language (for example, many European country languages share phonemes).

If there is a film that needs to be translated into a Spanish version, the past practice is to directly translate the post-production dubbing. Now with this technology, it is possible to directly generate translations with accurate pronunciation and accurate mouth shape.

Of course, the film is just an extreme case. If you're not so extreme, for example, if you're a makeup blogger, if you want to expand the audience overseas, you can use this technology to generate videos in other languages, even if the pronunciation is not 100 percent accurate.

The last use scenario is to create a virtual voice assistant with a visual image of the second-element idol. With this technology, you should be able to generate visible Lin Zhiling/Guo Degang navigation. Researchers have mentioned in the paper that in addition to using neural networks, their technology can also be used with macOS's speech synthesizer to make synthesized speech easier.

This article from the public number: Silicon Star man (ID: guixingren123), Author: spectrum, Doutzen

* The article is the author's independent point of view, does not represent the position of the tiger sniffing net. This article was published by Silicon Stars Authorized Tiger Sniff Network and edited by Tiger Sniff Network. Reprint this article, please indicate the author's name in the text, keep the integrity of the article (including the tiger sniffing and other authorship information), and please attach the source (Hua Sniff Network) and this page link. Original link: If the person is not reprinted according to the rules, the tiger sniffs the right to pursue the corresponding responsibility.

In front of the future, you and I are still children, do not download tiger sniff App sniff innovation!

Deepfake reproduction advanced version input text ready modification mouth type voice

Read More Stories

© , New View Book , Powered by UIHU