Figures - uploaded by Prerak Bhatt
Author content
All figure content in this area was uploaded by Prerak Bhatt
Content may be subject to copyright.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
DOI: http://dx.doi.org/10.26483/ijarcs.v8i8.4613
Volume 8, No. 8, September-October 2017
International Journal of Advanced Research in Computer Science
RESEARCH PAPER
Available Online at www.ijarcs.info
© 2015-19, IJARCS All Rights Reserved 217
MACHINE LEARNING FORENSICS:A NEW BRANCH OF DIGITAL FORENSICS
Prerak Bhatt
IFS, Gujarat Forensic Sciences University ,
Gandhinagar, Gujarat
Parag H. Rughani, Ph. D.
IFS, Gujarat Forensic Sciences University,
Gandhinagar, Gujarat
Abstract— The objective of this research is to understand how machine learning can be used in digital crime and its forensic importance, setting
up an environment to train artificial neural networks and investigate as well as analyze them to find artefacts that can be helpful in forensic
investigation.
Keywords- Machine Learning, ML forensics, AI forensics, TensorFlow, AI related crimes
I. INTRODUCTION
Artificial Intelligence (AI) and Machine Learning (ML) has
been around since a long time but it is now that we have
enough computational power to effectively develop strong
artificial neural networks (ANN) in a sensible time frame with
the help of strong hardware and software support.
Companies like Google, Amazon, Samsung etc. are heavily
investing in AI technologies and funding the research. Google
CEO Sundar Pichai announced the company's vision to be "AI-
First" at Google I/O 2017 and quoted "It's all about a
transition, from searching and organizing the world's
information to AI and machine learning." [1].
Google also unveiled "TensorFlow Research Cloud"
program which will provide researchers with access to 1000
cloud TPUs (Tensor Processing Units) for free with a condition
to open source their code and research. [2]
Since AI is becoming widely available to more and more
people, the potential of the technology to be used for malicious
purposes also increases significantly. To counter this and be
prepared for forensic challenges regarding crimes committed
with AI or ML, forensic evaluation and analysis of the
technology is necessary.
This paper demonstrates implementation of a machine
learning open source program "DeepQA" and forensic analysis
of the same while the program was in training and testing
modes. This paper also lists out some important artifacts
findings that can be taken as a reference for cases in future to
prove or determine that a machine learning technique based on
TensorFlow was used on provided evidence.
II. FORENSIC IMPORTANCE
AI and ML has great advantages and holds a bright future
ahead. But the same technology can i nevitably be used to craft,
automate and execute some serious crimes that can also be
deadly for people.
For instance, hackers can develop an ANN that scans new
versions of popular apps for unknown vulnerabilities, exploit
them and/or report the vulnerability to the hacker. If this
process is done manually, it can take long time to pentest an
app. But with the help of ML it can be done really quickly and
can be done on multiple different apps at the same time with
machine efficiency. It makes the job for hacker really easy and
quick.
Hackers are available to rent on dark web. It is a possibility
now that AI powered bots will replace them and do the job
more efficiently and quickly than a human being can. Earning
more money to hacker than he did before.
The task in 2016 DARPA (Defense Advanced Research
Projects Agency) Cyber Grand Challenge was to create an AI
that can correct the provided buggy code itself, patch
vulnerabilities present in its own system and look for intrusions
by opponents with minimal human interaction. The winning
prize was $2million. The competition lasted for about 8 hours.
[3]
"Spear phishing is going to become really, really good
when machine learning is incorporated into it on the attacking
side," says Dave Palmer, director of technology at Darktrace, a
cybersecurity firm which deploys machine learning in its
technology. [4]
So, there are so many possibilities where ML and AI can
make a criminal, hackers or a terrorist's job easy and quicker.
And this is why, ML and AI holds great forensic
importance. It is a new field to dive in for forensic investigators
and the scope of research is really big. There is no research
found on the algorithms or frameworks of ML that suggest how
to investigate or identify if any AI or ML technology was used
in commitment of a digital crime.
III. ENVIRONMENT SETUP
We need a specific environment setup in order to develop
ANN based programs. It requires a powerful CPU and/or GPU
because training of an ANN model is a resource consuming
task.
A. Hardware
• CPU: Intel Core – i5 6600k
• GPU: Nvidia GeForce GTX 1070
• RAM: 8GB
B. Software
• OS: Ubuntu 16.04
Prerak Bhatt et al, International Journal of Advanced Re search in Computer Science, 8 (8), Sept–Oct 2017,217 -222
© 2015-19, IJARCS All Rights Reserved 218
• Parallel Computing Platform: Nvidia CUDA
• ML Library/Framework: cuDNN & TensorFlow
1.0.0
• Language Platform : Python 2 & Python 3
• ML Program: DeepQA [5]
• Dataset: Cornell Movie Dialogs [6]
And other dependencies for above mentioned programs.
C. Why use GPU?
Training deep neural networks can be a time-consuming
process. It involves a big amount of matrix multiplications and
other mathematical operations that if parallelized, can boost up
the calculation time significantly.
A single workstation CPU in current scenario might have 8 -
10 cores, while a single GPU can have thousands of cores.
Although the GPU cores are slower than CPU cores, the large
number of cores makes that redundant.
Nvidia GeForce GTX 1070 GPU has 1920 CUDA cores
clocked at 1506MHz with 8GB of VRAM.
D. Why TensorFlow?
Google open sourced TensorFlow on November 9, 2015.
Since then, it is the most sophisticated and well written library
for Machine Learning. It supports CPU and GPU acceleration.
Because of its open source nature, it has attracted a lot of
machine learning programmers and they are using it in various
creative ways to build different types of programs and services.
Now after the announcements at Google I/O, the use of
TensorFlow will spread more between developers and more
open source projects will be pushed out.
Because of these reasons, I have chosen TensorFlow library
to build a ML ANN.
IV. IMPLEMENTATION
A. Ubuntu 16.04 & nVidia utilities:
I installed Ubuntu 16.04 with a separate /home, / (root) and
SWAP partitions on a workstation. Installed all necessary
programs such as Python 2 and Python 3 and other
dependencies that are required.
Installed latest nVidia graphics drivers, as well as installed
cuda and cudNN utilities that provide tensorflow a bridge
between Python and GPU for training a neural network.
B. Tensorflow:
Cloned tensorflow from its GitHub repository.
Configured tensorflow installation by running ./configure
script in tensorflow directory. Configuration included
specification of location of cuda installation, cudnn installation,
compute capability of your GPU, python installation directory
etc.
After configuring tensorflow correctly, I built a pip python
package and installed tensorflow as a plugin on my python 3
installation.
C. DeepQA ChatBot Program:
DeepQA is an open source Neural Conversational Model. It
uses a RNN (seq2seq model) for sentence predictions. It is
based on Python and TensorFlow. The advantage of this
program is that it supports a various number of conversational
datasets available for research purpose.
DeepQA also provides code to setup a Django web server
that gives the chatbot a nice graphical interface to play with.
The developer of this program is very responsive to queries
and keeps the project up to date.
Because of this program's versatility and activeness, I chose
this program to train an ANN chat bot.
D. Training chat bot:
To start training a neural network on "Cornell movie
dialogs" dataset, I entered the following command:
$python3 main.py --corpus cornell
Figure 1: Training on
There are many different parameters and variables are being
displayed in figure 1. Explanation of them is as following:
An epoch usually means one iteration over all of the
training data. For instance, if you have 20,000 images and a
batch size of 100 then the epoch should contain 20,000 / 100 =
200 steps.
The loss measure error between two tensors, or between a
tensor and zero. These can be used for measuring accuracy of a
network in a regression task or for regularization purposes.
Perplexity metric in ML is a way to capture the degree of
'uncertainty' a model has in predicting (assigning probabilities
to) some text. It is related to Shannon's Entropy. Lower the
entropy (uncertainty), lower the perplexity.
You can finish the training any time by pressing ctrl + c in the
terminal. It will save the model as a .ckpt file in /save directory
and exit the program.
You can resume training from the same step it left off later on
if you want to.
Training an ANN generally takes a good amount of time
depending on the size of provided dataset and how many
epochs (one full training cycle) you are running your program
for. Also the parameters you use in order to train the ANN
effects the training time.
It took me about 7 to 8 hours to train an ANN model based on
default parameters with 30 epochs and used "Cornell Movie
Dialogs" for dataset. Initially, it did not give me good results.
Prerak Bhatt et al, International Journal of Advanced Re search in Computer Science, 8 (8), Sept–Oct 2017,217 -222
© 2015-19, IJARCS All Rights Reserved 219
But by trying and setting different values in parameters like
learning rate (lr), max sentence length, etc. started to get me
better results.
I spent more than 24 hours in total to train different models
to get better results. Each try consisted of 30 epochs.
E. Testing chat bot:
To test the trained model and see how well has it trained
based on our dataset, I entered the following command:
$python3 main.py --test interactive
It provides a command line interface where you can type a
question or a message and the chatbot will reply to it.
Figure 2 – testing the chatbot
The replies the bot is giving to questions in figure above.
They are not really great but they are somewhat contextual
based on the questions asked to it.
Training it on a better dataset and for longer timing with
proper learning rate and other parameters can give you better
results.
Now imagine we provide this model a dataset that consists
of conversations between a support employee of bank and a
client. If we train it properly and long enough then it will be
able to successfully make the client believe that he is talking to
a real legit person and he would trust him enough to reveal his
information to him.
You can visualize the computational graph, the cost of the
ANN and word embeddings for our model with TensorBoard,
just run tensorboard --logdir save/ command.
Word embedding is the collective name for a set of
language modelling and feature learning techniques in natural
language processing (NLP) where words or phrases from the
vocabulary are mapped to vectors of real numbers in a low-
dimensional space relative to the vocabulary size. [7]
Figure 3 – TensorBoard word embeddings
The embeddings of our trained model can be seen in the
screenshot above. It is pretty dense. Means it is a well-trained
model containing a big amount of word vectors.
V. FORENSIC ACQUISITION AND ANALYSIS
After training a functional neural network that can give out
decent output, its time to forensically analyze the system to find
artefacts that help us determine that the system was used in
generation and testing of a neural network based on
TensorFlow.
A. Tools used:
• LiME: "Linux memory extractor" (LiME) is used to
take live RAM dumps in .lime and .raw formats. [8]
• Rufus: To create a bootable Ubuntu 16.04 USB
thumbdrive. [9]
• Disks Utility: It is a part of Ubuntu live system that
lets you create images of different partitions or whole
disk. [10]
• EnCase: EnCase is used to investigate disk images and
RAM dumps to find relevant artefacts and to make a
report based on findings and other technical
information about the system. [11]
B. Live memory capture with LiME:
LiME is a Loadable Kernel Module (LKM) which allows
for volatile memory acquisition from Linux and Linux-based
devices.
LiME utilizes the insmod command to load the module,
passing required arguments for its execution.
After cloning the source code of LiME from GitHub, it is
needed to make a kernel module compatible with your Linux
kernel. You cannot load a kernel module that is made for
another kernel on your kernel. It can be fatal in some cases for
the OS.
I loaded the LiME kernel module in the kernel while the
DeepQA program was in training mode.
Figure 4 – LiME while training
After taking the RAM dump while program was in training
mode, I put program in testing mode and again took a RAM
dump following the same way.
Figure 5 – LiME while testing
So now we have two different RAM dumps. One while the
system was in training mode and one while the system was in
testing mode.
Prerak Bhatt et al, International Journal of Advanced Re search in Computer Science, 8 (8), Sept–Oct 2017,217 -222
© 2015-19, IJARCS All Rights Reserved 220
1. TrainingRAMdump
2. TestingRAMdump
C. Disk acquisition with 'Disks' Utility:
"Disks" is a tool that comes preinstalled with Ubuntu 16.04.
It lets you manage your hard disk partitions. You can create
new partitions, edit partitions, shrink, extend, mount, unmount
and take logical images of partitions in .img format.
I created an Ubuntu 16.04 live bootable USB thumb drive
and booted it up on my system.
I launched Disks utility and selected /home partition.
Clicked on settings icon on left and selected 'create logical
image' of the partition and provided the location to store a bit-
by-bit image of /home partition.
Figure 6 - /home image
I did the same procedure with the remaining partitions
respectively, / (root) and swap.
So now we have logical images of all three partitions used
on Ubuntu that will be loaded on EnCase for investigation.
The reason to acquire these partitions is explained below:
/home: DeepQA program is hosted on this partition as well
as RAM dumps I took with LiME are stored here.
/ (root): Installation of TensorFlow and other dependency
programs were done in this partition. Plus this directory is the
parent of all directories on Ubuntu.
SWAP: Swap is used for paging. So it might have some
volatile data stored that might be useful for the investigation.
D. Forensic Analysis on EnCase:
Now comes the most interesting part of this project.
Analyzing the RAM dumps and hard disk images to find
relevant artefacts.
I chose EnCase to analyse the evidence because EnCase
provides state of the art solutions for evidence analysing,
processing and report generating. The interface is also easy to
use and clean.
Biggest advantage to use EnCase is that it can be cited in
court of law in USA, India and other major countries.
I created a new case on EnCase and entered appropriate
information such as the name of case, case number, examiner
namem case ID etc.
After creating the case, I added evidence files one by one.
First off, I started with adding the logical image of /home
partition. After adding the image of /home partition as an
evidence file, it is time to acquire the same evidence. EnCase
makes .E01 image of the raw image of the evidence we
provided in acquiring phase.
After acquiring the evidence image, I put the acquired
evidence image on processing. Selected appropriate processing
options like System Info Parser, File Carver, Personal
Information extractor, Linux Artefact Parser, etc.
I followed the same procedure for acquiring and processing
for next two logical images, / (root) and SWAP.
Processing SWAP partition did not give any categorized
data it was shown as unallocated space but, some RAW data
can be found from that unallocated space.
After adding, acquiring and processing all evidence files
EnCase Evidence window looks like this:
Figure 7 – All evidence images
E. Findings:
I started the analysis of evidence and found some concrete
artefacts explained below:
Tensorflow Installation location:
One of the most primary and important artefact is to find
out if TensorFlow is installed on the system.
TensorFlow is a Python library. So, first check the location
of Python installation and then look for tensorflow inside it.
On any Linux based OS programs are installed under /usr
directory so it is the first directory one should consider to
analyze for Python installation.
Tensorflow is installed under /usr/local/lib/python3.5/dist-
packages/tensorflow directory.
Figure 8 – TensorFlow location
Interestingly, this directory also contains some interesting
Python libraries that can be used as a part of a ML program.
Such as speech_recognition, pyttsx, etc.
Prerak Bhatt et al, International Journal of Advanced Re search in Computer Science, 8 (8), Sept–Oct 2017,217 -222
© 2015-19, IJARCS All Rights Reserved 221
Searching for keywords:
Seq2seq keyword:
seq2seq (sequence to sequence) is a class of tensorflow that
is used in developing a sequence to sequence, RNN (recurrent
neural network) model.
DeepQA program is based on seq2seq modelling and is a
recurrent neural network. So the possibility to find this class
used in creation of the model is high.
Netflix keyword:
Word Netflix was a part of our dataset I used to train our
ANN. So using this as a keyword to search to see if we can find
it in RAM dumps or on SWAP partition.
GTX1070:
If training of ANN program was done with tensorflow and
GPU, it will include the name of the GPU used in the training
at least somewhere in volatile data or in parameters of
tensorflow.
Added gtx1070 as a keyword to see if we find some
artefacts related to it as I used gtx 1070 GPU to train the
chatbot.
Keyword hits:
It takes a good amount of time to analyze all evidence files
for the provided keywords to EnCase. But it checks all
evidence files thoroughly and even shows if keyword hit was
found in unallocated space.
After the processing of searching for keywords finishes,
EnCase shows you all the keyword hits in one window of
Keywords. It shows the number of files and number of hits the
keyword has got right next to the name of keywords. It can be
seen in the screenshot below.
Figure 9 – Keyword hits
Seq2seq keyword hit:
Surprisingly seq2seq keyword got 277,875 number of hits
in all evidence images. Meaning it has been used a lot in 208
number of files. I analysed some of those files and found
following results.
Found the python script file of DeepQA chatbot.py
containing the seq2seq keyword. It can be seen that tensorflow
class seq2seq class is used in the code of this file.
Figure 10 – seq2seq hit
The same keyword was also found in the compiled chatbot
file chatbot.pyc it confirms that the script was indeed ran at
least once on the system. The pyc (Python compiled) file only
generates once the program executes at least once.
Figure 11 – seq2seq pyc
The keyword seq2seq was also found in the model.ckpt file
of our chatbot. This also confirms that the training of an ANN
was committed. Since we know that model file only generates
once you start training a neural network.
Figure – 12 seq2seq .ckpt
Netflix keyword hit:
I found Netflix keyword hit in some dataset files (.tsv). We
can see that the word has been mentioned in a conversation
between two parties in the file content.
Prerak Bhatt et al, International Journal of Advanced Re search in Computer Science, 8 (8), Sept–Oct 2017,217 -222
© 2015-19, IJARCS All Rights Reserved 222
Figure 13 – Netflix dataset
found Netflix keyword in a dataset.pkl file. The pickle
module (.pkl) implements a fundamental, but powerful
algorithm for serializing and de-serializing a Python object
structure. Tensorflow uses .pkl files when the program is in
testing mode to give quick serialized outputs.
Figure 14 – Netflix pkl
One interesting find for this keyword was in
TrainingRAMdump file that was captured while program was
training. This artefact confirms that the dataset that contained
this keyword was also used in training the program.
Figure 15 – Netflix RAM
Based on the artefacts I found regarding Netflix keyword, I
can confirm that string 'netflix' was a part of dataset and the
dataset was used while the program was training.
Gtx1070 keyword hit:
The keyword GTX1070 was found on SWAP partition.
Since the SWAP partition is considered as unused disk area,
EnCase shows it as one single raw file.
SWAP file is used for paging. So since the keyword is
mentioned here along with strings like tensorflow in the
content, we can conclude that GPU acceleration was used to
train neural networks using tensorflow.
Figure 16 – gtx1070 in SWAP
VI. CONCLUSION
Building an ANN with the help of machine learning has got
better with the introduction of Google's open source machine
learning library TensorFlow.
After finding the relevant artefacts in the investigation of
the evidence images, I can conclude that a machine learning
program based on tensorflow was trained as well as performed
on the system.
These findings can be used as a reference in future cases to
detect or identify the use of machine learning libraries,
algorithms, techniques etc.
However, a lot of research work still needs to be done in
this field. Proper and deeper analysis of volatile information
would be beneficial as well as more in- depth analysis of neural
networks might help us to get more familiar with machine
learning programs in the scope of digital forensic.
VII.REFERENCES
[1] https://venturebeat.com/2017/05/18/ai-weekly-google-shifts-
from-mobile- first-to-ai -first-world/
[2] https://techcrunch.com/2017/05/17/the-tensorflow-research-
cloud-program-gives-the-latest-cloud-tpus-to-scientists/
[3] https://techcrunch.com/2016/08/05/carnegie-mel lons-mayhem-
ai-takes-home -2-million-from- darpas-cyber -grand-challenge/
[4] http://www.zdnet.com/article/how-ai-powered-cyberattacks-
will-make- fighting-hackers-even-harder/
[5] https://github.com/Conchylicultor/DeepQA
[6] http://www.cs.cornell.edu/~cristian/Cornell_Movie-
Dialogs_Corpus.html
[7] https://en.wikipedia.org/wiki/Word_embedding
[8] https://github.com/504ensicsLabs/LiME
[9] https://rufus.akeo.ie/
[10] https://apps.ubuntu.com/cat/applications/precise/gnome -disk-
utility/
[11] https://www.guidancesoftware.com/encase-forensic
... Machine learning is a branch of computer science which gives the ability to the system to learn and predict future results with unseen data, it is also referred as the computational statics to build predictive models on given data and predict for unseen values, without explicit programming (Rughani and Bhatt, 2017). Machine learning is helping the forensics teams across the world in many ways; from individual identification, forensic cyber security, computer forensics, and forensic criminology are used to prevent and solve the crime cases (Ariu et al., 2011;Nasrabadi, 2007). ...
... To remove dirt, all volunteers were asked to clean their feet with soap and water. The quick-drying duplicating ink was uniformly spread on 0.30 × 0.30-m plain glass plate of 0.008 × 0.008-m thickness Rughani and Bhatt, 2017;Singh and Yadav, 2017. The participants were asked to put their feet one by one on the glass plate with normal force and then placed it on a plain A4 size white paper and lift up their foot without disturbing the paper (Robbins, 1986). ...
... The third step was the implementation of classification techniques for sex identification. Algorithms that were chosen for this purpose were Naïve Bayes, Random Forest, Random Tree, REP Tree, and J48 algorithm (Cichosz, 2014;Kim et al., 2014;Rughani and Bhatt, 2017). The measurements obtained in the second step were the input parameters to the abovementioned algorithms. ...
Likewise the fingerprints and palm prints, footprints are also helpful in solving a crime puzzle; however, very few studies have been reported targeting the identification of sex-based upon footprint features. Therefore, the present study aims at the identification of sex using footprint features from the population of Punjab, Pakistan. The foot measurements, i.e., toe length ratio, individual toe lengths, foot breadth, and foot index, are used as features for the identification of sex. Footprint samples were collected from 280 volunteers (142 males and 138 females) from all over Punjab (age range 18–50 years). A sex identification method is proposed in this study employing various machine learning algorithms, i.e., Naïve Bayes, J48, Random Forest, Random Tree, and REP Tree, and compared them. The designed model was cross-validated using 10-fold cross-validation. The results demonstrated the varying accuracy of the machine learning algorithms, using different combinations of footprint features. However, the Naïve Bayes algorithm demonstrated an accuracy of 87.8%, for sex identification, using the combination of toe length and foot indexes. It is concluded that by using a combination of toe length and foot indexes and employing the Naïve Bayes algorithm, sex can be identified more accurately as compared to the other methods.
... Sensors 2020, 20, 4491 2 of 21 fast, automatic and efficient tools for the automated discovery and analysis of images and videos to be implemented in criminal laboratories becomes crucial for the forensic field [2,3]. ...
... Besides, since the logarithmic function is defined for positive values larger than zero, and the F1 metric ranges between 0 and 1, it is not feasible to use logarithmic transformation in this case. Hence, we compared the performance of GLMs built with individual variables assuming a normal and a Binomial Negative distribution, models 1 and 2, respectively, against a model trained with a concatenation of the categorical variables: method and resized (see model 3). Results show that there is not a significant difference between the assessed models for F1 score estimation, having a slightly better performance the model built with a normal distribution and the concatenated variables-MAE of 0.370, MSER of 0.417 and MSE of 0.174. ...
Face recognition is a valuable forensic tool for criminal investigators since it certainly helps in identifying individuals in scenarios of criminal activity like fugitives or child sexual abuse. It is, however, a very challenging task as it must be able to handle low-quality images of real world settings and fulfill real time requirements. Deep learning approaches for face detection have proven to be very successful but they require large computation power and processing time. In this work, we evaluate the speed–accuracy tradeoff of three popular deep-learning-based face detectors on the WIDER Face and UFDD data sets in several CPUs and GPUs. We also develop a regression model capable to estimate the performance, both in terms of processing time and accuracy. We expect this to become a very useful tool for the end user in forensic laboratories in order to estimate the performance for different face detection options. Experimental results showed that the best speed–accuracy tradeoff is achieved with images resized to 50% of the original size in GPUs and images resized to 25% of the original size in CPUs. Moreover, performance can be estimated using multiple linear regression models with a Mean Absolute Error (MAE) of 0.113, which is very promising for the forensic field.
... Therefore, ML-based Autopsy modules, capable of detecting deepfakes are relevant and will most certainly be very much appreciated by the investigative authorities. The good results already observed by the reported ML methods for deepfake detection have not yet been fully translated into substantial gains for cybercrime investigation, as those methods have not often been incorporated into the most popular state-of-the-art digital forensics tools [19]. ...
Tampered multimedia content is being increasingly used in a broad range of cybercrime activities. The spread of fake news, misinformation, digital kidnapping, and ransomware-related crimes are amongst the most recurrent crimes in which manipulated digital photos and videos are the perpetrating and disseminating medium. Criminal investigation has been challenged in applying machine learning techniques to automatically distinguish between fake and genuine seized photos and videos. Despite the pertinent need for manual validation, easy-to-use platforms for digital forensics are essential to automate and facilitate the detection of tampered content and to help criminal investigators with their work. This paper presents a machine learning Support Vector Machines (SVM) based method to distinguish between genuine and fake multimedia files, namely digital photos and videos, which may indicate the presence of deepfake content. The method was implemented in Python and integrated as new modules in the widely used digital forensics application Autopsy. The implemented approach extracts a set of simple features resulting from the application of a Discrete Fourier Transform (DFT) to digital photos and video frames. The model was evaluated with a large dataset of classified multimedia files containing both legitimate and fake photos and frames extracted from videos. Regarding deepfake detection in videos, the Celeb-DFv1 dataset was used, featuring 590 original videos collected from YouTube, and covering different subjects. The results obtained with the 5-fold cross-validation outperformed those SVM-based methods documented in the literature, by achieving an average F1-score of 99.53%, 79.55%, and 89.10%, respectively for photos, videos, and a mixture of both types of content. A benchmark with state-of-the-art methods was also done, by comparing the proposed SVM method with deep learning approaches, namely Convolutional Neural Networks (CNN). Despite CNN having outperformed the proposed DFT-SVM compound method, the competitiveness of the results attained by DFT-SVM and the substantially reduced processing time make it appropriate to be implemented and embedded into Autopsy modules, by predicting the level of fakeness calculated for each analyzed multimedia file.
... Despite the challenges and limitations that forensics domains suffer from, machine learning came as a new early smart detection method to sort the limitations of previous forensics and counter anti-forensics methods. As a result, a new machine-learning counter anti-forensics-based branch was presented in [240,241,242,243] to detect any anti-forensics activity. In [141] Conti et al. revealed the importance of implementing and applying Artificial Intelligence-Machine Learning (AI-ML) techniques in the cyber-security domain. ...
The number of cyber attacks has increased tremendously in the last few years. This resulted into both human and financial losses at the individual and organization levels. Recently, cyber-criminals are leveraging new skills and capabilities by employing anti-forensics activities, techniques and tools to cover their tracks and evade any possible detection. Consequently, cyber-attacks are becoming more efficient and more sophisticated. Therefore, traditional cryptographic and non-cryptographic solutions and access control systems are no longer enough to prevent such cyber attacks, especially in terms of acquiring evidence for attack investigation. Hence, the need for well-defined, sophisticated, and advanced forensics investigation tools are highly required to track down cyber criminals and to reduce the number of cyber crimes. This paper reviews the different forensics and anti-forensics methods, tools, techniques, types, and challenges, while also discussing the rise of the anti-anti-forensics as a new forensics protection mechanism against anti-forensics activities. This would help forensics investigators to better understand the different anti-forensics tools, methods and techniques that cyber criminals employ while launching their attacks. Moreover, the limitations of the current forensics techniques are discussed, especially in terms of issues and challenges. Finally, this paper presents a holistic view from a literature point of view over the forensics domain and also helps other fellow colleagues in their quest to further understand the digital forensics domain.
... The automatic detection, segmentation and recognition of text in natural images, also known as text spotting, is a challenging task with multiple practical applications [1][2][3]. The location and transcription of text may be a great aid in forensic applications such as the analysis of Child Sexual Abuse Material (CSAM), the investigation of domains of the Tor network or the retrieval of critical information from criminal scenes among other tasks [4,5]. ...
Retrieving text embedded within images is a challenging task in real-world settings. Multiple problems such as low-resolution and the orientation of the text can hinder the extraction of information. These problems are common in environments such as Tor Darknet and Child Sexual Abuse images, where text extraction is crucial in the prevention of illegal activities. In this work, we evaluate eight text recognizers and, to increase the performance of text transcription, we combine these recognizers with rectification networks and super-resolution algorithms. We test our approach on four state-of-the-art and two custom datasets (TOICO-1K and Child Sexual Abuse (CSA)-text, based on text retrieved from Tor Darknet and Child Sexual Exploitation Material, respectively). We obtained a 0.3170 score of correctly recognized words in the TOICO-1K dataset when we combined Deep Convolutional Neural Networks (CNN) and rectification-based recognizers. For the CSA-text dataset, applying resolution enhancements achieved a final score of 0.6960. The highest performance increase was achieved on the ICDAR 2015 dataset, with an improvement of 4.83% when combining the MORAN recognizer and the Residual Dense resolution approach. We conclude that rectification outperforms super-resolution when applied separately, while their combination achieves the best average improvements in the chosen datasets.
... The aim of this study is to optimize crime scene evidence analysis during a criminal investigation and provide a second-eye reviewer to forensic personnel. Certain evidence patterns usually found at most crime scenes are very useful for the reconstruction process and other forensic investigations, some of these patterns are bloodstain pattern, glass fracture pattern, fire burn patterns, dead victim faces and position, furniture position pattern, injury wound pattern, and so on (Davide et al., 2011;Bhatt, 2017). This study utilizes a transfer learning approach on the state -of-the-art YOLOV2 object detection model based on the Convolutional Neural Network (CNN) algorithm to detect relevant objects at indoor crime scenes for evidence analysis. ...
Object detection is a key aspect of digital forensics of visual-based evidence from video surveillance systems and forensic photographs. The process of digital forensics can be very difficult and may require highly technical analysis of voluminous contents during a forensic investigation as every image and video collected for evidence from a particular crime scene provides a concrete visual documentation of the crime scene. In this study, an object detection model based on You Only Look Once (YOLO) Convolutional Neural Network (CNN) architecture was developed to detect objects at a crime scene without human involvement or any external control. The aim is to detect objects at a crime scene without human involvement or any external control thereby introducing an optimized approach for crime evidence analysis. Using cross-industry process for data mining, the CNN was trained on a dataset of 5 classes of objects with 1, 173 custom images common to indoor crime scenes. The result of the system gave an average accuracy of 67.84% at 0.013 confidence thresholds after training on an Intel(R) Core (TM) i72720QM CPU with a processor speed of 2.20GHz. The model was deployed on an Android-based forensic case documentation mobile application to review its effectiveness in the problem domain in real time. The study concludes that a crime scene evidence analysis is possible through an optimized approach as it provides a second-eye reviewer to forensic personnel.
... Their paper though did not specifically address the concepts of diverging DL cognitive computing techniques into cyber forensics as is the case presented in this paper. In another research, Bhatt & Rughani (2017) explains how machine learning can be used in digital crime and its forensic importance, setting up an environment to train artificial neural networks and investigate as well as analyse data to find artefacts that can be helpful in any forensic investigation. ...
More than ever before, the world is nowadays experiencing increased cyber-attacks in all areas of our daily lives. This situation has made combating cybercrimes a daily struggle for both individuals and organisations. Furthermore, this struggle has been aggravated by the fact that today's cybercriminals have gone a step ahead and are able to employ complicated cyber-attack techniques. Some of those techniques are minuscule and inconspicuous in nature and often camouflage in the facade of authentic requests and commands. In order to combat this menace, especially after a security incident has happened, cyber security professionals as well as digital forensic investigators are always forced to sift through large and complex pools of data also known as Big Data in an effort to unveil Potential Digital Evidence (PDE) that can be used to support litigations. Gathered PDE can then be used to help investigators arrive at particular conclusions and/or decisions. In the case of cyber forensics, what makes the process even tough for investigators is the fact that Big Data often comes from multiple sources and has different file formats. Forensic investigators often have less time and budget to handle the increased demands when it comes to the analysis of these large amounts of complex data for forensic purposes. It is for this reason that the authors in this paper have realised that Deep Learning (DL), which is a subset of Artificial Intelligence (AI), has very distinct use-cases in the domain of cyber forensics, and even if many people might argue that it's not an unrivalled solution, it can help enhance the fight against cybercrime. This paper therefore proposes a generic framework for diverging DL cognitive computing techniques into Cyber Forensics (CF) hereafter referred to as the DLCF Framework. DL uses some machine learning techniques to solve problems through the use of neural networks that simulate human decision-making. Based on these grounds, DL holds the potential to dramatically change the domain of CF in a variety of ways as well as provide solutions to forensic investigators. Such solutions can range from, reducing bias in forensic investigations to challenging what evidence is considered admissible in a court of law or any civil hearing and many more.
- Kishore Rajendiran
- Kumar Kannan
- Yongbin Yu
Nowadays, individuals and organizations experience an increase in cyber-attacks. Combating such cybercrimes has become the greatest struggle for individual persons and organizations. Furthermore, the battle has heightened as cybercriminals have gone a step ahead, employing the complicated cyber-attack technique. These techniques are minute and unobtrusive in nature and habitually disguised as authentic requests and commands. The cyber-secure professionals and digital forensic investigators enforce by collecting large and complex pools of data to reveal the potential digital evidence (PDE) to combat these attacks and helps investigators to arrive at particular conclusions and/or decisions. In cyber forensics, the challenging issue is hard for the investigators to make conclusions as the big data often comes from multiple sources and in different file formats. The objective is to explore the possible applications of machine learning (ML) in cyber forensics and to discuss the various research issues, the solutions of which will serve out to provide better predictions for cyber forensics.
ResearchGate has not been able to resolve any references for this publication.
Source: https://www.researchgate.net/publication/320803999_MACHINE_LEARNING_FORENSICSA_NEW_BRANCH_OF_DIGITAL_FORENSICS
Posted by: eugenepoiree0198789.blogspot.com