JTeC: A Large Collection of Java Test Classes for Test Code Analysis and Processing

This repository is the companion for the dataset:

F. Coro, R. Verdecchia, E. Cruciani, B. Miranda, and A. Bertolino, "JTeC: A large collection of Java test classes for test code analysisand processing". Submitted for revision at MSR 2019 Data Showcase.

It contains the implementation of all the steps required in order generate our dataset, including: (i) filtering of GitHub repositories, (ii) Java repository selection, (iii) test classes identification, (iv) repository selection, and (v) local storage of test classes.

Dataset replication

In order to replicate the dataset follow these steps:

Clone the repository
- git clone https://github.com/MSR19-JTeC/JTeC
Make sure to satisfy the following requirements:
- Have Python 3.0+ installed
- Possess a valid GitHub username and personal GitHub access token
Modify the file token.txt by changing the fields to your personal GitHub username and access token
Execute the script which launches in sequential order the JTeC generation steps (see Section "JTeC generation steps")
- sh JTeC_generator.sh

JTeC generation steps

The steps required in order to generate the dataset are implemented in the following 4 scripts, which have to be executed sequentially in the order given below. A brief description of the scripts is provided below:

repository_filtering.py - Script generating an index of GitHub public repositories (Step 1).
The final output of this script consists of a local .csv file containing for each public repository indexed the following fields: repositoryID, username of repository creator, name of the repository, and programming languages associated to the repository
selection_test_count.py - Script selecting Java repositories (Step 2) and identifying test classes of the selected repositories (Step 3). This script takes as parameter the programming language to be considered for the generation of the dataset, e.g. selection_test_count.py Java.
The final output of this script consists of a local .csv file containing the following information: user, repository, id, hash, date, n_tests, fork_id.
select.py - Script selecting among each forked project either the original or forked project according to which one contains more test classes (Step 4).
The final output of this script consists of a local .csv file containing the following information: user, repository, id, hash, date, n_tests, fork_id.
download_tests.py - Script downloading the test classes of the repositories selected by select.py (Step 5).
This script takes as input the list of repositories for which we want to download the test classes and create the dataset.
The final output of this script is: (i) the totality of the source code of the identified test classes, and (ii) a .csv file containing the following fields: user, repository, id,fork_id,hash, date, n_tests, SLOC, size

Utility files

In addition to the scripts described in Section "JTeC generation steps", the dataset generation process makes use of two utility scripts and one utility file, namely:

request_manager.py - Script managing all GitHub requests and handling possible error arising at request time, returning eventually a specific error-number to the script that first sent the request.
credentials.py - Script loading from the file tokens.txt the username and access tokens required to query the GitHub API.
tokens.txt - Text file containing the GitHub username and personal GitHub access token.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JTeC: A Large Collection of Java Test Classes for Test Code Analysis and Processing

Dataset replication

JTeC generation steps

Utility files

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
JTeC_generator.sh		JTeC_generator.sh
LICENSE		LICENSE
README.md		README.md
credentials.py		credentials.py
download_tests.py		download_tests.py
repository_filtering.py		repository_filtering.py
request_manager.py		request_manager.py
select.py		select.py
selection_test_count.py		selection_test_count.py
tokens.txt		tokens.txt

Folders and files

Latest commit

History

Repository files navigation

JTeC: A Large Collection of Java Test Classes for Test Code Analysis and Processing

Dataset replication

JTeC generation steps

Utility files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages