I’ll demonstrate how one can build a convolution neural network capable of distinguishing between cancer types using a simple PyTorch classifier. The data and code used for training are publicly available and the training can be done on a personal computer, potentially even on a CPU.
Cancer is a an unfortunate side-effect of our cells accumulating information errors over the courses of our lives, leading to an uncontrolled growth. As researches we investigate the patterns of these errors in order to understand the disease better. Seen from a data scientist perspective, the human genome is an around three-billion-letter-long string with letters A, C, G, T (i.e. 2 bits of information per letter). A copying error or an external event can potentially remove/insert/change a letter, causing a mutation and potentially disruption to the genomic function.
However, individual errors almost never lead to cancer development. The human body has multiple mechanisms to prevent cancer from developing, including dedicated proteins—the so called tumor suppressors. A list of necessary conditions—the so-called “hallmarks of cancer” must be met for a cell to be able to create a sustained growth.

Therefore, changes to individual letters of the DNA are usually insufficient to causes self-sustained proliferative growth. The vast majority of mutation-mediated cancers (as opposed to other sources of cancer, for example the HPV virus) also exhibit copy number (CN) changes. These are large-scale events, often adding or removing millions of DNA bases at a time.

These vast changes to the structure of the genome lead to loss of genes that would prevent the cancer from forming, while accumulating genes promoting cell growth. By sequencing the DNA of these cells, we can identify these changes, which quite often happens in regions specific to the cancer type. Copy number values for each allele can be derived from sequencing data using copy number callers.
Processing the Copy Number Profiles
One of the advantages of working with Copy Number (CN) profiles is that they are not biometric and therefore can be published without a need for access restrictions. This allows us to accumulate data over time from multiple studies to build datasets of sufficient size. However, the data coming from different studies is not always directly comparable, as it may be generated using different technologies, have different resolutions, or be pre-processed in different ways.
To obtain the data and jointly process and visualize them, we will be using the tool CNSistent, developed as part of work of the Institute for Computational Cancer Biology of the University Clinic, Cologne, Germany.
First we clone the repository and the data and set to the version used in this text:
git clone [email protected]:schwarzlab/cnsistent.git
cd cnsistent
git checkout v0.9.0
Since the data we will be using are inside of the repository (~1GB of data), it takes a few minutes to download. For cloning both Git and Git LFS must be present on the system.
Inside the repository is a requirements.txt file that lists all the dependencies that can be installed using pip install -r requirements.txt
.
(Creating a virtual environment first is recommended). Once the requirements are installed, CNSistent can be installed by running pip install -e .
in the same folder. The -e
flag installs the package from its source directory, which is necessary for access to the data through the API.
The repository contains raw data from three datasets: TCGA, PCAWG, and TRACERx. These need to first be pre-processed. This can be done by running the script bash ./scripts/data_process.sh
.
Now, we have processed datasets and can load it using the CNSistent data utility library:
import cns.data_utils as cdu
samples_df, cns_df = cdu.main_load("imp")
print(cns_df.head())
Producing the following result:
| | sample_id | chrom | start | end | major_cn | minor_cn |
|---:|:------------|:--------|---------:|---------:|-----------:|-----------:|
| 0 | SP101724 | chr1 | 0 | 27256755 | 2 | 2 |
| 1 | SP101724 | chr1 | 27256755 | 28028200 | 3 | 2 |
| 2 | SP101724 | chr1 | 28028200 | 32976095 | 2 | 2 |
| 3 | SP101724 | chr1 | 32976095 | 33354394 | 5 | 2 |
| 4 | SP101724 | chr1 | 33354394 | 33554783 | 3 | 2 |
This table shows the copy number data with the following columns:
sample_id
: the identifier of the sample,chrom
: the chromosome,start
: the start position of the segment (0-indexed inclusive),end
: the end position of the segment (0-indexed exclusive),major_cn
: the number of copies of the major allele (the bigger of the two),minor_cn
: the number of copies of the minor allele (the smaller of the two).
On the first line we can therefore see a segment stating that sample SP101724 has 2 copies of the major allele and 2 copies of the minor allele (4 in total) in the region of chromosome 1 from 0 to 27.26 megabase.
The second dataframe we loaded, samples_df, contains the metadata for the samples. For our purposes only the type is important. We can investigate the available types by running:
import matplotlib.pyplot as plt
type_counts = samples_df["type"].value_counts()
plt.figure(figsize=(10, 6))
type_counts.plot(kind='bar')
plt.ylabel('Count')
plt.xticks(rotation=90)

In the example shown above, we can observe a potential problem with the data — the lengths of the individual segments are not uniform. The first segment is 27.26 megabase long, while the second one is only 0.77 megabase long. This is a problem for the neural network, which expects the input to be of a fixed size.
We could technically take all existing breakpoints and create segments between all breakpoints in the dataset, so-called minimum consistent segmentation. This would however result in a huge number of segments — a quick check using len(cns_df[“end”].unique())
shows that there are 823652 unique breakpoints.
Alternatively, we can use CNSistent to create a new segmentation using a binning algorithm. This will create segments of a fixed size, which can be used as input to the neural network. In our work we have determined 1–3 megabase segments to provide the best trade-off between accuracy and overfitting. We first create the segmentation and then apply it to obtain new CNS files using the following Bash script:
threads=8
cns segment whole --out "./out/segs_3MB.bed" --split 3000000 --remove gaps - filter 300000
for dataset in TRACERx PCAWG TCGA_hg19;
do
cns aggregate ./out/${dataset}_cns_imp.tsv - segments ./out/segs_3MB.bed - out ./out/${dataset}_bin_3MB.tsv - samples ./out/${dataset}_samples.tsv - threads $threads
done
The loop processes each dataset separately, while maintaining the same segmentation. The --threads
flag is used to speed up the process by running the aggregation in parallel, adjusting the value according to the number of cores available.
The --remove gaps --filter 300000
arguments will remove regions of low mappability (aka gaps) and filter out segments shorter than 300 Kb. The --split 3000000
argument will create segments of 3 Mb.
Non-small-cell Lung Carcinoma
In this text we will focus on classification of non-small-cell lung carcinoma, which accounts for about 85% of all lung cancers, in particular the distinction between adenocarcinoma and squamous-cell carcinoma. It is important to differentiate between the two as their treatment regimes will be different and new methods give hope for non-invasive detection from blood samples or nasal swabs.
We will use the segments produced above and load these using a provided utility function using a utility function. Since we are classifying between two types of cancer, we can filter the samples to only include the relevant types, LUAD
(adenocarcinoma) and LUSC
(squamous cell carcinoma) and plot the first sample:
import cns
samples_df, cns_df = cdu.main_load("3MB")
samples_df = samples_df.query("type in ['LUAD', 'LUSC']")
cns_df = cns.select_CNS_samples(cns_df, samples_df)
cns_df = cns.only_aut(cns_df)
cns.fig_lines(cns.cns_head(cns_df, n=3))
Major and minor copy number segments in 3MB bins for the first three samples. In this case all three samples come from multi-region sequencing of the same patient, demonstrating how heterogeneous cancer cells may be even within a single tumor.
Convolution Neural Network Model
Running the code requires Python 3 with PyTorch 2+ to be installed and a Bash-compatible shell. NVIDIA GPU is recommended for faster training, but not necessary.
First we define a convolutional neural network with three layers:
import torch.nn as nn
class CNSConvNet(nn.Module):
def __init__(self, num_classes):
super(CNSConvNet, self).__init__()
self.conv_layers = nn.Sequential(
nn.Conv1d(in_channels=2, out_channels=16, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(in_channels=16, out_channels=32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2)
)
self.fc_layers = nn.Sequential(
nn.LazyLinear(128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.conv_layers(x)
x = x.view(x.size(0), -1)
x = self.fc_layers(x)
return x
This is a boilerplate deep CNN with 2 input channels — one for each allele — and 3 convolutional layers using 1D kernel of size 3 and ReLU activation function. The convolutional layers are followed by max pooling layers with kernel size of 2. Convolution is traditionally used for edge detection, which is useful for us as we are interested in changes in the copy number, i.e. the edges of the segments.
The output of the convolutional layers is then flattened and passed through two fully connected layers with dropout. The LazyLinear
layer connects the output of 64 stacked channels into one layer of 128 nodes, without needing to calculate how many nodes there are at the end of the convolution. This is where most of our parameters are, therefore we also apply dropout to prevent overfitting.
Training the Model
First we have to convert from dataframes to Torch tensors. We use a utility function bins_to_features, which creates a 3D feature array of the format (samples, alleles, segments). In the process we also split the data into training and testing sets in the 4:1 ratio:
import torch
from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# convert data to features and labels
features, samples_list, columns_df = cns.bins_to_features(cns_df)
# convert data to Torch tensors
X = torch.FloatTensor(features)
label_encoder = LabelEncoder()
y = torch.LongTensor(label_encoder.fit_transform(samples_df.loc[samples_list]["type"]))
# Test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Create dataloaders
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32, shuffle=True)
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=32, shuffle=False)
We can now train the model using the following training loop with 20 epochs. The Adam optimizer and CrossEntropy loss are typically used for classification tasks, we therefore use them here as well:
# setup the model, loss, and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNSConvNet(num_classes=len(label_encoder.classes_)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Training loop
num_epochs = 20
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Clear gradients
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward pass and optimize
loss.backward()
optimizer.step()
running_loss += loss.item()
# Print statistics
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {running_loss/len(train_loader):.4f}')
This concludes the training. Afterwards, we can evaluate the model and print the confusion matrix:
import numpy as np
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Loop over batches in the test set and collect predictions
model.eval()
y_true = []
y_pred = []
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
y_true.extend(labels.cpu().numpy())
y_pred.extend(outputs.argmax(dim=1).cpu().numpy())
_, predicted = torch.max(outputs.data, 1)
# Calculate accuracy and confusion matrix
accuracy = (np.array(y_true) == np.array(y_pred)).mean()
cm = confusion_matrix(y_true, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(3, 3), dpi=200)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix, accuracy={:.2f}'.format(accuracy))
plt.savefig("confusion_matrix.png", bbox_inches='tight')

The training process takes about 7 seconds total on an NVIDIA RTX 4090 GPU.
Conclusion
We have developed an efficient and accurate classifier of lung cancer subtype from copy number data. As we have shown, such models transfer well to new studies and sources of sequence data.
Mass scale AI is often being justified, among others, as a “solution to cancer“. However as in this article, small models with classical approaches usually serve their purpose well. Some even argue that the actual obstacle of machine learning in biology in medicine is not in solving problems, but in actually making impact for patients.
Still, machine learning has been able to mostly solve at least one major puzzle of computational biology, reinvigorating the focus on machine learning for cancer. With some luck, we might be able to look at the next decade as the time we finally “solved” cancer.
Bonus: Cell2Sentence
The foundational models of today often contain information, such as which genes have high copy numbers in which cancers, as a part of their training set. This though led to the creation of LLM-based approaches, like Cell2Sentence, where a dataset is converted to a natural language and submitted to an LLM.
We can utilize this approach here. First we convert a random LUSC sample to a ranked list of genes, based on their copy number value:
import cns
import cns.data_utils as cdu
samples_df, cns_df = cdu.main_load("COSMIC")
cns_df = cns.add_total_cn(cns_df)
lusc_df = cns.select_cns_by_type(cns_df, samples_df, "LUSC")
lusc_one_df = cns.cns_head(lusc_df, 1)
rank_list = lusc_one_df.sort_values(by="total_cn", ascending=False)["name"].to_list()
Then we convert the list of genes to a natural language query:
query = "Below is a list of COSMIC {gene_names} ordered by descending total copy number in a cell. Your task is to identify the cancer type of that sample.\n\n"
query += "{gene_names}: " + " ".join(rank_list) + "\n\n"
query += "Answer:"
print(query)
Submitted to ChatGPT 5, the model correctly identifies the LUSC cancer:
Based on the gene copy-number profile you listed, the strongest clue is the massive amplification of 11q13 (CCND1, FADD) together with SOX2, PIK3CA, TP63, EGFR, FGFR1, MYC, CCND1/CCND2/CCND3, CDK6, CDK4. This is a very characteristic pattern of squamous cell carcinoma of the lung (LUSC):
CCND1 amplification is one of the most frequent events in LUSC.
SOX2 amplification (3q26) is a defining driver of lung squamous carcinoma, rarely seen in adenocarcinoma.
TP63 amplification is again highly specific to squamous tumors.
Co-amplifications of PIK3CA, FGFR1, EGFR, MYC fit this profile.
The broad set of genes amplified (many regulators of squamous differentiation and RTK/RAS/PI3K signaling) reflects typical copy-number–driven LUSC biology.
By contrast, lung adenocarcinomas usually show KRAS, EGFR, ALK, ROS1, MET exon 14, ERBB2 point mutations/fusions and have fewer widespread squamous-type CNAs.
Answer: Lung squamous cell carcinoma (LUSC).
However, generating this output for one sample takes longer than classifying the whole dataset with our model and it would cost around ~200$ on API fees to classify our whole dataset.