Evaluation

Reference (Ground Truth)

Annotation process was handled by two teams, both of which include a radiology expert and an experienced medical image processing scientist. After manual annotation of both teams, a third radiology expert and another medical imaging scientist have analyzed the labels, which were fine-tuned according to discussions between annotators and controllers. In the CT database, only the liver is annotated, while four abdominal organs (liver, spleen, right/left kidneys) are annotated in the MR database.


Figure 1. Annotation examples for various abdominal CT images and some schallenging cases: (a) Very low contrast between liver and vena cava. (b) Low contrast between liver and muscle tissue (c) High contrast between liver and adjacent tissues (i.e. right kidney, muscle, vena cava) (d) Atypical liver shape that causes unclear boundary with spleen.


Figure 2. Annotation examples for MRI images (a) T1-DUAL, (b) T2-SPIR.

Evaluation Metrics

According to previous studies in literature, it is not possible to define a single evaluation metric for the organ segmentation problem. That is why, multiple evaluation metrics are used similar to the challenges before (i.e. SLIVER07). The four evaluation metrics to be utilized are determined as:

  1. Sørensen–Dice coefficient: Provides information about the overlapping parts of segmented and reference volumes in mm3. (1 for a perfect segmentation, 0 for the worst case)
  2. Relative absolute volume difference (RAVD): Also provides information about the differences between volumes between segmented and reference organs, but values the differences more than overlap. (0% for a perfect segmentation, 100% for the worst case).
  3. Average symmetric surface distance (ASSD): Determines the average difference between the surface of the segmented object and the reference in 3D. After the border voxels of segmentation and reference are determined, those voxels that have at least one neighbor from a predefined neighborhood that does not belong to the object are collected. For each collected voxel, the closest voxel in the other set is determined and the average of all these distances gives ASSD (0 mm for a perfect segmentation, max distance of image for the worst case).
  4. Maximum symmetric surface distance (MSSD): Similar to ASSD but particularly important for surgical operations (such as living donated liver surgery of database 1) as it determines the maximum margin of error by selecting the biggest of all calculated distances (0 mm for a perfect segmentation, max distance of image for the worst case).

The results of these four metrics are converted to grades at 0-100 scale with help of pre-defined thresholds. Since the decision of the thresholds has a critical role on the overall score, thresholds were calculated by intra- and inter-user similarities between our radiology experts who created the ground truth.

The evaluation code is already written in MATLAB, Python and Julia languages. You may access it from https://github.com/emrekavur/CHAOS-evaluation