Reference (Ground Truth)

Annotation process was handled by two teams, both of which include a radiology expert and an experienced medical image processing scientist. After manual annotation of both teams, a third radiology expert and another medical imaging scientist have analyzed the labels, which were fine-tuned according to discussions between annotators and controllers. In the CT database, only the liver is annotated, while four abdominal organs (liver, spleen, right/left kidneys) are annotated in the MR database.

Figure 6. Annotation examples for various abdominal CT images and some schallenging cases: (a) Very low contrast between liver and vena cava. (b) Low contrast between liver and muscle tissue (c) High contrast between liver and adjacent tissues (i.e. right kidney, muscle, vena cava) (d) Atypical liver shape that causes unclear boundary with spleen.

Figure 7. Annotation examples for MRI images (a) T1-DUAL, (b) T2-SPIR.

Evaluation Metrics

Both fully automatic and semi-automatic methods will be allowed to participate but they will be evaluated in different categories. The amount of interaction is not limited since some approaches may be very attractive from clinical point of view, and it is hard to differentiate semi-automatic and interactive terms given that the amount and nature of the interaction vary substantially.
According to previous studies in literature, it is not possible to define a single evaluation metric for the image segmentation problem. Since the results have to be analyzed both in two and three dimensions, we decided to use multiple evaluation metrics similar to the challenges before (i.e. SLIVER07). Although, the source code of SLIVER07 evaluation is publicly available, it has been re-written by the organizers of this challenge, in order to be able change the weights of each metric. The organizers believe that such an analysis on weights can provide valuable information about the diversity and the complementary about the results of different segmentation methods. The four evaluation metrics to be utilized are determined as:

  1. Sørensen–Dice coefficient: Provides information about the overlapping parts of segmented and reference volumes in mm3. (1 for a perfect segmentation)
  2. Relative absolute volume difference (RAVD): Also provides information about the differences between volumes between segmented and reference organs, but values the differences more than overlap (0% for a perfect segmentation).
  3. Average symmetric surface distance (ASSD): Determines the average difference between the surface of the segmented object and the reference in 3D. After the border voxels of segmentation and reference are determined, those voxels that have at least one neighbor from a predefined neighborhood that does not belong to the object are collected. For each collected voxel, the closest voxel in the other set is determined and the average of all these distances gives ASSD (0 mm for a perfect segmentation).
  4. Maximum symmetric surface distance (MSSD): Similar to ASSD but particularly important for surgical operations (such as living donated liver surgery of database 1) as it determines the maximum margin of error by selecting the biggest of all calculated distances (0 mm for a perfect segmentation).

The results of these four metrics will be converted to grades at 0-100 scale with help of pre-defined thresholds. Since the decision of the thresholds has a critical role on the overall score, it will be handled by a common agreement after the discussions of the organization team (The thresholds of SLIVER07 are observed to be quite strict producing 0 score for many methods that limits the diversity of the successful results).

The evaluation code is already written in MATLAB, Python and Julia languages. You may access it from https://github.com/emrekavur/CHAOS-evaluation